Frequentist multi-armed bandits for complex reward models

Haoran Chen, Weibing Deng, Keqin Liu*, Ting Wu

*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

Abstract

In the classical Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time, without knowing the reward models. At each time, the player selects one arm to play, aiming to maximize the total expected reward over o '‡ times. The regret of an algorithm is the expected total loss after o '‡ steps compared to the ideal scenario of knowing reward models. When the distributions of the arm reward are heavy-Tailed, it is difficult to learn which arm has the best reward. In this paper, we introduce an algorithm based on the idea of Upper Confidence Bound (UCB) and prove that the algorithm achieves a sublinear growth of regret for heavy-Tailed reward distributions. Furthermore, we consider MAB with gap periods as a dynamic model requiring that the arm will get into a gap period without offering reward immediately after being played. This model finds a broad application area in Internet advertising. Clearly the player should avoid to choose an arm until it gets out of the gap period. We extend the algorithm framework of Deterministic Sequencing of Exploration and Exploitation (DSEE) to the MAB model with gap periods, with regret reaching a growth of optimal order o ' á log o '‡á for light-Tailed distributions and a sublinear growth the for the heavy-Tailed.

Original languageEnglish
Title of host publication2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021
EditorsHong-Ming Yin, Ke Chen, Romeo Mestrovic, Teresa A. Oliveira
PublisherSPIE
ISBN (Electronic)9781510652026
DOIs
Publication statusPublished - 2022
Event2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021 - Nanjing, China
Duration: 26 Nov 202128 Nov 2021

Publication series

NameProceedings of SPIE - The International Society for Optical Engineering
Volume12163
ISSN (Print)0277-786X
ISSN (Electronic)1996-756X

Conference

Conference2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021
Country/TerritoryChina
CityNanjing
Period26/11/2128/11/21

Keywords

  • Adaptive DSEE for dynamic models
  • Extended UCB for heavy-Tailed reward
  • Frequentist Multi-Armed bandits
  • Online statistical learning and control

Fingerprint

Dive into the research topics of 'Frequentist multi-armed bandits for complex reward models'. Together they form a unique fingerprint.

Cite this