Frequentist multi-armed bandits for complex reward models

Haoran Chen; Weibing Deng; Keqin Liu; Ting Wu

doi:10.1117/12.2627482

Frequentist multi-armed bandits for complex reward models

Haoran Chen, Weibing Deng, Keqin Liu^*, Ting Wu

^*Corresponding author for this work

Department of Financial and Actuarial Mathematics

Nanjing University

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

In the classical Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time, without knowing the reward models. At each time, the player selects one arm to play, aiming to maximize the total expected reward over o '‡ times. The regret of an algorithm is the expected total loss after o '‡ steps compared to the ideal scenario of knowing reward models. When the distributions of the arm reward are heavy-Tailed, it is difficult to learn which arm has the best reward. In this paper, we introduce an algorithm based on the idea of Upper Confidence Bound (UCB) and prove that the algorithm achieves a sublinear growth of regret for heavy-Tailed reward distributions. Furthermore, we consider MAB with gap periods as a dynamic model requiring that the arm will get into a gap period without offering reward immediately after being played. This model finds a broad application area in Internet advertising. Clearly the player should avoid to choose an arm until it gets out of the gap period. We extend the algorithm framework of Deterministic Sequencing of Exploration and Exploitation (DSEE) to the MAB model with gap periods, with regret reaching a growth of optimal order o ' á log o '‡á for light-Tailed distributions and a sublinear growth the for the heavy-Tailed.

Original language	English
Title of host publication	2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021
Editors	Hong-Ming Yin, Ke Chen, Romeo Mestrovic, Teresa A. Oliveira
Publisher	SPIE
ISBN (Electronic)	9781510652026
DOIs	https://doi.org/10.1117/12.2627482
Publication status	Published - 2022
Event	2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021 - Nanjing, China Duration: 26 Nov 2021 → 28 Nov 2021

Publication series

Name	Proceedings of SPIE - The International Society for Optical Engineering
Volume	12163
ISSN (Print)	0277-786X
ISSN (Electronic)	1996-756X

Conference

Conference	2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021
Country/Territory	China
City	Nanjing
Period	26/11/21 → 28/11/21

Keywords

Adaptive DSEE for dynamic models
Extended UCB for heavy-Tailed reward
Frequentist Multi-Armed bandits
Online statistical learning and control

Access to Document

10.1117/12.2627482

Cite this

Chen, H., Deng, W., Liu, K., & Wu, T. (2022). Frequentist multi-armed bandits for complex reward models. In H.-M. Yin, K. Chen, R. Mestrovic, & T. A. Oliveira (Eds.), 2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021 Article 121631P (Proceedings of SPIE - The International Society for Optical Engineering; Vol. 12163). SPIE. https://doi.org/10.1117/12.2627482

Chen, Haoran ; Deng, Weibing ; Liu, Keqin et al. / Frequentist multi-armed bandits for complex reward models. 2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021. editor / Hong-Ming Yin ; Ke Chen ; Romeo Mestrovic ; Teresa A. Oliveira. SPIE, 2022. (Proceedings of SPIE - The International Society for Optical Engineering).

@inproceedings{8f7231c417b44a7fa726408404502ad3,

title = "Frequentist multi-armed bandits for complex reward models",

abstract = "In the classical Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time, without knowing the reward models. At each time, the player selects one arm to play, aiming to maximize the total expected reward over o '‡ times. The regret of an algorithm is the expected total loss after o '‡ steps compared to the ideal scenario of knowing reward models. When the distributions of the arm reward are heavy-Tailed, it is difficult to learn which arm has the best reward. In this paper, we introduce an algorithm based on the idea of Upper Confidence Bound (UCB) and prove that the algorithm achieves a sublinear growth of regret for heavy-Tailed reward distributions. Furthermore, we consider MAB with gap periods as a dynamic model requiring that the arm will get into a gap period without offering reward immediately after being played. This model finds a broad application area in Internet advertising. Clearly the player should avoid to choose an arm until it gets out of the gap period. We extend the algorithm framework of Deterministic Sequencing of Exploration and Exploitation (DSEE) to the MAB model with gap periods, with regret reaching a growth of optimal order o ' {\'a} log o '‡{\'a} for light-Tailed distributions and a sublinear growth the for the heavy-Tailed.",

keywords = "Adaptive DSEE for dynamic models, Extended UCB for heavy-Tailed reward, Frequentist Multi-Armed bandits, Online statistical learning and control",

author = "Haoran Chen and Weibing Deng and Keqin Liu and Ting Wu",

note = "Publisher Copyright: {\textcopyright} COPYRIGHT SPIE.; 2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021 ; Conference date: 26-11-2021 Through 28-11-2021",

year = "2022",

doi = "10.1117/12.2627482",

language = "English",

series = "Proceedings of SPIE - The International Society for Optical Engineering",

publisher = "SPIE",

editor = "Hong-Ming Yin and Ke Chen and Romeo Mestrovic and Oliveira, {Teresa A.}",

booktitle = "2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021",

}

Chen, H, Deng, W, Liu, K & Wu, T 2022, Frequentist multi-armed bandits for complex reward models. in H-M Yin, K Chen, R Mestrovic & TA Oliveira (eds), 2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021., 121631P, Proceedings of SPIE - The International Society for Optical Engineering, vol. 12163, SPIE, 2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021, Nanjing, China, 26/11/21. https://doi.org/10.1117/12.2627482

Frequentist multi-armed bandits for complex reward models. / Chen, Haoran; Deng, Weibing; Liu, Keqin et al.
2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021. ed. / Hong-Ming Yin; Ke Chen; Romeo Mestrovic; Teresa A. Oliveira. SPIE, 2022. 121631P (Proceedings of SPIE - The International Society for Optical Engineering; Vol. 12163).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Frequentist multi-armed bandits for complex reward models

AU - Chen, Haoran

AU - Deng, Weibing

AU - Liu, Keqin

AU - Wu, Ting

N1 - Publisher Copyright: © COPYRIGHT SPIE.

PY - 2022

Y1 - 2022

N2 - In the classical Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time, without knowing the reward models. At each time, the player selects one arm to play, aiming to maximize the total expected reward over o '‡ times. The regret of an algorithm is the expected total loss after o '‡ steps compared to the ideal scenario of knowing reward models. When the distributions of the arm reward are heavy-Tailed, it is difficult to learn which arm has the best reward. In this paper, we introduce an algorithm based on the idea of Upper Confidence Bound (UCB) and prove that the algorithm achieves a sublinear growth of regret for heavy-Tailed reward distributions. Furthermore, we consider MAB with gap periods as a dynamic model requiring that the arm will get into a gap period without offering reward immediately after being played. This model finds a broad application area in Internet advertising. Clearly the player should avoid to choose an arm until it gets out of the gap period. We extend the algorithm framework of Deterministic Sequencing of Exploration and Exploitation (DSEE) to the MAB model with gap periods, with regret reaching a growth of optimal order o ' á log o '‡á for light-Tailed distributions and a sublinear growth the for the heavy-Tailed.

AB - In the classical Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time, without knowing the reward models. At each time, the player selects one arm to play, aiming to maximize the total expected reward over o '‡ times. The regret of an algorithm is the expected total loss after o '‡ steps compared to the ideal scenario of knowing reward models. When the distributions of the arm reward are heavy-Tailed, it is difficult to learn which arm has the best reward. In this paper, we introduce an algorithm based on the idea of Upper Confidence Bound (UCB) and prove that the algorithm achieves a sublinear growth of regret for heavy-Tailed reward distributions. Furthermore, we consider MAB with gap periods as a dynamic model requiring that the arm will get into a gap period without offering reward immediately after being played. This model finds a broad application area in Internet advertising. Clearly the player should avoid to choose an arm until it gets out of the gap period. We extend the algorithm framework of Deterministic Sequencing of Exploration and Exploitation (DSEE) to the MAB model with gap periods, with regret reaching a growth of optimal order o ' á log o '‡á for light-Tailed distributions and a sublinear growth the for the heavy-Tailed.

KW - Adaptive DSEE for dynamic models

KW - Extended UCB for heavy-Tailed reward

KW - Frequentist Multi-Armed bandits

KW - Online statistical learning and control

UR - http://www.scopus.com/inward/record.url?scp=85131782793&partnerID=8YFLogxK

U2 - 10.1117/12.2627482

DO - 10.1117/12.2627482

M3 - Conference Proceeding

AN - SCOPUS:85131782793

T3 - Proceedings of SPIE - The International Society for Optical Engineering

BT - 2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021

A2 - Yin, Hong-Ming

A2 - Chen, Ke

A2 - Mestrovic, Romeo

A2 - Oliveira, Teresa A.

PB - SPIE

T2 - 2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021

Y2 - 26 November 2021 through 28 November 2021

ER -

Frequentist multi-armed bandits for complex reward models

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this