TY - GEN
T1 - Frequentist multi-armed bandits for complex reward models
AU - Chen, Haoran
AU - Deng, Weibing
AU - Liu, Keqin
AU - Wu, Ting
N1 - Publisher Copyright:
© COPYRIGHT SPIE.
PY - 2022
Y1 - 2022
N2 - In the classical Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time, without knowing the reward models. At each time, the player selects one arm to play, aiming to maximize the total expected reward over o '‡ times. The regret of an algorithm is the expected total loss after o '‡ steps compared to the ideal scenario of knowing reward models. When the distributions of the arm reward are heavy-Tailed, it is difficult to learn which arm has the best reward. In this paper, we introduce an algorithm based on the idea of Upper Confidence Bound (UCB) and prove that the algorithm achieves a sublinear growth of regret for heavy-Tailed reward distributions. Furthermore, we consider MAB with gap periods as a dynamic model requiring that the arm will get into a gap period without offering reward immediately after being played. This model finds a broad application area in Internet advertising. Clearly the player should avoid to choose an arm until it gets out of the gap period. We extend the algorithm framework of Deterministic Sequencing of Exploration and Exploitation (DSEE) to the MAB model with gap periods, with regret reaching a growth of optimal order o ' á log o '‡á for light-Tailed distributions and a sublinear growth the for the heavy-Tailed.
AB - In the classical Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time, without knowing the reward models. At each time, the player selects one arm to play, aiming to maximize the total expected reward over o '‡ times. The regret of an algorithm is the expected total loss after o '‡ steps compared to the ideal scenario of knowing reward models. When the distributions of the arm reward are heavy-Tailed, it is difficult to learn which arm has the best reward. In this paper, we introduce an algorithm based on the idea of Upper Confidence Bound (UCB) and prove that the algorithm achieves a sublinear growth of regret for heavy-Tailed reward distributions. Furthermore, we consider MAB with gap periods as a dynamic model requiring that the arm will get into a gap period without offering reward immediately after being played. This model finds a broad application area in Internet advertising. Clearly the player should avoid to choose an arm until it gets out of the gap period. We extend the algorithm framework of Deterministic Sequencing of Exploration and Exploitation (DSEE) to the MAB model with gap periods, with regret reaching a growth of optimal order o ' á log o '‡á for light-Tailed distributions and a sublinear growth the for the heavy-Tailed.
KW - Adaptive DSEE for dynamic models
KW - Extended UCB for heavy-Tailed reward
KW - Frequentist Multi-Armed bandits
KW - Online statistical learning and control
UR - http://www.scopus.com/inward/record.url?scp=85131782793&partnerID=8YFLogxK
U2 - 10.1117/12.2627482
DO - 10.1117/12.2627482
M3 - Conference Proceeding
AN - SCOPUS:85131782793
T3 - Proceedings of SPIE - The International Society for Optical Engineering
BT - 2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021
A2 - Yin, Hong-Ming
A2 - Chen, Ke
A2 - Mestrovic, Romeo
A2 - Oliveira, Teresa A.
PB - SPIE
T2 - 2021 International Conference on Statistics, Applied Mathematics, and Computing Science, CSAMCS 2021
Y2 - 26 November 2021 through 28 November 2021
ER -