Multi-armed bandit problems with heavy-tailed reward distributions

Keqin Liu; Qing Zhao

doi:10.1109/Allerton.2011.6120206

Multi-armed bandit problems with heavy-tailed reward distributions

Keqin Liu^*, Qing Zhao

^*Corresponding author for this work

Department of Financial and Actuarial Mathematics

Cornell University

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

21 Citations (Scopus)

Abstract

In the Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time without knowing the arm reward statistics. The essence of the problem is the tradeoff between exploration and exploitation: playing a less explored arm to learn its reward statistics or playing the arm appearing to be the best. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that when the moment-generating functions of the arm reward distributions are properly bounded, the optimal logarithmic order of the regret can be achieved by DSEE. The condition on the reward distributions can be gradually relaxed at a cost of a higher (nevertheless, sublinear) regret order: for any positive integer p, O(T ^1/p) regret can be achieved by DSEE when the moments of the reward distributions exist (only) up to the pth order. The proposed DSEE approach complements existing work on MAB by providing corresponding results under a set of relaxed conditions on the reward distributions. Furthermore, with a clearly defined tunable parameter - the cardinality of the exploration sequence, the DSEE approach is easily extendable to variations of MAB, including decentralized MAB with partial reward observations and restless MAB with unknown Markov dynamics. Potential applications include dynamic spectrum access, multi-agent systems, Internet advertising and Web search.

Original language	English
Title of host publication	2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011
Pages	485-492
Number of pages	8
DOIs	https://doi.org/10.1109/Allerton.2011.6120206
Publication status	Published - 2011
Event	2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011 - Monticello, IL, United States Duration: 28 Sept 2011 → 30 Sept 2011

Publication series

Name	2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011

Conference

Conference	2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011
Country/Territory	United States
City	Monticello, IL
Period	28/09/11 → 30/09/11

Access to Document

10.1109/Allerton.2011.6120206

Cite this

@inproceedings{79c7aac0e75e4e15a77911eb819f0f99,

title = "Multi-armed bandit problems with heavy-tailed reward distributions",

abstract = "In the Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time without knowing the arm reward statistics. The essence of the problem is the tradeoff between exploration and exploitation: playing a less explored arm to learn its reward statistics or playing the arm appearing to be the best. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that when the moment-generating functions of the arm reward distributions are properly bounded, the optimal logarithmic order of the regret can be achieved by DSEE. The condition on the reward distributions can be gradually relaxed at a cost of a higher (nevertheless, sublinear) regret order: for any positive integer p, O(T 1/p) regret can be achieved by DSEE when the moments of the reward distributions exist (only) up to the pth order. The proposed DSEE approach complements existing work on MAB by providing corresponding results under a set of relaxed conditions on the reward distributions. Furthermore, with a clearly defined tunable parameter - the cardinality of the exploration sequence, the DSEE approach is easily extendable to variations of MAB, including decentralized MAB with partial reward observations and restless MAB with unknown Markov dynamics. Potential applications include dynamic spectrum access, multi-agent systems, Internet advertising and Web search.",

author = "Keqin Liu and Qing Zhao",

year = "2011",

doi = "10.1109/Allerton.2011.6120206",

language = "English",

isbn = "9781457718168",

series = "2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011",

pages = "485--492",

booktitle = "2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011",

note = "2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011 ; Conference date: 28-09-2011 Through 30-09-2011",

}

Liu, K & Zhao, Q 2011, Multi-armed bandit problems with heavy-tailed reward distributions. in 2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011., 6120206, 2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011, pp. 485-492, 2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011, Monticello, IL, United States, 28/09/11. https://doi.org/10.1109/Allerton.2011.6120206

Multi-armed bandit problems with heavy-tailed reward distributions. / Liu, Keqin; Zhao, Qing.
2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011. 2011. p. 485-492 6120206 (2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Multi-armed bandit problems with heavy-tailed reward distributions

AU - Liu, Keqin

AU - Zhao, Qing

PY - 2011

Y1 - 2011

N2 - In the Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time without knowing the arm reward statistics. The essence of the problem is the tradeoff between exploration and exploitation: playing a less explored arm to learn its reward statistics or playing the arm appearing to be the best. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that when the moment-generating functions of the arm reward distributions are properly bounded, the optimal logarithmic order of the regret can be achieved by DSEE. The condition on the reward distributions can be gradually relaxed at a cost of a higher (nevertheless, sublinear) regret order: for any positive integer p, O(T 1/p) regret can be achieved by DSEE when the moments of the reward distributions exist (only) up to the pth order. The proposed DSEE approach complements existing work on MAB by providing corresponding results under a set of relaxed conditions on the reward distributions. Furthermore, with a clearly defined tunable parameter - the cardinality of the exploration sequence, the DSEE approach is easily extendable to variations of MAB, including decentralized MAB with partial reward observations and restless MAB with unknown Markov dynamics. Potential applications include dynamic spectrum access, multi-agent systems, Internet advertising and Web search.

AB - In the Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time without knowing the arm reward statistics. The essence of the problem is the tradeoff between exploration and exploitation: playing a less explored arm to learn its reward statistics or playing the arm appearing to be the best. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that when the moment-generating functions of the arm reward distributions are properly bounded, the optimal logarithmic order of the regret can be achieved by DSEE. The condition on the reward distributions can be gradually relaxed at a cost of a higher (nevertheless, sublinear) regret order: for any positive integer p, O(T 1/p) regret can be achieved by DSEE when the moments of the reward distributions exist (only) up to the pth order. The proposed DSEE approach complements existing work on MAB by providing corresponding results under a set of relaxed conditions on the reward distributions. Furthermore, with a clearly defined tunable parameter - the cardinality of the exploration sequence, the DSEE approach is easily extendable to variations of MAB, including decentralized MAB with partial reward observations and restless MAB with unknown Markov dynamics. Potential applications include dynamic spectrum access, multi-agent systems, Internet advertising and Web search.

UR - http://www.scopus.com/inward/record.url?scp=84856102138&partnerID=8YFLogxK

U2 - 10.1109/Allerton.2011.6120206

DO - 10.1109/Allerton.2011.6120206

M3 - Conference Proceeding

AN - SCOPUS:84856102138

SN - 9781457718168

T3 - 2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011

SP - 485

EP - 492

BT - 2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011

T2 - 2011 49th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2011

Y2 - 28 September 2011 through 30 September 2011

ER -

Multi-armed bandit problems with heavy-tailed reward distributions

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this