Deterministic sequencing of exploration and exploitation for multi-armed bandit problems

Sattar Vakili; Keqin Liu; Qing Zhao

doi:10.1109/JSTSP.2013.2263494

Deterministic sequencing of exploration and exploitation for multi-armed bandit problems

Sattar Vakili, Keqin Liu, Qing Zhao

Department of Financial and Actuarial Mathematics

Research output: Contribution to journal › Article › peer-review

81 Citations (Scopus)

Abstract

In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward models. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length $T$. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that for all light-tailed reward distributions, DSEE achieves the optimal logarithmic order of the regret, where regret is defined as the total expected reward loss against the ideal case with known reward models. For heavy-tailed reward distributions, DSEE achieves O(T^1/p) regret when the moments of the reward distributions exist up to the pth order for 1 < p 2 and O (T¹/^(1+p/2)) for p > 2. With the knowledge of an upperbound on a finite moment of the heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret order. The proposed DSEE approach complements existing work on MAB by providing corresponding results for general reward distributions. Furthermore, with a clearly defined tunable parameter-the cardinality of the exploration sequence, the DSEE approach is easily extendable to variations of MAB, including MAB with various objectives, decentralized MAB with multiple players and incomplete reward observations under collisions, restless MAB with unknown dynamics, and combinatorial MAB with dependent arms that often arise in network optimization problems such as the shortest path, the minimum sp anning tree, and the dominating set problems under unknown random weights.

Original language	English
Article number	6516952
Pages (from-to)	759-767
Number of pages	9
Journal	IEEE Journal on Selected Topics in Signal Processing
Volume	7
Issue number	5
DOIs	https://doi.org/10.1109/JSTSP.2013.2263494
Publication status	Published - 2013

Keywords

combinatorial multi-armed bandit
decentralized multi-armed bandit
deterministic sequencing of exploration and exploitation
Multi-armed bandit
regret
restless multi-armed bandit

Access to Document

10.1109/JSTSP.2013.2263494

Cite this

@article{dafe7bc29efa41c9828af1186fcbcc40,

title = "Deterministic sequencing of exploration and exploitation for multi-armed bandit problems",

abstract = "In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward models. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length $T$. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that for all light-tailed reward distributions, DSEE achieves the optimal logarithmic order of the regret, where regret is defined as the total expected reward loss against the ideal case with known reward models. For heavy-tailed reward distributions, DSEE achieves O(T1/p) regret when the moments of the reward distributions exist up to the pth order for 1 < p 2 and O (T1/(1+p/2)) for p > 2. With the knowledge of an upperbound on a finite moment of the heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret order. The proposed DSEE approach complements existing work on MAB by providing corresponding results for general reward distributions. Furthermore, with a clearly defined tunable parameter-the cardinality of the exploration sequence, the DSEE approach is easily extendable to variations of MAB, including MAB with various objectives, decentralized MAB with multiple players and incomplete reward observations under collisions, restless MAB with unknown dynamics, and combinatorial MAB with dependent arms that often arise in network optimization problems such as the shortest path, the minimum sp anning tree, and the dominating set problems under unknown random weights.",

keywords = "combinatorial multi-armed bandit, decentralized multi-armed bandit, deterministic sequencing of exploration and exploitation, Multi-armed bandit, regret, restless multi-armed bandit",

author = "Sattar Vakili and Keqin Liu and Qing Zhao",

year = "2013",

doi = "10.1109/JSTSP.2013.2263494",

language = "English",

volume = "7",

pages = "759--767",

journal = "IEEE Journal on Selected Topics in Signal Processing",

issn = "1932-4553",

number = "5",

}

TY - JOUR

T1 - Deterministic sequencing of exploration and exploitation for multi-armed bandit problems

AU - Vakili, Sattar

AU - Liu, Keqin

AU - Zhao, Qing

PY - 2013

Y1 - 2013

N2 - In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward models. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length $T$. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that for all light-tailed reward distributions, DSEE achieves the optimal logarithmic order of the regret, where regret is defined as the total expected reward loss against the ideal case with known reward models. For heavy-tailed reward distributions, DSEE achieves O(T1/p) regret when the moments of the reward distributions exist up to the pth order for 1 < p 2 and O (T1/(1+p/2)) for p > 2. With the knowledge of an upperbound on a finite moment of the heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret order. The proposed DSEE approach complements existing work on MAB by providing corresponding results for general reward distributions. Furthermore, with a clearly defined tunable parameter-the cardinality of the exploration sequence, the DSEE approach is easily extendable to variations of MAB, including MAB with various objectives, decentralized MAB with multiple players and incomplete reward observations under collisions, restless MAB with unknown dynamics, and combinatorial MAB with dependent arms that often arise in network optimization problems such as the shortest path, the minimum sp anning tree, and the dominating set problems under unknown random weights.

AB - In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward models. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length $T$. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that for all light-tailed reward distributions, DSEE achieves the optimal logarithmic order of the regret, where regret is defined as the total expected reward loss against the ideal case with known reward models. For heavy-tailed reward distributions, DSEE achieves O(T1/p) regret when the moments of the reward distributions exist up to the pth order for 1 < p 2 and O (T1/(1+p/2)) for p > 2. With the knowledge of an upperbound on a finite moment of the heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret order. The proposed DSEE approach complements existing work on MAB by providing corresponding results for general reward distributions. Furthermore, with a clearly defined tunable parameter-the cardinality of the exploration sequence, the DSEE approach is easily extendable to variations of MAB, including MAB with various objectives, decentralized MAB with multiple players and incomplete reward observations under collisions, restless MAB with unknown dynamics, and combinatorial MAB with dependent arms that often arise in network optimization problems such as the shortest path, the minimum sp anning tree, and the dominating set problems under unknown random weights.

KW - combinatorial multi-armed bandit

KW - decentralized multi-armed bandit

KW - deterministic sequencing of exploration and exploitation

KW - Multi-armed bandit

KW - regret

KW - restless multi-armed bandit

UR - http://www.scopus.com/inward/record.url?scp=84884549238&partnerID=8YFLogxK

U2 - 10.1109/JSTSP.2013.2263494

DO - 10.1109/JSTSP.2013.2263494

M3 - Article

AN - SCOPUS:84884549238

SN - 1932-4553

VL - 7

SP - 759

EP - 767

JO - IEEE Journal on Selected Topics in Signal Processing

JF - IEEE Journal on Selected Topics in Signal Processing

IS - 5

M1 - 6516952

ER -

Deterministic sequencing of exploration and exploitation for multi-armed bandit problems

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this