Learning in a changing world: Restless multiarmed bandit with unknown dynamics

Haoyang Liu; Keqin Liu; Qing Zhao

doi:10.1109/TIT.2012.2230215

Learning in a changing world: Restless multiarmed bandit with unknown dynamics

Haoyang Liu^*, Keqin Liu, Qing Zhao

^*Corresponding author for this work

Department of Financial and Actuarial Mathematics

Research output: Contribution to journal › Article › peer-review

111 Citations (Scopus)

Abstract

We consider the restless multiarmed bandit problem with unknown dynamics in which a player chooses one out of $N$ arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. The performance of an arm selection policy is measured by regret, defined as the reward loss with respect to the case where the player knows which arm is the most rewarding and always plays the best arm. We construct a policy with an interleaving exploration and exploitation epoch structure that achieves a regret with logarithmic order. We further extend the problem to a decentralized setting where multiple distributed players share the arms without information exchange. Under both an exogenous restless model and an endogenous restless model, we show that a decentralized extension of the proposed policy preserves the logarithmic regret order as in the centralized setting. The results apply to adaptive learning in various dynamic systems and communication networks, as well as financial investment.

Original language	English
Article number	6362216
Pages (from-to)	1902-1916
Number of pages	15
Journal	IEEE Transactions on Information Theory
Volume	59
Issue number	3
DOIs	https://doi.org/10.1109/TIT.2012.2230215
Publication status	Published - 2013

Keywords

Distributed learning
online learning
regret
restless multiarmed bandit (RMAB)

Access to Document

10.1109/TIT.2012.2230215

Cite this

@article{428345c1a7c24a26a06e987be3d007d7,

title = "Learning in a changing world: Restless multiarmed bandit with unknown dynamics",

abstract = "We consider the restless multiarmed bandit problem with unknown dynamics in which a player chooses one out of $N$ arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. The performance of an arm selection policy is measured by regret, defined as the reward loss with respect to the case where the player knows which arm is the most rewarding and always plays the best arm. We construct a policy with an interleaving exploration and exploitation epoch structure that achieves a regret with logarithmic order. We further extend the problem to a decentralized setting where multiple distributed players share the arms without information exchange. Under both an exogenous restless model and an endogenous restless model, we show that a decentralized extension of the proposed policy preserves the logarithmic regret order as in the centralized setting. The results apply to adaptive learning in various dynamic systems and communication networks, as well as financial investment.",

keywords = "Distributed learning, online learning, regret, restless multiarmed bandit (RMAB)",

author = "Haoyang Liu and Keqin Liu and Qing Zhao",

year = "2013",

doi = "10.1109/TIT.2012.2230215",

language = "English",

volume = "59",

pages = "1902--1916",

journal = "IEEE Transactions on Information Theory",

issn = "0018-9448",

number = "3",

}

TY - JOUR

T1 - Learning in a changing world

T2 - Restless multiarmed bandit with unknown dynamics

AU - Liu, Haoyang

AU - Liu, Keqin

AU - Zhao, Qing

PY - 2013

Y1 - 2013

N2 - We consider the restless multiarmed bandit problem with unknown dynamics in which a player chooses one out of $N$ arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. The performance of an arm selection policy is measured by regret, defined as the reward loss with respect to the case where the player knows which arm is the most rewarding and always plays the best arm. We construct a policy with an interleaving exploration and exploitation epoch structure that achieves a regret with logarithmic order. We further extend the problem to a decentralized setting where multiple distributed players share the arms without information exchange. Under both an exogenous restless model and an endogenous restless model, we show that a decentralized extension of the proposed policy preserves the logarithmic regret order as in the centralized setting. The results apply to adaptive learning in various dynamic systems and communication networks, as well as financial investment.

AB - We consider the restless multiarmed bandit problem with unknown dynamics in which a player chooses one out of $N$ arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. The performance of an arm selection policy is measured by regret, defined as the reward loss with respect to the case where the player knows which arm is the most rewarding and always plays the best arm. We construct a policy with an interleaving exploration and exploitation epoch structure that achieves a regret with logarithmic order. We further extend the problem to a decentralized setting where multiple distributed players share the arms without information exchange. Under both an exogenous restless model and an endogenous restless model, we show that a decentralized extension of the proposed policy preserves the logarithmic regret order as in the centralized setting. The results apply to adaptive learning in various dynamic systems and communication networks, as well as financial investment.

KW - Distributed learning

KW - online learning

KW - regret

KW - restless multiarmed bandit (RMAB)

UR - http://www.scopus.com/inward/record.url?scp=84873932839&partnerID=8YFLogxK

U2 - 10.1109/TIT.2012.2230215

DO - 10.1109/TIT.2012.2230215

M3 - Article

AN - SCOPUS:84873932839

SN - 0018-9448

VL - 59

SP - 1902

EP - 1916

JO - IEEE Transactions on Information Theory

JF - IEEE Transactions on Information Theory

IS - 3

M1 - 6362216

ER -

Learning in a changing world: Restless multiarmed bandit with unknown dynamics

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this