TY - GEN
T1 - Decentralized multi-armed bandit with multiple distributed players
AU - Liu, Keqin
AU - Zhao, Qing
PY - 2010
Y1 - 2010
N2 - We formulate and study a decentralized multi-armed bandit (MAB) problem, where M distributed players compete for N independent arms with unknown reward statistics. At each time, each player chooses one arm to play without exchanging information with other players. Players choosing the same arm collide, and, depending on the collision model, either no one receives reward or the colliding players share the reward in an arbitrary way. We show that the minimum system regret of the decentralized MAB grows with time at the same logarithmic order as in the centralized counterpart where players act collectively as a single entity by exchanging observations and making decisions jointly. A general framework of constructing fair and order-optimal decentralized policies is established based on a Time Division Fair Sharing (TDFS) of the M best arms. A lower bound on the system regret growth rate is established for a general class of decentralized polices, to which all TDFS policies belong.We further develop several fair and order-optimal decentralized polices within the TDFS framework and study their performance in different applications including cognitive radio networks, multi-channel communications in unknown fading environment, target collecting in multi-agent systems, and web search and advertising.
AB - We formulate and study a decentralized multi-armed bandit (MAB) problem, where M distributed players compete for N independent arms with unknown reward statistics. At each time, each player chooses one arm to play without exchanging information with other players. Players choosing the same arm collide, and, depending on the collision model, either no one receives reward or the colliding players share the reward in an arbitrary way. We show that the minimum system regret of the decentralized MAB grows with time at the same logarithmic order as in the centralized counterpart where players act collectively as a single entity by exchanging observations and making decisions jointly. A general framework of constructing fair and order-optimal decentralized policies is established based on a Time Division Fair Sharing (TDFS) of the M best arms. A lower bound on the system regret growth rate is established for a general class of decentralized polices, to which all TDFS policies belong.We further develop several fair and order-optimal decentralized polices within the TDFS framework and study their performance in different applications including cognitive radio networks, multi-channel communications in unknown fading environment, target collecting in multi-agent systems, and web search and advertising.
UR - http://www.scopus.com/inward/record.url?scp=77952689239&partnerID=8YFLogxK
U2 - 10.1109/ITA.2010.5454071
DO - 10.1109/ITA.2010.5454071
M3 - Conference Proceeding
AN - SCOPUS:77952689239
SN - 9781424470143
T3 - 2010 Information Theory and Applications Workshop, ITA 2010 - Conference Proceedings
SP - 568
EP - 577
BT - 2010 Information Theory and Applications Workshop, ITA 2010 - Conference Proceedings
T2 - 2010 Information Theory and Applications Workshop, ITA 2010
Y2 - 31 January 2010 through 5 February 2010
ER -