TY - GEN
T1 - Decentralized multi-armed bandit with imperfect observations
AU - Liu, Keqin
AU - Zhao, Qing
AU - Krishnamachari, Bhaskar
PY - 2010
Y1 - 2010
N2 - We consider decentralized multi-armed bandit problems with multiple distributed players. At each time, each player chooses one of the N independent arms with unknown reward statistics to play. Players do not exchange information regarding their observations or actions. A collision occurs when multiple players choose the same arm. In this case, the reward obtained by a player involved in the collision deviates from the actual reward offered by this arm in an arbitrary and unknown way, thus making it harder to learn the underlying reward statistic of this arm. The objective is to design a decentralized arm selection policy to minimize the system regret defined as the total reward loss with respect to the ideal scenario of known reward model and centralized scheduling among players. We propose a decentralized policy that achieves O(√T) system regret where T is the length of the time horizon. The policy thus achieves the same maximum average reward as in the ideal scenario. Furthermore, the policy ensures fairness among players, i.e., players achieve the same local average reward at the same rate. These problems find applications in cognitive radio networks, multi-agent systems, Internet advertising and web search.
AB - We consider decentralized multi-armed bandit problems with multiple distributed players. At each time, each player chooses one of the N independent arms with unknown reward statistics to play. Players do not exchange information regarding their observations or actions. A collision occurs when multiple players choose the same arm. In this case, the reward obtained by a player involved in the collision deviates from the actual reward offered by this arm in an arbitrary and unknown way, thus making it harder to learn the underlying reward statistic of this arm. The objective is to design a decentralized arm selection policy to minimize the system regret defined as the total reward loss with respect to the ideal scenario of known reward model and centralized scheduling among players. We propose a decentralized policy that achieves O(√T) system regret where T is the length of the time horizon. The policy thus achieves the same maximum average reward as in the ideal scenario. Furthermore, the policy ensures fairness among players, i.e., players achieve the same local average reward at the same rate. These problems find applications in cognitive radio networks, multi-agent systems, Internet advertising and web search.
UR - http://www.scopus.com/inward/record.url?scp=79952434360&partnerID=8YFLogxK
U2 - 10.1109/ALLERTON.2010.5707117
DO - 10.1109/ALLERTON.2010.5707117
M3 - Conference Proceeding
AN - SCOPUS:79952434360
SN - 9781424482146
T3 - 2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010
SP - 1669
EP - 1674
BT - 2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010
T2 - 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010
Y2 - 29 September 2010 through 1 October 2010
ER -