Decentralized multi-armed bandit with imperfect observations

Keqin Liu; Qing Zhao; Bhaskar Krishnamachari

doi:10.1109/ALLERTON.2010.5707117

Decentralized multi-armed bandit with imperfect observations

Keqin Liu^*, Qing Zhao, Bhaskar Krishnamachari

^*Corresponding author for this work

Department of Financial and Actuarial Mathematics

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

12 Citations (Scopus)

Abstract

We consider decentralized multi-armed bandit problems with multiple distributed players. At each time, each player chooses one of the N independent arms with unknown reward statistics to play. Players do not exchange information regarding their observations or actions. A collision occurs when multiple players choose the same arm. In this case, the reward obtained by a player involved in the collision deviates from the actual reward offered by this arm in an arbitrary and unknown way, thus making it harder to learn the underlying reward statistic of this arm. The objective is to design a decentralized arm selection policy to minimize the system regret defined as the total reward loss with respect to the ideal scenario of known reward model and centralized scheduling among players. We propose a decentralized policy that achieves O(√T) system regret where T is the length of the time horizon. The policy thus achieves the same maximum average reward as in the ideal scenario. Furthermore, the policy ensures fairness among players, i.e., players achieve the same local average reward at the same rate. These problems find applications in cognitive radio networks, multi-agent systems, Internet advertising and web search.

Original language	English
Title of host publication	2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010
Pages	1669-1674
Number of pages	6
DOIs	https://doi.org/10.1109/ALLERTON.2010.5707117
Publication status	Published - 2010
Event	48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010 - Monticello, IL, United States Duration: 29 Sept 2010 → 1 Oct 2010

Publication series

Name	2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010

Conference

Conference	48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010
Country/Territory	United States
City	Monticello, IL
Period	29/09/10 → 1/10/10

Access to Document

10.1109/ALLERTON.2010.5707117

Cite this

Liu, K., Zhao, Q., & Krishnamachari, B. (2010). Decentralized multi-armed bandit with imperfect observations. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010 (pp. 1669-1674). Article 5707117 (2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010). https://doi.org/10.1109/ALLERTON.2010.5707117

@inproceedings{91746b535481449eb8a2d294f2e6c6dd,

title = "Decentralized multi-armed bandit with imperfect observations",

abstract = "We consider decentralized multi-armed bandit problems with multiple distributed players. At each time, each player chooses one of the N independent arms with unknown reward statistics to play. Players do not exchange information regarding their observations or actions. A collision occurs when multiple players choose the same arm. In this case, the reward obtained by a player involved in the collision deviates from the actual reward offered by this arm in an arbitrary and unknown way, thus making it harder to learn the underlying reward statistic of this arm. The objective is to design a decentralized arm selection policy to minimize the system regret defined as the total reward loss with respect to the ideal scenario of known reward model and centralized scheduling among players. We propose a decentralized policy that achieves O(√T) system regret where T is the length of the time horizon. The policy thus achieves the same maximum average reward as in the ideal scenario. Furthermore, the policy ensures fairness among players, i.e., players achieve the same local average reward at the same rate. These problems find applications in cognitive radio networks, multi-agent systems, Internet advertising and web search.",

author = "Keqin Liu and Qing Zhao and Bhaskar Krishnamachari",

year = "2010",

doi = "10.1109/ALLERTON.2010.5707117",

language = "English",

isbn = "9781424482146",

series = "2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010",

pages = "1669--1674",

booktitle = "2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010",

note = "48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010 ; Conference date: 29-09-2010 Through 01-10-2010",

}

Liu, K, Zhao, Q & Krishnamachari, B 2010, Decentralized multi-armed bandit with imperfect observations. in 2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010., 5707117, 2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010, pp. 1669-1674, 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010, Monticello, IL, United States, 29/09/10. https://doi.org/10.1109/ALLERTON.2010.5707117

Decentralized multi-armed bandit with imperfect observations. / Liu, Keqin; Zhao, Qing; Krishnamachari, Bhaskar.
2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010. 2010. p. 1669-1674 5707117 (2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Decentralized multi-armed bandit with imperfect observations

AU - Liu, Keqin

AU - Zhao, Qing

AU - Krishnamachari, Bhaskar

PY - 2010

Y1 - 2010

N2 - We consider decentralized multi-armed bandit problems with multiple distributed players. At each time, each player chooses one of the N independent arms with unknown reward statistics to play. Players do not exchange information regarding their observations or actions. A collision occurs when multiple players choose the same arm. In this case, the reward obtained by a player involved in the collision deviates from the actual reward offered by this arm in an arbitrary and unknown way, thus making it harder to learn the underlying reward statistic of this arm. The objective is to design a decentralized arm selection policy to minimize the system regret defined as the total reward loss with respect to the ideal scenario of known reward model and centralized scheduling among players. We propose a decentralized policy that achieves O(√T) system regret where T is the length of the time horizon. The policy thus achieves the same maximum average reward as in the ideal scenario. Furthermore, the policy ensures fairness among players, i.e., players achieve the same local average reward at the same rate. These problems find applications in cognitive radio networks, multi-agent systems, Internet advertising and web search.

AB - We consider decentralized multi-armed bandit problems with multiple distributed players. At each time, each player chooses one of the N independent arms with unknown reward statistics to play. Players do not exchange information regarding their observations or actions. A collision occurs when multiple players choose the same arm. In this case, the reward obtained by a player involved in the collision deviates from the actual reward offered by this arm in an arbitrary and unknown way, thus making it harder to learn the underlying reward statistic of this arm. The objective is to design a decentralized arm selection policy to minimize the system regret defined as the total reward loss with respect to the ideal scenario of known reward model and centralized scheduling among players. We propose a decentralized policy that achieves O(√T) system regret where T is the length of the time horizon. The policy thus achieves the same maximum average reward as in the ideal scenario. Furthermore, the policy ensures fairness among players, i.e., players achieve the same local average reward at the same rate. These problems find applications in cognitive radio networks, multi-agent systems, Internet advertising and web search.

UR - http://www.scopus.com/inward/record.url?scp=79952434360&partnerID=8YFLogxK

U2 - 10.1109/ALLERTON.2010.5707117

DO - 10.1109/ALLERTON.2010.5707117

M3 - Conference Proceeding

AN - SCOPUS:79952434360

SN - 9781424482146

T3 - 2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010

SP - 1669

EP - 1674

BT - 2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010

T2 - 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010

Y2 - 29 September 2010 through 1 October 2010

ER -

Decentralized multi-armed bandit with imperfect observations

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this