SIC-MMAB: Synchronisation Involves Communication in … · 2020. 9. 3. · SIC-MMAB:...

Post on 21-Sep-2020

7 views 0 download

Transcript of SIC-MMAB: Synchronisation Involves Communication in … · 2020. 9. 3. · SIC-MMAB:...

SIC - MMAB: Synchronisation Involves Communication inMultiplayer Multi-Armed Bandits

Etienne Boursier†, Vianney Perchet*†

†ENS Paris-Saclay, CMLA *Criteo AI Labetienne.boursier@ens-paris-saclay.fr, vianney.perchet@normalesup.org

Summary

• Synchronisation =⇒ communication through collisions in MMAB• Contradicts previous lower bounds• Synchronisation is unrealistic and a loophole in the model• More realistic dynamic model: first logarithmic regret algorithm

Multiplayer bandits model

Motivation cognitive radio networksFramework K arms, M ≤ K players. At each t≤ T, player j pulls πj(t)

Reward rj(t) := Xπj(t)(t)(1−ηπj(t)(t))

• ηk(t) :=

¨

1 if several players pull k,

0 otherwise• Xk(t) ∈ [0, 1] i.i.d. with E[Xk] = µk

Decentralized no information exchange between playersObservations Collision sensing: observes rj(t) and ηπj(t)(t)

No sensing: observes rj(t) onlySynchro models Static: any player plays for t= 1, . . . , T

Dynamic: player j plays for t= 1+τjτjτj, . . . , TGoal with M(t) set of players at time t, minimize regret

RT :=T∑

t=1

#M(t)∑

k=1

µ(k)−Eµ

T∑

t=1

j∈M(t)

rj(t)

Notations µ(1) ≥ . . .≥ µ(K) and ∆̄(M) := mini=1,...,M

(µ(i)−µ(i+1))

State of the art bounds

Setting Prior knowledge Asympt. Upper bound

Centralized Multiplayer M∑

k>M

log(T)µ(M)−µ(k)

[3]

Decentralized, Col. Sensing µ(M)−µ(M+1)MK log(T)

(µ(M)−µ(M+1))2 [4]

Decentralized, Col. Sensing -∑

k>M

log(T)µ(M)−µ(k)

+MK log(T)

Dec., Col. Sensing , Dynamic ∆̄(M)Mp

K log(T)T∆̄2(M)

[4]

Dec., No Sensing, Dynamic - MK log(T)∆̄2(M)+ M2K log(T)

µ(M)

Our results in red.

SIC-MMAB: static with collision sensing

Observation: coll. indicator ηk(t) ∈ {0, 1} → bit sent from one player to another=⇒ communication possible between players

Algorithm 1: SIC-MMABInitialization phase: estimate M and player rank j in time O(K log(T))for p= 1, ...,∞ doExploration phase: explore each active arm 2p times without collisionCommunication phase: players exchange statistics of arms using collisionsEliminate suboptimal armsAttribute optimal arms to players (who enter exploitation phase)

endExploitation phase: pull attributed arm until T

Communication protocol

When player i sends stats of arm k to player j:

• Pull arm j to send 1 bit (collision)• Pull arm i to send 0 bit (no collision)• Send quantized empirical mean in binary

For 2p exploration rounds, communicate in p→ sublogarithmic comm. regret

Regret of SIC-MMAB

E�

RT

®

Exploration phases︷ ︸︸ ︷

k>M

log(T)µ(M)−µ(k)

+ MK log(T)︸ ︷︷ ︸

Initialization phase

+

Communication phases︷ ︸︸ ︷

M3K log2

log(T)(µ(M)−µ(M+1))2

Contradiction with the lower bound

Centralized lower bound

k>M

log(T)µ(M)−µ(k)

[1]

Decentralized lower bound

M∑

k>M

log(T)µ(M)−µ(k)

? [2]

NO

Decentralized ∼ Centralized

SIC-MMAB

Toward a realistic model

Unrealistic communication protocols→ loophole in current modelWhich assumption did go wrong?

• Collision sensing? No: still send a bit in log(T)µ(K)

rounds without it

• Synchronisation? Yes! Communication possible because players knowwhen to talk to each other

Synchronisation: unrealistic assumption leading to hacking algorithmsLet’s remove it→ dynamic model

DYN-MMAB: dynamic no sensing

Algorithm 2: DYN-MMABExploration phase: pull k∼ U(K)Update occupied arms and queue of arms to exploitif rk(t)> 0 and k= arm to exploit thenEnter exploitation phase

endExploitation phase: pull exploited arm until T

How does it work?Do not estimate µk but γµk where γ' P(no collision)Still distinguish optimal from suboptimal armsWhat if another player exploits k?

• either r̂k→ 0 quickly and arm k becomes suboptimal• or too many 0s in a row =⇒ k is occupied

Regret of DYN-MMAB

E[RT]®M2K log(T)µ(M)

+MK log(T)∆̄2(M)

References

[1] Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymptotically efficient allocationrules for the multiarmed bandit problem with multiple plays-part i: I.i.d. rewards. IEEETransactions on Automatic Control, 32(11):968–976.

[2] Besson, L. and Kaufmann, E. (2018). Multi-Player Bandits Revisited. In AlgorithmicLearning Theory, Lanzarote, Spain.

[3] Komiyama, J., Honda, J., and Nakagawa, H. (2015). Optimal regret analysis of thompsonsampling in stochastic multi-armed bandit problem with multiple plays. In InternationalConference on Machine Learning, pages 1152–1161.

[4] Rosenski, J., Shamir, O., and Szlak, L. (2016). Multi-player bandits – a musical chairsapproach. In International Conference on Machine Learning, pages 155–163.