SIC-MMAB: Synchronisation Involves Communication in … · 2020. 9. 3. · SIC-MMAB:...

SIC - MMAB: Synchronisation Involves Communication inMultiplayer Multi-Armed Bandits

Etienne Boursier†, Vianney Perchet*†

†ENS Paris-Saclay, CMLA *Criteo AI Labetienne.boursier@ens-paris-saclay.fr, vianney.perchet@normalesup.org

Summary

• Synchronisation =⇒ communication through collisions in MMAB• Contradicts previous lower bounds• Synchronisation is unrealistic and a loophole in the model• More realistic dynamic model: first logarithmic regret algorithm

Multiplayer bandits model

Motivation cognitive radio networksFramework K arms, M ≤ K players. At each t≤ T, player j pulls πj(t)

Reward rj(t) := Xπj(t)(t)(1−ηπj(t)(t))

• ηk(t) :=

1 if several players pull k,

0 otherwise• Xk(t) ∈ [0, 1] i.i.d. with E[Xk] = µk

Decentralized no information exchange between playersObservations Collision sensing: observes rj(t) and ηπj(t)(t)

No sensing: observes rj(t) onlySynchro models Static: any player plays for t= 1, . . . , T

Dynamic: player j plays for t= 1+τjτjτj, . . . , TGoal with M(t) set of players at time t, minimize regret

RT :=T∑

#M(t)∑

µ(k)−Eµ

j∈M(t)

Notations µ(1) ≥ . . .≥ µ(K) and ∆̄(M) := mini=1,...,M

(µ(i)−µ(i+1))

State of the art bounds

Setting Prior knowledge Asympt. Upper bound

Centralized Multiplayer M∑

log(T)µ(M)−µ(k)

Decentralized, Col. Sensing µ(M)−µ(M+1)MK log(T)

(µ(M)−µ(M+1))2 [4]

Decentralized, Col. Sensing -∑

log(T)µ(M)−µ(k)

+MK log(T)

Dec., Col. Sensing , Dynamic ∆̄(M)Mp

K log(T)T∆̄2(M)

Dec., No Sensing, Dynamic - MK log(T)∆̄2(M)+ M2K log(T)

Our results in red.

SIC-MMAB: static with collision sensing

Observation: coll. indicator ηk(t) ∈ {0, 1} → bit sent from one player to another=⇒ communication possible between players

Algorithm 1: SIC-MMABInitialization phase: estimate M and player rank j in time O(K log(T))for p= 1, ...,∞ doExploration phase: explore each active arm 2p times without collisionCommunication phase: players exchange statistics of arms using collisionsEliminate suboptimal armsAttribute optimal arms to players (who enter exploitation phase)

endExploitation phase: pull attributed arm until T

Communication protocol

When player i sends stats of arm k to player j:

• Pull arm j to send 1 bit (collision)• Pull arm i to send 0 bit (no collision)• Send quantized empirical mean in binary

For 2p exploration rounds, communicate in p→ sublogarithmic comm. regret

Regret of SIC-MMAB

Exploration phases︷︸︸︷

log(T)µ(M)−µ(k)

+ MK log(T)︸︷︷︸

Initialization phase

Communication phases︷︸︸︷

M3K log2

log(T)(µ(M)−µ(M+1))2

Contradiction with the lower bound

Centralized lower bound

log(T)µ(M)−µ(k)

Decentralized lower bound

log(T)µ(M)−µ(k)

Decentralized ∼ Centralized

SIC-MMAB

Toward a realistic model

Unrealistic communication protocols→ loophole in current modelWhich assumption did go wrong?

• Collision sensing? No: still send a bit in log(T)µ(K)

rounds without it

• Synchronisation? Yes! Communication possible because players knowwhen to talk to each other

Synchronisation: unrealistic assumption leading to hacking algorithmsLet’s remove it→ dynamic model

DYN-MMAB: dynamic no sensing

Algorithm 2: DYN-MMABExploration phase: pull k∼ U(K)Update occupied arms and queue of arms to exploitif rk(t)> 0 and k= arm to exploit thenEnter exploitation phase

endExploitation phase: pull exploited arm until T

How does it work?Do not estimate µk but γµk where γ' P(no collision)Still distinguish optimal from suboptimal armsWhat if another player exploits k?

• either r̂k→ 0 quickly and arm k becomes suboptimal• or too many 0s in a row =⇒ k is occupied

Regret of DYN-MMAB

E[RT]®M2K log(T)µ(M)

+MK log(T)∆̄2(M)

References

[1] Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymptotically efficient allocationrules for the multiarmed bandit problem with multiple plays-part i: I.i.d. rewards. IEEETransactions on Automatic Control, 32(11):968–976.

[2] Besson, L. and Kaufmann, E. (2018). Multi-Player Bandits Revisited. In AlgorithmicLearning Theory, Lanzarote, Spain.

[3] Komiyama, J., Honda, J., and Nakagawa, H. (2015). Optimal regret analysis of thompsonsampling in stochastic multi-armed bandit problem with multiple plays. In InternationalConference on Machine Learning, pages 1152–1161.

[4] Rosenski, J., Shamir, O., and Szlak, L. (2016). Multi-player bandits – a musical chairsapproach. In International Conference on Machine Learning, pages 155–163.

SIC-MMAB: Synchronisation Involves Communication in … · 2020. 9. 3. · SIC-MMAB:...

Documents

Transcript of SIC-MMAB: Synchronisation Involves Communication in … · 2020. 9. 3. · SIC-MMAB:...

Synchronisation Primitives

Html5 : stockage local & synchronisation

Jitter wander & synchronisation

041 Time Synchronisation

Concurrency and Synchronisation

Sic Sic Paper1pdf

VIAF and ISNI Synchronisation

Catalogue Item Synchronisation

UMTS Synchronisation

Airvana - Using NTP for Network Synchronisation of 3G ... · Using NTP for network synchronisation of 3G femtocellsof 3G ... The 3G femtocell ... Using NTP for Network Synchronisation

BASIC PRINCIPLES OF SYNCHRONISATION. MAIN CONCEPTS SYNCHRONISATION CRITICAL SECTION DEAD LOCK.

GNSS Based Synchronisation

DG SYNCHRONISATION

Task Synchronisation - 2501ICT/7421ICTNathan

T-TOUCH SOLAR E81 USER’S MANUAL glass must be active to access synchronisation mode. Synchronisation mode display Synchronisation setting mode The hands should be perfectly superimposed

6 Synchronisation

Generator Synchronisation

Synchronisation 120610202311 Phpapp02

Ms Synchronisation

Automatic Alternator Synchronisation