The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... ·...
Transcript of The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... ·...
![Page 1: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/1.jpg)
Introduction Main Results Experiments Conclusions
The UCB Algorithm
Paper Finite-time Analysis of the Multiarmed Bandit Problemby Auer, Cesa-Bianchi, Fischer, Machine Learning 27, 2002
Presented by Markus Enzenberger.Go Seminar, University of Alberta.
March 14, 2007
The UCB Algorithm
![Page 2: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/2.jpg)
Introduction Main Results Experiments Conclusions
IntroductionMultiarmed Bandit ProblemRegretLai and Robbins (1985)In this Paper
Main ResultsTheorem 1 (UCB1)Theorem 2 (UCB2)Theorem 3 (εn-GREEDY)Theorem 4 (UCB1-NORMAL)Independence Assumptions
ExperimentsUCB1-TUNEDSetupBest value for αSummary of ResultsComparison on Distribution 11
ConclusionsThe UCB Algorithm
![Page 3: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/3.jpg)
Introduction Main Results Experiments Conclusions
Multiarmed Bandit Problem
Multiarmed Bandit Problem
I Example for the exploration vs exploitation dilemma
I K independent gambling machines (armed bandits)
I Each machine has an unknown stationary probabilitydistribution for generating the reward
I Observed rewards when playing machine i :Xi ,1,Xi ,2, . . .
I Policy A chooses next machine based on previous sequence ofplays and rewards
The UCB Algorithm
![Page 4: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/4.jpg)
Introduction Main Results Experiments Conclusions
Regret
Regret
Regret of policy A
µ∗n − µj
K∑j=1
E[Tj(n)]
µi expectation of machine iµ∗ expectation of optimal machineTj number of times machine j was played
The regret is the expected loss after n plays due to the fact thatthe policy does not always play the optimal machine.
The UCB Algorithm
![Page 5: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/5.jpg)
Introduction Main Results Experiments Conclusions
Lai and Robbins (1985)
Lai and Robbins (1985)
Policy for a class of reward distributions (including: normal,Bernoulli, Poisson) with regret asymptotically bounded bylogarithm of n:
E[Tj(n)] ≤ ln(n)
D(pj ||p∗)n →∞
D(pj ||p∗) Kullback-Leibler divergence between reward densities
I This is the best possible regret
I Policy computes upper confidence index for each machine
I Needs entire sequence of rewards for each machine
The UCB Algorithm
![Page 6: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/6.jpg)
Introduction Main Results Experiments Conclusions
In this Paper
In this Paper
I Show policies with logarithmic regret uniformly over time
I Policies are simple and efficient
I Notation: ∆i := µ∗ − µi
The UCB Algorithm
![Page 7: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/7.jpg)
Introduction Main Results Experiments Conclusions
Theorem 1 (UCB1)
Theorem 1 (UCB1)
Policy with finite-time regret logarithmically bounded for arbitrarysets of reward distributions with bounded support
The UCB Algorithm
![Page 8: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/8.jpg)
Introduction Main Results Experiments Conclusions
Theorem 1 (UCB1)
I E[Tj(n)] ≤ 8∆2
jln(n) worse than Lai and Robbins
I D(pj ||p∗) ≥ 2∆2j with best possible constant 2
→ UCB2 brings main constant arbitrarily close to 12∆2
j
The UCB Algorithm
![Page 9: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/9.jpg)
Introduction Main Results Experiments Conclusions
Theorem 2 (UCB2)
Theorem 2 (UCB2)
More complicated version of UCB1 with better constants forbound on regret.
The UCB Algorithm
![Page 10: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/10.jpg)
Introduction Main Results Experiments Conclusions
Theorem 2 (UCB2)
The UCB Algorithm
![Page 11: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/11.jpg)
Introduction Main Results Experiments Conclusions
Theorem 2 (UCB2)
I First term brings constant arbitrarily close to 12∆2
jfor small α
I cα →∞ as α → 0
I Let α = αn slowly decrease
The UCB Algorithm
![Page 12: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/12.jpg)
Introduction Main Results Experiments Conclusions
Theorem 3 (εn-GREEDY)
Theorem 3 (εn-GREEDY)
Similar result for ε-greedy heuristic.(ε needs to go to 0; constant ε has linear regret)
The UCB Algorithm
![Page 13: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/13.jpg)
Introduction Main Results Experiments Conclusions
Theorem 3 (εn-GREEDY)
I For c large enough, the bound is of order c/(d2n) + o(1/n)→ logarithmic bound on regret
I Bound on instanteneous regret
I Need to know lower bound d on expectation between bestand second-best machine
The UCB Algorithm
![Page 14: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/14.jpg)
Introduction Main Results Experiments Conclusions
Theorem 4 (UCB1-NORMAL)
Theorem 4 (UCB1-NORMAL)
Indexed based policy with logarithmically bounded finite-timeregret for normally distributed reward distributions with unknownmean and variance.
The UCB Algorithm
![Page 15: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/15.jpg)
Introduction Main Results Experiments Conclusions
Theorem 4 (UCB1-NORMAL)
The UCB Algorithm
![Page 16: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/16.jpg)
Introduction Main Results Experiments Conclusions
Theorem 4 (UCB1-NORMAL)
I Like UCB1, but since kind of distribution is known, samplevariance is used to estimate variance of distribution
I Proof depends on bounds for tails of χ2 and Studentdistribution, which were only verified numerically
The UCB Algorithm
![Page 17: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/17.jpg)
Introduction Main Results Experiments Conclusions
Independence Assumptions
Independence Assumptions
I Theorem 1–3 also hold for rewards that arenot independent across machines:Xi ,s and Xj ,t might be dependent for any s, t and i 6= j
I The rewards of a single machine do not need to beindependent and identically-distributed.Weaker assumption:E[Xi ,t |Xi ,1, . . . ,Xi ,t−1] = µi for all 1 ≤ t ≤ n
The UCB Algorithm
![Page 18: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/18.jpg)
Introduction Main Results Experiments Conclusions
UCB1-TUNED
UCB1-TUNED
Fined-tuned version of UCB taking the measured variance intoaccount (no proven regret bounds)Upper confidence bound on variance of machine j
Replace upper confidence bound in UCB1 by
1/4 is upper bound on variance of a Bernoulli random variable
The UCB Algorithm
![Page 19: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/19.jpg)
Introduction Main Results Experiments Conclusions
Setup
Distributions
The UCB Algorithm
![Page 20: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/20.jpg)
Introduction Main Results Experiments Conclusions
Best value for α
Best value for α
I Relatively insensitive, as long as α is small
I Use fixed α = 0.001
The UCB Algorithm
![Page 21: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/21.jpg)
Introduction Main Results Experiments Conclusions
Summary of Results
Summary of Results
I An optimally tuned εn-GREEDY performs almost always best
I Performance of not well-tuned εn-GREEDY degrades rapidly
I In most cases UCB1-TUNED performs comparably to awell-tuned εn-GREEDY
I UCB1-TUNED not sensitive to the variances of the machines
I UCB2 performs similar to UCB1-TUNED, but always slightlyworse
The UCB Algorithm
![Page 22: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/22.jpg)
Introduction Main Results Experiments Conclusions
Comparison on Distribution 11
Comparison on Distribution 11
The UCB Algorithm
![Page 23: The UCB Algorithm - University of Albertawebdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/... · 2007-03-14 · Introduction Main Results Experiments Conclusions The UCB Algorithm](https://reader033.fdocuments.us/reader033/viewer/2022041612/5e38dd9a3f191269851de6b9/html5/thumbnails/23.jpg)
Introduction Main Results Experiments Conclusions
Conclusions
I Simple, efficient policies for the bandit problem on any set ofreward distributions with known bounded support withuniform logarithmic regret
I Based on upper confidence bounds (with exception ofεn-GREEDY)
I Robust with respect to the introduction of moderatedependencies
I Many extensions of this work are possible
I Generalize to non-stationary problems
I Based on Gittins allocation indices(needs preliminary knowledge or learning of the indices)
The UCB Algorithm