Introduction to Bandits
Transcript of Introduction to Bandits
Introduction to Bandits
Chris Liaw
Machine Learning Reading Group
July 10, 2019
β[T]he problem is a classic one; it was formulated during the war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotageβ - Peter Whittle (on the bandit problem)
Motivation and applications
Clinical trials (Thompson β33)
Online ads
Search results
Many others: network routing, recommender systems, etc.
Outline
β’ Intro to stochastic bandits
β’ Explore-then-commit
β’ Upper confidence bound algorithm
β’ Adversarial bandits & Exp3
β’ Application: Learning Diverse Rankings
Intro to stochastic bandits
πΎ arms; unknown sequence of stochastic rewards π 1, π 2, β¦ β [0,1]πΎ; π π‘~π
For each round π‘ = 1,2,β¦ , π (assume horizon π is known; will say more later)
β’ Choose arm π΄π‘ β [πΎ]
β’ Obtain reward π π‘,π΄π‘ and only see π π‘,π΄π‘
? ? ??
Problem was introduced by Robbins (1952).
Pull arm 3
? ? ?0.2
Pull arm 1
0.01 ? ??
Pull arm 3
? ? ?0.6
Intro to stochastic banditsπΎ arms; unknown sequence of stochastic rewards π 1, π 2, β¦ β [0,1]πΎ; π π‘~π
For each round π‘ = 1,2,β¦ , π (assume horizon π is known; will say more later)
β’ Choose arm π΄π‘ β [πΎ]
β’ Obtain reward π π‘,π΄π‘ and only see π π‘,π΄π‘Arm π has mean ππ which is unknown.
Goal: Find a policy that minimizes the regret
π ππ π = π β πβ β πΈ
π‘β π
π π‘,π΄π‘
Ideally, we would like that π ππ π = π(π).
Reward of best arm
πβ = maxπ
ππ
Algorithmβs reward
Exploration-Exploitation tradeoffAt each time step, we can either:
1. (Exploit) Pull the arm we think is the best one; or
2. (Explore) Pull an arm we think is suboptimal.
1. We do not know which is the best arm so if we keep exploiting, we may keep pulling a suboptimal arm which may incur large regret.
2. If we explore, we gather information about the arms, but we pull suboptimal arms so may incur large regret again!
Challenge is to tradeoff exploration and exploitation!
Explore-then-commit (ETC)Perhaps the simplest algorithm that provably gets sublinear regret!
Let π0 be a hyper-parameter and assume π β₯ πΎ β π»π.1. Pull each of πΎ arms π»π times.2. Compute empirical average ΰ·ππ of each arm.3. Pull arm with largest empirical average for remaining π β πΎ β π0 rounds.
Theorem. Let Ξπ β πβ β ππ be suboptimality of arm π. Then
π ππ π β€ π»π
πβ πΎ
π«π + π β πΎ β π»π β
πβ[πΎ]
π«π exp βπ»π β π«ππ
πΆ
βCost of explorationβSuboptimality of each additional step.
Note: The term π«π exp βπ»π β π«ππ
4is small when π»π is large.
Explore-then-commit (ETC)
Theorem. Let Ξπ β πβ β ππ be suboptimality of arm π. Then
π ππ π β€ π»π
πβ πΎ
π«π + π β πΎ β π»π β
πβ[πΎ]
π«π exp βπ»π β π«ππ
πΆ
β’ This illustrates exploration-exploitation tradeoff:
β’ Explore too much (π0 large) then first term is large.
β’ Exploit too much (π0 small) then second term is large.
β’ Can we tune exploration (i.e. π0) to get sublinear regret?
β’ Yes! Choose π0 = π2/3. Can show that π ππ π = π(πΎ β π2/3).
β’ If πΎ = 2 arms, can use a data-dependent π0 to get π ππ π = π(π1/2)[Garivier, Kaufmann, Lattimore NeurIPS β16]
Explore-then-commit (ETC)
Theorem. Let Ξπ β πβ β ππ be suboptimality of arm π. Then
π ππ π β€ π»π
πβ πΎ
π«π + π β πΎ β π»π β
πβ[πΎ]
π«π exp βπ»π β π«ππ
πΆ
Sketch.
β’ Initially, we try each arm π for π0 trials; this incurs regret π0 β Ξπ
β’ Next, we exploit; we only pull arm π again if empirical average of arm π is at least that of best arm.
β’ This happens with probability at most exp βπ»π β π«ππ
πΆ.
β’ Summing the contribution from all arms gives the claimed regret.
Aside: Doubling Trick
β’ Previously, we assumed that time horizon π is known beforehand.
β’ The doubling trick can be used to get around that.
β’ Suppose that some algorithm π has regret π(π) if it knew the time horizon beforehand.
β’ At every power of 2 step (i.e. at step 2π for some π), we reset π and assume time horizon is 2π.
β’ Then this gives an algorithm with regret π(π‘) for all π‘, i.e. an βanytime algorithmβ.
Upper confidence bound (UCB) algorithm
β’ Based on the idea of βoptimism in the face of uncertainty.β
β’ Algorithm: compute the empirical mean of each arm and a confidence interval; use the upper confidence bound as a proxy for goodness of arm.
β’ Note: confidence interval chosen so that true mean is very unlikely to be outside of confidence interval.
Upper confidence bound (UCB) algorithm
π1
π2 π3
Start by pulling each arm once.
Upper confidence bound (UCB) algorithm
π1
π2 π3
Start by pulling each arm once.
Arm 3 has the highest UCB, we pull that next.ΖΈπ1
ΖΈπ2ΖΈπ3
Upper confidence bound (UCB) algorithm
π1
π2 π3
Start by pulling each arm once.
Arm 3 has the highest UCB, we pull that next.
Now, arm 2 has the highest UCB; we pull arm 2.
ΖΈπ1
ΖΈπ2ΖΈπ3
Upper confidence bound (UCB) algorithmLet πΏ β (0,1) be a hyper-parameter.β’ Pull each of πΎ arms once.β’ For π‘ = πΎ + 1,πΎ + 2,β¦ , π
1. Let ππ(π‘) be number of times arm π was pulled so far and ΖΈππ(π‘) be empirical average.
2. Let ππΆπ΅π π‘ = ΖΈππ π‘ + π π₯π¨π π
πΉ/π΅π π
3. Play arm in argmaxππΆπ΅π(π‘).
Claim. Fix an arm π. Then with probability at least 1 β 2πΏ, we have
ππ β ΰ·ππ π‘ β€ π π₯π¨π π
πΉ/π΅π π
Upper confidence bound (UCB) algorithmTheorem. Let Ξπ β πβ β ππ be suboptimality of arm π. If we choose πΏ~1/π2:
π ππ π β€ πΆ
πβ πΎ
Ξπ +
π:Ξπ>0
πΆ log π
Ξπ
Always have to pay.This turns out to mean the following:
If Ξπ > 0, we only pull arm π roughly Ξπβ2 log π times
incurring regret Ξπ each time.
Upper confidence bound (UCB) algorithmTheorem. Let Ξπ β πβ β ππ be suboptimality of arm π. If we choose πΏ~1/π2:
π ππ π β€ πΆ
πβ πΎ
Ξπ +
π:Ξπ>0
πΆ log π
Ξπ
Sketch.β’ Fact. π ππ π = Οπ:Ξπ>0
ΞππΈ[ππ π ] (ππ π counts number of times arm π was pulled up to time π)
β’ Want to bound πΈ[ππ π ] whenever Ξπ > 0.
β’ W.h.p. ππΆπ΅π π‘ = ΖΈππ π‘ + 2 log1
πΏ/ππ π‘ β€ ππ + 2 2 log
1
πΏ/ππ π‘
β’ If ππ π‘ β₯ Ξ© log1
πΏΞπβ2 then ππΆπ΅π π‘ < πβ so will pull π log
1
πΏΞπβ2 w.h.p.
β’ To conclude, if Ξπ > 0 then ΞππΈ[ππ π ] β² π log1
πΏΞπβ1 .
β’ Choose πΉ~π/π»π to beat union bound.
Upper confidence bound (UCB) algorithmTheorem. Let Ξπ β πβ β ππ be suboptimality of arm π. If we choose πΏ~1/π2:
π ππ π β€ πΆ
πβ πΎ
Ξπ +
π:Ξπ>0
πΆ log π
Ξπ
This is an instance-dependent bound but we can also get a instance-free bound.
Corollary. If we choose πΏ~1/π2 then
π ππ π β€ π ππΎ β log π
So regret is ππΎ π β log π . (Recall that ETC has regret ππΎ π2/3 .)
It is possible to get regret π ππΎ [Audibert, Bubeck β10]; this is optimal.
UCB can also be extended to heavier tails (e.g. [Bubeck, Cesa-Bianchi, Lugosi β13])
π-greedy algorithm
Let ππΎ+1, ππΎ+2, β¦ β [0,1] be an exploration schedule.β’ Pull each of πΎ arms once.β’ For π‘ = πΎ + 1,πΎ + 2,β¦
1. With probability ππ‘, pull a random arm; otherwise pull arm with highest empirical mean.
Theorem. For an appropriate choice of ππ‘, can showπ ππ π‘ = π(π‘2/3 πΎ log π‘ 1/3).
Choosing ππ‘ = π‘β1/3 πΎ β log π‘ 1/3 will give the theorem (see Theorem 1.4 in book by Slivkins).
Adversarial bandits
The only difference with expert setting (where ππ‘ is revealed).
Goal: minimize βpseudoβ-regret over all reward vectors (same as experts)
π ππ π = maxπβ[πΎ]
π‘
ππ‘,π β
π‘
ππ‘ , ππ‘
Assume πΎ experts and rewards ππ‘ β 0,1 πΎ
Initialize π1 (e.g. uniform distribution over experts)For time π‘ = 1, 2,β¦1. Algorithm plays according to ππ‘; say chooses action π2. Algorithm gains ππ‘ , ππ‘ (expected reward over randomness of action)
3. Algorithm receives ππ,π and updates ππ‘ to get ππ‘+1.
Adversarial bandits and Exp3
A nifty trick:β’ Algorithm only receives ππ‘,π; ideally, we would like ππ‘
β’ Define Ηππ‘,π =ππ‘,π
ππ‘,πif algorithm chose action π and Ηππ‘,π = 0 otherwise.
β’ Then πΈ Ηππ‘ = ππ‘, i.e. algorithm can get an unbiased estimate of ππ‘.β’ One gets Exp3 algorithm by replacing ππ‘ in MWU with Ηππ‘!
Assume πΎ experts and rewards ππ‘ β 0,1 πΎ
Initialize π1 (e.g. uniform distribution over experts)For time π‘ = 1, 2,β¦1. Algorithm plays according to ππ‘; say chooses action π2. Algorithm gains ππ‘ , ππ‘ (expected reward over randomness of action)
3. Algorithm receives ππ,π and updates ππ‘ to get ππ‘+1.
Exp3
MWU. Assume πΎ experts and rewards ππ‘ β 0,1 πΎ; step size πInitialize π 0 = (0,β¦ , 0)For time π‘ = 1, 2,β¦ , π
1. Set ππ‘,π = exp ππ π‘β1,π /ππ‘β1 where ππ‘β1 = Οπ exp(ππ π‘β1,π).
2. Follow expert π with prob. ππ‘,π. Expected reward is ππ‘ , ππ‘ .
3. Algorithm observes ππ.4. Update: π π‘,π = π π‘β1,π + ππ,π for all π.
Exp3
Exp3. Assume πΎ experts and rewards ππ‘ β 0,1 πΎ; step size πInitialize π 0 = (0,β¦ , 0)For time π‘ = 1, 2,β¦ , π
1. Set ππ‘,π = exp ππ π‘β1,π /ππ‘β1 where ππ‘β1 = Οπ exp(ππ π‘β1,π).
2. Follow expert π with prob. ππ‘,π. Expected reward is ππ‘ , ππ‘ .
3. Algorithm observes ππ,π. Set ππ,π = ππ,π/ππ,π if follow expert π; else ππ,π = π.
4. Update: π π‘,π = π π‘β1,π + ππ,π for all π.
Exp3
Theorem. In the experts setting with πΎ experts, MWU has regret π( π β logπΎ).
Theorem. In the bandits setting with πΎ experts, Exp3 has regret π( ππΎ β logπΎ).
Proof for Exp3 is nearly identical to MWU!(See [Bubeck, Cesa-Bianchi β12] or Lecture 17 in Nick Harveyβs CPSC 531H course.)
In the bandits setting, can get π ππΎ regret and this is optimal [Audibert, Bubeck β10]
Application: Learning Diverse Rankings
Paper is Learning Diverse Rankings with Multi-Armed Bandits by Radlinsky, Kleinberg, Joachims (ICML β08)
β’ Setting is web search
β’ A user enters a search query
β’ We want to ensure that a relevant document is near the top.
User may mean the insect or the car and both appear on the top few.
An example of a search which is not diverse at all.
Those searching for bandit algorithms would not click.
Application: Learning Diverse Rankings
β’ Setting is web search
β’ A user enters a search query
β’ We want to ensure that a relevant document is near the top.
β’ Model this as follows.
β’ Let π be a set of documents (for one fixed query).
β’ A user π’π‘ comes with some βtypeβ which is a prob. vector ππ‘ indicating probability of clicking a specific document.
β’ If user clicks, get reward of 1; if user leaves, get reward of 0.
β’ Goal: Maximize number of user clicks.
β’ Note that offline problem is NP-hard (for one user, we need to solve max coverage problem); but we can get (1 β 1/π)-approximation.
Ranked Explore and Commit
β’ Model users as static identities.
β’ Start at the first rank (top of we page).
β’ Try every possible document for that position a bunch of times.
β’ Whichever document has the most hits at that rank is chosen to be in that rank.
β’ Repeat this for every rank.
Ranked Explore and Commit
Theorem. With a suitable choice of parameters, payoff for ranked ETC after π
rounds is at least 1 β1
πβ πππ β ππ,π(π
2/3).
If πππ β₯ Ξ©(π) (i.e. a constant fraction of users want to click on some document) then ranked ETC is competitive with optimal offline algorithm.
Optimal payoff on offline setting.
Ranked Bandits Algorithmβ’ Here, users can change over time.
β’ Instantiate a multi-armed bandit algorithm for each rank.
β’ For each rank, we ask algorithm for a document.
β’ Bandit corresponding to rank π gets reward 1 if page is clicked; else gets zero.
β’ Note that the MAB algorithm can be arbitrary.
Ranked Bandits Algorithm
Theorem. With a suitable choice of parameters, payoff for ranked bandits after π
rounds is at least 1 β1
πβ πππ β π β π (π), where π (π) is regret of MAB
algorithm.
For e.g., if we use Exp3 then π π = ΰ·¨ππ,π π1/2 .
If πππ β₯ Ξ©(π) (i.e. a constant fraction of users want to click on some document) then ranked bandits is competitive with optimal offline algorithm.
Optimal payoff in offline setting.
Number of slots (documents to show)
Empirical Results
All their experiments were with synthetic data.
Recapβ’ We introduced stochastic bandits problem and saw two algorithms:β’ Explore-then-commit which has an initial exploration stage and then commits for the
rest of timeβ’ Upper-confidence bound algorithm which maintains a confidence interval for each
arm and uses the upper-confidence as a proxy.
β’ We introduced adversarial bandits and saw the Exp3 algorithm.
β’ Some references:β’ Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems by Bubeck and
Cesa-Bianchiβ’ Introduction to Online Convex Optimization by Hazanβ’ Introduction to Online Optimization by Bubeckβ’ Bandit Algorithms by Lattimore and SzepesvΓ‘riβ’ Introduction to Multi-Armed Bandits by Slivkins