Introduction to Bandits

35
Introduction to Bandits Chris Liaw Machine Learning Reading Group July 10, 2019 β€œ[T]he problem is a classic one; it was formulated during the war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage” - Peter Whittle (on the bandit problem)

Transcript of Introduction to Bandits

Page 1: Introduction to Bandits

Introduction to Bandits

Chris Liaw

Machine Learning Reading Group

July 10, 2019

β€œ[T]he problem is a classic one; it was formulated during the war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage” - Peter Whittle (on the bandit problem)

Page 2: Introduction to Bandits

Motivation and applications

Clinical trials (Thompson β€˜33)

Online ads

Search results

Many others: network routing, recommender systems, etc.

Page 3: Introduction to Bandits

Outline

β€’ Intro to stochastic bandits

β€’ Explore-then-commit

β€’ Upper confidence bound algorithm

β€’ Adversarial bandits & Exp3

β€’ Application: Learning Diverse Rankings

Page 4: Introduction to Bandits

Intro to stochastic bandits

𝐾 arms; unknown sequence of stochastic rewards 𝑅1, 𝑅2, … ∈ [0,1]𝐾; 𝑅𝑑~𝜈

For each round 𝑑 = 1,2,… , 𝑇 (assume horizon 𝑇 is known; will say more later)

β€’ Choose arm 𝐴𝑑 ∈ [𝐾]

β€’ Obtain reward 𝑅𝑑,𝐴𝑑 and only see 𝑅𝑑,𝐴𝑑

? ? ??

Problem was introduced by Robbins (1952).

Page 5: Introduction to Bandits

Pull arm 3

? ? ?0.2

Pull arm 1

0.01 ? ??

Pull arm 3

? ? ?0.6

Page 6: Introduction to Bandits

Intro to stochastic bandits𝐾 arms; unknown sequence of stochastic rewards 𝑅1, 𝑅2, … ∈ [0,1]𝐾; 𝑅𝑑~𝜈

For each round 𝑑 = 1,2,… , 𝑇 (assume horizon 𝑇 is known; will say more later)

β€’ Choose arm 𝐴𝑑 ∈ [𝐾]

β€’ Obtain reward 𝑅𝑑,𝐴𝑑 and only see 𝑅𝑑,𝐴𝑑Arm 𝑖 has mean πœ‡π‘– which is unknown.

Goal: Find a policy that minimizes the regret

𝑅𝑒𝑔 𝑇 = 𝑇 β‹… πœ‡βˆ— βˆ’ 𝐸

π‘‘βˆˆ 𝑇

𝑅𝑑,𝐴𝑑

Ideally, we would like that 𝑅𝑒𝑔 𝑇 = π‘œ(𝑇).

Reward of best arm

πœ‡βˆ— = max𝑖

πœ‡π‘–

Algorithm’s reward

Page 7: Introduction to Bandits

Exploration-Exploitation tradeoffAt each time step, we can either:

1. (Exploit) Pull the arm we think is the best one; or

2. (Explore) Pull an arm we think is suboptimal.

1. We do not know which is the best arm so if we keep exploiting, we may keep pulling a suboptimal arm which may incur large regret.

2. If we explore, we gather information about the arms, but we pull suboptimal arms so may incur large regret again!

Challenge is to tradeoff exploration and exploitation!

Page 8: Introduction to Bandits

Explore-then-commit (ETC)Perhaps the simplest algorithm that provably gets sublinear regret!

Let 𝑇0 be a hyper-parameter and assume 𝑇 β‰₯ 𝐾 β‹… π‘»πŸŽ.1. Pull each of 𝐾 arms π‘»πŸŽ times.2. Compute empirical average ΰ·πœ‡π‘– of each arm.3. Pull arm with largest empirical average for remaining 𝑇 βˆ’ 𝐾 β‹… 𝑇0 rounds.

Theorem. Let Δ𝑖 ≔ πœ‡βˆ— βˆ’ πœ‡π‘– be suboptimality of arm 𝑖. Then

𝑅𝑒𝑔 𝑇 ≀ π‘»πŸŽ

π‘–βˆˆ 𝐾

πš«π’Š + 𝑇 βˆ’ 𝐾 β‹… π‘»πŸŽ β‹…

π‘–βˆˆ[𝐾]

πš«π’Š exp βˆ’π‘»πŸŽ β‹…πš«π’ŠπŸ

𝐢

β€œCost of exploration”Suboptimality of each additional step.

Note: The term πš«π’Š exp βˆ’π‘»πŸŽ β‹…πš«π’ŠπŸ

4is small when π‘»πŸŽ is large.

Page 9: Introduction to Bandits

Explore-then-commit (ETC)

Theorem. Let Δ𝑖 ≔ πœ‡βˆ— βˆ’ πœ‡π‘– be suboptimality of arm 𝑖. Then

𝑅𝑒𝑔 𝑇 ≀ π‘»πŸŽ

π‘–βˆˆ 𝐾

πš«π’Š + 𝑇 βˆ’ 𝐾 β‹… π‘»πŸŽ β‹…

π‘–βˆˆ[𝐾]

πš«π’Š exp βˆ’π‘»πŸŽ β‹…πš«π’ŠπŸ

𝐢

β€’ This illustrates exploration-exploitation tradeoff:

β€’ Explore too much (𝑇0 large) then first term is large.

β€’ Exploit too much (𝑇0 small) then second term is large.

β€’ Can we tune exploration (i.e. 𝑇0) to get sublinear regret?

β€’ Yes! Choose 𝑇0 = 𝑇2/3. Can show that 𝑅𝑒𝑔 𝑇 = 𝑂(𝐾 β‹… 𝑇2/3).

β€’ If 𝐾 = 2 arms, can use a data-dependent 𝑇0 to get 𝑅𝑒𝑔 𝑇 = 𝑂(𝑇1/2)[Garivier, Kaufmann, Lattimore NeurIPS β€˜16]

Page 10: Introduction to Bandits

Explore-then-commit (ETC)

Theorem. Let Δ𝑖 ≔ πœ‡βˆ— βˆ’ πœ‡π‘– be suboptimality of arm 𝑖. Then

𝑅𝑒𝑔 𝑇 ≀ π‘»πŸŽ

π‘–βˆˆ 𝐾

πš«π’Š + 𝑇 βˆ’ 𝐾 β‹… π‘»πŸŽ β‹…

π‘–βˆˆ[𝐾]

πš«π’Š exp βˆ’π‘»πŸŽ β‹…πš«π’ŠπŸ

𝐢

Sketch.

β€’ Initially, we try each arm 𝑖 for 𝑇0 trials; this incurs regret 𝑇0 β‹… Δ𝑖

β€’ Next, we exploit; we only pull arm 𝑖 again if empirical average of arm 𝑖 is at least that of best arm.

β€’ This happens with probability at most exp βˆ’π‘»πŸŽ β‹…πš«π’ŠπŸ

𝐢.

β€’ Summing the contribution from all arms gives the claimed regret.

Page 11: Introduction to Bandits

Aside: Doubling Trick

β€’ Previously, we assumed that time horizon 𝑇 is known beforehand.

β€’ The doubling trick can be used to get around that.

β€’ Suppose that some algorithm π’œ has regret π‘œ(𝑇) if it knew the time horizon beforehand.

β€’ At every power of 2 step (i.e. at step 2π‘˜ for some π‘˜), we reset π’œ and assume time horizon is 2π‘˜.

β€’ Then this gives an algorithm with regret π‘œ(𝑑) for all 𝑑, i.e. an β€œanytime algorithm”.

Page 12: Introduction to Bandits

Upper confidence bound (UCB) algorithm

β€’ Based on the idea of β€œoptimism in the face of uncertainty.”

β€’ Algorithm: compute the empirical mean of each arm and a confidence interval; use the upper confidence bound as a proxy for goodness of arm.

β€’ Note: confidence interval chosen so that true mean is very unlikely to be outside of confidence interval.

Page 13: Introduction to Bandits

Upper confidence bound (UCB) algorithm

πœ‡1

πœ‡2 πœ‡3

Start by pulling each arm once.

Page 14: Introduction to Bandits

Upper confidence bound (UCB) algorithm

πœ‡1

πœ‡2 πœ‡3

Start by pulling each arm once.

Arm 3 has the highest UCB, we pull that next.ΖΈπœ‡1

ΖΈπœ‡2ΖΈπœ‡3

Page 15: Introduction to Bandits

Upper confidence bound (UCB) algorithm

πœ‡1

πœ‡2 πœ‡3

Start by pulling each arm once.

Arm 3 has the highest UCB, we pull that next.

Now, arm 2 has the highest UCB; we pull arm 2.

ΖΈπœ‡1

ΖΈπœ‡2ΖΈπœ‡3

Page 16: Introduction to Bandits

Upper confidence bound (UCB) algorithmLet 𝛿 ∈ (0,1) be a hyper-parameter.β€’ Pull each of 𝐾 arms once.β€’ For 𝑑 = 𝐾 + 1,𝐾 + 2,… , 𝑇

1. Let 𝑁𝑖(𝑑) be number of times arm 𝑖 was pulled so far and ΖΈπœ‡π‘–(𝑑) be empirical average.

2. Let π‘ˆπΆπ΅π‘– 𝑑 = ΖΈπœ‡π‘– 𝑑 + 𝟐 π₯𝐨𝐠𝟏

𝜹/π‘΅π’Š 𝒕

3. Play arm in argmaxπ‘ˆπΆπ΅π‘–(𝑑).

Claim. Fix an arm 𝑖. Then with probability at least 1 βˆ’ 2𝛿, we have

πœ‡π‘– βˆ’ ΰ·œπœ‡π‘– 𝑑 ≀ 𝟐 π₯𝐨𝐠𝟏

𝜹/π‘΅π’Š 𝒕

Page 17: Introduction to Bandits

Upper confidence bound (UCB) algorithmTheorem. Let Δ𝑖 ≔ πœ‡βˆ— βˆ’ πœ‡π‘– be suboptimality of arm 𝑖. If we choose 𝛿~1/𝑇2:

𝑅𝑒𝑔 𝑇 ≀ 𝐢

π‘–βˆˆ 𝐾

Δ𝑖 +

𝑖:Δ𝑖>0

𝐢 log 𝑇

Δ𝑖

Always have to pay.This turns out to mean the following:

If Δ𝑖 > 0, we only pull arm 𝑖 roughly Ξ”π‘–βˆ’2 log 𝑇 times

incurring regret Δ𝑖 each time.

Page 18: Introduction to Bandits

Upper confidence bound (UCB) algorithmTheorem. Let Δ𝑖 ≔ πœ‡βˆ— βˆ’ πœ‡π‘– be suboptimality of arm 𝑖. If we choose 𝛿~1/𝑇2:

𝑅𝑒𝑔 𝑇 ≀ 𝐢

π‘–βˆˆ 𝐾

Δ𝑖 +

𝑖:Δ𝑖>0

𝐢 log 𝑇

Δ𝑖

Sketch.β€’ Fact. 𝑅𝑒𝑔 𝑇 = σ𝑖:Δ𝑖>0

Δ𝑖𝐸[𝑁𝑖 𝑇 ] (𝑁𝑖 𝑇 counts number of times arm 𝑖 was pulled up to time 𝑇)

β€’ Want to bound 𝐸[𝑁𝑖 𝑇 ] whenever Δ𝑖 > 0.

β€’ W.h.p. π‘ˆπΆπ΅π‘– 𝑑 = ΖΈπœ‡π‘– 𝑑 + 2 log1

𝛿/𝑁𝑖 𝑑 ≀ πœ‡π‘– + 2 2 log

1

𝛿/𝑁𝑖 𝑑

β€’ If 𝑁𝑖 𝑑 β‰₯ Ξ© log1

π›ΏΞ”π‘–βˆ’2 then π‘ˆπΆπ΅π‘– 𝑑 < πœ‡βˆ— so will pull 𝑂 log

1

π›ΏΞ”π‘–βˆ’2 w.h.p.

β€’ To conclude, if Δ𝑖 > 0 then Δ𝑖𝐸[𝑁𝑖 𝑇 ] ≲ 𝑂 log1

π›ΏΞ”π‘–βˆ’1 .

β€’ Choose 𝜹~𝟏/π‘»πŸ to beat union bound.

Page 19: Introduction to Bandits

Upper confidence bound (UCB) algorithmTheorem. Let Δ𝑖 ≔ πœ‡βˆ— βˆ’ πœ‡π‘– be suboptimality of arm 𝑖. If we choose 𝛿~1/𝑇2:

𝑅𝑒𝑔 𝑇 ≀ 𝐢

π‘–βˆˆ 𝐾

Δ𝑖 +

𝑖:Δ𝑖>0

𝐢 log 𝑇

Δ𝑖

This is an instance-dependent bound but we can also get a instance-free bound.

Corollary. If we choose 𝛿~1/𝑇2 then

𝑅𝑒𝑔 𝑇 ≀ 𝑂 𝑇𝐾 β‹… log 𝑇

So regret is 𝑂𝐾 𝑇 β‹… log 𝑇 . (Recall that ETC has regret 𝑂𝐾 𝑇2/3 .)

It is possible to get regret 𝑂 𝑇𝐾 [Audibert, Bubeck β€˜10]; this is optimal.

UCB can also be extended to heavier tails (e.g. [Bubeck, Cesa-Bianchi, Lugosi ’13])

Page 20: Introduction to Bandits

πœ–-greedy algorithm

Let πœ–πΎ+1, πœ–πΎ+2, … ∈ [0,1] be an exploration schedule.β€’ Pull each of 𝐾 arms once.β€’ For 𝑑 = 𝐾 + 1,𝐾 + 2,…

1. With probability πœ–π‘‘, pull a random arm; otherwise pull arm with highest empirical mean.

Theorem. For an appropriate choice of πœ–π‘‘, can show𝑅𝑒𝑔 𝑑 = 𝑂(𝑑2/3 𝐾 log 𝑑 1/3).

Choosing πœ–π‘‘ = π‘‘βˆ’1/3 𝐾 β‹… log 𝑑 1/3 will give the theorem (see Theorem 1.4 in book by Slivkins).

Page 21: Introduction to Bandits

Adversarial bandits

The only difference with expert setting (where π‘Ÿπ‘‘ is revealed).

Goal: minimize β€œpseudo”-regret over all reward vectors (same as experts)

𝑅𝑒𝑔 𝑇 = maxπ‘–βˆˆ[𝐾]

𝑑

π‘Ÿπ‘‘,𝑖 βˆ’

𝑑

𝑝𝑑 , π‘Ÿπ‘‘

Assume 𝐾 experts and rewards π‘Ÿπ‘‘ ∈ 0,1 𝐾

Initialize 𝑝1 (e.g. uniform distribution over experts)For time 𝑑 = 1, 2,…1. Algorithm plays according to 𝑝𝑑; say chooses action 𝑗2. Algorithm gains 𝑝𝑑 , π‘Ÿπ‘‘ (expected reward over randomness of action)

3. Algorithm receives 𝒓𝒕,𝒋 and updates 𝑝𝑑 to get 𝑝𝑑+1.

Page 22: Introduction to Bandits

Adversarial bandits and Exp3

A nifty trick:β€’ Algorithm only receives π‘Ÿπ‘‘,𝑗; ideally, we would like π‘Ÿπ‘‘

β€’ Define Ηπ‘Ÿπ‘‘,𝑗 =π‘Ÿπ‘‘,𝑗

𝑝𝑑,𝑗if algorithm chose action 𝑗 and Ηπ‘Ÿπ‘‘,𝑗 = 0 otherwise.

β€’ Then 𝐸 Ηπ‘Ÿπ‘‘ = π‘Ÿπ‘‘, i.e. algorithm can get an unbiased estimate of π‘Ÿπ‘‘.β€’ One gets Exp3 algorithm by replacing π‘Ÿπ‘‘ in MWU with Ηπ‘Ÿπ‘‘!

Assume 𝐾 experts and rewards π‘Ÿπ‘‘ ∈ 0,1 𝐾

Initialize 𝑝1 (e.g. uniform distribution over experts)For time 𝑑 = 1, 2,…1. Algorithm plays according to 𝑝𝑑; say chooses action 𝑗2. Algorithm gains 𝑝𝑑 , π‘Ÿπ‘‘ (expected reward over randomness of action)

3. Algorithm receives 𝒓𝒕,𝒋 and updates 𝑝𝑑 to get 𝑝𝑑+1.

Page 23: Introduction to Bandits

Exp3

MWU. Assume 𝐾 experts and rewards π‘Ÿπ‘‘ ∈ 0,1 𝐾; step size πœ‚Initialize 𝑅0 = (0,… , 0)For time 𝑑 = 1, 2,… , 𝑇

1. Set 𝑝𝑑,𝑗 = exp πœ‚π‘…π‘‘βˆ’1,𝑗 /π‘π‘‘βˆ’1 where π‘π‘‘βˆ’1 = σ𝑖 exp(πœ‚π‘…π‘‘βˆ’1,𝑖).

2. Follow expert 𝑗 with prob. 𝑝𝑑,𝑗. Expected reward is 𝑝𝑑 , π‘Ÿπ‘‘ .

3. Algorithm observes 𝒓𝒕.4. Update: 𝑅𝑑,𝑗 = π‘…π‘‘βˆ’1,𝑗 + 𝒓𝒕,𝒋 for all 𝑗.

Page 24: Introduction to Bandits

Exp3

Exp3. Assume 𝐾 experts and rewards π‘Ÿπ‘‘ ∈ 0,1 𝐾; step size πœ‚Initialize 𝑅0 = (0,… , 0)For time 𝑑 = 1, 2,… , 𝑇

1. Set 𝑝𝑑,𝑗 = exp πœ‚π‘…π‘‘βˆ’1,𝑗 /π‘π‘‘βˆ’1 where π‘π‘‘βˆ’1 = σ𝑖 exp(πœ‚π‘…π‘‘βˆ’1,𝑖).

2. Follow expert 𝑗 with prob. 𝑝𝑑,𝑗. Expected reward is 𝑝𝑑 , π‘Ÿπ‘‘ .

3. Algorithm observes 𝒓𝒕,𝒋. Set 𝒓𝒕,𝒋 = 𝒓𝒕,𝒋/𝒑𝒕,𝒋 if follow expert 𝒋; else 𝒓𝒕,𝒋 = 𝟎.

4. Update: 𝑅𝑑,𝑗 = π‘…π‘‘βˆ’1,𝑗 + 𝒓𝒕,𝒋 for all 𝑗.

Page 25: Introduction to Bandits

Exp3

Theorem. In the experts setting with 𝐾 experts, MWU has regret 𝑂( 𝑇 β‹… log𝐾).

Theorem. In the bandits setting with 𝐾 experts, Exp3 has regret 𝑂( 𝑇𝐾 β‹… log𝐾).

Proof for Exp3 is nearly identical to MWU!(See [Bubeck, Cesa-Bianchi β€˜12] or Lecture 17 in Nick Harvey’s CPSC 531H course.)

In the bandits setting, can get 𝑂 𝑇𝐾 regret and this is optimal [Audibert, Bubeck β€˜10]

Page 26: Introduction to Bandits

Application: Learning Diverse Rankings

Paper is Learning Diverse Rankings with Multi-Armed Bandits by Radlinsky, Kleinberg, Joachims (ICML β€˜08)

β€’ Setting is web search

β€’ A user enters a search query

β€’ We want to ensure that a relevant document is near the top.

Page 27: Introduction to Bandits

User may mean the insect or the car and both appear on the top few.

Page 28: Introduction to Bandits

An example of a search which is not diverse at all.

Those searching for bandit algorithms would not click.

Page 29: Introduction to Bandits

Application: Learning Diverse Rankings

β€’ Setting is web search

β€’ A user enters a search query

β€’ We want to ensure that a relevant document is near the top.

β€’ Model this as follows.

β€’ Let π’Ÿ be a set of documents (for one fixed query).

β€’ A user 𝑒𝑑 comes with some β€œtype” which is a prob. vector 𝑝𝑑 indicating probability of clicking a specific document.

β€’ If user clicks, get reward of 1; if user leaves, get reward of 0.

β€’ Goal: Maximize number of user clicks.

β€’ Note that offline problem is NP-hard (for one user, we need to solve max coverage problem); but we can get (1 βˆ’ 1/𝑒)-approximation.

Page 30: Introduction to Bandits

Ranked Explore and Commit

β€’ Model users as static identities.

β€’ Start at the first rank (top of we page).

β€’ Try every possible document for that position a bunch of times.

β€’ Whichever document has the most hits at that rank is chosen to be in that rank.

β€’ Repeat this for every rank.

Page 31: Introduction to Bandits

Ranked Explore and Commit

Theorem. With a suitable choice of parameters, payoff for ranked ETC after 𝑇

rounds is at least 1 βˆ’1

𝑒⋅ 𝑂𝑃𝑇 βˆ’ 𝑂𝑛,π‘˜(𝑇

2/3).

If 𝑂𝑃𝑇 β‰₯ Ξ©(𝑇) (i.e. a constant fraction of users want to click on some document) then ranked ETC is competitive with optimal offline algorithm.

Optimal payoff on offline setting.

Page 32: Introduction to Bandits

Ranked Bandits Algorithmβ€’ Here, users can change over time.

β€’ Instantiate a multi-armed bandit algorithm for each rank.

β€’ For each rank, we ask algorithm for a document.

β€’ Bandit corresponding to rank π‘Ÿ gets reward 1 if page is clicked; else gets zero.

β€’ Note that the MAB algorithm can be arbitrary.

Page 33: Introduction to Bandits

Ranked Bandits Algorithm

Theorem. With a suitable choice of parameters, payoff for ranked bandits after 𝑇

rounds is at least 1 βˆ’1

𝑒⋅ 𝑂𝑃𝑇 βˆ’ π‘˜ β‹… 𝑅(𝑇), where 𝑅(𝑇) is regret of MAB

algorithm.

For e.g., if we use Exp3 then 𝑅 𝑇 = ෨𝑂𝑛,π‘˜ 𝑇1/2 .

If 𝑂𝑃𝑇 β‰₯ Ξ©(𝑇) (i.e. a constant fraction of users want to click on some document) then ranked bandits is competitive with optimal offline algorithm.

Optimal payoff in offline setting.

Number of slots (documents to show)

Page 34: Introduction to Bandits

Empirical Results

All their experiments were with synthetic data.

Page 35: Introduction to Bandits

Recapβ€’ We introduced stochastic bandits problem and saw two algorithms:β€’ Explore-then-commit which has an initial exploration stage and then commits for the

rest of timeβ€’ Upper-confidence bound algorithm which maintains a confidence interval for each

arm and uses the upper-confidence as a proxy.

β€’ We introduced adversarial bandits and saw the Exp3 algorithm.

β€’ Some references:β€’ Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems by Bubeck and

Cesa-Bianchiβ€’ Introduction to Online Convex Optimization by Hazanβ€’ Introduction to Online Optimization by Bubeckβ€’ Bandit Algorithms by Lattimore and SzepesvΓ‘riβ€’ Introduction to Multi-Armed Bandits by Slivkins