New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02]...

76
New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology 1 Work done at Yahoo!

Transcript of New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02]...

Page 1: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

New Algorithms for Contextual Bandits

Lev Reyzin Georgia Institute of Technology

1 Work done at Yahoo!

Page 2: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

2

S  A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised Learning Guarantees (AISTATS 2011)

S  M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang Efficient Optimal Learning for Contextual Bandits (UAI 2011)

S  S. Kale, L. Reyzin, R.E. Schapire Non-Stochastic Bandit Slate Problems (NIPS 2010)

Page 3: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Serving Content to Users

Query, IP address, browser properties, etc.

3

Page 4: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Serving Content to Users

Query, IP address, browser properties, etc.

result (ie. ad, news story)

4

Page 5: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Serving Content to Users

Query, IP address, browser properties, etc.

result (ie. ad, news story)

click or not

5

Page 6: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Serving Content to Users

Query, IP address, browser properties, etc.

result (ie. ad, news story)

click or not

6

Page 7: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Serving Content to Users

Query, IP address, browser properties, etc.

result (ie. ad, news story)

click or not

7

Page 8: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Serving Content to Users

8

Page 9: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

9

Page 10: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

Multiarmed Bandits [Robbins ’52]

$0 1 2 4 … T 3

10

Page 11: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

Multiarmed Bandits [Robbins ’52]

1 2 4 … T

$0.50

3 $0.50

2

11

Page 12: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

Multiarmed Bandits [Robbins ’52]

1 2 4 … T

$0.50

3

$0

$0.50

3

12

Page 13: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

Multiarmed Bandits [Robbins ’52]

1 2 4 … T

$0.50

3

$0

$0.33

$0.83

2

13

Page 14: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

Multiarmed Bandits [Robbins ’52]

1 2 4 … T

$0.50

3

$0

$0.20

$0.90

$0.33

$0.4T

14

Page 15: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

Multiarmed Bandits [Robbins ’52]

$0.5T

$0.2T

$0.33T

$0.1T

$0.4T

15

Page 16: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

Multiarmed Bandits [Robbins ’52]

$0.5T

$0.2T

$0.33T

$0.1T

regret = $0.1T

$0.4T

16

Page 17: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

$0 1 2 4 … T 3

N experts/policies/functions think of N >> K

context:

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

17

Page 18: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

$0 1 2 4 … T 3

N experts/policies/functions think of N >> K

5

1

K

1

4

3

context: x1

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

18

Page 19: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

$0.15 1 2 4 … T 3

N experts/policies/functions think of N >> K

1$0.15

context: x1

5

1

K

1

4

3

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

19

Page 20: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

$0.2T 1 2 4 … T 3

N experts/policies/functions think of N >> K

$.12T

$0.15

context: x1 x2 x3 x4 … xT

$0.5

$0.1

$0.2

$0.4

$.1T

$.2T

$.22T

0

$.17T

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

20

Page 21: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

$0.2T 1 2 4 … T 3

N experts/policies/functions think of N >> K

$.12T

context: x1 x2 x3 x4 … xT

$.1T

$.2T

$.22T

0

$.17T

regret = $0.02T

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

21

Page 22: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

$0.2T 1 2 4 … T 3

context: x1 x2 x3 x4 … xT

the rewards can come i.i.d. from a distribution or be arbitrary stochastic / adversarial The experts can be present or not. contextual / non-contextual

22

Page 23: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

The Setting

S  T rounds, K possible actions, N policies π in Π (context à actions)

S  for t=1 to T S  world commits to rewards r(t)=r1(t),r2(t),…,rK(t) (adversarial or iid)

S  world provides context xt

S  learner’s policies recommend π1(xt), π2(xt), …, πN(xt)

S  learner chooses action jt

S  learner receives reward rjt(t)

S  want to compete with following the best policy in hindsight

23

Page 24: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Regret

S  reward of algorithm A:

S  expected reward of policy i:

S  algorithm A’s regret:

GA T( ) ˙ = rjt t( )t=1

T

Gi T( ) ˙ = π i xt( ) ⋅ r t( )t=1

T

maxiGi T( ) −GA (T)

24

Page 25: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Regret

S  algorithm A’s regret:

S  bound on expected regret:

S  high probability bound:

maxiGi T( ) −GA (T)

maxiGi T( )−E[GA (T )]< ε

P[maxi

Gi T( ) −GA (T) > ε] ≤ δ

25

Page 26: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

u Harder than supervised learning: u In the bandit setting we do not know the rewards of

actions not taken.

u Many applications

u Ad auctions, medicine, finance, …

u Exploration/Exploitation u Can exploit expert/arm you’ve learned to be good.

u Can explore expert/arm you’re not sure about.

26

Page 27: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Some Barriers

Ω(kT)1/2 (non-contextual) and ~ Ω(TK ln N)1/2 (contextual) are known lower bounds [Auer et al. ’02] on regret, even in the stochastic case.

Any algorithm achieving regret Õ(KT polylog N)1/2 is said to be optimal.

ε-greedy algorithms that first explore (act randomly) and then exploit (follow the best policy) cannot be optimal. Any optimal algorithm must be adaptive.

27

Page 28: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Two Types of Approaches

UCB [Auer ’02]

EXP3 Exponential Weights [Littlestone-Warmuth ’94]

[Auer et al. ’02] 1

0.5

0 t=1

t=2

t=3

Algorithm: at every time step 1)  pull arm with highest UCB 2)  update confidence bound of the

arm pulled.

Algorithm: at every time step 1)  sample from distribution defined

by weights (mixed w/ uniform) 2)  update weights “exponentially”

28

Page 29: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

UCB vs EXP3 A Comparison

UCB [Auer ’02]

u Pros u  Optimal for the stochastic

setting. u  Succeeds with high probability.

u Cons u  Does not work in the

adversarial setting. u  Is not optimal in the contextual

setting.

EXP3 & Friends [Auer-CesaBianchi-Freund-Schapire ’02]

u Pros u  Optimal for both the adversarial

and stochastic settings. u  Can be made to work in the

contextual setting

u Cons u  Does not succeed with high

probability in the contextual setting (only in expectation).

29

Page 30: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

30

Algorithm Regret High Prob? Context?

Exp4 [ACFS ’02] Õ(KT ln(N))1/2 No Yes

ε-greedy, epoch-geedy [LZ ’07]

Õ((K ln(N) 1/3)T2/3) Yes Yes

Exp3.P[ACFS ’02] UCB [Auer ’00]

Õ(KT)1/2 Yes No

Ω KT( ) lower bound [ACFS ’02]

Page 31: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

31

Algorithm Regret High Prob? Context?

Exp4 [ACFS ’02] Õ(KT ln(N))1/2 No Yes

ε-greedy, epoch-geedy [LZ ’07]

Õ((K ln(N) 1/3)T2/3) Yes Yes

Exp3.P[ACFS ’02] UCB [Auer ’00]

Õ(KT)1/2 Yes No

Exp4.P [BLLRS ’10] Õ(K ln(N/δ)T)1/2 Yes Yes

Ω KT( ) lower bound [ACFS ’02]

Page 32: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

EXP4P [Beygelzimer-Langford-Li-R-Schapire ’11]

EXP4P combines the advantages of Exponential Weights and UCB. optimal for both the stochastic and adversarial settings

works for the contextual case (and also the non-contextual case) a high probability result

Main Theorem [Beygelzimer-Langford-Li-R-Schapire ’11]: For any δ>0, with probability at least 1-δ, EXP4P has regret at most

O(KT ln (N/δ))1/2 in the adversarial contextual bandit setting.

32

Page 33: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

33

Page 34: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

34

S  Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S  Adversary has two actions, one always paying off 1 and the other 0.

Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.

S  Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S  Adversary has two actions, one always paying off 1 and the other 0.

If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.

Some Failed Approaches

Page 35: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

35

S  Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S  Adversary has two actions, one always paying off 1 and the other 0.

Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.

S  Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S  Adversary has two actions, one always paying off 1 and the other 0.

If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.

Some Failed Approaches

Page 36: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

36

S  Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S  Adversary has two actions, one always paying off 1 and the other 0.

Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.

S  Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S  Adversary has two actions, one always paying off 1 and the other 0.

If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.

Some Failed Approaches

Page 37: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

37

S  Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S  Adversary has two actions, one always paying off 1 and the other 0.

Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.

S  Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S  Adversary has two actions, one always paying off 1 and the other 0.

If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.

Some Failed Approaches

Page 38: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

epsilon-greedy

S  Rough idea of ε-greedy (or ε-first): act randomly for ε rounds, then go with best (arm or expert).

S  Even if we know the number of rounds in advance, ε-first won’t get us regret O(T)1/2, even in the non-contextual setting.

S  Rough analysis: even for just 2 arms, we suffer regret of ε+(T-ε)/(ε1/2). S  ε≈ T2/3 is optimal tradeoff. S  gives regret ≈ T2/3

38

Page 39: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

39

Page 40: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Ideas Behind Exp4.P (all appeared in previous algorithms)

S  exponential weights S  keep a weight on each expert that drops exponentially in the

expert’s (estimated) performance

S  upper confidence bounds S  use an upper confidence bound on each expert’s estimated

reward

S  ensuring exploration S  make sure each action is taken with some minimum probability

S  importance weighting S  give rare events more importance to keep estimates unbiased

40

Page 41: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

41

(slide from Beygelzimer & Langford ICML 2010 tutorial)

Page 42: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

(Exp4.P) [Beygelzimer, Langford, Li, R, Schapire ’10]

42

Page 43: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

43

Lemma 1

Page 44: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

44

Lemma 2

Page 45: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

EXP4P [Beygelzimer-Langford-Li-R-Schapire ’11]

Main Theorem [Beygelzimer-Langford-Li-R-Schapire ’11]: For any δ>0, with probability at least 1-δ, EXP4P has regret at most

O(KT ln (N/δ))1/2 in the adversarial contextual bandit setting.

key insights (on top of UCB/ EXP) 1)  exponential weights and upper

confidence bounds “stack” 2)  generalized Bernstein’s

inequality for martingales

t=1

t=2

t=3 45

Page 46: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

46

Algorithm Regret High Prob? Context? Efficient?

Exp4 [ACFS ’02]

Õ(T1/2) No Yes No

epoch-geedy [LZ ’07]

Õ(T2/3) Yes Yes Yes

Exp3.P/UCB [ACFS ’02][A ’00]

Õ(T1/2) Yes No Yes

Exp4.P [BLLRS ’10]

Õ(T1/2) Yes Yes No

Efficiency

Page 47: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

47

EXP4P Applied to Yahoo!

Page 48: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

48

u We chose a policy class for which we could efficiently keep track of the weights.

u Created 5 clusters, with users (at each time step) getting features based on their distances to clusters.

u Policies mapped clusters to article (action) choices.

u Ran on personalized news article recommendations for Yahoo! front page.

u We used a learning bucket on which we ran the algorithms and a deployment bucket on which we ran the greedy (best) learned policy.

Experiments on Yahoo! Data

Page 49: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

49

Reported estimated (normalized) click-through rates on front page news. Over 41M user visits. 253 total articles. 21 candidate articles per visit.

EXP4P EXP4 ε-greedy

Learning eCTR

1.0525 1.0988 1.3829

Deployment eCTR

1.6512 1.5309 1.4290

Experimental Results

Page 50: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

50

Reported estimated (normalized) click-through rates on front page news. Over 41M user visits. 253 total articles. 21 candidate articles per visit.

EXP4P EXP4 ε-greedy

Learning eCTR

1.0525 1.0988 1.3829

Deployment eCTR

1.6512 1.5309 1.4290

Experimental Results

Why does this work in practice?

Page 51: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

51

Page 52: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Infinitely Many Policies

S  What if we have an infinite number of policies?

S  Our bound of Õ(K ln(N)T)1/2 becomes vacuous.

S  If we assume our policy class has a finite VC dimension d, then we can tackle this problem.

S  Need i.i.d. assumption. We will also assume k=2 to illustrate the argument.

52

Page 53: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

VC Dimension

S  The VC dimension of a hypothesis class captures the class’s expressive power.

S  It is the cardinality of the largest set (in our case, of contexts) the class can shatter.

S  Shatter means to label in all possible configurations.

53

Page 54: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

VE, an Algorithm for VC Sets

The VE algorithm:

S  Act uniformly at random for τ rounds.

S  This partitions our policies Π into equivalence classes according to their labelings of the first τ examples.

S  Pick one representative from each equivalence class to make Π’.

S  Run Exp4.P on Π’.

54

Page 55: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

55

S  Sauer’s lemma bounds the number of equivalence classes to (eτ/d)d.

S  Hence, using Exp4.P bounds, VE’s regret to Π’ is

≈τ+ O (Td ln(τ))

S  We can show that the regret of Π’ to Π is ≈ (T/τ)(d lnT)

S  by looking at the probability of disagreeing on future data given agreement for τ steps.

S  τ≈ (Td ln 1/δ)1/2 achieves the optimal trade-off.

S  Gives Õ(Td)1/2 regret.

S  Still inefficient!

Outline of Analysis of VE

Page 56: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

56

Page 57: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Hope for an Efficient Algorithm? [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]

For EXP4P, the dependence on N in the regret is logarithmic.

this suggests

We could compete with a large, even super-polynomial number of policies! (e.g. N=K100 becomes 10 log1/2 K in the regret)

however

All known contextual bandit algorithms explicitly “keep track” of the N policies. Even worse, just reading in the N would take

too long for large N. 57

Page 58: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Idea: Use Supervised Learning

S  “Competing” with a large (even exponentially large) set of policies is commonplace in supervised learning. S  Targets: e.g. linear thresholds, CNF, decision trees (in practice only)

S  Methods: e.g. boosting, SVM, neural networks, gradient descent

S  The recommendations of the policies don’t need to be explicitly read in when the policy class has structure!

x1 x2

x3 x4 x5

x6

Supervised Learning

Oracle Policy class Π

A good policy in Π

idea originates with [Langford-Zhang ’07]

58

Page 59: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Idea: Use Supervised Learning

S  “Competing” with a large (even exponentially large) set of policies is commonplace in supervised learning. S  Targets: e.g. linear thresholds, CNF, decision trees (in practice only)

S  Methods: e.g. boosting, SVM, neural networks, gradient descent

S  The recommendations of the policies don’t need to be explicitly read in when the policy class has structure!

x1 x2

x3 x4 x5

x6

Supervised Learning

Oracle Policy class Π

A good policy in Π

idea originates with [Langford-Zhang ’07]

59

Warning: NP-Hard in

General

Page 60: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

$1.20 1 2 4 … T 3

N experts/policies/functions think of N >> K

5

1

k

1

4

3

context: x1 x2 x3

Back to Contextual Bandits

$0.50

$0.70

60

Page 61: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

$1.20 1 2 4 … T 3

N experts/policies/functions think of N >> K

5

1

k

1

4

3

context: x1 x2 x3

$0.50

$0.70

Supervised Learning

Oracle

made-up data

Back to Contextual Bandits

61

Page 62: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

1

2

3

k

$1.20 1 2 4 … T 3

N experts/policies/functions think of N >> K

5

1

k

1

4

3

context: x1 x2 x3

$0.50

$0.70

Supervised Learning

Oracle

made-up data

Back to Contextual Bandits

62

Page 63: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Randomized-UCB

Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ))1/2+K ln(NK/δ)) in the

stochastic contextual bandit setting and runs in time poly(K,T, ln N).

63

Page 64: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Randomized-UCB

Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ))1/2+K ln(NK/δ)) in the

stochastic contextual bandit setting and runs in time poly(K,T, ln N).

if arms are chosen among only good policies s.t. all have variance < approx 2K, we win can prove this exists via a minimax theorem

this condition can be softened to occasionally allow choosing of bad policies via “randomized” upper confidence bounds

creates a problem of how to choose arms as to satisfy the constraints expressed as convex optimization problem

solvable by ellipsoid algorithm can implement a separation oracle with the supervised learning oracle

64

Page 65: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

creates a problem of how to choose arms as to satisfy the constraints expressed as convex optimization problem

Randomized-UCB

Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ))1/2+K ln(NK/δ)) in the

stochastic contextual bandit setting and runs in time poly(K,T, ln N).

if arms are chosen among only good policies s.t. all have variance < approx 2K, we win can prove this exists via a minimax theorem

this condition can be softened to occasionally allow choosing of bad policies via “randomized” upper confidence bounds

solvable by ellipsoid algorithm can implement a separation oracle with the supervised learning oracle

65

Not practical to implement!

(yet)

Page 66: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

66

Page 67: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Bandit Slate Problems [Kale-R-Schapire ’11]

Problem: Instead of selecting one arm, we need to select s ≥ 1, arms (possibly ranked). The motivation is web ads where a search engine shows multiple ads at once.

67

Page 68: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Slates Setting

S  On round t algorithm selects a sate St of s arms S  Unordered or Ordered

S  No context or Contextual

S  Algorithm sees rj(t) for all j in S.

S  Algorithm gets reward

S  Obvious solution is to reduce to the regular bandit problem, but we can do much better.

68

rj (t)j∈S∑

Page 69: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

69

Algorithm Idea

Page 70: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

70

Algorithm Idea

.33 .33

.33

Page 71: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

71

Algorithm Idea

Page 72: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

72

Algorithm Idea

multiplicative update

Page 73: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

73

Algorithm Idea

relative entropy projection

Also “Component Hedge,” independently by Koolen et al. ’10.

Page 74: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

74

Algorithm Idea

Page 75: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

75

Unordered Slates Ordered, with Positional Factors

No Policies Õ(sKT)1/2 * Õ(s(KT)1/2)

N Policies Õ(sKT ln N)1/2 Õ(s(KT ln N)1/2)

Slate Results

*Independently obtained by Uchiya et al. ’10, using different methods.

Page 76: New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02] # Pros # Optimal for the stochastic setting. # Succeeds with high probability. #

Discussion

S  The contextual bandit setting captures many interesting real-world problems.

S  We presented the first optimal, high-probability, contextual algorithm.

S  We showed how one could possibly make it efficient. S  Not fully there yet…

S  We discussed slates – a more real-world setting. S  How to make those efficient?

76