New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02]...

Post on 22-Aug-2020

8 views 0 download

Transcript of New Algorithms for Contextual Bandits · 2012. 3. 1. · UCB vs EXP3 A Comparison UCB [Auer ’02]...

New Algorithms for Contextual Bandits

Lev Reyzin Georgia Institute of Technology

1 Work done at Yahoo!

2

S  A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised Learning Guarantees (AISTATS 2011)

S  M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang Efficient Optimal Learning for Contextual Bandits (UAI 2011)

S  S. Kale, L. Reyzin, R.E. Schapire Non-Stochastic Bandit Slate Problems (NIPS 2010)

Serving Content to Users

Query, IP address, browser properties, etc.

3

Serving Content to Users

Query, IP address, browser properties, etc.

result (ie. ad, news story)

4

Serving Content to Users

Query, IP address, browser properties, etc.

result (ie. ad, news story)

click or not

5

Serving Content to Users

Query, IP address, browser properties, etc.

result (ie. ad, news story)

click or not

6

Serving Content to Users

Query, IP address, browser properties, etc.

result (ie. ad, news story)

click or not

7

Serving Content to Users

8

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

9

1

2

3

k

Multiarmed Bandits [Robbins ’52]

$0 1 2 4 … T 3

10

1

2

3

k

Multiarmed Bandits [Robbins ’52]

1 2 4 … T

$0.50

3 $0.50

2

11

1

2

3

k

Multiarmed Bandits [Robbins ’52]

1 2 4 … T

$0.50

3

$0

$0.50

3

12

1

2

3

k

Multiarmed Bandits [Robbins ’52]

1 2 4 … T

$0.50

3

$0

$0.33

$0.83

2

13

1

2

3

k

Multiarmed Bandits [Robbins ’52]

1 2 4 … T

$0.50

3

$0

$0.20

$0.90

$0.33

$0.4T

14

1

2

3

k

Multiarmed Bandits [Robbins ’52]

$0.5T

$0.2T

$0.33T

$0.1T

$0.4T

15

1

2

3

k

Multiarmed Bandits [Robbins ’52]

$0.5T

$0.2T

$0.33T

$0.1T

regret = $0.1T

$0.4T

16

1

2

3

k

$0 1 2 4 … T 3

N experts/policies/functions think of N >> K

context:

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

17

1

2

3

k

$0 1 2 4 … T 3

N experts/policies/functions think of N >> K

5

1

K

1

4

3

context: x1

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

18

1

2

3

k

$0.15 1 2 4 … T 3

N experts/policies/functions think of N >> K

1$0.15

context: x1

5

1

K

1

4

3

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

19

1

2

3

k

$0.2T 1 2 4 … T 3

N experts/policies/functions think of N >> K

$.12T

$0.15

context: x1 x2 x3 x4 … xT

$0.5

$0.1

$0.2

$0.4

$.1T

$.2T

$.22T

0

$.17T

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

20

1

2

3

k

$0.2T 1 2 4 … T 3

N experts/policies/functions think of N >> K

$.12T

context: x1 x2 x3 x4 … xT

$.1T

$.2T

$.22T

0

$.17T

regret = $0.02T

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

21

1

2

3

k

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]

$0.2T 1 2 4 … T 3

context: x1 x2 x3 x4 … xT

the rewards can come i.i.d. from a distribution or be arbitrary stochastic / adversarial The experts can be present or not. contextual / non-contextual

22

The Setting

S  T rounds, K possible actions, N policies π in Π (context à actions)

S  for t=1 to T S  world commits to rewards r(t)=r1(t),r2(t),…,rK(t) (adversarial or iid)

S  world provides context xt

S  learner’s policies recommend π1(xt), π2(xt), …, πN(xt)

S  learner chooses action jt

S  learner receives reward rjt(t)

S  want to compete with following the best policy in hindsight

23

Regret

S  reward of algorithm A:

S  expected reward of policy i:

S  algorithm A’s regret:

GA T( ) ˙ = rjt t( )t=1

T

Gi T( ) ˙ = π i xt( ) ⋅ r t( )t=1

T

maxiGi T( ) −GA (T)

24

Regret

S  algorithm A’s regret:

S  bound on expected regret:

S  high probability bound:

maxiGi T( ) −GA (T)

maxiGi T( )−E[GA (T )]< ε

P[maxi

Gi T( ) −GA (T) > ε] ≤ δ

25

u Harder than supervised learning: u In the bandit setting we do not know the rewards of

actions not taken.

u Many applications

u Ad auctions, medicine, finance, …

u Exploration/Exploitation u Can exploit expert/arm you’ve learned to be good.

u Can explore expert/arm you’re not sure about.

26

Some Barriers

Ω(kT)1/2 (non-contextual) and ~ Ω(TK ln N)1/2 (contextual) are known lower bounds [Auer et al. ’02] on regret, even in the stochastic case.

Any algorithm achieving regret Õ(KT polylog N)1/2 is said to be optimal.

ε-greedy algorithms that first explore (act randomly) and then exploit (follow the best policy) cannot be optimal. Any optimal algorithm must be adaptive.

27

Two Types of Approaches

UCB [Auer ’02]

EXP3 Exponential Weights [Littlestone-Warmuth ’94]

[Auer et al. ’02] 1

0.5

0 t=1

t=2

t=3

Algorithm: at every time step 1)  pull arm with highest UCB 2)  update confidence bound of the

arm pulled.

Algorithm: at every time step 1)  sample from distribution defined

by weights (mixed w/ uniform) 2)  update weights “exponentially”

28

UCB vs EXP3 A Comparison

UCB [Auer ’02]

u Pros u  Optimal for the stochastic

setting. u  Succeeds with high probability.

u Cons u  Does not work in the

adversarial setting. u  Is not optimal in the contextual

setting.

EXP3 & Friends [Auer-CesaBianchi-Freund-Schapire ’02]

u Pros u  Optimal for both the adversarial

and stochastic settings. u  Can be made to work in the

contextual setting

u Cons u  Does not succeed with high

probability in the contextual setting (only in expectation).

29

30

Algorithm Regret High Prob? Context?

Exp4 [ACFS ’02] Õ(KT ln(N))1/2 No Yes

ε-greedy, epoch-geedy [LZ ’07]

Õ((K ln(N) 1/3)T2/3) Yes Yes

Exp3.P[ACFS ’02] UCB [Auer ’00]

Õ(KT)1/2 Yes No

Ω KT( ) lower bound [ACFS ’02]

31

Algorithm Regret High Prob? Context?

Exp4 [ACFS ’02] Õ(KT ln(N))1/2 No Yes

ε-greedy, epoch-geedy [LZ ’07]

Õ((K ln(N) 1/3)T2/3) Yes Yes

Exp3.P[ACFS ’02] UCB [Auer ’00]

Õ(KT)1/2 Yes No

Exp4.P [BLLRS ’10] Õ(K ln(N/δ)T)1/2 Yes Yes

Ω KT( ) lower bound [ACFS ’02]

EXP4P [Beygelzimer-Langford-Li-R-Schapire ’11]

EXP4P combines the advantages of Exponential Weights and UCB. optimal for both the stochastic and adversarial settings

works for the contextual case (and also the non-contextual case) a high probability result

Main Theorem [Beygelzimer-Langford-Li-R-Schapire ’11]: For any δ>0, with probability at least 1-δ, EXP4P has regret at most

O(KT ln (N/δ))1/2 in the adversarial contextual bandit setting.

32

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

33

34

S  Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S  Adversary has two actions, one always paying off 1 and the other 0.

Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.

S  Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S  Adversary has two actions, one always paying off 1 and the other 0.

If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.

Some Failed Approaches

35

S  Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S  Adversary has two actions, one always paying off 1 and the other 0.

Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.

S  Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S  Adversary has two actions, one always paying off 1 and the other 0.

If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.

Some Failed Approaches

36

S  Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S  Adversary has two actions, one always paying off 1 and the other 0.

Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.

S  Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S  Adversary has two actions, one always paying off 1 and the other 0.

If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.

Some Failed Approaches

37

S  Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S  Adversary has two actions, one always paying off 1 and the other 0.

Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.

S  Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S  Adversary has two actions, one always paying off 1 and the other 0.

If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.

Some Failed Approaches

epsilon-greedy

S  Rough idea of ε-greedy (or ε-first): act randomly for ε rounds, then go with best (arm or expert).

S  Even if we know the number of rounds in advance, ε-first won’t get us regret O(T)1/2, even in the non-contextual setting.

S  Rough analysis: even for just 2 arms, we suffer regret of ε+(T-ε)/(ε1/2). S  ε≈ T2/3 is optimal tradeoff. S  gives regret ≈ T2/3

38

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

39

Ideas Behind Exp4.P (all appeared in previous algorithms)

S  exponential weights S  keep a weight on each expert that drops exponentially in the

expert’s (estimated) performance

S  upper confidence bounds S  use an upper confidence bound on each expert’s estimated

reward

S  ensuring exploration S  make sure each action is taken with some minimum probability

S  importance weighting S  give rare events more importance to keep estimates unbiased

40

41

(slide from Beygelzimer & Langford ICML 2010 tutorial)

(Exp4.P) [Beygelzimer, Langford, Li, R, Schapire ’10]

42

43

Lemma 1

44

Lemma 2

EXP4P [Beygelzimer-Langford-Li-R-Schapire ’11]

Main Theorem [Beygelzimer-Langford-Li-R-Schapire ’11]: For any δ>0, with probability at least 1-δ, EXP4P has regret at most

O(KT ln (N/δ))1/2 in the adversarial contextual bandit setting.

key insights (on top of UCB/ EXP) 1)  exponential weights and upper

confidence bounds “stack” 2)  generalized Bernstein’s

inequality for martingales

t=1

t=2

t=3 45

46

Algorithm Regret High Prob? Context? Efficient?

Exp4 [ACFS ’02]

Õ(T1/2) No Yes No

epoch-geedy [LZ ’07]

Õ(T2/3) Yes Yes Yes

Exp3.P/UCB [ACFS ’02][A ’00]

Õ(T1/2) Yes No Yes

Exp4.P [BLLRS ’10]

Õ(T1/2) Yes Yes No

Efficiency

47

EXP4P Applied to Yahoo!

48

u We chose a policy class for which we could efficiently keep track of the weights.

u Created 5 clusters, with users (at each time step) getting features based on their distances to clusters.

u Policies mapped clusters to article (action) choices.

u Ran on personalized news article recommendations for Yahoo! front page.

u We used a learning bucket on which we ran the algorithms and a deployment bucket on which we ran the greedy (best) learned policy.

Experiments on Yahoo! Data

49

Reported estimated (normalized) click-through rates on front page news. Over 41M user visits. 253 total articles. 21 candidate articles per visit.

EXP4P EXP4 ε-greedy

Learning eCTR

1.0525 1.0988 1.3829

Deployment eCTR

1.6512 1.5309 1.4290

Experimental Results

50

Reported estimated (normalized) click-through rates on front page news. Over 41M user visits. 253 total articles. 21 candidate articles per visit.

EXP4P EXP4 ε-greedy

Learning eCTR

1.0525 1.0988 1.3829

Deployment eCTR

1.6512 1.5309 1.4290

Experimental Results

Why does this work in practice?

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

51

Infinitely Many Policies

S  What if we have an infinite number of policies?

S  Our bound of Õ(K ln(N)T)1/2 becomes vacuous.

S  If we assume our policy class has a finite VC dimension d, then we can tackle this problem.

S  Need i.i.d. assumption. We will also assume k=2 to illustrate the argument.

52

VC Dimension

S  The VC dimension of a hypothesis class captures the class’s expressive power.

S  It is the cardinality of the largest set (in our case, of contexts) the class can shatter.

S  Shatter means to label in all possible configurations.

53

VE, an Algorithm for VC Sets

The VE algorithm:

S  Act uniformly at random for τ rounds.

S  This partitions our policies Π into equivalence classes according to their labelings of the first τ examples.

S  Pick one representative from each equivalence class to make Π’.

S  Run Exp4.P on Π’.

54

55

S  Sauer’s lemma bounds the number of equivalence classes to (eτ/d)d.

S  Hence, using Exp4.P bounds, VE’s regret to Π’ is

≈τ+ O (Td ln(τ))

S  We can show that the regret of Π’ to Π is ≈ (T/τ)(d lnT)

S  by looking at the probability of disagreeing on future data given agreement for τ steps.

S  τ≈ (Td ln 1/δ)1/2 achieves the optimal trade-off.

S  Gives Õ(Td)1/2 regret.

S  Still inefficient!

Outline of Analysis of VE

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

56

Hope for an Efficient Algorithm? [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]

For EXP4P, the dependence on N in the regret is logarithmic.

this suggests

We could compete with a large, even super-polynomial number of policies! (e.g. N=K100 becomes 10 log1/2 K in the regret)

however

All known contextual bandit algorithms explicitly “keep track” of the N policies. Even worse, just reading in the N would take

too long for large N. 57

Idea: Use Supervised Learning

S  “Competing” with a large (even exponentially large) set of policies is commonplace in supervised learning. S  Targets: e.g. linear thresholds, CNF, decision trees (in practice only)

S  Methods: e.g. boosting, SVM, neural networks, gradient descent

S  The recommendations of the policies don’t need to be explicitly read in when the policy class has structure!

x1 x2

x3 x4 x5

x6

Supervised Learning

Oracle Policy class Π

A good policy in Π

idea originates with [Langford-Zhang ’07]

58

Idea: Use Supervised Learning

S  “Competing” with a large (even exponentially large) set of policies is commonplace in supervised learning. S  Targets: e.g. linear thresholds, CNF, decision trees (in practice only)

S  Methods: e.g. boosting, SVM, neural networks, gradient descent

S  The recommendations of the policies don’t need to be explicitly read in when the policy class has structure!

x1 x2

x3 x4 x5

x6

Supervised Learning

Oracle Policy class Π

A good policy in Π

idea originates with [Langford-Zhang ’07]

59

Warning: NP-Hard in

General

1

2

3

k

$1.20 1 2 4 … T 3

N experts/policies/functions think of N >> K

5

1

k

1

4

3

context: x1 x2 x3

Back to Contextual Bandits

$0.50

$0.70

60

1

2

3

k

$1.20 1 2 4 … T 3

N experts/policies/functions think of N >> K

5

1

k

1

4

3

context: x1 x2 x3

$0.50

$0.70

Supervised Learning

Oracle

made-up data

Back to Contextual Bandits

61

1

2

3

k

$1.20 1 2 4 … T 3

N experts/policies/functions think of N >> K

5

1

k

1

4

3

context: x1 x2 x3

$0.50

$0.70

Supervised Learning

Oracle

made-up data

Back to Contextual Bandits

62

Randomized-UCB

Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ))1/2+K ln(NK/δ)) in the

stochastic contextual bandit setting and runs in time poly(K,T, ln N).

63

Randomized-UCB

Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ))1/2+K ln(NK/δ)) in the

stochastic contextual bandit setting and runs in time poly(K,T, ln N).

if arms are chosen among only good policies s.t. all have variance < approx 2K, we win can prove this exists via a minimax theorem

this condition can be softened to occasionally allow choosing of bad policies via “randomized” upper confidence bounds

creates a problem of how to choose arms as to satisfy the constraints expressed as convex optimization problem

solvable by ellipsoid algorithm can implement a separation oracle with the supervised learning oracle

64

creates a problem of how to choose arms as to satisfy the constraints expressed as convex optimization problem

Randomized-UCB

Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ))1/2+K ln(NK/δ)) in the

stochastic contextual bandit setting and runs in time poly(K,T, ln N).

if arms are chosen among only good policies s.t. all have variance < approx 2K, we win can prove this exists via a minimax theorem

this condition can be softened to occasionally allow choosing of bad policies via “randomized” upper confidence bounds

solvable by ellipsoid algorithm can implement a separation oracle with the supervised learning oracle

65

Not practical to implement!

(yet)

Outline

S  The setting and some background

S  Show ideas that fail

S  Give a high probability optimal algorithm

S  Dealing with VC sets

S  An efficient algorithm

S  Slates

66

Bandit Slate Problems [Kale-R-Schapire ’11]

Problem: Instead of selecting one arm, we need to select s ≥ 1, arms (possibly ranked). The motivation is web ads where a search engine shows multiple ads at once.

67

Slates Setting

S  On round t algorithm selects a sate St of s arms S  Unordered or Ordered

S  No context or Contextual

S  Algorithm sees rj(t) for all j in S.

S  Algorithm gets reward

S  Obvious solution is to reduce to the regular bandit problem, but we can do much better.

68

rj (t)j∈S∑

69

Algorithm Idea

70

Algorithm Idea

.33 .33

.33

71

Algorithm Idea

72

Algorithm Idea

multiplicative update

73

Algorithm Idea

relative entropy projection

Also “Component Hedge,” independently by Koolen et al. ’10.

74

Algorithm Idea

75

Unordered Slates Ordered, with Positional Factors

No Policies Õ(sKT)1/2 * Õ(s(KT)1/2)

N Policies Õ(sKT ln N)1/2 Õ(s(KT ln N)1/2)

Slate Results

*Independently obtained by Uchiya et al. ’10, using different methods.

Discussion

S  The contextual bandit setting captures many interesting real-world problems.

S  We presented the first optimal, high-probability, contextual algorithm.

S  We showed how one could possibly make it efficient. S  Not fully there yet…

S  We discussed slates – a more real-world setting. S  How to make those efficient?

76