Introduction to Bandits: Algorithms and Theory

Introduction to Bandits: Algorithms and TheoryICML 2011, Bellevue (WA), USA
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 1/38
Outline
I Bandits with small set of actions
I Stochastic setting
I Adversarial setting
I unstructured set
I structured set I linear bandits I Lipschitz bandits I tree bandits
I Extensions
Bandit game Parameters available to the forecaster: the number of arms (or actions) K and the number of rounds n Unknown to the forecaster: the way the gain vectors gt = (g1,t , . . . , gK ,t) ∈ [0, 1]K are generated
For each round t = 1, 2, . . . , n
1. the forecaster chooses an arm It ∈ {1, . . . ,K} 2. the forecaster receives the gain gIt ,t
3. only gIt ,t is revealed to the forecaster
Cumulative regret goal: maximize the cumulative gains obtained. More precisely, minimize
Rn =
( max
gIt ,t
where E comes from both a possible stochastic generation of the gain vector and a possible randomization in the choice of It
Stochastic and adversial environments
I Stochastic environment: the gain vector gt is sampled from an unknown product distribution ν1⊗ . . .⊗ νK on [0, 1]K , that is gi ,t ∼ νi .
I Adversarial environment: the gain vector gt is chosen by an adversary (which, at time t, knows all the past, but not It)
Numerous variants
I different environments: adversarial, “stochastic”, non-stationary
I different targets: cumulative regret, simple regret, tracking the best expert
I Continuous or discrete set of actions
I extension with additional rules: varying set of arms, pay-per- observation, . . .
Various applications
I Nash equilibria (traffic or communication networks, agent simu- lation, tic-tac-toe phantom, . . . )
I Game-playing computers (Go, urban rivals, . . . )
I Packet routing, itinerary selection
I . . .
Outline
I unstructured set
I Extensions
Stochastic bandit game (Robbins, 1952)
Parameters available to the forecaster: K and n Parameters unknown to the forecaster: the reward distributions ν1, . . . , νK of the arms (with respective means µ1, . . . , µK )
1. the forecaster chooses an arm It ∈ {1, . . . ,K} 2. the environment draws the gain vector gt = (g1,t , . . . , gK ,t)
according to ν1 ⊗ · · · ⊗ νK 3. the forecaster receives the gain gIt ,t
Notation: i∗ = arg maxi=1,...,K µi µ∗ = maxi=1,...,K µi i = µ∗ − µi , Ti (n) =
∑n t=1 1It=i
Cumulative regret: Rn = ∑n
t=1 gIt ,t
Rn = ERn = nµ∗ − E n∑
t=1
Ti (n)µi = K∑ i=1
iETi (n)
A simple policy: ε-greedy
For simplicity, all rewards are in [0, 1]
I Playing the arm with highest empirical mean does not work I ε-greedy: at time t,
I with probability 1−εt , play the arm with highest empirical mean I with probability εt , play a random arm
I Theoretical guarantee: (Auer, Cesa-Bianchi, Fischer, 2002)
I Let = mini :i>0 i and consider εt = min( 6K 2t , 1)
I When t ≥ 6K 2 , the probability of choosing a suboptimal arm i
is bounded by C 2t for some constant C > 0
I As a consequence, E[Ti (n)] ≤ C 2 log n and Rn ≤
∑ i :i>0
2 log n −→ logarithmic regret
I drawbacks: I naive exploration for K > 2: no distinction of sub-optimal arms I requires knowledge of I outperformed by UCB policy in practice
Optimism in face of uncertainty
I At time t, from past observations and some probabilistic ar- gument, you have an upper confidence bound (UCB) on the expected rewards.
I Simple implementation:
Why does it make sense?
I Could we stay a long time drawing a wrong arm?
No, since: I The more we draw a wrong arm i the closer the UCB gets to
the expected reward µi , I µi < µ∗ ≤ UCB on µ∗
Illustration of UCB policy
Confidence intervals vs sampling times
Hoeffding-based UCB (Auer, Cesa-Bianchi, Fischer, 2002)
I Hoeffding’s inequality: Let X ,X1, . . . ,Xm be i.i.d. r.v. taking their values in [0, 1]. For any ε > 0, with probability at least 1− ε, we have
EX ≤ 1
It ∈ arg max i∈{1,...,K}
{ µi ,t−1 +
√ 2 log t
Ti (t − 1)
∑Ti (t−1) s=1 Xi ,s
I Regret bound:
)
I Hoeffding’s inequality: Let X1, . . . ,Xm be i.i.d. r.v. taking their values in [0, 1]. For any ε > 0, with probability at least 1− ε, we have
EX ≤ 1
{ µi ,t−1 +
√ 2 log t
Ti (t − 1)
∑Ti (t−1) s=1 Xi ,s
I UCB1 is an anytime policy (it does not need to know n to be implemented)
I Hoeffding’s inequality: Let X1, . . . ,Xm be i.i.d. r.v. taking their values in [0, 1]. For any ε > 0, with probability at least 1− ε, we have
EX ≤ 1
{ µi ,t−1 +
√ 2 log t
Ti (t − 1)
∑Ti (t−1) s=1 Xi ,s
I UCB1 corresponds to 2 log t = log(ε−1) 2 , hence ε = 1/t4
I Critical confidence level ε = 1/t (Lai & Robbins, 1985; Agrawal,
1995; Burnetas & Katehakis, 1996; Audibert, Munos, Szepesvari, 2009;
Honda & Takemura, 2010)
Better confidence bounds imply smaller regret I Hoeffding’s inequality 1
t -confidence bound
EX ≤ 1
EX ≤ 1
(Audibert, Munos, Szepesvari, 2009; Maurer, 2009; Audibert, 2010) I Asymptotic confidence bound leads to catastrophy:
EX ≤ 1
∫ +∞
Better confidence bounds imply smaller regret
Hoeffding-based UCB empirical Bernstein-based UCB
EX ≤ 1 m
Tuning the exploration: simple vs difficult bandit problems
I UCB1(ρ) policy: At time t, play
{ µi ,t−1 +
√ ρ log t
Ti (t − 1)
} ,
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Exploration parameter ρ
0
5
10
15
20
25
E xp
ec te
d re
gr et
Regret of UCB1(ρ) for n = 1000 and K = 2 arms: Dirac(0.6) and Ber(0.5)
0
5
10
15
20
25
30
35
E xp
ec te
d re
gr et
Regret of UCB1(ρ) for n = 1000 and K = 2 arms: Ber(0.6) and Ber(0.5)
Tuning the exploration parameter: from theory to practice
I Theory: I for ρ < 0.5, UCB1(ρ) has polynomial regret I for ρ > 0.5, UCB1(ρ) has logarithmic regret
(Audibert, Munos, Szepesvari, 2009; Bubeck, 2010)
I Practice: ρ = 0.2 seems to be the best default value for n < 108
0
5
10
15
20
25
30
35
40
45
50
E xp
ec te
d re
gr et
Regret of UCB1(ρ) for n = 1000 and K = 3 arms: Ber(0.6), Ber(0.5) and Ber(0.5)
0
10
20
30
40
50
60
70
80
E xp
ec te
d re
gr et
Regret of UCB1(ρ) for n = 1000 and K = 5 arms: Ber(0.7), Ber(0.6), Ber(0.5), Ber(0.4) and Ber(0.3)
Deviations of UCB1 regret
{ µi,t−1 +
}
I Inequality of the form P(Rn > ERn+γ) ≤ ce−cγ does not hold!
I If the smallest reward observable from the optimal arm is smaller than the mean reward of the second optimal arm, then the regret of UCB1 satisfies: for any C > 0, there exists C ′ > 0 such that for any n ≥ 2
P(Rn > ERn + C log n) > 1
C ′(log n)C ′
Anytime UCB policies has a heavy-tailed regret
I For some difficult bandit problems, the regret of UCB1 satisfies: for any C > 0, there exists C ′ > 0 such that for any n ≥ 2
P(Rn > ERn + C log n) > 1
C ′(log n)C ′ (?)
{ µi ,t−1 +
√ 2 log n
Ti (t − 1)
(Audibert, Munos, Szepesvari, 2009)
I UCB-H satisfies P(Rn > ERn + C log n) ≤ C n for some C > 0
I (?) = unavoidable for anytime policies (Salomon, Audibert, 2011)
Comparison of UCB1 (solid lines) and UCB-H (dotted lines)
Comparison of UCB1(ρ) and UCB-H(ρ) in expectation
0
5
10
15
20
25
E xp
ec te
d re
gr et
Comparison of policies for n = 1000 and K = 2 arms: Dirac(0.6) and Ber(0.5)
GCL∗
UCB1(ρ) UCB-H(ρ)
0
10
20
30
40
50
60
70
E xp
ec te
d re
gr et
Comparison of policies for n = 1000 and K = 2 arms: Ber(0.6) and Dirac(0.5)
GCL∗
Comparison of UCB1(ρ) and UCB-H(ρ) in deviations
I For n = 1000 and K = 2 arms: Bernoulli(0.6) and Dirac(0.5)
−20 0 20 40 60 80 100 120 140 160
Regret level r 0.00
Regret level r 0.00
Left: smoothed probability mass function. Right: tail distribution of the regret.
Knowing the horizon: theory and practice
I Theory: use UCB-H to avoid heavy tails of the regret
I Practice: Theory is right. Besides, thanks to this robustness, the expected regret of UCB-H(ρ) consistently outperforms the expected regret of UCB1(ρ). However:
I the gain is small. I a better way to have small regret tails is to take larger ρ
Knowing µ∗
I Hoeffding-based GCL∗ policy: play each arm once, then play
It ∈ argmin i∈{1,...,K}
(Salomon, Audibert, 2011)
I Underlying ideas: I compare p-values of the K tests: H0 = {µi = µ∗}, i ∈ {1, . . . ,K} I the p-values are estimated using Hoeffding’s inequality
PH0
+
) I play the arm for which we have the Greatest Confidence Level
that it is the optimal arm. I Advantages:
I logarithmic expected regret I anytime policy I regret with a subexponential right-tail I parameter-free policy ! I outperforms any other Hoeffding-based algorithm !
From Chernoff’s inequality to KL-based algorithms
I LetK(p, q) be the Kullback-Leibler divergence between Bernoulli distributions of respective parameter p and q
I Let X1, . . . ,XT be i.i.d. r.v. of mean µ, and taking their values in [0, 1]. Let X = 1
T
P ( X ≤ µ− γ
)) .
I If µ∗ is known, using the same idea of comparing the p-values of the tests H0 = {µi = µ∗}, i ∈ {1, . . . ,K}, we get the Chernoff-based GCL∗ policy: play each arm once, then play
It ∈ argmin i∈{1,...,K}
Ti (t − 1) K (
) Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 28/38
Back to unknown µ∗
I When µ∗ is unknown, the principle
playing the arm for which we have the greatest confidence level that it is the optimal arm
is replaced by
being optimistic in face of uncertainty:
I an arm i is represented by the highest mean of a distribution ν for which the hypothesis H0 = {νi = ν} has a p-value greater than 1
tβ (critical β = 1, as usual)
I the arm with the highest index (=UCB) is played
KL-based algorithms when µ∗ is unknown
I Approximating the p-value using Sanov’s theorem is tightly linked to the DMED policy, which satisfies
lim sup n→+∞
(Burnetas & Katehakis, 1996; Honda & Takemura, 2010)
It matches the lower bound
lim inf n→+∞
(Lai & Robbins, 1985; Burnetas & Katehakis, 1996)
I Approximating the p-value using non-asymptotic version of Sanov’s theorem leads to the KL-UCB (Cappe & Garivier, COLT 2011) and the K-strategy (Maillard, Munos, Stoltz, COLT 2011)
Outline
I unstructured set
I Extensions
Adversarial bandit
Parameters: the number of arms K and the number of rounds n
1. the forecaster chooses an arm It ∈ {1, . . . ,K}, possibly with the help of an external randomization
2. the adversary chooses a gain vector gt = (g1,t , . . . , gK ,t) ∈ [0, 1]K
3. the forecaster receives and observes only the gain gIt ,t
Goal: Maximize the cumulative gains obtained. We consider the regret:
Rn =
( max
gIt ,t ,
I In full information, step 3. is replaced by the forecaster receives gIt ,t and observes the full gain vector gt
I In both settings, the forecaster should use an external randomization to have o(n) regret.
Adversarial setting in full information: an optimal policy
I Cumulative reward on [1, t − 1]: Gi ,t−1 = ∑t−1
s=1 gi ,s I Follow-the-leader: It ∈ arg maxi∈{1,...,K} Gi ,t−1 is a bad policy
I An “optimal” policy is obtained by considering
pi ,t = P(It = i) = eηGi,t−1∑K k=1 eηGk,t−1
I For this policy, Rn ≤ nη 8 + log K
η
8 log K n , we have Rn ≤
√ n log K
Proof of the regret bound
E ∑ t
gIt ,t
=E ∑ t
Adapting the exponentially weighted forecaster I In bandit setting, Gi ,t−1, i = 1, . . . ,K are not observed
Trick = estimate them
I Precisely, Gi ,t−1 is estimated by Gi ,t−1 = ∑t−1
s=1 gi ,s with
gi ,s = 1− 1− gIs ,s pIs ,s
1Is=i .
k=1
pk,s 1k=i = gi ,s
I For this policy, Rn ≤ nKη 2 + log K
η
2 log K nK , we have Rn ≤
√ 2nK log K
Implicitly Normalized Forecaster (Audibert, Bubeck, 2010)
Let ψ : R∗− → R∗+ increasing, convex, twice continuously differen- tiable, and s.t. [ 1
K , 1] ⊂ ψ(R∗−)
For each round t = 1, 2, . . . ,
I It ∼ pt
I Compute pt+1 = (p1,t+1, . . . , pK ,t+1) where
pi ,t+1 = ψ(Gi ,t − Ct)
where Ct is the unique real number s.t. ∑K
i=1 pi ,t+1 = 1
Minimax policy
I ψ(x) = exp(ηx) with η > 0; this corresponds exactly to the exponentially weighted forecaster
I ψ(x) = (−ηx)−q with q > 1 and η > 0; this is a new policy which is minimax optimal: for q = 2 and η =
√ 2n, we have
while for any strategy, we have
sup Rn ≥ 1
Outline
I unstructured set
I Extensions

Introduction to Bandits: Algorithms and Theory

Documents

Transcript of Introduction to Bandits: Algorithms and Theory