Multi-armed bandits for fun and profit

MULTI-ARMED BANDITS

JANANI SRIRAM MAD STREET DEN

FOR FUN AND PROFIT

MULTI-ARMED BANDITS

OVERVIEW

▸ Overview - Background, Introduction and Formulation

▸ Optimality - Gittin’s Index

▸ Optimization Strategies

▸ Epsilon greedy

▸ Upper Confidence Bound

▸ Boltzmann Exploration

▸ Bayesian Bandits

OVERVIEW

EXPLORATION VS. EXPLOITATION▸ Tradeoff between the necessity to try out all arms

and to minimize the total regret suffered due to sub-optimal arms.

▸ The agent can gain knowledge about the environment only by pulling an arm.

▸ But by pulling a bad arm it suffers some regret.

▸ If an algorithms explores forever or exploits forever it will have linear total regret

▸ Usage Scenarios

▸ Clinical Trials

▸ A/B Testing online ads*

▸ Restaurant selection

▸ Feynman’s restaurant problem

V ⇤(s) = R(s) + maxa �

Ps0 P (s0|s, a)V ⇤

(s0)

https://support.google.com/analytics/answer/2844870?hl=en

https://support.google.com/analytics/answer/2844870?hl=en

OVERVIEW

MARKOV DECISION PROCESSES▸ Sequential decision making process with stationary

markov property s.t.

▸ States

▸ Transition model

▸ Reward

▸ Actions

▸ Discount factor

S = {s1, s2...sn}

(S,A,Pr, R, �)

* Introduction to Reinforcement Learning Sutton and Barto [1998]

A = {a1, a2...an}

Pass0 = P(St+1 = s0 | (St = s,At = a))

Ras = E(Rt+1 | (St = s,At = a))

� 2 [0, 1]

OVERVIEW

WHAT ARE BANDITS?▸ Originally described by Robbins [1952]

▸ A gambler is faced with K slot machines each with an unknown distribution of rewards. The goal is to maximize cumulative rewards over a finite number of trials (horizon T).

▸ A Bernoulli bandit is a special case of MAB that has a Bernoulli distributed reward.

▸ Stochastic MABs - Each arm k is associated with an unknown probability. Rewards are drawn i.i.d from

▸ Adversarial Bandits - Rewards are generated by an adversary.

vk 2 [0, 1]

OVERVIEW

BACKGROUND

▸ Notation

▸ Goal: To maximize total reward

▸ Or minimize total expected regret (optimal - obtained reward)

▸ Lai and Robbins [1985] showed that optimal regret is

t = {1, 2, ...T}Trials: Choice: ti 2 {1, 2, ...K}Reward: for chosen arm i at trial trit 2 R

TPt=1

rit

i⇤ = argmaxi=1,...,K µi µ⇤= maxi=1,...,K µi �i = µ⇤ � µi

Regret: is no. of times arm j is selectedTj(T )

O(log T )

Tµ⇤ �TP

t=1E[µit ] =

KPj=1

�jE[Tj(T )]

STRATEGIES

EPSILON GREEDY▸ Select initial empirical means for each arm i,

▸ A time t, with probability play the arm with highest empirical mean and with probability , play a random arm

µ̂i(0)

1� ✏t✏t

BOLTZMANN EXPLORATION

pk = eµ̂i(t)

⌧

kPj=1

eµ̂i(t)

⌧

, i = 1, ...n

▸ At trial t, arm k is selected with probability given by Gibb’s distribution

is a temperature parameter controlling the randomness of the choice

⌧

STRATEGIES

UPPER CONFIDENCE BOUND

▸ ‘Optimism in the face of uncertainty’.

▸ Chernoff-Hoeffding bound on deviation from mean

▸ Algorithm:

▸ Setup: Select empirical mean payoffs for each arm i,

▸ For each round pick arm with probability,.

▸ Optimal lower bound on regret

*Using Confidence Bounds for Exploitation-Exploration Trade-offs Auer, Cesa-Bianchi & Fisher [2002]

µ̂i

P(Y + a+ µ) e�2na2

j(t) = argmaxi(µ̂i +

q2 ln tni

)

O(log n)(Knowledge) (Uncertainty)

STRATEGIES

BAYESIAN BANDITS

▸ Assume a prior distribution on parameters

▸ The likelihood of reward is given by

▸ Sample from the posterior distribution and update priors

▸ For bandits with Bernoulli rewards start with standard conjugate prior - Beta distribution. The posterior is also a Beta distribution.

P (r | a, ✓)

P (✓)

red : ↵ = 2,� = 2green : ↵ = 12,� = 12

blue : ↵ = 102,� = 102

f(x;↵,�) = �(↵+�)�(↵)�(�)x

↵�1(1� x)��1

pdf of a Beta distribution with parameters ↵ > 0,� > 0

STRATEGIES

GITTIN’S INDEX (INFORMATION STATE SEARCH)

▸ Goal: to maximize the total expected discounted reward

▸ Reduces to solving the stopping problem

▸ Bayesian adaptive MDP: Assume prior on reward distribution and geometric discounting. Each state transition is a Bayes model update. For Bernoulli bandits this means Beta prior.

▸ Optimal policy: Select arm that maximizes Gittin’s dynamic allocation Index which is a a normalized sum of time discounted reward.

▸ For arm i,

⇡(r|↵,�) = r↵�1(1�r)��1

B(↵,�)where B is the Beta function

vi

= max

⌧>0

E(1Pt=0

�

trit(xit))

E[1Pt=0

�

t]

� Reward discount parameter⌧ Stopping time

STRATEGIES

THOMSON SAMPLING (PROBABILITY MATCHING)▸ Start with a prior belief on parameters of the distribution

▸ Play arm according to probability that it is optimal

▸ After every trial, observe a reward and do a Bayesian update

▸ Shown to have logarithmic expected regret [Agrawal 2012]

at = argmaxa E(r | a, ✓t)

STRATEGIES

THOMSON SAMPLING

▸ Simulation from http://bit.ly/2fqR57P

http://bit.ly/2fqR57P

CONCLUSION

REFERENCES

▸ D. Berry and B. Fristedt. Bandit problems. Chapman and Hall, 1985

▸ J Gittins. Multi-armed bandit allocation indices. Wiley, 1989

▸ Lai and Robbins. Asymptotically Efficient Adaptive Allocation Rules

▸ Shipra Agrawal and Navin Goyal. Analysis of Thompson Sampling for the Multi-armed Bandit Problem.

▸ Volodymyr Kuleshov, Doina Precup. Algorithms for the multi-armed bandit problem.

▸ Finite-time analysis of the multi-armed bandit problem. Auer, P., Cesa-Bianchi, N., and Fischer, P.

THANK YOU

Multi-armed bandits for fun and profit

Technology

Transcript of Multi-armed bandits for fun and profit