Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a...
Transcript of Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a...
![Page 1: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/1.jpg)
Class 3: Multi-Arm Bandit
Sutton and Barto, Chapter 2
295, class 2 1
Sutton slides and Silver
![Page 2: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/2.jpg)
Multi-Arm BanditsSutton and Barto, Chapter 2
The simplest reinforcement learning
problem
![Page 3: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/3.jpg)
The Exploration/Exploitation Dilemma
295, class 2 3
Online decision-making involves a fundamental choice:
• Exploitation Make the best decision given current information
• Exploration Gather more information
The best long-term strategy may involve short-term sacrifices
Gather enough information to make the best overall decisions
![Page 4: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/4.jpg)
Examples
295, class 2 4
Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
Online Banner AdvertisementsExploitation Show the most successful advert
Exploration Show a different advert
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game PlayingExploitation Play the move you believe is best
Exploration Play an experimental move
![Page 5: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/5.jpg)
You are the algorithm! (bandit1)
![Page 6: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/6.jpg)
The k-armed Bandit Problem
• On each of a sequence of time steps,t=1,2,3,…,
you choose an action At from k possibilities, and receive a real-
valued reward Rt
• These true values are unknown. The distribution is unknown
• Nevertheless, you must maximize your total reward
• You must both try actions to learn their values (explore), and
prefer those that appear best (exploit)
true values
![Page 7: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/7.jpg)
The Exploration/Exploitation Dilemma
![Page 8: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/8.jpg)
Regret
295, class 2 8
The action-value is the mean reward for action a,
• q*(a) = E [r |a]
The optimal value V ∗ is
• V ∗ = Q(a∗) = max q*(a)a∈A
The regret is the opportunity loss for one step
• lt = E [V ∗ − Q(at )]
The total regret is the total opportunity loss
![Page 9: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/9.jpg)
Multi-Armed Bandits
Regret
![Page 10: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/10.jpg)
Multi-Armed Bandits
Regret
0 11 12 13 14 15 16 17 1819
Total regret
ϵ-greedygreedy
1 2 3 4 5 6 7 8 9 10
Time-steps
decaying ϵ-greedy
If an algorithm forever explores it will have linear total regret
If an algorithm never explores it will have linear total regret Is
it possible to achieve sublinear total regret?
![Page 11: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/11.jpg)
Complexity of regret
295, class 2 11
![Page 12: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/12.jpg)
Overview
• Action-value methods
– Epsilon-greedy strategy
– Incremental implementation
– Stationary vs. non-stationary environment
– Optimistic initial values
• UCB action selection
• Gradient bandit algorithms
• Associative search (contextual bandits)
295, class 2 12
![Page 13: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/13.jpg)
Basics
• Maximize total reward collected
– vs learn (optimal) policy (RL)
• Episode is one step
• Complex function of
– True value
– Uncertainty
– Number of time steps
– Stationary vs non-stationary?
295, class 2 13
![Page 14: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/14.jpg)
Action-Value Methods
![Page 15: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/15.jpg)
-Greedy ActionSelection
• In greedy action selection, you always exploit
• In 𝜀-greedy, you are usually greedy, but with probability 𝜀 you
instead pick an action at random (possibly the greedy action
again)
• This is perhaps the simplest way to balance exploration and
exploitation
![Page 16: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/16.jpg)
A simple bandit algorithm
![Page 17: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/17.jpg)
1 2 3 4 7 8 9 10
0
1
2
3
-3
-2
-1
q⇤(1)
q⇤(2)
q⇤(3)
q⇤(4)
q⇤(5)
q⇤(6)
q⇤(7)
q⇤(8)
q⇤(9)
q⇤(10)Reward
distribution
5 6
Action
-4
4
One Bandit Taskfrom
The 10-armedTestbed
Run for 1000 steps
Repeat the whole thing 2000 times with different bandit tasks
Figure 2.1: An example bandit problem from the 10-armed testbed. The true value q(a) of each of the tenactions was selected according to a normal distribution with mean zero and unit variance, and then the actualrewards were selected according to a mean q(a) unit variance normal distribution, as suggested by these graydistributions.
![Page 18: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/18.jpg)
-Greedy Methods on the 10-ArmedTestbed
![Page 19: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/19.jpg)
Averaging ⟶ learning rule
• To simplify notation, let us focus on one action
• We consider only its rewards, and its estimate after n+1 rewards:
• How can we do this incrementally (without storing all the rewards)?
• Could store a running sum and count (and divide), or equivalently:
.Qn =
R1 + R2 + ···+ Rn-1
n - 1
![Page 20: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/20.jpg)
Derivation of incremental update
![Page 21: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/21.jpg)
Tracking a Non-stationary Problem
![Page 22: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/22.jpg)
Standard stochastic approximation convergence conditions
![Page 23: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/23.jpg)
Optimistic InitialValues
So far we have used
• All methods so far depend on Q1(a), i.e.,they are biased.
Q1(a) = 0
• Suppose we initialize the action values optimistically (Q1(a) = 5 ), e.g., on
the 10-armed testbed (with alpha= 0.1 )
20%
0%
40%
60%
80%
100%
%Optimal action
0 200 400 600 800 1000
Plays
0
realistic, -greedy
0
Steps
Q1 = 0, E = 0.1
optimistic, greedy
Q1 = 5, E= 0
![Page 24: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/24.jpg)
Upper Confidence Bound (UCB) action selection• A clever way of reducing exploration over time
• Focus on actions whose estimate has large degree of uncertainty
• Estimate an upper bound on the true action values
• Select the action with the largest (estimated) upper bound
UCB c = 2
E-greedy E = 0.1
Average reward
Steps
![Page 25: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/25.jpg)
Theorem
t →∞lim Lt ≤ 8 log t
The UCB algorithm achieves logarithmic asymptotic total regret
aa|∆ >0
∆ a
Complexity of UCB Algorithm
![Page 26: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/26.jpg)
Gradient-Bandit Algorithms• Let H t (a ) be a learned preference for taking action a
%
Optimal
action
α =0.1
100%
80%
60%
40%
20%
0%
α =0.4
α =0.1
α =0.4
without baseline
with baseline
0 250 500
Steps
750 1000
![Page 27: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/27.jpg)
Derivation of gradient-bandit algorithm
![Page 28: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/28.jpg)
![Page 29: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/29.jpg)
![Page 30: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/30.jpg)
![Page 31: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/31.jpg)
Summary Comparison of Bandit Algorithms
![Page 32: Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2The k-armed Bandit Problem • On each of a sequence of time steps,t=1,2,3,…, you choose an action At from k possibilities, and](https://reader034.fdocuments.us/reader034/viewer/2022051805/5ff27e9e6badac40722c2ce7/html5/thumbnails/32.jpg)
Conclusions
• These are all simple methods
• but they are complicated enough—we will build on them
• we should understand them completely
• there are still open questions
• Our first algorithms that learn from evaluative feedback
• and thus must balance exploration and exploitation
• Our first algorithms that appear to have a goal
—that learn to maximize reward by trial and error