RL Lecture 5: Monte Carlo Methods
Transcript of RL Lecture 5: Monte Carlo Methods
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
RL Lecture 5: Monte Carlo Methods
❐ Monte Carlo methods learn from complete sample returnsn Only defined for episodic tasks
❐ Monte Carlo methods learn directly from experiencen On-line: No model necessary and still attains optimalityn Simulated: No need for a full model
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2
Monte Carlo Policy Evaluation
❐ Goal: learn Vπ(s)❐ Given: some number of episodes under π which contain s❐ Idea: Average returns observed after visits to s
❐ Every-Visit MC: average returns for every time s is visited in an episode
❐ First-visit MC: average returns only for first time s is visited in an episode
❐ Both converge asymptotically
1 2 3 4 5
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3
First-visit Monte Carlo policy evaluation
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4
Blackjack example
❐ Object: Have your card sum be greater than the dealers without exceeding 21.
❐ States (200 of them): n current sum (12-21)n dealer’s showing card (ace-10)n do I have a useable ace?
❐ Reward: +1 for winning, 0 for a draw, -1 for losing❐ Actions: stick (stop receiving cards), hit (receive another
card)❐ Policy: Stick if my sum is 20 or 21, else hit
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5
Blackjack value functions
+1
−1
ADealer showing 10 12
Play
er s
um21
After 500,000 episodesAfter 10,000 episodes
Usableace
Nousable
ace
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6
Backup diagram for Monte Carlo
❐ Entire episode included❐ Only one choice at each state
(unlike DP)
❐ MC does not bootstrap
❐ Time required to estimate one state does not depend on the total number of states
terminal state
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7
e.g., Elastic Membrane (Dirichlet Problem)
The Power of Monte Carlo
How do we compute the shape of the membrane or bubble?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8
Relaxation Kakutani’s algorithm, 1945
Two Approaches
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9
Monte Carlo Estimation of Action Values (Q)
❐ Monte Carlo is most useful when a model is not availablen We want to learn Q*
❐ Qπ(s,a) - average return starting from state s and action a following π
❐ Also converges asymptotically if every state-action pair is visited
❐ Exploring starts: Every state-action pair has a non-zero probability of being the starting pair
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10
Monte Carlo Control
❐ MC policy iteration: Policy evaluation using MC methods followed by policy improvement
❐ Policy improvement step: greedify with respect to value (or action-value) function
π Q
evaluation
improvement
Q →Qπ
π→greedy(Q)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11
Convergence of MC Control
❐ Policy improvement theorem tells us:Qπ k (s,π k+1(s)) = Qπ k (s, argmax
aQπ k (s, a))
= maxaQπ k (s,a)
≥Qπ k (s,π k (s))
= Vπ k (s)❐ This assumes exploring starts and infinite number of
episodes for MC policy evaluation❐ To solve the latter:
n update only to a given level of performancen alternate between evaluation and improvement per
episode
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12
Monte Carlo Exploring Starts
Fixed point is optimal policy π*
Proof is open question
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13
Usableace
Nousable
ace
20
10A 2 3 4 5 6 7 8 9Dealer showing
Play
er s
um
HIT
STICK 19
21
1112131415161718
π*
10A 2 3 4 5 6 7 8 9
HIT
STICK 2019
21
1112131415161718
V*
21
10 12
A
Dealer showing Play
er s
um
10
A
12
21
+1
−1
Blackjack example continued
❐ Exploring starts❐ Initial policy as described before
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14
On-policy Monte Carlo Control
)(1
sAε
ε +−
greedy
)(sAε
non-max
❐ On-policy: learn about policy currently executing❐ How do we get rid of exploring starts?
n Need soft policies: π(s,a) > 0 for all s and an e.g. ε-soft policy:
❐ Similar to GPI: move policy towards greedy policy (i.e. ε-soft)
❐ Converges to best ε-soft policy
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15
On-policy MC Control
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16
Off-policy Monte Carlo control
❐ Behavior policy generates behavior in environment❐ Estimation policy is policy being learned about❐ Average returns from behavior policy by probability their
probabilities in the estimation policy
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17
Learning about π while following π "
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18
Off-policy MC control
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19
Incremental Implementation
❐ MC can be implemented incrementallyn saves memory
❐ Compute the weighted average of each return
Vn =wkRk
k =1
n
∑
wkk=1
n
∑
[ ]
000
11
11
11
==
+=
−+=
++
++
++
WVwWW
VRWwVV
nnn
nnn
nnn
incremental equivalent non-incremental
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20
Racetrack Exercise
Starting line
Finishline
Starting line
Finishline
❐ States: grid squares, velocity horizontal and vertical
❐ Rewards: -1 on track, -5 off track
❐ Actions: +1, -1, 0 to velocity❐ 0 < Velocity < 5❐ Stochastic: 50% of the time it
moves 1 extra square up or right
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21
Summary
❐ MC has several advantages over DP:n Can learn directly from interaction with environmentn No need for full modelsn No need to learn about ALL statesn Less harm by Markovian violations (later in book)
❐ MC methods provide an alternate policy evaluation process
❐ One issue to watch for: maintaining sufficient explorationn exploring starts, soft policies
❐ No bootstrapping (as opposed to DP)