Lecture 3: Model-Free Controlpeople.stern.nyu.edu/xchen3/...Learning_PHD_Class/... · Lecture 3:...
Transcript of Lecture 3: Model-Free Controlpeople.stern.nyu.edu/xchen3/...Learning_PHD_Class/... · Lecture 3:...
Lecture 3: Model-Free Control
Xi Chen
Stern School of Business, New York University
Slides are based on David Silver’s RL lecture notes and Lihong Li’s RL Tutorial
Xi Chen (NYU) Model-Free Control 1 / 38
Outline
1 Introduction
2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example
3 On-Policy Temporal-Difference Learning
4 Off-Policy LearningQ-Learning
Xi Chen (NYU) Model-Free Control 2 / 38
Outline
1 Introduction
2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example
3 On-Policy Temporal-Difference Learning
4 Off-Policy LearningQ-Learning
Xi Chen (NYU) Model-Free Control 3 / 38
Model-Free Reinforcement Learning
Last lecture:
Model-free predictionEstimate the value function of an unknown MDP
This lecture:
Model-free controlOptimize the value function of an unknown MDP
Xi Chen (NYU) Model-Free Control 4 / 38
Uses of Model-Free Control
Many examples that can be modelled as MDPs
Elevator
Parallel
Parking
Ship Steering
Bioreactor
Helicopter
Aeroplane
Logistics
Robocup
Soccer
Quake
Portfolio management
Protein Folding
Robot walking
Game of Go
For most of these problems, either:
MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples
Model-free control can optimize the MDP for these problems.
Xi Chen (NYU) Model-Free Control 5 / 38
On and Off-Policy Learning
On-policy learning
“Learn on the job”Learn about policy π from experience sampled from πUsing the currently learned policy π to select actions andimproving stability
Off-policy learning
“Look over someones shoulder”Learn about policy π from experience sampled from a differentpolicy µGreater data efficiency using the same data to evaluate/optimizeLower costs/risks: RL in autonomous driving
Xi Chen (NYU) Model-Free Control 6 / 38
On and Off-Policy Learning
On-policy learning
“Learn on the job”Learn about policy π from experience sampled from πUsing the currently learned policy π to select actions andimproving stability
Off-policy learning
“Look over someones shoulder”Learn about policy π from experience sampled from a differentpolicy µGreater data efficiency using the same data to evaluate/optimizeLower costs/risks: RL in autonomous driving
Xi Chen (NYU) Model-Free Control 6 / 38
Outline
1 Introduction
2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example
3 On-Policy Temporal-Difference Learning
4 Off-Policy LearningQ-Learning
Xi Chen (NYU) Model-Free Control 7 / 38
Generalised Policy Iteration (Refresher)
Policy evaluation Estimate vπe.g. Iterative policy evaluation
Policy improvement Generate π′ ≥ πe.g. Greedy policy improvement
Xi Chen (NYU) Model-Free Control 8 / 38
Generalised Policy Iteration With Monte-Carlo Evaluation
Policy evaluation Monte-Carlo policy evaluation, V = vπ?Policy improvement Greedy policy improvement?
Xi Chen (NYU) Model-Free Control 9 / 38
Model-Free Policy Iteration Using Action-Value Function
Greedy policy improvement over V (s) requires model of MDP(rewardand transition)
π′(s) = argmaxa∈A
R(s, a) + γ∑s′
Pr[s ′|s, a]vπ(s ′)
Greedy policy improvement over Q(s, a) is model-free
π′(s) = argmaxa∈A
Q(s, a)
Xi Chen (NYU) Model-Free Control 10 / 38
Generalised Policy Iteration with Action-Value Function
Policy evaluation Monte-Carlo policy evaluation, Q = qπPolicy improvement Greedy policy improvement? Too greedy is bad
Xi Chen (NYU) Model-Free Control 11 / 38
Monte-Carlo policy evaluation for Q = qπ
Sample kth episode using π: {S1,A1,R2, . . . ,ST} ∼ πFor each state St and action At in the episode,
N (St ,At)← N (St ,At) + 1Q (St ,At)← Q (St ,At) + 1
N(St ,At)(Gt − Q (St ,At))
Xi Chen (NYU) Model-Free Control 12 / 38
Example of Greedy Action Selection
There are two doors in front of you.
You open the left door and get reward 0V(left) = 0
You open the right door and get reward +1V(right) = +1
You open the right door and get reward +3V(right) = +2
You open the right door and get reward +2V(right) = +2
...
Are you sure youve chosen the best door?
Xi Chen (NYU) Model-Free Control 13 / 38
Exploration: ε-greedy
At every step, select a random action with small probability, and thecurrent best-guess otherwise:
at =
{arg maxa Q(st , a), with prob. 1− εrandom action, with prob. ε
Larger ε, more exploration
Uniform random exploration when ε = 1
Very simple to implement; can be quite effective in practice
Often use a sequence of decaying ε values, reducing amountof exploration over time.
Will discuss more improvement over ε-greedy at the end of thissection
Xi Chen (NYU) Model-Free Control 14 / 38
Exploration: ε-greedy
At every step, select a random action with small probability, and thecurrent best-guess otherwise:
at =
{arg maxa Q(st , a), with prob. 1− εrandom action, with prob. ε
Larger ε, more exploration
Uniform random exploration when ε = 1
Very simple to implement; can be quite effective in practice
Often use a sequence of decaying ε values, reducing amountof exploration over time.
Will discuss more improvement over ε-greedy at the end of thissection
Xi Chen (NYU) Model-Free Control 14 / 38
Exploration: ε-greedy
At every step, select a random action with small probability, and thecurrent best-guess otherwise:
at =
{arg maxa Q(st , a), with prob. 1− εrandom action, with prob. ε
Larger ε, more exploration
Uniform random exploration when ε = 1
Very simple to implement; can be quite effective in practice
Often use a sequence of decaying ε values, reducing amountof exploration over time.
Will discuss more improvement over ε-greedy at the end of thissection
Xi Chen (NYU) Model-Free Control 14 / 38
Exploration: ε-greedy
At every step, select a random action with small probability, and thecurrent best-guess otherwise:
at =
{arg maxa Q(st , a), with prob. 1− εrandom action, with prob. ε
Larger ε, more exploration
Uniform random exploration when ε = 1
Very simple to implement; can be quite effective in practice
Often use a sequence of decaying ε values, reducing amountof exploration over time.
Will discuss more improvement over ε-greedy at the end of thissection
Xi Chen (NYU) Model-Free Control 14 / 38
ε-Greedy Policy Improvement
Theorem
For any ε-greedy policy π, the ε-greedy policy π′ with respect to qπ is animprovement, vπ′(s) ≥ vπ(s)
qπ (s, π′(s)) =∑a∈A
π′(a|s)qπ(s, a)
= ε/m∑a∈A
qπ(s, a) + (1− ε) maxa∈A
qπ(s, a)
≥ ε/m∑a∈A
qπ(s, a) + (1− ε)∑a∈A
π(a|s)− ε/m
1− εqπ(s, a)
=∑a∈A
π(a|s)qπ(s, a) = vπ(s)
Therefore from policy improvement theorem, vπ′(s) ≥ vπ(s).
Xi Chen (NYU) Model-Free Control 15 / 38
Monte-Carlo Control
Every episode:Policy evaluation Monte-Carlo policy evaluation, Q ≈ qπPolicy improvement ε-greedy policy improvement
Xi Chen (NYU) Model-Free Control 16 / 38
GLIE Monte-Carlo Control
Greedy in the Limit with Infinite Exploration (GLIE)
Sample kth episode using π: {S1,A1,R2, . . . ,ST} ∼ πFor each state St and action At in the episode,
N (St ,At)← N (St ,At) + 1Q (St ,At)← Q (St ,At) + 1
N(St ,At)(Gt − Q (St ,At))
Improve policy based on new action-value function
ε← 1/k
π ← ε− greedy(Q)
Xi Chen (NYU) Model-Free Control 17 / 38
GLIE
Definition
All state-action pairs are explored infinitely many times,
limk→∞
Nk(s, a) =∞
The policy converges on a greedy policy,
limk→∞
πk(a|s) = 1
(a = argmax
a′∈AQk
(s, a′
))GLIE Monte-Carlo control converges to the optimal action-valuefunction, Q(s, a)→ q∗(s, a)
For example, ε-greedy is GLIE
Xi Chen (NYU) Model-Free Control 18 / 38
Monte-Carlo Control in Blackjack
Store Q(s, a) inside, only present V ∗(s) = maxa Q∗(s, a) .
No usable ace: HITif the dealer shows a large value (HIT is TWIST)
Xi Chen (NYU) Model-Free Control 20 / 38
Outline
1 Introduction
2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example
3 On-Policy Temporal-Difference Learning
4 Off-Policy LearningQ-Learning
Xi Chen (NYU) Model-Free Control 21 / 38
MC vs. TD Control
Temporal-difference (TD) learning has several advantages overMonte-Carlo (MC)
Lower varianceOnlineIncomplete sequences
Natural idea: use TD instead of MC in our control loop
Apply TD to Q(S ,A)Use ε-greedy policy improvementUpdate every time-step
Xi Chen (NYU) Model-Free Control 22 / 38
MC vs. TD Control
Temporal-difference (TD) learning has several advantages overMonte-Carlo (MC)
Lower varianceOnlineIncomplete sequences
Natural idea: use TD instead of MC in our control loop
Apply TD to Q(S ,A)Use ε-greedy policy improvementUpdate every time-step
Xi Chen (NYU) Model-Free Control 22 / 38
Updating Action-Value Functions with Sarsa
Sarsa: state-action-reward-state-action
Q(S ,A)← Q(S ,A) + α
R + γQ(S ′,A′
)︸ ︷︷ ︸TD Target
−Q(S ,A)
Action A′: action taken under S ′.
Xi Chen (NYU) Model-Free Control 23 / 38
On-Policy Control With Sarsa
Every time-step:Policy evaluation Sarsa (updating involves S ,A,R, S ′,A′), Q ≈ qπPolicy improvement ε-greedy policy improvement always based on thecurrent value of Q
Xi Chen (NYU) Model-Free Control 25 / 38
Convergence of Sarsa
Theorem
Sarsa converges to the optimal action-value function,Q(s, a)→ q∗(s, a), under the following conditions:
GLIE sequence of policies πt(a|s) (every state has to be visited)
Robbins-Monro sequence of step-sizes αt (e.g., t( − α) withα ∈ (0, 1])
∞∑t=1
αt =∞
∞∑t=1
α2t <∞
Xi Chen (NYU) Model-Free Control 26 / 38
Windy Gridworld Example
Reward = −1 per time-step until reaching goal
the player chooses standard’s move from start (S) to goal (G )(thinking: how about king’s move?)
Each column represent the no. of cells that the wind will forceto move it up.
Xi Chen (NYU) Model-Free Control 27 / 38
Sarsa on the Windy Gridworld
At beginning, random walk takes about 2000 time steps to finishone episode (reaching G ).
Less time steps to finish more episodes.
Xi Chen (NYU) Model-Free Control 28 / 38
Outline
1 Introduction
2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example
3 On-Policy Temporal-Difference Learning
4 Off-Policy LearningQ-Learning
Xi Chen (NYU) Model-Free Control 29 / 38
Off-Policy Learning
Evaluate target policy π(a|s) to compute vπ(s) or qπ(s, a)
While following behavior policy µ(a|s)
{S1,A1,R2, . . . ,ST} ∼ µ
Why is this important?
Learn from observing humans or other agentsRe-use experience generated from old policiesLearn about optimal policy while following behavior policy
Xi Chen (NYU) Model-Free Control 30 / 38
Q-Learning
We now consider off-policy learning of action-values Q(s, a)
Next action is chosen using behaviour policy At+1 ∼ µ(·|St),which generates St+1
But we consider target successor action A′ to be the optimalpolicy that maximum the current Q(St+1, ·)We allow both behaviour and target policies to improve
And update Q(St ,At) towards value of target action
Q (St ,At)← Q (St ,At) +α(Rt+1 + γQ
(St+1,A
′)− Q (St ,At))
Xi Chen (NYU) Model-Free Control 31 / 38
Off-Policy Control with Q-Learning
The behavior policy µ is e.g. ε-greedy w.r.t. Q(s, a)
The target policy π is greedy w.r.t. Q(s, a)
π(S ′)
= argmaxa′
Q(S ′, a′
)
The Q-learning update:
Q(S ,A)← Q(S ,A) + α
R + γmaxa′
Q(S ′, a′
)︸ ︷︷ ︸
Q−learning target
−Q(S ,A)
Xi Chen (NYU) Model-Free Control 32 / 38
Off-Policy Control with Q-Learning
The behavior policy µ is e.g. ε-greedy w.r.t. Q(s, a)
The target policy π is greedy w.r.t. Q(s, a)
π(S ′)
= argmaxa′
Q(S ′, a′
)The Q-learning update:
Q(S ,A)← Q(S ,A) + α
R + γmaxa′
Q(S ′, a′
)︸ ︷︷ ︸
Q−learning target
−Q(S ,A)
Xi Chen (NYU) Model-Free Control 32 / 38
Q-Learning as a stochastic approximation
Q Learning as stochastic approximation
Q∗(s, a) = E[r(s, a)] + γ∑s′∈S
Pr[s ′|s, a
]V ∗(s ′)
= Es′
[r(s, a) + γmax
a∈AQ∗(s ′, a
)]Q(S ,A)← Q(S ,A) + α (R + γmaxa′ Q (S ′, a′)− Q(S ,A))
Theorem
Q-learning control converges to the optimal action-value function,Q(s, a)→ q∗(s, a)
Xi Chen (NYU) Model-Free Control 34 / 38
Q-learning in Direct Mail Campaign Example
Slowly decayinglearning rate from 0.05to 0.0001
Different ε values
Convergence is alwaysguaranteed, but atdifferent speed
Main lesson
Need efficient exploration/exploitation trade-off!
Xi Chen (NYU) Model-Free Control 35 / 38
Q-learning in Direct Mail Campaign Example
Slowly decayinglearning rate from 0.05to 0.0001
Different ε values
Convergence is alwaysguaranteed, but atdifferent speed
Main lesson
Need efficient exploration/exploitation trade-off!
Xi Chen (NYU) Model-Free Control 35 / 38
Inefficiency of ε-greedy
Combination lock
N states, 2 actions
Reward is 1 when reaching G ;and 0 otherwise
Optimal action is always thegreen action
Q-learning with ε-greedy
Assuming Q initialized to 0
Every step has 0.5 probabilityof going back to state “1”
It takes Θ(2N) steps onaverage to get to G for thefirst time
Can we do better?
Xi Chen (NYU) Model-Free Control 36 / 38
Inefficiency of ε-greedy
Combination lock
N states, 2 actions
Reward is 1 when reaching G ;and 0 otherwise
Optimal action is always thegreen action
Q-learning with ε-greedy
Assuming Q initialized to 0
Every step has 0.5 probabilityof going back to state “1”
It takes Θ(2N) steps onaverage to get to G for thefirst time
Can we do better?
Xi Chen (NYU) Model-Free Control 36 / 38
Inefficiency of ε-greedy
Combination lock
N states, 2 actions
Reward is 1 when reaching G ;and 0 otherwise
Optimal action is always thegreen action
Q-learning with ε-greedy
Assuming Q initialized to 0
Every step has 0.5 probabilityof going back to state “1”
It takes Θ(2N) steps onaverage to get to G for thefirst time
Can we do better?
Xi Chen (NYU) Model-Free Control 36 / 38
Inefficiency of ε-greedy
Combination lock
N states, 2 actions
Reward is 1 when reaching G ;and 0 otherwise
Optimal action is always thegreen action
Q-learning with ε-greedy
Assuming Q initialized to 0
Every step has 0.5 probabilityof going back to state “1”
It takes Θ(2N) steps onaverage to get to G for thefirst time
Can we do better?
Xi Chen (NYU) Model-Free Control 36 / 38
Exploration Bonus
Selection rule
at = arg maxa{Q(st , a) + B(st , a)}
for some pre-defined exploration bonus function B(st , a).
B(s, a) measures the need for exploration, e.g., some decreasingfunction of C (s, a) — number of times a has been chosen in s
Popular choice: B(s, a) = α/√
C (s, a) with parameter α > 0.
Also known as Upper Confidence Bound (UCB)
An example of guided exploration (as opposed to ε-greedy)
Xi Chen (NYU) Model-Free Control 37 / 38
Exploration Bonus
Selection rule
at = arg maxa{Q(st , a) + B(st , a)}
for some pre-defined exploration bonus function B(st , a).
B(s, a) measures the need for exploration, e.g., some decreasingfunction of C (s, a) — number of times a has been chosen in s
Popular choice: B(s, a) = α/√
C (s, a) with parameter α > 0.
Also known as Upper Confidence Bound (UCB)
An example of guided exploration (as opposed to ε-greedy)
Xi Chen (NYU) Model-Free Control 37 / 38
Exploration Bonus
Epsilon-greedy strategy:
with probability 1− ε greedy action from s;with probability ε random action.
Epoch-dependent strategy (Boltzmann exploration):
pt(a|s,Q) =e
Q(s,a)τt∑
a′∈A eQ(s,a′)
Tt
τt → 0: greedy selection.larger τt : random action.
Xi Chen (NYU) Model-Free Control 38 / 38