Lecture 9: Policy Gradient II (Post lecture)...
Transcript of Lecture 9: Policy Gradient II (Post lecture)...
![Page 1: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/1.jpg)
Lecture 9: Policy Gradient II (Post lecture) 2
Emma Brunskill
CS234 Reinforcement Learning.
Winter 2018
Additional reading: Sutton and Barto 2018 Chp. 13
2With many slides from or derived from David Silver and John Schulman and Pieter Abbeel
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 3 Winter 2018 1 / 54
![Page 2: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/2.jpg)
Class Feedback
Thanks to those that participated!
Of 70 responses, 54% thought too fast, 43% just right
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 4 Winter 2018 2 / 54
![Page 3: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/3.jpg)
Class Feedback
Thanks to those that participated!
Of 70 responses, 54% thought too fast, 43% just right
Multiple request to: repeat questions for those watching later on;have more worked examples; have more conceptual emphasis;minimize notation errors
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 5 Winter 2018 3 / 54
![Page 4: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/4.jpg)
Class Feedback
Thanks to4 those that participated!
Of 70 responses, 54% thought too fast, 43% just right
Multiple request to: repeat questions for those watching later on;have more worked examples; have more conceptual emphasis;minimize notation errors
Common things people find are helping them learn: assignments,mathematical derivations, checking your understanding/talking to aneighbor
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 6 Winter 2018 4 / 54
![Page 5: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/5.jpg)
Class Structure
Last time: Policy Search
This time: Policy Search
Next time: Midterm review
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 7 Winter 2018 5 / 54
![Page 6: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/6.jpg)
Recall: Policy-Based RL
Policy search: directlyparametrize the policy
πθ(s, a) = P[a|s, θ]
Goal is to find a policy π withthe highest value function V π
(Pure) Policy based methods
No Value FunctionLearned Policy
Actor-Critic methods
Learned Value FunctionLearned Policy
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 8 Winter 2018 6 / 54
![Page 7: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/7.jpg)
Recall: Advantages of Policy-Based RL
Advantages:
Better convergence properties
Effective in high-dimensional or continuous action spaces
Can learn stochastic policies
Disadvantages:
Typically converge to a local rather than global optimum
Evaluating a policy is typically inefficient and high variance
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 9 Winter 2018 7 / 54
![Page 8: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/8.jpg)
Recall: Policy Gradient
Defined V (θ) = V πθ to make explicit the dependence of the value onthe policy parameters
Assumed episodic MDPs
Policy gradient algorithms search for a local maximum in V (θ) byascending the gradient of the policy, w.r.t parameters θ
∆θ = α∇θV (θ)
Where ∇θV (θ) is the policy gradient
∇θV (θ) =
δV (θ)δθ1...
δV (θ)δθn
and α is a step-size parameter
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 10 Winter 2018 8 / 54
![Page 9: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/9.jpg)
Desired Properties of a Policy Gradient RL Algorithm
Goal: Converge as quickly as possible to a local optima
Incurring reward / cost as execute policy, so want to minimize numberof iterations / time steps until reach a good policy
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 11 Winter 2018 9 / 54
![Page 10: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/10.jpg)
Desired Properties of a Policy Gradient RL Algorithm
Goal: Converge as quickly as possible to a local optima
Incurring reward / cost as execute policy, so want to minimize numberof iterations / time steps until reach a good policy
During policy search alternating between evaluating policy andchanging (improving) policy (just like in policy iteration)
Would like each policy update to be a monotonic improvement
Only guaranteed to reach a local optima with gradient descentMonotonic improvement will achieve thisAnd in the real world, monotonic improvement is often beneficial
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 12 Winter 2018 10 / 54
![Page 11: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/11.jpg)
Desired Properties of a Policy Gradient RL Algorithm
Goal: Obtain large monotonic improvements to policy at each update
Techniques to try to achieve this:
Last time and today: Get a better estimate of the gradient (intuition:should improve updating policy parameters)Today: Change how to update the policy parameters given the gradient
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 13 Winter 2018 11 / 54
![Page 12: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/12.jpg)
Table of Contents
1 Better Gradient Estimates
2 Policy Gradient Algorithms and Reducing Variance
3 Updating the Parameters Given the Gradient: Motivation
4 Need for Automatic Step Size Tuning
5 Updating the Parameters Given the Gradient: Local Approximation
6 Updating the Parameters Given the Gradient: Trust Regions
7 Updating the Parameters Given the Gradient: TRPO Algorithm
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 14 Winter 2018 12 / 54
![Page 13: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/13.jpg)
Likelihood Ratio / Score Function Policy Gradient
Recall last time (m is a set of trajectories):
∇θV (θ) ≈ (1/m)m∑i=1
R(τ (i))T−1∑t=0
∇θ log πθ(a(i)t |s
(i)t )
Unbiased estimate of gradient but very noisy
Fixes that can make it practical
Temporal structure (discussed last time)BaselineAlternatives to using Monte Carlo returns R(τ (i)) as targets
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 15 Winter 2018 13 / 54
![Page 14: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/14.jpg)
Policy Gradient: Introduce Baseline
reduce variance by introducing a baseline b(s)
∇θEτ [R] = Eτ
[T−1∑t=0
∇θ log π(at |st , θ)
(T−1∑t′=t
rt′ − b(st)
)]
For any choice of b, gradient estimator is unbiased.
Near optimal choice is expected return,b(st) ≈ E[rt + rt+1 + · · ·+ rT−1]
Interpretation: increase logprob of action at proportionally to howmuch returns
∑T−1t′=t rt′ are better than expected
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 16 Winter 2018 14 / 54
![Page 15: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/15.jpg)
Baseline b(s) Does Not Introduce Bias–Derivation
Eτ [∇θ log π(at |st , θ)b(st)]
= Es0:t ,a0:(t−1)
[Es(t+1):T ,at:(T−1)
[∇θ log π(at |st , θ)b(st)]]
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 17 Winter 2018 15 / 54
![Page 16: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/16.jpg)
Baseline b(s) Does Not Introduce Bias–Derivation
Eτ [∇θ log π(at |st , θ)b(st)]
= Es0:t ,a0:(t−1)
[Es(t+1):T ,at:(T−1)
[∇θ log π(at |st , θ)b(st)]]
(break up expectation)
= Es0:t ,a0:(t−1)
[b(st)Es(t+1):T ,at:(T−1)
[∇θ log π(at |st , θ)]]
(pull baseline term out)
= Es0:t ,a0:(t−1)[b(st)Eat [∇θ log π(at |st , θ)]] (remove irrelevant variables)
= Es0:t ,a0:(t−1)
[b(st)
∑a
πθ(at |st)∇θπ(at |st , θ)
πθ(at |st)
](likelihood ratio)
= Es0:t ,a0:(t−1)
[b(st)
∑a
∇θπ(at |st , θ)
]
= Es0:t ,a0:(t−1)
[b(st)∇θ
∑a
π(at |st , θ)
]= Es0:t ,a0:(t−1)
[b(st)∇θ1]
= Es0:t,a0:(t−1) [b(st) · 0] = 0
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 18 Winter 2018 16 / 54
![Page 17: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/17.jpg)
”Vanilla” Policy Gradient Algorithm
Initialize policy parameter θ, baseline bfor iteration=1, 2, · · · do
Collect a set of trajectories by executing the current policyAt each timestep in each trajectory, compute
the return Rt(st) =∑T−1
t′=t rt′ , andthe advantage estimate At = Rt − b(st).
Re-fit the baseline, by minimizing ||b(st)− Rt ||2,summed over all trajectories and timesteps.
Update the policy, using a policy gradient estimate g ,which is a sum of terms ∇θ log π(at |st , θ)At .(Plug g into SGD or ADAM)
endfor
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 19 Winter 2018 17 / 54
![Page 18: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/18.jpg)
Practical Implementation with Autodiff
Usual formula∑
t ∇θ log π(at |st ; θ)At is inifficient–want to batch data
Define ”surrogate” function using data from current batch
L(θ) =∑t
log π(at |st ; θ)At
Then policy gradient estimator g = ∇θL(θ)
Can also include value function fit error
L(θ) =∑t
(log π(at |st ; θ)At − ||V (st)− Rt ||2
)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 20 Winter 2018 18 / 54
![Page 19: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/19.jpg)
Other choices for baseline?
Initialize policy parameter θ, baseline bfor iteration=1, 2, · · · do
Collect a set of trajectories by executing the current policyAt each timestep in each trajectory, compute
the return Rt =∑T−1
t′=t rt′ , andthe advantage estimate At = Rt − b(st).
Re-fit the baseline, by minimizing ||b(st)− Rt ||2,summed over all trajectories and timesteps.
Update the policy, using a policy gradient estimate g ,which is a sum of terms ∇θ log π(at |st , θ)At .(Plug g into SGD or ADAM)
endfor
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 21 Winter 2018 19 / 54
![Page 20: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/20.jpg)
Choosing the Baseline: Value Functions
Recall Q-function / state-action-value function:
Qπ,γ(s, a) = Eπ[r0 + γr1 + γ2r2 · · · |s0 = s, a0 = a
]State-value function can serve as a great baseline
V π,γ(s) = Eπ[r0 + γr1 + γ2r2 · · · |s0 = s
]= Ea∼π[Qπ,γ(s, a)]
Advantage function: Combining Q with baseline V
Aπ,γ(s, a) = Qπ,γ(s, a)− V π,γ(s)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 22 Winter 2018 20 / 54
![Page 21: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/21.jpg)
Table of Contents
1 Better Gradient Estimates
2 Policy Gradient Algorithms and Reducing Variance
3 Updating the Parameters Given the Gradient: Motivation
4 Need for Automatic Step Size Tuning
5 Updating the Parameters Given the Gradient: Local Approximation
6 Updating the Parameters Given the Gradient: Trust Regions
7 Updating the Parameters Given the Gradient: TRPO Algorithm
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 23 Winter 2018 21 / 54
![Page 22: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/22.jpg)
Likelihood Ratio / Score Function Policy Gradient
Recall last time:
∇θV (θ) ≈ (1/m)m∑i=1
R(τ (i))T−1∑t=0
∇θ log πθ(a(i)t |s
(i)t )
Unbiased estimate of gradient but very noisy
Fixes that can make it practical
Temporal structure (discussed last time)BaselineAlternatives to using Monte Carlo returns R ∗ τ (i) as targets
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 24 Winter 2018 22 / 54
![Page 23: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/23.jpg)
Choosing the Target
R(τ (i)) is an estimation of the value function from a single roll out
Unbiased but high variance
reduce variance by introducing bias using bootstrapping and functionapproximation (just like in we saw for TD vs MC, and in the valuefunction approximation lectures)
Estimate of V /Q is done by a critic
Actor-critic methods maintain an explicit representation of both thepolicy and the value function, and update both
A3C is very popular an actor-critic method
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 25 Winter 2018 23 / 54
![Page 24: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/24.jpg)
Policy Gradient Formulas with Value Functions
Recall:
∇θEτ [R] = Eτ
[T−1∑t=0
∇θ log π(at |st , θ)
(T−1∑t′=t
rt′ − b(st)
)]
∇θEτ [R] ≈ Eτ
[T−1∑t=0
∇θ log π(at |st , θ) (Q(st ,w)− b(st))
]Letting the baseline be an estimate of the value V , we can representthe gradient in terms of the state-action advantage function
∇θEτ [R] = Eτ
[T−1∑t=0
∇θ log π(at |st , θ)Aπ(st , at)
]
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 26 Winter 2018 24 / 54
![Page 25: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/25.jpg)
Choosing the Target: N-step estimators
∇θV (θ) ≈ (1/m)m∑i=1
R(τ (i))T−1∑t=0
∇θ log πθ(a(i)t |s
(i)t )
Note that critic can select any blend between TD and MC estimatorsfor the target to substitute for the true state-action value function.
R(1)t = rt + γV (st+1)
R(2)t = rt + γrt+1 + γ2V (st+2) · · ·
R(inf)t = rt + γrt+1 + γ2rt+1 + · · ·
If subtract baselines from the above, get advantage estimators
A(1)t = rt + γV (st+1)−V (st)
A(inf)t = rt + γrt+1 + γ2rt+1 + · · · −V (st)
A(a)t has low variance & high bias. A
(∞)t high variance but low bias.
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 27 Winter 2018 25 / 54
![Page 26: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/26.jpg)
Table of Contents
1 Better Gradient Estimates
2 Policy Gradient Algorithms and Reducing Variance
3 Updating the Parameters Given the Gradient: Motivation
4 Need for Automatic Step Size Tuning
5 Updating the Parameters Given the Gradient: Local Approximation
6 Updating the Parameters Given the Gradient: Trust Regions
7 Updating the Parameters Given the Gradient: TRPO Algorithm
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 28 Winter 2018 26 / 54
![Page 27: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/27.jpg)
Updating the Policy Parameters Given the Gradient
Initialize policy parameter θ, baseline bfor iteration=1, 2, · · · do
Collect a set of trajectories by executing the current policyAt each timestep in each trajectory, compute
the return Rt =∑T−1
t′=t rt′ , andthe advantage estimate At = Rt − b(st).
Re-fit the baseline, by minimizing ||b(st)− Rt ||2,summed over all trajectories and timesteps.
Update the policy, using a policy gradient estimate g ,which is a sum of terms ∇θ log π(at |st , θ)At .(Plug g into SGD or ADAM)
endfor
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 29 Winter 2018 27 / 54
![Page 28: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/28.jpg)
Table of Contents
1 Better Gradient Estimates
2 Policy Gradient Algorithms and Reducing Variance
3 Updating the Parameters Given the Gradient: Motivation
4 Need for Automatic Step Size Tuning
5 Updating the Parameters Given the Gradient: Local Approximation
6 Updating the Parameters Given the Gradient: Trust Regions
7 Updating the Parameters Given the Gradient: TRPO Algorithm
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 30 Winter 2018 28 / 54
![Page 29: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/29.jpg)
Policy Gradient and Step Sizes
Goal: Each step of policy gradient yields an updated policy π′ whosevalue is greater than or equal to the prior policy π: V π′ ≥ V π
Gradient descent approaches update the weights a small step indirection of gradient
First order / linear approximation of the value function’s dependenceon the policy parameterization
Locally a good approximation, further away less good
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 31 Winter 2018 29 / 54
![Page 30: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/30.jpg)
Why are step sizes a big deal in RL?
Step size is important in any problem involving finding the optima ofa function
Supervised learning: Step too far → next updates will fix it
Reinforcement learning
Step too far → bad policyNext batch: collected under bad policyPolicy is determining data collect! Essentially controllingexploration and exploitation trade off due to particular poilcyparameters and the stochasticity of the policyMay not be able to recover from a bad choice, collapse in performance!
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 32 Winter 2018 30 / 54
![Page 31: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/31.jpg)
Simple step-sizing: Line search in direction of gradient
Simple but expensive (perform evaluations along the line)Naive: ignores where the first order approximation is good or bad
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 33 Winter 2018 31 / 54
![Page 32: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/32.jpg)
Policy Gradient Methods with Auto-Step-Size Selection
Can we automatically ensure the updated policy π′ has value greaterthan or equal to the prior policy π: V π′ ≥ V π?
Consider this for the policy gradient setting, and hope to address thisby modifying step size
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 34 Winter 2018 32 / 54
![Page 33: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/33.jpg)
Objective Function
Goal: find policy parameters that maximize value function35
V (θ) = Eπθ
[ ∞∑t=0
γtR(st , at);πθ
](1)
where s0 ∼ µ(s0), at ∼ π(at |st), st+1 ∼ P(st+1|st , at)Have access to samples from the current policy π (param. by θ)
Want to predict the value of a different policy (off policy learning!)
35For today we will primarily consider discounted value functions
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 36 Winter 2018 33 / 54
![Page 34: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/34.jpg)
Objective Function
Goal: find policy parameters that maximize value function37
V (θ) = Eπθ
[ ∞∑t=0
γtR(st , at);πθ
](2)
where s0 ∼ µ(s0), at ∼ π(at |st), st+1 ∼ P(st+1|st , at)Express expected return of another policy in terms of the advantageover the original policy
V (θ) = V (θ) + Eπθ
[ ∞∑t=0
γtAπ(st , at)
]= V (θ) +
∑s
ρπ(s)∑a
π(a|s)Aπ(s, a) (3)
where ρπ(s) is defined as the discounted weighted frequency of states under policy π (similar to in Imitation Learning lecture)
We know the advantage Aπ and π
But we can’t compute the above because we don’t know ρπ, the statedistribution under the new proposed policy
37For today we will primarily consider discounted value functions
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 38 Winter 2018 34 / 54
![Page 35: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/35.jpg)
Table of Contents
1 Better Gradient Estimates
2 Policy Gradient Algorithms and Reducing Variance
3 Updating the Parameters Given the Gradient: Motivation
4 Need for Automatic Step Size Tuning
5 Updating the Parameters Given the Gradient: Local Approximation
6 Updating the Parameters Given the Gradient: Trust Regions
7 Updating the Parameters Given the Gradient: TRPO Algorithm
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 39 Winter 2018 35 / 54
![Page 36: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/36.jpg)
Local approximation
Can we remove the dependency on the discounted visitationfrequencies under the new policy?
Substitute in the discounted visitation frequencies under the currentpolicy to define a new objective function:
Lπ(π) = V (θ) +∑s
ρπ(s)∑a
π(a|s)Aπ(s, a) (4)
Note that Lπθ0 (πθ0) = V (θ0)
Gradient of L is identical to gradient of value function at policyparameterized evaluated at θ0: ∇θLπθ0 (πθ)|θ=θ0 = ∇θV (θ)|θ=θ0
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 40 Winter 2018 36 / 54
![Page 37: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/37.jpg)
Conservative Policy Iteration
Is there a bound on the performance of a new policy obtained byoptimizing the surrogate objective?
Consider mixture policies that blend between an old policy and adifferent policy
πnew (a|s) = (1− α)πold(a|s) + απ′(a|s) (5)
In this case can guarantee a lower bound on value of the new πnew :
V πnew ≥ Lπold (πnew )− 2εγ
(1− γ)2α2 (6)
where ε = maxs∣∣Ea∼π′(a|s) [Aπ(s, a)]
∣∣Check your understanding: is this bound tight if πnew = πold? Can weremove the dependency on the discounted visitation frequencies underthe new policy?
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 41 Winter 2018 37 / 54
![Page 38: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/38.jpg)
Find the Lower-Bound in General Stochastic Policies
Would like to similarly obtain a lower bound on the potentialperformance for general stochastic policies (not just mixture policies)Recall Lπ(π) = V (θ) +
∑s ρπ(s)
∑a π(a|s)Aπ(s, a)
Theorem
Let DmaxTV (π1, π2) = maxs DTV (π1(·|s), π2(·|s)). Then
V πnew ≥ Lπold (πnew )− 4εγ
(1− γ)2(Dmax
TV (πold , πnew ))2
where ε = maxs,a |Aπ(s, a)|.
Note that DTV (p, q)2 ≤ DKL(p, q) for prob. distrib p and q.Then the above theorem immediately implies that
V πnew ≥ Lπold (πnew )− 4εγ
(1− γ)2DmaxKL (πold , πnew )
where DmaxKL (π1, π2) = maxs DKL(π1(·|s), π2(·|s))
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 42 Winter 2018 38 / 54
![Page 39: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/39.jpg)
Guaranteed Improvement43
Goal is to compute a policy that maximizes the objective functiondefining the lower bound:
Mi (π) = Lπi (π)− 4εγ
(1− γ)2DmaxKL (πi , π) (7)
V πi+1 ≥ Lπi (π)− 4εγ
(1− γ)2DmaxKL (πi , π) = Mi (πi+1) (8)
V πi = Mi (πi ) (9)
V πi+1 − V πi ≥ Mi (πi+1)−Mi (πi ) (10)
So as long as the new policy πi+1 is equal or an improvementcompared to the old policy πi with respect to the lower bound, we areguaranteed to to monotonically improve!
The above is a type of Minorization-maximization (MM) algorithm
43Lπ(π) = V (θ) +∑
s ρπ(s)∑
a π(a|s)Aπ(s, a)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 44 Winter 2018 39 / 54
![Page 40: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/40.jpg)
Guaranteed Improvement45
V πnew ≥ Lπold (πnew )− 4εγ
(1− γ)2DmaxKL (πold , πnew )
45Lπ(π) = V (θ) +∑
s ρπ(s)∑
a π(a|s)Aπ(s, a)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 46 Winter 2018 40 / 54
![Page 41: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/41.jpg)
Table of Contents
1 Better Gradient Estimates
2 Policy Gradient Algorithms and Reducing Variance
3 Updating the Parameters Given the Gradient: Motivation
4 Need for Automatic Step Size Tuning
5 Updating the Parameters Given the Gradient: Local Approximation
6 Updating the Parameters Given the Gradient: Trust Regions
7 Updating the Parameters Given the Gradient: TRPO Algorithm
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 47 Winter 2018 41 / 54
![Page 42: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/42.jpg)
Optimization of Parameterized Policies48
Goal is to optimize
maxθ
Lθold (θnew )− 4εγ
(1− γ)2Dmax
KL (θold , θnew ) = Lθold (θnew )− CDmaxKL (θold , θnew )
where C is the penalty coefficient
In practice, if we used the penalty coefficient recommended by the theoryabove C = 4εγ
(1−γ)2 , the step sizes would be very small
New idea: Use a trust region constraint on step sizes. Do this by imposing aconstraint on the KL divergence between the new and old policy.
maxθ
Lθold (θ) (11)
subject to Ds∼ρθoldKL (θold , θ) ≤ δ (12)
This uses the average KL instead of the max (the max requires the KL isbounded at all states and yields an impractical number of constraints)
48Lπ(π) = V (θ) +∑
s ρπ(s)∑
a π(a|s)Aπ(s, a)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 49 Winter 2018 42 / 54
![Page 43: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/43.jpg)
From Theory to Practice
Prior objective:
maxθ
Lθold (θ) (13)
subject to Ds∼ρθoldKL (θold , θ) ≤ δ (14)
where Lπ(π) = V (θ) +∑
s ρπ(s)∑
a π(a|s)Aπ(s, a)
Don’t know the visitation weights nor true advantage function
Instead do the following substitutions:∑s
ρπ(s)→ 1
1− γEs∼ρθold [. . .], (15)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 50 Winter 2018 43 / 54
![Page 44: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/44.jpg)
From Theory to Practice
Next substitution:∑a
πθ(a|sn)Aθold (sn, a)→ Ea∼q
[πθ(a|sn)
q(a|sn)Aθold (sn, a)
](16)
where q is some sampling distribution over the actions and sn is aparticular sampled state.
This second substitution is to use importance sampling to estimatethe desired sum, enabling the use of an alternate samplingdistribution q (other than the new policy πθ.
Third substitution:
Aθold → Qθold (17)
Note that the above substitutions do not change solution to theabove optimization problem
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 51 Winter 2018 44 / 54
![Page 45: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/45.jpg)
Selecting the Sampling Policy
Optimize
maxθ
Es∼ρθold ,a∼q
[πθ(a|s)
q(a|s)Qθold (s, a)
]subject to Es∼ρθoldDKL(πθold (·|s), πθ(·|s)) ≤ δ
Standard approach: sampling distribution is q(a|s) is simply πold(a|s)
For the vine procedure see the paper
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 52 Winter 2018 45 / 54
![Page 46: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/46.jpg)
Searching for the Next Parameter
Use a linear approximation to the objective function and a quadraticapproximation to the constraint
Constrained optimization problem
Use conjugate gradient descent
schulman2015trust
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 53 Winter 2018 46 / 54
![Page 47: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/47.jpg)
Table of Contents
1 Better Gradient Estimates
2 Policy Gradient Algorithms and Reducing Variance
3 Updating the Parameters Given the Gradient: Motivation
4 Need for Automatic Step Size Tuning
5 Updating the Parameters Given the Gradient: Local Approximation
6 Updating the Parameters Given the Gradient: Trust Regions
7 Updating the Parameters Given the Gradient: TRPO Algorithm
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 54 Winter 2018 47 / 54
![Page 48: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/48.jpg)
Practical Algorithm: TRPO
1: for iteration=1, 2, . . . do2: Run policy for T timesteps or N trajectories3: Estimate advantage function at all timesteps4: Compute policy gradient g5: Use CG (with Hessian-vector products) to compute F−1g where F is
the Fisher information matrix6: Do line search on surrogate loss and KL constraint7: end for
schulman2015trust
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 55 Winter 2018 48 / 54
![Page 49: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/49.jpg)
Practical Algorithm: TRPO
Applied to
Locomotion controllers in 2D
Atari games with pixel input
schulman2015trust
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 56 Winter 2018 49 / 54
![Page 50: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/50.jpg)
TRPO Results
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 57 Winter 2018 50 / 54
![Page 51: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/51.jpg)
TRPO Results
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 58 Winter 2018 51 / 54
![Page 52: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/52.jpg)
TRPO Summary
Policy gradient approach
Uses surrogate optimization function
Automatically constrains the weight update to a trusted region, toapproximate where the first order approximation is valid
Empirically consistently does well
Very influential: +350 citations since introduced a few years ago
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 59 Winter 2018 52 / 54
![Page 53: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/53.jpg)
Common Template of Policy Gradient Algorithms
1: for iteration=1, 2, . . . do2: Run policy for T timesteps or N trajectories3: At each timestep in each trajectory, compute target Qπ(st , at), and
baseline b(st)4: Compute estimated policy gradient g5: Update the policy using g , potentially constrained to a local region6: end for
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 60 Winter 2018 53 / 54
![Page 54: Lecture 9: Policy Gradient II (Post lecture) 2web.stanford.edu/class/cs234/CS234Win2019/slides/lecture9.pdf · Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234](https://reader033.fdocuments.us/reader033/viewer/2022052800/5f0f50637e708231d4438beb/html5/thumbnails/54.jpg)
Policy Gradient Summary
Extremely popular and useful set of approaches
Can input prior knowledge in the form of specifying policyparameterization
You should be very familiar with REINFORCE and the policy gradienttemplate on the prior slide
Understand where different estimators can be slotted in (andimplications for bias/variance)
Don’t have to be able to derive or remember the specific formulas inTRPO for approximating the objectives and constraints
Will have the opportunity to practice with these ideas in homework 3
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 9: Policy Gradient II (Post lecture) 61 Winter 2018 54 / 54