Advanced Policy Gradients - University of California,...
Transcript of Advanced Policy Gradients - University of California,...
![Page 1: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/1.jpg)
Advanced Policy Gradients
CS 285: Deep Reinforcement Learning, Decision Making, and Control
Sergey Levine
![Page 2: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/2.jpg)
Class Notes
1. Homework 2 due today (11:59 pm)!• Don’t be late!
2. Homework 3 comes out this week• Start early! Q-learning takes a while to run
![Page 3: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/3.jpg)
Today’s Lecture
1. Why does policy gradient work?
2. Policy gradient is a type of policy iteration
3. Policy gradient as a constrained optimization
4. From constrained optimization to natural gradient
5. Natural gradients and trust regions
• Goals:• Understand the policy iteration view of policy gradient
• Understand how to analyze policy gradient improvement
• Understand what natural gradient does and how to use it
![Page 4: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/4.jpg)
Recap: policy gradients
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
“reward to go”
can also use function approximation here
![Page 5: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/5.jpg)
Why does policy gradient work?
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
look familiar?
![Page 6: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/6.jpg)
Policy gradient as policy iteration
![Page 7: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/7.jpg)
Policy gradient as policy iterationimportance sampling
![Page 8: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/8.jpg)
Ignoring distribution mismatch??
why do we want this to be true?
is it true? and when?
![Page 9: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/9.jpg)
Bounding the distribution change
seem familiar?
not a great bound, but a bound!
![Page 10: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/10.jpg)
Bounding the distribution change
Proof based on: Schulman, Levine, Moritz, Jordan, Abbeel. “Trust Region Policy Optimization.”
![Page 11: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/11.jpg)
Bounding the objective value
![Page 12: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/12.jpg)
Where are we at so far?
![Page 13: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/13.jpg)
Break
![Page 14: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/14.jpg)
A more convenient bound
KL divergence has some very convenient properties that make it much easier to approximate!
![Page 15: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/15.jpg)
How do we optimize the objective?
![Page 16: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/16.jpg)
How do we enforce the constraint?
can do this incompletely (for a few grad steps)
![Page 17: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/17.jpg)
How (else) do we optimize the objective?
Use first order Taylor approximation for objective (a.k.a., linearization)
![Page 18: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/18.jpg)
How do we optimize the objective?
(see policy gradient lecture for derivation)
exactly the normal policy gradient!
![Page 19: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/19.jpg)
Can we just use the gradient then?
![Page 20: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/20.jpg)
Can we just use the gradient then?
not the same!
second order Taylor expansion
![Page 21: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/21.jpg)
Can we just use the gradient then?
natural gradient
![Page 22: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/22.jpg)
Is this even a problem in practice?
(image from Peters & Schaal 2008)
Essentially the same problem as this:
(figure from Peters & Schaal 2008)
![Page 23: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/23.jpg)
Practical methods and notes
• Natural policy gradient• Generally a good choice to stabilize policy gradient training
• See this paper for details:• Peters, Schaal. Reinforcement learning of motor skills with policy gradients.
• Practical implementation: requires efficient Fisher-vector products, a bit non-trivial to do without computing the full matrix• See: Schulman et al. Trust region policy optimization
• Trust region policy optimization
• Just use the IS objective directly• Use regularization to stay close to old policy
• See: Proximal policy optimization
![Page 24: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.us/reader035/viewer/2022071214/6041bdf6fdcc5066227d12cc/html5/thumbnails/24.jpg)
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
Review
• Policy gradient = policy iteration
• Optimize advantage under new policy state distribution
• Using old policy state distribution optimizes a bound, if the policies are close enough
• Results in constrained optimization problem
• First order approximation to objective = gradient ascent
• Regular gradient ascent has the wrong constraint, use natural gradient
• Practical algorithms
• Natural policy gradient
• Trust region policy optimization