Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted...
Transcript of Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted...
![Page 1: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/1.jpg)
MetaReinforcement Learning
Kate Rakelly11/13/19
![Page 2: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/2.jpg)
Questions we seek to answer
Motivation: What problem is meta-RL trying to solve?
Context: What is the connection to other problems in RL?
Solutions: What are solution methods for meta-RL and their limitations?
Open Problems: What are the open problems in meta-RL?
![Page 3: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/3.jpg)
Robot art by Matt Spangler, mattspangler.com
Meta-learning problem statementsupervised learning reinforcement learning
“Dalmation”
“German shepherd” “Pug”
corgi ???
![Page 4: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/4.jpg)
Meta-RL problem statementRegular RL: learn policy for single task Meta-RL: learn adaptation rule
Meta-training / Outer loop
Adaptation / Inner loop
![Page 5: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/5.jpg)
Relation to goal-conditioned policies
Meta-RL can be viewed as a goal-conditioned policy where the task information is inferred from experience
Task information could be about the dynamics or reward functions
Rewards are a strict generalization of goalsSlide adapted from Chelsea Finn
![Page 6: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/6.jpg)
Relation to goal-conditioned policies
Slide adapted from Chelsea Finn
Q: What is an example of a reward function that can’t be expressed as a goal state?
A: E.g., seek while avoiding, action penalties
![Page 7: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/7.jpg)
AdaptationWhat should the adaptation procedure do?
- Explore: Collect the most informative data
- Adapt: Use that data to obtain the optimal policy
![Page 8: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/8.jpg)
General meta-RL algorithm outline
In practice, compute update across a batch of tasks
Different algorithms:- Choice of function f- Choice of loss function L
Can do more than one round of adaptation
![Page 9: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/9.jpg)
Solution Methods
![Page 10: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/10.jpg)
Solution #1: recurrenceImplement the policy as a recurrent network, train across a set of tasks
Persist the hidden state across episode boundaries for continued adaptation!
Duan et al. 2016, Wang et al. 2016. Heess et al. 2015. Fig adapted from Duan et al. 2016
RNN
PG
![Page 11: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/11.jpg)
Solution #1: recurrence
![Page 12: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/12.jpg)
Solution #1: recurrence
RNN
PGPro: general, expressive
There exists an RNN that can compute any function
Con: not consistent
What does it mean for adaptation to be “consistent”?
Will converge to the optimal policy given enough data
![Page 13: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/13.jpg)
Solution #1: recurrence
Duan et al 2016, Wang et al. 2016
![Page 14: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/14.jpg)
is pretraining a type of meta-learning?better features = faster learning of new task!
Sample inefficient, prone to overfitting, and is particularly difficult in RLSlide adapted from Sergey Levine
Wait, what if we just fine-tune?
![Page 15: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/15.jpg)
Solution #2: optimization
Finn et al. 2017. Fig adapted from Finn et al. 2017
Learn a parameter initialization from which fine-tuning for a new task works! PG
PG
![Page 16: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/16.jpg)
Solution #2: optimization
Finn et al. 2017. Fig adapted from Finn et al. 2017
Requires second order derivatives!
![Page 17: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/17.jpg)
Solution #2: optimization
Fig adapted from Rothfuss et al. 2018
How exploration is learned automatically
Causal relationship between pre and post-update trajectories is taken into account
Pre-update parameters receive credit for producing good exploration trajectories
PG
PG
![Page 18: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/18.jpg)
Solution #2: optimization
Fig adapted from Rothfuss et al. 2018
PG
PG
View this as a “return” that encourages gradient alignment
![Page 19: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/19.jpg)
Solution #2: optimizationPro: consistent!
Con: not as expressive
Q: When could the optimization strategy be less expressive than the recurrent strategy?
PG
PG
Suppose reward is given only in this region
Example: when no rewards are collected, adaptation will not change the policy, even though this data gives information about which states to avoid
![Page 20: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/20.jpg)
Solution #2: optimization
Exploring in a sparse reward setting
Fig adapted from Rothfuss et al. 2018
Cheetah running forward and back after 1 gradient step
Fig adapted from Finn et al. 2017
![Page 21: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/21.jpg)
Meta-RL on robotic systems
![Page 22: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/22.jpg)
Meta-imitation learning
Figure adapted from BAIR Blog Post: One-Shot Imitation from Watching Videos
Demonstration 1-shot imitation
![Page 23: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/23.jpg)
Meta-imitation learning
Yu et al. 2017
Behavior cloning
PGTest: perform task given single robot demoTraining: run behavior cloning for adaptation
Meta-training Test time
![Page 24: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/24.jpg)
Meta-imitation learning from human demos
Figure adapted from BAIR Blog Post: One-Shot Imitation from Watching Videos
demonstration 1-shot imitation
![Page 25: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/25.jpg)
Meta-imitation learning from humans
Learned loss
PGTest: perform task given single human demoTraining: learn a loss function that adapts policy
Supervised by paired robot-human demos only during meta-training!
Meta-training Test time
Yu et al. 2018
![Page 26: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/26.jpg)
Model-Based meta-RL
Figure adapted from Anusha Nagabandi
What if the system dynamics change?- Low battery- Malfunction- Different terrain
Re-train model? :(
![Page 27: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/27.jpg)
Model-Based meta-RL
Figure adapted from Anusha Nagabandi
Supervised model learning
MPC
![Page 29: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/29.jpg)
Break
![Page 30: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/30.jpg)
Aside: POMDPsstate is unobserved (hidden)
observation gives incomplete information about the state
Example: incomplete sensor data
“That Way We Go” by Matt Spangler
![Page 31: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/31.jpg)
The POMDP view of meta-RL
Two approaches to solve: 1) policy with memory (RNN) 2) explicit state estimation
![Page 32: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/32.jpg)
Model belief over latent task variables
⚬
⚬
Goal state
POMDP for unobserved state
Where am I?
a = “left”, s = S0, r = 0
s = S0S0 S1 S2
⚬
⚬
POMDP for unobserved task
Goal for MDP 2
Goal for MDP 1 What task am I in?
Goal for MDP 0
a = “left”, s = S0, r = 0
s = S0
![Page 33: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/33.jpg)
Model belief over latent task variables
⚬
⚬
⚬
⚬
Goal state
POMDP for unobserved state POMDP for unobserved task
Goal for MDP 2
Goal for MDP 1Where am I? What task am I in?
Goal for MDP 0
a = “left”, s = S0, r = 0 a = “left”, s = S0, r = 0
s = S0s = S0 sampleS0 S1 S2
![Page 34: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/34.jpg)
Solution #3: task-belief states
Stochastic encoder
![Page 35: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/35.jpg)
Solution #3: posterior sampling in action
![Page 36: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/36.jpg)
Solution #3: belief training objective
Stochastic encoder
“Likelihood” term (Bellman error)
“Regularization” term / information bottleneck
Variational approximations to posterior and prior
See Control as Inference (Levine 2018) for justification of thinking of Q as a pseudo-likelihood
![Page 37: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/37.jpg)
Solution #3: encoder design
Don’t need to know the order of transitions in order to identify the MDP (Markov property)
Use a permutation-invariant encoder for simplicity and speed
![Page 38: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/38.jpg)
Aside: Soft Actor-Critic (SAC)
“Soft”: Maximize rewards *and* entropy of the policy (higher entropy policies explore better)
“Actor-Critic”: Model *both* the actor (aka the policy) and the critic (aka the Q-function)
SAC Haarnoja et al. 2018, Control as Inference Tutorial. Levine 2018, SAC BAIR Blog Post 2019
Dclaw robot turns valve from pixelsMuch more sample efficient than on-policy algs.
![Page 39: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/39.jpg)
Soft Actor-Critic
![Page 40: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/40.jpg)
Solution #3: task-belief + SAC
Rakelly & Zhou et al. 2019
SAC
Stochastic encoder
![Page 41: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/41.jpg)
variable reward function(locomotion direction, velocity, or goal)
variable dynamics(joint parameters)
Meta-RL experimental domains
Simulated via MuJoCo (Todorov et al. 2012), tasks proposed by (Finn et al. 2017, Rothfuss et al. 2019)
![Page 42: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/42.jpg)
ProMP (Rothfuss et al. 2019), MAML (Finn et al. 2017), RL2 (Duan et al. 2016)
![Page 43: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/43.jpg)
ProMP (Rothfuss et al. 2019), MAML (Finn et al. 2017), RL2 (Duan et al. 2016)
20-100X more sample efficient!
![Page 44: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/44.jpg)
two views of meta-RL
Slide adapted from Sergey Levine and Chelsea Finn
![Page 45: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/45.jpg)
Summary
Slide adapted from Sergey Levine and Chelsea Finn
![Page 46: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/46.jpg)
Frontiers
![Page 47: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/47.jpg)
Where do tasks come from?
max
Ant learns to run in different directions, jump, and flip
Point robot learns to explore different areas after the hallway
Idea: generate self-supervised tasks and use them during meta-training
Separate skills visit different states
Skills should be high entropy
Eysenbach et al. 2018, Gupta et al. 2018
Limitations
Assumption that skills shouldn’t depend on action not always valid
Distribution shift meta-train -> meta-test
![Page 48: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/48.jpg)
How to explore efficiently in a new task?Learn exploration strategies better... Bias exploration with extra information…
Plain gradient meta-RL Latent-variable modelhuman -provided demo
Robot attempt #1, w/ only demo info
Robot attempt #2, w/ demo + reward info
Gupta et al. 2018, Rakelly et al. 2019, Zhou et al. 2019
![Page 49: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/49.jpg)
Online meta-learningMeta-training tasks are presented in a sequence rather than a batch
Finn et al. 2019
![Page 50: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/50.jpg)
SummaryMeta-RL finds an adaptation procedure that can quickly adapt the policy to a new task
Three main solution classes: RNN, optimization, task-belief and several learning paradigms: model-free (on and off policy), model-based, imitation learning
Connection to goal-conditioned RL and POMDPs
Some open problems (there are more!): better exploration, defining task distributions, meta-learning online
![Page 51: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/51.jpg)
ReferencesRecurrent meta-RLLearning to Reinforcement Learn, Wang et al. 2016Fast Reinforcement Learning by Slow Reinforcement Learning, Duan et al. 2016Memory-Based Control with Recurrent Neural Networks, Heess et al. 2015
Optimization-based meta-RLModel-Agnostic Meta-Learning, Finn et al. 2017Proximal Meta-Policy Search, Rothfuss et al. 2018
Optimization-based meta-RL + imitation learningOne-Shot Visual Imitation Learning via Meta-Learning, Yu et al. 2017One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning, Yu et al. 2018
Model-based meta-RLLearning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning, Nagabandi et al. 2019
Off-policy meta-RLSoft Actor-Critic, Haarnoja et al. 2018Control as Inference, Levine 2018.Efficient Off-Policy Meta-RL via Probabilistic Context Variables, Rakelly et al. 2019
![Page 52: Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-20.pdfFig adapted from Rothfuss et al. 2018 How exploration is learned automatically Causal relationship](https://reader033.fdocuments.us/reader033/viewer/2022041613/5e3907205614c21b80356964/html5/thumbnails/52.jpg)
Open ProblemsDiversity is All You Need: Learning Skills without a Reward Function, Eysenbach et al. 2018Unsupervised Meta-learning for RL, Gupta et al. 2018Meta-Reinforcement Learning of Structured Exploration Strategies, Gupta et al. 2018Watch, Try, Learn, Meta-Learning from Demonstrations and Reward, Zhou et al. 2019Online Meta-Learning, Finn et al. 2019
Slides and FiguresSome slides adapted from Meta-Learning Tutorial at ICML 2019, Finn and LevineRobot illustrations by Matt Spangler, mattspangler.com
References