Meta Reinforcement Learning

MetaReinforcement Learning

Kate Rakelly11/13/19

Questions we seek to answer

Motivation: What problem is meta-RL trying to solve?

Context: What is the connection to other problems in RL?

Solutions: What are solution methods for meta-RL and their limitations?

Open Problems: What are the open problems in meta-RL?

Robot art by Matt Spangler, mattspangler.com

Meta-learning problem statementsupervised learning reinforcement learning

“Dalmation”

“German shepherd” “Pug”

corgi ???

Meta-RL problem statementRegular RL: learn policy for single task Meta-RL: learn adaptation rule

Meta-training / Outer loop

Adaptation / Inner loop

Relation to goal-conditioned policies

Meta-RL can be viewed as a goal-conditioned policy where the task information is inferred from experience

Task information could be about the dynamics or reward functions

Rewards are a strict generalization of goalsSlide adapted from Chelsea Finn

Relation to goal-conditioned policies

Slide adapted from Chelsea Finn

Q: What is an example of a reward function that can’t be expressed as a goal state?

A: E.g., seek while avoiding, action penalties

AdaptationWhat should the adaptation procedure do?

- Explore: Collect the most informative data

- Adapt: Use that data to obtain the optimal policy

General meta-RL algorithm outline

In practice, compute update across a batch of tasks

Different algorithms:- Choice of function f- Choice of loss function L

Can do more than one round of adaptation

Solution Methods

Solution #1: recurrenceImplement the policy as a recurrent network, train across a set of tasks

Persist the hidden state across episode boundaries for continued adaptation!

Duan et al. 2016, Wang et al. 2016. Heess et al. 2015. Fig adapted from Duan et al. 2016

Solution #1: recurrence

PGPro: general, expressive

There exists an RNN that can compute any function

Con: not consistent

What does it mean for adaptation to be “consistent”?

Will converge to the optimal policy given enough data

Solution #1: recurrence

Duan et al 2016, Wang et al. 2016

is pretraining a type of meta-learning?better features = faster learning of new task!

Sample inefficient, prone to overfitting, and is particularly difficult in RLSlide adapted from Sergey Levine

Wait, what if we just fine-tune?

Solution #2: optimization

Finn et al. 2017. Fig adapted from Finn et al. 2017

Learn a parameter initialization from which fine-tuning for a new task works! PG

Finn et al. 2017. Fig adapted from Finn et al. 2017

Requires second order derivatives!

Fig adapted from Rothfuss et al. 2018

How exploration is learned automatically

Causal relationship between pre and post-update trajectories is taken into account

Pre-update parameters receive credit for producing good exploration trajectories

View this as a “return” that encourages gradient alignment

Solution #2: optimizationPro: consistent!

Con: not as expressive

Q: When could the optimization strategy be less expressive than the recurrent strategy?

Suppose reward is given only in this region

Example: when no rewards are collected, adaptation will not change the policy, even though this data gives information about which states to avoid

Exploring in a sparse reward setting

Cheetah running forward and back after 1 gradient step

Fig adapted from Finn et al. 2017

Meta-RL on robotic systems

Meta-imitation learning

Figure adapted from BAIR Blog Post: One-Shot Imitation from Watching Videos

Demonstration 1-shot imitation

Meta-imitation learning

Yu et al. 2017

Behavior cloning

PGTest: perform task given single robot demoTraining: run behavior cloning for adaptation

Meta-training Test time

Meta-imitation learning from human demos

Figure adapted from BAIR Blog Post: One-Shot Imitation from Watching Videos

demonstration 1-shot imitation

Meta-imitation learning from humans

Learned loss

PGTest: perform task given single human demoTraining: learn a loss function that adapts policy

Supervised by paired robot-human demos only during meta-training!

Meta-training Test time

Yu et al. 2018

Model-Based meta-RL

Figure adapted from Anusha Nagabandi

What if the system dynamics change?- Low battery- Malfunction- Different terrain

Re-train model? :(

Model-Based meta-RL

Figure adapted from Anusha Nagabandi

Supervised model learning

Model-Based meta-RL

Video from Nagabandi et al. 2019

Aside: POMDPsstate is unobserved (hidden)

observation gives incomplete information about the state

Example: incomplete sensor data

“That Way We Go” by Matt Spangler

The POMDP view of meta-RL

Two approaches to solve: 1) policy with memory (RNN) 2) explicit state estimation

Model belief over latent task variables

Goal state

POMDP for unobserved state

Where am I?

a = “left”, s = S0, r = 0

s = S0S0 S1 S2

POMDP for unobserved task

Goal for MDP 2

Goal for MDP 1 What task am I in?

Goal for MDP 0

a = “left”, s = S0, r = 0

s = S0

Model belief over latent task variables

Goal state

POMDP for unobserved state POMDP for unobserved task

Goal for MDP 2

Goal for MDP 1Where am I? What task am I in?

Goal for MDP 0

a = “left”, s = S0, r = 0 a = “left”, s = S0, r = 0

s = S0s = S0 sampleS0 S1 S2

Solution #3: task-belief states

Stochastic encoder

Solution #3: posterior sampling in action

Solution #3: belief training objective

Stochastic encoder

“Likelihood” term (Bellman error)

“Regularization” term / information bottleneck

Variational approximations to posterior and prior

See Control as Inference (Levine 2018) for justification of thinking of Q as a pseudo-likelihood

Solution #3: encoder design

Don’t need to know the order of transitions in order to identify the MDP (Markov property)

Use a permutation-invariant encoder for simplicity and speed

Aside: Soft Actor-Critic (SAC)

“Soft”: Maximize rewards *and* entropy of the policy (higher entropy policies explore better)

“Actor-Critic”: Model *both* the actor (aka the policy) and the critic (aka the Q-function)

SAC Haarnoja et al. 2018, Control as Inference Tutorial. Levine 2018, SAC BAIR Blog Post 2019

Dclaw robot turns valve from pixelsMuch more sample efficient than on-policy algs.

Soft Actor-Critic

Solution #3: task-belief + SAC

Rakelly & Zhou et al. 2019

Stochastic encoder

variable reward function(locomotion direction, velocity, or goal)

variable dynamics(joint parameters)

Meta-RL experimental domains

Simulated via MuJoCo (Todorov et al. 2012), tasks proposed by (Finn et al. 2017, Rothfuss et al. 2019)

ProMP (Rothfuss et al. 2019), MAML (Finn et al. 2017), RL2 (Duan et al. 2016)

20-100X more sample efficient!

two views of meta-RL

Slide adapted from Sergey Levine and Chelsea Finn

Summary

Slide adapted from Sergey Levine and Chelsea Finn

Frontiers

Where do tasks come from?

Ant learns to run in different directions, jump, and flip

Point robot learns to explore different areas after the hallway

Idea: generate self-supervised tasks and use them during meta-training

Separate skills visit different states

Skills should be high entropy

Eysenbach et al. 2018, Gupta et al. 2018

Limitations

Assumption that skills shouldn’t depend on action not always valid

Distribution shift meta-train -> meta-test

How to explore efficiently in a new task?Learn exploration strategies better... Bias exploration with extra information…

Plain gradient meta-RL Latent-variable modelhuman -provided demo

Robot attempt #1, w/ only demo info

Robot attempt #2, w/ demo + reward info

Gupta et al. 2018, Rakelly et al. 2019, Zhou et al. 2019

Online meta-learningMeta-training tasks are presented in a sequence rather than a batch

Finn et al. 2019

SummaryMeta-RL finds an adaptation procedure that can quickly adapt the policy to a new task

Three main solution classes: RNN, optimization, task-belief and several learning paradigms: model-free (on and off policy), model-based, imitation learning

Connection to goal-conditioned RL and POMDPs

Some open problems (there are more!): better exploration, defining task distributions, meta-learning online

ReferencesRecurrent meta-RLLearning to Reinforcement Learn, Wang et al. 2016Fast Reinforcement Learning by Slow Reinforcement Learning, Duan et al. 2016Memory-Based Control with Recurrent Neural Networks, Heess et al. 2015

Optimization-based meta-RLModel-Agnostic Meta-Learning, Finn et al. 2017Proximal Meta-Policy Search, Rothfuss et al. 2018

Optimization-based meta-RL + imitation learningOne-Shot Visual Imitation Learning via Meta-Learning, Yu et al. 2017One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning, Yu et al. 2018

Model-based meta-RLLearning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning, Nagabandi et al. 2019

Off-policy meta-RLSoft Actor-Critic, Haarnoja et al. 2018Control as Inference, Levine 2018.Efficient Off-Policy Meta-RL via Probabilistic Context Variables, Rakelly et al. 2019

Open ProblemsDiversity is All You Need: Learning Skills without a Reward Function, Eysenbach et al. 2018Unsupervised Meta-learning for RL, Gupta et al. 2018Meta-Reinforcement Learning of Structured Exploration Strategies, Gupta et al. 2018Watch, Try, Learn, Meta-Learning from Demonstrations and Reward, Zhou et al. 2019Online Meta-Learning, Finn et al. 2019

Slides and FiguresSome slides adapted from Meta-Learning Tutorial at ICML 2019, Finn and LevineRobot illustrations by Matt Spangler, mattspangler.com

References

Meta Reinforcement Learning

Documents

Transcript of Meta Reinforcement Learning

Meta-Inverse Reinforcement Learning with Probabilistic ... · context-based meta-learning [5, 26], deep latent variable generative models [17], and maximum entropy inverse RL [42,

Model-Based Meta-Reinforcement Learning for Flight with ...Model-Based Meta-Reinforcement Learning for Flight with Suspended Payloads Suneel Belkhale y, Rachel Li , Gregory Kahn ,

Prefrontal cortex as a Meta-reinforcement learning system · Prefrontal cortex as a Meta-reinforcement learning system Matthew Botvinick DeepMind, London UK Gatsby Computational Neuroscience

Interpretable Hierarchical Reinforcement Learningdivy.at/UGP_ReinforcementLearning.pdf · Keywords: Reinforcement Learning, Meta Learning, Multi-goal Generalization, Hierarchical

Learning to Learn with Gradients · ing elements of meta-learning with techniques for deep model-based reinforcement learning, imitation learning, and inverse reinforcement learning.

Multi-Objective Reinforcement Learning using Sets of Pareto … · 2020. 10. 19. · learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement

Toward object-oriented deep reinforcement learningalgonauts.csail.mit.edu/slides/Algonauts2019_Matt_Botvinick.pdfon episodic memory and meta- learning. Alongside their interest as

Meta-Gradient Reinforcement Learning with an Objective ...

Unsupervised Curricula for Visual Meta-Reinforcement Learning...Unsupervised Curricula for Visual Meta-Reinforcement Learning Allan Jabri Kyle Hsu,yBenjamin Eysenbach Abhishek Gupta

Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

Meta-Gradient Reinforcement Learningpapers.nips.cc/paper/7507-meta-gradient-reinforcement-learning.pdf · of learning. We derive a practical gradient-based meta-learning algorithm

Policy Transfer Algorithms for Meta Inverse Reinforcement ...Policy Transfer Algorithms for Meta Inverse Reinforcement Learning Benjamin Kha Abstract Inverse reinforcement learning

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Meta Exploration for Model-Agnostic Reinforcement Learning

Meta-Reinforcement Learning of Structured Exploration Strategiespapers.nips.cc/paper/7776-meta-reinforcement-learning-of... · 2019-02-19 · Exploration is a fundamental challenge

Context Meta-Reinforcement Learning via Neuromodulation

Meta-learning of exploration-exploitation strategies in reinforcement learning

Meta-Gradient Reinforcement Learning - NIPS

Prefrontal cortex as a meta-reinforcement CS330 Student ...cs330.stanford.edu/presentations/presentation-11.4-1.pdf · Prefrontal cortex as a meta-reinforcement learning system Wang