When to use parametric models in reinforcement learning?

When to use parametric modelsin reinforcement learning?

Hado van Hasselt, Matteo Hessel, John AslanidesDeepMind

Motivation / prior work

Baseline: Rainbow DQN

Performance (=score)after 100,000 steps (=400,000 frames)

Question

Why does the parametric model perform better than replay?

Models and planningA model is a function:

r, s’ = m(s, a)

We can use models to plan: spend more compute to improve prediction & policies.

We can also plan with experience replay:

r_{n+1}, s_{n+1} = replay(s_n, a_n)

● Experience replay is similar to a non-parametric model● But we can only query it at observed state action pairs (s_n, a_n), n < t.

Replay and models, propertiesTypically models use less memory and more compute than replay

But what about data efficiency & performance?

AlgorithmsK: iterationsM: real steps / iterationP: planned steps / iteration

State-samplingdistribution

Iterations (K) Real stepsper iteration (M)

Planned stepsper iteration (P)

SimPLe 16 6400 800,000

Rainbow DQN 12,500,000 4 32

Data-efficientRainbow DQN (new) 100,000 1 32

Algorithms

Total realexperience (K x M)

Total planned experience (K x P)

SimPLe 100,000(400K frames)

15,200,000

Rainbow DQN 50,000,000(200M frames)

400,000,000

Data-efficientRainbow DQN (new)

100,000(400K frames)

3,200,000

Algorithms

When do models help performance?

What if learning the model is easy?

Surprising instabilities

● Even with perfect models learning can be unstable● This happens surprisingly easily!● Related to the deadly triad:

○ the state sampling distribution d and the model m may mismatch, even if the model is perfect.

How can we best use learnt models?

Forward planning for credit assignment

Backward planning for credit assignment

Forward planning for behaviour

Conclusions1. Replay can be used for planning2. There are different ways to use models:

a. Forward planning for credit assignmentb. Forward planning for immediate behaviourc. Backward planning for credit assignment

3. b. and c. might be better than a.

Thank you

https://arxiv.org/abs/1906.05243

Related work on forward planning for credit assignment

“Dyna, an integrated architecture for learning, planning, and reacting”Rich Sutton (1991)← argues for combining learning with forward planning for credit assignment with a learnt model

“The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces”Zach Holland, Eric Talvitie, Mike Bowling (2018)← shows that it is important how to use your planning budget (e.g., short vs long rollouts)

“Model-Based Reinforcement Learning for Atari”Lukasz Kaiser et al. (2019)← introduces the SimPLe algorithm we compare against

Related work on forward planning for behaviour

Essentially: lots of work on model-predictive control, e.g.:

“Model predictive heuristic control” J. Richalet, A. Rault, J. Testud, and J. Papon (1978)

“Model predictive control: past, present and future” M. Morari and J. H. Lee (1999)

“Model predictive control: Recent developments and future promise” D. Q. Mayne (2014)

“An online learning approach to model predictive control” N. Wagener, C.-A. Cheng, J. Sacks, and B. Boots (2019)

Related work on backward planning for credit assignment

Comparatively less…? (More common: plan backwards from a real or imagined goal.)

E.g., doesn’t show up in:“Model Learning for Robot Control: A Survey” D. Nguyen-Tuong, J. Peters (2011)“Reinforcement Learning: an Introduction” R. Sutton & A. Barto (2018)

Note: in dynamic programming and replay forward and backward planning are equivalent.The only distinction is then which states to update, i.e., prioritized sweeping:“Prioritized sweeping: Reinforcement learning with less data and less time” Moore, Atkeson (1993)“Efficient Learning and Planning Within the Dyna Framework” Peng, Williams (1993)

The idea of using learnt models to plan backwards for credit assignment is perhaps underexplored

Rainbow DQN hyperparameters changes

When to use parametric models in reinforcement learning?

Documents

Transcript of When to use parametric models in reinforcement learning?

Parametric Forecasting and Stochastic Programming Models ...faculty.washington.edu/yongpin/Stochastic Scheduling.pdf · Parametric Forecasting and Stochastic Programming Models for

REINFORCEMENT RATE AND RESURGENCE: A PARAMETRIC …rmac-mx.org/wp-content/uploads/2015/09/RMAC4102_Cancado.pdf · REINFORCEMENT RATE AND RESURGENCE: A PARAMETRIC ANALYSIS TASA DE

PARAMETRIC MODELS FOR ESTIMATING WIND TURBINE …

Non-parametric habitat models with automatic …people.oregonstate.edu/~mccuneb/McCune2006JVS-NPMR.pdf- Non-parametric habitat models with automatic interactions - 821 these surfaces

Parametric versus semi nonparametric parametric regression models

Aalborg Universitet Parametric Hidden Markov Models for ...

Parametric stellar convection models

Using Inaccurate Models in Reinforcement Learning

Actor-critic models of reinforcement learning in the basal ... · Actor-critic models of reinforcement learning ... reinforcement learning in the basal ... both striosomes and matrix

Learning Parametric Inverse Dynamics Models from …genesis.naist.jp/~takam-m/ROBIO2011_Horiguchi_Camera.pdfLearning Parametric Inverse Dynamics Models from Multiple Conditions for

Non-parametric stochastic frontier models

Advanced Prediction Models · Advanced Prediction Models Deep Learning, Graphical Models and Reinforcement Learning. Today’s Outline • Complex Decisions • Reinforcement Learning

streg — Parametric survival models - Stata · Title stata.com streg — Parametric survival models DescriptionQuick startMenuSyntax OptionsRemarks and examplesStored resultsMethods

Using Inaccurate Models in Reinforcement Learning

Phase-Parametric Policies for Reinforcement Learning in Cyclic … · 2018-01-03 · Phase-Parametric Policies for Reinforcement Learning in Cyclic Environments Arjun Sharma and Kris

Apparel Fit Assessment Using Parametric Models

Representing parametric probabilistic models …Didier.Dubois/Papers0804/Baudrit-FSSnew.pdfRepresenting parametric probabilistic models tainted with imprecision ... Numerical possibility

Combining parametric and nonparametric models for off-policy …12-16-00)-12-16-40-4537... · Parametric vs. Nonparametric Models Nonparametric models – Predicting the dynamics

Development of Parametric Mass and Volume Models for … · Development of Parametric Mass and Volume ... describes the parametric mass and volume models that are used ... from the

Creating 3D Models through Parametric Coding