Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value...

60
Model-based Reinforcement Learning DAVIDE BACCIU – [email protected]

Transcript of Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value...

Page 1: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Model-based Reinforcement LearningDAVIDE BACCIU – [email protected]. IT

Page 2: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Outline✓Introduction

✓Model-based reinforcement learning

✓Integrated Architectures✓Learning and Planning

✓Real and simulated experience

✓Monte-Carlo Tree Search

✓Go use case

DAVIDE BACCIU - UNIVERSITÀ DI PISA 2

Page 3: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Introduction

DAVIDE BACCIU - UNIVERSITÀ DI PISA 3

Page 4: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Model-Based Reinforcement Learning✓Previous lecture✓ Learn value function directly from experience✓ Learn policy directly from experience✓ Model-free RL

✓Today✓Learn model directly from experience✓Use planning to construct a value function or policy✓Model-based RL

✓Integrate learning and planning into a single architecture

DAVIDE BACCIU - UNIVERSITÀ DI PISA 4

Page 5: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Model-Based & Model-Free RL✓Model-Free RL✓No model

✓Learn value function (and/or policy) from experience

✓Model-Based RL✓Learn a model from experience

✓Plan value function (and/or policy) from model

DAVIDE BACCIU - UNIVERSITÀ DI PISA 5

Page 6: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Model-Free RL

DAVIDE BACCIU - UNIVERSITÀ DI PISA 6

Page 7: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Model-based RL

DAVIDE BACCIU - UNIVERSITÀ DI PISA 7

Page 8: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Model-based RL

DAVIDE BACCIU - UNIVERSITÀ DI PISA 8

Page 9: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Model-based RL - Lifecycle

DAVIDE BACCIU - UNIVERSITÀ DI PISA 9

value/policy

model experience

planningacting

model learning

Page 10: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Model-Based RL – Pros and Cons✓Advantages✓Can efficiently learn model by supervised learning methods

✓Can reason about model uncertainty

✓Disadvantages✓First learn a model, then construct a value function

✓I.e. two sources of approximation error

DAVIDE BACCIU - UNIVERSITÀ DI PISA 10

Page 11: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

What is a Model?✓A model ℳ𝜂 is a representation of an MDP 𝒮,𝒜,𝑷,ℛ parametrized by 𝜂

✓Assume state space 𝒮 and action space 𝒜 are known

✓So a model ℳ𝜂 = 𝑃𝜂, ℛ𝜂 represents state transitions 𝑃𝜂 ≈ 𝑷 and rewards ℛ𝜂 ≈ ℛ

𝑆𝑡+1 = 𝑃𝜂(𝑆𝑡+1|𝑆𝑡 , 𝐴𝑡)

𝑅𝑡+1 = ℛ𝜂(𝑅𝑡+1|𝑆𝑡 , 𝐴𝑡)

✓Typically assume conditional independence between state transitions and rewards

𝑃 𝑆𝑡+1, 𝑅𝑡+1 𝑆𝑡 , 𝐴𝑡 = 𝑃 𝑆𝑡+1 𝑆𝑡 , 𝐴𝑡 𝑃 𝑅𝑡+1 𝑆𝑡 , 𝐴𝑡

DAVIDE BACCIU - UNIVERSITÀ DI PISA 11

Page 12: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Model Learning✓Estimate model ℳ𝜂 from experience 𝑆1; 𝐴1; 𝑅2; … , 𝑆𝑇

✓Supervised learning task𝑆1; 𝐴1 → 𝑅2; 𝑆2𝑆2; 𝐴2 → 𝑅3; 𝑆3

…𝑆𝑇−1; 𝐴𝑇−1 → 𝑅𝑇; 𝑆𝑇

✓Learning 𝑠, 𝑎 → 𝑟 is a regression problem

✓Learning 𝑠, 𝑎 → 𝑠′ is a density estimation problem

✓Pick an adequate loss function, e.g. mean-squared error, KL divergence, ...

✓Find parameters 𝜂 that minimise empirical loss

DAVIDE BACCIU - UNIVERSITÀ DI PISA 12

Page 13: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Some Learning Models✓Table Lookup Model

✓Linear Expectation Model

✓Linear Gaussian Model

✓Gaussian Process Model

✓Deep Belief Network Model

✓...

DAVIDE BACCIU - UNIVERSITÀ DI PISA 13

Page 14: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Model Learning with Table Lookup✓Model is an explicit MDP 𝑃, ℛ

✓Count visits 𝑁(𝑠, 𝑎) to each state-action pair

𝑃𝑠,𝑠′𝑎 =

1

𝑁(𝑠, 𝑎)

𝑡=1

𝑇

𝟏 𝑆𝑡 , 𝐴𝑡 , 𝑆𝑡+1; 𝑠, 𝑎, 𝑠′

𝑅𝑠𝑎 =

1

𝑁(𝑠, 𝑎)

𝑡=1

𝑇

𝟏 𝑆𝑡 , 𝐴𝑡; 𝑠, 𝑎 𝑅𝑡

✓Alternatively✓At each time-step 𝑡 record experience tuple 𝑆𝑡, 𝐴𝑡, 𝑅𝑡+1, 𝑆𝑡+1✓To sample the model randomly pick tuple matching 𝑠, 𝑎,⋅,⋅

DAVIDE BACCIU - UNIVERSITÀ DI PISA 14

Page 15: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

The Return of AB Example✓Two states A; B; no discounting; 8 episodes of experience

1. A, 0, B, 02. B, 13. B, 14. B, 15. B, 16. B, 17. B, 18. B, 0

✓We have constructed a table lookup from experience

DAVIDE BACCIU - UNIVERSITÀ DI PISA 15

Page 16: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Planning with a Model✓Given a model ℳ𝜂 = 𝑃𝜂 , ℛ𝜂

✓Solve the MDP 𝒮,𝒜, 𝑃𝜂 , ℛ𝜂

✓Using your favourite planning algorithm✓Value iteration

✓Policy iteration

✓Tree search

✓...

DAVIDE BACCIU - UNIVERSITÀ DI PISA 16

Page 17: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Sample-Based Planning✓Simple but powerful approach to planning

✓Use the model only to generate samples

✓Sample experience from model𝑆𝑡+1 ~ 𝑃𝜂(𝑆𝑡+1|𝑆𝑡 , 𝐴𝑡)𝑅𝑡+1 = ℛ𝜂 𝑅𝑡+1 𝑆𝑡 , 𝐴𝑡

✓Apply model-free RL to samples, e.g.:✓MC control✓Sarsa✓Q-learning

✓Sample-based planning methods are often more efficient

DAVIDE BACCIU - UNIVERSITÀ DI PISA 17

Page 18: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

The AB Example - Again✓Construct a table-lookup model from real experience

✓Apply model-free RL to sampled experience

DAVIDE BACCIU - UNIVERSITÀ DI PISA 18

Real experience

1. A, 0, B, 0

2. B, 1

3. B, 1

4. B, 1

5. B, 1

6. B, 1

7. B, 1

8. B, 0 e.g. Monte-Carlo learning: V (A) = 1; V (B) = 0:75

Sampled experience

1. B, 1

2. B, 0

3. B, 1

4. A, 0, B, 1

5. B, 1

6. A, 0, B, 1

7. B, 1

8. B, 0

Page 19: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Planning with an Inaccurate Model✓Given an imperfect model 𝑃𝜂 , ℛ𝜂 ≠ 𝑃,ℛ✓Performance of model-based RL is limited to optimal policy for approximate

MDP 𝒮,𝒜, 𝑃𝜂, ℛ𝜂

✓i.e. Model-based RL is only as good as the estimated model

✓When the model is inaccurate, planning process will compute a suboptimal policy✓Solution 1 - When model is wrong, use model-free RL

✓Solution 2 - Reason explicitly about model uncertainty

DAVIDE BACCIU - UNIVERSITÀ DI PISA 19

Page 20: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Integrated Architectures

DAVIDE BACCIU - UNIVERSITÀ DI PISA 20

Page 21: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Real and Simulated Experience✓Consider two sources of experience

✓Real experience - Sampled from environment (true MDP)𝑆′ ~𝒫𝑠,𝑠′

𝑎

𝑅 = ℛ𝑠𝑎

✓Simulated experience - Sampled from model (approximate MDP)𝑆′ ~ 𝑃𝜂(𝑆′|𝑆, 𝐴)𝑅 = ℛ𝜂 𝑅 𝑆, 𝐴

DAVIDE BACCIU - UNIVERSITÀ DI PISA 21

Page 22: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Integrating Learning and Planning✓Model-Free RL✓No model✓Learn value function (and/or policy) from real experience

✓Model-Based RL (using Sample-Based Planning)✓Learn a model from real experience✓Plan value function (and/or policy) from simulated experience

✓Dyna✓Learn a model from real experience✓Learn and plan value function (and/or policy) from real and simulated

experience

DAVIDE BACCIU - UNIVERSITÀ DI PISA 22

Page 23: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Dyna Architecture

DAVIDE BACCIU - UNIVERSITÀ DI PISA 23

value/policy

model experience

planningacting

model learning

direct RL

Page 24: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Dyna-Q Algorithm

DAVIDE BACCIU - UNIVERSITÀ DI PISA 24

Page 25: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Dyna-Q on a Simple Maze

DAVIDE BACCIU - UNIVERSITÀ DI PISA 25

Steps per episode

Page 26: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Dyna-Q with an Inaccurate Model (I)

The changed environment is harder

DAVIDE BACCIU - UNIVERSITÀ DI PISA 26

Page 27: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Dyna-Q with an Inaccurate Model (II)

The changed environment is easier

DAVIDE BACCIU - UNIVERSITÀ DI PISA 27

Page 28: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Learning with Simulation

DAVIDE BACCIU - UNIVERSITÀ DI PISA 28

Page 29: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Forward Search✓Forward search algorithms select the best action by lookahead

✓They build a search tree with the current state 𝑠𝑡 at the root

✓Using a model of the MDP to look ahead

DAVIDE BACCIU - UNIVERSITÀ DI PISA 29

No need to solve whole MDP, just sub-MDP starting from now

Page 30: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Simulation-Based Search (I)✓Forward search paradigm using sample-based planning

✓Simulate episodes of experience from now with the model

✓Apply model-free RL to simulated episodes

DAVIDE BACCIU - UNIVERSITÀ DI PISA 30

Page 31: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Simulation-Based Search (II)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 31

✓Simulate episodes of experience from now with the model𝑠𝑡𝑘 , 𝐴𝑡

𝑘 , 𝑅𝑡+1𝑘 ; … , 𝑆𝑇

𝑘 ~ℳ𝜈

✓Apply model-free RL to simulated episodes✓Monte-Carlo control⟹Monte-Carlo search

✓SARSA ⟹ TD search

Page 32: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Simple Monte-Carlo Search✓Given a model ℳ𝜈 and a simulation policy 𝜋

✓For each action 𝑎 ∈ 𝒜✓Simulate 𝐾 episodes from current (real) state 𝑠𝑡

𝑠𝑡, 𝑎, 𝑅𝑡+1𝑘 ; 𝑆𝑡+1

𝑘 , 𝐴𝑡+1𝑘 … , 𝑆𝑇

𝑘 ~ℳ𝜈 , 𝜋

✓Evaluate actions by mean return (Monte-Carlo evaluation)

𝑄 𝑠𝑡, 𝑎 =1

𝐾

𝑘=1

𝐾

𝐺𝑡 →𝑃 𝑞𝜋(𝑠𝑡 , 𝑎)

✓Select current (real) action with maximum value𝑎𝑡 = argmax

𝑎∈𝒜𝑄 𝑠𝑡 , 𝑎

DAVIDE BACCIU - UNIVERSITÀ DI PISA 32

Page 33: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Monte-Carlo Tree Search (MCTS) -Evaluate✓Given a model ℳ𝜈

✓Simulate 𝐾 episodes from current state 𝑠𝑡 using current simulation policy 𝜋𝑠𝑡 , 𝐴𝑡

𝑘 , 𝑅𝑡+1𝑘 ; 𝑆𝑡+1

𝑘 , 𝐴𝑡+1𝑘 … , 𝑆𝑇

𝑘 ~ℳ𝜈 , 𝜋

✓Build a search tree containing visited states and actions

✓Evaluate states 𝑄 𝑠, 𝑎 by mean return of episodes from 𝑠, 𝑎

𝑄 𝑠, 𝑎 =1

N(s, a)

𝑘=1

𝐾

u=t

T

𝟏 𝑆𝑢, 𝐴𝑢; 𝑠, 𝑎 𝐺u →𝑃 𝑞𝜋(s, 𝑎)

✓After search is finished, select current (real) action with maximum value in search tree

𝑎𝑡 = argmax𝑎∈𝒜

𝑄 𝑠𝑡 , 𝑎

DAVIDE BACCIU - UNIVERSITÀ DI PISA 33

Page 34: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Monte-Carlo Tree Search - Simulate✓In MCTS the simulation policy 𝜋 improves

✓Each simulation consists of two phases (in-tree, out-of-tree)✓Tree policy (improves): pick actions to maximise 𝑄 𝑆, A

✓Default policy (fixed): pick actions randomly

✓Repeat (each simulation)✓Evaluate states 𝑄 𝑆, A by Monte-Carlo evaluation✓Improve tree policy, e.g. by 𝜖-greedy(Q)

✓Monte-Carlo control applied to simulated experience

✓Converges on the optimal search tree, 𝑄 𝑆, A → 𝑞∗(𝑆, 𝐴)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 34

Page 35: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Advantages of MCTS✓Highly selective best-first search

✓Evaluates states dynamically (unlike e.g. DP)

✓Uses sampling to break curse of dimensionality

✓Works for black-box models (only requires samples)

✓Computationally efficient, anytime, parallelisable

DAVIDE BACCIU - UNIVERSITÀ DI PISA 35

Page 36: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Temporal-Difference Search✓Simulation-based search

✓Using TD instead of MC (bootstrapping)

✓MCTS applies MC control to sub-MDP from now

✓TD search applies SARSA to sub-MDP from now

DAVIDE BACCIU - UNIVERSITÀ DI PISA 36

Page 37: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

MC vs TD search✓For model-free reinforcement learning, bootstrapping is helpful✓TD learning reduces variance but increases bias

✓TD learning is usually more efficient than MC

✓TD(𝜆) can be much more efficient than MC

✓For simulation-based search, bootstrapping is also helpful✓TD search reduces variance but increases bias

✓TD search is usually more efficient than MC search

✓TD(𝜆) search can be much more efficient than MC search

DAVIDE BACCIU - UNIVERSITÀ DI PISA 37

Page 38: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

TD Search✓Simulate episodes from the current (real) state 𝑠𝑡

✓Estimate action-value function 𝑄 s, a

✓For each step of simulation, update action-values by SARSAΔ𝑄 S, A = 𝛼 𝑅 + 𝛾𝑄 S′, A′ − 𝑄 S, A

✓Select actions based on action-values 𝑄 s, a (e.g. 𝜖-greedy)

✓May also use function approximation for 𝑄

DAVIDE BACCIU - UNIVERSITÀ DI PISA 38

Page 39: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Dyna2✓In Dyna-2, the agent stores two sets of feature weights✓Long-term memory

✓Short-term (working) memory

✓Long-term memory is updated from real experience using TD learning✓General domain knowledge that applies to any episode

✓Short-term memory is updated from simulated experience using TD search✓Specific local knowledge about the current situation

✓Over value function is sum of long and short-term memories

DAVIDE BACCIU - UNIVERSITÀ DI PISA 39

Page 40: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

GO Case Study

DAVIDE BACCIU - UNIVERSITÀ DI PISA 40

Page 41: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Game of Go

✓The ancient oriental game of Go is 2500 years old

✓Considered to be the hardest classic board game

✓Considered a grand challenge task for AI (since McCarthy)

✓Traditional game-tree search has long failed in Go

DAVIDE BACCIU - UNIVERSITÀ DI PISA 41

Page 42: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Go Rules✓Usually played on 19x19, also 13x13 or 9x9 board

✓Simple rules, complex strategy

✓Black and white place down stones alternately

✓Surrounded stones are captured and removed

✓The player with more territory wins the game

DAVIDE BACCIU - UNIVERSITÀ DI PISA 42

Page 43: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Position Evaluation in Go✓How good is a position 𝑠?

✓Reward function (undiscounted):✓𝑅𝑡 = 0 for all non-terminal steps 𝑡 < 𝑇

✓𝑅𝑇 = 1 if Black wins

✓𝑅𝑇 = 0 if White wins

✓Policy 𝜋 = 𝜋𝐵 , 𝜋𝑊 selects moves for both players (B,W)

✓Value function (how good is position 𝑠)𝑣𝜋 𝑠 = 𝔼 𝑅𝑇|𝑆 = 𝑠 = 𝑃 Black wins|𝑆 = 𝑠𝑣∗ 𝑠 = max

𝜋𝐵min𝜋𝑊

𝑣𝜋(𝑠)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 43

Page 44: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Go – Monte-Carlo Evaluation

DAVIDE BACCIU - UNIVERSITÀ DI PISA 44

𝑉(𝑠) = 2/4 = 0.5 Current position s

Simulation

Outcomes

1 1 0 0

Page 45: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Applying Monte-Carlo Tree Search (I)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 45

Page 46: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Applying Monte-Carlo Tree Search (II)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 46

Page 47: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Applying Monte-Carlo Tree Search (III)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 47

Page 48: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Applying Monte-Carlo Tree Search (IV)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 48

Page 49: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Applying Monte-Carlo Tree Search (V)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 49

Page 50: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

MCTS in Computer Go

DAVIDE BACCIU - UNIVERSITÀ DI PISA 50

Before AlphaGo!

Page 51: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

TD Search in Computer Go

DAVIDE BACCIU - UNIVERSITÀ DI PISA 51

Page 52: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Alpha-Go✓ Combining what we have seen so far✓ Value function learning

✓ Policy gradient

✓ Monte-Carlo Tree Search

DAVIDE BACCIU - UNIVERSITÀ DI PISA 52

Convolutional neural networks to extract a meaningful state representation

Page 53: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Deep Learning in Alpha-Go

DAVIDE BACCIU - UNIVERSITÀ DI PISA 53

Value network Policy network

𝑠 𝑠

𝑃(𝑎|𝑠)𝑉(𝑠)

Is the player going to win with the current board?

How much preference for a specific move in

the current board?

Page 54: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Supervised-Reinforcement Learning Pipeline (Offline phase)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 54

Max likelihood policy on human expert games

Policy gradient on self-play Value function learning onself-play (by policy)

Page 55: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Simulation Phase - Reducing Search Depth

DAVIDE BACCIU - UNIVERSITÀ DI PISA 55

Using learned value network

Page 56: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Simulation Phase - Reducing Search Breadth

DAVIDE BACCIU - UNIVERSITÀ DI PISA 56

Bias MCTS towards most promising branches by policy network

Page 57: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Progress in Computer Go (revised)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 57

The advent of deep learning and AlphaGo

Page 58: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Wrap-up

DAVIDE BACCIU - UNIVERSITÀ DI PISA 58

Page 59: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Take (stay) home messages

DAVIDE BACCIU - UNIVERSITÀ DI PISA 59

✓Model-based RL is effective✓If you know the rules of the world (game) you can use those to simulate experience

✓Can use the model of the environment to simulate experiences

✓Can integrate simulated experiences with real world experiences (Dyna)

✓Monte-Carlo Tree Search✓ Assess the value of an actual state by looking ahead in episodes sampled in simulation

✓ Works for black-box models and it is highly efficient

✓TD Search✓ Update action-state values by SARSA on simulated episodes

✓ As usual reduces variance w.r.t. MCTS but increases bias

✓ Possibly more efficient than MCTS

Page 60: Model-based Reinforcement Learning...Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience DAVIDE BACCIU - UNIVERSITÀ

Coming up✓Exploration and Exploitation ✓ Simple naïve exploration (𝜖-greedy)✓ Optimistic approaches✓ Probability matching & Information Value

✓Multi-armed bandits

✓Imitation Learning✓Demonstration techniques✓Inverse reinforcement learning✓Reinforcement learning with generative models

DAVIDE BACCIU - UNIVERSITÀ DI PISA 60