RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL...

45
RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial

Transcript of RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL...

Page 1: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

RL + DL

Rob Platt Northeastern University

Some slides from David Silver, Deep RL Tutorial

Page 2: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Deep RL in an exciting recent development

Page 3: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Deep RL in an exciting recent development

Page 4: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Remember the RL scenario...

Agent World

a

s,r

Agent takes actions

Agent perceives states and rewards

Page 5: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Tabular RL

Worldstate action

argmaxQ action

sta

teQ-function

Update rule

Page 6: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Tabular RL

Worldstate action

argmaxQ action

stat

e

Q-function

Update rule

Page 7: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Deep RL

Worldstate action

argmax

Q-function

Page 8: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Deep RL

Worldstate action

argmax

Q-function

But, why would we want to do this?

Page 9: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Where does “state” come from?

Agent

a

s,r

Agent takes actions

Agent perceives states and rewards

Earlier, we dodged this question: “it’s part of the MDP problem statement”

But, that’s a cop out. How do we get state?

Typically can’t use “raw” sensor data as state w/ a tabular Q-function– it’s too big (e.g. pacman has something like 2^(num pellets) + … states)

Page 10: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Linear function approximation?

In the RL lecture, we suggested using Linear Function approximation:

Q-function encoded by the weights

Page 11: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Linear function approximation?

In the RL lecture, we suggested using Linear Function approximation:

Q-function encoded by the weights

But, this STILL requires a human designer to specify the features!

Page 12: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Linear function approximation?

In the RL lecture, we suggested using Linear Function approximation:

Q-function encoded by the weights

But, this STILL requires a human designer to specify the features!

Is it possible to do RL WITHOUThand-coding either states or features?

Page 13: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Deep Q Learning

Page 14: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Instead of state, we have an image– in practice, it could be a history of the k most recent images stacked as a single k-channel image

Hopefully this new image representation is Markov…– in some domains, it might not be!

Deep Q Learning

Page 15: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Network structure

Q-function

Conv 1 Conv 2 FC 1 Output

Stack of images

Page 16: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Q-function

Conv 1 Conv 2 FC 1 Output

Stack of images

Network structure

Page 17: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Q-function

Conv 1 Conv 2 FC 1 Output

Stack of images

Num output nodes equals the number of actions

Network structure

Page 18: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Updating the Q-function

Here’s the standard Q-learning update equation:

Page 19: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Here’s the standard Q-learning update equation:

Updating the Q-function

Page 20: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Here’s the standard Q-learning update equation:

Rewriting:

Updating the Q-function

Page 21: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Here’s the standard Q-learning update equation:

Rewriting:

let’s call this the “target”

This equation adjusts Q(s,a) in the direction of the target

Updating the Q-function

Page 22: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Here’s the standard Q-learning update equation:

Rewriting:

let’s call this the “target”

This equation adjusts Q(s,a) in the direction of the target

We’re going to accomplish this same thing in a different way using neural networks...

Updating the Q-function

Page 23: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Remember how we train a neural network...

Given a dataset:

Define loss function:

Adjust w, b so as to minimize:

Do gradient descent on dataset:

1. repeat

2.

3.

4. until converged

Page 24: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Use this loss function:

Updating the Q-network

Page 25: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Use this loss function:

Updating the Q-network

Notice that Q is now parameterized by the weights, w

Page 26: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Use this loss function:

Updating the Q-network

I’m including the bias in the weights

Page 27: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Use this loss function:

Updating the Q-network

target

Page 28: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Use this loss function:

Updating the Q-network

target

Notice that the target is a function of the weights– this is a problem b/c targets are non-stationary

Page 29: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Use this loss function:

Updating the Q-network

target

Notice that the target is a function of the weights– this is a problem b/c targets are non-stationary

So… when we do gradient descent, we’re going to hold the targets constant

Now we can do gradient descent!

Page 30: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Deep Q Learning, no replay buffers

Initialize Q(s,a;w) with random weightsRepeat (for each episode):

Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

Until s is terminal

Where:

Page 31: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Initialize Q(s,a;w) with random weightsRepeat (for each episode):

Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

Until s is terminal

Where:

This is all that changed relative to standard

q-learning

Deep Q Learning, no replay buffers

Page 32: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Example: 4x4 frozen lake env

Get to the goal (G)Don’t fall in a hole (H)

Demo!

Page 33: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Experience replay

Deep learning typically assumes independent, identically distributed (IID) training data

Page 34: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Experience replay

Initialize Q(s,a;w) with random weightsRepeat (for each episode):

Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

Until s is terminal

Deep learning typically assumes independent, identically distributed (IID) training data

But is this true in the deep RL scenario?

Page 35: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Experience replay

Initialize Q(s,a;w) with random weightsRepeat (for each episode):

Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

Until s is terminal

Deep learning typically assumes independent, identically distributed (IID) training data

But is this true in the deep RL scenario?

Our solution: buffer experiences and then “replay” them during training

Page 36: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Initialize Q(s,a;w) with random weights

Repeat (for each episode):Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

If mod(step,trainfreq) == 0:sample batch B from D

Deep Q Learning WITH replay buffers

Replay buffer

Add this exp to buffer

One step grad descent WRT buffer

Train every trainfreq steps

Page 37: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Initialize Q(s,a;w) with random weights

Repeat (for each episode):Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

If mod(step,trainfreq) == 0:sample batch B from D

Deep Q Learning WITH replay buffers

Replay buffer

Add this exp to buffer

One step grad descent WRT buffer

Train every trainfreq steps

– buffers like this are pretty common in DL...

Page 38: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Example: 4x4 frozen lake env

Get to the goal (G)Don’t fall in a hole (H)

Demo!

Page 39: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Atari Learning Environment

Page 40: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Atari Learning Environment

Page 41: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Atari Learning Environment

Page 42: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Additional Deep RL “Tricks”

Page 43: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Additional Deep RL “Tricks”

Page 44: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Additional Deep RL “Tricks”

Page 45: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development.

Additional Deep RL “Tricks”

Combined algorithm: 3x mean Atari score DQN w/ replay buffers