A brief intro to Deep Reinforcement...

41
A brief intro to Deep Reinforcement Learning CEng 783 – Deep Learning Fall 2019 Emre Akbaş

Transcript of A brief intro to Deep Reinforcement...

Page 1: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

A brief intro toDeep Reinforcement Learning

CEng 783 – Deep LearningFall 2019

Emre Akbaş

Page 2: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 3

Today

● Brief introduction to Reinforcement Learning– Concepts– Markov Decision Processes– Bellman Equation– Q-learning

● Deep Q-learning● Deep policy gradients

Page 3: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 4

Reinforcement learning

Environment

Agent

ActionReward

State

The agent is situated & operating in an environment

Page 4: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 5

Reinforcement learning

Environment

Agent

ActionReward

State

The agent is situated & operating in an environment

Typically stochastic

Page 5: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 6

Reinforcement learning

Environment

Agent

ActionReward

State

The agent is situated & operating in an environment

Typically stochastic

Partial or full description of the environment at a certain time.

Some actions may result in positive/negative rewards.

Policy: The rule(s) that specify how to choose an action

Page 6: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 7

Reinforcement learning

Example: the backgammon game

● State

● Agent

● Action

● Reward

● Policy

● Environment (is stochastic in general)

Page 7: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 8

How could we train an agent that would perform well in such a setting?

Well = maximize reward

Looking at only immediate rewards would not work well.

We need to take into account “future” rewards.

Page 8: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 9

At time t, the total future reward is

We want to take the action that maximizes Rt.

BUT we have to consider the fact that ...

Page 9: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 10

At time t, the total future reward is

We want to take the action that maximizes Rt.

BUT we have to consider the fact that the

environment is stochastic.

So, discounting the future rewards is a good idea.

Page 10: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 11

Discounted future rewards:

where gamma is the discount factor between 0 and 1.

The more a reward is into the future, the less we care about it.

Page 11: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 12

Discounted future rewards:

Consider the cases

Page 12: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 13

Discounted future rewards:

● γ = 0 ● Considers the immediate rewards only. ● Short-sighted. Won't work well

● γ = 1● Future rewards are not discounted● Should work in deterministic environments

Page 13: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 14

Discounted future rewards:

● γ = 0 ● Considers the immediate rewards only. ● Short-sighted. Won't work well

● γ = 1● Future rewards are not discounted● Should work in deterministic environments

We’ll come back to this with value-based methods.

Page 14: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 15

The goal of RL

τ

[Figure from Sergey Levine‘s slides.]http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_3_rl_intro.pdf

Page 15: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 16[Slide by David Silver]http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/intro_RL.pdf

Page 16: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 17

A policy-gradient method

● Based on direct differentiation of the cost function.

Page 17: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 18

Page 18: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 19

Page 19: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 20

Page 20: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 21

Page 21: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 22

Markov Decision Process

● Relies on the Markov assumption

● The probability of the next state depends only on the current state and the action but not on preceding states or actions.

[Image from nervanasys.com]

Page 22: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 23

Markov Decision Process

● Relies on the Markov assumption

● The probability of the next state depends only on the current state and the action but not on preceding states or actions.

[Image from nervanasys.com]

One episode (e.g. a game from start to finish) of thisprocess forms a sequence:

Page 23: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 24

The REINFORCE algorithm

The previous derivations result in the following algorithm.

[From Sergey Levine’s slides.]

An example application is in the next slides.

Page 24: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 25

Example: The “Pong” game

Image from http://karpathy.github.io/2016/05/31/rl/

State: current frame minus the last frame (to capture the motion)

Action: UP or DOWN

Reward: +1 if you win the game, otherwise -1

Page 25: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 26

Example: “Pong”

Image from http://karpathy.github.io/2016/05/31/rl/

Page 26: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 27

Learning the optimal policy1) Initialize the policy network randomly

2) Play a batch of episodes

3) Label all the actions in a WIN game as CORRECT

4) Label all the actions in a LOST game as INCORRECT

5) Apply supervised learning, i.e. maximize

where ri =1 for any action in a WON game, otherwise -1

6) Go to step 2, repeat until convergence.

Page 27: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 28

Q-learning: a value-based method

● The Q-function: – Expected total reward from taking action at in st

This is called the Bellman equation.

Page 28: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 29

Q-learning: a value-based method

● The Q-function: – Expected total reward from taking action at in st

This is called the Bellman equation.

Remember the discounted future rewards:

Page 29: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 31

The policy π

The rule that specifies which action to choose given the current state.

A typical and sensible policy is

Page 30: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 33

Q-learning

Suppose the current state is “s” and we take action “a”

Then, we observe the next state “s' “ and obtain reward “ r”.

It has been shown1 that the following iterative update rule will converge to an optimal Q function:

1http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf

Page 31: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 34

Q-learning pseudo-code

1) Create a table (num_states x num_actions) for Q

2) Initialize the table randomly

3) Observe the initial state s

4) Repeat until the game is over (i.e. next state is terminal state)

1) Choose an action, i.e. a = π(s) /* s is the current state */

2) Receive next state s' and reward r

3)

4) s = s' /* next state becomes the current state */

Page 32: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 35

How can we use deep-learning here?

The Q-function can be approximated using a neural network model.

Q(s,a) is implemented on the right.

[Image from nervanasys.com]

Page 33: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 36

How can we use deep-learning here?

The Q-function can be approximated using a neural network model.

Q(s,a) is implemented on the right.

[Image from nervanasys.com]

Page 34: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 37

How can we use deep-learning here?

Alternative model for implementing Q(s,a)

[Image from nervanasys.com]

Now, it's efficient to evaluate:

Page 35: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 38

Example

In DeepMind's 2013 paper [Mnih et al. (2013)]

state is encoded by the images of last four frames

of the game.

After pre-processing, a state is a 84x84x4 matrix.

Page 36: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 39

Example

The network used in DeepMind's 2013 paper

[Mnih et al. (2013)]

Qs: What type of a network is this? Why 18? Why no pooling layers?

Page 37: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 40

Deep Q-learningRemember the Bellman equation?

For a transition <s,a,r,s'>,

The update rule in ordinary Q-learning:

In deep Q-learning:

Page 38: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 41

Deep Q-learning

[Pseudocode from nervanasys.com]

Page 39: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42

Deep Q-learning

● “It turns out that approximation of Q-values using non-linear functions is not very stable.

● There is a whole bag of tricks (reward clipping, gradient clipping, batch normalization) that you have to use to actually make it converge.

● The most important trick is experience replay.”[Quote from nervanasys.com]

Experience relay is nothing but minibatch training. i.e. collect <s,a,r,s'> transitions and use them in minibatches (instead of one by one) while training.

Page 40: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 43Figure from Deepmind’s post.]https://deepmind.com/blog/article/deep-reinforcement-learning

Page 41: A brief intro to Deep Reinforcement Learninguser.ceng.metu.edu.tr/~emre/.../week12...learning.pdf · 03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 42 Deep Q-learning “It turns

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 46

References● Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

● Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.

● Guest Post (Part I): Demystifying Deep Reinforcement Learning (Blog post), https://www.nervanasys.com/demystifying-deep-reinforcement-learning/

● Deep Reinforcement Learning: Pong from Pixels (Blog post) by A. Karpathy, http://karpathy.github.io/2016/05/31/rl/