CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network...

91
CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: Continue Deep Reinforcement Learning – Q-Learning Policy gradient methods

Transcript of CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network...

Page 1: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

CS 4803 / 7643: Deep Learning

Zsolt KiraGeorgia Tech

Topics: – Continue Deep Reinforcement Learning

– Q-Learning– Policy gradient methods

Page 2: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Administrativia• Projects!

• Project Check-in due April 11th

– Will be graded pass/fail, if fail then you can address the issues– Counts for 5 points of project score

• Poster due date moved to April 23rd (last day of class)– No presentations

• Final submission due date April 30th

(C) Dhruv Batra & Zsolt Kira 2

Page 3: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Administrativia• Piazza me requests for topics after reinforcement

learning

(C) Dhruv Batra & Zsolt Kira 3

Page 4: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Types of Learning• Supervised learning

– Learning from a “teacher”– Training data includes desired outputs

• Unsupervised learning– Discover structure in data– Training data does not include desired outputs

• Reinforcement learning– Learning to act under evaluative feedback (rewards)

(C) Dhruv Batra & Zsolt Kira 4Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 5: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

RL API

(C) Dhruv Batra & Zsolt Kira 5Slide Credit: David Silver

Page 6: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Policy:

Value:

Summary of Last time - MDPs

- Life is trajectory:

Defined by:

: set of possible states: set of possible actions: distribution of reward given (state, action) pair: transition probability i.e. distribution over next state given (state, action) pair: discount factor

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Bellman Equation

Page 7: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Summary of Last time – Solving MDPs• Known transition and reward functions (planning)

• (Q-)Value iteration – just iteratively update based on Bellman Equation• Q-values easier to extract policy from (argmax)

• Unknown transition and reward functions• Model-based – Learn T and R and then do planning• Model-free

• Take sample trajectories (policy not controlled by us)• Estimate expected values (the world samples from true distribution)• Direct Evaluation: Just average over all samples• Temporal Difference (TD): Incremental update per example

• Function approximation: Estimate Q or V using a function, e.g. neural network

• Today:• Policy iteration – Learn a policy directly without values• Actor-Critic – Estimate both value and policy function!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 8: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Q-Networks

Slide Credit: David Silver

Page 9: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Case Study: Playing Atari Games

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step

Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission. 

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 10: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 11: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

Input: state st

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 12: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

Familiar conv layers, FC layer

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 13: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 14: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

Number of actions between 4-18 depending on Atari game

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 15: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

Number of actions between 4-18 depending on Atari game

A single feedforward pass to compute Q-values for all actions from the current state => efficient!

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 16: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 17: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning

Loss function:Forward Pass

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 18: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning

Loss function:

where

Forward Pass

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 19: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning

Loss function:

where

Forward Pass

Backward PassGradient update (with respect to Q-function parameters θ):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 20: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning

Loss function:

where

Iteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy 𝝿*)

Forward Pass

Backward PassGradient update (with respect to Q-function parameters θ):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 21: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Deep Q-Networks

Page 22: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Training the Q-network: Experience Replay

22

Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 23: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Training the Q-network: Experience Replay

23

Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand side) => can lead to bad feedback loops

Address these problems using experience replay- Continually update a replay memory table of transitions (st, at, rt, st+1) as game

(experience) episodes are played- Train Q-network on random minibatches of transitions from the replay memory,

instead of consecutive samples

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 24: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Experience Replay

(C) Dhruv Batra 24Slide Credit: David Silver

Page 25: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Stable Deep RL (2): Fixed Target Q-Network

Page 26: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Stable Deep RL (3): Reward/Value Range

Page 27: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

(C) Dhruv Batra & Zsolt Kira 27

Page 28: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

(C) Dhruv Batra & Zsolt Kira 28

Page 29: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

(C) Dhruv Batra & Zsolt Kira 29

Page 30: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

(C) Dhruv Batra & Zsolt Kira 30

Page 31: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

(C) Dhruv Batra & Zsolt Kira 31

Page 32: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

(C) Dhruv Batra & Zsolt Kira 32

Page 33: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

(C) Dhruv Batra & Zsolt Kira 33

Page 34: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

(C) Dhruv Batra & Zsolt Kira 34

Page 35: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

(C) Dhruv Batra & Zsolt Kira 35

Page 36: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Video by Károly Zsolnai-Fehér. Reproduced with permission.https://www.youtube.com/watch?v=V1eYniJ0Rnk

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 37: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

DQN Results in Atari

Page 38: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Deep RL• Value-based RL

– Use neural nets to represent Q function

• Policy-based RL– Use neural nets to represent policy

• Model– Use neural nets to represent and learn the model

(C) Dhruv Batra 38

Page 39: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Formally, let’s define a class of parameterized policies:

For each policy, define its value:

Policy Gradients

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 40: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Formally, let’s define a class of parameterized policies:

For each policy, define its value:

We want to find the optimal policy

How can we do this?

Policy Gradients

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 41: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Formally, let’s define a class of parameterized policies:

For each policy, define its value:

We want to find the optimal policy

How can we do this?

Policy Gradients

Gradient ascent on policy parameters!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 42: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE algorithmMathematically, we can write:

Where r(𝜏) is the reward of a trajectory

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 43: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE algorithmExpected reward:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 44: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE algorithm

Now let’s differentiate this:

Expected reward:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 45: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE algorithm

Intractable! Expectation of gradientis problematic when p depends on θ

Now let’s differentiate this:

Expected reward:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 46: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE algorithm

Intractable! Expectation of gradientis problematic when p depends on θ

Now let’s differentiate this:

However, we can use a nice trick:

Expected reward:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 47: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE algorithm

Intractable! Expectation of gradientis problematic when p depends on θ

Can estimate with Monte Carlo sampling

Now let’s differentiate this:

However, we can use a nice trick:If we inject this back:

Expected reward:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 48: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE algorithmCan we compute those quantities without knowing the transition probabilities?

We have:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 49: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE algorithmCan we compute those quantities without knowing the transition probabilities?

We have:

Thus:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 50: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE algorithmCan we compute those quantities without knowing the transition probabilities?

We have:

Thus:

And when differentiating:Doesn’t depend on

transition probabilities!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 51: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE algorithmCan we compute those quantities without knowing the transition probabilities?

We have:

Thus:

And when differentiating:

Therefore when sampling a trajectory 𝜏, we can estimate J(𝜃) with

Doesn’t depend on transition probabilities!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 52: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

IntuitionGradient estimator:

Interpretation:- If r(𝜏) is high, push up the probabilities of the actions seen- If r(𝜏) is low, push down the probabilities of the actions seen

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 53: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

IntuitionGradient estimator:

Interpretation:- If r(𝜏) is high, push up the probabilities of the actions seen- If r(𝜏) is low, push down the probabilities of the actions seen

Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 54: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Intuition

(C) Dhruv Batra 54

Page 55: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

IntuitionGradient estimator:

Interpretation:- If r(𝜏) is high, push up the probabilities of the actions seen- If r(𝜏) is low, push down the probabilities of the actions seen

Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!

However, this also suffers from high variance because credit assignment is really hard. Can we help the estimator?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 56: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Variance reductionGradient estimator:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 57: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Variance reductionGradient estimator:

First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 58: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Variance reductionGradient estimator:

First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state

Second idea: Use discount factor 𝛾 to ignore delayed effects

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 59: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Variance reduction: Baseline

Problem: The raw value of a trajectory isn’t necessarily meaningful. For example, if rewards are all positive, you keep pushing up probabilities of actions.

What is important then? Whether a reward is better or worse than what you expect to get

Idea: Introduce a baseline function dependent on the state.Concretely, estimator is now:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 60: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

How to choose the baseline?

A simple baseline: constant moving average of rewards experienced so far from all trajectories

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 61: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

How to choose the baseline?

A simple baseline: constant moving average of rewards experienced so far from all trajectories

Variance reduction techniques seen so far are typically used in “Vanilla REINFORCE”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 62: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE in action: Recurrent Attention Model (RAM)

Objective: Image Classification

Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class

- Inspiration from human perception and eye movements- Saves computational resources => scalability- Able to ignore clutter / irrelevant parts of image

State: Glimpses seen so farAction: (x,y) coordinates (center of glimpse) of where to look next in imageReward: 1 at the final timestep if image correctly classified, 0 otherwise

glimpse

[Mnih et al. 2014]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 63: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE in action: Recurrent Attention Model (RAM)

Objective: Image Classification

Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class

- Inspiration from human perception and eye movements- Saves computational resources => scalability- Able to ignore clutter / irrelevant parts of image

State: Glimpses seen so farAction: (x,y) coordinates (center of glimpse) of where to look next in imageReward: 1 at the final timestep if image correctly classified, 0 otherwise

Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCEGiven state of glimpses seen so far, use RNN to model the state and output next action

glimpse

[Mnih et al. 2014]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 64: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE in action: Recurrent Attention Model (RAM)

NN

(x1, y1)

Input image

[Mnih et al. 2014]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 65: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE in action: Recurrent Attention Model (RAM)

NN

(x1, y1)

NN

(x2, y2)

Input image

[Mnih et al. 2014]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 66: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE in action: Recurrent Attention Model (RAM)

NN

(x1, y1)

NN

(x2, y2)

NN

(x3, y3)

Input image

[Mnih et al. 2014]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 67: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE in action: Recurrent Attention Model (RAM)

NN

(x1, y1)

NN

(x2, y2)

NN

(x3, y3)

NN

(x4, y4)

Input image

[Mnih et al. 2014]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 68: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE in action: Recurrent Attention Model (RAM)

NN

(x1, y1)

NN

(x2, y2)

NN

(x3, y3)

NN

(x4, y4)

NN

(x5, y5)

Softmax

Input image

y=2

[Mnih et al. 2014]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 69: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

REINFORCE in action: Recurrent Attention Model (RAM)

[Mnih et al. 2014]Figures copyright Daniel Levy, 2017. Reproduced with permission. 

Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 70: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

How to choose the baseline?A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.

Q: What does this remind you of?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 71: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

How to choose the baseline?A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.

Q: What does this remind you of?

A: Q-function and value function!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 72: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

How to choose the baseline?A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.

Q: What does this remind you of?

A: Q-function and value function!Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 73: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

How to choose the baseline?A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.

Q: What does this remind you of?

A: Q-function and value function!Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small.

Using this, we get the estimator:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 74: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

74

Page 75: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Actor-Critic AlgorithmInitialize policy parameters 𝜃, critic parameters 𝜙For iteration=1, 2 … do

Sample m trajectories under the current policy

For i=1, …, m doFor t=1, ... , T do

End for

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 76: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Summary

- Policy gradients: very general but suffer from high variance so requires a lot of samples. Challenge: sample-efficiency

- Q-learning: does not always work but when it works, usually more sample-efficient. Challenge: exploration

- Guarantees:- Policy Gradients: Converges to a local minima of J(𝜃), often good

enough!- Q-learning: Zero guarantees since you are approximating Bellman

equation with a complicated function approximator

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 77: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Playing Go

Page 78: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Go is a Difficult Game

Page 79: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

AlphaGo

• Go is a perfect information game– See entire board at all times– Has an optimal value function!

• Key idea: We cannot unroll search tree to learn a policy/value for a large number of states, instead:– Reduce depth of search via position evaluation: Replace

subtrees with estimated value function v(s)– Reduce breadth of search via action sampling: Don’t

perform unlikely actions• Start by predicting expert actions, gives you a probability distribution

• Use Monte Carlo rollouts, with a policy, selecting children with higher values– As policy improves this search improves too

Page 80: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Minimax Tree Search

From Wikipedia

Page 81: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Monte-Carlo Tree Search

(C) Dhruv Batra & Zsolt Kira 81

From Wikipedia

Page 82: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,
Page 83: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,
Page 84: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

AlphaGo - Steps

1) Train supervised learning policy network from human moves

2) Train fast policy that can sample actions during rollouts

3) Train RL policy network that improves the SL policy network by optimizing the final outcome of games

1) Use self-play

4) Train a value network that predicts the winner of games played by the RL policy network against itself

Page 85: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Supervised Learning Policy Network• CNN to predict p(a|s) trained on expert

players– 13-layer deep– Accuracy 57% (55.7% using only board positions

and move history)– Previous state of art was 44.4%

• Also train a fast (2 us!) rollout policy network with just linear softmax of pattern features– Not planes any more

Page 86: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Reinforcement Learning Policy Network

• Initialized from supervised policy network– Play against policies sampled from previous iterations– Won 80% of games against supervised network

Page 87: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Value Network• Use RL policy network self-play to train (30M

positions)

Page 88: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

Playing the Game• Final play combines policy and value networks using

MCTS

Page 89: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

AlphaGo Zero• MCTS with Self-Play

– Don’t have to guess what opponent might do, so…– If no exploration, a big-branching game tree becomes one path– You get an automatically improving, evenly-matched opponent who is

accurately learning your strategy– “We have met the enemy, and he is us” (famous variant of Pogo, 1954)– No need for human expert scoring rules for boards from unfinished games

• Treat board as an image: use residual convolutional neural network

• AlphaGo Zero: One deep neural network learns both the value function and policy in parallel

• Alpha Zero: Removed rollout altogether from MCTS and just used current neural net estimates instead

(C) Dhruv Batra & Zsolt Kira 89Slide Credit: Craven & Page

Page 90: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

(C) Dhruv Batra & Zsolt Kira 90Slide Credit: Craven & Page

Page 91: CS 4803 / 7643: Deep Learning - Georgia Institute of ...neural network with weights Q-network Architecture Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion,

AlphaGo Zero

https://deepmind.com/blog/alphago-zero-learning-scratch/D. Silver et al., Mastering the Game of Go without Human Knowledge, Nature 550, October 2017

It’s also more efficient than older engines!