Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When...
Transcript of Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When...
Deep Reinforcement Learning
Jeremy Morton
November 29, 2017
Jeremy Morton Deep RL November 29, 2017 1 / 22
Reinforcement Learning Challenges
1. Balancing Exploration with Exploitation
2. Credit Assignment
3. Generalizing from experience
How can we generalize effectively in large (and possibly continuous) state and action spaces?
Driving from Images Continuous Action Space
Jeremy Morton Deep RL November 29, 2017 2 / 22
Global Approximation
Recall global approximation:Q(s, a) = θa
ᵀβ(s)
What is β(s)? If β(s) = s, then this is may not be a good approximation of Q(s, a).
But how can we learn appropriate features and weights so that this approximation is accurate?
Jeremy Morton Deep RL November 29, 2017 3 / 22
Neural Networks: Overview
Here, we have a supervised learning problem and would like to approximate y.
Jeremy Morton Deep RL November 29, 2017 4 / 22
Neural Networks: Overview
i1 = wᵀ1x+ b1
Jeremy Morton Deep RL November 29, 2017 4 / 22
Neural Networks: Overview
o1 = σ(wᵀ1x+ b1)
Jeremy Morton Deep RL November 29, 2017 4 / 22
Neural Networks: Overview
o1 = σ(wᵀ1x+ b1)
Jeremy Morton Deep RL November 29, 2017 4 / 22
Neural Networks: Overview
y = w3 σ(w2 σ(wᵀ1x+ b1) + b2) + b3
Jeremy Morton Deep RL November 29, 2017 4 / 22
Neural Networks: Overview
Neural networks transform inputs via a set of matrix multiplications and element-wisenonlinearities.
y =W3 σ(W2 σ(W1x+ b1) + b2) + b3Jeremy Morton Deep RL November 29, 2017 4 / 22
Neural Network Properties
Neural networks are universal function approximators.
Recall the properties of the sigmoid function: we can control its sharpness and shift it.
−10 −5 0 5 10
0
0.2
0.4
0.6
0.8
1
σ(x)
−10 −5 0 5 10
0
0.2
0.4
0.6
0.8
1
σ(1000x)
−10 −5 0 5 10
0
0.2
0.4
0.6
0.8
1
σ(1000x− 5000)
Jeremy Morton Deep RL November 29, 2017 5 / 22
Neural Network Properties
Neural networks are universal function approximators.
Imagine we wanted to approximate f(x) = x:
0 0.5 1 1.5 2 2.5 30
1
2
3
Jeremy Morton Deep RL November 29, 2017 5 / 22
Neural Network Properties
Neural networks are universal function approximators.
Imagine you have a neural network with three neurons that take in x and output the following:
0 0.5 1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
σ(1000x− 1000)
0 0.5 1 1.5 2 2.5 3
−1
−0.8
−0.6
−0.4
−0.2
0
−σ(1000x− 2000)
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2σ(1000x− 2000)
Jeremy Morton Deep RL November 29, 2017 5 / 22
Neural Network Properties
Neural networks are universal function approximators.
Now sum the neuron outputs together:
0 0.5 1 1.5 2 2.5 30
1
2
3
Jeremy Morton Deep RL November 29, 2017 5 / 22
Neural Network Properties
Neural networks are universal function approximators.
As more neurons are added, the approximation to f(x) improves.
0 0.5 1 1.5 2 2.5 30
1
2
3
Good explanation here.
Jeremy Morton Deep RL November 29, 2017 5 / 22
Neural Network Properties
Neural networks can be trained efficiently through backpropagation.
Example for network with single hidden layer:
y =W2 σ(W1x+ b1) + b2
If loss function L is a function of y then by chain rule:
∂L∂W2
=∂L∂y
(∂y
∂W2
)ᵀ
=∂L∂y
σ(W1x+ b1)ᵀ
Finally, can update W2 using gradient descent:
W2 ←W2 − α∂L∂W2
Jeremy Morton Deep RL November 29, 2017 6 / 22
Neural Network Properties
Neural networks can be trained efficiently through backpropagation.
Example for network with single hidden layer:
y =W2 σ(W1x+ b1) + b2
If loss function L is a function of y then by chain rule:
∂L∂W2
=∂L∂y
(∂y
∂W2
)ᵀ
=∂L∂y
σ(W1x+ b1)ᵀ
Finally, can update W2 using gradient descent:
W2 ←W2 − α∂L∂W2
Jeremy Morton Deep RL November 29, 2017 6 / 22
Neural Network Properties
Neural networks can be trained efficiently through backpropagation.
Example for network with single hidden layer:
y =W2 σ(W1x+ b1) + b2
If loss function L is a function of y then by chain rule:
∂L∂W2
=∂L∂y
(∂y
∂W2
)ᵀ
=∂L∂y
σ(W1x+ b1)ᵀ
Finally, can update W2 using gradient descent:
W2 ←W2 − α∂L∂W2
Jeremy Morton Deep RL November 29, 2017 6 / 22
Deep Q-Learning
V. Mnih, et al. (2013). Playing Atari with Deep Reinforcement Learning. Available athttps://arxiv.org/abs/1312.5602
Jeremy Morton Deep RL November 29, 2017 7 / 22
Deep Q-Learning: Objective
Recall the Bellman equation:
Q∗(s, a) = Es′∼E[R(s, a) + γmax
a′Q∗(s′, a′)
]We estimate the value of Q∗(s, a) using a neural network with parameters θ. At iteration i,the loss function is given by the temporal difference error:
Li(θi) = Es,a∼ρ(·);s′∼E
[(R(s, a) + γmax
a′Q(s′, a′; θi−1)−Q(s, a; θi)
)2]
where θi−1 are the parameter values from the previous iteration.
Jeremy Morton Deep RL November 29, 2017 8 / 22
Deep Q-Learning: Challenges
I Often in supervised learning it is assumed that successive samples are iid
I However, in reinforcement learning successive samples are highly correlated
I To combat this, transitions were stored in replay memory DI During training, random transitions {st, at, rt, st+1} were sampled from D and used to
train the network
Jeremy Morton Deep RL November 29, 2017 9 / 22
Deep Q-Learning: Challenges
When performing regression in supervised learning, it is common to use the least-squares loss:
L =1
2(y − y)2
where y is the target value and y is the model prediction. In the case of Q-Learning, thistarget value is:
y = rt + γmaxa′
Q(st+1, a′; θ).
Because of this, the network target value changes as the network is trained. The deepQ-Learning algorithm uses a target network whose parameters are changed less frequently inorder to have a stationary target value.
Jeremy Morton Deep RL November 29, 2017 10 / 22
Deep Q-Learning: Challenges
I Input images are 210× 160 pixels with three color channels, so huge inputs and statespace → Images are downsampled, converted to grayscale, and cropped to be 84× 84pixels.
I Images are static, so many states are perceptually aliased → The input to the network isa sequence of 4 preprocessed frames.
Jeremy Morton Deep RL November 29, 2017 11 / 22
Deep Q-Learning: Network Architecture
Jeremy Morton Deep RL November 29, 2017 12 / 22
Deep Q-Learning: Algorithm
Jeremy Morton Deep RL November 29, 2017 13 / 22
Deep Q-Learning: Results
Jeremy Morton Deep RL November 29, 2017 14 / 22
Policy Gradient Methods
I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).
I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).
I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.
I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].I Can be applied to both discrete and continuous action spaces.
Jeremy Morton Deep RL November 29, 2017 15 / 22
Policy Gradient Methods
I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).
I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).
I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.
I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].I Can be applied to both discrete and continuous action spaces.
Jeremy Morton Deep RL November 29, 2017 15 / 22
Policy Gradient Methods
I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).
I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).
I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.
I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].I Can be applied to both discrete and continuous action spaces.
Jeremy Morton Deep RL November 29, 2017 15 / 22
Policy Gradient Methods
I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).
I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).
I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.
I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].
I Can be applied to both discrete and continuous action spaces.
Jeremy Morton Deep RL November 29, 2017 15 / 22
Policy Gradient Methods
I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).
I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).
I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.
I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].I Can be applied to both discrete and continuous action spaces.
Jeremy Morton Deep RL November 29, 2017 15 / 22
Calculating Gradients
How do we find gradients for the policy parameters?
∇θEa∼πθ [Qπθ(s, a)] = ∇θ∫πθ(a | s)Qπθ(s, a)da
=
∫∇θπθ(a | s)Qπθ(s, a)da
=
∫ ∇θπθ(a | s)πθ(a | s)
πθ(a | s)Qπθ(s, a)da
=
∫∇θ log πθ(a | s)πθ(a | s)Qπθ(s, a)da
= Ea∼πθ [∇θ log πθ(a | s)Qπθ(s, a)]
Jeremy Morton Deep RL November 29, 2017 16 / 22
Calculating Gradients
How do we find gradients for the policy parameters?
∇θEa∼πθ [Qπθ(s, a)] = ∇θ∫πθ(a | s)Qπθ(s, a)da
=
∫∇θπθ(a | s)Qπθ(s, a)da
=
∫ ∇θπθ(a | s)πθ(a | s)
πθ(a | s)Qπθ(s, a)da
=
∫∇θ log πθ(a | s)πθ(a | s)Qπθ(s, a)da
= Ea∼πθ [∇θ log πθ(a | s)Qπθ(s, a)]
Jeremy Morton Deep RL November 29, 2017 16 / 22
Calculating Gradients
How do we find gradients for the policy parameters?
∇θEa∼πθ [Qπθ(s, a)] = ∇θ∫πθ(a | s)Qπθ(s, a)da
=
∫∇θπθ(a | s)Qπθ(s, a)da
=
∫ ∇θπθ(a | s)πθ(a | s)
πθ(a | s)Qπθ(s, a)da
=
∫∇θ log πθ(a | s)πθ(a | s)Qπθ(s, a)da
= Ea∼πθ [∇θ log πθ(a | s)Qπθ(s, a)]
Jeremy Morton Deep RL November 29, 2017 16 / 22
Calculating Gradients
How do we find gradients for the policy parameters?
∇θEa∼πθ [Qπθ(s, a)] = ∇θ∫πθ(a | s)Qπθ(s, a)da
=
∫∇θπθ(a | s)Qπθ(s, a)da
=
∫ ∇θπθ(a | s)πθ(a | s)
πθ(a | s)Qπθ(s, a)da
=
∫∇θ log πθ(a | s)πθ(a | s)Qπθ(s, a)da
= Ea∼πθ [∇θ log πθ(a | s)Qπθ(s, a)]
Jeremy Morton Deep RL November 29, 2017 16 / 22
Calculating Gradients
How do we find gradients for the policy parameters?
∇θEa∼πθ [Qπθ(s, a)] = ∇θ∫πθ(a | s)Qπθ(s, a)da
=
∫∇θπθ(a | s)Qπθ(s, a)da
=
∫ ∇θπθ(a | s)πθ(a | s)
πθ(a | s)Qπθ(s, a)da
=
∫∇θ log πθ(a | s)πθ(a | s)Qπθ(s, a)da
= Ea∼πθ [∇θ log πθ(a | s)Qπθ(s, a)]
Jeremy Morton Deep RL November 29, 2017 16 / 22
Policy Gradients: Estimating Q
Given a sample trajectory {s1, a1, r1, . . . sT−1, aT−1, rT }, can create an empirical estimate ofQ:
Qπθ(st, at) =
T∑k=t
γk−1rk
Then perform the following parameter update at each time step:
θ ← θ + α∇θ log πθ(at | st)Qπθ(st, at).
This is known as the REINFORCE algorithm. It is also possible to use a second neuralnetwork to estimate Q. This is known as an actor-critic algorithm.
Jeremy Morton Deep RL November 29, 2017 17 / 22
Policy Gradients: Demonstration
Full video here.Jeremy Morton Deep RL November 29, 2017 18 / 22
Further Reading
V. Mnih, et al. (2016). Asynchronous Methods for Deep Reinforcement Learning.Available at https://arxiv.org/abs/1602.01783.
Credit: Medium post by A. Juliani here
Performance of Asynchronous AdvantageActor-Critic (A3C) algorithm
Jeremy Morton Deep RL November 29, 2017 19 / 22
Further Reading
M. Hessel, et al. (2017). Rainbow: Combining Improvements in Deep ReinforcementLearning. Available at https://arxiv.org/abs/1710.02298.
Jeremy Morton Deep RL November 29, 2017 19 / 22
Further Reading
I J. Schulman, et al. (2015). Trust Region Policy Optimization. Available athttp://arxiv.org/abs/1502.05477.
I D. Silver, et al. (2016). Mastering the game of Go with deep neural networks andtree search. Nature, 529(7587), pp.484-489. (AlphaGo paper)
I D. Silver, et al. (2017). Mastering the game of Go without human knowledge.Nature, 550(7676), pp.354-359. (AlphaGo Zero paper)
I Y. Li (2017). Deep Reinforcement Learning: An Overview. Available athttps://arxiv.org/abs/1701.07274v5. (Long Survey)
I K. Arulkumaran, et al. (2017). A Brief Survey of Deep Reinforcement Learning.Available at https://arxiv.org/abs/1708.05866v2. (Shorter Survey)
Jeremy Morton Deep RL November 29, 2017 19 / 22
Deep Learning Frameworks
Currently, the (arguably) two most popular frameworks are:
I TensorFlow (https://www.tensorflow.org)
I PyTorch (http://pytorch.org)
With both, you can construct and train networks in Python. Other frameworks includeTheano, CNTK, Caffe, MXNet, etc.
For easy network construction and training, check out Keras.
A much more in-depth discussion of deep learning frameworks can be found in the CS 231ncourse slides.
Jeremy Morton Deep RL November 29, 2017 20 / 22
Additional Resources
Neural Networks:
I CS 231n Course Notes (http://cs231n.stanford.edu)
I Michael Nielsen’s free textbook Neural Networks and Deep Learning(http://neuralnetworksanddeeplearning.com)
I Deep Learning by Bengio et al. (http://www.deeplearningbook.org)
Deep Reinforcement Learning:
I CS 294 Course Notes (UC Berkeley) (http://rll.berkeley.edu/deeprlcourse/)
I David Silver’s RL Course Notes(http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html)
I OpenAI Gym (https://gym.openai.com/docs/)
Jeremy Morton Deep RL November 29, 2017 21 / 22
Thank You
Questions?
Jeremy Morton Deep RL November 29, 2017 22 / 22