Introduction to Deep Reinforcement...
Transcript of Introduction to Deep Reinforcement...
![Page 1: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/1.jpg)
Introduction to Deep Reinforcement Learning
Yen-Chen Wu
2015/12/11
![Page 2: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/2.jpg)
Outline
• Reinforcement Learning
• Markov Decision Process
• How to Solve MDPs
– DP
– MC
– TD
– Q-learning (DQN)
• Paper Review
![Page 3: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/3.jpg)
REINFORCEMENT LEARNING
![Page 4: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/4.jpg)
Branches of Machine Learning
![Page 5: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/5.jpg)
What makes different?
• There is no supervisor, only a reward signal
• Feedback is delayed, not instantaneous
• Time really matters (sequential, non i.i.d data)
• Agent’s actions affect the subsequent data it receives
![Page 6: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/6.jpg)
Goal: Maximize Cumulative Reward
• Actions may have long term consequences
• Reward may be delayed
• It may be better to sacrifice immediate reward to gain more long-term reward
![Page 7: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/7.jpg)
Agent & Enviroment
→←↑↓DefenseAttackJump
![Page 8: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/8.jpg)
MARKOV DECISION PROCESS
Markov Processes
Markov Reward Processes
Markov Decision Processes
![Page 9: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/9.jpg)
Markov Process
![Page 10: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/10.jpg)
Markov Reward Processes
![Page 11: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/11.jpg)
Markov Decision Process
![Page 12: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/12.jpg)
Markov Decision Process(MDP)
• S : finite set of states (observations)
• A : finite set of actions
• P : transition probability
• R : immediate reward
• γ : discount factor
• Goal :– Choose policy π
– Maximize expected return :
![Page 13: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/13.jpg)
HOW TO SOLVE MDP
Dynamic Programming
Monte-Carlo
Temporal-Difference
Q-Learning
![Page 14: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/14.jpg)
Model-based
• Dynamic Programming
– Evaluate policy
– Update policy
![Page 15: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/15.jpg)
Model Free
• Unknown Transition Probability & Reward
• MC vs TD
![Page 16: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/16.jpg)
Model Free:Q-learning
• Instead of tabular
• optimal action-value function (Q-learning)
– =
• Bellman equation
Basic idea : iterative update (lack of generalization)
In practical : function approximator
Linear ?
Using DNN !
![Page 17: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/17.jpg)
DEEP Q-NETWORK (DQN)
![Page 19: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/19.jpg)
Deep Q-Network
• compute Q-values for all actions
Input : 84x84x4
Convolves 32 filters of 8x8 with stride 4Convolves 64 filters of 4x4 with stride 2Convolves 64 filters of 3x3 with stride 1
Full-connected 512 nodes
Output a node for each action
![Page 20: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/20.jpg)
Update DQN
• Loss function
• Gradient
![Page 21: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/21.jpg)
Two Technique
• Experience Replay– Experience– Pooled Memory
• Data efficiency (bootstrap)• Avoid correlation between samples (variance between
batches)• Off –policy is suitable for Q-learning
– Random sampled mini-batch– Prioritized sweeping (active learning)
• Separate Target Network– more stable than online learning
Example Learn the value of… Pros & Cons
On-policy SARSA policy being carried out by the agent
Fast but weak
Off-policy DQN optimal policy independently of the agent's actions
Slow but robust
![Page 22: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/22.jpg)
DEMO
![Page 23: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/23.jpg)
PAPER REVIEW
![Page 24: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/24.jpg)
Paper list
• Massively Parallel Methods for Deep Reinforcement Learning
• Continuous control with deep reinforcement learning
• Deep Reinforcement Learning with Double Q-learning
• Policy Distillation
• Dueling Network Architectures for Deep Reinforcement Learning
• Multiagent Cooperation and Competition with Deep Reinforcement Learning
![Page 25: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/25.jpg)
Massively Parallel Methods for Deep Reinforcement LearningArun NairarXiv:1507.04296
![Page 26: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/26.jpg)
DDPG (Deterministic Policy Gradient)
• DDAC (Deep Deterministic Actor-Critic)
Continuous control with deep reinforcement learningTimothy P. LillicraparXiv:1509.02971https://goo.gl/J4PIAz
![Page 27: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/27.jpg)
Double Q-learning
![Page 28: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve](https://reader034.fdocuments.us/reader034/viewer/2022051105/5aa7b1a57f8b9ac5648c76fe/html5/thumbnails/28.jpg)
Policy Distillation
• Soft target