Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
-
Upload
esmond-parsons -
Category
Documents
-
view
222 -
download
6
Transcript of Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
![Page 1: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/1.jpg)
Bayesian Reinforcement Learning
Machine Learning RCC
16th June 2011
![Page 2: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/2.jpg)
Outline
• Introduction to Reinforcement Learning
• Overview of the field
• Model-based BRL
• Model-free RL
![Page 3: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/3.jpg)
References
• ICML-07 Tutorial– P. Poupart, M. Ghavamzadeh, Y. Engel
• Reinforcement Learning: An Introduction – Richard S. Sutton and Andrew G. Barto
![Page 4: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/4.jpg)
Machine Learning
Unsupervised Learning
Reinforcement Learning
Supervised Learning
![Page 5: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/5.jpg)
Definitions
State Action Reward
Policy
£££££
Reward function
![Page 6: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/6.jpg)
Markov Decision Process
x0
a0
x1
Policy
Transition Probability
r0
a1
r1
Reward function
![Page 7: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/7.jpg)
Value Function
![Page 8: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/8.jpg)
Optimal Policy
• Assume one optimal action per state
Unknown
Value Iteration
![Page 9: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/9.jpg)
Reinforcement Learning
• RL Problem: Solve MDP when reward/transition models are unknown
• Basic Idea: Use samples obtained from agent’s interaction with environment
![Page 10: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/10.jpg)
Model-Based vs Model-Free RL
• Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy
• Model-Free: Directly learn value function/policy
![Page 11: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/11.jpg)
RL Solutions
![Page 12: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/12.jpg)
RL Solutions
• Value Function Algorithms– Define a form for the value function– Sample state-action-reward sequence– Update value function– Extract optimal policy
• SARSA, Q-learning
![Page 13: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/13.jpg)
RL Solutions
• Actor-Critic– Define a policy structure
(actor)– Define a value function
(critic)– Sample state-action-reward– Update both actor & critic
![Page 14: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/14.jpg)
RL Solutions
• Policy Search Algorithm– Define a form for the policy– Sample state-action-reward sequence– Update policy
• PEGASUS– (Policy Evaluation-of-Goodness
And Search Using Scenarios)
![Page 15: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/15.jpg)
Online - Offline
• Offline– Use a simulator– Policy fixed for each ‘episode’– Updates made at the end of episode
• Online– Directly interact with environment– Learning happens step-by-step
![Page 16: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/16.jpg)
Model-Free Solutions
1. Prediction: Estimate V(x) or Q(x,a)
2. Control: Extract policy• On-Policy• Off-Policy
![Page 17: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/17.jpg)
Monte-Carlo Predictions
Valu
eR
ewar
d
-13
Leave car park Get out of city Motorway Enter Cambridge
State
-90 -83 -55 -11
-15 -61 -11
Up
date
d
-100 -87 -72 -11
![Page 18: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/18.jpg)
Temporal Difference Predictions
Valu
eR
ewar
d
-13
Leave car park Get out of city Motorway Enter Cambridge
State
-90 -83 -55 -11
-15 -61 -11
Up
date
d
-96 -70 -72 -11
![Page 19: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/19.jpg)
Advantages of TD
• Don’t need a model of reward/transitions
• Online, fully incremental
• Proved to converge given conditions on step-size
• “Usually” faster than MC methods
![Page 20: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/20.jpg)
From TD to TD(λ)
State
Reward
Terminal state
![Page 21: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/21.jpg)
From TD to TD(λ)
State
Reward
Terminal state
![Page 22: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/22.jpg)
SARSA & Q-learning
TD-Learning
SARSA Q-Learning
On-Policy
Estimate value function for
current policy
Off-Policy
Estimate value function for
optimal policy
![Page 23: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/23.jpg)
GP Temporal Difference
xxx
xxxx
xxxx
1 2
![Page 24: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/24.jpg)
GP Temporal Difference
xxx
xxxx
xxxx
1 2
![Page 25: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/25.jpg)
GP Temporal Difference
![Page 26: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/26.jpg)
GP Temporal Difference
![Page 27: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/27.jpg)
GP Temporal Difference
![Page 28: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f175503460f94c2e8da/html5/thumbnails/28.jpg)
GP Temporal Difference