An Automated Measure of MDP Similarity for Transfer in...
Transcript of An Automated Measure of MDP Similarity for Transfer in...
![Page 1: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/1.jpg)
An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning
Haitham Bou Ammar Eric Eaton
Gerhard Weiss Kurt Driessens Karl Tuyls
Decebal Mocanu Matthew Taylor
![Page 2: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/2.jpg)
Introduc)on
Reinforcement learning (RL) is a key technique for learning through interac8on with the environment
Problem Defini)on:
RL problems are formalized as Markov Decision Processes (MDPs):
: Ac8on Space
: Discount Factor
: State Space : Transi8on Probability
: Reward Func8on
2 Bou Ammar, Eaton, et al.
Goal
Learn op8mal policy by maximizing
Q(s, a) = E
" 1X
t=0
�tRt
#
hS,A,P,R, �i
![Page 3: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/3.jpg)
Reinforcement learners are slow to learn
Problem
Reuse knowledge from other sources
Possible Solu)on Impressive Results
3 Bou Ammar, Eaton, et al.
Mo)va)on
• Learning from Demonstra8on
• Transfer Learning
![Page 4: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/4.jpg)
Transfer Learning
New target task Pool of source tasks from same domain
… …
…
Ques)ons to answer: 1. How to transfer? 2. What to transfer?
3. When to transfer?
lots of approaches lots of approaches
Needs a task similarity measure
4 Bou Ammar, Eaton, et al.
Less progress has been achieved
![Page 5: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/5.jpg)
RBDist: Similarity Measure Between MDPs
5 Bou Ammar, Eaton, et al.
![Page 6: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/6.jpg)
RBDist: Similarity Measure Between MDPs
RBM Energy Func)on
Probability distribu)on Weights are trained using contras)ve
divergence
6 Bou Ammar, Eaton, et al.
Our measure is based on Restricted Boltzmann Machines (RBMs):
• Set of visible units • Set of hidden units
. . .
Visible Layer
Hidden Layer
. . .
![Page 7: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/7.jpg)
Sampled trajectories capturing source dynamics
7 Bou Ammar, Eaton, et al.
RBDist: Similarity Measure Between MDPs
Step 1: Train an RBM to approximate the source task’s dynamics
. . .
Visible Layer
Hidden Layer
. . .
The RBM learns a genera8ve model that captures the source dynamics. Key Idea: If the dynamics of a source and target domain are similar, the
RBM trained on the source task should be able to reconstruct trajectories from the target task.
Separate into i.i.d. tuples and train RBM
hs, a, s0i
![Page 8: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/8.jpg)
Trajectories from target task
8 Bou Ammar, Eaton, et al.
RBDist: Similarity Measure Between MDPs
Step 2: Reconstruct target task trajectories by sampling the trained RBM
. . . Visible Layer
Hidden Layer
. . .
Reconstruc)on of target trajectories based on source
task’s dynamics
Sampling
Step 3: Measure reconstruc)on error of sampled target trajectories as RBDist
RBDist =1
n
nX
k=1
ek ek = L2⇣D
s(k)2 , a(k)2 , s0(k)2
E
0,Ds(k)2 , a(k)2 , s0
(k)2
E
1
⌘
original tuple reconstructed tuple
![Page 9: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/9.jpg)
Experiments & Results
9 Bou Ammar, Eaton, et al.
![Page 10: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/10.jpg)
Dynamical Systems & Benchmarks
Inverted Pendulum Cart Pole Mountain Car
Swing and balance pole upright by applying torques
Balance pole upright by applying linear forces
Control car to reach goal by oscilla8ng around the valley
10 Bou Ammar, Eaton, et al.
![Page 11: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/11.jpg)
Inverted Pendulum
Mountain Car
11 Bou Ammar, Eaton, et al.
Results: Dynamical Phases
RBDist can automa)cally cluster dynamical phases
Cart Pole
![Page 12: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/12.jpg)
20 40 60 80 100 120 140−700
−600
−500
−400
−300
−200
−100
0
100
200
Different Cartpoles
Re
wa
rd
Jump Start Inverted Pendulum
Inverted Pendulum
Mountain Car
Cart Pole
Results: Transfer Performance
12 Bou Ammar, Eaton, et al.
RBDist correlates with transfer performance
![Page 13: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning](https://reader033.fdocuments.us/reader033/viewer/2022041816/5e5acc6cc3a82c67cc16507a/html5/thumbnails/13.jpg)
13 Bou Ammar, Eaton, et al.
Thank you!
This work was supported in part by ONR N00014-‐11-‐1-‐0139,
AFOSR FA8750-‐14-‐1-‐0069, and NSF IIS-‐1149917.
Please send correspondence to:
Haitham Bou Ammar Eric Eaton [email protected] [email protected]