Apprenticeship Learning for Robotic Control, with...
Transcript of Apprenticeship Learning for Robotic Control, with...
![Page 1: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/1.jpg)
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion
and Autonomous Helicopter Flight
Pieter AbbeelPieter Abbeel
Stanford University
(Variation hereof) presented at Cornell/USC/UCSD/Michigan/UNC/Duke/UCLA/UW/EPFL/Berkeley/CMU
Winter/Spring 2008
In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov, Sebastian Thrun.
![Page 2: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/2.jpg)
![Page 3: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/3.jpg)
Big picture and key challenges
Dynamics Model PPPPsasasasa
Reward Reinforcement Controller/
Probability distribution over next states given current state and actionDescribes desirability
of being in a state.
Reward Function RRRR
ReinforcementLearning /
Optimal Control
Controller/Policy ππππ
Prescribes action to take for each state
� Key challenges
� Providing a formal specification of the control task.
� Building a good dynamics model.
� Finding closed-loop controllers.
![Page 4: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/4.jpg)
� Apprenticeship learning algorithms
� Leverage expert demonstrations to learn to perform a desired task.
� Formal guarantees
� Running time
Overview
� Sample complexity
� Performance of resulting controller
� Enabled us to solve highly challenging, previously unsolved, real-world control problems in
� Quadruped locomotion
� Autonomous helicopter flight
![Page 5: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/5.jpg)
Example task: driving
![Page 6: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/6.jpg)
� Input:
� Dynamics model / Simulator Psa(st+1 | st, at)
� No reward function
� Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …
(= trace of the teacher’s policy π*)
Problem setup
π
� Desired output:
� Policy , which (ideally) has performance guarantees, i.e.,
� Note: R* is unknown.
![Page 7: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/7.jpg)
� Formulate as standard machine learning problem
� Fix a policy class
� E.g., support vector machine, neural network, decision tree, deep belief net, …
� Estimate a policy from the training examples (s0, a0), (s , a ), (s , a ), …
Prior work: behavioral cloning
(s1, a1), (s2, a2), …
� E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002.
![Page 8: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/8.jpg)
Prior work: behavioral cloning
� Limitations:
� Fails to provide strong performance guarantees
� Underlying assumption: policy simplicity
![Page 9: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/9.jpg)
Problem structure
Dynamics Model PPPPsasasasa Prescribes action to
take for each state: typically very complex
Often fairly succinct
Reward Function RRRR
ReinforcementLearning /
Optimal Control
Controller/Policy ππππ
![Page 10: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/10.jpg)
Apprenticeship learning [Abbeel & Ng, 2004]
� Assume
� Initialize: pick some controller π0.
� Iterate for i = 1, 2, … :
� “Guess” the reward function:
Find a reward function such that the teacher maximally outperforms
Learning through reward functions rather than
directly learning policies.
Find a reward function such that the teacher maximally outperforms all previously found controllers.
� Find optimal control policy πifor the current guess of the reward
function Rw.
� If , exit the algorithm.
There is no reward function for which the teacher significantly
outperforms thus-far found policies.
![Page 11: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/11.jpg)
Theoretical guarantees
� Guarantee w.r.t. unrecoverable reward function of teacher.
� Sample complexity does not depend on complexity of teacher’s policy π*.
![Page 12: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/12.jpg)
Related work
� Prior work:
� Behavioral cloning. (covered earlier)
� Utility elicitation / Inverse reinforcement learning, Ng & Russell, 2000.
� No strong performance guarantees.
� Closely related later work:
� Ratliff et al., 2006, 2007; Neu & Szepesvari, 2007;
Syed & Schapire, 2008.
� Work on specialized reward function: trajectories.
� E.g., Atkeson & Schaal, 1997.
![Page 13: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/13.jpg)
Highway driving
Teacher in Training World Learned Policy in Testing World
� Input:
� Dynamics model / Simulator Psa(st+1 | st, at)
� Teacher’s demonstration: 1 minute in “training world”
� Note: R* is unknown.
� Reward features: 5 features corresponding to lanes/shoulders; 10 features corresponding to presence of other car in current lane at different distances
![Page 14: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/14.jpg)
More driving examplesDriving
demonstrationDriving
demonstrationLearned behavior
Learned behavior
In each video, the left sub-panel shows a
demonstration of a different driving
“style”, and the right sub-panel shows
the behavior learned from watching the
demonstration.
![Page 15: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/15.jpg)
Parking lot navigation
[Abbeel et al., submitted]
� Reward function trades off:
� curvature
� smoothness,
� distance to obstacles,
� alignment with principal directions.
![Page 16: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/16.jpg)
� Demonstrate parking lot navigation on “train parking lot.”
� Run our apprenticeship learning algorithm to find a set of
Experimental setup
� Run our apprenticeship learning algorithm to find a set of reward weights w.
� Receive “test parking lot” map + starting point and destination.
� Find a policy for navigating the test parking lot.
Learned reward weights
![Page 17: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/17.jpg)
Learned controller
![Page 18: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/18.jpg)
Quadruped
� Reward function trades off 25 features.
[Kolter, Abbeel & Ng, 2008]
![Page 19: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/19.jpg)
� Demonstrate path across the “training terrain”
� Run our apprenticeship learning algorithm to find a set of reward weights w.
Experimental setup
of reward weights w.
� Receive “testing terrain”---height map.
� Find a policy for crossing the testing terrain.
Learned reward weights
![Page 20: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/20.jpg)
Without learning
![Page 21: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/21.jpg)
With learned reward function
![Page 22: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/22.jpg)
Apprenticeship learning
Dynamics Model PPPPsasasasa
Teacher’s flight
Learn
R
Reward Function RRRR
ReinforcementLearning /
Optimal Control
Controller ππππ
(s0, a0, s1, a1, ….)
![Page 23: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/23.jpg)
Apprenticeship learning
Dynamics Model PPPPsasasasa
Teacher’s flight
ReinforcementLearning /
Optimal Control
Learn
R
Reward Function RRRR
Controller ππππ
(s0, a0, s1, a1, ….)
![Page 24: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/24.jpg)
Motivating example
•Textbook model
Collect
flight data.
•Textbook model� How to fly for data collection?
Accurate
dynamics
model Psa
•Textbook model
• Specification
Accurate
dynamics
model Psa
•Textbook model
• Specification
Learn model
from data.
� How to fly for data collection? � How to ensure that the entire flight envelope is covered?
![Page 25: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/25.jpg)
Aggressive (manual) exploration
![Page 26: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/26.jpg)
� Never any explicit exploration (neither manual nor autonomous).
Near-optimal performance (compared to teacher).
Desired properties
� Near-optimal performance (compared to teacher).
� Small number of teacher demonstrations.
� Small number of autonomous trials.
![Page 27: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/27.jpg)
Learn Psa
Learn Psa
Apprenticeship learning of the model
Dynamics
Model PPPPsasasasa
Autonomous flight
(s0, a0, s1, a1, ….)
Teacher’s flight
(s0, a0, s1, a1, ….)
Reward Function RRRR
ReinforcementLearning /
Optimal Control
Controller ππππ
(s0, a0, s1, a1, ….)(s0, a0, s1, a1, ….)
![Page 28: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/28.jpg)
Theoretical guarantees
No explicit exploration required.
[Abbeel & Ng, 2005]
![Page 29: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/29.jpg)
Learn Psa
Learn Psa
Dynamics Model PPPPsasasasa
Autonomous flight
(s , a , s , a , ….)
Teacher’s flight
(s , a , s , a , ….)
Apprenticeship learning summary
Learn
R
Reward Function RRRR
ReinforcementLearning /
Optimal Control
Controller ππππ
(s0, a0, s1, a1, ….)(s0, a0, s1, a1, ….)
![Page 30: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/30.jpg)
Other relevant parts of the story
� Learning the dynamics model from data
� Locally weighted models
� Exploiting structure from physics
� Simulation accuracy at time-scales relevant for control� Simulation accuracy at time-scales relevant for control
� Reinforcement learning / Optimal control
� Model predictive control
� Receding horizon differential dynamic programming
[Abbeel et al. 2005, 2006a, 2006b, 2007]
![Page 31: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/31.jpg)
Related work
� Bagnell & Schneider, 2001; LaCivita, Papageorgiou, Messner & Kanade, 2002; Ng, Kim, Jordan & Sastry 2004a (2001);
� Roberts, Corke & Buskey, 2003; Saripalli, Montgomery & Sukhatme, 2003; Shim, Chung, Kim & Sastry, 2003; Doherty et al., 2004.
� Gavrilets, Martinos, Mettler and Feron, 2002; Ng et al., 2004b.
� Maneuvers presented here are significantly more challenging than those flown by any other autonomous helicopter.
![Page 32: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/32.jpg)
Autonomous aerobatic flips (attempt) before apprenticeship learning
Task description: meticulously hand-engineeredModel: learned from (commonly used) frequency sweeps data
![Page 33: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/33.jpg)
1. Our expert pilot demonstrates the airshow several times.
Experimental setup for helicopter
![Page 34: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/34.jpg)
Demonstrations
![Page 35: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/35.jpg)
1. Our expert pilot demonstrates the airshow several times.
2. Learn a reward function---trajectory.
3. Learn a dynamics model.
Experimental setup for helicopter
![Page 36: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/36.jpg)
Learned reward (trajectory)
![Page 37: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/37.jpg)
1. Our expert pilot demonstrates the airshow several times.
2. Learn a reward function---trajectory.
3. Learn a dynamics model.
Experimental setup for helicopter
4. Find the optimal control policy for learned reward and dynamics model.
5. Autonomously fly the airshow
6. Learn an improved dynamics model. Go back to step 4.
![Page 38: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/38.jpg)
![Page 39: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/39.jpg)
AccuracyWhite: target trajectory. Black: autonomously flown trajectory.
![Page 40: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/40.jpg)
Summary
� Apprenticeship learning algorithms
� Learn to perform a task from observing expert demonstrations of the task.
� Formal guarantees� Formal guarantees
� Running time.
� Sample complexity.
� Performance of resulting controller.
� Enabled us to solve highly challenging, previously unsolved, real-world control problems.
![Page 41: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/41.jpg)
� Applications:
� Autonomous helicopters to assist in in wildland fire fighting.
� Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%.
Current and future work
three aircraft formation: 20%.
� Learning from demonstrations only scratches surface of the potential impact of work at intersection machine learning/control on robotics.
� Safe autonomous learning.
� More general advice taking.
![Page 42: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/42.jpg)
![Page 43: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/43.jpg)
Thank you.Thank you.
![Page 44: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/44.jpg)
Chaos (short)
![Page 45: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/45.jpg)
Chaos (long)
![Page 46: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/46.jpg)
Flips
![Page 47: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/47.jpg)
Auto-rotation descent
![Page 48: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/48.jpg)
Tic-toc
![Page 49: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/49.jpg)
Autonomous nose-in funnel
![Page 50: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/50.jpg)
� From initial pilot demonstrations, our model/simulator Psa
will be accurate for the part of the state space (s,a) visited by the pilot.
� Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s controller π*.
� Consequently, there is at least one controller (namely π*) that looks capable of flying the helicopter well in our simulation.
Model Learning: Proof Idea
capable of flying the helicopter well in our simulation.
� Thus, each time we solve for the optimal controller using the current model/simulator P
sa, we will find a controller that
successfully flies the helicopter according to Psa.
� If, on the actual helicopter, this controller fails to fly the helicopter---despite the model P
sapredicting that it should---then it must be
visiting parts of the state space that are inaccurately modeled.
� Hence, we get useful training data to improve the model. This can happen only a small number of times.
![Page 51: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/51.jpg)
� Apprenticeship Learning via Inverse Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2004.
� Learning First Order Markov Models for Control, Pieter Abbeel and Andrew Y. Ng. In NIPS 17, 2005.
� Exploration and Apprenticeship Learning in Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2005.
� Modeling Vehicular Dynamics, with Application to Modeling Helicopters, Pieter Abbeel, Varun Ganapathi and Andrew Y. Ng. In NIPS 18, 2006.
� Using Inaccurate Models in Reinforcement Learning, Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. In Proc. ICML, 2006.
� An Application of Reinforcement Learning to Aerobatic Helicopter Flight, Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. In NIPS 19, 2007.
� Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng. In NIPS 20, 2008.
![Page 52: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/52.jpg)
Helicopter dynamics model in auto
![Page 53: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/53.jpg)
Performance guarantee intuition
� Intuition by example:
� Let
� If the returned controller π satisfies
� Then no matter what the values of and are, the controller π performs as well as the teacher’s controller π*.
![Page 54: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/54.jpg)
Probabilistic graphical model for multiple demonstrations
![Page 55: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/55.jpg)
Full model
![Page 56: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/56.jpg)
� Step 1: find the time-warping, and the distributional parameters
We use EM, and dynamic time warping to
Learning algorithm
� We use EM, and dynamic time warping to alternatingly optimize over the different parameters.
� Step 2: find the intended trajectory
![Page 57: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/57.jpg)
Teacher demonstration for quadruped
� Full teacher demonstration = sequence of footsteps.
� Much simpler to “teach hierarchically”:� Much simpler to “teach hierarchically”:
� Specify a body path.
� Specify best footstep in a small area.
![Page 58: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/58.jpg)
Hierarchical inverse RL
� Quadratic programming problem (QP):
� quadratic objective, linear constraints.
� Constraint generation for path constraints.
![Page 59: Apprenticeship Learning for Robotic Control, with …ai.stanford.edu/~pabbeel//presentations/job-talk-slides.pdfThus, each time we solve for the optimal controller using the current](https://reader036.fdocuments.us/reader036/viewer/2022062602/5e95fd6e0562b4120b5c2382/html5/thumbnails/59.jpg)
� Training:
� Have quadruped walk straight across a fairly simple board with fixed-spaced foot placements.
� Around each foot placement: label the best foot placement. (about 20 labels)
Experimental setup
placement. (about 20 labels)
� Label the best body-path for the training board.
� Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels.
� Test on hold-out terrains:
� Plan a path across the test-board.