Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and...
-
date post
21-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and...
![Page 1: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/1.jpg)
Pieter Abbeel and Andrew Y. Ng
Reinforcement Learning and
Apprenticeship Learning
Pieter Abbeel and Andrew Y. Ng
Stanford University
![Page 2: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/2.jpg)
Pieter Abbeel and Andrew Y. Ng
Example of Reinforcement Learning (RL) Problem
Highway driving.
![Page 3: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/3.jpg)
Pieter Abbeel and Andrew Y. Ng
Reinforcement Learning (RL) formalism
DynamicsModel
Psa
RewardFunction
R
ReinforcementLearning Control policy
)(...)(Emax 0 TsRsR
![Page 4: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/4.jpg)
Pieter Abbeel and Andrew Y. Ng
RL formalism
• Assume that at each time step, our system is in some state st.
• Upon taking an action a, our state randomly transitions to some new state st+1.
• We are also given a reward function R.
• The goal: Pick actions over time so as to maximize the expected score: E[R(s0) + R(s1) + … + R(sT)].
Systemdynamics
s0
s1
Systemdynamics
…System
dynamicssT-1
sT
s2
R(s0) R(s2) R(sT-1)R(s1) R(sT)+ ++…++
![Page 5: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/5.jpg)
Pieter Abbeel and Andrew Y. Ng
RL formalism
• Markov Decision Process (S,A,Psa,s0,R)
• W.l.o.g. we assume
• Policy
• Utility of a policy for reward R=wT
![Page 6: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/6.jpg)
Pieter Abbeel and Andrew Y. Ng
RL formalism
DynamicsModel
Psa
RewardFunction
R
ReinforcementLearning Control policy
)(...)(Emax 0 TsRsR
![Page 7: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/7.jpg)
Pieter Abbeel and Andrew Y. Ng
Part IApprenticeship learning via
inverse reinforcement learning
![Page 8: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/8.jpg)
Pieter Abbeel and Andrew Y. Ng
Motivation
Reinforcement learning (RL) gives powerful tools for solving MDPs. It can be difficult to specify the reward function. Example: Highway driving.
![Page 9: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/9.jpg)
Pieter Abbeel and Andrew Y. Ng
Apprenticeship Learning
• Learning from observing an expert.
• Previous work:
– Learn to predict expert’s actions as a function of states.
– Usually lacks strong performance guarantees.
– (E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …)
• Our approach:
– Based on inverse reinforcement learning (Ng & Russell, 2000).
– Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function.
![Page 10: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/10.jpg)
Pieter Abbeel and Andrew Y. Ng
Algorithm
For t = 1,2,…
Inverse RL step:
Estimate expert’s reward function R(s)= wT(s) such that under R(s) the expert performs better than all previously found policies {i}.
RL step:
Compute optimal policy t for
the estimated reward weights w.
![Page 11: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/11.jpg)
Pieter Abbeel and Andrew Y. Ng
Algorithm: Inverse RL step
![Page 12: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/12.jpg)
Pieter Abbeel and Andrew Y. Ng
Feature Expectation Closeness and Performance
If we can find a policy such that
||(E) - ()||2 ,
then for any underlying reward R*(s) =w*T(s),
we have that
|Uw*(E) - Uw*()| = |w*T (E) - w*T ()|
||w*||2 ||(E) - ()||2
.
![Page 13: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/13.jpg)
Pieter Abbeel and Andrew Y. Ng
Theoretical Results: Convergence
Theorem. Let an MDP (without reward function), a k-dimensional feature vector and the expert’s feature expectations (E) be given. Then after at most
kT2/2
iterations, the algorithm outputs a policy that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w*T(s), i.e.,
Uw*() Uw*(E) - .
![Page 14: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/14.jpg)
Pieter Abbeel and Andrew Y. Ng
Gridworld Experiments
Reward function is piecewise constant over small regions.Features for IRL are these small regions.
128x128 grid, small regions of size 16x16.
![Page 15: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/15.jpg)
Pieter Abbeel and Andrew Y. Ng
Gridworld Experiments
![Page 16: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/16.jpg)
Pieter Abbeel and Andrew Y. Ng
Gridworld Experiments
![Page 17: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/17.jpg)
Pieter Abbeel and Andrew Y. Ng
Gridworld Experiments
![Page 18: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/18.jpg)
Pieter Abbeel and Andrew Y. Ng
Case study: Highway driving
The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.
Input: Driving demonstration Output: Learned behavior
![Page 19: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/19.jpg)
Pieter Abbeel and Andrew Y. Ng
More driving examples
In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.
![Page 20: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/20.jpg)
Pieter Abbeel and Andrew Y. Ng
Our algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function.
Algorithm is guaranteed to converge in poly(k,1/) iterations.
The algorithm exploits reward “simplicity” (vs. policy “simplicity” in previous approaches).
Conclusions for part I
![Page 21: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/21.jpg)
Pieter Abbeel and Andrew Y. Ng
Part IIApprenticeship learning for
learning the transition model
![Page 22: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/22.jpg)
Pieter Abbeel and Andrew Y. Ng
Learning the dynamics model Psa from data
DynamicsModel
Psa
RewardFunction
R
ReinforcementLearning Control
policy )(...)(Emax 0 TsRsR
Estimate Psa from data
![Page 23: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/23.jpg)
Pieter Abbeel and Andrew Y. Ng
Transition model
• So we need to estimate the dynamics from data.
• Have to collect enough data to model all relevant parts of the flight envelop.
• Consider the problem of controlling a complicated system like a helicopter.
• No models are available that specify the dynamics accurately as a function of the helicopter’s specifications.
![Page 24: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/24.jpg)
Pieter Abbeel and Andrew Y. Ng
Collecting data to learn dynamical model
State-of-the-art: E3 algorithm (Kearns and Singh, 2002)
Have goodmodel of dynamics?
YES
“Exploit”
NO
“Explore”
![Page 25: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/25.jpg)
Pieter Abbeel and Andrew Y. Ng
Learning the dynamics
(a1, s1, a2, s2, a3, s3, ….)
Expert human pilot flight
Learn Psa
DynamicsModel
Psa
RewardFunction
R
ReinforcementLearning Control policy
(a1, s1, a2, s2, a3, s3, ….)
Autonomous flight
Learn Psa
)(...)(Emax 0 TsRsR
![Page 26: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/26.jpg)
Pieter Abbeel and Andrew Y. Ng
Apprenticeship learning of model
Theorem. Suppose that we obtain m = O(poly(S, A, T, 1/)) examples from a human expert demonstrating the task. Then after a polynomial number N of iterations of testing/re-learning, with high probability, we will obtain a policy whose performance is comparable to the expert’s:
U() U(E) -
Thus, so long as a demonstration is available, it isn’t necessary to explicitly explore.
![Page 27: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/27.jpg)
Pieter Abbeel and Andrew Y. Ng
Proof idea
• From initial pilot demonstrations, our model/simulator Psa will be accurate for the part of the flight envelop (s,a) visited by the pilot.
• Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s policy E.
• Consequently, there is at least one policy (namely E) that looks like it’s able to fly the helicopter well in our simulation.
• Thus, each time we solve the MDP using the current simulator Psa, we will find a policy that successfully flies the helicopter according to Psa.
• If, on the actual helicopter, this policy fails to fly the helicopter---despite the model Psa predicting that it should---then it must be visiting parts of the flight envelop that the model is failing to accurately model.
• Hence, this gives useful training data to model new parts of the flight envelop.
![Page 28: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/28.jpg)
Pieter Abbeel and Andrew Y. Ng
Conclusions
![Page 29: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/29.jpg)
Pieter Abbeel and Andrew Y. Ng
Conclusions
DynamicsModel
Psa
RewardFunction
R
ReinforcementLearning Control policy
)(...)(Emax 0 TsRsR
Given expert demonstrations, our inverse RL algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function.
Given an initial demonstration, there is no need to explicitly explore the state/action space. Even if you repeatedly “exploit” (use your best policy), you will collect enough data to learn a sufficiently accurate dynamical model to carry out your control task.
![Page 30: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/30.jpg)
Pieter Abbeel and Andrew Y. Ng
Thanks for your attention.
![Page 31: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/31.jpg)
Pieter Abbeel and Andrew Y. Ng
Different Formulation
LP formulation for RL problem
max. s,a (s,a) R(s)
s.t.
s a (s,a) = s’,a P(s|s’,a) (s’,a)
QP formulation for Apprenticeship Learning
min. , i (E,i - i)2
s.t.
s a (s,a) = s’,a P(s|s’,a) (s’,a)
i i = s,a i(s) (s,a)
![Page 32: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/32.jpg)
Pieter Abbeel and Andrew Y. Ng
Different Formulation (ctd.)
Our algorithm is equivalent to iteratively
linearizing QP at current point (Inverse RL step),
solve resulting LP (RL step).
Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality.]
![Page 33: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/33.jpg)
Pieter Abbeel and Andrew Y. Ng
Simplification of Inverse RL step: QP Euclidean projection
• In the Inverse RL step
–set (i-1) = orthogonal projection of E onto line through { (i-1),((i-1)) }
–set w(i) = E - (i-1)
• Note: the theoretical results on convergence and sample complexity hold unchanged for the simpler algorithm.
![Page 34: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/34.jpg)
Pieter Abbeel and Andrew Y. Ng
More driving examples
In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.
![Page 35: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/35.jpg)
Pieter Abbeel and Andrew Y. Ng
Proof (sketch)
1(0)
w(1)
(1)
2
(1)
(E)
d0 d1
![Page 36: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/36.jpg)
Pieter Abbeel and Andrew Y. Ng
Proof (sketch)
![Page 37: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/37.jpg)
Pieter Abbeel and Andrew Y. Ng
Proof (sketch)
![Page 38: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/38.jpg)
Pieter Abbeel and Andrew Y. Ng
Algorithm (projection version)
1
E
(0)
w(1)
(1)
2
![Page 39: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/39.jpg)
Pieter Abbeel and Andrew Y. Ng
Algorithm (projection version)
1
E
(0)
w(1)
w(2)(1)
(2)
2
(1)
![Page 40: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/40.jpg)
Pieter Abbeel and Andrew Y. Ng
Algorithm (projection version)
1
E
(0)
w(1)
w(2)(1)
(2)
2
w(3)
(1)
(2)
![Page 41: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/41.jpg)
Pieter Abbeel and Andrew Y. Ng
Appendix: Different View
Bellman LP for solving MDPs
Min. V c’V s.t.
s,a V(s) R(s,a) + s’ P(s,a,s’)V(s’)
Dual LP
Max. s,a (s,a)R(s,a) s.t.
s c(s) - a (s,a) + s’,a P(s’,a,s) (s’,a) =0
Apprenticeship Learning as QP
Min. i (E,i - s,a (s,a)i(s))2 s.t.
s c(s) - a (s,a) + s’,a P(s’,a,s) (s’,a) =0
![Page 42: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/42.jpg)
Pieter Abbeel and Andrew Y. Ng
Different View (ctd.)
Our algorithm is equivalent to iteratively
linearize QP at current point (Inverse RL step),
solve resulting LP (RL step).
Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality.]
![Page 43: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/43.jpg)
Collision
Offroad Left
Left Lane
Middle
Lane Right
Lane Offroad
Right
1 Feature Distr. Expert 0 0 0.1325 0.2033 0.5983 0.0658
Feature Distr. Learned 5.00E-05 0.0004 0.0904 0.2286 0.604 0.0764
Weights Learned -0.0767 -0.0439 0.0077 0.0078 0.0318 -0.0035
2 Feature Distr. Expert 0.1167 0 0.0633 0.4667 0.47 0
Feature Distr. Learned 0.1332 0 0.1045 0.3196 0.5759 0
Weights Learned 0.234 -0.1098 0.0092 0.0487 0.0576 -0.0056
3 Feature Distr. Expert 0 0 0 0.0033 0.7058 0.2908
Feature Distr. Learned 0 0 0 0 0.7447 0.2554
Weights Learned -0.1056 -0.0051 -0.0573 -0.0386 0.0929 0.0081
4 Feature Distr. Expert 0.06 0 0 0.0033 0.2908 0.7058
Feature Distr. Learned 0.0569 0 0 0 0.2666 0.7334
Weights Learned 0.1079 -0.0001 -0.0487 -0.0666 0.059 0.0564
5 Feature Distr. Expert 0.06 0 0 1 0 0
Feature Distr. Learned 0.0542 0 0 1 0 0
Weights Learned 0.0094 -0.0108 -0.2765 0.8126 -0.51 -0.0153
Car driving results (more detail)
![Page 44: Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.](https://reader037.fdocuments.us/reader037/viewer/2022103022/56649d6d5503460f94a4d682/html5/thumbnails/44.jpg)
Pieter Abbeel and Andrew Y. Ng
Apprenticeship Learning via Inverse Reinforcement Learning
Pieter Abbeel and Andrew Y. Ng
Stanford University