Inverse Reinforcement LearningChelsea Finn
3/5/2017
1
Course Reminders: March 22nd: Project group & title due April 17th: Milestone report due & milestone presentations April 26th: Beginning of project presentations
2
1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions
3
Inverse RL: Outline
1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions
4
Inverse RL: Outline
Mnih et al. ’15
video from Montessori New Zealand
what is the reward?reinforcement learning agent
reward
In the real world, humans don’t get a score.
5
reward function is essential for RL
real-world domains: reward/cost often difficult to specify
• robotic manipulation • autonomous driving • dialog systems • virtual assistants • and more…
Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16Tesauro ’95
6
Behavioral Cloning: Mimic actions of expert
7
Motivation
- but no reasoning about outcomes or dynamics
Can we reason about what the expert is trying to achieve?
- the expert might have different degrees of freedom
Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from expert demonstrations
(Kalman ’64, Ng & Russell ’00)(IOC/IRL)
8
Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from demonstrations
Challenges underdefined problem
difficult to evaluate a learned cost demonstrations may not be precisely optimal
given: - state & action space - roll-outs from π* - dynamics model [sometimes]
goal: - recover reward function - then use reward to get policy
Compare to DAgger: no direct access to π*
9
Early IRL Approaches
All: alternate between solving MDP w.r.t. cost and updating cost
Ng & Russell ‘00: expert actions should have higher value than other actions, larger gap is better
Abbeel & Ng ’04: expert policy w.r.t. cost should match feature counts of expert trajectories
Ratliff et al. ’06: max margin formulation between value of expert actions and other actions
How to handle ambiguity? What if expert is not perfect?
1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions
10
Inverse RL: Outline
11
Whiteboard
Maximum Entropy Inverse RL(Ziebart et al. ’08)
Notation:
: cost with parameters [linear case ]
: dataset of demonstrations
: transition dynamics
12
Maximum Entropy Inverse RL(Ziebart et al. ’08)
1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy IRL 4. Scaling IRL to deep cost functions
13
Inverse RL: Outline
14
Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost
NIPS Deep RL workshop 2015
IROS 2016
Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost
15Need to iteratively solve MDP for every weight update
Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost
16
demonstrations
mean height height variance cell visibility
120km of demonstration data
test-set trajectory prediction:
manually designed cost:
MHD: modified Hausdorff distance
Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost
17
Strengths - scales to neural net costs - efficient enough for real robots
Limitations - still need to repeatedly solve the MDP - assumes known dynamics
18
Whiteboard
What about unknown dynamics?
19
Goals: - remove need to solve MDP in the inner loop - be able to handle unknown dynamics - handle continuous state & actions spaces
Case Study: Guided Cost Learning
ICML 2016
Update cost using samples & demos
generate policy samples from π
update π w.r.t. cost
x
x
x
1
2
n
h
h
h
1
2
k
h
h
h
1
2
p
(1)
(1)
(1)
(2)
(2)
(2)
h
h
h
1
2
m
(3)
(3)
(3)
c (x)θ2
policy π cost c
guided cost learning algorithm
20
policy π0
Update cost using samples & demos
generate policy samples from π
update π w.r.t. cost (partially optimize)
generator
x
x
x
1
2
n
h
h
h
1
2
k
h
h
h
1
2
p
(1)
(1)
(1)
(2)
(2)
(2)
h
h
h
1
2
m
(3)
(3)
(3)
c (x)θ2
policy π cost c
discriminator
guided cost learning algorithm
update cost in inner loop of policy optimization
21
policy π0
Update cost using samples & demos
generate policy samples from π
generator
Ho et al., ICML ’16, NIPS ‘16
x
x
x
1
2
n
h
h
h
1
2
k
h
h
h
1
2
p
(1)
(1)
(1)
(2)
(2)
(2)
h
h
h
1
2
m
(3)
(3)
(3)
c (x)θ2
policy π cost c
discriminator
guided cost learning algorithm
update π w.r.t. cost (partially optimize)
22
policy π0
23
What about unknown dynamics?Adaptive importance sampling
GCL Experiments
dish placement pouring almonds
Real-world Tasks
state includes goal plate pose state includes unsupervised visual features [Finn et al. ’16]
24
action: joint torques
Update cost using samples & demos
generate policy samples from q
x
x
x
1
2
n
h
h
h
1
2
k
h
h
h
1
2
p
(1)
(1)
(1)
(2)
(2)
(2)
h
h
h
1
2
m
(3)
(3)
(3)
c (x)θ2
cost c
ComparisonsRelative Entropy IRL (Boularias et al. ‘11)
Path Integral IRL (Kalakrishnan et al. ‘13)
25
Dish placement, demos
26
Dish placement, standard cost
27
Dish placement, RelEnt IRL
• video of dish baseline method
28
Dish placement, GCL policy
• video of dish our method - samples & reoptimizing
29
Pouring, demos
• video of pouring demos
30
Pouring, RelEnt IRL
• video of pouring baseline method
31
Pouring, GCL policy
• video of pouring our method - samples
32
Conclusion: We can recover successful policies for new positions.
Is the cost function also useful for new scenarios?
33
Dish placement - GCL reopt.
• video of dish our method - samples & reoptimizing
34
Pouring - GCL reopt.
• video of pouring our method - reoptimization
35
Note: normally the GAN discriminator is discarded
36
Case Study: Generative Adversarial Imitation Learning
NIPS 2016
37
Case Study: Generative Adversarial Imitation Learning
- demonstrations from TRPO-optimized policy - use TRPO as a policy optimizer - OpenAI gym tasks
Guided Cost Learning & Generative Adversarial Imitation Learning
38
Strengths - can handle unknown dynamics - scales to neural net costs - efficient enough for real robots
Limitations - adversarial optimization is hard - can’t scale to raw pixel observations of demos - demonstrations typically collected with kinesthetic
teaching or teleoperation (first person)
Next Time: Back to forward RL (advanced policy gradients)
39
40
IOC is under-definedneed regularization:• encourage slowly changing cost
• cost of demos decreases strictly monotonically in time
Regularization ablationReaching
samples
dist
ance
5 25 45 65 850
0.2
0.4
0.6
0.8
Peg Insertion
samples
dist
ance
5 25 45 65 850
0.1
0.2
0.3
0.4
0.5
ours, demo initours, rand. initours, no lcr reg, demo initours, no lcr reg, rand. initours, no mono reg, demo initours, no mono reg, rand. init
Top Related