Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π...

42
Inverse Reinforcement Learning Chelsea Finn 3/5/2017 1

Transcript of Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π...

Page 1: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Inverse Reinforcement LearningChelsea Finn

3/5/2017

1

Page 2: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Course Reminders: March 22nd: Project group & title due April 17th: Milestone report due & milestone presentations April 26th: Beginning of project presentations

2

Page 3: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

3

Inverse RL: Outline

Page 4: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

4

Inverse RL: Outline

Page 5: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Mnih et al. ’15

video from Montessori New Zealand

what is the reward?reinforcement learning agent

reward

In the real world, humans don’t get a score.

5

Page 6: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

reward function is essential for RL

real-world domains: reward/cost often difficult to specify

• robotic manipulation • autonomous driving • dialog systems • virtual assistants • and more…

Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16Tesauro ’95

6

Page 7: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Behavioral Cloning: Mimic actions of expert

7

Motivation

- but no reasoning about outcomes or dynamics

Can we reason about what the expert is trying to achieve?

- the expert might have different degrees of freedom

Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from expert demonstrations

(Kalman ’64, Ng & Russell ’00)(IOC/IRL)

Page 8: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

8

Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from demonstrations

Challenges underdefined problem

difficult to evaluate a learned cost demonstrations may not be precisely optimal

given: - state & action space - roll-outs from π* - dynamics model [sometimes]

goal: - recover reward function - then use reward to get policy

Compare to DAgger: no direct access to π*

Page 9: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

9

Early IRL Approaches

All: alternate between solving MDP w.r.t. cost and updating cost

Ng & Russell ‘00: expert actions should have higher value than other actions, larger gap is better

Abbeel & Ng ’04: expert policy w.r.t. cost should match feature counts of expert trajectories

Ratliff et al. ’06: max margin formulation between value of expert actions and other actions

How to handle ambiguity? What if expert is not perfect?

Page 10: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

10

Inverse RL: Outline

Page 11: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

11

Whiteboard

Maximum Entropy Inverse RL(Ziebart et al. ’08)

Notation:

: cost with parameters [linear case ]

: dataset of demonstrations

: transition dynamics

Page 12: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

12

Maximum Entropy Inverse RL(Ziebart et al. ’08)

Page 13: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy IRL 4. Scaling IRL to deep cost functions

13

Inverse RL: Outline

Page 14: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

14

Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost

NIPS Deep RL workshop 2015

IROS 2016

Page 15: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost

15Need to iteratively solve MDP for every weight update

Page 16: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost

16

demonstrations

mean height height variance cell visibility

120km of demonstration data

test-set trajectory prediction:

manually designed cost:

MHD: modified Hausdorff distance

Page 17: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost

17

Strengths - scales to neural net costs - efficient enough for real robots

Limitations - still need to repeatedly solve the MDP - assumes known dynamics

Page 18: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

18

Whiteboard

What about unknown dynamics?

Page 19: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

19

Goals: - remove need to solve MDP in the inner loop - be able to handle unknown dynamics - handle continuous state & actions spaces

Case Study: Guided Cost Learning

ICML 2016

Page 20: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Update cost using samples & demos

generate policy samples from π

update π w.r.t. cost

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

policy π cost c

guided cost learning algorithm

20

policy π0

Page 21: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Update cost using samples & demos

generate policy samples from π

update π w.r.t. cost (partially optimize)

generator

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

policy π cost c

discriminator

guided cost learning algorithm

update cost in inner loop of policy optimization

21

policy π0

Page 22: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Update cost using samples & demos

generate policy samples from π

generator

Ho et al., ICML ’16, NIPS ‘16

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

policy π cost c

discriminator

guided cost learning algorithm

update π w.r.t. cost (partially optimize)

22

policy π0

Page 23: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

23

What about unknown dynamics?Adaptive importance sampling

Page 24: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

GCL Experiments

dish placement pouring almonds

Real-world Tasks

state includes goal plate pose state includes unsupervised visual features [Finn et al. ’16]

24

action: joint torques

Page 25: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Update cost using samples & demos

generate policy samples from q

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

cost c

ComparisonsRelative Entropy IRL (Boularias et al. ‘11)

Path Integral IRL (Kalakrishnan et al. ‘13)

25

Page 26: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Dish placement, demos

26

Page 27: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Dish placement, standard cost

27

Page 28: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Dish placement, RelEnt IRL

• video of dish baseline method

28

Page 29: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Dish placement, GCL policy

• video of dish our method - samples & reoptimizing

29

Page 30: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Pouring, demos

• video of pouring demos

30

Page 31: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Pouring, RelEnt IRL

• video of pouring baseline method

31

Page 32: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Pouring, GCL policy

• video of pouring our method - samples

32

Page 33: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Conclusion: We can recover successful policies for new positions.

Is the cost function also useful for new scenarios?

33

Page 34: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Dish placement - GCL reopt.

• video of dish our method - samples & reoptimizing

34

Page 35: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Pouring - GCL reopt.

• video of pouring our method - reoptimization

35

Note: normally the GAN discriminator is discarded

Page 36: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

36

Case Study: Generative Adversarial Imitation Learning

NIPS 2016

Page 37: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

37

Case Study: Generative Adversarial Imitation Learning

- demonstrations from TRPO-optimized policy - use TRPO as a policy optimizer - OpenAI gym tasks

Page 38: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Guided Cost Learning & Generative Adversarial Imitation Learning

38

Strengths - can handle unknown dynamics - scales to neural net costs - efficient enough for real robots

Limitations - adversarial optimization is hard - can’t scale to raw pixel observations of demos - demonstrations typically collected with kinesthetic

teaching or teleoperation (first person)

Page 39: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Next Time: Back to forward RL (advanced policy gradients)

39

Page 40: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

40

Page 41: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

IOC is under-definedneed regularization:• encourage slowly changing cost

• cost of demos decreases strictly monotonically in time

Page 42: Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π update π w.r.t. cost x x x 1 2 n h h h k h h p (1) (1) (1) (2) (2) (2) m (3) (3)

Regularization ablationReaching

samples

dist

ance

5 25 45 65 850

0.2

0.4

0.6

0.8

Peg Insertion

samples

dist

ance

5 25 45 65 850

0.1

0.2

0.3

0.4

0.5

ours, demo initours, rand. initours, no lcr reg, demo initours, no lcr reg, rand. initours, no mono reg, demo initours, no mono reg, rand. init