Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π...

Inverse Reinforcement LearningChelsea Finn

3/5/2017

1

Course Reminders: March 22nd: Project group & title due April 17th: Milestone report due & milestone presentations April 26th: Beginning of project presentations

2

1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

3

Inverse RL: Outline

1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

4

Inverse RL: Outline

Mnih et al. ’15

video from Montessori New Zealand

what is the reward?reinforcement learning agent

reward

In the real world, humans don’t get a score.

5

reward function is essential for RL

real-world domains: reward/cost often difficult to specify

• robotic manipulation • autonomous driving • dialog systems • virtual assistants • and more…

Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16Tesauro ’95

6

Behavioral Cloning: Mimic actions of expert

7

Motivation

- but no reasoning about outcomes or dynamics

Can we reason about what the expert is trying to achieve?

- the expert might have different degrees of freedom

Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from expert demonstrations

(Kalman ’64, Ng & Russell ’00)(IOC/IRL)

8

Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from demonstrations

Challenges underdefined problem

difficult to evaluate a learned cost demonstrations may not be precisely optimal

given: - state & action space - roll-outs from π* - dynamics model [sometimes]

goal: - recover reward function - then use reward to get policy

Compare to DAgger: no direct access to π*

9

Early IRL Approaches

All: alternate between solving MDP w.r.t. cost and updating cost

Ng & Russell ‘00: expert actions should have higher value than other actions, larger gap is better

Abbeel & Ng ’04: expert policy w.r.t. cost should match feature counts of expert trajectories

Ratliff et al. ’06: max margin formulation between value of expert actions and other actions

How to handle ambiguity? What if expert is not perfect?

1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

10

Inverse RL: Outline

11

Whiteboard

Maximum Entropy Inverse RL(Ziebart et al. ’08)

Notation:

: cost with parameters [linear case ]

: dataset of demonstrations

: transition dynamics

12

Maximum Entropy Inverse RL(Ziebart et al. ’08)

1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy IRL 4. Scaling IRL to deep cost functions

13

Inverse RL: Outline

14

Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost

NIPS Deep RL workshop 2015

IROS 2016


15Need to iteratively solve MDP for every weight update


16

demonstrations

mean height height variance cell visibility

120km of demonstration data

test-set trajectory prediction:

manually designed cost:

MHD: modified Hausdorff distance


17

Strengths - scales to neural net costs - efficient enough for real robots

Limitations - still need to repeatedly solve the MDP - assumes known dynamics

18

Whiteboard

What about unknown dynamics?

19

Goals: - remove need to solve MDP in the inner loop - be able to handle unknown dynamics - handle continuous state & actions spaces

Case Study: Guided Cost Learning

ICML 2016

Update cost using samples & demos

generate policy samples from π

update π w.r.t. cost

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

policy π cost c

guided cost learning algorithm

20

policy π0



update π w.r.t. cost (partially optimize)

generator

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

policy π cost c

discriminator


update cost in inner loop of policy optimization

21

policy π0



generator

Ho et al., ICML ’16, NIPS ‘16

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

policy π cost c

discriminator


update π w.r.t. cost (partially optimize)

22

policy π0

23

What about unknown dynamics?Adaptive importance sampling

GCL Experiments

dish placement pouring almonds

Real-world Tasks

state includes goal plate pose state includes unsupervised visual features [Finn et al. ’16]

24

action: joint torques


generate policy samples from q

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

cost c

ComparisonsRelative Entropy IRL (Boularias et al. ‘11)

Path Integral IRL (Kalakrishnan et al. ‘13)

25

Dish placement, demos

26

Dish placement, standard cost

27

Dish placement, RelEnt IRL

• video of dish baseline method

28

Dish placement, GCL policy

• video of dish our method - samples & reoptimizing

29

Pouring, demos

• video of pouring demos

30

Pouring, RelEnt IRL

• video of pouring baseline method

31

Pouring, GCL policy

• video of pouring our method - samples

32

Conclusion: We can recover successful policies for new positions.

Is the cost function also useful for new scenarios?

33

Dish placement - GCL reopt.

• video of dish our method - samples & reoptimizing

34

Pouring - GCL reopt.

• video of pouring our method - reoptimization

35

Note: normally the GAN discriminator is discarded

36

Case Study: Generative Adversarial Imitation Learning

NIPS 2016

37

Case Study: Generative Adversarial Imitation Learning

- demonstrations from TRPO-optimized policy - use TRPO as a policy optimizer - OpenAI gym tasks

Guided Cost Learning & Generative Adversarial Imitation Learning

38

Strengths - can handle unknown dynamics - scales to neural net costs - efficient enough for real robots

Limitations - adversarial optimization is hard - can’t scale to raw pixel observations of demos - demonstrations typically collected with kinesthetic

teaching or teleoperation (first person)

Next Time: Back to forward RL (advanced policy gradients)

39

IOC is under-definedneed regularization:• encourage slowly changing cost

• cost of demos decreases strictly monotonically in time

Regularization ablationReaching

samples

dist

ance

5 25 45 65 850

0.2

0.4

0.6

0.8

Peg Insertion

samples

dist

ance

5 25 45 65 850

0.1

0.2

0.3

0.4

0.5

ours, demo initours, rand. initours, no lcr reg, demo initours, no lcr reg, rand. initours, no mono reg, demo initours, no mono reg, rand. init

Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π...

Documents

Transcript of Inverse Reinforcement Learning · 2017. 3. 6. · samples & demos generate policy samples from π...