Inverse Reinforcement Learning for Telepresence Robots
Transcript of Inverse Reinforcement Learning for Telepresence Robots
Inverse Reinforcement Learning for Telepresence Robots
Shimon Whiteson Dept. of Computer Science
University of Oxford
joint work with Kyriacos Shiarlis & João Messias
Acknowledgements
João MessiasKyriacos Shiarlis
Outline• TERESA
• Inverse Reinforcement Learning
• Inverse Reinforcement Learning from Failure
• Results
• Next Steps
What Is Telepresence?
“Skype on a stick”
Human Controller / Pilot Interaction Target
Telepresence Benefits
• Greater physical presence:“Your alter ego on wheels”
• Mobility enables spontaneous interaction
Application: Education Accessibility
Application: Remote Health Care
Application: Elderly Accessibility
“Cafe Philo" Discussion Group
Les Arcades Prevention Centre in Troyes, France
Telepresence Limitations• Learning curve for human controller
• Otherwise automatic behaviours (e.g., body language) require manual execution
• Human controller must simultaneously make high-level and low-level decisions
• Leads to cognitive overload: mistakes at the low level; less attention for the high level [Tsui et al. 2011]
• Result is poor quality social interaction
TERESA Solution
• A new telepresence system with partial autonomy: automate low-level decision making
• Free human controller to focus on high-level decisions
• Requires social intelligence:
• Social navigation
• Social conversation
TERESA Robot
Experiments with Real Subjects
Les Arcades Prevention Centre in Troyes, France
Video
Interface
Human Controller
Audiovisual Processing Environmentaudio &
videoaudio &video
Cost Estimator State Estimator
Navigator Body Pose Controller
controllerreactions
peoplereactions
peoplelocations
peoplebehavior
state estimates
cost estimates
high-levelcontrolsignals
bodyposes motion
Cognitive Architecture
Outline• TERESA
• Inverse Reinforcement Learning
• Inverse Reinforcement Learning from Failure
• Results
• Next Steps
Inverse Reinforcement Learning
Reward & Value
R(s, a) =KX
k=1
wk�k(s, a)
V ⇡(s) = E{hX
t=1
R(st, at) | s1 = s}
V ⇡(s) =KX
k=1
wk
E{
hX
t=1
�k(st, at) | s1 = s}!
=KX
k=1
wkµ⇡,1:hk |s1=s,
unknown weight feature function
feature expectation under s1
Data & Feature Expectations
D =�⌧1, ⌧2, ...⌧N
⌧i = h(s⌧i1 , a⌧i1 ), (s⌧i2 , a⌧i2 ), . . . , (s⌧ih , a⌧ih )i
Dataset:
Trajectory:
eµDk =
1
N
X
⌧2D
hX
t=1
�k(s⌧t , a
⌧t )
µ⇡k |D =
X
s2SPD(s1 = s)µ⇡
k |s1=s
Empirical feature expectation:
Feature expectation of ∏:
Maximum Causal Entropy IRLfind: max
⇡H(Ah||Sh
)
subject to: eµDk = µ⇡
k |D 8k
and:
X
a2A⇡(s, a) = 1 8s 2 S
and: ⇡(s, a) � 0 8s 2 S, a 2 A
H(Ah||Sh) = �
hX
t=1
X
s1:t2St
a1:t2At
P (a1:t, s1:t) log(P (at|st))
where H is the causal entropy:
B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, pages 1433–1438, 2008.
depends only on previous states & actions
Lagrangian
L(⇡, w) = H(Ah||Sh) +KX
k=1
wk(µ⇡k |D � eµD
k )
find: min
w
n
max
⇡(L(⇡, w))
o
subject to:
X
a2A⇡(s, a) = 1 8s 2 S
and: ⇡(s, a) � 0 8s 2 S, a 2 A
Setting to zero implies:
Solving the Lagrangianr⇡(st,at)L(⇡, w)
⇡(st, at) / exp
H(At:h||St:h
) +
KX
k=1
wkµ⇡,t:hk |st,at
!
Finding ∏ corresponds to performing soft value iteration:
Q
w
(s, a)
soft
=
KX
k=1
w
k
�
k
(s, a) +
X
s
0
T (s, a, s
0)V
w
(s
0)
V
w
(s)
soft
= log
X
a
exp(Q
w
(s, a))
⇡(s, a) = exp(Q
w
(s, a)� V
w
(s))
Q⇡(st, at)
Gradient descent w.r.t. in the outer loop:
Outer Loop
Finding ∏ for a fixed corresponds to the inner loop of:
min
w
n
max
⇡(L(⇡, w))
o
w
w
rwL(⇡, w) = µ⇡|D � eµD,
Outline• TERESA
• Inverse Reinforcement Learning
• Inverse Reinforcement Learning from Failure
• Results
• Next Steps
Failed Demonstrations
• Existing IRL methods learn only from success
• IRL motivation: describing reward is hard but demonstrating success is easy
• Failed demonstrations also often available whenever humans learn via trial and error
• We extend IRL to include learning from failure
• Finds cost function that encourages behaviour that:
• Maximises similarity to successful demos
• Minimises similarity to failed demos
Inverse Reinforcement Learning from Failure
Kyriacos Shiarlis‚ João Messias and Shimon Whiteson. Inverse Reinforcement Learning from Failure. In AAMAS 2016.
Add inequality constraints:
Naive Approach
|eµFk � µ⇡
k |F | > ↵k 8k
max
⇡,↵H(Ah||Sh
) +
KX
k=1
↵k
Non-convex!
Convex relaxation:
max
⇡,✓H(Ah||Sh
) +
KX
k=1
✓k(µ⇡k |F � eµF
k )
Intractable!
IRLF
max
⇡,✓,zH(Ah||Sh
) +
KX
k=1
✓kzk � �
2
||✓||2
subject to: µ⇡k |D = eµD
k 8kand: µ⇡
k |F � eµFk = zk 8k
and:
X
a2A⇡(s, a) = 1 8s 2 S
and: ⇡(s, a) � 0 8s 2 S, a 2 A,
causal entropy dissimilarity regularisation
new constraint
Lagrangian
L(⇡, ✓, z, wD, wF ) = H(Ah||Sh)� �
2||✓||2 +
KX
k=1
✓kzk+
KX
k=1
wDk (µ⇡
k |D � eµDk ) +
KX
k=1
wFk (µ⇡
k � eµFk |F � zk)
find: min
w
⇢max
⇡,✓,z(L(⇡, w))
�
subject to:
X
a2A⇡(s, a) = 1 8s 2 S
and: ⇡(s, a) � 0 8s 2 S, a 2 A
First find critical points w.r.t. z and θ, then w.r.t. ∏, yielding:
Solving the Lagrangian
Yielding a new soft value iteration:
⇡(st, at) / exp
H(At:h||St:h
) +
KX
k=1
(wDk + wF
k )µ⇡,t:hk |st,at
!
Q
w
(s, a)
soft
=
KX
k=1
(w
Dk
+ w
Fk
)�
k
(s, a) +
X
s
0
T (s, a, s
0)V
w
(s
0)
V
w
(s)
soft
= log
X
a
exp(Q
w
(s, a))
⇡(s, a) = exp(Q
w
(s, a)� V
w
(s))
Outer Loop
rwDkL⇤(wD, wF ) = µ⇡⇤
k |D � eµDk ,
rwFkL⇤(wD, wF ) = µ⇡⇤
k |F � eµFk � �wF
k .
Finding ∏,θ,z for a fixed corresponds to the inner loop of:
find: min
w
⇢max
⇡,✓,z(L(⇡, w))
�
subject to:
X
a2A⇡(s, a) = 1 8s 2 S
and: ⇡(s, a) � 0 8s 2 S, a 2 A
w
Gradient descent w.r.t. and in the outer loop:wD wF
Incremental IRLF
• Updates to each set of weights affect feature expectations and thus interact with each other
• Some failed demos may be better for fine tuning
• Solution: anneal �
Outline• TERESA
• Inverse Reinforcement Learning
• Inverse Reinforcement Learning from Failure
• Results
• Next Steps
Simulated Robot Domain
Simulated Robot ResultsOverlapping
E: reach target, avoid obstacles T: reach target,
hit obstacle
Complementary: E: reach target T: hit obstacle
w.r.
t. Ex
pert
w.r.
t. Ta
boo
Contrasting E: reach target, avoid obstacles T: hit obstacle
Varying # of Successful Demos
(Contrasting Scenario)
Learned Reward Functions
Expert Taboo
IRL IRLF(Contrasting Scenario)
Factory Domain Results
w.r.t. Expert w.r.t. Taboo
R. Dearden & C. Boutilier. Abstraction and Approximate Decision-Theoretic Planning. Artificial Intelligence, 89(1):219–283, 1997.
Real Robot Data
Real-World IRL Evaluation• Missing ground-truth reward functions
• Current practice: use qualitative and/or ad-hoc metrics
• New approach:
• Apply IRL to D and F separately
• Resulting reward functions := ground truth
• Generate new data from corresponding policies
• Apply IRLF to new datasets
Real Robot Results
w.r.t. Expert w.r.t. Taboo
Outline• TERESA
• Inverse Reinforcement Learning
• Inverse Reinforcement Learning from Failure
• Results
• Next Steps
Rapidly Exploring Random Trees
Figure source: Wikipedia
Next Version of TERESA
Extra Slides
Wizard of Oz MethodologyHuman Controller
Interaction Target
Human Obstacles
Wizard of Oz
Deep IRLF
Figure source: Wulfmeier et al.
Markus Wulfmeier, Peter Ondruska, Ingmar Posner Maximum Entropy Deep Inverse Reinforcement Learning, arXiv:1507.04888