Learning Social Affordance for Human-Robot...

1
Outer loop A Metropolis-Hasting algorithm for latent sub-event parsing Dynamics: splitting/merging/relabeling Learning Social Affordance for Human-Robot Interaction Tianmin Shu 1 M. S. Ryoo 2 Song-Chun Zhu 1 1 Center for Vision, Cognition, Learning, and Autonomy, UCLA 2 Indiana University, Bloomington Objective: Learning explainable knowledge from the noisy observation of human interactions in RGB-D videos to enable human-robot interactions. Key idea: Beyond traditional object and scene affordances, we propose a weakly supervised learning of social affordances for HRI. Contributions: First formulation and hierarchical representation of social affordance Weakly supervised learning from noisy skeleton input Efficient motion synthesis based on learned hierarchical affordances Introduction Goal: Obtain the optimal joint selection and grouping and interaction parsing results by maximizing the joint probability. Algorithm: Learning Model ... c s 1 s K ... s 2 p(c) t 2 T 1 J t p(Z ) J t p(Z ) ... ... Joints Selection and Grouping Interaction Parsing ... t 2 T 2 Z s 2 Z s 1 Shake Hands Current frame Affordance For one instance of category c: Inner loop A Gibbs sampling for our modified CRP Initialization Skeleton clustering for initial sub-event parsing Motion Synthesis Goal: Given the initial 10 frames (25 fps), synthesize the motion of an agent given the motion of the other agent and the interaction type. Algorithm: At time , 1) Estimate the current sub-event by DP 2) Predict the ending time and the corresponding joint positions 3) Obtain the joint positions at through interpolation Experiment Hand Over a Cup Throw and Catch High-Five #50 #80 #110 #30 #80 #130 #15 #45 #75 #15 #40 #65 #20 #70 #120 Shake Hands Pull Up GT Ours HMM Exp 2: User study (14 subjects) Q1: Successful? Q2: Natural? Q3: Human vs. robot? From 1 (worst) to 5 (best) 0 0.2 0.4 0.6 0.8 1 GT Ours Exp 1: Average joint distance in meters (compared with GT skeletons) Examples of discovered latent sub-events and their sub-goals Synthesis Examples Frequencies of high scores (4 or 5) for Q3 Acknowledgment This research has been sponsored by grants DARPA SIMPLEX project N66001-15-C- 4035 and ONR MURI project N00014-16-1-2007. Scan to visit our project website p(G|Z c ) / Y k p({J t } t2T k |Z s k ,s k ,c) | {z } likelihood · p(c) |{z} interaction prior · K Y k =2 p(s k |s k -1 ,c) | {z } sub-event transition · K Y k =1 p(s k |c) | {z } sub-event prior For N training examples of category c ( ): G = {G n } n=1,...,N p(Z c )= Y s2S p(Z s |c) p({J t } t2T |Z s , s, c)= g ({J t } t2T ,Z s ,s) m ({J t } t2T ,Z s ,s) p(G ,Z c )= p(Z c ) Q N n p(G n |Z c ) G = {G n } n=1,...,N Z c p(z s ai |Z s -ai )= 8 > > > < > > > : β γ M - 1+ γ if z s ai > 0,M z s ai =0 β M s z a i M - 1+ γ if z s ai > 0,M z s ai > 0 1 - β if z s ai =0 t +5 t t 0 Human agent Synthesized agent t +5 t 0 t s 2 s 1 UCLA Human-Human-Object Interaction Dataset Five types of interactions; on average, 23.6 instances per interaction performed by totally 8 actors. Each lasts 2-7 s presented at 10-15 fps. RGB-D videos, skeletons and annotations are available: http://www.stat.ucla.edu/~tianmin.shu/SocialAffordance Pull Up 1 2 3 1 4 5 5 6 1 2 1 2 Hand Over a Cup Throw and Catch High-Five Shake Hands z s ai p(G|Z 0 c )p(z s ai |Z s -ai ) p(G, Z c )= p(G|Z c )p(Z c )

Transcript of Learning Social Affordance for Human-Robot...

Page 1: Learning Social Affordance for Human-Robot Interactiontianmin.shu/SocialAffordance/Poster_IJCAI.pdf · Learning Social Affordance for Human-Robot Interaction Tianmin Shu1 M. S. Ryoo2

Outer loopA Metropolis-Hasting algorithm for latent sub-event parsingDynamics: splitting/merging/relabeling

Learning Social Affordance for Human-Robot InteractionTianmin Shu1 M. S. Ryoo2 Song-Chun Zhu1

1Center for Vision, Cognition, Learning, and Autonomy, UCLA 2Indiana University, Bloomington

Objective:Learning explainable knowledge from the noisy observation of humaninteractions in RGB-D videos to enable human-robot interactions.Key idea:Beyond traditional object and scene affordances, we propose a weaklysupervised learning of social affordances for HRI.Contributions:• First formulation and hierarchical representation of social affordance• Weakly supervised learning from noisy skeleton input• Efficient motion synthesis based on learned hierarchical affordances

IntroductionGoal:Obtain the optimal joint selection and grouping and interaction parsingresults by maximizing the joint probability.Algorithm:

Learning

Model

...

c

s1 sK

...

s2

p(c)

t 2 T1J t

p(Z)

J t

p(Z)

......

Joints Selectionand Grouping Interaction Parsing

...

t 2 T2

Zs2

Zs1

Shake HandsCurrent frame Affordance

For one instance of category c:

Inner loop A Gibbs sampling for our modified CRP

Initialization Skeleton clustering for initial sub-event parsing

Motion SynthesisGoal: Given the initial 10 frames (25 fps), synthesize the motion of an agentgiven the motion of the other agent and the interaction type.Algorithm:At time ,1) Estimate the current sub-event by DP2) Predict the ending time and the corresponding joint positions3) Obtain the joint positions at through interpolation

Experiment

Hand Over a CupThrow and CatchHigh-Five#50 #80 #110 #30 #80 #130 #15 #45 #75 #15 #40 #65 #20 #70 #120Shake Hands Pull Up

GT

Our

sHM

M

Exp 2: User study (14 subjects)Q1: Successful? Q2: Natural? Q3: Human vs. robot?From 1 (worst) to 5 (best)

0

0.2

0.4

0.6

0.8

1

GTOurs

Exp 1: Average joint distance in meters (compared with GT skeletons)

Examples of discovered latent sub-events and their sub-goals

Synthesis Examples

Frequencies of highscores (4 or 5) for Q3

AcknowledgmentThis research has beensponsored by grants DARPASIMPLEX project N66001-15-C-4035 and ONR MURI projectN00014-16-1-2007.

Scantovisitourprojectwebsite

p(G|Zc) /Y

k

p({J t}t2Tk |Zsk , sk, c)

| {z }likelihood

· p(c)|{z}interaction prior

·KY

k=2

p(sk|sk�1

, c)

| {z }sub-event transition

·KY

k=1

p(sk|c)

| {z }sub-event prior

For N training examples of category c ( ) :G = {Gn}n=1,...,N

p(Zc) =Y

s2Sp(Zs|c)

p({J t}t2T |Zs, s, c) = g({J t}t2T , Zs, s) m({J t}t2T , Z

s, s)

p(G, Zc) = p(Zc)QN

n p(Gn|Zc)

G = {Gn}n=1,...,N

Zc

p(zsai|Zs�ai) =

8>>><

>>>:

��

M � 1 + �if zsai > 0,Mzs

ai= 0

�Ms

zai

M � 1 + �if zsai > 0,Mzs

ai> 0

1� � if zsai = 0

t+ 5

t

t0

Human agentSynthesized agent

t+ 5 t0t

s2s1

UCLA Human-Human-Object Interaction Dataset• Five types of interactions; on average, 23.6 instances per interaction

performed by totally 8 actors. Each lasts 2-7 s presented at 10-15 fps.• RGB-D videos, skeletons and annotations are available:

http://www.stat.ucla.edu/~tianmin.shu/SocialAffordance

Pull Up

1 2 3 1 4 55 61 2 1 2

Hand Over a CupThrow and CatchHigh-FiveShake Hands

zsai ⇠ p(G|Z 0c)p(z

sai|Zs

�ai)

p(G,Zc) = p(G|Zc)p(Zc)