Efﬁcient Model Learning from Joint-Action Demonstrations ... · Training Data Preprocessing...

Efficient Model Learning from Joint-ActionDemonstrations for Human-Robot Collaborative Tasks

Stefanos Nikolaidis∗

Ramya RamakrishnanKeren Gu Julie Shah

[email protected], [email protected], [email protected], [email protected] Science and Artificial Intelligence Laboratory

Massachusetts Institute of TechnologyCambridge, MA 02139, USA

ABSTRACTWe present a framework for automatically learning humanuser models from joint-action demonstrations that enablesa robot to compute a robust policy for a collaborative taskwith a human. First, the demonstrated action sequencesare clustered into different human types using an unsuper-vised learning algorithm. A reward function is then learnedfor each type through the employment of an inverse rein-forcement learning algorithm. The learned model is then in-corporated into a mixed-observability Markov decision pro-cess (MOMDP) formulation, wherein the human type is apartially observable variable. With this framework, we caninfer online the human type of a new user that was not in-cluded in the training set, and can compute a policy for therobot that will be aligned to the preference of this user. Ina human subject experiment (n = 30), participants agreedmore strongly that the robot anticipated their actions whenworking with a robot incorporating the proposed framework(p < 0.01), compared to manually annotating robot actions.In trials where participants faced difficulty annotating therobot actions to complete the task, the proposed frameworksignificantly improved team efficiency (p < 0.01). The robotincorporating the framework was also found to be more re-sponsive to human actions compared to policies computedusing a hand-coded reward function by a domain expert(p < 0.01). These results indicate that learning human usermodels from joint-action demonstrations and encoding themin a MOMDP formalism can support effective teaming inhuman-robot collaborative tasks.

1. INTRODUCTIONThe development of new industrial robotic systems that

operate in the same physical space as people highlights the

∗Stefanos Nikolaidis’ present affiliation is with The RoboticsInstitute, Carnegie Mellon University, Pittsburgh, PA 15213

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’15, March 2–5, 2015, Portland, Oregon, USA.Copyright c© 2015 ACM 978-1-4503-2883-8/15/03 ...$15.00.http://dx.doi.org/10.1145/2696454.2696455.

emerging need for robots that can integrate seamlessly intohuman group dynamics. In order to be efficient and produc-tive teammates, robotic assistants need to be able to adaptto the personalized style of their human counterparts. Thisadaptation requires learning a statistical model of human be-havior and integrating this model into the decision-makingalgorithm of the robot in a principled way.

In this paper, we describe a framework that allows for thelearning of human user models through joint-action demon-strations. This framework enables the robot to compute arobust policy for a collaborative task with a human, assum-ing access to demonstrations of human teams working onthe task. We hypothesize that a limited number of “domi-nant” strategies can capture the majority of demonstratedsequences. Using this assumption, we denote the preferenceof a human team member as a partially observable variablein a mixed-observability Markov decision process (MOMDP)[24] and constrain its value to a limited set of possible as-signments. We chose the MOMDP formulation because thenumber of observable variables for human-robot collabora-tive tasks in a manufacturing setting is much larger thanthat of partially observable variables. Denoting the humanpreference for the action toward task completion as a hiddenvariable naturally models human collaboration, since the in-tentions of the participants can never be directly observedduring training, and must be inferred through interactionand observation.

We start by describing the clustering of demonstrated ac-tion sequences into different human types using an unsuper-vised learning algorithm. These demonstrated sequences areused by the robot to learn a reward function that is represen-tative for each type, through the employment of an inversereinforcement learning algorithm. The learned models arethen incorporated as part of a MOMDP formulation. Withthis framework, we can infer, either offline or online, the hu-man type of a new user that was not included in the train-ing set, and can compute a policy for the robot that will bealigned to the preference of this new user. In a human sub-ject experiment (n = 30), participants agreed more stronglythat the robot anticipated their actions when working witha robot utilizing the proposed framework (p < 0.01), com-pared to manually annotating robot actions. Additionally,in trials where the sequence of robot actions toward taskcompletion was not trivially inferred by the participants, theproposed framework significantly improved team efficiency

(p < 0.01). Finally, the robot was found to be more respon-sive to human actions with the learned policy, comparedwith executing a policy from a reward function hand-codedby a domain expert (p < 0.01).

First, we place our work in the context of other relatedwork in Section 2, and introduce the proposed framework inSection 3. Next, we describe the clustering of demonstratedaction sequences in Section 4. The learned models are thenused as part of a MOMDP formulation (Section 5). We de-scribe the human subject experiment in Section 6, discussthe results in Section 7 and conclude with potential direc-tions for future research in Section 8.

2. RELEVANT WORKFor a robot to learn a human model, a human expert is

typically required to explicitly teach the robot a skill or spe-cific task [3, 4, 1, 21, 8, 2]. In this work, demonstrationsof human teams executing a task are used to automaticallylearn human types in an unsupervised fashion. The datafrom each cluster then serves as input for an inverse rein-forcement learning algorithm. In the context of control the-ory, this problem is known as inverse optimal control, origi-nally posed by Kalman and solved in [6]. We follow the ap-proach of Abbeel and Ng [1], and solve a quadratic programiteratively to find feature weights that attempt to match theexpected feature counts of the resulting policy with thoseof the expert demonstrations. There have also been game-theoretic approaches [28, 30] that aim to model multi-agentbehavior. Recent state-of-the-art imitation learning algo-rithms [15] have been shown to preserve theoretical guar-antees of performance, while minimizing risk during learn-ing. Through human demonstrations, our framework learnsa number of different human types and a reward function foreach type, and uses these as part of a MOMDP formulation.

Related approaches to learning user models include natu-ral language interaction with a robot wheelchair [9], wherea user model is learned simultaneously with a dialog man-ager policy. In this case, the system assumes that the modelparameters are initially uncertain, and improves the modelthrough interaction. More recently, humans and robots havebeen able to learn a shared plan for a collaborative taskthrough a training process called “cross-training” [22]. Suchapproaches do not have the limitation of a fixed set of avail-able models; however, learning a good model requires a largeamount of data, which can be an issue when using the modelfor large-scale, real-world applications. Rather than learninga new model for each human user, which can be tedious andtime-consuming, we use demonstrations by human teams toinfer some“dominant”human types, and then associate eachnew user to a new type.

Recent work has also inferred human intentions duringcollaborative tasks for game AI applications. One such study[20] focused on inferring the intentions of a human player,allowing a non-player character (NPC) to assist the human.Alternatively, [17] proposed the partially observable Monte-Carlo cooperative planning system, in which human inten-tion is inferred for a ”cops-and-robbers” turn-based game.In both works, the model of the human type is assumed tobe known beforehand.

Partially observable Markov decision process (POMDP)models have been used to infer human intention during driv-ing tasks [7], as well. In this case, the user model is repre-sented by the transition matrix of a POMDP and is learned

through task-specific action-rules. Recently, the mixed ob-servability predictive state representation framework (MO-PSR) [23] has been shown to learn accurate models of mixedobservability systems directly from observable quantities,but has not yet been verified for task planning applications.The MOMDP formulation [24] has been shown to achievesignificant computational efficiency, and has been used inmotion planning applications [5], with uncertainty about theintention of the human over their own actions. In the afore-mentioned work, the reward structure of the task is assumedto be known. In our work, the reward function that corre-sponds to each human type is learned automatically fromunlabeled demonstrations.

In summary, we present a pipeline to automatically learnthe reward function of the MOMDP through unsupervisedlearning and inverse reinforcement learning. The proposedframework enables the rapid estimation of a human usermodel online, through the a priori unsupervised learning ofa set of “dominant models” encoded in a MOMDP formula-tion. Using a human subject experiment, we show that thelearned MOMDP policies accurately represent the prefer-ences of human participants and can support effective team-ing for human-robot collaborative tasks. We describe theproposed framework in the next section.

3. METHODOur framework has two main stages, as shown in Figure 1.

The training data is preprocessed in the first stage. In thesecond stage, the robot infers the personalized style of a newhuman teammate and executes its role in the task accordingto the preference of this teammate.

Input DB of sequences

Cluster sequences Learn a reward

function per cluster

Compute a MOMDP policy

User type is inferred

Human and robot do task actions

Sequences of new user

Training Data Preprocessing

Online Task Execution

Figure 1: Framework flowchart

When a robot is introduced to work with a new humanworker, it needs to infer the human type and choose ac-tions aligned to the preference of that human. Additionally,the robot should reason over the uncertainty on the typeof the human. The first stage of our framework assumesaccess to a set of demonstrated sequences of actions fromhuman teams working together on a collaborative task, anduses an unsupervised learning algorithm to cluster the datainto dominating human types. The cluster indices serve asthe values of a partially observable variable denoting hu-man type in a mixed-observability Markov decision process.Our framework then learns a reward function for each hu-man type, which represents the preference of a human ofthe given type on a subset of task-related robot actions. Fi-nally, the framework computes an approximately optimalpolicy for the robot that reasons over the uncertainty on

the human type and maximizes the expected accumulatedreward.

In the second stage, a new human subject is asked to exe-cute the collaborative task with the robot. The human canthen demonstrate a few sequences of human and robot ac-tions, and a belief about his type can be computed accordingto the likelihood of the human sequences belonging to eachcluster. Alternatively, if the human actions are informativeof his type, as in the human subject experiment describedin Section 6, the human type can be estimated online. Therobot then executes the action of the computed policy of theMOMDP, based on the current belief of the human type ateach time step.

In the following section, we describe the first block of theproposed framework: finding the number of dominating hu-man types in a collaborative task by clustering the demon-strated sequences.

4. CLUSTERING OF HUMAN TYPESTo improve a robot’s ability to adapt to human prefer-

ences, we first try to find human preferences using an un-supervised clustering approach. In this problem, we havea data set D = x1, ..., xn, where each xi is a demonstratedsequence of alternating, discrete human and robot actions.The robot actions are those that the human annotates forthe robot. The goal is to find the number of human types,k, within this data and the assignment of each sequence ofactions xi to a type.

Previous work has approached this problem of clusteringsequential data through various methods. Murphy and Mar-tin [19] clustered ranking or ordinal data through expectation-maximization (EM) by learning distance-based models thathad two parameters: a central ranking and a precision pa-rameter. The distance between rankings was defined usingKendall’s, Spearman’s and Cayley’s distances, as specifiedin [18]. In another work, Jaaskinen [13] clustered DNA se-quences modeled as Markov chains using a Dirichlet processprior over the partitions. A greedy search of joining andsplitting partitions was used to determine the number ofclusters, and EM was used to learn transition probabilitymatrices and to correctly assign sequences to clusters. Insolving our clustering problem, we chose to use a hybrid ap-proach combining these two methods. Similar to [13], ourframework learns transition matrices between human androbot actions using EM, because this provides informationabout how the human will act based on the actions of therobot, and vice-versa.

We begin by using a hard variant of EM, similar to [13],to cluster the data into a set of human preferences. In thealgorithm, we represent each preference or cluster by a tran-sition matrix of size |A| x |A|, where |A| is the size of theaction space, A = Ar, Ah, which includes both robot ac-tions Ar and human actions Ah. Since the data consists ofa sequence of actions in which the human and robot taketurns, the transition matrix encodes information about howthe human will act based on the previous robot action, andvice-versa. We then define θ as the set of k representa-tive transition matrices θ1, ..., θk that correspond to the kclusters. Every sequence xi, each of length l, in the dataD = x1...xn must be assigned to one of these k clusters.The assignments of these sequences to clusters can be de-noted as Z = z1...zn, where each zi ∈ 1, ..., k .

Algorithm: Cluster-Transition-Matrices (k)

1. Initialize θ by randomizing θ1, ..., θk

2. Initialize sequence assignments Z = z1, ..., zn

3. repeat

4. E-step: Compute assignments for each sequence zi

for i = 1, ..., n

zi = arg maxzi

(l∏

j=2

θzi(xji |x

j−1i )

)

5. M-step: Update each transition matrix θz

for z = 1, ..., k

ni|j : observed count of transitions from i to j

θz,i|j =ni|j

|A|∑x=1

nx|j

for i, j = 1, ..., |A|

6. until Z converges to stable assignments

Figure 2: Cluster Transition Matrices using EM

The probability of one sequence xi parameterized by θ canbe represented as follows, where xji denotes the jth elementof the ith demonstrated sequence:

P (xi;θ) =

k∑zi=1

P (zi)P (xi|zi;θ)

=

k∑zi=1

P (zi)

(l∏

j=2

θzi(xji |x

j−1i )

) (1)

For all data points, the log-likelihood is:

l(D;θ) =

n∑i=1

logP (xi;θ)

=

k∑z=1

n∑i=1

δ(z|zi)log

(P (zi)

l∏j=2

θzi(xji |x

j−1i )

) (2)

δ(z|zi) = 1 if z = zi and zero otherwise.The cluster-transition-matrices EM algorithm learns the

optimal transition matrices θ1, ..., θk by iteratively perform-ing the E-step and the M-step. First, lines 1-2 randomlyinitialize k transition matrices and sequence assignments;then, lines 3 through 6 repeatedly execute the E-step andM-step until the assignments Z converge to stable values. Inthe E-step, we complete the data by assigning each sequenceto the cluster with the highest log-likelihood (line 4). In theM-step, each cluster’s transition matrix is updated by count-ing the transitions in all sequences assigned to that cluster(line 5). These two steps are repeated until the assignmentsz1, ..., zn do not change (line 6).

For the EM algorithm, we choose the number of clusters,k, by calculating the within-cluster dispersion for a rangeof values of k and selecting the value of k such that adding

another cluster does not result in a significant decrease ofthe within-cluster dispersion [29].

We then input the learned clusters into a mixed-observabilityMarkov decision process, as described in the next section.

5. MOMDP LEARNING AND PLANNINGThe clusters of demonstrated action sequences represent

different types of humans. When a robot is introduced towork with a new human worker, it needs to infer the humantype for that worker and choose actions that are aligned totheir preference. Additionally, the robot should reason overthe uncertainty on the type of the human. Therefore, thecluster indices serve as the values of a partially observablevariable denoting the human type in a mixed-observabilityMarkov decision process (MOMDP).

Our framework learns a reward function for each humantype. We then compute an approximately optimal policyfor the robot that reasons over the uncertainty on the hu-man type and maximizes the expected accumulated reward.We describe the MOMDP formulation, the learning of thereward function and the computation of an approximatelyoptimal policy, as follows:

5.1 MOMDP FormulationWe treat the unknown human type as a hidden variable

in a MOMDP, and have the robot choose actions accord-ing to the estimated human type. The MOMDP frameworkuses proper factorization of the observable and unobserv-able state variables, reducing the computational load. TheMOMDP is described by a tuple, X,Y, S,Ar, Tx, Ty, R,Ω, O,so that:

• X is the set of observable variables in the MOMDP. Inour framework, the observable variable is the currenttask-step among a finite set of task-steps that signifyprogress toward task completion.

• Y is the set of partially observable variables in theMOMDP. In our framework, a partially observable vari-able, y, represents the human type.

• S : X × Y is the set of states in the MOMDP con-sisting of the observable and non-observable variables.The state s ∈ S consists of the task-step x, which weassume is fully observable, and the unobservable typeof the human y.

• Ar is a finite set of discrete task-level robot actions.

• Tx : S × Ar −→ Π(X) is the probability of the fullyobservable variable being x′ at the next time step ifthe robot takes action ar at state s.

• Ty : S × Ar × X −→ Π(Y ) is the probability of thepartially observable variable being y′ at the next timestep if the robot takes action ar at state s, and thenext fully observable state variable has value x′.

• R : S × Ar −→ R is a reward function that gives animmediate reward for the robot taking action ar atstate s. It is a function of the observable task-step x,the partially observable human type y and the robotaction ar.

• Ω is the set of observations that the robot receivesthrough observation of the actions taken by the humanand the robot.

• O : S×Ar −→ Π(Ω) is the observation function, whichgives a probability distribution over possible observa-tions for each state s and robot action ar. We writeO(s, ar, o) for the probability that we receive observa-tion o given s and ar.

5.2 Belief-State EstimationBased on the above, the belief update is then [24]:

by(y′) = ηO(s′, ar, o)∑y∈Y

Tx(s, ar, x′)Ty(s, ar, s

′)by(y) (3)

5.3 Inverse Reinforcement LearningGiven a reward function, an exact value function and an

optimal policy for the robot can be calculated. Since wewant the robot to choose actions that align with the humantype of its teammate, a reward function must be specified forevery value that the human type can take. Manually spec-ifying a reward function for practical applications can betedious and time-consuming, and would represent a signifi-cant barrier for the applicability of the proposed framework.In this section, we describe the learning of a reward functionfor each human type using the demonstrated sequences thatbelong to the cluster associated with that specific type.

For a fixed human type y, we can reduce the MOMDPinto a Markov decision process (MDP). The MDP, in thiscontext, is a tuple: (X,Ar, Tx, R, γ), where X, Ar, Tx andR are defined in the MOMDP Formulation section above.Given demonstrated sequences of state-action pairs, we canestimate the reward function of the MDP using the inversereinforcement learning (IRL) algorithm [1]. We assume thehuman type to be constant in the demonstrated sequences.To compute the reward function for each cluster, we firstassume that a feature vector ϕ exists for each state, and eachgiven policy has a feature expectation that represents theexpected discounted accumulation of feature values basedon that policy. Formally, we define the feature expectationsof a policy π to be:

µ(π) = E[

∞∑t=0

γtϕ(st)|π] (4)

We also require an estimate of the feature expectationsfor each human type. Given a set of nz demonstrated state-action trajectories per human type z, we denote the empir-ical estimate for the feature expectation as follows:

µz =1

nz

nz∑i=1

∞∑t=0

γtϕ(s(i)t ) (5)

The IRL algorithm begins with a single random policyand attempts to generate a policy that is a mixture of ex-isting policies, with feature expectations that are similar tothose for the policy followed by the expert. In our case,the “expert” demonstrations were those followed by all hu-mans of a particular type. The algorithm terminates when||µz − µ(π)||2 ≤ ε, and is implemented as described in [1].

The output is a list of policies π(i) and corresponding re-ward functions Ri(s), with mixture weights λi. We use thereward function of the policy with the maximum weight λi.

For each human type, the framework applies IRL, usingthe demonstrated sequences of that type as input to calcu-late an associated reward function. With a reward functionfor any assignment of the partially observable human typevariable y, we can now compute an approximately optimalpolicy for the robot, as described in the next section.

5.4 Policy ComputationWe solve the MOMDP for a policy that takes into account

the uncertainty of the robot over the human type, whilemaximizing the agent’s expected total reward. MOMDPsare structured variants of POMDPs, and finding an exactsolution for a POMDP is computationally expensive [14].Point-based approximation algorithms have greatly improvedthe speed of POMDP planning [27, 16, 26] by updating se-lected sets of belief points. In this work, we use the SARSOPsolver [16], which, combined with the MOMDP formulation,can scale up to hundreds of thousands of states [5]. TheSARSOP algorithm samples a representative set of pointsfrom the belief space that are reachable from the initial be-lief, and uses this set as an approximate representation ofthe space, allowing for the efficient computation of a satis-factory solution.

6. HUMAN-ROBOT TEAMINGEXPERIMENT

We conducted a large-scale experiment to evaluate theproposed framework. In particular, we were interested inshowing that the learned policies enabled the robot to takeanticipatory actions that matched the preference of a hu-man worker, compared to policies computed using a rewardfunction hand-coded by a domain expert. We were also in-terested in the quality of the learned reward function com-pared to a hand-coded reward function, as well as in theeffect of the anticipatory actions on the task execution ef-ficiency of the human-robot team. For conditions 2 and 3below, 15 subjects provided demonstrations by annotatinghuman and robot actions.

6.1 Independent VariablesIn our experiment, we controlled the robot decision-making

mechanism in a human-robot collaborative task. This inde-pendent variable can have one of three values. which we referto as “Manual session,”“Auto1 session” and “Auto2 session.”

1. Manual Control - The subject annotates robot actionsby verbally commanding the robot.

2. Automatic Control with Manually Hand-coded Rewards(Auto1) - A domain expert observed the demonstratedsequences, manually specified the number of humantypes and explicitly specified a reward function thatbest explained the preference of each human type. AMOMDP policy was then computed, as described inSection 5.4. The robot automatically takes actions byexecuting that policy. The expert experimented sev-eral times by changing the reward function until theresulting policy would be satisfactory.

3. Automatic Control with Learned Rewards (Auto2) -We used clustering and inverse reinforcement learningto learn the reward function of the MOMDP and com-pute a policy, as described in sections 3, 4 and 5.

6.2 HypothesesH1 Participants will agree more strongly that the robottakes anticipatory actions when working with the robot in theAuto1 and the Auto2 conditions, compared to working withthe robot in the Manual condition. The learned MOMDPpolicy enables the robot to infer online the preference ofthe human participant, and take anticipatory actions thatmatch that preference. We expected the robot actions fromthe second and third condition to match the goals of theparticipants and to be strongly perceived as anticipatory,compared with manually jogging the robot. In prior work[5], MOMDPs have successfully been applied to recognizinghuman intention and using that information for decision-making.H2 The robot performance as a teammate, as perceivedby the participants, will be comparable between the Auto1and Auto2 conditions. We posited that the reward func-tion learned by the proposed framework for each cluster ofdemonstrated action sequences would accurately representthe goals of participants with that preference, and would re-sult in performance similar to that achieved using a rewardfunction carefully annotated by a domain expert.H3 The team efficiency and fluency of human-robot teamsof the Auto1 and Auto2 condition will be better than of theteams of the Manual condition. We posited that the robotanticipatory actions of the learned MOMDP policy, as wellas of the MOMDP policy from the hand-coded reward, wouldresult in faster task execution and better team fluency com-pared with manually annotating robot actions. Automatingrobot behaviors [11] and, in particular, enabling anticipa-tory actions [12] has previously resulted in significant im-provements in team efficiency and fluency in manufacturingapplications.

6.3 Experiment SettingWe conducted a large-scale experiment of 36 human sub-

jects to test the three hypotheses mentioned above, using ahuman-robot collaborative task. The human’s role was torefinish the surface of a box attached to an ABB industrialrobot. The robot’s role was to position the box in a posi-tion that matches the human preference. All participantsexecuted the task from four different positions, as shown inFigure 3. Each position required the participant to think of adifferent configuration for the box; therefore, the preferenceof the participant for the robot actions was dependent on hisposition. Our experiments used a within-subject design tomitigate the effects of inter-subject variability. We dividedthe subjects into balanced groups for each of the k! = 6 or-derings of our k = 3 conditions. All experiment sessions wererecorded. After each condition, all participants were askedto answer a post-session survey that used a five-point Likertscale to assess their responses to working with the robot. Atthe end of the experiment, participants also responded to anopen-ended post-experimental questionnaire.

For the robot in the second and third conditions, the ob-servable state variables x of the MOMDP framework werethe box position along the horizontal and vertical axis, aswell as the tilt angle of the box. The size of the observablestate-space X was 726. The starting configuration of thebox was specified by a small initial rotation and displace-ment from the center and level position. The demonstra-tion data used to learn the MOMDP was provided prior tothe experiment by 15 participants who were different than

(a) (b)

(c) (d)

Figure 3: Execution of a hand-finishing task by ahuman worker and an industrial robot. All par-ticipants were asked to execute the task from fourdifferent positions: (a) middle-left, (b) far-left, (c)middle-right and (d) far-right. In the first two po-sitions (top), participants were asked to refinish theleft surface of the box; in the other two (bottom),they were asked to refinish the right surface.

the 36 subjects. For the second condition, a domain expertgrouped the participants into four types based on their finalposition. For the third condition, the clustering algorithmof the proposed framework identified four clusters based onwhether the box was moved to the left or right side, as wellas whether it was tilted and moved down or not. Therefore,including the partially observable variables, the total size ofthe state-space of the MOMDP was 4 × 726 = 2904 states.The robot actions corresponded to discrete changes in po-sition and orientation of the box along three different axes.The human actions corresponded to the discretized motionof the human hand in the horizontal plane. To recognizethe actions of the human, we used a Phasespace motioncapture system consisting of eight cameras [25] that trackedthe motion of a Phasespace glove worn by the participant.Measurements of the hand position were averaged in fixedintervals, and an action was detected when the differencebetween two consecutive averages exceeded a set threshold.Additionally, an oscillating hand-motion with a brush, sim-ulating a surface refinishing, was recorded as the final actionthat ended the task. For both the second and third condi-tions, the observation function for the MOMDP was empir-ically specified. We leave learning an observation functionfrom the demonstrated sequences for each cluster for futurework.

7. RESULTS AND DISCUSSION

7.1 Subjective MeasuresTwo participants belonging to different groups did not an-

swer the questions on the second page of their post-session

Table 1: P-Values for three-way and pairwise com-parisons (n=30). Statistically significant values areshown in bold.

Question Omnibus Auto2 v. Auto2 v. Auto1 v.Auto1 Man Man

Q1 p < 0.01 p = 0.01 p = 0.20 p = 0.01Q2 p < 0.01 p = 0.34 p < 0.01 p < 0.01

Q1: “The robot was responsive to me.”

Q2: “The robot anticipated my actions.”

surveys, and were therefore not taken into consideration. Tomaintain an equal number of participants per group, we ran-domly removed one participant from each of the remainingfour groups, resulting in a final number of n = 30 partici-pants.

As shown in Table 1, a three-way Friedman’s test con-firmed a statistically significant difference for question Q2.A pairwise Wilcoxon signed rank test with Bonferroni cor-rection showed that, compared with manually annotatingrobot actions, participants agreed more strongly that therobot anticipated their actions when utilizing the proposedframework, Auto2 (Q2, p < 0.01), as well as when workingwith the robot in condition Auto1 (Q2, p < 0.01). Theseresults follow from the fact that the MOMDP formulationallows the robot to reason over its uncertainty on the hu-man type. Once the robot has enough information on thehuman type or preference, it will take actions toward taskcompletion that follow that preference. Interestingly, thegenerated robot behavior emulates behavior frequently seenin human teams, where one member may let his teammatestart working on a task, wait until he has enough informa-tion to confidently associate his teammates’ actions with afamiliar pattern based on prior experience, and then moveonto task execution himself. This supports hypothesis H1of Section 6.2.

Recall our second hypothesis H2: that the robot perfor-mance as perceived by the participants would be compara-ble between conditions Auto1 and Auto2. We performed aTOST equivalence test, similarly to [10], and showed thatparticipants rated robot intelligence (p < 0.01), accuracy ofrobot actions (p = 0.02), the robot’s anticipation of partic-ipants’ actions (p = 0.03), the trustworthiness of the robot(p < 0.01) and the smoothness of the robot’s trajectories(p = 0.04) similarly between the two conditions. Interest-ingly, results from a pairwise Wilcoxon signed rank test indi-cated that participants agreed more strongly that the robotwas responsive to them when working with the robot of theproposed framework (Auto2), compared with the robot thatexecuted the MOMDP policy with the hand-coded rewardfunction (Q1, p = 0.01). We believe that this is a result ofthe quality of the clustering algorithm: Whereas the domainexpert grouped human subjects according to their final po-sition upon completing the task, (Figure 3), our algorithmclustered subjects based on demonstrated human and robotaction sequences. Therefore, when the robot of the proposedframework inferred the human preference and took anticipa-tory actions, it was perceived as more responsive.

7.2 Quantitative measuresHere, we consider the total task completion time and hu-

man idle time for each condition of the experiment. As Fig-ure 4 shows, participants executed the task faster, on aver-age, when working with the robot of the proposed framework(Auto2: M = 102.8, SD = 8.33), compared with the othertwo conditions (Auto1: M = 111.4, SD = 8.85 and Man-ual: M = 107.4, SD = 33.0). Furthermore, the human idletime was shorter when working with the robot of the pro-posed framework (Auto2: M = 85.2, SD = 8.3) comparedwith the other two conditions (Auto1: M = 92.9, SD = 7.67and Manual: M = 89.8, SD = 33.7). (All units are in sec-onds.) A repeated-measure analysis of variance did not findthe differences between the conditions in task execution andhuman idle time to be statistically significant. Additionally,no learning effects were observed as a function of sessionnumber.

Whereas this result does not provide adequate support forhypothesis H3 of section 6.2, we observed that the varianceof the task completion and human idle time between thesubjects was much smaller when working with the robot ofthe proposed framework, compared to when manually anno-tating human and robot actions. We attribute this to thelarge variation in the amount of time subjects needed to findout which robot actions would bring the box to the desiredposition when manually controlling the robot. Additionally,some subjects annotated short trajectories during the man-ual condition, while others took multiple redundant stepswhen commanding the robot. Reducing the variation in ex-ecution time is important in the manufacturing domain, asit enables more efficient process planning and scheduling.

0

20

40

60

80

100

120

Manual Auto1 Auto2

Tim

e (

s)

Task Completion Time Human Idle Time

Figure 4: Average and standard error for task com-pletion and human idle times in each condition.

We conducted a post-hoc experimental analysis of thedata, and observed that task completion times varied signif-icantly depending on the position from which participantswere asked to execute the task (Figure 3). In particular,when manually annotating robot actions in the far-left po-sition, a small number of robot motions within the horizon-tal plane were enough to bring the box from the startingpoint to a comfortable position. Indeed, a repeated mea-sures analysis of variance with a Greenhouse-Geisser correc-tion demonstrated statistically significant differences in taskcompletion time between participants for the far-left posi-tion (F (1.16, 46.0) = 26.61, p < 0.01). Completion time atthis position in the Manual condition was significantly lowerthan in the Auto2 condition (t(29) = −6.651, p < 0.01).

On the other hand, in the middle-right and far-right posi-tions, several users spent a considerable amount of time dur-

ing the Manual session trying to determine which robot ac-tions would bring the box to a comfortable position from thestarting configuration. A repeated measures analysis of vari-ance with a Greenhouse-Geisser correction indicated statisti-cally significant differences in task completion time as a func-tion of condition at both the middle-left (F (1.12, 32.5) =7.03, p = 0.01), and far-right positions (F (1.66, 48.13) =8.96, p < 0.01). Completion time in the Auto2 conditionwas significantly lower than in the Manual condition at themiddle-left (t(29) = 3.16, p < 0.01) and far-right (t(29) =3.39, p < 0.01) positions (Figure 5). These results showthat, at positions where the robot actions that would bringthe box to a comfortable position were straightforward, sim-ply annotating robot actions seemed to be the most efficientapproach. For more complex configurations, the robot an-ticipatory behavior enabled by our framework resulted in asignificant benefit to team efficiency.

0

10

20

30

40

50

Man

ual

Au

to2

Man

ual

Au

to2

Man

ual

Au

to2

Man

ual

Au

to2

Middle Left Far Left Middle Right Far Right

Tim

e (

s)

Task Execution Time by Position

Figure 5: Average and standard error for the taskcompletion times in Manual and Auto2 conditions.

7.3 Open-Ended ResponsesWhen participants were asked to comment on their overall

experience, some suggested that they appreciated the auto-mated motion of the robot, because they found it difficultto determine how to move it to a favorable position. Whenasked to describe the part of the experiment that they likedmost, some mentioned the ease of standing in place and al-lowing the robot to maneuver itself. One subject mentionedthat, during the Manual session, she “had to think how torotate the box, which turned out to be not trivial.” Severalsubjects, on the other hand, suggested that they preferredto be in control during the manual condition, rather thanceding control to the robot, and that both automated tri-als resulted in some unnecessary motion on the part of therobot. Interestingly, we observed an intermediate correla-tion between the execution time of the subjects in the man-ual condition and their preference for manually controllingthe robot (r = -0.58). This result is indicative of a relation-ship between user preference and performance that warrantsfuture investigation.

8. CONCLUSIONWe presented a framework for automatically learning hu-

man user models from joint-action demonstrations, enablingthe robot to compute a robust policy for a collaborative taskwith a human. First, we described the clustering of demon-strated action sequences into different human types using

an unsupervised learning algorithm. These demonstratedsequences were used to learn a reward function that is rep-resentative for each type, through the employment of aninverse reinforcement learning algorithm. The learned mod-els were then included as part of a MOMDP formulation,wherein the human type was a partially observable vari-able. In a human subject experiment (n = 30), participantsagreed more strongly that the robot anticipated their actionswhen using the proposed framework (p < 0.01), comparedwith manually annotating robot actions. In trials whereparticipants faced difficulty annotating the robot actions inorder to complete the task, the proposed framework signifi-cantly improved team efficiency (p < 0.01). Also, comparedwith policies computed using a reward function hand-codedby a domain expert, the robot was more responsive to humanactions when using our framework (p < 0.01). These resultsindicate that learning human user models through joint-action demonstrations and encoding them in a MOMDPformalism can support effective teaming in human-robot col-laborative tasks.

While our assumption of full-observability of the task-steps is reasonable in a manufacturing setting with well-defined task procedures, many other domains including home-settings involve less structured tasks and would require ex-tensions to our approach. Future work also includes the ex-ploration of a hybrid approach between inferring“dominant”human types and learning an individualized user model: therobot starts with a policy corresponding to the general typeof a new human user, and further refines its action selectionmechanism through interaction.

9. REFERENCES[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via

inverse reinforcement learning. In Proc. ICML, 2004.

[2] B. Akgun, M. Cakmak, J. W. Yoo, and A. L. Thomaz.Trajectories and keyframes for kinesthetic teaching: ahuman-robot interaction perspective. In HRI, 2012.

[3] B. D. Argall, S. Chernova, M. Veloso, andB. Browning. A survey of robot learning fromdemonstration. Robot. Auton. Syst., May 2009.

[4] C. G. Atkeson and S. Schaal. Robot learning fromdemonstration. In ICML, pages 12–20, 1997.

[5] T. Bandyopadhyay, K. S. Won, E. Frazzoli, D. Hsu,W. S. Lee, and D. Rus. Intention-aware motionplanning. In WAFR. Springer, 2013.

[6] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan.Linear Matrix Inequalities in System and ControlTheory. Stud. Appl. Math. SIAM, June 1994.

[7] F. Broz, I. Nourbakhsh, and R. Simmons. Designingpomdp models of socially situated tasks. In RO-MAN,2011.

[8] S. Chernova and M. Veloso. Teaching multi-robotcoordination using demonstration of communicationand state sharing. In Proc. AAMAS, 2008.

[9] F. Doshi and N. Roy. Efficient model learning fordialog management. In Proc. HRI, March 2007.

[10] A. Dragan, R. Holladay, and S. Srinivasa. An analysisof deceptive robot motion. In RSS, 2014.

[11] M. C. Gombolay, R. A. Gutierrez, G. F. Sturla, andJ. A. Shah. Decision-making authority, team efficiencyand human worker satisfaction in mixed human-robotteams. In RSS, 2014.

[12] G. Hoffman and C. Breazeal. Effects of anticipatoryaction on human-robot teamwork efficiency, fluency,and perception of team. In Proc. HRI, 2007.

[13] V. Jaaskinen, V. Parkkinen, L. Cheng, andJ. Corander. Bayesian clustering of dna sequencesusing markov chains and a stochastic partition model.Stat. Appl. Genet. Mol., 2013.

[14] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra.Planning and acting in partially observable stochasticdomains. Artificial Intelligence, 1998.

[15] B. Kim and J. Pineau. Maximum mean discrepancyimitation learning. In Proceedings of RSS, 2013.

[16] H. Kurniawati, D. Hsu, and W. S. Lee. Sarsop:Efficient point-based pomdp planning byapproximating optimally reachable belief spaces. InRobotics: Science and Systems, pages 65–72, 2008.

[17] O. Macindoe, L. P. Kaelbling, and T. Lozano-Perez.Pomcop: Belief space planning for sidekicks incooperative games. In AIIDE, 2012.

[18] J. I. Marden. Analyzing and modeling rank data. CRCPress, 1995.

[19] T. B. Murphy and D. Martin. Mixtures ofdistance-based models for ranking data.Computational statistics & data analysis, 2003.

[20] T.-H. D. Nguyen, D. Hsu, W. S. Lee, T.-Y. Leong,L. P. Kaelbling, T. Lozano-Perez, and A. H. Grant.Capir: Collaborative action planning with intentionrecognition. In AIIDE, 2011.

[21] M. N. Nicolescu and M. J. Mataric. Natural methodsfor robot task learning: Instructive demonstrations,generalization and practice. In Proc. AAMAS, 2003.

[22] S. Nikolaidis and J. Shah. Human-robot cross-training:computational formulation, modeling and evaluationof a human team training strategy. In Proc. HRI, 2013.

[23] S. C. Ong, Y. Grinberg, and J. Pineau. Mixedobservability predictive state representations. InAAAI, 2013.

[24] S. C. Ong, S. W. Png, D. Hsu, and W. S. Lee.Planning under uncertainty for robotic tasks withmixed observability. IJRR, 29(8):1053–1068, 2010.

[25] Phasespace, http://www.phasespace.com, 2012.

[26] J. Pineau, G. Gordon, S. Thrun, et al. Point-basedvalue iteration: An anytime algorithm for pomdps. InIJCAI, volume 3, pages 1025–1032, 2003.

[27] G. Shani, J. Pineau, and R. Kaplow. A survey ofpoint-based pomdp solvers. In Proc. AAMAS, 2013.

[28] U. Syed and R. E. Schapire. A game-theoreticapproach to apprenticeship learning. In Proc. NIPS,2007.

[29] R. Tibshirani, G. Walther, and T. Hastie. Estimatingthe number of clusters in a data set via the gapstatistic. J. Roy. Statist. Soc. Ser. B, 2003.

[30] K. Waugh, B. D. Ziebart, and J. A. D. Bagnell.Computational rationalization: The inverseequilibrium problem. In Proc. ICML, June 2011.

Efﬁcient Model Learning from Joint-Action Demonstrations ... · Training Data Preprocessing...

Documents

Transcript of Efﬁcient Model Learning from Joint-Action Demonstrations ... · Training Data Preprocessing...