Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for...

16
Noname manuscript No. (will be inserted by the editor) Multi-Cue Contingency Detection Jinhan Lee · Crystal Chao · Aaron F. Bobick · Andrea L. Thomaz Received: date / Accepted: date Abstract The ability to detect a human’s contingent response is an essential skill for a social robot attempt- ing to engage new interaction partners or maintain on- going turn-taking interactions. Prior work on contin- gency detection focuses on single cues from isolated channels, such as changes in gaze, motion, or sound. We propose a framework that integrates multiple cues for detecting contingency from multimodal sensor data in human-robot interaction scenarios. We describe three levels of integration and discuss our method for per- forming sensor fusion at each of these levels. We per- form a Wizard-of-Oz data collection experiment in a turn-taking scenario in which our humanoid robot plays the turn-taking imitation game “Simon says” with hu- man partners. Using this data set, which includes mo- tion and body pose cues from a depth and color im- age and audio cues from a microphone, we evaluate our contingency detection module with the proposed inte- gration mechanisms and show gains in accuracy of our multi-cue approach over single-cue contingency detec- tion. We show the importance of selecting the appropri- ate level of cue integration as well as the implications of varying the referent event parameter. ONR YIP N000140810842. Jinhan Lee 801 Atlantic Drive Atlanta, GA 30332-0280 School of Interactive Computing E-mail: [email protected] Crystal Chao E-mail: [email protected] Aaron F. Bobick E-mail: [email protected] Andrea L. Thomaz E-mail: [email protected] Keywords Contingency Detection · Response Detec- tion · Cue Integration 1 Introduction A social robot can use contingency as a primary or aux- iliary cue to decide how to act appropriately in vari- ous interaction scenarios (e.g. [1,2]). One such scenario is identifying willing interaction partners and subse- quently initiating interactions with them, such as a shop robot’s attempting to engage customers who might need help. To do this, the robot would generate a signal in- tended to elicit a behavioral response and then look to see if any such response occurs. The presence of a change in the behavior of a human at the right time is a good indication that he or she is willing to follow-up with an interaction. Similarly for disengagement, the absence of contingent responses can verify that a hu- man has finished interacting with the robot. Another scenario is that which involves reciprocal turn-taking situations. In such contexts, contingency can be used as a signal that helps a robot determine when the human is ready to take a turn and when it is appropriate for the robot to relinquish the floor to the human. For example, if the robot sends a signal to yield the floor to the human and detects a contin- gent response from the human mid-way through, this may indicate the human’s readiness take a turn. In this case, the robot can interrupt its signal and yield the floor to the human. When a robot understands the semantics of a hu- man’s activity in a given context and has a specific expectation over a set of known possible actions, then checking for a contingent response might simply entail matching the human’s executed action against the ex-

Transcript of Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for...

Page 1: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

Noname manuscript No.(will be inserted by the editor)

Multi-Cue Contingency Detection

Jinhan Lee · Crystal Chao · Aaron F. Bobick · Andrea L. Thomaz

Received: date / Accepted: date

Abstract The ability to detect a human’s contingentresponse is an essential skill for a social robot attempt-ing to engage new interaction partners or maintain on-going turn-taking interactions. Prior work on contin-gency detection focuses on single cues from isolatedchannels, such as changes in gaze, motion, or sound. Wepropose a framework that integrates multiple cues fordetecting contingency from multimodal sensor data inhuman-robot interaction scenarios. We describe threelevels of integration and discuss our method for per-forming sensor fusion at each of these levels. We per-form a Wizard-of-Oz data collection experiment in aturn-taking scenario in which our humanoid robot playsthe turn-taking imitation game “Simon says” with hu-man partners. Using this data set, which includes mo-tion and body pose cues from a depth and color im-age and audio cues from a microphone, we evaluate ourcontingency detection module with the proposed inte-gration mechanisms and show gains in accuracy of ourmulti-cue approach over single-cue contingency detec-tion. We show the importance of selecting the appropri-ate level of cue integration as well as the implicationsof varying the referent event parameter.

ONR YIP N000140810842.

Jinhan Lee801 Atlantic Drive Atlanta, GA 30332-0280School of Interactive ComputingE-mail: [email protected]

Crystal ChaoE-mail: [email protected]

Aaron F. BobickE-mail: [email protected]

Andrea L. ThomazE-mail: [email protected]

Keywords Contingency Detection · Response Detec-tion · Cue Integration

1 Introduction

A social robot can use contingency as a primary or aux-iliary cue to decide how to act appropriately in vari-ous interaction scenarios (e.g. [1,2]). One such scenariois identifying willing interaction partners and subse-quently initiating interactions with them, such as a shoprobot’s attempting to engage customers who might needhelp. To do this, the robot would generate a signal in-tended to elicit a behavioral response and then lookto see if any such response occurs. The presence of achange in the behavior of a human at the right time isa good indication that he or she is willing to follow-upwith an interaction. Similarly for disengagement, theabsence of contingent responses can verify that a hu-man has finished interacting with the robot.

Another scenario is that which involves reciprocalturn-taking situations. In such contexts, contingencycan be used as a signal that helps a robot determinewhen the human is ready to take a turn and when itis appropriate for the robot to relinquish the floor tothe human. For example, if the robot sends a signalto yield the floor to the human and detects a contin-gent response from the human mid-way through, thismay indicate the human’s readiness take a turn. In thiscase, the robot can interrupt its signal and yield thefloor to the human.

When a robot understands the semantics of a hu-man’s activity in a given context and has a specificexpectation over a set of known possible actions, thenchecking for a contingent response might simply entailmatching the human’s executed action against the ex-

Page 2: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

2 Jinhan Lee et al.

pected action set. This strategy of detecting expected re-actions makes the sometimes problematic assumptionsthat the set of appropriate or meaningful responses canbe enumerated in advance, and that specific recogni-tion methods can be constructed for each response. Forexample, the robot might expect the human to utter asentence that is contained in the grammar of its speechrecognizer; a match would indicate a contingent re-sponse. In these cases, preprogrammed domain-specificresponses may be sufficient to decide how to act. Inmany realistic interaction scenarios, however, the po-tential natural human responses are so varied that itwould be difficult to enumerate them a priori and topreprogram methods of detecting those actions. In ad-dition, knowing which action was generated is not al-ways necessary; often, the presence of a contingent ac-tion is enough to enable appropriate social behavior.Thus, a robot needs an additional mechanism to detectcontingent responses that are not modeled in advance:detecting a behavioral change that is indicative of a con-tingent response. We believe that these two mechanismsare complementary for contingency detection. Our mainfocus in this work is to provide a framework that de-tects behavior changes without a priori knowledge ofpossible responses.

A contingent behavioral change by a human can oc-cur in one or multiple communication channels. For ex-ample, a robot that waves to a human to get his atten-tion may receive the speech of “Hello!” with a simul-taneous wave motion as a response. Here, we considerthe problem of human contingency detection with mul-timodal sensor data as input when forming our compu-tational model. We validate the multiple-cue approachusing multimodal data from a turn-taking scenario andshow that modeling a response using multiple cues andmerging them at the appropriate levels leads to im-provements in accuracy in contingency detection. Col-lectively with our previous work [3], the results in thispaper demonstrate that our behavior-change-based con-tingency detector provides a highly indicative percep-tual signal for response detection in both engagementand turn-taking scenarios.

In this paper, we make the following contributions.

1. We present a contingency detection framework thatintegrates multiple cues, each of which models dif-ferent aspects of a human’s behavior. We extend ourprior work, which only uses the visual cues of motionand body pose, to create a more generic contingencydetection module. In our proposed framework, thecue response can be modeled as either an event ora change.

2. We propose three different levels of cue integration:the frame level, the module level, and the decision

Fig. 1 A human being contingent to the robot in a turn-taking scenario based on the game “Simon says.” The robotsends both motion and speech signals to the human subjectby simultaneously waving and saying hello. The subject wavesback to the robot in response.

level. We show that for change-based detection, in-tegration of visual cues at the module level outper-forms integration at the decision level.

3. We examine the effects of selecting different timingmodels and referent events. In particular we showhow selecting the minimum necessary informationreferent event [4] improves detection and requires asmaller amount of data, increasing the tractabilityof the real-time detection problem.

4. We provide a probabilistic method for measuringthe reliability of visual cues and adaptively integrat-ing those cues based on their reliability.

5. We evaluate our proposed contingency detection frame-work using multimodal data and demonstrate thatmulti-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses.

The rest of paper is organized as follows. Section 2shows prior research related to contingency detection.Section 3 describes our computational model for con-tingency detection and cue integration. Section 4 intro-duces two different approaches for response detection:event-based detection and change-based detection. Sec-tion 5 describes the derivation of motion, body pose,and audio cues from image and depth sensors and theimplementation of contingency detectors using thesecues. Section 6 explains the different levels of cue inte-gration in detail. Sections 7–10 explain human data col-lection, model evaluation with the collected data, andexperimental results. Finally, Section 11 presents ourconclusions.

2 Related Work

Prior work has shown how contingency detection can beleveraged by a social robot to learn about structure in

Page 3: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

Multi-Cue Contingency Detection 3

functional or social environments. Contingency detec-tion has been used to allow robots to learn human gazebehaviors [5–7] and to understand social interactions [1,2,8].

Butko and Movellan [7] broke down the problem ofcontingency detection into two subproblems: responsedetection and timing interpretation. They developed amodel of behavior that queried the environment withan auditory signal and detected a responsive agent us-ing a thresholded sound sensor. The goal of the con-troller was to query at times that maximized the infor-mation gained about the responsiveness of the agent.Thus, their focus was on the timing constraints of thecontingency problem. Similarly, Gold and Scassellati [9]focused on learning effective timing windows for con-tingency detection, which they did separately for anauditory signal and a visual signal.

There is evidence that humans use this contingencymechanism to learn some causal relationships. Watsonfound that infants use this mechanism for self-recognitionand for detection of responsive social agents [10,11]. Arobot can also use contingency for recognition of selfand others [12,13]. In this formulation of the problem,the presence of a visual change is directly mapped tothe presence of a response. Thus, the presence of theresponse is assumed, but the source of the response isunknown and is attributed using timing interpretation.

In our work, we formulate the problem slightly dif-ferently. Because we are interested in human-robot in-teraction domains, such as engagement detection andturn-taking, we focus on a single response source: thehuman. We also cannot assume that visual changes al-ways indicate responses, as visual changes frequentlyoccur without correlating with contingent responses.Detecting contingency thus requires more complex anal-ysis of what visual changes are observed. Previously in[3], we implemented contingency detection using sin-gle vision-based cues and demonstrated the applicationof our contingency detection module as a perceptualcomponent for engagement detection. Other work hasfocused on processing other individual channels, inde-pendently demonstrating the significance of gaze shift[14,15], agent trajectory [16,17], or audio cues [18] ascontingent reactions.

An ideal contingency detector should be able to ac-cept a variety of sensory cues, because certain percep-tual and social cues are more informative for some in-teractional situations than for others. Here we extendthe modalities supported by our framework in [3] to in-clude the audio channel and detail an approach for theintegration of multiple cues.

Time

tR

Time

tRef

WAWB

Robot Signal

Human Response

tS

tC

Maximum Response Delay (MRD)

ΔB ΔA

Fig. 2 Causal relationship between a robot’s interactive sig-nal and a human’s response. Top: tS is the start of the robotsignal, and tR is the start of the human response. Bottom:Two time windows, WB and WA, defined with respect tothe current time tC , are used to model the human behaviorbefore and after tRef . WB and WA are of size ∆B and ∆A

respectively. The time window starting at tRef and valid overMRD is examined for the contingent human response. Notethat tRef may not be same as tS .

3 Approach

Contingency detection consists of two sub-problems: re-sponse detection and timing estimation. Figure 2 showsthe causal relationship between a robot signal and thecorresponding human response as well as time windowsfor detecting such a response. A robot generates someinteractive signal to a human by gesturing to, speakingto, or approaching her. Some time after the robot ini-tiates a signal and indeed sometimes before the robotcompletes its action, the human may initiate a responsethat could last for a certain duration.

The time interval during which a human may re-spond to a given robot signal needs to be estimated. Wedefine its endpoints to be the time at which the robotbegins to look for a human response, tRef , and the max-imum time for the human’s response delay after tRef ,(tRef +MRD). To detect a response, we evaluate sen-sor data within that time window. Because we are notconsidering anticipation on the part of the human, tRefis defined to be after tS ; to be effective it also shouldprecede the initiation time of the human’s response, tR.Chao et al. [4] define the notion of the minimum neces-sary information(MNI) moment in time when an actor- human or robot - has conveyed sufficient informationsuch that the other actor may respond appropriately.We will describe this concept in more detail within thecontext of our experiments in Section 7.) and in the re-sults we will show the MNI is a more effective referentthan the time at which the robot completes its signal.

To detect human responses within the estimatedevaluation time window, we take two different modelingapproaches: event detection and change detection. As

Page 4: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

4 Jinhan Lee et al.

Situation-Aware Instrumental Module

Muti-Cue Contingency

Detection ModuleContingent

or not

Robot Signals/ Timing Model

Sensor Cues

Contingency

Contingency

Fig. 3 Situation-independent Contingency Detection. Situ-ation specific information, such as robot signals and timingmodels are parameters of the contingency module.

shown in Figure 2, we define WB and WA as time win-dows from which data are used to model the human be-havior before and after the robot’s referent signal, tRef ,respectively. An event detection models a response asan event and that event is looked for in WA. On theother hand, a change detection models a response as achange, which is measured by comparing observationsbetween WB and WA. When the human does not re-spond to the robot, both WB and WA describe the sameaspect of the human’s behavior. However, in the con-tingent cases the human changes her behavior to makea contingent response, thus WB and WA model differ-ent behaviors. To detect such changes, we measure howlikely that sensor data in WA reflects the same behaviorobserved in the sensor data in WB .

Figure 3 shows information flow between a contin-gency module and other robot modules. To make ourcontingency detection module as situation-independentas possible, we assume that the expected timing of aresponse is determined by another of the robot’s mod-ules and is taken as a parameter to the contingencydetection module.

3.1 Cue Representation

To model a given aspect of human behavior, we deriveinformation from observations from a single or multiplesensors, hereafter referred to as a cue. Depending on thecharacteristics of the sensors used, a cue is encoded aseither a binary variable or a continuous variable. Whenunderlying sensors produce low-dimensional observa-tions, the derived cue is easily represented as a binaryvariable. For example, audio and touch sensor signalscan be classified as either on or off by simply threshold-ing the magnitude of the raw observations. Sensors thatgenerate high-dimensional observations, such as imageand depth sensors, require more complicated prepro-

cessing, and thus a derived cue would be encoded ascontinuous and high-dimensional variables. Section 5describes procedures for extracting cues from sensorsin detail.

3.2 Multi-cue Integration

The extracted cues from sensors should be integratedin such a way that the contingency detection mod-ule reduces uncertainty and increases accuracy in itsdecision-making. We adapt the data fusion frameworkintroduced by Hall and Llinas [19] to integrate cues.Here, we define a frame as a representation of cue in-formation and as an input to the contingency detectionmodule. We define three levels of integration: 1) theframe level, at which cues are merged into one aug-mented frame as an input to the contingency module;2) the module level, at which cues are integrated withinone contingency detection module; and 3) the decisionlevel, at which outputs from multiple single-cue contin-gency modules merge. These levels are shown in Figure4.

Intuitively, the principal difference between frameand module level integration and, decision level inte-gration is whether the cues are combined into a singlesignal whose variation during the contingency windowis evaluated for a behavioral change, or whether eachcue is considered independently and the two decisionsare fused to provide a final answer.

Cues should be integrated at the right level basedon characteristics of a cue, the underlying sensor’s sam-pling rate and dimensionality, and encoded perceptualinformation. If two cues are encoding the same percep-tual modality of a human and they complement eachother, thus modeling behavior in a more discriminativeway, then two cues should be merged either at the frameor at the module level rather than at the decision level.

The difference between frame level integration andmodule level integration is whether the cues are aug-mented and evaluated by one common distance met-ric, or whether each individual cue is evaluated by acue-specific distance metric and the two evaluations arefused. One evaluation on merged cues captures corre-lation between cue observations better than mergingmultiple individual evaluations. If such a distance met-ric is available, cues should be integrated at the framelevel rather than at the module level. Because of thedissimilarity of the physical measurements we do notexplore frame level integration in this paper.

Page 5: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

Multi-Cue Contingency Detection 5

CueExtracting data from Windows

Dissimilarity Measure

Building Distance Matrix

Building Distance Graph

Cue Buffer Dissimilarityscore

CueCueCueCuet

Fig. 5 Change Detection Framework [3]

a) Frame-level Integration

Sensor A

Sensor B

Cue A

Cue AB

Frame 1Contingency

Detection Module1

b) Module-level Integration

Sensor A

Sensor B

Cue A

Cue AB

Frame 1

Contingency Detection Module1

Frame 2

c) Decision-level Integration

Sensor A

Sensor B

Cue A

Cue AB

Frame 1Contingency

Detection Module1

Frame 2

Contingency Detection Module2

Decision

Decision

DecisionFrame 1Frame 1Frame 1Frame 1

Frame 1Frame 1Frame 1Frame 1

Frame 1Frame 1Frame 1Frame 2

Frame 1Frame 1Frame 1Frame 1

Frame 1Frame 1Frame 1Frame 2

Fig. 4 Cue integration at different levels: a) at the framelevel, b) at the module level, and c) at the decision level.

4 Response Detection

4.1 Event Detection

When the possible forms of expected responses are knownto a robot a priori, responses can be modeled and be rec-ognized as events. If temporal events are modeled with ahigh-dimensional cue, some vision-based methods suchas generative models (Hidden Markov Models and theirvariants) and discriminative models (Conditional Ran-dom Fields and their variants) can be adopted [20]. Sim-ple events such as touching a robot or making a soundcan be modeled using low-dimensional cues, so they areeasier to detect using simple filtering methods. To de-tect vocal responses, we derive the audio cue from amicrophone sensor and model them as events.

4.2 Change Detection

Change detection measures a human’s behavioral changeas a response to a robot’s signal. We look for a signifi-cant perturbation in the human behavior by modelingthat behavior before and after the signal and looking

for significant differences. We assume that any observedperturbation happening within the allowed time inter-val is the human’s response. A change is measured usinga history of cue data between WB and WA as input. Tomeasure the degree of behavioral difference in the re-sponse, we proposed a change detection framework [3],as shown in Figure 5.

As the change detector accumulates cue data into abuffer over time, it keeps only enough to fill the past∆B amount of time and discards frames that are older.When triggered at time tRef , the data for WB staysfixed, and the detector transitions to accumulating cuedata to fill WA until time tRef+∆A.

Once cue data exist over all of WB and WA, a cuedistance matrix is calculated between data in WB andWA using a cue-specific distance metric (Section 4.2.1).The distance matrix is converted into a distance graphby applying specific node connectivity (Section 4.2.2).Then, we calculate the dissimilarity score by measuringa statistical difference between the graph nodes repre-senting data from WB and WA (Section 4.2.3). Finally,as an extension to [3] and as one of the contributions ofthis paper, we introduce a method that uses probabilis-tic models to evaluate a dissimilarity score S withinour multi-cue contingency detection framework (Sec-tion 4.2.4).

4.2.1 Building the Distance Matrix

We define a cue buffer to be the set of cue feature vec-tors vt computed at each instant of time. From the cuebuffer, we first extract cue data from two time windowsWB and WA based on tRef and tC . WB is the timeinterval between tRef - ∆B and tRef , and WA is thetime interval between tC - ∆A and tC . Let V B and V A

denote cue data extracted from the time windows WB

and WA respectively:V B = {vt | t ∈WB} = {vl+1, vl+2, ..., vl+P }V A = {vt | t ∈WA} = {vm+1, vm+2, ..., vm+Q}.Let V = V B ∪ V A, so |V | = (P +Q). Let Vi denote

the ith element of V. The distance matrix DMX of acue X is calculated by measuring the pairwise distancebetween cue elements in V ; DMX(i, j) describes thedistance between cue vectors Vi and Vj using a prede-fined distance metric for the cue X. We will describe

Page 6: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

6 Jinhan Lee et al.

V1

V4

V2 V5

V3 V6

w(V5,N(V5,2))w(V 1

,V 2) w(V5 ,N(V5, 1))

w(V2 ,V3 )

w(V6 ,N(V6, 1))

w(V6,N(V6,2))

w(V3 ,V4 )

w(V2 ,V4 )

w(V1 ,V4 )

w(V 1,V 3)

Fig. 6 Building a distance graph from the distance matrix.The edge weight between Vi and Vj , w(Vi, Vj), is the distancebetween Vi and Vj , DMX(i, j). N(Vi, k) is the kth nearestneighbor of Vi in WB . Blue and red nodes are from V B andV A, respectively.

distance metrics for visual cues, motion (see Section5.2), and body pose (See Section 5.3).

4.2.2 Building the Distance Graph

We construct a distance graph from the distance ma-trix. The distance graph has the following characteris-tics as shown in Figure 6:

1. Nodes from WB (e.g. V1 to V4) are fully connectedto each other.

2. Nodes from WA (e.g. V5 and V6) are never connectedto each other.

3. Nodes from WA are only connected to the κ nearestnodes from WB .

The edge weight between two nodes Vi and Vj , w(Vi, Vj),corresponds to DMX(i, j).

4.2.3 Calculating the Dissimilarity Measure

We measure dissimilarity by calculating the ratio ofthe cross-dissimilarity between V B and V A to the self-dissimilarity of V B . N(Vi, k) denotes the kth nearestneighbor of Vi in VB . Let E denote the number ofdissimilarity evaluations. CD(V ) measures the cross-dissimilarity between V B and V A in the following equa-tion:

CD(V )

=Q∑q=1

κ∑k=1

E∑e=1

w(VP+q, N(N(VP+q, k), e)),(1)

where P is |V B | and Q is |V A|.

SD(V ) measures the self-dissimilarity within MBT :

SD(V )

=Q∑q=1

κ∑k=1

E∑e=1

w(N(VP+q, k), N(N(VP+q, k), e))(2)

The dissimilarity of V, DS(V ), is defined as:

DS(V ) =CD(V )SD(V )

(3)

DS(V ) is the dissimilarity score S for a given V .

4.2.4 Evaluating Dissimilarity Score

In [3], we learned a threshold value on the dissimilarityscore from the training data and used that to classifythe score as being contingent or not. This simple evalu-ation method cannot be used in our probabilistic modelfor multi-cue integration because it does not have theconfidence on the decision made, and because it doesnot take into account how informative a used cue is.

We propose a new evaluation method that resolvesthe two problems described above. To determine thatan observed change (i.e. a dissimilarity score) actuallyresulted from a human response and not from changesthat occur naturally in the human’s background be-havior, we should evaluate the change not only un-der the contingency condition, but also under the non-contingency condition. To this end, we model two condi-tional probability distributions: a probability distribu-tion of the dissimilarity score S under the contingencycondition C, P (S|C), and a probability distribution ofS under the non-contingency condition, P (S|C). As-suming that a human changes her behavior when re-sponding, these two distributions need to differ for thecue to be considered informative.

We learn the distribution P (S|C) off-line from train-ing data in which human subjects are being contin-gent to the robot’s action. We estimate the distribu-tion P (S|C), the null hypothesis, on the fly during aninteraction from observations of the human’s behaviorbefore the robot triggers a signal. It is important to notethat a null hypothesis is estimated with on-line data,particularly the data gathered immediately before therobot’s signal is triggered. As shown in Figure 7, thisdistribution is estimated from dissimilarity score sam-ples, each of which is obtained as if the robot’s signalwere triggered and enough data were accumulated ateach point in time.

Page 7: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

Multi-Cue Contingency Detection 7

Time

tRef

WAWB

WAWB

WAWB

Fig. 7 A method for building null hypothesis. Dissimilarityscore samples are obtained by evaluating data in windowsover time.

5 Cues for Contingency Detection

The choice of cues for response detection should be de-termined by the nature of the interaction. When a robotengages in face-to-face interaction with a human, a shiftin the human’s eye gaze is often enough to determinethe presence of a response [14,15]. In situations where agaze cue is less indicative or is less reliable to perceive,other cues should be leveraged for response detection.Here, we are interested in modeling human behaviorusing three different cues: 1) a motion cue, the patternof observed motion; 2) a body pose cue, the observedhuman body configuration; and 3) an audio cue, thepresence of human sound.

5.1 Audio Cue

For the audio cue, the contingent response is modeledas an event that occurs during the evaluation time win-dow WA (from tRef to tRef+MRD). The audio cuemodels the presence of the human response through thesound channel and is derived from a microphone sen-sor. We build a simple event detector using the audiocue. The audio cue is estimated as a binary variable bythresholding on raw sound data to remove a base levelof background noise, as shown in Figure 8(a). Only rawsound samples greater than a certain threshold are setto 1; otherwise, they are set to 0. At time t, this bi-nary audio cue, denoted as At is added to the audiocue buffer. After a robot triggers a signal, the evalua-tion starts. During the evaluation, at time tc, cue dataAT is first extracted from WA: AT = {At | t ∈WA}.AT is classified as Aon only if all elements in AT are1; otherwise it is classified as Aoff . The classificationratio of Aon, R(Aon), is calculated as:

R(Aon) =P (C|Aon)P (C|Aon)

=P (Aon|C)P (C)P (Aon|C)P (C)

(4)

Since the presence of a human response can only beconsidered when the audio cues are onset, the classifi-cation ratio of Aoff is set to 1 and thus does not haveany influence on the overall contingency decision whenit is integrated with other cues.

5.2 Motion Cue

The motion cue models the motion patterns of the hu-man of interest in the image coordinate system. Thiscue is derived from observations from image and depthsensors. To observe only motions generated by a humansubject in a scene, we segment the region in which a hu-man subject appears. To do so, we use the OpenNI API[21], which provides the functionalities of detecting andtracking multiple humans using depth images. The pro-cess for generating the motion cue is illustrated in 8(b).First, the motion in an image is estimated using a denseoptical flow calculation [22]. After grouping motion re-gions using a connected components-based blob detec-tion method, groups with small motions and groupsthat do not coincide with the location of the humanof interest are filtered out. The motion cue comprisesthe remaining motion regions. This extracted motioncue is used as an input into the motion contingencymodule and is accumulated over time.

In the evaluation of [3], we included an intermedi-ate step of dimensionality reduction using Non-negativeMatrix Factorization (NMF) [23] on the combined cuedata. The NMF coefficients of the data are normalized,so the magnitude of the original motion is encoded inNMF basis vectors. We calculate Euclidean distancesbetween the cue data using only NMF coefficients ofthe data. When the human does not move, small mo-tions that occur from noise observations can stand outand determine a dissimilarity score. If some noise occursin MA

T , then a high dissimilarity score would be pro-duced. To remove this instability, we now reconstructthem using NMF basis vectors to recover the originalmagnitude of motion and calculate Euclidean distancesbetween the reconstructed cue data. The small valueεM is added to the distance matrix of MB

T to increasethe overall self-dissimilarity of MB

T .

5.3 Body Pose Cue

The body pose cue models the body configuration ofthe human of interest. The human body configurationis estimated from a 3D point cloud extracted from thehuman of interest. The process for generating the bodypose cue is illustrated in 8(c). As for the motion cue,we segment the region of the human from a depth scene

Page 8: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

8 Jinhan Lee et al.

Image Sensor

Depth Sensor

Optical Flow

Foreground Segmentation

Human Motion Cue

Flow Magnitude-based Threshold

Blob Detection/Small Blob Removal

b) Extraction of motion cue from image and depth sensors

Depth Sensor Foreground Segmentation

Human Body Pose Cue

c) Extraction of body pose cue from a depth sensor

Point Cloud Extraction

Microphone Sensor

Background Noise Filtering

Human Audio Cue

a) Extraction of audio cue from a microphone sensor

Fig. 8 Extracting cues from different sensors: a) audio cue, b) motion cue, and c) body pose cue.

using the OpenNI API. Then, we extract a set of 3Dpoints by sparse sampling from the human region in adepth image and reconstructing 3D points from thosedepth samples.

As a distance metric for the body pose cue, wemeasure the pairwise distance between two sets of 3Dpoints. Let P1 and P2 be two sets of 3D points: P1 ={X1i | i = 1, . . . ,m

}and P2 =

{X2j | j = 1, . . . , n

}. The

distance between these point sets P1 and P2, PD(P1, P2),is defined as follows:

PD(P1, P2) =1mK

m∑i=1

K∑k=1

∥∥X1i −X2

ik

∥∥2

+1nK

n∑j=1

K∑k=1

∥∥X2j −X1

jk

∥∥2,

(5)

where X2ik is the kth nearest neighbor in P2 to X1

i andX1jk the kth nearest neighbor in P1 to X2

j . When calcu-lating the distance matrix DMD of MB

T and MAT , the

small value εD is added to the distance matrix of MBT

for the same reason of handling noise effects as for themotion cue.

6 Multiple Cue Integration and Adaptation

Cues can be integrated at one of three levels: the framelevel, the module level, or the decision level. The spe-cific level should be chosen based on characteristics ofthe cues, such as dimensionality, sampling rates, andassociated semantics.

Since the motion and body pose cues are extractedat similar sampling rates from image and depth sen-sors and also model the same perceptual region, we can

combine the motion and configuration aspects to yielda feature that is more discriminative than either sep-arately. Because a common distance metric for thesecues is not available, we merge these cues at the mod-ule level, at which distance matrices calculated fromindividual cues are merged. Due to the different sam-pling rate and low-dimensional signal, the audio cue isintegrated at the decision level with the output of themerged motion and body pose cues. For the decisionlevel of integration, we use a naıve Bayes probabilisticmodel. Our implementation of the proposed frameworkfor contingency detection with the motion, body pose,and audio cues is shown in Figure 9.

6.1 Cue Integration at the Decision Level

At the decision level of integration, we use the naıveBayes probabilistic model to integrate dissimilarity scoresobtained from cues. We chose this model because cuesthat are integrated at this level are assumed to be con-ditionally independent of each given the contingencyvalue; otherwise, they should be integrated at otherlevels. Assume that some function fX is provided thatsummarizes the current and past cue observations intoa dissimilarity score. For two cues Cuei and Cuej thatare integrated at the decision level, the overall contin-gency C in terms of Si = fi(Cuei) and Sj = fj(Cuej)is estimated by standard Bayesian mechanics, as de-

Page 9: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

Multi-Cue Contingency Detection 9

Image Sensor

Depth Sensor

Motion Cue

ContingencyDecisionFrame 1Frame 1

Frame 1MotionCues

Frame 1Frame 1Frame 1Body

PoseCues

Body Pose Cue

Change Detection Module

Microphone Sensor Audio Cue

Frame 1Frame 1Frame 1Audio

Cues

Event Detection Module

Fig. 9 An implementation of the proposed contingency framework with the motion, body pose, and audio cue.

scribed in Equation 8:

P (C|Si, Sj) =P (Si, Sj |C)P (C)

P (Si, Sj)

=P (Si|C)P (Sj |C)P (C)

P (Si, Sj)

(6)

P (C|Si, Sj) =P (Si, Sj |C)P (C)

P (Si, Sj)

=P (Si|C)P (Sj |C)P (C)

P (Si, Sj)

(7)

P (C|Si, Sj)P (C|Si, Sj)

=P (Si|C)P (Si|C)

P (Sj |C)P (Sj |C)

P (C)P (C)

(8)

If this ratio > 1, we declare that the human behavioris changed; otherwise, any changes are not detected.

6.2 Integrating Motion and Body Pose Cues at theModule Level

At the module level of integration, the cue data areintegrated when calculating a distance matrix. Fromthe accumulated cue data for both motion and bodypose cues, two distance matrices are calculated indepen-dently and merged into a new distance matrix. Since themotion and body pose cues are represented in differentcoordinate spaces, an image space for motion and a 3Dworld space for body pose, the distance matrices needto be normalized. We denote the distance matrices formotion and body pose cues as DMM and DMD, respec-tively. The merged distance matrix DMN is calculatedin the following way:

DMN =12

(DMM

‖DMM‖F+

DMD

‖DMD‖F), (9)

where ‖DMX‖F is the Frobenius norm of a matrixDMX . After building DMN , we use this distance ma-trix to calculate the dissimilarity measure, as explainedin Section 4.2.3.

6.3 Audio Cue Integration

Because of the low dimensionality and fast samplingrates of the audio cue, the audio is naturally integratedwith other cues at the decision level. We model thespecific event of sound being on for some amount oftime. In contrast to the motion and body pose cues,probabilistic distributions of this event for the naıveBayes model are learned off-line from the training data.

7 Data Collection

Two important scenarios in HRI where contingency de-tection is applicable are engagement detection and turn-taking. In previous work, we demonstrated our contin-gency detection module with single cues in an engage-ment detection scenario using Simon, our upper-torsohumanoid robot [3]. In that experiment, the robot at-tempted to get a human subject’s attention while thehuman was performing a background task, such as play-ing with toys or talking on a cell phone.

In this paper, we validate multiple-cue contingencydetection within a turn-taking scenario in which therobot plays an imitation game with a human partner.Importantly, this imitation game was designed not forevaluation of a contingency detector but for generatingnatural interactions with human-like timings. To col-lect naturalistic data, the robot was tele-operated overthe whole interaction by one of authors with randomlygenerated timing variations. In this section, we describehow we conducted data collection for this turn-takingscenario.

This imitation game was based on the traditionalchildren’s game “Simon says,” and is described exten-sively in [4]. The interaction setup is shown in Figure1. In this game, one participant plays the leader andthe other plays the follower. The leader is referred toas “Simon.” There are two phases in the interaction, a

Page 10: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

10 Jinhan Lee et al.

Fig. 10 An example of an interaction between a human and a robot.

(a) “Wave” (b) “Bow” (c) “Shrug”

(d) “Fly like a bird” (e) “Play air guitar”

Fig. 11 Actions in the “Simon says” game.

game phase and a negotiation phase. An example of aninteraction is shown in Figure 10.

During the game phase, the leader says sentencesof the structure, “Simon says, [perform an action]” orsimply “[Perform an action].” The five motions under-standable to the robot were waving, bowing, shrugging,flying like a bird, and playing air guitar. These mo-tions are shown in Figure 11. The follower must imitatethe leader’s action when the leader starts with “Simonsays,” but when the leader does not, the follower mustrefrain from performing the action. If the follower mis-takenly performs the action when the leader does notstart with “Simon says,” the follower loses the game.The leader concludes a game phase segment after ob-serving an incorrect response by declaring, “You lose!”or “I win!”

In the negotiation phase, the leader and the followernegotiate about switching roles. The follower can ask,“Can I play Simon?” or state, “I want to play Simon.”The leader can acquiesce to or reject the follower’s re-quest. The leader also has the option of asking the fol-lower, “Do you want to play Simon?” or saying to him,“You can play Simon now.” Similarly, the follower canacquiesce to or reject the leader’s request. The leaderand follower can exchange roles at any time during theinteraction.

We collected multimodal data from 11 human sub-jects. There were about 4 minutes of data per subject.The sensors recorded were one of the robot’s eye cam-eras, an external camera mounted on a tripod, a struc-tured light depth sensor (“Kinect”) mounted on a tri-pod, and a microphone worn around the participant’sneck. We also logged the robot’s signals with times-tamps, so that we could link the time of the robot’ssignal with the sensor data. The robot’s eye cameraand the external camera were intended to extract a hu-man subject’s face information, which we did not use inthis paper. We used only depth and camera data fromKinect sensor and microphone sensor data.

8 Model Building and Evaluation

To build and evaluate a computational model of contin-gency detection for the “Simon says” interaction game,we use a supervised learning approach. We employ aleave-one-subject-out cross validation procedure; a sin-gle subject is iteratively left out of training data and isused as testing data.

First we split the recorded data sessions into shortervideo segments, each of which starts at the tRef of onerobot signal and ends at the tRef of the following one.We examine two different points of time as a referentevent, tRef . One referent point is tS , the time at whichthe robot signal starts, and the other is the point of min-imum necessary information (MNI), the time at whichthe robot has finished conveying enough information fora human to give a semantically appropriate response.

The time location of the MNI was coded from videoby two authors. In the game phase, the human needs toknow whether or not to respond and also which motionto respond with if she needs to respond, so the MNIis the point at which both pieces of information havebeen delivered. In the negotiation phase, since the mainchannel for delivering information is speech, the infor-mation is sent by a pronoun. Some examples of codingrobot MNI in the game phase and the negotiation phaseare shown in Figure 12 and Figure 13, respectively. Thecoder agreement was 99.8% for robot MNI events.

Page 11: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

Multi-Cue Contingency Detection 11

(a) All informative speech occurs before the ani-mation starts.

(b) The action is conveyed through motion be-fore the human knows whether or not to exe-cute it.

Fig. 12 Examples of coding robot MNI in the game phase.

(a) Pronouns demarcate information for turn ex-change.

(b) The emotive phrase announces the end of aset.

(c) Examples of acknowledgments.

Fig. 13 Examples of coding robot MNI in the negotiationphase.

Depending on the presence or absence of a human’sresponse, video segments were partitioned into two sets,contingent and non-contingent. In both phases, if thehuman makes noticeable body movements or vocal ut-terances as a response to a robot’s signal, then corre-sponding video segments are classified as being contin-gent. Therefore, the contingent set included responsesthat were not covered by the game semantics that oc-curred when the robot did not say “Simon says.” Allvideo segments were coded independently by two of theauthors, and for each video segment that was agreedupon, the coded time was averaged. The coder agree-ment was 100% for classification. (The average differ-ence in coded time was 123 milliseconds.) We collected246 video segments: 101 (97 contingent) cases from ne-gotiation phases and 163 (120 contingent) cases fromgame phases.

8.1 Model Training

To train a computational model for contingency detec-tion, we set or learn values for system parameters: timewindows (WB , WA, and MRD) and conditional prob-ability distributions for cues.

For the visual cues, we set the time window WB

and WA to 4 seconds and 2 seconds, respectively. We

chose these values because in the data, human responsesusually last less than two seconds, and from our em-pirical observations, WB should be at least two timeslonger than WA to make a reliable dissimilarity evalua-tion. We learned the model of P (S|C) from dissimilar-ity scores, which were obtained by processing contin-gent video segments from the game phase interactionsin the training data set. Instead of manually collectingdissimilarity scores in which a response occurred, weextract one second of dissimilarity scores around themaximum.

A probabilistic model of P (S|C), the null hypoth-esis is not learned from the training set since the nullhypothesis should represent the amount of behavioralchange that occurs naturally during interaction — notindicative of a contingent response. Therefore, it is learnedon the fly by evaluating a human’s normal behavior be-fore the referent event. In building a null hypothesis,our proposed method requires both WB+WA and anextra α, used to generate dissimilarity samples. We setα to 2 seconds. Overall, the method requires 8 secondsof data.

For the audio cue, modeled as an event, we set thetime window WA for the audio cue to 200 ms. This isbased upon our observation about how long the vocalresponses take in our data set. We learn conditionalmodels of P (Aon|C) and P (Aon|C) and a prior modelof P (C). These models are learned from both negotia-tion phase and game phase interactions in the trainingdata set. P (Aon|C) is simply calculated as the ratio ofthe total amount of time where audio events are foundto be onset to the total amount of time for contingentinteractions. We learned P (Aon|C) in a similar man-ner using only non-contingent interactions. We set theprior model of P (C) to 0.5 to eliminate classificationbias towards contingency due to our unbalanced train-ing data.

Depending on the choice of referent event, tRef , weuse different evaluation windows MRD. When tS isused as the referent event, the evaluation terminates af-ter the following interaction begins or after it lasts for8 seconds, whichever comes first. When MNI is used,MRD is set to 4 seconds, during which most of theresponses happened in our “Simon says” game data.

8.2 Experimental Setup

Our experiments test the effects of cue integration andreferent event selection on the overall performance ofcontingency detection in our experimental scenario. We

Page 12: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

12 Jinhan Lee et al.

A M B M&B M/B0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

0.94

0.680.72

0.77

0.71

0.96

0.780.80

0.86

0.81

Classification Accuracy

TSMNI

Fig. 14 Effects of referent event selection on accuracy ofcues.

test 9 different cue combinations1 and two types of ref-erent events. One referent event is the time when therobot starts sending a signal, tS . The other is the pointof MNI in the robot signal.

To test effects of referent event selection, we buildtwo different classifiers: TS (tS as the referent event)and MNI (MNI as the referent event).

9 Results

9.1 Effects of Referent Event

As shown in Figure 14, the detectors with MNI as theirreferent event found visual responses in game phasesand vocal responses in negotiation phases better thanthose with tS . This reflects the fact that the closer thedata for building a null hypothesis is to the referentevent, the better the classification accuracy. When weextract data to model a human’s background behavior,we want this data to be as new as possible and notto include any contingent responses. We find that thepoint of time when MNI happens meets these criteriafor a good referent event.

The speed of exchanges in turn-taking interactionsmay not always permit the delay needed for buildingthe null hypothesis. This problem occurred frequentlyin our collected data because the teleoperator focusedon generating natural interactions with human-like tim-

1 M (Motion only), B (Body Pose only), A (Audio only),M&B (Motion and Body Pose merged at module level), M/B(Motion and Body Pose merged at decision level), M/A (Mo-tion and Audio at decision level), B/A (Body Pose and Audioat decision level), M/B/A (all Motion, Body Pose, and Audioat decision level), M&B/A (Motion and Body Pose at moduleand Audio at decision level)

ings rather than generating data for a contingency de-tector. Thus, many interaction segments do not containenough data to model the human’s baseline behavior.Using MNI as a referent event provides the detector amuch tighter evaluation window than using the starttime of the signal generated. We measured what per-centage of data that is used for building null hypothesiscomes from previous interactions. From our coded inter-actions, we observed that 67% of data (5.37 seconds) isfrom previous interactions in TS based detectors, 51%of data (4.15 seconds) in MNI based detectors.

In cases where the robot knows the MNI (i.e. fromthe gesture or speech that is being delivered), it is thebest point in time to use as the referent. By using MNI,the robot knows when to stop evaluation. But in caseswhere this is not known, then the beginning of robotsignal can serve as the fallback referent. The MNI iscritical to making our contingency detectors operate innatural interactions in which the robot interacts withhumans with human-like timings. This is necessary forrealtime detection in realistic and natural interactionscenarios.

9.2 Cue Integration

We run experiments to test how indicative single orcombined cues are for detecting the presence of re-sponses. For the purpose of clarification, we explain ourobservations only with MNI-based classifiers. It shouldnote that the same observations are valid with TS-basedclassifiers.

During negotiation phases, as shown in Figure 15(a),classifiers using audio cues detect speech-based responseswith an accuracy of 0.96. The error of 0.04 is all falsenegatives and comes from interactions in which tworobot actions occur in close sequence without delay.Thus, human subjects consider the two actions as oneand wait to respond until the second action ends. Clas-sifiers that do not use the audio cue perform underthe average accuracy of 0.50. When the visual cues aremerged together with the audio cue, the overall accu-racy is slightly increased. This increase in accuracy re-sults from more interaction data classified as being con-tingent; in some negotiation interaction cases, humansmade visible responses by waving back to a robot with-out making vocal responses.

On the other hand, during game phases, the visualcues such as the motion, body pose, and the mergedones are more indicative of responses than the audiocue, as shown in Figure 15(b). Due to the nature ofthe game phase interactions, where a robot asks humansubjects to imitate its motion commands, the most fre-quently detectable form of response is a visual change.

Page 13: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

Multi-Cue Contingency Detection 13

M B A M&B M/B M/A B/A M/B/A M&B/A0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

0.45

0.53

0.96

0.49 0.49

0.97 0.98 0.96 0.98

Classification Accuracy

(a) Accuracy on the negotiation phase.

M B A M&B M/B M/A B/A M/B/A M&B/A0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

0.78 0.80

0.42

0.860.81 0.79

0.81 0.830.88

Classification Accuracy

(b) Accuracy on the game phase.

M B A M&B M/B M/A B/A M/B/A M&B/A0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

0.650.70

0.62

0.730.70

0.86 0.88 0.870.91

Classification Accuracy

(c) Accuracy on both the negotiation and game phases.

Fig. 15 Accuracy of the MNI based classifiers on both thenegotiation and game phases. (MNI for their referent event)X&Y indicates cue X and cue Y combined at the modulelevel; X/Y indicates integration at the decision level.

We test four different combinations for visual cues: 1)motion only, 2) body pose only, 3) motion and bodypose merged at the module level, and 4) motion andbody pose merged at the decision level. The best accu-racy is obtained when the motion and body pose cuesare integrated at the module level.

As shown, the accuracy of the classifier built withvisual cues merged at the decision level (81%) is only1% greater than that achieved by the best single vi-sual cue of body pose. However, when integrated at themodule level, the combination of pose and motion is 5%higher than pose alone. We argue that because visualcues model different but related aspects of the sameperceptual signal, the cues are strongly related, andthus not conditionally independent. By combining atthe module level this dependence is explicitly modeledand thus the merged cues generate more discriminativefeatures. This is supported by our result that the clas-sifiers built with a single visual cue or built with visualcues integrated at the decision level produce more falsenegative cases than the classifiers built with visual cuesmerged at the module level.

As shown in Figure 15(b), when the audio cue ismerged, the overall accuracy is slightly increased. Itis observed that when the audio cue is used togetherwith the visual cues, more interaction data are clas-sified as being contingent; sometimes, human subjectsmake some sound as a response during interactions, re-gardless of the robot’s commands. For MNI data, thereare 33 cases where human subjects make some soundwhile they are responding by imitating the robot andthere are 10 cases where human subjects make soundwhen they are not supposed to imitate. These 10 casesare detected only with the audio cue.

When considering both negotiation and game phases,classifiers using either the visual cue or the audio cueperform well only for one phase, but not for the other.As shown in Figure 15(c), the average classification ac-curacy when using M, B, A, M&B, and M/B cues areless than 0.73. Overall, the best classification result,an accuracy of 0.91, is obtained when the motion andbody pose cue are integrated at the module level andthe audio cue is merged with integrated visual cues atthe decision level.

10 Discussion

10.1 Three-Level Cue Integration

We integrated three different cues to improve contin-gency detection. There are many factors to be consid-ered when determining the appropriate level for cue in-tegration. They include whether to model a response as

Page 14: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

14 Jinhan Lee et al.

an event or a change, and characteristics of cues (sam-pling rate and dimensionality of cues, the correlationbetween cue observations, and so on). The right inte-gration level of the cue that is modeled as an event isthe decision level because the presence of this event canbe detected independent to the output of change-baseddetectors.

The best integration level, however, for cues usedin change-based detectors depends on the correlationbetween cue observations. With our data, we obtainthe best classifier when our visual cues are integratedat the module level and the audio cue is integrated atthe decision level. Since the audio cue differs from themotion and body pose cues in terms of the samplingrate and dimensionality, it is integrated at the decisionlevel. It is modeled as an event because some directfunctional mapping from cue to response exists.

Visual cues are modeled as changes because thereis no such direct mapping. At the decision level, thedecision from each cue is merged with others assum-ing conditional independence. On the other hand, atthe module level, the correlation between cue observa-tions can be explicitly modeled in making a decision.The visual cues model the different aspects of the sameperceptual region, so there exists heavy correlation be-tween cues and thus one would predict an advantagefor the module level of integration. Our experimentaldata support this conclusion.

10.2 Alternative to MNI as a Referent Event

We examined the effect of the point of MNI as a ref-erent event on accuracy of contingency detection. TheMNI point is the time at which a robot has finishedconveying enough information for a human to give a se-mantically appropriate response. To detect a response,our contingency detector evaluates 4 seconds of dataafter the point of MNI. If the MNI point is not avail-able, a default choice for the referent event would betS , the time at which the robot starts a signal. The sizeof the evaluation window with this referent event is un-certain and larger because the relative location of theMNI point with regard to tS changes, so the MNI pointcould be possibly at one of many points between tS andthe time at which a robot finishes its signal.

We think that a more informative reference eventand an evaluation window could be learned using con-tingency through interaction. A robot initially uses tSas a referent event and a long timing window for eval-uation. Every time a response is detected, the MNI isinversely estimated and is associated with each partic-ular type of a signal made by the robot, whether itwas speech-only, motion-only, or both. After obtaining

enough associations, a robot adjusts its timing modeland estimates the MNI without modeling informationtransfer from the start.

10.3 The Role of Contingency Detection in Interaction

We introduced a computational model for semantics-free contingency detection in which an event-based de-tector using the audio cue looks for the presence of vo-cal utterances and a changed-based detector using thevisual cues looks for a presence of visible changes inhuman behavior. Because these cues model a genericproperty of a human’s natural response, our learnedmodel should be transferable to a new task.

We applied contingency detection in different inter-action scenarios. In [3], we used contingency detectionas a direct measurement of engagement. In order fora robot to draw a human’s attention, a robot gener-ates a probe signal to a human of interest and looks forsome behavioral change. The presence of a contingentresponse from a human is a good indication that he iswilling to have a follow-up interaction with the robot.We make a similar argument for detecting disengage-ment. When a human decides to finish interactions, ifthe robot fails to observe contingent responses, it is agood indication that a human has finished interactingwith the robot.

In reciprocal turn-taking situations such as interac-tions from the “Simon says” game, contingency can beused as a signal. When taking turns with a human, arobot needs to know when to relinquish the floor. Con-tingency can be used as a signal for this. If the robotdetects a human’s contingent response, then it may in-dicate the human is ready to take a turn or already hasit. The robot can then decide to yield the floor.

10.4 Limitation of Semantics-free ContingencyDetection

The main limitation of our semantics-free contingencymodule is that detected responses are not grounded.In order for the robot to act appropriately on such re-sponses, the robot might need additional informationabout responses such as semantics of responses and va-lidity of responses with respect to the robot action (sig-nal) and the context that the robot is situated in.

To overcome this limitation, we believe that a robotshould model and recognize a set of grounded responsesthat are built from knowledge about the nature of theinteraction situation, and should also be able to groundresponses that are found by semantics-free contingencydetection. Within the current framework, we can model

Page 15: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

Multi-Cue Contingency Detection 15

and recognize grounded responses as events. As futurework, we will investigate how to attribute semantics toungrounded responses through iterative interactions.

11 Conclusion

In this paper, we proposed a contingency detection frame-work that integrates data from multiple cues. We col-lected multimodal sensor data from a turn-taking human-robot interaction scenario based on the turn-taking imi-tation game “Simon says.” We implemented our multiple-cue approach and evaluated it using motion and bodypose cues from a structured light depth sensor and au-dio cues from a microphone. The results show that in-tegrating multiple cues at appropriate levels offers animprovement over individual cues for contingency de-tection. We showed that using MNI as a referent eventprovides contingency detectors a more reliable evalu-ation window. From our constrained experiment, wemade important observation that human response tim-ings and effective integration of cues are important fac-tors to detect contingency. We believe that this observa-tion is important to understand other factors situatedin more complex and ambiguous interactions. We be-lieve that our contingency detection module improvesa social robot’s ability to engage in multimodal interac-tions with humans when the semantics of the human’sbehavior are not known to the robot.

References

1. K. Lohan, A. Vollmer, J. Fritsch, K. Rohlfing, andB. Wrede, “Which ostensive stimuli can be used for arobot to detect and maintain tutoring situations?” inACII Workshop, 2009.

2. K. Pitsch, H. Kuzuoka, Y. Suzuki, L. Sussenbach, P. Luff,and C. Heath, “The first five seconds: Contingent step-wise entry into an interaction as a means to secure sus-tained engagement,” in IEEE International Symposiumon Robot and Human Interactive Communication (RO-MAN), 2009.

3. J. Lee, J. Kiser, A. Bobick, and A. Thomaz, “Vision-based contingency detection,” in ACM/IEEE Interna-tional Conference on Human-Robot Interaction (HRI),2011.

4. C. Chao, J. Lee, M. Begum, and A. Thomaz, “Si-mon plays simon says: The timing of turn-taking inan imitation game,” in IEEE International Symposiumon Robot and Human Interactive Communication (RO-MAN), 2011.

5. H. Sumioka, Y. Yoshikawa, and M. Asada, “Reproduc-ing interaction contingency toward open-ended develop-ment of social actions: Case study on joint attention,”in IEEE Transactions on Autonomous Mental Develop-ment, 2010.

6. J. Triesch, C. Teuscher, G. Deak, and E. Carlson, “Gazefollowing: why (not) learn it?” in Developmental Science,2006.

7. N. Butko and J. Movellan, “Infomax control of eye move-ments,” in IEEE Transactions on Autonomous MentalDevelopment, 2010.

8. G. Csibra and G. Gergely, “Social learning and social cog-nition: The case for pedagogy,” in Processes of Changesin Brain and Cognitive Development. Attention and Per-formance. Oxford University Press, 2006.

9. K. Gold and B. Scassellati, “Learning acceptable win-dows of contingency,” in Connection Science, 2006.

10. J. Watson, “Smiling, cooling, and ’the game’”,” in Mer-rillPalmer Quarterly, 1972.

11. ——, “The perception of contingency as a determinant ofsocial responsiveness,” in Origins of the Infant’s SocialResponsiveness, 1979.

12. K. Gold and B. Scassellati, “Using probabilistic reasoningover time to self-recognize,” in Robotics and AutonomousSystems, 2009.

13. A. Stoytchev, “Self-detection in robots: a method basedon detecting temporal contingencies,” in Robotica. Cam-bridge University Press, 2011.

14. B. Multu, T. Shiwa, T. Ishiguro, and N. Hagita, “Foot-ing in human-robot conversations: how robots mightshape participant roles using gaze cues,” in ACM/IEEEInternational Conference on Human-Robot Interaction(HRI), 2009.

15. C. Rich, B. Ponsler, A. Holroyd, and C. Sidner, “Recog-nizing engagement in human-robot interaction,” in ACMInternational Conference on Human-Robot Interaction(HRI), 2010.

16. M. Michalowski, S. Sabanovic, and R. Simmons, “A spa-tial model of engagement for a social robot,” in Interna-tional Workshop on Advanced Motion Control (AMC),2006.

17. S. Muller, S. Hellbach, E. Schaffernicht, A. Ober,A. Scheidig, and H. Gross, “Whom to talk to? estimatinguser interest from movement trajectories,” in IEEE In-ternational Symposium on Robot and Human InteractiveCommunication (ROMAN), 2008.

18. N. Butko and J. Movellan, “Detecting contingencies: Aninfomax approach,” in IEEE Transactions on NeuralNetworks, 2010.

19. D. Hall and J. Llinas, “An introduction to multisensordata fusion,” in Proceedings of the IEEE, 1997.

20. R. Poppe, “A survey on vision-based human action recog-nition,” in Image and Vision Computing, 2010.

21. The OpenNI API, http://www.openni.org.22. M. Werlberger, W. Trobin, T. Pock, A. Wedel, D. Cre-

mers, and H. Bishof, “Anisotropic huber-l1 optical flow,”in Proceedings of the British Machine Vision Confer-ence(BMVC), 2009.

23. D. Lee and H. S. Seung, “Algorithms for non-negativematrix factorization,” in In Advances in Neural Infor-mation Processing, 2000.

Jinhan Lee obtained his M.S. in Computer Sci-ence from the Georgia Institute of Technology in 2007.He is currently pursuing his Ph.D. degree in Roboticsat the Georgia Institute of Technology. His researchinterests include perception for human-robot interac-tion, cue learning in multimodal interactions, and robotlearning from demonstration.

Crystal Chao received her B.S. in Computer Sci-ence from the Massachusetts Institute of Technology in2008. She is currently a Ph.D. student in Robotics at

Page 16: Multi-Cue Contingency Detection · multi-cue contingency detection is a necessary com-ponent for interactions with humans and their multi-modal responses. The rest of paper is organized

16 Jinhan Lee et al.

the Georgia Institute of Technology. Her research inter-ests include interactive learning, human-robot dialogue,and turn-taking in multimodal interactions.

Aaron F. Bobick is Professor and Chair of theSchool of Interactive Computing in the College of Com-puting at the Georgia Institute of Technology. He hasB.Sc. degrees from MIT in Mathematics (1981) andComputer Science (1981) and a Ph.D. from MIT inCognitive Science (1987). He joined the MIT MediaLaboratory faculty in 1992 where he led the Media LabDARPA VSAM project. In 1999 Prof. Bobick joined thefaculty at Georgia Tech where he became the Directorof the GVU Center and in 2005 became the foundingchair of the School of Interactive Computing. He is apioneer in the area of action recognition by computer vi-

sion, having authored over 80 papers in this area alone.His current focus is on robots understanding the af-fordances of objects and the action-based intentions ofhuman beings.

Andrea L. Thomaz is an Assistant Professor ofInteractive Computing at the Georgia Institute of Tech-nology. She directs the Socially Intelligent Machines labwhich is affiliated with the Robotics and Intelligent Ma-chines (RIM) Center and with the Graphics Visualiza-tion and Usability (GVU) Center. She earned Sc.M.(2002) and Ph.D. (2006) degrees from MIT. She haspublished in the areas of Artificial Intelligence, Robotics,Human-Robot Interaction, and Human-Computer In-teraction.