Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf ·...

13
30 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 3, NO. 1, MARCH 2011 Using the Rhythm of Nonverbal Human–Robot Interaction as a Signal for Learning Pierre Andry, Arnaud Blanchard, and Philippe Gaussier Abstract—Human–robot interaction is a key issue in order to build robots for everyone. The difficulty for people to understand how robots work and how they must be controlled will be one of the mains limit for broad robotics. In this paper, we study a new way of interacting with robots without needing to understand how robots work or to give them explicit instructions. This work is based on psychological data showing that synchronization and rhythm are very important features for pleasant interaction. We propose a biologically inspired architecture using rhythm detec- tion to build an internal reward for learning. After showing the results of keyboard interactions, we present and discuss the results of real human–robots (Aibo and Nao) interactions. We show that our minimalist control architecture allows the discovery and learning of arbitrary sensorimotor associations games with expert users. With nonexpert users, we show that using only the rhythm information is not sufficient for learning all the associations due to the different strategies used by the human. Nevertheless, this last experiment shows that the rhythm is still allowing the discovery of subsets of associations, being one of the promising signal of tomorrow social applications. Index Terms—Artificial neural networks, autonomous robotics, human–robot interaction, rhythm detection and prediction, self- supervised learning. I. INTRODUCTION U NDERSTANDING nonverbal communication is crucial for building really adaptive and interactive robots. Young infants, even at preverbal stages take turn, naturally switch roles, and maintain bidirectional interaction without the use of explicit codes or declarative “procedures” [1]. Among the implicit but important signals of interaction, synchrony and rhythm appear to be fundamental mechanisms in early communication among humans [2], [3]. Our long term goal is to understand the emergence of turn taking or role switching between a human and a robot (or even between two robots). Understanding what the good dynamical properties of turn taking are would be an important progress to- wards maintaining for free robust and long lasting interaction (without explicit programming or explicit signals). It could lead Manuscript received February 15, 2010; revised August 13, 2010 and October 26, 2010; accepted November 04, 2010. Date of publication December 10, 2010; date of current version March 16, 2011. This work was supported by the French Region Ile de France, the Institut Universitaire de France (IUF), the FEELIX GROWING European Project FP6 IST-045169, and the French ANR Interact Project. The authors are with the ETIS, University Cergy-Pontoise, Cergy-Pontoise, F-95000, France (e-mail: [email protected]; blanchardarnaud@gmail. com; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.org. Digital Object Identifier 10.1109/TAMD.2010.2097260 to the learning of long and complex sequences of behaviors. Moreover, this work is linked to the fundamental question of how a “natural” interaction can emerge from few basic princi- ples [4]. Taking inspiration from numerous studies in develop- mental psychology on how newborns and young preverbal in- fants react in interpersonal relations (see next section for a re- view), this work focuses on the role of rhythm in human–robot face-to-face interaction. Our main goal in this paper is to show that an implicit signal such as the rhythm of the interaction can be used to build internal rewards and enhance the learning of an interacting robot. To do so, we propose a simple methodology illustrating how many good associations can be learned thanks to the conjunction of a rhythm detector with different reinforce- ment learning rules: is a robot detecting cadence changes able to improve its responses and converge toward a satisfying inter- action with a caregiver ? We present an artificial neural network (NN) control architecture inspired from properties of the hip- pocampus and cerebellum. This model allows online learning and the prediction of the rhythm of the sensorimotor flow of the robot. As a result, the robot is able, from its own dynamics, to detect changes in the interaction rhythm, that is to say changes in the cadence of the gesture production of the human-robot system. The architecture uses the rhythm prediction to com- pute an internal reward reinforcing the robot’s set of sensori- motor associations. Therefore, a stable and rhythmic interaction should indicate that the robot’s behavior is correct (generating internal positive rewards, strengthening sensorimotor rules, and allowing to respond correctly). On the other hand, an interaction full of breaks (the time for the caregiver to express his disagree- ment or even to interrupt the interaction) should lead the robot to change its motor responses due to the internal negative reward and converge toward new ones corresponding to the human’s ex- pectancies. Rhythm prediction plays the role of a control mech- anism allowing the building of an internal reward, which is used to guide the learning of sensorimotor associations. Finally, we will discuss the importance of the learning rule coupled to the rhythm prediction mechanisms and highlight how rule properties can affect the quality of the interaction, and the learning of the task. II. INTERDISCIPLINARY MOTIVATION In the last decade, an important effort has been made to make robots more attractive to human beings. For instance, a major focus of interest has been put on the expressiveness and the appearance of robots [5]–[7]. Targeting the facilitation of human–machine communication, theses approaches have neglected the importance of the dynamics when two agents interact. In our opinion, intuitive communication (verbal or not) 1943-0604/$26.00 © 2010 IEEE

Transcript of Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf ·...

Page 1: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

30 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 3, NO. 1, MARCH 2011

Using the Rhythm of Nonverbal Human–RobotInteraction as a Signal for Learning

Pierre Andry, Arnaud Blanchard, and Philippe Gaussier

Abstract—Human–robot interaction is a key issue in order tobuild robots for everyone. The difficulty for people to understandhow robots work and how they must be controlled will be one ofthe mains limit for broad robotics. In this paper, we study a newway of interacting with robots without needing to understandhow robots work or to give them explicit instructions. This workis based on psychological data showing that synchronization andrhythm are very important features for pleasant interaction. Wepropose a biologically inspired architecture using rhythm detec-tion to build an internal reward for learning. After showing theresults of keyboard interactions, we present and discuss the resultsof real human–robots (Aibo and Nao) interactions. We showthat our minimalist control architecture allows the discovery andlearning of arbitrary sensorimotor associations games with expertusers. With nonexpert users, we show that using only the rhythminformation is not sufficient for learning all the associations due tothe different strategies used by the human. Nevertheless, this lastexperiment shows that the rhythm is still allowing the discoveryof subsets of associations, being one of the promising signal oftomorrow social applications.

Index Terms—Artificial neural networks, autonomous robotics,human–robot interaction, rhythm detection and prediction, self-supervised learning.

I. INTRODUCTION

U NDERSTANDING nonverbal communication is crucialfor building really adaptive and interactive robots. Young

infants, even at preverbal stages take turn, naturally switch roles,and maintain bidirectional interaction without the use of explicitcodes or declarative “procedures” [1]. Among the implicit butimportant signals of interaction, synchrony and rhythm appearto be fundamental mechanisms in early communication amonghumans [2], [3].

Our long term goal is to understand the emergence of turntaking or role switching between a human and a robot (or evenbetween two robots). Understanding what the good dynamicalproperties of turn taking are would be an important progress to-wards maintaining for free robust and long lasting interaction(without explicit programming or explicit signals). It could lead

Manuscript received February 15, 2010; revised August 13, 2010 and October26, 2010; accepted November 04, 2010. Date of publication December 10, 2010;date of current version March 16, 2011. This work was supported by the FrenchRegion Ile de France, the Institut Universitaire de France (IUF), the FEELIXGROWING European Project FP6 IST-045169, and the French ANR InteractProject.

The authors are with the ETIS, University Cergy-Pontoise, Cergy-Pontoise,F-95000, France (e-mail: [email protected]; [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.org.

Digital Object Identifier 10.1109/TAMD.2010.2097260

to the learning of long and complex sequences of behaviors.Moreover, this work is linked to the fundamental question ofhow a “natural” interaction can emerge from few basic princi-ples [4]. Taking inspiration from numerous studies in develop-mental psychology on how newborns and young preverbal in-fants react in interpersonal relations (see next section for a re-view), this work focuses on the role of rhythm in human–robotface-to-face interaction. Our main goal in this paper is to showthat an implicit signal such as the rhythm of the interaction canbe used to build internal rewards and enhance the learning of aninteracting robot. To do so, we propose a simple methodologyillustrating how many good associations can be learned thanksto the conjunction of a rhythm detector with different reinforce-ment learning rules: is a robot detecting cadence changes ableto improve its responses and converge toward a satisfying inter-action with a caregiver ? We present an artificial neural network(NN) control architecture inspired from properties of the hip-pocampus and cerebellum. This model allows online learningand the prediction of the rhythm of the sensorimotor flow of therobot. As a result, the robot is able, from its own dynamics, todetect changes in the interaction rhythm, that is to say changesin the cadence of the gesture production of the human-robotsystem. The architecture uses the rhythm prediction to com-pute an internal reward reinforcing the robot’s set of sensori-motor associations. Therefore, a stable and rhythmic interactionshould indicate that the robot’s behavior is correct (generatinginternal positive rewards, strengthening sensorimotor rules, andallowing to respond correctly). On the other hand, an interactionfull of breaks (the time for the caregiver to express his disagree-ment or even to interrupt the interaction) should lead the robot tochange its motor responses due to the internal negative rewardand converge toward new ones corresponding to the human’s ex-pectancies. Rhythm prediction plays the role of a control mech-anism allowing the building of an internal reward, which is usedto guide the learning of sensorimotor associations.

Finally, we will discuss the importance of the learning rulecoupled to the rhythm prediction mechanisms and highlight howrule properties can affect the quality of the interaction, and thelearning of the task.

II. INTERDISCIPLINARY MOTIVATION

In the last decade, an important effort has been made tomake robots more attractive to human beings. For instance,a major focus of interest has been put on the expressivenessand the appearance of robots [5]–[7]. Targeting the facilitationof human–machine communication, theses approaches haveneglected the importance of the dynamics when two agentsinteract. In our opinion, intuitive communication (verbal or not)

1943-0604/$26.00 © 2010 IEEE

Page 2: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION AS A SIGNAL FOR LEARNING 31

refers to the ability to “take turn” and to respond coherently tothe stimulations of others without needing the presence of anexpert. In summary, it means being able to detect the crucialsignals of the interaction and use them to adapt one’s dynamicsto the other’s behavior. Obviously, such an issue questions thesense of “self” and “others,” and the position of such notions inthe frame of building an interactive robot. In the present work,we do not want to define an a priori boundary between therobot and the external world. We follow a bottom–up approach,testing how the combination of a few low-level mechanismscan allow the emergence of cooperative behaviors without anynotion of self. From an epigenetic perspective this combinationof low-level mechanisms can then appear as an interesting hy-pothesis in the discussion of the emergence of self and agencyin artificial and natural systems. Among these bootstrappingmechanisms, we will emphasize novelty detection (and moreprecisely rhythm detection) as a way of building an internalreward structuring the learning of any interacting agent.

In developmental robotics, a lot of research focus on how ar-tificial systems can detect novelty in their surrounding environ-ment in order to learn and generalize new skills, i.e., how robotscan, autonomously develop and ground their own learning ex-perience [8]. The notion of environment itself plays a crucialrole, with numerous researches focusing on the embodiment orthe surrounding physical environment while a few deal with theadaptation to the social environment of the robot. One of theseminal works concerns the role of prediction in the buildingof the relationships between the environment and the robot. In[9], the authors showed how multiple internal predictors playingthe role of experts can help a mobile robot segment the outsideworld according to the confidence and the scale of the predic-tion of each expert. The segmentation of the external world indifferent categories (and therefore the convergence of the ex-perts predictions) is made online, while the robot is exploringthe environment. Another good example of an active explo-ration and learning of the physical environment can be givenby the humanoid robots Cog and Obrero. In [10], Cog pro-duces random arm movement to roll and push small objects,and use the induced optical movement to discriminate the ob-jects from the background (see also [11] for the same experi-ment using sound with the robot Obrero). This is used to visu-ally segment the objects and to learn some of their properties(roll versus nonroll, etc.). These experiments are examples inwhich the robot’s progress is directly linked to its activity andthe changes it induces. Also in the frame of the exploration ofthe physical surrounding environment, authors have proposed tointroduce the notion of intrinsic motivation in artificial systems[12], [13]. They propose that self-motivated experiments of aprediction-based controller allow the robot to discover and splitits physical environment into a finite number of states. The dif-ferent areas of interest are segmented and labeled according tothe consequence of the experiments on the learning progress ofthe robot [14]. This mechanism allows the robot to seek the situ-ations where novelty will be the most “interesting” by providingthe maximal learning progress (fully predictable and totally un-predictable situations progressively becoming repellers). Suchan “artificial curiosity” is a good example of how the buildingof an internal reward value can drive the development of the

robot in its physical environment. Finally, in preliminary exper-iments Pitti et al. [15] have started to link internal dynamics withthe dynamics of a human manipulating the robot. They highlighthow synchronization allows to integrate sensing and acting in anembodied robot, and then facilitates the learning of simple sen-sorimotor patterns. The authors show that the synchronizationbetween input (sensing) and output (action) is a necessary con-dition for information transmission between the different partsof the architecture and therefore for learning new patterns.

Reviewing these studies, one important question we may askis: can the same principles of novelty detection and intrinsicmotivation be applied to a social environment, for example inthe case of an autonomous robot exchanging with a caregiver?The first works started to show that behaviors such as imitationcould allow autonomous robots to get for free some adaptationto the physical environment thanks to simple “social” behav-iors [16]. Following this idea, we have started to consider lowlevel imitation (imitation of meaningless gestures) as a proto-communicational behavior obtained as an emergent property ofa simple perception–action homeostat based on perception am-biguity [17], [18]. Tending to balance vision and proprioception(direct motor feedback) the controller acts to correct errors in-duced by an elementary visual system (unexpected movementdetection). An imitative behavior emerges without notion or de-tection of the experimenter. Based on the same principles, [19]highlights how a system naturally looking for the initial equilib-rium of its first perceptions (based on the imprinting paradigm)can alternate an exploration of the physical environment and areturn to the caregiver for learning grounding. More recently, re-searches have started to investigate the stability of human–robotface-to-face interactions, using synchrony and antisynchrony asa link between stable internal dynamics and bidirectional phaselocking with the caregiver [20], [4], [21], [22]. According todevelopmental psychology, the ability to synchronize, to detectthe synchrony, and the sensitivity to the timing appear to play animportant role in early communication among infants from birthon. Works have shown that babies are endowed with key senso-rimotor skills exploited during interpersonal relations [3]. Forexample, the young infants show sensitiveness to timing withthe ability to imitate, exchange, and preserve the temporalityand rhythm of sequences of sounds during bidirectional inter-action with their mother [23]. Neonates (between two and 48hours after birth) are able to anticipate the causality betweentwo events, and also to express a negative expression if thesecond event does not follow the first one [24], [25]. Gergelyand Watson have unified these findings in a model of contin-gency detection (DCM) extending the notion of causality to thesocial events [26]. According to the DCM model, infants areable to discriminate the causality provoked by their own actions(direct feedback inducing a perfect synchrony with their ownmotor production) for example when seeing and/or feeling themovement of their own body, from the contingencies of socialinteractions (due to the regular but nonperfect synchrony be-tween the emission of the stimulus by the baby and the responseof the caregiver). From this point, contingency detection haveturn to be one of the key mechanism of the development of selfand social cognition [27]. In the same line, Prince et al. are in-terested in understanding the development of infant skills with

Page 3: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

32 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 3, NO. 1, MARCH 2011

a model of contingencies detection based on the intermodal per-ception of audio–visual synchrony [28]. An extended version ofthis low-level model is also tested by Rolf et al. in order to bindwords or sounds with the vision of object [29].

In parallel, a lot of studies have been aimed at studyinghow infants detect changes in the other’s responses duringface-to-face interactions. Since 1978, the still-face paradigm,introduced by Tronick et al. has been widely used and studied(see [30] for a review) especially during preverbal interactions.A still-face consists in the production of a neutral, still-faceof the caregiver after a few minutes of interaction. Interest-ingly, this sudden break of the response and the timing of theinteraction induce a fall of the infant’s positive responses. Thesame responses where also measured with the more accurateDouble Video paradigm allowing to shift the timing of theinteraction using a dual display and recording system [2]. Inthis second paradigm, the content of the caregiver’s responsesremains the same as in a normal interaction, but a more or lesslong decay can be introduced in the display of the mother’sresponses. Using this dispositive during natural, bidirectionalmother–babies interactions, Nadel and Prepin [2] highlightthe importance of the timing of the response and show howsynchrony and rhythm are fundamental parameters of preverbalcommunication. Breaking the timing results in violating theinfant’s expectations, and produces a strong increasing ofnegative responses.

Consequently, in the frame of a face-to-face human–robot in-teraction composed of simple gestures (for example an imitationgame), our working hypothesis will be the following:

• a constant rhythm should naturally emerge if the interac-tion goes well (that is to say, if the robot’s responses cor-respond to the human’s expectancies);

• conversely, if the robot produces the wrong behavior or thewrong responses, we suppose that the human may intro-duce more breaks in the interaction, for example to take thetime to restart the game, manifest his disagreement (even ifthe robot is not able to process any signal concerning thisdissatisfaction), or simply withdrawing the interaction.

Therefore, we will present in the next section our model forrhythm learning and prediction, and the internal reward extrac-tion mechanism based on the rhythm detection.

III. MODEL

Our model is an artificial NN divided in two main parts(Fig. 1). The first part (the rhythm detection and predictionlayer in Fig. 1, up) learns online and very quickly (one-shootlearning) the rhythm of the input–output activity and computesa reward value according to the accuracy of the rhythmprediction. The second part (the sensorimotor learning layer inFig. 1, bottom) is a reinforcement learning mechanism allowingto change the input–output associations according to valuesupdated during the interaction.

A. Rhythm Detection and Prediction

The role of this network is to learn and predict the timingof the changing sensorimotor events. The network is inspiredby the functions of two brain structures involved in memory

Fig. 1. Model for rhythm detection and prediction. The interaction dynamics isread through the sum ��� of the input stimulations (input). Input–output associ-ations are explored and learned thanks to� based on the difference between theprediction of the next stimulation (PO) and the effective detection of the currentstimulation (TD).

and timing learning: the cerebellum and the hippocampus (see[31] for the neurobiological model we have proposed). It pro-cesses the sum of information from the sensorimotor pathway(integrating vision processing and the link between visionand motor output, see Section III-B), stimulated each time thehuman makes a gesture in front of the robot. Consequently, therobot can have information about the whole dynamics of the in-teractions by monitoring only the flow of its own sensorimotoractivity, without having any notion of the other. A singleneuron is summing the activities coming from the sensorimotorpathway. When is connected to the input vector, the result isan activity representing the total intensity of the human stimula-tions. In all the experiments, input responses are the result of acompetition applied to the visual detection. Therefore, givesthe undifferentiated activity of the successive inputs activated.

The core of the timing prediction is composed of three groupsof neurons (Fig. 1): time derivation (TD) group, time base (TB)group, and prediction output (PO) group. A single TD neuronfires at the beginning of new actions (it detects deviation).Therefore, TD detects the successive changes of activation ofthe input group. TB group decomposes the time elapsed betweentwo TD spikes, with cells responding at different timing andwith different time span. It simulates the activities Act ofcells of different sizes to the same TD stimulation (such cellsare known to be found in the Dentate Gyrus of the hippocampus[31])

(1)

where is the number of the cell of TB, and are the timeconstant and the standard deviation associated to the th cell.is the instant of the activation of TB by one TD spike.

One PO cell learns the association between TB cells states(the time trace of the previous input stimulation) and the new TDactivation (the detection of the current input stimulation). The

Page 4: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION AS A SIGNAL FOR LEARNING 33

Fig. 2. Left: Activity of a group of TB cells measuring the time elapsed betweentwo TD spikes. Here TB is stimulated at � � � � � and is composed of 12 cells.Each cell has a distinct set of parameters� and � allowing the cells activitiesto overlap. TB plays the role of a short-term memory (here for 20 s) of one TDspike. Right: example of the shape of two PO predictions. The solid line is theprediction activity for a timing of 1, 7 s. The dotted line is the prediction activityfor a timing of 9 s.

strength between PO neuron and TB battery is modifiedaccording to

Act

unmodified otherwise(2)

The potential of PO is the sum of the information comingfrom TD and the delayed activity in TB.

Pot Act (3)

Act (4)

(5)

After one learning (at least the presentation of two input stim-ulations), TB cells activation leads to the growing activity of POuntil a maximum corresponding to the stimuli period (Fig. 2).The learning is done online, meaning that repetitive activationsof the different units of input will lead to repetitive predictionsof the timing of the next input by PO. The prediction itself is notlinked to a particular input unit, but only to the precise timing atwhich the next input unit should be observed. At last, comparingPO and TD activities gives the success value of the rhythm pre-dictor. This comparison is used to build the reward for senso-rimotor associations learning

(6)

with the value of the activity of PO in [0, 1.5] at theinstant the value of the activity of TD in [0, 1], theinstant of TD spike, and in all experiments. The valueof is calculated when TD spikes, on the basis of PO stateat time .

For example, let us suppose that input is activated with a pe-riod . PO learns to predict the timing with a maximal activityat , and the value of is maximal while the frequency ofinput activation stays unchanged. If, for some reason, the inputfrequency is changed to a period , then the shape of PO(“bell” curve) will induce a decrease of proportional to theshift between and (see Fig. 3). If is too different

Fig. 3. Building the reinforcement ����. Up: illustration of the principle. Theanalogical activity of the prediction group PO peaks at the next TD period. Bycomputing the difference between PO and TD when TD fires, we obtain thedifference between the prediction and the effective rhythm. If the experimentmaintain a constant rhythm, then PO will predict correctly and the TD–PO dif-ference will be positive. At the opposite the more the experimenter changes thetiming of its response, the more TD–PO will be negative. Bottom: Experimentalrecord of the TD–PO values. A nominal rhythm is learned at ���� � ���ms, and the TD–PO response is recorded for each test period ranging from 0 to5000 ms.

from the previous , a negative reward is produced (rhythmbreak detection) at the first occurrence of the new TD spikes.After the emission of the negative reward, will be learnedby PO (online learning), being the new reference of the inter-action rhythm and the new period predicted. Letting PO learnonline the period allows to obtain a system which suits to thechanges of timing (see Fig. 4), without negative reward if thefrequency is slowly changing and sliding during the interaction.Moreover, (1) and (2) allow a one shot learning of the timing be-tween two stimulations. It means that the rhythm is learned oncethe human and the robot have exchanged two gestures (approxi-mately 3 to 5 s of interaction, depending on the human intrinsicperiod of interaction ). In the frame of a human–robot inter-action, the rhythms cannot be learned faster. During the inter-action, the rhythm prediction is constantly updated accordingto the flow of the input/output information, at each new ges-ture. Of course, the rhythm prediction can be updated or pro-gressively modified to take into account past value (averagingwith a sliding window).

Such properties are welcome in the frame of a realhuman–robot interaction where the rhythm can slowly varywithout negative impact until a strong break arises. When anegative reward is emitted (case of a strong rhythm break), POis temporally inhibited until the next TD activation. This modu-lation of PO prevents the learning of the duration of the rhythmbreak itself and having two consecutive negative rewards.

One of the main interests of this NN algorithm is that PO pre-diction respects the Weber–Fechner law [31]: the standard devi-ation of PO activity is inversely proportional to the period of thetiming learned. This property is particularly appropriated to thedetection of various human–robot interaction rhythms, wherefor example a delay of 100 ms must not have the same conse-quence if it happens in a interaction that is carried at 0.5 Hzversus 5 Hz. The global shape of the reinforcement is depen-dent of the PO activity, itself directly dependent of the incomingenergy given by TB cells (3). TB parameters play a direct role

Page 5: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

34 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 3, NO. 1, MARCH 2011

Fig. 4. Sample of a rhythmic interaction. Up: the user produces constant move-ments activating motor responses of the robot, from � � � to � � �� at 0.3 Hzand from � � ��� to � � ��� at 0.15 Hz. Bottom: PO activity of showing therhythm prediction. The neuron has adapted online the prediction of the rhythm.

in the accuracy and the shape of PO activity. In the above ex-periments, the number of cells was tested between 10 and 30according to the precision of PO needed ( does not affect theglobal shape of PO activity but the resolution of the curve).defines the center of each curve on the time course since the ac-tivation of the battery (at a relative time ). It is important tomaintain an overlap of the activity of each cell of TB, otherwisea strict separation would result in noisy and erroneous predic-tions (in this case the maximum of PO would not correspond tothe learned timing). Practically, the value of is proportionalto multiplied by a constant, and shifts the center of each curvealong the time course. To ensure the overlap of multiples cellsand the production of a “bell” curve of PO, is also linked to. More precisely, is inversely proportional to the value of

. This ensures that the first cells of TB havehigh standard deviation, resulting in a “thin” curve, while thelast cells have a smaller deviation, resulting in larger curves.This variation of thickness of TB curves induce the variation ofthickness of the PO curve at the origin of the Weber-Fechnerproperty (see Fig. 2).

B. Sensorimotor Learning

The input–output link (see Fig. 1, bottom) is the sensorimotorpathway of the architecture. Input processes visual informationfrom the CCD camera. The visual space of the robot is splitinto different areas stimulated by the gestures of the human. Forseek of simplicity, the actions of the human are filtered by acolor detector, and the human is asked to handle a pink ball (seeFig. 5).

Output triggers the robot’s motor response. The activation ofone output neuron induces the movement of the robot’s effector

Fig. 5. Elementary visual system. Left: image captured by the robot. Middle:detection of the red component. Right: detection of the pink color before com-petition at the basis of the input activity.

toward a region of the working space. The number of input andoutput neurons is the same, and the division of the workingspace is the same on each layer. For example, input first neuronis activated when the upper right area of the visual space of therobot is stimulated, and the trigger of the first output neuron willinduce the robot to raise the right arm upside.

In the frame of our experiments, input and are com-petitive layers forcing the activation of a sole input–output as-sociation at a time. Both groups also have activity thresholds,allowing the robot to be still (no motor command) if no rele-vant input is provided (no significant visual activity). From theseproperties, we obtain an elementary but reliable basis for a non-verbal human–robot interaction. Each time the human caregivermakes a move (stimulating input neurons) the robot will triggera motor response. Input–output links are initially random, and inthe frame of our experiments, the robot will have to discover andlearn the right input–output associations corresponding to thecaregiver expectancies. Therefore, finding the good responsescorresponds to discovering the right sensorimotor associationsamong the possible ones. To do so, a learning rule using re-ward computed by the rhythm detection and prediction Layer(Section III-A) is introduced.

The rule learning input–output associations is directly in-spired from the associative search element (ASE) from Barto,Sutton, and Anderson [32]. ASE uses a “generate and test”search process in order to find the correct outputs. To simplifythe equations, the letters and stands for input and outputgroups

noise (7)

with the Heaviside function and noise the amount of addi-tional random noise. At the beginning of the experiment thehave the same value and the noise value is driving the selectionof neurons. It ensures the initial exploration of the searchspace before any learning. When an output is tested, the correc-tion of the weight is made according to the variation of and

(8)

with and the weight betweenand updated at time . For the Input acti-

vated at , the network tests the output at and obtainthe reward at . This learning rule results in a modification

Page 6: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION AS A SIGNAL FOR LEARNING 35

Fig. 6. Variation of� according to (8). As shown in (7),� is binary, whichexplain the limited number of cases: only the weights with � ��� �� � � areupdated.

of the input–output link according to the variations of and, Fig. 6.

Therefore, constant online updates of the rhythm predictionproduce a new value at each interaction step (that is to sayat each new human–robot exchange of gesture), and to changethe input– weights accordingly. Doing so, it is importantto notice that the convergence time is strongly dependent of theinteraction cadency, the value of being updated at each ex-change according to each new rhythm prediction.

IV. EXPERIMENTS

In this section, we describe three kind of experiments.1) A simple human–machine interaction based on key-sound

associations. The experimenter has to teach the computerto associate sounds with keystrokes. In this experiment, thehuman is expert, and the goal is to demonstrate the conver-gence of the model with a growing number of input–outputassociations

2) A human–robot interaction with a Sony Aibo robot, testinggestures–gestures associations. The human has to teach therobot to respond with the right gesture among four possibleones, according to four human gestures. In this experiment,we tested expert and nonexpert users in the frame of verylimited instructions.

3) A human–robot interaction with a small humanoid Alde-baran Robotics Nao robot. The subjects are involved inthe same gesture experiment as experiment 2, and we alsotested two groups of subjects, experts versus nonexperts.Taking into account the results of experiment 2, we pro-pose to enhance the learning rule to take into account a de-layed reinforcement and we introduce a confidence valueto enhance the stability of the robot’s associations testing.

In all the experiments, we have defined expert and nonexpertsubjects as follows.

• An expert user is a subject that knows that the rhythm of theinteraction has an effect on the robot learning. It concernspeople from the lab aware of this research, and also peoplesnaive to robotics that are instructed before the experimentthat they can change the timing of the interaction or breakthe rhythm in order to change the robot’s behavior.

• A nonexpert user is a subject that is not aware of the robot’sfunctioning, and that has for only instruction the one givento each experiment, without additional information.

A. Human–Machine Interaction: The Sound Experiment

In order to test the feasibility of our model, we have estab-lished a human–computer interaction setup.

Fig. 7. Principle of the human–computer interaction. The user ��� pushes akeyboard key and expects a given note in response. From � to � the machine��� responds correctly, so the user continues the interaction and tests new keys.At � , the note emitted is incorrect, and the user interrupt shortly the interaction(or at least change noticeably the timing) to restart the interaction, or for exampleexpress his disagree before going on.

1) Setup: The setup is designed as a game where the expertmust teach the computer to produce a given note when he hitsa given key. As presented in Section III, the architecture mustlearn to associate key strokes with sounds, through the interac-tion (see Fig. 7).

2) Results: We have tested the network with sizes of inputand output groups going from two neurons to 10 neurons. Inall tests, the number of neurons in input and output is equal,meaning that the search rule had to discover the right set ofassociations in a set of possible associations. Fig. 8 showsthe average time for discovering the right set of associationsaccording to (see Fig. 8, up) and the progress of the onlinelearning during one experiment, with (Fig. 8, bottom).

The system succeeded in learning the set of correct associa-tions. Obviously, we see in Fig. 8, up that the convergence timeof the architecture is strongly linked to the combinatory of theexploration space before discovering the right associations. Forexample the experiment presented in Fig. 8, bottom took ap-proximately seven min of interactions before fully converging.The network is running permanently, but it is important to no-tice that the learning rule can only be triggered after each updateof the interaction cycle. Therefore, the time taken to convergeis not a limitation of the algorithm, but a limitation imposed bythe user’s cadence taking the time to interact every two or threes in the frame of such a keyboard–human game. If we take forexample the sequence of 10 elements, it means that the interac-tion has to go through the test of the 100 possible input–outputassociations. With an interaction cycle of three s, this gives us a“theoretical” total interaction duration of 300 s, without takinginto account the additional time taken to break the interaction.Our algorithm converges in a mean time of 445 s of intereaction.

The choice of this first test (under the form of a human–com-puter interaction using sounds, for seek of simplicity) allows thehuman to quickly and intuitively react to the sound emitted bythe computer. With such a simple setup, all the tests where donewith an expert user, that is to say a user confident with the inter-face, and who knows that he can use the rhythm to control thelearning. The objective of this experiment was to show that the

Page 7: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

36 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 3, NO. 1, MARCH 2011

Fig. 8. Mean convergence time for discovering and learning � associations ina set of � possible ones through a keyboard-sound interaction and using therhythm of the interaction and a classical reinforcement rule. Up: mean time inseconds to converge to the right set of associations according to�. To each valueof � corresponds the mean value of a set of five tests conducted with the sameexperimenter. Bottom: learning progress during one experiment consisting inthe discovery of 10 key-sound associations among 100. Each association dis-covered gives one point, and the convergence takes about 11 000 computationsteps (see text for more details about convergence time).

convergence of a classical reinforcement rule was possible whencombined with a neural network rhythm predictor learning on-line, even with a large amount of possible sensorimotor asso-ciations (here, up to 100 associations). “Large” must be hereunderstood as an important set of associations to be explored ina limited time of interaction. In all the experiments, the first as-sociation have always been the most difficult to discover. Whilenone of the responses of the system are those expected by theuser, it is difficult to establish a default rhythm of interaction.Conversely, once a first association is discovered and learned,the user tends to use it to support the exchanges, installing a de-fault rate with the system before testing the new keys, and so on.

B. First Human–Robot Experiment

1) Setup: For this experiment, we have used a Sony Aiborobot. The robot was seated on a table in front of the human(see Fig. 9 for an illustration). The robot controlled four degreesof freedom (the elbow and the shoulder of each arm), and thehead was still. The control architecture was implemented withfour different actions: raising the left foreleg, raising the rightforeleg, lowering the left foreleg, or lowering the right foreleg.The task was a “mirror action” game where the robot has to learnto make the same action as the human.

The robot must produce the same action, on the same side, asthe actions done by the human (human and robot being face-to-face). During the game, the robot has to explore its motor reper-tory when stimulated by the human, and discover and reinforcethe mirror gestures of the human’s ones. Consequently the robot

Fig. 9. Setup of the a “mirror action” game with Aibo.

had to discover the four associations corresponding to the imi-tative actions among 16 possible ones.

2) Instructions: The subjects where told to “teach the robotto do the same action.” To illustrate the instructions, a little pro-gram was run to show the four possible actions of the robot usingthe two upper legs. No additional instruction was given to thesubject (if the robot is able to process speech, face, etc.). Sucha setup was done to easily engage the human to lead the inter-action, and also to give as less as possible cues about the func-tioning of the robot, especially for nonexpert users. The humanproposes a gesture and wait for the robot’s response, but he hasno access to any special interface indicating to the robot if theresponse was good or bad. At the end of the experiment, wequestioned the subject about the quality of the interaction, theteaching procedure they used, and if they could understand howthe robot was learning.

3) Results: Experts: For all expert subjects, the learning ofthe four input–output associations was completed in a meantime of four min of interactions. Fig. 10 shows the result of oneexperiment conducted with an expert according to the setup de-scribed in the previous section. The correct learning of the fourassociations took approximately three min. 1 As in the sound ex-periment, the experts found the first association the most diffi-cult one to teach because they have no basis to trigger a constantrhythm and therefore, no basis to control the reinforcement.Most of the experts, and especially the experts naive to robotics(but instructed to use the timing to change the robots learning)also felt the robot’s action choices as inconsistent during the ex-periment. The interaction was found to be “disturbing,” or “un-stable,” because of the robot changing randomly the explorationof the sensorimotor associations. This behavior of the robot isdirectly linked to the noise parameter added to the activity of .If this variable is fundamental in the exploration of the searchspace by the algorithm, the corresponding robot’s behavior isfelt as inconsistent by the experimenter, changing from time totime the action, “for no reason.” This situation particularly hap-pens at the beginning of the experiment, when is too week toinduce strong changes of the letting the noise value decidethe winning output.

1A video of this experiment can be seen on http://www.youtube.com/watch?v=7OQK_NjX8D8.

Page 8: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION AS A SIGNAL FOR LEARNING 37

Fig. 10. Expert–robot experiment with Aibo. First Line: Activity of TD de-tecting the beginning of the human actions. Second Line: responses of the robotlabeled from one to five. Third Line: evolution of the score of the robot. Eachcorrect association (mirror action) gives one point. Fourth Line: reward builtby the robot, oscillating between positive and negative values according to therhythm of the interaction.

4) Results: Nonexperts: None of the five nonexpert subjectssucceeded to teach the robot the four input–output. The bestscore was the teaching of two associations (population meanscore: 1.4 taught association), with a mean interaction time of5.2 min. When questioned, none of the subjects detected thatthe timing of the exchanges played a role in the learning of therobot. When combining data record (mainly the robot’s inputs)and responses of the subject to an interview about the quality ofthe interaction, we can formulate the following comments.

• As for the expert subjects, the nonexperts felt the robotbehavior as inconsistent, changing randomly the actions. Amore “stable” behavior in the responses of the robot wouldhave helped the subject to take the time to work on a givensensorimotor association, testing it iteratively and givingmore chance to install an interaction rhythm.

• There is a strong overlap between the duration of actionsand reaction of the nonexperts whatever this duration islinked to a positive or a negative reward (see Fig. 11). Inthis experiment, the variance and overlap for each cate-gory shows that there is no obvious distinction between thetiming of successful actions and incorrect ones. We can ex-plain the variance of positive actions as follow: cadencesare not always the same, and the period of a constant sat-isfying interaction can slide during the experiment. Thiseffect is taken into account by our algorithm that adaptsto small successive variations of the rhythm. We can ex-plain the variance of negative actions as follow: when ob-serving the experiment and the inputs of the robot (re-flecting the timing of the subject’s actions) nonexpert usersoften carry-on the interaction at a constant rhythm whenthe robot is wrong, waiting two or three wrong actions tomake a strong break. This explains that the duration of neg-ative actions and reactions overlaps with the positive ones.

Fig. 11. Record of the delays and durations a nonexpert’s actions during a 3.5min interaction with Aibo. Delays and duration are component determining thetime between two actions, and therefore the rhythm of the interaction. For all ex-periments, we recorded the delays and durations (from the robot’s perceptions)of the actions responsible for the reward attribution. On the left side ares valuesof delays and durations corresponding to negative rewards versus the positiveones on the right. Mean delay and duration for negative rewards are 3.2 s and1.7 s versus 6.1 s and 2.3 s for the positive rewards. Similar distributions wherefound for each of the five nonexperts.

Such information allows us to think that if an algorithmusing immediate reinforcement is suited for experts (theyimmediately break the rhythm after a wrong answer) it maynot be the case in the frame of nonexpert users, waitinga mean of two or three wrong responses before breakingthe interaction. In this case, our algorithm should also takeinto account the delayed rewards, for example to distributethe effect of a negative reward on the two or three pastassociations.

• Finally, and more generally it is important to mention thatexperiments lasting more than four min were found to betiring by the nonexperts. We did not conduct an intensiveresearch on how to identify precisely the time during whicha human could accept to interact with such a setup. Thenumber of variables is too important (it would require astrong analysis of the experimental condition) and such astudy is out of the scope of our competences (the humanbeing in the loop, it would require a psychological studyabout the human acceptance of interaction with such arti-ficial systems). Nevertheless, these naive but recurrent ob-servations of tired or bored reactions after four or five minof interaction let us suggest that it is a limit for studying a“one shoot” interaction with our autonomous robots (beinggiven the entertainment level of our robots).

C. Second Human–Robot Experiment

To investigate the limitations of the first set of robotic exper-iments, we tested a second learning rule, in order to: 1) takeinto account a delayed reward, for example, if the break of therhythm does not only concern the precedent input–output asso-ciation: and 2) to give to the robot the possibility to maintain theassociations longer in order to test more strongly if a rule mustbe learned or unlearned.

1) The PCR Rule: The probabilistic conditioning rule (PCR)[33] was proposed to allow mobiles robots to learn the correctset of perception–action associations when receiving delayed

Page 9: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

38 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 3, NO. 1, MARCH 2011

reward. A typical task solved by PCR was to allow a mobilerobot to learn the right actions to do (go ahead, turn left or turnright) in order to escape from a maze. Each branch of the mazewas recognized by the robot using vision (the recognition of theplaces being the inputs of the learning network), and the robotreceived negative rewards each time it bumped into a wall, and aunique positive reward when escaping from the maze. The rulewas designed with the following properties:

• the robot must be able to test a given number of hypothesisduring a given amount of time before deciding to learn ornot (for example, always turning left when sensing a wallclose to the front sensors);

• the robot can manage delayed reward, thanks to a memoryof the associations that where selected since the last reward(for example, rewarding the robot only at the exit of themaze).

Whereas this learning rule was initially designed to solvetasks of mobile robotics with external motivations, we foundinteresting that its properties should fit with some of the issuesof our first robotic application.

In order to be able to reward past associations, PCR uses acorrelation over a sliding window in order to keep a shortterm memory of the activated inputs , outputs , and as-sociations with

(9)

In order to be able to test a given association longer, a confi-dence value is introduced. This value represents how thecorresponding association is trusted by the architecture. A highvalue of indicate that the corresponding association between

and should not be changed. Otherwise, a weak value ofindicate that the association must be changed because not

rewarding. Consequently, is used to change the confidencevalue of associations if then the confidence value

is updated

(10)

(11)

with , the global reinforcement signal from PO, thelearning speed, the reinforcement factor, a forget factor,and the reward threshold. if otherwise.Moreover, is the STM memory of the past associationscalculated with (9) and

(12)

At each iteration, the confidence of each association is tested.If a random draw Rand is higher than the confidence, then theweight value and its confidence value are inverted ifand then

(13)

Fig. 12. Setup of the “mirror action” game with Nao. The robot is seated infront of the user. As in the Aibo experiment, a pink ball mediates the interaction,but more freedom is given to the experimenter thanks to a tracking algorithm ofthe ball.

The less the confidence is, the higher the probability to changethe rule is. The mechanisms of confidence inversion is also veryimportant, allowing also punctual tests of new rules even whenall the associations have a strong confidence. For example, thereis always a very small probability to change an association that isat 0.99 of confidence. If this case happens, the benefit of the pastlearning (that has driven the rule to a 0.99 confidence) is not lost:the new rule will only have 0.01 value of confidence, meaningthat it should come back rapidly to the previous association.Using this mechanism the network is able to test punctually newhypotheses before changing for sustainable responses.

Finally the output is selected according to

Act noise (14)

with

Act Actotherwise

(15)

2) Setup: Fig. 12 illustrates the experimental setup. Nao isseated on a table in from of the user.

The head, torso, and arms are freed. We added a simple al-gorithm allowing to track the pink color with the head (two de-grees of freedom, pan and tilt axis). This enhancement allowsNao to follow the pink ball when the experimenter moves it.The tracking behavior gives more freedom to the experimentersmoves (allowing movement with a large amplitude while beingseated at approximately 1 m from the robot) and also provides animportant feedback to know if the robot is looking at the target.Nevertheless, if the experimenter decides to hide the ball fromNao’s field of view, the head comes back to a central position.A side effect of the pink tracking induce sometimes that Naolooks at the head of the experimenter (depending on the skintone) if no other pink stimulus is present in the field of view. Wedecided to keep this side effect, because: 1) it does not changethe tracking performance (the pink ball is always the winningstimuli); and 2) it adds a coherent interactive behavior to therobot (the robot is looking at the head of the experimenter if theball is hidden, that is to say when there is nothing to do).

3) Instruction: As in the previous experiment, we testedtwo kinds of users: experts and nonexperts (as described in

Page 10: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION AS A SIGNAL FOR LEARNING 39

Fig. 13. Illustration of the instructions given to the experimenter: “try to teachthe following actions to the robot: when you raise the left arm, the robot shouldrespond like this, etc.”

Sections IV-A and IV-B). Both types of subjects are instructedto use the pink ball to teach the robot to make the same moves,and were shown the illustration of Fig. 13. All subjects arefree to interact with the robot as long as they wish. Subjectswere instructed that they can, if they wish, talk to the robot,but none were informed if the robot was able to process or un-derstand what the subject was saying. No other oral or writteninstructions were given to the subject. The group of expertswas composed of 10 subjects, and the group of nonexperts wascomposed of eight subjects.

4) Results: Experts: For the experts, the results are similarto the previous Aibo experiment. The network converges to-ward the imitative associations with a convergence speed sim-ilar to the ASE inspired algorithm (with a mean time of conver-gence of 4–5 mn). The use of an algorithm able to cope withdelayed rewards did not change the convergence time, since ex-perts have a tendency to immediately break the rhythm afterrobot’s wrong responses and therefore, providing immediatefeedback to the system. According to the subject’s interview, thequality of the interaction had been noticeably improved thanksto the use of the confident value on the weights. When Aibo wasperceived changing randomly, with no reason the actions, Nao isperceived far more stable in testing the same set of actions, be-fore changing the weights. In the frame of expert subjects, usingthe rhythm of the interaction appear to be an efficient mean ofteaching the robot to learn new associations. The interaction isfree, without use of any programming skills or debugging infor-mation. Interestingly, when parametrizing the robotic setup, wehave noticed that trying to use a debug (display of the activityof the neurons) to explicitly see when the predictions fires inorder to manage our actions and test the rhythm break detectionbrought poorer results than naturally interacting with the robot.

5) Results: Nonexperts: Results with nonexperts were foundto be in progress compared to the Aibo’s experiment. None ofthe nonexperts managed to achieve the learning of the four as-sociations, all subjects obtained at least a score of two correctassociations, and six subjects obtained a score of three correctassociations (mean group score: 2.8). Experiments were lastingbetween three and six min (mean time of duration: 4 min 20 s).The reason for not obtaining perfect scores can be explainedby the fact that nonexperts often use different strategies in the

Fig. 14. Typical interaction with a nonexpert. The subject alternates the strate-gies, using sometime a constant rhythm to strengthen the interaction with breakswhen the robot does not respond correctly (from � � �� to � � ��� s) andsometime a time independent behavior (just seeking the possible actions of therobot, for example, from � � � to � � �� s). After � � ��� the subject startsto being bored and progressively gives-up the game: he acts very rapidly, tryingto “force” new responses from the robot. During this experiment the robot dis-covers three of the four correct associations among the 16 possible ones.

same experiment. As shown in Fig. 14, illustrating the course ofa whole experiment, the subjects can combine multiples strate-gies in order to teach and test the robot. Among the strategiesused by the subject, we can identify three ones that have beenoften observed.

• The first strategy is consistent with our working hy-potheses: the subject have a stable rate of actions when therobot responded correctly and breaks the rhythm once therobot starts to make the wrong actions, or a succession oftwo or three wrong actions. This strategy was observed asa part of all subject’s interaction. It was more used amongthe subjects that did chose to speak to the robot, accom-panying the correct actions with positive words during themovement, while taking the time to speak (and break theinteraction) when responses were wrong. Nevertheless, ifthis strategy was part of the interaction, it was never thesole one, explaining why the subjects never succeed toreach a score of four.

• We observed subjects simply “scanning” the possibleresponses of the robot without changing the action’s rate.The result of such a behavior, is that the architecturegenerates positive reward for all input–output associationstested during this “scan,” whatever they are correct or not.Such a strategy affects strongly the interaction, since itresults in lowering the score and producing wrong asso-ciations that will have to be unlearned before the robotproposes new ones. Interestingly, it appear that such astrategy is also consistent with our initial hypotheses:the subject simply wanted to check if the robot respondsto every possible stimulation, and the fact that the robot

Page 11: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

40 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 3, NO. 1, MARCH 2011

TABLE IMEAN PROPORTION OF THE DIFFERENT STRATEGIES OBSERVED

DURING THE NONEXPERTS’ EXPERIMENTS

responds by an action—whatever the action itself—isprecisely the behavior expected by the human, thereforecontinuing the check rhythmically (but reinforcing wronginput–output associations).

• A last strategy was often observed at the end of the experi-ment (most often after four to five min of interaction), whenthe subject started to get bored by the interaction and thefact that not all the associations could be learned. In thiscase, the subject simply accelerate strongly the rhythm ofthe exchanges, in order to go as fast as possible to see ifat some point, “by luck” the robot will perform the correctactions.

Table I summarizes the mean proportion of each strategy ob-served among the eight nonexperts. This table is based on a com-bination of visual observation of the experiment, subjects inter-view and a record of the robot’s inputs reflecting the cadency ofthe subject’s actions. The proportion of the expected behaviorrepresents 38% of the interaction. The subject speaking to therobot has a higher proportion of this behavior, corresponding toa mean of 56% of the interaction (observed with three subjects).If the proportion of the “scanning” behavior is limited, such abehavior has a strong negative effect on the course of the in-teraction, as explained above. The strong proportion of “other”behaviors is explained by parts of the interaction that we werenot able to identify, and transitions between identified behaviors.All these data suggest that rhythmic components are present andcan be exploited at the same time to improve the learning of therobot, and to enhance the quality of the interaction. Neverthe-less, solutions strictly predicting the rhythm of the interactioncannot be the only ones guiding the robot’s learning, as shownby the nonexpert score. Of course, the performance is highly de-pendent of the instructions given to the subjects. For example,an instruction such as “you have to talk to the robot when itsbehavior is wrong” is a way to guide the subjects toward anappropriate behavior for such algorithms. In the present paper,we decided not to give such instruction, especially instructionsthat would have constrained the subject’s behavior in case ofwrong responses of the robot. We wanted to obtain an evalua-tion about the usability of rhythm’s prediction in the frame ofthe less constrained interaction as possible. With the same idea,we have chosen to limit the robot’s behavior, but improvementcould lead to enhance the results. For example, some compo-nent of the behavior (such as the legs balancing, or changes inthe eyes’ color) could indicate that the robot perceives a rhythmand provide a behavioral carrier to the caregiver, that is to say,turn the interaction in a bidirectional exchange. To illustrate theperspectives of using the rhythm of the interaction as a tool forhuman–robot interactions, we propose in the next section to givea brief description of an ongoing research where the rhythm de-tection plays a different role.

V. PERSPECTIVES IN HUMAN–ROBOT INTERACTIONS

This research is about the learning and recognition of facialemotional expressions in the frame of a face-to-face interactionwith a robotic head [34]. As in the previous experiments, therobot has no notion of others. The robotic head is simply ex-pressing, alternatively, five different expressions (due to an ar-bitrary variation of its internal state). The control architecture iscomposed of an advanced vision system, processing the detec-tion of feature points of the visual flow and the categorizationof the position (where) and the local content (what) of the se-lected features points. The robot is then able to associate theresult of the visual detection and categorization with the cur-rent expression or the current internal state triggering the cur-rent expression. If a human comes and starts to imitate the robot,then he closes the loop of the interaction. The human returns avisual equivalent of the robot’s own “muscular” facial expres-sion. Such an experiment shows that a categorization of the vi-sual emotional expression is possible with an unconstrained vi-sion system that just learn the properties of the features pointsof all the scene. Once the salient features of the scene are ex-tracted and associated with the corresponding internal emotion,the robot is able to respond to the human with the same emo-tional expression. The roles are reversed and the robot is nowable to recognize and imitate the facial expression of the human,without knowing what a human, a face or an agent is. But tobe able to ground the visual categorization, the robot must beable to learn only when the human is interacting. And as men-tioned above, the robot have no notion of what is a face or ahuman. An elegant solution resides in the use of the interactionrhythm to provide the information that the robot is effectively in-teracting with a human face. In such an experiment, the rhythmdetection and prediction mechanism is exactly the one describedin Section III-A. Because the robot is imposing the rhythm ofthe interaction, the imitating human is producing a rhythmicresponse that can be perceived by the robot (just a computingof the sum a visual movement detected). The rhythm predic-tion plays the role of a modulation, supporting the categoriza-tion of the visual field when the movement detection coincidewith PO activity. Moreover, the rhythm information can then bethe trigger of the learning of the difference between “face” and“nonface” visual categories, that is to say between “rhythmic”and “nonrhythmic” visual stimuli (“rhythmic” referring here tothe stimuli that is locked on the rhythm of the robot actions).

VI. CONCLUSION

In this paper, we have presented a NN model able to learn andpredict the rhythm of an interaction. Following a bottom–up per-spective, the input of the rhythm detection is the sensorimotorpathway of our architecture. By monitoring its own sensori-motor dynamics, the autonomous robot has access to the rhythmof the whole interaction it is part of. This principle allows artifi-cial systems to learn from human behavior and uses social cueswithout any notion of what a human is. We have shown thata combination of rhythm detection and classical learning rulesallows a familiar user to drive the learning of the robot’s senso-rimotor associations without using any explicit setup or ad-hocprocedure to reward the robot. We have showed that the use of amore elaborated learning rule taking into account: 1) a delayed

Page 12: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION AS A SIGNAL FOR LEARNING 41

reward; and 2) the sustain of the tested associations, increasesthe quality of the interaction. We have also tested our archi-tecture with nonexpert subjects, showing that our working hy-potheses were partly covering the strategies used by the human.If rhythm variation is not the only behavior used by nonexperts,it is nevertheless a mechanism that stays present in the interac-tion. Finally, we have suggested and illustrated another use ofrhythm detection allowing a robotic head to detect that a humanis interacting, thus triggering the categorization of the visualstimuli. All theses experiments show that taking into accountthe timing of the interaction is a necessary condition to developadaptive skills from the social context. The timing is one of thesole amodal dimensions of the social interaction. Therefore, itcan underly the interaction and the use of all the other modali-ties. If our experiments with nonexperts show that such a prop-erty is not sufficient to allow an optimal learning or framing ofthe interaction, rhythm prediction seems to be a promising toolthat needs to be combined with other learning strategies, and es-pecially learning rules exploiting the other modalities.

Prepin et al. have recently proposed that rhythm and syn-chrony play the role of phatic signals of the preverbal inter-action [35]. The authors highlight that the term phatic, “usually used in a verbal context to define periverbal signalswhose function is to structure and regulate the interaction, is alsoadapted to the nonverbal interaction where these signals are atthe same time exchanged signals and regulation signals...”2 Infuture work we wish to extend the role of the rhythm as a regu-lation signal. Our next working hypothesis will be to show howpredicting the rhythm can play the role of an internal signal, forexample generating positive (correct predictions of the timing)versus negative feelings (wrong predictions). These positive ornegative internal values could be expressed by the robot and leadto changes of the human’s behavior. Thus, changing the rhythmcould modify the internal state of the robot, internal state whoseexpression would subsequently change the human’s responsesand rhythm. Testing such a model could be an interesting steptoward the establishment of locked versus unlocked phases ofinteraction, allowing a progress toward the establishment of turntaking and/or role switching.

ACKNOWLEDGMENT

The authors would like to thank N. Garnault for the work doneprogramming Abo and Nao, and S. Boucenna for his participa-tion to Section V.

REFERENCES

[1] , A. Meltzoff and W. Prinz, Eds., The Imitative Mind: Development,Evolution and Brain Bases. Cambridge, MA: Cambridge Univ.Press, 2000.

[2] J. Nadel, K. Prepin, and M. Okanda, “Experiencing contigency andagency: First step toward self-understanding ?,” Interact. Stud., vol. 2,pp. 447–462, 2005.

[3] P. Rochat, “Five levels of self-awareness as they unfold early in life,”Conscious. Cogn., vol. 12, pp. 717–731, 2003.

[4] T. Kuriyama and Y. Kuniyoshi, “Acquisition of human–robot interac-tion rules via imitation and response observation,” in Proc. 10th Int.Conf. Simulation Adapt. Behav., Osaka, Japan, 2008.

2English translation of [35], the reference being available in French only.

[5] C. Breazeal, “Regulation and entrainment for human-robot interac-tion,” Int. J. Exp. Robot., vol. 10–11, no. 21, pp. 883–902, 2003.

[6] H. Kozima, C. Nakagawa, and H. Yano, “Can a robot empathize withpeople?,” Artif. Life Robot., vol. 8, no. 1, pp. 83–88, Sep. 2004.

[7] M. Masahiro, “On the uncanny valley.,” in Proc. Humanoids-2005Workshop, Tsukuba, Japan, Dec. 2005.

[8] M. Lungarella and L. Berthouze, “Modelling embodied cognition inhumans with artifacts,” Adapt. Behav., vol. 10, pp. 223–241, 2003.

[9] J. Tani and S. Nolfi, “Learning to perceive the world as articulated:An approach for hierarchical learning in sensory-motor systems,” inFrom Animals to Animats: Simulation of Adaptive Behavior SAB’98,R. Pfeifer, B. Blumberg, J. Meyer, and S. Wilson, Eds. Cambridge,MA: MIT Press, 1998, pp. 270–279.

[10] P. Fitzpatrick, “First contact: An active vision approach to segmen-tation,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), LasVegas, NV, 2003, pp. 27–31.

[11] E. Torres-Jara, L. Natale, and P. Fitzpatrick, “Tapping into touch,” inProc. 5th Int. Workshop Epig. Robot., EPIROB’05, Osaka, Japan, 2005.

[12] A. Barto, S. Singh, and N. Chentanez, “Intrinsically motivated learningof hierarchical collections of skills.,” in Proc. 3rd Int. Conf. Develop.Learn. (ICDL 2004), San Diego, CA, 2004.

[13] P.-Y. Oudeyer and F. Kaplan, “Intelligent adaptive curiosity: A sourceof self-development,” in Proc. 4th Int. Workshop Epig. Robot., Genoa,Italy, 2004, pp. 127–130.

[14] P.-Y. Oudeyer, F. Kaplan, and V. Hafner, “Intrinsic motivation systemsfor autonomous mental development,” IEEE Trans. Evol. Comput., vol.11, no. 2, pp. 265–286, May 2007.

[15] A. Pitti, M. Lungarella, and Y. Kuniyoshi, “Synchronization: Adaptivemechanism linking internal and external dynamics,” in Proc. 6th Int.Conf. Epig. Robot., EPIROB’06, Paris, France, 2006, pp. 127–134.

[16] K. Dautenhahn, “Getting to know each other—Artificial social intelli-gence for autonomous robots,” Robot. Autonom. Syst., vol. 16, no. 2–4,pp. 333–356, Dec. 1995.

[17] P. Gaussier, S. Moga, M. Quoy, and J. Banquet, “From perception ac-tion loops to imitation processes: A bottom-up approach of learning byimitation,” Appl. Artif. Intell., vol. 12, no. 7–8, pp. 701–727, Oct.–Dec.1998.

[18] P. Andry, P. Gaussier, S. Moga, J. Banquet, and J. Nadel, “Learningand communication in imitation: An autonomous robot perspective,”IEEE Trans. Syst., Man, Cybern., A, Syst., Humans, vol. 31, no. 5, pp.431–444, Sep. 2001.

[19] A. Blanchard and L. Canamero, “From imprinting to adaptation:Building a history of affective interaction,” in Proc. 5th Int. WorkshopEpig. Robot. (EPIROB05), Nara, Japan, Jul. 2005, pp. 42–50.

[20] K. Prepin and A. Revel, “Human-machine interaction as a model ofmachine-machine interaction: How to make machines interact as hu-mans do,” Adv. Robot. Section Focused Imitative Robot. (2), vol. 21,no. 15, pp. 18–31, Dec. 2007.

[21] A. Blanchard and J. Nadel, “Designing a turn-taking mechanism as abalance between familiarity and novelty,” in Proc. 9th Int. Conf. Epi-genetic Robot. (EPIROB2009), Venice, Italy, Nov. 2009, pp. 199–201.

[22] A. Revel and P. Andry, “Emergence of structured interactions: Froma theoretical model to pragmatic robotics,” Neural Netw., vol. 22, pp.116–125, 2009.

[23] C. Trevarthen, “The foundations of intersubjectivity: Development ofinterpersonal and cooperative understanding in infants,” in The SocialFoundations of Language and Thought. Essays in Honor of JeromeBruner. New York: Norton, 1980.

[24] J. Gewirtz, “Mechanism of social learning: Some roles of stimulationand behaviour in early development,” in Handbook of SocializationTheory and Research. Chicago: Ran McNally, 1969, pp. 57–212.

[25] E. Blass, J. Ganchrow, and J. Steiner, “Classical conditioning innewborn human 2–48 hours of age,” Inf. Behav. Develop., vol. 7, pp.223–235, 1984.

[26] G. Gergely and J. Watson, “Early social-emotional development: Con-tingency perception and the social biofeedback model.,” Early Soc.Cogn., pp. 101–136, 1999.

[27] K. Hiraki, “Detecting contingency: A key to understanding develop-ment of self and social cognition,” Jpn. Psychol. Res., vol. 48, no. 3,pp. 204–212, 2006.

[28] C. G. Prince, G. J. Hollich, N. A. Helder, E. J. Mislivec, A. Reddy, S.Salunke, and N. Memon, “Taking synchrony seriously: A perceptuallevel model of infant synchrony detection,” in Proc. 5th Int. Conf. Epig.Robot. (EPIROB’05), Nara, Japan, 2005, pp. 89–95.

[29] M. Rolf, M. Hanheide, and K. Rohlfing, “Attention via synchrony.Making use of multi-modal cues in social learning,” IEEE Trans. Au-tonom. Mental Develop., vol. 1, no. 1, May 2009.

Page 13: Pierre Andry, Arnaud Blanchard, and Philippe Gaussierbctill/papers/ememcog/Andry_etal_2011.pdf · 2011-03-17 · ANDRY et al.: USING THE RHYTHM OF NONVERBAL HUMAN–ROBOT INTERACTION

42 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 3, NO. 1, MARCH 2011

[30] J. Nadel, R. Soussignan, P. Canet, G. Libert, and P. Grardin, “Twomonth- old infants of depressed mothers show mild, delayed and per-sistent change in emotional state after non-contingent interaction,” Inf.Behav. Develop., vol. 28, pp. 418–425, 2005.

[31] J. Banquet, P. Gaussier, J. Dreher, C. Joulain, and A. Revel,“Space-time, order and hierarchy in fronto- hippocampal system:A neural basis of personality,” in Cognitive Science Perspectives onPersonality and Emotion. Amsterdam, The Netherlands: ElsevierScience BV Amsterdam, 1997, pp. 123–189.

[32] A. Barto, R. Sutton, and C. Anderson, “Neuronlike adaptive elementsthat can solve difficult control problems,” IEEE Trans. Syst., Man, Cy-bern., vol. SMC-13, no. 5, pp. 834–846, Sept./Oct. 1983.

[33] P. Gaussier, A. Revel, C. Joulain, and S. Zrehen, “Living in a partiallystructured environment: How to bypass the limitation of classical rein-forcement techniques,” Robot. Autonom. Syst., 1997.

[34] S. Boucenna, P. Gaussier, and P. Andry, “What should be taught first:The emotional expression or the face?,” in Proc. 8th Int. WorkshopEpig. Robot., Modeling Cogn. Develop. Robot. Syst., Lund, Sweden,2008, pp. 97–100.

[35] K. Prepin, “Développement et Modélisation Des Capacités Dinterac-tions Homme-Robot: L’imitation Comme Modle de Communication,”Ph.D. dissertation, Centre Emotion CNRS—Université de Cergy-Pon-toise, , 2008.

Pierre Andry received the M.Sc. degree in artificialintelligence from Paris VI University, Pierre et MarieCurie, France, in 1999, and the Ph.D. degree in com-puter science from the Cergy-Pontoise University,Cergy-Pontoise, France, in 2002, investigating therole of imitation in learning and communicationskills of autonomous robots.

He is currently a Researcher at the ETIS Lab,University of Cergy Pontoise, Cergy Pontoise,France. His research interests include epigeneticrobotics, i.e., the way autonomous robots could

develop in order to adapt to their physical and social environment. This concernincludes issues such as imitation, emotions, preverbal interactions processes,turn-taking, and emergent behaviors.

Arnaud Blanchard received the M.Sc. degree incognitive science from Paris IV University, Paris,France, and the Ph.D. degree in computer sciencefrom the University of Hertfordshire, Hatfield, U.K.,in 2007, where he was working in the AdaptiveSystems Research Group with L. Cañamero.

He then held two Postdoctoral positions within theEuropean Project, Feelix Growing. One in the emo-tion center of the Salpêtrière hospital in Paris withJ. Nadel and one in the neurocybernetic group of P.Gaussier of the laboratory ETIS in Cergy-Pontoise

(France). He his now a Research Engineer in the laboratory, where he is mainlyworking on interaction, imitation, and learning.

Philippe Gaussier received the M.S. degree in elec-tronics from Aix-Marseille University, Marseille,France, in 1989, and the Ph.D. degree in computerscience from the University of Paris XI, Orsay,France, for his work on the modeling and simulationof a visual system inspired by mammals vision.

From 1992 to 1994, he conducted researchin neural network applications and in control ofautonomous mobile robots at the Swiss FederalInstitute of Technology, Zurich, Switzerland. He iscurrently a Professor at Cergy-Pontoise University,

Cergy-Pontoise, France, and leads the Neurocybernetic Team of the Imageand Signal processing Lab (ETIS). His research interests are focused on themodeling of the cognitive mechanisms involved in visual perception, motivatednavigation, and action selection, and on the study of the dynamical interactionsbetween individuals with a particular research in the fields of imitation andemotions.