Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion...

7
Emotion Recognition from Facial Expressions using Multilevel HMM Ira Cohen, Ashutosh Garg, Thomas S. Huang Beckman Institute for Advanced Science and Technology The University of Illinois at Urbana-Champaign [email protected], [email protected], [email protected] Abstract Human-computer intelligent interaction (HCII) is an emerging field of science aimed at providing natural ways for humans to use computers as aids. It is argued that for the computer to be able to interact with humans, it needs to have the communication skills of humans. One of these skills is the ability to understand the emotional state of the person. The most expressive way humans display emotions is through facial expressions. This work focuses on auto- matic facial expression recognition from live video input using temporal cues. Methods for using temporal informa- tion have been extensively explored for speech recognition applications. Among these methods are template matching using dynamic programming methods and hidden Markov models (HMM). This work exploits existing methods and proposes a new architecture of HMMs for automatically segmenting and recognizing human facial expression from video sequences. The novelty of this architecture is that both segmentation and recognition of the facial expressions are done automatically using a multilevel HMM architec- ture while increasing the discrimination power between the different classes. In the work we explore person-dependent and person-independent recognition of expressions. 1 Introduction In recent years there has been a growing interest in im- proving all aspects of the interaction between humans and computers. This emerging field has been a research inter- est for scientists from several different scholastic tracks, i.e., computer science, engineering, psychology, and neu- roscience. These studies focus not only on improving com- puter interfaces, but also on improving the actions the com- puter takes based on feedback from the user. Feedback from the user has traditionaly been through the keyboard and mouse. Other devices have also been developed for more application specific interfaces, such as joysticks, trackballs, datagloves and touch screens. The rapid advance of technol- ogy in recent years has made computers cheaper and more powerful, and has made the use of microphones and PC- cameras affordable and easily available. The microphones and cameras enable the computer to “see” and “hear,” and to use this information to act. A good example of this is the “Smart-Kiosk” project being done at Compaq research laboratories [7]. It is argued that to truly achieve effec- tive human-computer intelligent interaction (HCII), there is a need for the computer to be able to interact naturally with the user, similar to the way human-human interac- tion takes place. Humans interact with each other mainly through speech, but also through body gestures, to empha- size a certain part of the speech, and display of emotions. Emotions are displayed by visual, vocal, and other phys- iological means. There is a growing amount of evidence showing that emotional skills are part of what is called “in- telligence” [16, 8]. There are many ways that humans display their emo- tions. The most natural way to display emotions is using facial expressions. In the past 20 years there has been much research on recognizing emotion through facial expressions. This reasearch was pioneered by Ekman and Friesen [6] who started their work from the psychcology perspective. In the early 1990s the engineering community started to use these results to construct automatic methods of recog- nizing emotions from facial expressions in images or video [12, 13, 18, 15, 2] . Work on recognition of emotions from voice and video has been recently suggested and shown to work by Chen [2], Chen et al. [3], and DeSilva et al [5]. This work tries to suggest another method for recogniz- ing the emotion through facial expression displayed in live video. The method uses all of the temporal information dis- played in the video. The logic behind using all of the tem- poral information is that any emotion being displayed has a unique temporal pattern. Many of the facial expression research works classified each frame of the video to a facial expression based on some set of features computed for that time frame. This excludes the work of Otsuka and Ohya [13], which used simple hidden Markov models (HMM) to recognize sequences of emotion. 1

Transcript of Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion...

Page 1: Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion Recognition from Facial Expressions using Multilevel HMM Ira Cohen, Ashutosh Garg,

Emotion Recognitionfr om Facial ExpressionsusingMultile vel HMM

Ira Cohen,AshutoshGarg, ThomasS.HuangBeckmanInstitutefor AdvancedScienceandTechnology

TheUniversityof Illinois at [email protected],[email protected],[email protected]

Abstract

Human-computerintelligent interaction (HCII) is anemerging field of scienceaimedat providing natural waysfor humansto usecomputers asaids. It is arguedthat forthe computerto be able to interact with humans,it needsto havethe communicationskills of humans.Oneof theseskills is theability to understandtheemotionalstateof theperson.Themostexpressivewayhumansdisplayemotionsis throughfacial expressions.This work focuseson auto-matic facial expressionrecognition from live video inputusingtemporal cues.Methodsfor usingtemporal informa-tion havebeenextensivelyexplored for speech recognitionapplications.Amongthesemethodsare templatematchingusingdynamicprogrammingmethodsand hiddenMarkovmodels(HMM). This work exploits existing methodsandproposesa new architecture of HMMs for automaticallysegmentingand recognizinghumanfacial expressionfromvideo sequences.The novelty of this architecture is thatbothsegmentationandrecognitionof thefacial expressionsare doneautomaticallyusinga multilevel HMM architec-turewhile increasingthediscriminationpowerbetweenthedifferentclasses.In thework weexplore person-dependentandperson-independentrecognitionof expressions.

1 Intr oduction

In recentyearstherehasbeena growing interestin im-proving all aspectsof the interactionbetweenhumansandcomputers.This emerging field hasbeena researchinter-est for scientistsfrom several different scholastictracks,i.e., computerscience,engineering,psychology, andneu-roscience.Thesestudiesfocusnot only on improving com-puterinterfaces,but alsoon improving theactionsthecom-putertakesbasedonfeedbackfrom theuser. Feedbackfromthe user has traditionaly beenthrough the keyboard andmouse. Otherdeviceshave alsobeendevelopedfor moreapplicationspecificinterfaces,suchasjoysticks,trackballs,dataglovesandtouchscreens.Therapidadvanceof technol-

ogy in recentyearshasmadecomputerscheaperandmorepowerful, andhasmadethe useof microphonesandPC-camerasaffordableandeasilyavailable. The microphonesandcamerasenablethe computerto “see” and“hear,” andto usethis information to act. A goodexampleof this isthe “Smart-Kiosk” projectbeingdoneat Compaqresearchlaboratories [7]. It is arguedthat to truly achieve effec-tive human-computerintelligent interaction(HCII), thereis a needfor the computerto be able to interactnaturallywith the user, similar to the way human-humaninterac-tion takesplace. Humansinteractwith eachothermainlythroughspeech,but alsothroughbodygestures,to empha-sizea certainpart of the speech,anddisplayof emotions.Emotionsare displayedby visual, vocal, and other phys-iological means. Thereis a growing amountof evidenceshowing thatemotionalskills arepartof whatis called“in-telligence” [16, 8].

There are many ways that humansdisplay their emo-tions. The mostnaturalway to displayemotionsis usingfacialexpressions.In thepast20yearstherehasbeenmuchresearchonrecognizingemotionthroughfacialexpressions.This reasearchwas pioneeredby EkmanandFriesen [6]who startedtheir work from the psychcologyperspective.In the early 1990sthe engineeringcommunity startedtousetheseresultsto constructautomaticmethodsof recog-nizing emotionsfrom facialexpressionsin imagesor video[12, 13, 18, 15, 2] . Work on recognitionof emotionsfromvoiceandvideohasbeenrecentlysuggestedandshown towork by Chen [2], Chenet al. [3], andDeSilvaet al [5].

This work triesto suggestanothermethodfor recogniz-ing theemotionthroughfacialexpressiondisplayedin livevideo.Themethodusesall of thetemporalinformationdis-playedin thevideo. The logic behindusingall of thetem-poral information is that any emotionbeingdisplayedhasa uniquetemporalpattern. Many of the facial expressionresearchworksclassifiedeachframeof thevideoto afacialexpressionbasedon somesetof featurescomputedfor thattime frame. This excludesthe work of OtsukaandOhya[13], which usedsimplehiddenMarkov models(HMM) torecognizesequencesof emotion.

1

Page 2: Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion Recognition from Facial Expressions using Multilevel HMM Ira Cohen, Ashutosh Garg,

The novelty in this work is that a methodto automati-cally segmentthevideoanddo therecognitionis proposed,using a multilevel HMM structure. The first level of thearchitectureis comprisedof independentHMM’ s relatedtothedifferentemotions,but insteadof lookingjustatthefinaloutputof theseHMM’ s andusea ML classifier, the statesequenceof the HMM’ s is usedasthe input of the higherlevel HMM. This is meantboth to segmentthe video se-quence,but also increasesthe discriminationbetweentheclassessinceit tries to find not only theprobablityof eachthesequencedisplayingoneemotion,but theprobablityofof thesequencedisplayingoneemotionandnot displayingall the other emotionsat the sametime. We demonstratethis systemusing a small databaseof 5 peopleand showthat in generalthis architectureoutperformsa single levelHMM Maximum likelihood classifier, wherethe input tothisclassifieris apresegmentedvideosequence,in contrastto acontinousvideosequencefed into themultilevel HMMarchitecture.

2 Hidden Mark ov Models

HiddenMarkov modelshavebeenwidely usedfor manyclassificationand modeling problems. Perhapsthe mostcommonapplicationof HMM is in speechrecognition.Oneof the main advantagesof HMMs is their ability to modelnonstationarysignalsor events. Dynamic programmingmethodsallow oneto align thesignalssoasto accountforthenonstationarity. However, the maindisadvatageof thisapproachis that it is very time-consumingsinceall of thestoredsequencesareusedto find thebestmatch.TheHMMfinds an implicit time warping in a probabilisticparamet-ric fashion. It usesthe trasition probablitiesbetweenthehiddenstatesandlearnsthe conditionalprobablitiesof theobservationsgiven the stateof the model. In the caseofemotionexpression,the signal is the measurementsof thefacialmotion. This signalis nonstationaryin nature,sinceanexpressioncanbedisplayedat varyingrates,with vary-ing intensitiesevenfor thesameindividual.

An HMM is givenby thefollowing setof parameters:

� � ��������� ������ � �������������� ��� ����� � � ����!#"%$&��'("�)� � *,+ � ��-��.��/0�����-�� � �.�1��� � ����!2"3'("4) � � ����5����� � �

(1)

where�

is the statetransitionprobablity matrix,�

isthe observation probability distribution, and

is the ini-

tial statedistribution. The numberof statesof the HMMis givenby

). It shouldbenotedthattheobservations(

-��)

can be either discreteor continuous,and can be vectors.

In thediscretecase,�

becomesa matrix of probablityen-tries(ConditionalProbabilityTable),andin thecontinuouscase,

�will be given by the parametersof the probability

distribution function of the observations(normally chosento be theGaussiandistribution or a mixtureof Gaussians).Given an HMM therearethreebasicproblemsthat areofinterest.The first is how to efficiently computethe proba-blility of the observationsgiven the model. This problemis relatedto classificationin the sensethat it givesa mea-sureof how well a certainmodeldescribesan observationsequence.The secondis how, given a setof observationsandthe model,to find the correspondingstatesequenceinsomeoptimal way. This will becomean importantpart ofthe algorithmto recognizethe expressionsfrom live inputandwill bedescribedlaterin thispaper. Thethird is how tolearntheparametersof themodel

�giventhesetof observa-

tionssoasto maximizetheprobabilityof obervationsgiventhemodel.Thisproblemrelatesto thelearningphaseof theHMMs which describeeachfacialexpressionsequence.AcomprehansivetutorialonHMMs is givenby Rabiner [14].

3 Expression Recognition Using Emotion-SpecificHMMs

Sincethedisplayof a certainfacial expressionin videois representedby a temporalsequenceof facial motionsitis naturalto modeleachexpressionusinganHMM trainedfor that particular type of expression. Therewill be sixsuchHMMs, onefor eachexpression:

*happy(1),angry(2)

, surprise(3),disgust(4),fear(5),sad(6)/. Thereareseveral

choicesof modelstructurethatcanbeused.Thetwo mainmodelsaretheleft-to-rightmodelandtheergodicmodel.Inthe left-to-right model,theprobablityof goingbackto theprevious stateis set to zero,and thereforethe modelwillalwaysstartfrom a certainstateandendup in an ‘exiting’state.In theergodicmodelevery statecanbereachedfromany otherstatein a finite numberof time steps.In [13], Ot-sukaandOhyausedleft-to-rightmodelswith threestatestomodeleachtypeof facialexpression.Theadvantageof us-ing this modellies in thefactthatit seemsnaturalto modela sequentialeventwith a modelthatalsostartsfrom a fixedstartingstateandalways reachesan endstate. It also in-volves fewer parameters,and thereforeis easierto train.However, it reducesthedegreesof freedomthemodelhastotry to accountfor theobservationsequence.Therehasbeennostudyto indicatethatthefacialexpressionsequenceis in-deedmodeledwell by theleft-to-right model.On theotherhand,usingtheergodicHMM allowsmorefreedomfor themodelto accountfor theobservationsequences,andin fact,for an infinite amountof trainingdatait canbeshown thatthe ergodicmodelwill reduceto the left-to-right model,ifthat is indeedthe true model. In this work both typesofmodelsweretestedwith variousnumbersof statesin anat-

2

Page 3: Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion Recognition from Facial Expressions using Multilevel HMM Ira Cohen, Ashutosh Garg,

temptto studythe beststructurethat canmodel facial ex-pressions.

In Figure1 anexampleof afive-stateleft-to-rightHMM(with return)is shown, with theproblilities aslearnedfromtheexperimentsdescribedin thefollowing section.

0.86

0.07

0.040.91

0.090.94

0.06

0.930.07

0.93

687:9<;>=@?BA<CED�FHGI?KJ:?,LNMPO,?KQ�RTS.FUS.?VJ:?KWXS.Q�S.Y<Q�=�7Z9<[>S(\0]3]^ 7:S.[_=�?KS.;>=�`EC

Theobservationvector-��

for theHMM representscon-tinuousmotion of the facial actionunits. Therefore,

�is

representedby theprobablitydensityfunctions(pdf) of theobservation vectorat time a given the stateof the model.TheGaussiandistributionis chosento representthesepdf’s,i.e.,

�b�4*�+ � ��- � ��/#c�)d��e � �5f � ���5!#"%'("4) (2)

Wheree � and

f � arethe meanvectorandfull covariancematrix, respectively.

The parametersof the model of emotion-expressionspecifc HMM are learnedusing the well-known Baum-Welch reestimationformulas. See[11] for detailsof thealgorithm.For learning,handlabeledsequencesof eachofthefacialexressionsareusedasgroundtruthsequences,andthe Baumalgorithmis usedto derive the maximumlikeli-hood(ML) estimationof themodelparametes(

�).

Parameterlearningis followedby the constructionof aML classifier. Figure2 shows thestructureof theML clas-sifier. Givenanobservationsequence

- �, whereahg �@!i��j(� ,

theprobabilityof theobservationgiveneachof thesix mod-els���- � � � � � is computedusingtheforward-backwardpro-

cedure[14]. Thesequenceis classifiedastheemotioncor-respondingto themodelthatyieldedthehighestprobablil-ity, i.e.,

k�l � ��m5nUo%�qp �srutTr8vHw:���- � �8ti�yx (3)

4 Automatic Segmentationand Recognitionof EmotionsUsingMultile vel HMM.

Themainproblemwith theapproachtakenin theprevi-oussectionis thatit workson isolatedfacialexpressionse-quencesor on presegmentedsequencesof the expressions

HMM modelfor ’Happy’

HMM model

HMM model

HMM model

HMM model

HMM model

for ’Angry’

for ’Surprise’

for ’Disgust’

for ’Fear’

for ’Sad’

Select Maximum

Index of Recognized Expression

P(O|Model 1)

(1)

(2)

(3)

(4)

(5)

(6)

P(O|Model 2)

P(O|Model 6)

P(O|Model 5)

P(O|Model 4)

P(O|Model 3)

Video Sequence Face tracking and

Action Unit

Measurements

O=Observation seq

6H7:9<;u=�?{z|C ]}FH~,7:��;>� JZ7:��?,JZ7:[>YPYPL��,J:FHRTRT7:MP?,=�W�YK=?,��Y>S�7ZY<`_RT�I?K��7:MP�_\0]3]���FHRT?

from the video. In reality, this segmentationis not avail-able,andthereforethereis a needto find anautomaticwayof segmentingthesequences.Concatenationof theHMMsrepresentingphonemesin conjuctionwith theuseof gram-marhasbeenusedin many systemsfor continuousspeechrecognition.Dynamicprogrammingfor continuousspeechhasalsobeenproposedin differentresearches.It is notverystraightforwardto try andapply thesemethodsto theemo-tion recognitionproblemsincethereis no clearnotion oflanguagein displayingemotions. OtsukaandOhya [13]useda hueristicmethodbasedon changesin themotionofseveral regionsof the faceto decidethatan expressionse-quenceis beginningandending.After detectingthebound-ries,thesequenceis classifiedto oneof theemotionsusingtheemotion-specificHMM. This methodis proneto errorsbecauseof thesensitivity of the classifierto the segmenta-tion result.Althoughtheresultof theHMM’ s areindepen-dentof eachother, if weassumethatthey modelrealisticalythe motion of the facial featuresrelatedto eachemotion,thecombinationof thestatesequenceof thesix HMM’ s to-gethercanprovideveryusefulinformationandenhancethediscriminationbetweenthedifferentclasses.Sincewe willusea left-to-right model(with return),thechangingof thestatesequencecanhave a physicalattribute attachedto it(suchasopeningandclosingof mouthwhensmiling), andthereforetherewecangainusefulinformationfrom lookingat the statesequenceandusing it to discriminatebetweentheemotionsat eachpoint in time.

To solve thesegmentationproblemandenhancethedis-criminationbetweenthe classes,a differentkind of archi-tectureis needed. Figure 3 shows the proposedarchitec-ture for automaticsegmentationandrecogntionof the dis-playedexpressionat eachtime instance. As canbe seen,themotionfeaturesarefedcontinuouslyto thesix emotion-specificHMMs. Thestatesequenceof eachof the HMMsis decodedandusedastheobservationvectorfor thehigh-

3

Page 4: Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion Recognition from Facial Expressions using Multilevel HMM Ira Cohen, Ashutosh Garg,

level HMM. Thehigh-level HMM consistsof sevenstates,onefor eachof the six emotionsandonefor neutral. Theneutral stateis neccessaryasfor the largeportionof time,thereis no displayof emotionon a person’s face.Thetran-sitionsbetweenemotionsare imposedto passthroughtheneutral statesinceit is fair to assumethat thefaceresumesa neutralpositionbeforeit displaysa new emotion.For in-stance,a personcannotgo from expressinghappy to sadwithout returningthe faceto its neutralposition(even fora very brief interval). The recognitionof the expressionisdoneby decodingthe statethat the high-level HMM is inateachpoint in timesincethestaterepresentsthedisplayedemotion. To get a more stablerecognition,output of theclassifierwill actualybeasmoothedversionof thestatese-quence,i.e., thehigh-level HMM will have to stayin a par-ticular statefor a long enoughtime in orderfor the outputto betheemotionrelatedto thatstate.

Anger

Surprise

Sad

Happy

Disgust

Fear

Neutral

t+1 t+2 t+3 t+4t

t t+1 t+2 t+3 t+4

6 HMMState sequence of

Model for Emotion (1)

HMM Model for Emotion (6)

Decoded State Sequence

Tracking Results - Action

Unit Measurements

Decoded State Sequence = Observation Sequence for High-Level HMM

Recognition of Emotion at Each

Sampling Time

Decoded State Sequence

Higher-Level HMM

687:9<;>=@?���C1]};>J:S.7:JZ?,O,?KJ�\]3]�F�=@��[>7:S.?K��S.;>=�?�WXYU=�FH;>QS.Y<��FHS�7Z��RT?K9<��?K`>S.FHS.7:Y<`�FH`>LN=@?,��Y<9>`>7:S.7:Y<`_Y<W1?K��Y<QS.7:Y>`|C

Thetrainingprocedureof thesystemis asfollows:

� Train the emotion-specificHMMs usinga handseg-mentedsequenceasdescribedin theprevioussection.

� Feedall six HMMs with thecontinuous(labeled)facialexpressionsequence.Eachexpressionsequencecon-tainsseveral instancesof eachfacial expressionwithneutral instancesseparatingtheemotions.

� Obtain the statesequenceof eachHMM to form thesix-dimensionalobservationvectorof thehigher-levelHMM,

i.e.,-��� ��wZ�,� ����

,...,�,� v��� xy�

, where� ��

is the stateof theith emotion-specificHMM. Thedecodingof the statesequenceis doneusingtheVitterbi algorithm [14].

� Learntheprobabilityobservationmatrix for eachstateof thehigh-level HMM using

���� � � �� � �u�8����* expectedfrequency of model$

beinginstate'

giventhatthetruestatewas � / , and

� � � � �4*�+s�8��- �� �@/���*v

�:� ������� � � �� � ���<��/ (4)

where' g �s! ,Numberof Statesfor Lower Level

HMM).

� Computethe transition probability� ��* � �&� / of

thehigh-level HMM usingthefrequency of transitingfrom eachof thesixemotionclassesto theneutral statein thetrainingsequencesandfrom theneutral statetotheotheremotionstates.For notation,theneutral stateis numbered� , andtheotherstatesarenumberedasintheprevioussection.It shouldbenotedthatthetransi-tion probablitiesfrom oneemotionstateto anotherthatis not neutral aresetto zero.

� Settheinitial probablityof thehigh-level HMM to be1 for the neutral stateand0 for all otherstates.Thisforcesthemodelto alwaysstartat theneutral stateandassumesthatapersonwill displayaneutral expressionin thebeginningof any videosequence.This assump-tion is madejust for simplicity of thetesting.

Thestepsfollowedduringthetestingphaseareverysim-ilar to the onesfollowed during training. The facetrack-ing sequenceis fed into the lower-level HMMs anda de-codedstatesequenceis obtainedusing the Viterbi algo-rithm. The decodedlower-level statesequence

-���is fed

into thehigher-level HMM andtheobservationprobablitiesarecomputedusingEq.(4).Notethatin thiswayof comput-ing theprobability, it is assumedthatthestatesequencesofthe lower-level HMMs areindepedentgiventhetrue label-ing of thesequence.Thisassumptionis reasonablesincetheHMMs aretrainedindependentlyandon differenttrainingsequences.In addition,without this assumption,thesizeof�

will beenormous,sinceit will haveto accountfor all pos-sible combinationsof statesof the six lower-level HMMs,andit would requireahugeamountof trainingdata.

Using the Viterbi algorithm again for the high-levelHMM, a mostlikely statesequenceis produced.ThestatethattheHMM wasin at time a correspondsto theexpressedemotionin thevideosequenceat time a . To make theclas-sificationresultrobustto undesiredfastchanges,a smooth-ing of thestatesequenceis doneby not changingtheactualclassificationresultif thetheHMM did notstayin apartic-ularstatefor morethen

jtimes,where

jcanvarybetween

1 and15 samples(assuminga 30-Hz samplingrate). Theintroductionof thesmoothingfactor

jwill causeadelayin

the decisionof the system,but of no morethanj

sampletimes.

4

Page 5: Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion Recognition from Facial Expressions using Multilevel HMM Ira Cohen, Ashutosh Garg,

(a)Anger (b) Disgust (c) Fear (d) Happiness (e)Sadness (f) Surprise

687:9<;>=@?� �C�¡>~,FH���>J:?KR�Y<W�7Z��FH9<?,R�W�=�Y<�{S.[>?�OK7:L>?KY�RT?,¢u;>?K`>��?KR�;>RT?,LB7:`_S.[u?�?,~K�I?,=�7Z��?,`uS�C

5 Experiments

Thetestingof thealgorithmsdescribedin previoussec-tions is performedon a databaseof people that are in-structedto displayfacial expressionscorrespondingto thesix types of emotions. This databaseis the sameas theonetestedin [2]. Thedatacollectionmethodis describedin detail in [2]. However, the classificationdonein [2]wasbasedonaframe-to-framebasis;whereas,in thiswork,theclassificationis basedon anentiresequenceof onedis-playedemotion. All the testsof the algorithmsare per-formedon a setof five people,eachonedisplayingsix se-quencesof eachoneof thesix emotions,andalwayscomingbackto aneutralstatebetweeneachemotionsequence.Thevideowasusedasthe input to the facetrackingalgorithm.We useda facetrackingalgorithmdevelopedby Tao [17].Thetrackingalgorithmusesa 3D Beziervolumemodelforfacetrackingandoutputsthe valuesof 12 actionunit likemeasurementscorrespondingto the motion of variousre-gionson the facefor eachframe. This AU’s are usedasthe input to theHMM architecture.Thesamplingratewas30 Hz, anda typical emotionsequenceis about70 sampleslong (2̃ s). Figure4 shows oneframeof eachemotionforthreesubjects. The datawas collectedin a openrecord-ing scenario,wherethe personis asked to display the ex-pressioncorrespondingto theemotionbeinginduced.Thisis of coursenot the ideal way of collectingemotiondata.Theidealwaywouldbeusingahiddenrecording,inducingtheemotionthrougheventsin thenormalenviromentof thesubject,not in a studio. Themainproblemwith collecting

thedatathiswayis theimpracticalityof it andtheethicalis-sueof hiddenrecording.In thefollowing experiments,bothapproaches(emotion-specificHMM, andmultilevel HMM)are testedusing the database.In all of the tests,a leave-one-outcrossvalidationis usedto obtaintheprobabilityoferror.

6 Person-DependentTests

A person-dependenttest is first tried. Sincetherearesix sequencesof eachfacialexpressionfor eachperson,foreachtestonesequenceof eachemotionis left out, andtherestareusedasthetrainingsequences.For theHMM-basedmodels,severalstatesweretried(3-12)andboththeergodicand left-to-right with returnwere tested. The resultspre-sentedbelow areof thebestconfiguration(anergodicmodelusing11states).Table1 showstherecognitionratefor eachpersonfor thetwo classifiers,andthetotal recognitionrateaveragedover the five people. Notice that the fifth personhastheworstrecognitionrate.person.

The fact that subject5 waspoorly classifiedcanbe at-tributedto the inaccuratetrackingresultand lack of suffi-cientvariability in displayingthe emotions.It canbeseenthat the multilevel HMM doesnot significantly decreasethe recognitionrate(andimprovesit in somecases),eventhoughthe input is unsegmentedcontinuousvideo, in con-trast to the emotion-specificHMM which needsthe seg-mentedemotionsequences.Analysisof the confusionbe-tweendifferentemotions(describedin detailsin [4]) showsthat happinessand surpriseare well recognizedfor both

5

Page 6: Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion Recognition from Facial Expressions using Multilevel HMM Ira Cohen, Ashutosh Garg,

£ FHG>J:?¤A<C}¥�?,=�RTY>`>Q�L>?K�I?,`uL>?,`uS�?K��Y<S.7:Y>`¦=�?,��Y>9<`>7:QS.7:Y>`§=�FHS.?,R¨;>RT7Z`>9¦S.[>?©?,��Y>S�7ZY<`>Q�RT�I?K��7:MP�§\0]3]3ªFH`uL���;uJ:S.7:J:?,OK?,J�\0]3]3C

Subject SingleHMM Multilevel HMM

1 82.86% 80%2 91.43% 85.71%3 80.56% 80.56%4 83.33% 88.89%5 54.29% 77.14%

Total 78.49% 82.46%

classifiers,with happinessachieving near100%,andsur-priseapproximately90%.Theothermore‘subtle’ emotionsareconfusedwith eachothermorefrequently, with sadnessbeingthe mostconfusedemotion. Although the emotionsusuallydo not confusewith happiness, in someinstancessurprisewasconfusedwith happinessdueto the fact thatthesubjectsmiledwhile displayingthesurprise, somethingthatdoeshappenin reallife whenthesurpriseis agoodone.Theseresultssuggestthe useof a differentlabelingof theemotionalstatesto scalesof positiveandnegativeandinten-sityof theemotions.This2D representationof theemotionshasbeendescribedby Lang [10].

7 Person-IndependentTests

In the previous sectionit was seenthat a good recog-nition ratewasachievedwhenthe trainingsequencesweretakenfrom thesamesubjectasthetestsequences.Themainchallengeis to seeif this canbe generalizedto a person-independentrecognition.For this testall of the sequencesof onesubjectareusedas the test sequences,and the se-quencesof theremainingfour subjectsareusedastrainingsequences.This testis repeatedfive times,eachtime leav-ing a differentpersonout (leave oneout crossvalidation).Table2 shows the recognitionrate of the test for the twoalgorithms.Theresultsindicatethat in this case,themulti-level HMM gave betterresultsthantheonelayeredHMM,and both gave resultsmuch higher then pure chance. Ingeneraltherecognitionrateis muchlower thantheperson-dependentcase(58%at best,comparedto 88%). Thefirstreasonfor thisdropis thefactthatthesubjectsareverydif-ferentfrom eachother(threefemales,two males,anddif-ferentethnicbackgrounds);hence,they displaytheir emo-tion differently. In fact,therecognitionrateof subject3, anasianwoman,was the lowest in this case(36% for multi-level HMM). Although it appearsto contradictthe univer-sitality of the facial expressionsasstudiedby EkmanandFriesen[6], it shows that for practicalautomaticemotionrecognition,considerationof genderandraceplay a role in

£ FHG>JZ?0z�C�«¬?K��Y<9<`u7:S.7:Y<`�=�FHS�?�W�YU=|�I?K=�RTY<`uQ�L>?,�I?K`>Lu?,`>SS�?KRTS�C

SingleHMM Multilevel HMMRecognitionrate 55% 58%

thetrainingof thesystem.This conclusioncannotbemadestrongly sincethe databaseis small. A study of a largerdatabaseof subjectscanconfirmor disputethisconclusion,andtherearesuggestionsin theliteratureon thevalidity ofthis conclusion.

8 Discussion

In this work a new method for emotion recognitionfrom video sequencesof facial expressionwereexplored.Emotion-specificHMM, relied on segmentationof a con-tiuousvideo into sequencesof emotions(or neutralstate).However, multilevel HMM, performedautomaticsegmen-tationandrecognitionfrom acontinuoussignal.Theexper-imentson a databaseof five peopleshowedthat the recog-nition ratesfor a person-dependenttestarevery high usingbothmethods.Therecognitionratesdropdramaticallyfor aperson-independenttest.This impliedthata largerdatabaseis neededfor thetraining,andpossiblythesubjectsshouldbe classifiedaccordingto somecatergories,suchasethnicbackgroundandgender. The testsalsoshowed that someemotionsaregreatlyconfusedasagainstothers(anger, dis-gust, sadnessand fear), while happinessandsurpriseareusuallyclassifiedwell. This implies the useof a differentsetof classesto getmorerobustclassification.Theclassescanbe positive, negative, surpriseandneutral. This scaleclusterstheemotionsinto four categories,andcanimprovetherecognitionratedramatically.

Oneof the maindrawbacksin all of the worksdoneonemotion recognitionfrom facial expressionvideos is thelack of a benchmarkdatabaseto testdifferentalgorithms.Thiswork reliedon a databasecollectedby Chen[2], but itisdifficult tocomparetheresultstootherworksusingdiffer-entdatabases.Therecentlyconstructeddatabaseby Kanadeet al [9] will beausefultool for testingthesealgorithms.

A usefulextensionof this work would beto build a realtime systemcomprisedof a fastandaccuratefacetrackingalgorithmcombinedwith themultilevelHMM structure.Bygiving thisfeedbackto thecomputer, abetterinteractioncanbeachieved.This canbeusedin many ways.For example,it canhelpin educationby helpingchildrenlearneffectivelywith computers.

Recognizingtheemotionfrom justthefacialexpressionsis probablynot accurateenough. For a computerto trulyunderstandtheemotionalstateof a human,othermeasure-mentsprobablyhave to be made. Voice andgesturesare

6

Page 7: Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion Recognition from Facial Expressions using Multilevel HMM Ira Cohen, Ashutosh Garg,

widely believed to play an importantrole aswell [2, 5],andphysiologicalstatessuchasheartbeatandskinconduc-tivity arebeingsuggested[1]. Peoplealsousecontext asan indicatorof the emotionalstateof a person.This workis justanotherstepon theway towardachieving thegoalofbuilding moreeffectivecomputersthatcanserveusbetter.

References

[1] J. T. CacioppoandL.G. Tassinary. Inferring psycho-logicalsignificancefrom physiologicalsignals.Amer-ican Psychologist, 45:16–28,January1990.

[2] L. S. Chen. Joint processingof audio-visualinfor-mationfor therecognitionof emotionalexpressionsinhuman-computerinteraction. PhD thesis,Universityof Illinois at Urbana-Champaign,Dept. of ElectricalEngineering,2000.

[3] L. S. Chen,H. Tao, T. S. Huang,T. Miyasato,andR. Nakatsu.Emotionrecognitionfrom audiovisualin-formation. In Proc. IEEE Workshopon MultimediaSignal Processing, pages83–88, Los Angeles,CA,USA, Dec.7-9,1998.

[4] I. Cohen. Automatic facial expression recogni-tion from video sequencesusing temporal informa-tion. In MS Thesis, University of Illinois at Urbana-Champaign,Dept.of ElectricalEngineering,2000.

[5] L. C. De Silva, T. Miyasato,andR. Natatsu. Facialemotionrecognitionusingmultimodalinformation.InProc. IEEE Int. Conf. on Information, Communica-tions and SignalProcessing(ICICS’97), pages397–401,Singapore,Sept.1997.

[6] P. Ekmanand W. V. Friesen. Facial Action CodingSystem:Investigator’s Guide. ConsultingPsycholo-gistsPress,PaloAlto, CA, 1978.

[7] A. Garg, V. Pavlovic, J. Rehg, and T. S. Huang.Audio–visual speaker detection using dynamicBayesian networks. In Proc. of 4rd Intl Conf.Automatic Face and Gesture Rec., pages374–471,2000.

[8] D. Goleman.EmotionalIntelligence. BantamBooks,New York, 1995.

[9] T. Kanade,J.F. Cohn, and Y. Tian. Comprehesivedatabasefor facial expressionanalysis. In Proc. of4rd Intl Conf. AutomaticFaceandGestureRec., pages46–53,2000.

[10] P. Lang. The emotionprobe: Studiesof motivationandattention.AmericanPsychologist, 50(5):372–385,May 1995.

[11] S.E.Levinson,L.R. Rabiner, andM.M. Sondhi. Anintroductionto the applicationof the theoryof prob-abilitic functions of a markov processto automaticspeechrecognition. The Bell Lab SystemTechnicalJournal, 62(4):1035–1072,apr1983.

[12] K. Mase. Recognitionof facial expressionfrom op-tical flow. IEICE Transactions, E74(10):3474–3483,October1991.

[13] T. OtsukaandJ.Ohya.Recognizingmultiplepersons’facialexpressionsusingHMM basedonautomaticex-tractionof significantframesfrom imagesequences.In Proc. Int. Conf. on Image Processing(ICIP-97),pages546–549,SantaBarbara,CA, USA, Oct.26-29,1997.

[14] L.R. Rabiner. A tutorial on hiddenMarkov modelsandselectedapplicationsin speechprocessing.Pro-ceedingsof IEEE, 77(2):257–286,1989.

[15] M. Rosenblum,Y. Yacoob,andL.S.Davis. Humanex-pressionrecognitionfrom motionusinga radialbasisfunctionnetwork architecture.IEEE TransactionsonNeural Network, 7(5):1121–1138,September1996.

[16] P. Salovey and J.D. Mayer. Emotional intelligence.Imagination, Cognition and Personality, 9(3):185–211,1990.

[17] H. Tao and T. S. Huang. Connectedvibrations: Amodal analysisapproachto non-rigid motion track-ing. In Proc. IEEE Conferenceon ComputerVisionandPatternRecognition1998(CVPR’98), SantaBar-bara,CA, USA, June23-25,1998.

[18] Y. YacoobandL.S. Davis. Recognizinghumanfacialexpressionsfrom long imagesequencesusingopticalflow. IEEE TransactionsonPatternAnalysisandMa-chineIntelligence, 18(6):636–642,June1996.

7