Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion...

Emotion Recognitionfr om Facial ExpressionsusingMultile vel HMM

Ira Cohen,AshutoshGarg, ThomasS.HuangBeckmanInstitutefor AdvancedScienceandTechnology

TheUniversityof Illinois at [email protected],[email protected],[email protected]

Abstract

Human-computerintelligent interaction (HCII) is anemerging field of scienceaimedat providing natural waysfor humansto usecomputers asaids. It is arguedthat forthe computerto be able to interact with humans,it needsto havethe communicationskills of humans.Oneof theseskills is theability to understandtheemotionalstateof theperson.Themostexpressivewayhumansdisplayemotionsis throughfacial expressions.This work focuseson auto-matic facial expressionrecognition from live video inputusingtemporal cues.Methodsfor usingtemporal informa-tion havebeenextensivelyexplored for speech recognitionapplications.Amongthesemethodsare templatematchingusingdynamicprogrammingmethodsand hiddenMarkovmodels(HMM). This work exploits existing methodsandproposesa new architecture of HMMs for automaticallysegmentingand recognizinghumanfacial expressionfromvideo sequences.The novelty of this architecture is thatbothsegmentationandrecognitionof thefacial expressionsare doneautomaticallyusinga multilevel HMM architec-turewhile increasingthediscriminationpowerbetweenthedifferentclasses.In thework weexplore person-dependentandperson-independentrecognitionof expressions.

1 Intr oduction

In recentyearstherehasbeena growing interestin im-proving all aspectsof the interactionbetweenhumansandcomputers.This emerging field hasbeena researchinter-est for scientistsfrom several different scholastictracks,i.e., computerscience,engineering,psychology, andneu-roscience.Thesestudiesfocusnot only on improving com-puterinterfaces,but alsoon improving theactionsthecom-putertakesbasedonfeedbackfrom theuser. Feedbackfromthe user has traditionaly beenthrough the keyboard andmouse. Otherdeviceshave alsobeendevelopedfor moreapplicationspecificinterfaces,suchasjoysticks,trackballs,dataglovesandtouchscreens.Therapidadvanceof technol-

ogy in recentyearshasmadecomputerscheaperandmorepowerful, andhasmadethe useof microphonesandPC-camerasaffordableandeasilyavailable. The microphonesandcamerasenablethe computerto “see” and“hear,” andto usethis information to act. A goodexampleof this isthe “Smart-Kiosk” projectbeingdoneat Compaqresearchlaboratories [7]. It is arguedthat to truly achieve effec-tive human-computerintelligent interaction(HCII), thereis a needfor the computerto be able to interactnaturallywith the user, similar to the way human-humaninterac-tion takesplace. Humansinteractwith eachothermainlythroughspeech,but alsothroughbodygestures,to empha-sizea certainpart of the speech,anddisplayof emotions.Emotionsare displayedby visual, vocal, and other phys-iological means. Thereis a growing amountof evidenceshowing thatemotionalskills arepartof whatis called“in-telligence” [16, 8].

There are many ways that humansdisplay their emo-tions. The mostnaturalway to displayemotionsis usingfacialexpressions.In thepast20yearstherehasbeenmuchresearchonrecognizingemotionthroughfacialexpressions.This reasearchwas pioneeredby EkmanandFriesen [6]who startedtheir work from the psychcologyperspective.In the early 1990sthe engineeringcommunity startedtousetheseresultsto constructautomaticmethodsof recog-nizing emotionsfrom facialexpressionsin imagesor video[12, 13, 18, 15, 2] . Work on recognitionof emotionsfromvoiceandvideohasbeenrecentlysuggestedandshown towork by Chen [2], Chenet al. [3], andDeSilvaet al [5].

This work triesto suggestanothermethodfor recogniz-ing theemotionthroughfacialexpressiondisplayedin livevideo.Themethodusesall of thetemporalinformationdis-playedin thevideo. The logic behindusingall of thetem-poral information is that any emotionbeingdisplayedhasa uniquetemporalpattern. Many of the facial expressionresearchworksclassifiedeachframeof thevideoto afacialexpressionbasedon somesetof featurescomputedfor thattime frame. This excludesthe work of OtsukaandOhya[13], which usedsimplehiddenMarkov models(HMM) torecognizesequencesof emotion.

1

The novelty in this work is that a methodto automati-cally segmentthevideoanddo therecognitionis proposed,using a multilevel HMM structure. The first level of thearchitectureis comprisedof independentHMM’ s relatedtothedifferentemotions,but insteadof lookingjustatthefinaloutputof theseHMM’ s andusea ML classifier, the statesequenceof the HMM’ s is usedasthe input of the higherlevel HMM. This is meantboth to segmentthe video se-quence,but also increasesthe discriminationbetweentheclassessinceit tries to find not only theprobablityof eachthesequencedisplayingoneemotion,but theprobablityofof thesequencedisplayingoneemotionandnot displayingall the other emotionsat the sametime. We demonstratethis systemusing a small databaseof 5 peopleand showthat in generalthis architectureoutperformsa single levelHMM Maximum likelihood classifier, wherethe input tothisclassifieris apresegmentedvideosequence,in contrastto acontinousvideosequencefed into themultilevel HMMarchitecture.

2 Hidden Mark ov Models

HiddenMarkov modelshavebeenwidely usedfor manyclassificationand modeling problems. Perhapsthe mostcommonapplicationof HMM is in speechrecognition.Oneof the main advantagesof HMMs is their ability to modelnonstationarysignalsor events. Dynamic programmingmethodsallow oneto align thesignalssoasto accountforthenonstationarity. However, the maindisadvatageof thisapproachis that it is very time-consumingsinceall of thestoredsequencesareusedto find thebestmatch.TheHMMfinds an implicit time warping in a probabilisticparamet-ric fashion. It usesthe trasition probablitiesbetweenthehiddenstatesandlearnsthe conditionalprobablitiesof theobservationsgiven the stateof the model. In the caseofemotionexpression,the signal is the measurementsof thefacialmotion. This signalis nonstationaryin nature,sinceanexpressioncanbedisplayedat varyingrates,with vary-ing intensitiesevenfor thesameindividual.

An HMM is givenby thefollowing setof parameters:

� � �� !#"%$&��'("�)� � *,+ � ��-��.��/0��-�� .�1�� !2"3'("4) � � ��5��

(1)

where�

is the statetransitionprobablity matrix,�

isthe observation probability distribution, and

is the ini-

tial statedistribution. The numberof statesof the HMMis givenby

). It shouldbenotedthattheobservations(

-��)

can be either discreteor continuous,and can be vectors.

In thediscretecase,�

becomesa matrix of probablityen-tries(ConditionalProbabilityTable),andin thecontinuouscase,

�will be given by the parametersof the probability

distribution function of the observations(normally chosento be theGaussiandistribution or a mixtureof Gaussians).Given an HMM therearethreebasicproblemsthat areofinterest.The first is how to efficiently computethe proba-blility of the observationsgiven the model. This problemis relatedto classificationin the sensethat it givesa mea-sureof how well a certainmodeldescribesan observationsequence.The secondis how, given a setof observationsandthe model,to find the correspondingstatesequenceinsomeoptimal way. This will becomean importantpart ofthe algorithmto recognizethe expressionsfrom live inputandwill bedescribedlaterin thispaper. Thethird is how tolearntheparametersof themodel

�giventhesetof observa-

tionssoasto maximizetheprobabilityof obervationsgiventhemodel.Thisproblemrelatesto thelearningphaseof theHMMs which describeeachfacialexpressionsequence.AcomprehansivetutorialonHMMs is givenby Rabiner [14].

3 Expression Recognition Using Emotion-SpecificHMMs

Sincethedisplayof a certainfacial expressionin videois representedby a temporalsequenceof facial motionsitis naturalto modeleachexpressionusinganHMM trainedfor that particular type of expression. Therewill be sixsuchHMMs, onefor eachexpression:

*happy(1),angry(2)

, surprise(3),disgust(4),fear(5),sad(6)/. Thereareseveral

choicesof modelstructurethatcanbeused.Thetwo mainmodelsaretheleft-to-rightmodelandtheergodicmodel.Inthe left-to-right model,theprobablityof goingbackto theprevious stateis set to zero,and thereforethe modelwillalwaysstartfrom a certainstateandendup in an ‘exiting’state.In theergodicmodelevery statecanbereachedfromany otherstatein a finite numberof time steps.In [13], Ot-sukaandOhyausedleft-to-rightmodelswith threestatestomodeleachtypeof facialexpression.Theadvantageof us-ing this modellies in thefactthatit seemsnaturalto modela sequentialeventwith a modelthatalsostartsfrom a fixedstartingstateandalways reachesan endstate. It also in-volves fewer parameters,and thereforeis easierto train.However, it reducesthedegreesof freedomthemodelhastotry to accountfor theobservationsequence.Therehasbeennostudyto indicatethatthefacialexpressionsequenceis in-deedmodeledwell by theleft-to-right model.On theotherhand,usingtheergodicHMM allowsmorefreedomfor themodelto accountfor theobservationsequences,andin fact,for an infinite amountof trainingdatait canbeshown thatthe ergodicmodelwill reduceto the left-to-right model,ifthat is indeedthe true model. In this work both typesofmodelsweretestedwith variousnumbersof statesin anat-

2

temptto studythe beststructurethat canmodel facial ex-pressions.

In Figure1 anexampleof afive-stateleft-to-rightHMM(with return)is shown, with theproblilities aslearnedfromtheexperimentsdescribedin thefollowing section.

0.86

0.07

0.040.91

0.090.94

0.06

0.930.07

0.93

687:9<;>=@?BA<CED�FHGI?KJ:?,LNMPO,?KQ�RTS.FUS.?VJ:?KWXS.Q�S.Y<Q�=�7Z9<[>S(\0]3]^ 7:S.[_=�?KS.;>=�`EC

Theobservationvector-��

for theHMM representscon-tinuousmotion of the facial actionunits. Therefore,

�is

representedby theprobablitydensityfunctions(pdf) of theobservation vectorat time a given the stateof the model.TheGaussiandistributionis chosento representthesepdf’s,i.e.,

�b�4*�+ � ��- � ��/#c�)d��e � �5f � ��5!#"%'("4) (2)

Wheree � and

f � arethe meanvectorandfull covariancematrix, respectively.

The parametersof the model of emotion-expressionspecifc HMM are learnedusing the well-known Baum-Welch reestimationformulas. See[11] for detailsof thealgorithm.For learning,handlabeledsequencesof eachofthefacialexressionsareusedasgroundtruthsequences,andthe Baumalgorithmis usedto derive the maximumlikeli-hood(ML) estimationof themodelparametes(

�).

Parameterlearningis followedby the constructionof aML classifier. Figure2 shows thestructureof theML clas-sifier. Givenanobservationsequence

- �, whereahg �@!i��j(� ,

theprobabilityof theobservationgiveneachof thesix mod-els��- � � � � � is computedusingtheforward-backwardpro-

cedure[14]. Thesequenceis classifiedastheemotioncor-respondingto themodelthatyieldedthehighestprobablil-ity, i.e.,

k�l � ��m5nUo%�qp �srutTr8vHw:��- � �8ti�yx (3)

4 Automatic Segmentationand Recognitionof EmotionsUsingMultile vel HMM.

Themainproblemwith theapproachtakenin theprevi-oussectionis thatit workson isolatedfacialexpressionse-quencesor on presegmentedsequencesof the expressions

HMM modelfor ’Happy’

HMM model

HMM model

HMM model

HMM model

HMM model

for ’Angry’

for ’Surprise’

for ’Disgust’

for ’Fear’

for ’Sad’

Select Maximum

Index of Recognized Expression

P(O|Model 1)

(1)

(2)

(3)

(4)

(5)

(6)

P(O|Model 2)

P(O|Model 6)

P(O|Model 5)

P(O|Model 4)

P(O|Model 3)

Video Sequence Face tracking and

Action Unit

Measurements

O=Observation seq

6H7:9<;u=�?{z|C ]}FH~,7:��;>� JZ7:��?,JZ7:[>YPYPL��,J:FHRTRT7:MP?,=�W�YK=?,��Y>S�7ZY<`_RT�I?K��7:MP�_\0]3]��FHRT?

from the video. In reality, this segmentationis not avail-able,andthereforethereis a needto find anautomaticwayof segmentingthesequences.Concatenationof theHMMsrepresentingphonemesin conjuctionwith theuseof gram-marhasbeenusedin many systemsfor continuousspeechrecognition.Dynamicprogrammingfor continuousspeechhasalsobeenproposedin differentresearches.It is notverystraightforwardto try andapply thesemethodsto theemo-tion recognitionproblemsincethereis no clearnotion oflanguagein displayingemotions. OtsukaandOhya [13]useda hueristicmethodbasedon changesin themotionofseveral regionsof the faceto decidethatan expressionse-quenceis beginningandending.After detectingthebound-ries,thesequenceis classifiedto oneof theemotionsusingtheemotion-specificHMM. This methodis proneto errorsbecauseof thesensitivity of the classifierto the segmenta-tion result.Althoughtheresultof theHMM’ s areindepen-dentof eachother, if weassumethatthey modelrealisticalythe motion of the facial featuresrelatedto eachemotion,thecombinationof thestatesequenceof thesix HMM’ s to-gethercanprovideveryusefulinformationandenhancethediscriminationbetweenthedifferentclasses.Sincewe willusea left-to-right model(with return),thechangingof thestatesequencecanhave a physicalattribute attachedto it(suchasopeningandclosingof mouthwhensmiling), andthereforetherewecangainusefulinformationfrom lookingat the statesequenceandusing it to discriminatebetweentheemotionsat eachpoint in time.

To solve thesegmentationproblemandenhancethedis-criminationbetweenthe classes,a differentkind of archi-tectureis needed. Figure 3 shows the proposedarchitec-ture for automaticsegmentationandrecogntionof the dis-playedexpressionat eachtime instance. As canbe seen,themotionfeaturesarefedcontinuouslyto thesix emotion-specificHMMs. Thestatesequenceof eachof the HMMsis decodedandusedastheobservationvectorfor thehigh-

3

level HMM. Thehigh-level HMM consistsof sevenstates,onefor eachof the six emotionsandonefor neutral. Theneutral stateis neccessaryasfor the largeportionof time,thereis no displayof emotionon a person’s face.Thetran-sitionsbetweenemotionsare imposedto passthroughtheneutral statesinceit is fair to assumethat thefaceresumesa neutralpositionbeforeit displaysa new emotion.For in-stance,a personcannotgo from expressinghappy to sadwithout returningthe faceto its neutralposition(even fora very brief interval). The recognitionof the expressionisdoneby decodingthe statethat the high-level HMM is inateachpoint in timesincethestaterepresentsthedisplayedemotion. To get a more stablerecognition,output of theclassifierwill actualybeasmoothedversionof thestatese-quence,i.e., thehigh-level HMM will have to stayin a par-ticular statefor a long enoughtime in orderfor the outputto betheemotionrelatedto thatstate.

Anger

Surprise

Sad

Happy

Disgust

Fear

Neutral

t+1 t+2 t+3 t+4t

t t+1 t+2 t+3 t+4

6 HMMState sequence of

Model for Emotion (1)

HMM Model for Emotion (6)

Decoded State Sequence

Tracking Results - Action

Unit Measurements

Decoded State Sequence = Observation Sequence for High-Level HMM

Recognition of Emotion at Each

Sampling Time

Decoded State Sequence

Higher-Level HMM

687:9<;>=@?��C1]};>J:S.7:JZ?,O,?KJ�\]3]�F�=@��[>7:S.?K��S.;>=�?�WXYU=�FH;>QS.Y<��FHS�7Z��RT?K9<��?K`>S.FHS.7:Y<`�FH`>LN=@?,��Y<9>`>7:S.7:Y<`_Y<W1?K��Y<QS.7:Y>`|C

Thetrainingprocedureof thesystemis asfollows:

� Train the emotion-specificHMMs usinga handseg-mentedsequenceasdescribedin theprevioussection.

� Feedall six HMMs with thecontinuous(labeled)facialexpressionsequence.Eachexpressionsequencecon-tainsseveral instancesof eachfacial expressionwithneutral instancesseparatingtheemotions.

� Obtain the statesequenceof eachHMM to form thesix-dimensionalobservationvectorof thehigher-levelHMM,

i.e.,-�� wZ�,� ��

,...,�,� v�� xy�

, where� ��

is the stateof theith emotion-specificHMM. Thedecodingof the statesequenceis doneusingtheVitterbi algorithm [14].

� Learntheprobabilityobservationmatrix for eachstateof thehigh-level HMM using

�� u�8��* expectedfrequency of model$

beinginstate'

giventhatthetruestatewas � / , and

� � � � �4*�+s�8��- �� @/��*v

�:� �� <��/ (4)

where' g �s! ,Numberof Statesfor Lower Level

HMM).

� Computethe transition probability� ��* � �&� / of

thehigh-level HMM usingthefrequency of transitingfrom eachof thesixemotionclassesto theneutral statein thetrainingsequencesandfrom theneutral statetotheotheremotionstates.For notation,theneutral stateis numbered� , andtheotherstatesarenumberedasintheprevioussection.It shouldbenotedthatthetransi-tion probablitiesfrom oneemotionstateto anotherthatis not neutral aresetto zero.

� Settheinitial probablityof thehigh-level HMM to be1 for the neutral stateand0 for all otherstates.Thisforcesthemodelto alwaysstartat theneutral stateandassumesthatapersonwill displayaneutral expressionin thebeginningof any videosequence.This assump-tion is madejust for simplicity of thetesting.

Thestepsfollowedduringthetestingphaseareverysim-ilar to the onesfollowed during training. The facetrack-ing sequenceis fed into the lower-level HMMs anda de-codedstatesequenceis obtainedusing the Viterbi algo-rithm. The decodedlower-level statesequence

-��is fed

into thehigher-level HMM andtheobservationprobablitiesarecomputedusingEq.(4).Notethatin thiswayof comput-ing theprobability, it is assumedthatthestatesequencesofthe lower-level HMMs areindepedentgiventhetrue label-ing of thesequence.Thisassumptionis reasonablesincetheHMMs aretrainedindependentlyandon differenttrainingsequences.In addition,without this assumption,thesizeof�

will beenormous,sinceit will haveto accountfor all pos-sible combinationsof statesof the six lower-level HMMs,andit would requireahugeamountof trainingdata.

Using the Viterbi algorithm again for the high-levelHMM, a mostlikely statesequenceis produced.ThestatethattheHMM wasin at time a correspondsto theexpressedemotionin thevideosequenceat time a . To make theclas-sificationresultrobustto undesiredfastchanges,a smooth-ing of thestatesequenceis doneby not changingtheactualclassificationresultif thetheHMM did notstayin apartic-ularstatefor morethen

jtimes,where

jcanvarybetween

1 and15 samples(assuminga 30-Hz samplingrate). Theintroductionof thesmoothingfactor

jwill causeadelayin

the decisionof the system,but of no morethanj

sampletimes.

4

(a)Anger (b) Disgust (c) Fear (d) Happiness (e)Sadness (f) Surprise

687:9<;>=@?� �C�¡>~,FH��>J:?KR�Y<W�7Z��FH9<?,R�W�=�Y<�{S.[>?�OK7:L>?KY�RT?,¢u;>?K`>��?KR�;>RT?,LB7:`_S.[u?�?,~K�I?,=�7Z��?,`uS�C

5 Experiments

Thetestingof thealgorithmsdescribedin previoussec-tions is performedon a databaseof people that are in-structedto displayfacial expressionscorrespondingto thesix types of emotions. This databaseis the sameas theonetestedin [2]. Thedatacollectionmethodis describedin detail in [2]. However, the classificationdonein [2]wasbasedonaframe-to-framebasis;whereas,in thiswork,theclassificationis basedon anentiresequenceof onedis-playedemotion. All the testsof the algorithmsare per-formedon a setof five people,eachonedisplayingsix se-quencesof eachoneof thesix emotions,andalwayscomingbackto aneutralstatebetweeneachemotionsequence.Thevideowasusedasthe input to the facetrackingalgorithm.We useda facetrackingalgorithmdevelopedby Tao [17].Thetrackingalgorithmusesa 3D Beziervolumemodelforfacetrackingandoutputsthe valuesof 12 actionunit likemeasurementscorrespondingto the motion of variousre-gionson the facefor eachframe. This AU’s are usedasthe input to theHMM architecture.Thesamplingratewas30 Hz, anda typical emotionsequenceis about70 sampleslong (2̃ s). Figure4 shows oneframeof eachemotionforthreesubjects. The datawas collectedin a openrecord-ing scenario,wherethe personis asked to display the ex-pressioncorrespondingto theemotionbeinginduced.Thisis of coursenot the ideal way of collectingemotiondata.Theidealwaywouldbeusingahiddenrecording,inducingtheemotionthrougheventsin thenormalenviromentof thesubject,not in a studio. Themainproblemwith collecting

thedatathiswayis theimpracticalityof it andtheethicalis-sueof hiddenrecording.In thefollowing experiments,bothapproaches(emotion-specificHMM, andmultilevel HMM)are testedusing the database.In all of the tests,a leave-one-outcrossvalidationis usedto obtaintheprobabilityoferror.

6 Person-DependentTests

A person-dependenttest is first tried. Sincetherearesix sequencesof eachfacialexpressionfor eachperson,foreachtestonesequenceof eachemotionis left out, andtherestareusedasthetrainingsequences.For theHMM-basedmodels,severalstatesweretried(3-12)andboththeergodicand left-to-right with returnwere tested. The resultspre-sentedbelow areof thebestconfiguration(anergodicmodelusing11states).Table1 showstherecognitionratefor eachpersonfor thetwo classifiers,andthetotal recognitionrateaveragedover the five people. Notice that the fifth personhastheworstrecognitionrate.person.

The fact that subject5 waspoorly classifiedcanbe at-tributedto the inaccuratetrackingresultand lack of suffi-cientvariability in displayingthe emotions.It canbeseenthat the multilevel HMM doesnot significantly decreasethe recognitionrate(andimprovesit in somecases),eventhoughthe input is unsegmentedcontinuousvideo, in con-trast to the emotion-specificHMM which needsthe seg-mentedemotionsequences.Analysisof the confusionbe-tweendifferentemotions(describedin detailsin [4]) showsthat happinessand surpriseare well recognizedfor both

5

£ FHG>J:?¤A<C}¥�?,=�RTY>`>Q�L>?K�I?,ùL>?,ùS�?K��Y<S.7:Y>`¦=�?,��Y>9<`>7:QS.7:Y>`§=�FHS.?,R¨;>RT7Z`>9¦S.[>?©?,��Y>S�7ZY<`>Q�RT�I?K��7:MP�§\0]3]3ªFHùL��;uJ:S.7:J:?,OK?,J�\0]3]3C

Subject SingleHMM Multilevel HMM

1 82.86% 80%2 91.43% 85.71%3 80.56% 80.56%4 83.33% 88.89%5 54.29% 77.14%

Total 78.49% 82.46%

classifiers,with happinessachieving near100%,andsur-priseapproximately90%.Theothermore‘subtle’ emotionsareconfusedwith eachothermorefrequently, with sadnessbeingthe mostconfusedemotion. Although the emotionsusuallydo not confusewith happiness, in someinstancessurprisewasconfusedwith happinessdueto the fact thatthesubjectsmiledwhile displayingthesurprise, somethingthatdoeshappenin reallife whenthesurpriseis agoodone.Theseresultssuggestthe useof a differentlabelingof theemotionalstatesto scalesof positiveandnegativeandinten-sityof theemotions.This2D representationof theemotionshasbeendescribedby Lang [10].

7 Person-IndependentTests

In the previous sectionit was seenthat a good recog-nition ratewasachievedwhenthe trainingsequencesweretakenfrom thesamesubjectasthetestsequences.Themainchallengeis to seeif this canbe generalizedto a person-independentrecognition.For this testall of the sequencesof onesubjectareusedas the test sequences,and the se-quencesof theremainingfour subjectsareusedastrainingsequences.This testis repeatedfive times,eachtime leav-ing a differentpersonout (leave oneout crossvalidation).Table2 shows the recognitionrate of the test for the twoalgorithms.Theresultsindicatethat in this case,themulti-level HMM gave betterresultsthantheonelayeredHMM,and both gave resultsmuch higher then pure chance. Ingeneraltherecognitionrateis muchlower thantheperson-dependentcase(58%at best,comparedto 88%). Thefirstreasonfor thisdropis thefactthatthesubjectsareverydif-ferentfrom eachother(threefemales,two males,anddif-ferentethnicbackgrounds);hence,they displaytheir emo-tion differently. In fact,therecognitionrateof subject3, anasianwoman,was the lowest in this case(36% for multi-level HMM). Although it appearsto contradictthe univer-sitality of the facial expressionsasstudiedby EkmanandFriesen[6], it shows that for practicalautomaticemotionrecognition,considerationof genderandraceplay a role in

£ FHG>JZ?0z�C�«¬?K��Y<9<ù7:S.7:Y<`�=�FHS�?�W�YU=|�I?K=�RTY<ùQ�L>?,�I?K`>Lu?,`>SS�?KRTS�C

SingleHMM Multilevel HMMRecognitionrate 55% 58%

thetrainingof thesystem.This conclusioncannotbemadestrongly sincethe databaseis small. A study of a largerdatabaseof subjectscanconfirmor disputethisconclusion,andtherearesuggestionsin theliteratureon thevalidity ofthis conclusion.

8 Discussion

In this work a new method for emotion recognitionfrom video sequencesof facial expressionwereexplored.Emotion-specificHMM, relied on segmentationof a con-tiuousvideo into sequencesof emotions(or neutralstate).However, multilevel HMM, performedautomaticsegmen-tationandrecognitionfrom acontinuoussignal.Theexper-imentson a databaseof five peopleshowedthat the recog-nition ratesfor a person-dependenttestarevery high usingbothmethods.Therecognitionratesdropdramaticallyfor aperson-independenttest.This impliedthata largerdatabaseis neededfor thetraining,andpossiblythesubjectsshouldbe classifiedaccordingto somecatergories,suchasethnicbackgroundandgender. The testsalsoshowed that someemotionsaregreatlyconfusedasagainstothers(anger, dis-gust, sadnessand fear), while happinessandsurpriseareusuallyclassifiedwell. This implies the useof a differentsetof classesto getmorerobustclassification.Theclassescanbe positive, negative, surpriseandneutral. This scaleclusterstheemotionsinto four categories,andcanimprovetherecognitionratedramatically.

Oneof the maindrawbacksin all of the worksdoneonemotion recognitionfrom facial expressionvideos is thelack of a benchmarkdatabaseto testdifferentalgorithms.Thiswork reliedon a databasecollectedby Chen[2], but itisdifficult tocomparetheresultstootherworksusingdiffer-entdatabases.Therecentlyconstructeddatabaseby Kanadeet al [9] will beausefultool for testingthesealgorithms.

A usefulextensionof this work would beto build a realtime systemcomprisedof a fastandaccuratefacetrackingalgorithmcombinedwith themultilevelHMM structure.Bygiving thisfeedbackto thecomputer, abetterinteractioncanbeachieved.This canbeusedin many ways.For example,it canhelpin educationby helpingchildrenlearneffectivelywith computers.

Recognizingtheemotionfrom justthefacialexpressionsis probablynot accurateenough. For a computerto trulyunderstandtheemotionalstateof a human,othermeasure-mentsprobablyhave to be made. Voice andgesturesare

6

widely believed to play an importantrole aswell [2, 5],andphysiologicalstatessuchasheartbeatandskinconduc-tivity arebeingsuggested[1]. Peoplealsousecontext asan indicatorof the emotionalstateof a person.This workis justanotherstepon theway towardachieving thegoalofbuilding moreeffectivecomputersthatcanserveusbetter.

References

[1] J. T. CacioppoandL.G. Tassinary. Inferring psycho-logicalsignificancefrom physiologicalsignals.Amer-ican Psychologist, 45:16–28,January1990.

[2] L. S. Chen. Joint processingof audio-visualinfor-mationfor therecognitionof emotionalexpressionsinhuman-computerinteraction. PhD thesis,Universityof Illinois at Urbana-Champaign,Dept. of ElectricalEngineering,2000.

[3] L. S. Chen,H. Tao, T. S. Huang,T. Miyasato,andR. Nakatsu.Emotionrecognitionfrom audiovisualin-formation. In Proc. IEEE Workshopon MultimediaSignal Processing, pages83–88, Los Angeles,CA,USA, Dec.7-9,1998.

[4] I. Cohen. Automatic facial expression recogni-tion from video sequencesusing temporal informa-tion. In MS Thesis, University of Illinois at Urbana-Champaign,Dept.of ElectricalEngineering,2000.

[5] L. C. De Silva, T. Miyasato,andR. Natatsu. Facialemotionrecognitionusingmultimodalinformation.InProc. IEEE Int. Conf. on Information, Communica-tions and SignalProcessing(ICICS’97), pages397–401,Singapore,Sept.1997.

[6] P. Ekmanand W. V. Friesen. Facial Action CodingSystem:Investigator’s Guide. ConsultingPsycholo-gistsPress,PaloAlto, CA, 1978.

[7] A. Garg, V. Pavlovic, J. Rehg, and T. S. Huang.Audio–visual speaker detection using dynamicBayesian networks. In Proc. of 4rd Intl Conf.Automatic Face and Gesture Rec., pages374–471,2000.

[8] D. Goleman.EmotionalIntelligence. BantamBooks,New York, 1995.

[9] T. Kanade,J.F. Cohn, and Y. Tian. Comprehesivedatabasefor facial expressionanalysis. In Proc. of4rd Intl Conf. AutomaticFaceandGestureRec., pages46–53,2000.

[10] P. Lang. The emotionprobe: Studiesof motivationandattention.AmericanPsychologist, 50(5):372–385,May 1995.

[11] S.E.Levinson,L.R. Rabiner, andM.M. Sondhi. Anintroductionto the applicationof the theoryof prob-abilitic functions of a markov processto automaticspeechrecognition. The Bell Lab SystemTechnicalJournal, 62(4):1035–1072,apr1983.

[12] K. Mase. Recognitionof facial expressionfrom op-tical flow. IEICE Transactions, E74(10):3474–3483,October1991.

[13] T. OtsukaandJ.Ohya.Recognizingmultiplepersons’facialexpressionsusingHMM basedonautomaticex-tractionof significantframesfrom imagesequences.In Proc. Int. Conf. on Image Processing(ICIP-97),pages546–549,SantaBarbara,CA, USA, Oct.26-29,1997.

[14] L.R. Rabiner. A tutorial on hiddenMarkov modelsandselectedapplicationsin speechprocessing.Pro-ceedingsof IEEE, 77(2):257–286,1989.

[15] M. Rosenblum,Y. Yacoob,andL.S.Davis. Humanex-pressionrecognitionfrom motionusinga radialbasisfunctionnetwork architecture.IEEE TransactionsonNeural Network, 7(5):1121–1138,September1996.

[16] P. Salovey and J.D. Mayer. Emotional intelligence.Imagination, Cognition and Personality, 9(3):185–211,1990.

[17] H. Tao and T. S. Huang. Connectedvibrations: Amodal analysisapproachto non-rigid motion track-ing. In Proc. IEEE Conferenceon ComputerVisionandPatternRecognition1998(CVPR’98), SantaBar-bara,CA, USA, June23-25,1998.

[18] Y. YacoobandL.S. Davis. Recognizinghumanfacialexpressionsfrom long imagesequencesusingopticalflow. IEEE TransactionsonPatternAnalysisandMa-chineIntelligence, 18(6):636–642,June1996.

7

Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion...

Documents

Transcript of Emotion Recognition from Facial Expressions using ...ashutosh/papers/NIPS_emotion.pdf · Emotion...