Unsupervised Discovery of Facial Eventsjeffcohn/biblio/learn_au_final.pdf · Unsupervised Discovery...

8
Unsupervised Discovery of Facial Events Feng ZhouFernando De la TorreJeffrey F. Cohn, Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213. , University of Pittsburgh, Department of Psychology, Pittsburgh, Pennsylvania 15260. Abstract Automatic facial image analysis has been a long stand- ing research problem in computer vision. A key component in facial image analysis, largely conditioning the success of subsequent algorithms (e.g., facial expression recognition), is to define a vocabulary of possible dynamic facial events. To date, that vocabulary has come from the anatomically- based Facial Action Coding System (FACS) or more subjec- tive approaches (i.e. emotion-specified expressions). The aim of this paper is to discover facial events directly from video of naturally occurring facial behavior, without re- course to FACS or other labeling schemes. To discover fa- cial events, we propose a novel temporal clustering algo- rithm, Aligned Cluster Analysis (ACA), and a multi-subject correspondence algorithm for matching expressions. Im- age data were from posed (Cohn-Kanade), and more com- plex unposed facial behavior (RU-FACS) and some video in infants. To validate our unsupervised learning approach, we compared the resulting facial events with manual FACS labeling. We found moderate intersystem agreement with FACS. In addition, several applications for visualization and summarization of facial expression are illustrated. 1. Introduction The face is one of the most powerful channels of nonver- bal communication. Facial expression provides cues about emotional response, regulates interpersonal behavior, and communicates aspects of psychopathology. While people have believed for centuries that facial expressions can re- veal what people are thinking and feeling, it is relatively re- cently that the face has been studied scientifically for what it can tell us about internal states, social behavior, and psy- chopathology. Faces possess their own language. To represent the el- emental units of this language, Ekman and Friesen [12] in the 70’s proposed the Facial Action Coding System (FACS). FACS segments the visible effects of facial muscle acti- vation into “action units”. Each action unit is related to one or more facial muscles. The FACS taxonomy was de- Figure 1. Selected video frames of unposed facial behavior from three participants. Different colors and shapes represent dynamic events discovered by unsupervised learning: smile (green circle) and lip compressor (blue hexagons). Dashed lines indicate corre- spondences between persons. velop by manually observing graylevel variation between expressions in images and to a lesser extent by recording the electrical activity of underlying facial muscles [3]. Be- cause of its descriptive power, FACS has become the state of the art in manual measurement of facial expression and is widely used in studies of spontaneous facial behavior. In part for these reasons, much effort in automatic facial im- age analysis seeks to automatically recognize FACS action units [1, 28, 24, 27]. In this paper, we ask whether unsupervised learning can discover useful facial units in video sequences of one or more persons, and whether the discovered facial events correspond to manual coding of FACS action units. We propose an unsupervised temporal clustering algorithm, Aligned Cluster Analysis (ACA). ACA is an extension of kernel k-means to cluster multi-dimensional time series. Using this unsupervised learning approach it is possible to find meaningful dynamic clusters of similar facial expres- 1

Transcript of Unsupervised Discovery of Facial Eventsjeffcohn/biblio/learn_au_final.pdf · Unsupervised Discovery...

Unsupervised Discovery of Facial Events

Feng Zhou† Fernando De la Torre† Jeffrey F. Cohn‡†, Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213.

‡, University of Pittsburgh, Department of Psychology, Pittsburgh, Pennsylvania 15260.

Abstract

Automatic facial image analysis has been a long stand-ing research problem in computer vision. A key componentin facial image analysis, largely conditioning the success ofsubsequent algorithms (e.g., facial expression recognition),is to define a vocabulary of possible dynamic facial events.To date, that vocabulary has come from the anatomically-based Facial Action Coding System (FACS) or more subjec-tive approaches (i.e. emotion-specified expressions). Theaim of this paper is to discover facial events directly fromvideo of naturally occurring facial behavior, without re-course to FACS or other labeling schemes. To discover fa-cial events, we propose a novel temporal clustering algo-rithm, Aligned Cluster Analysis (ACA), and a multi-subjectcorrespondence algorithm for matching expressions. Im-age data were from posed (Cohn-Kanade), and more com-plex unposed facial behavior (RU-FACS) and some video ininfants. To validate our unsupervised learning approach,we compared the resulting facial events with manual FACSlabeling. We found moderate intersystem agreement withFACS. In addition, several applications for visualizationand summarization of facial expression are illustrated.

1. IntroductionThe face is one of the most powerful channels of nonver-

bal communication. Facial expression provides cues aboutemotional response, regulates interpersonal behavior, andcommunicates aspects of psychopathology. While peoplehave believed for centuries that facial expressions can re-veal what people are thinking and feeling, it is relatively re-cently that the face has been studied scientifically for whatit can tell us about internal states, social behavior, and psy-chopathology.

Faces possess their own language. To represent the el-emental units of this language, Ekman and Friesen [12] inthe 70’s proposed the Facial Action Coding System (FACS).FACS segments the visible effects of facial muscle acti-vation into “action units”. Each action unit is related toone or more facial muscles. The FACS taxonomy was de-

Figure 1. Selected video frames of unposed facial behavior fromthree participants. Different colors and shapes represent dynamicevents discovered by unsupervised learning: smile (green circle)and lip compressor (blue hexagons). Dashed lines indicate corre-spondences between persons.

velop by manually observing graylevel variation betweenexpressions in images and to a lesser extent by recordingthe electrical activity of underlying facial muscles [3]. Be-cause of its descriptive power, FACS has become the stateof the art in manual measurement of facial expression andis widely used in studies of spontaneous facial behavior. Inpart for these reasons, much effort in automatic facial im-age analysis seeks to automatically recognize FACS actionunits [1, 28, 24, 27].

In this paper, we ask whether unsupervised learning candiscover useful facial units in video sequences of one ormore persons, and whether the discovered facial eventscorrespond to manual coding of FACS action units. Wepropose an unsupervised temporal clustering algorithm,Aligned Cluster Analysis (ACA). ACA is an extension ofkernel k-means to cluster multi-dimensional time series.Using this unsupervised learning approach it is possible tofind meaningful dynamic clusters of similar facial expres-

1

sions in one individual and correspondences between facialevents across individuals in an unsupervised manner. Imagedata are from a variety of sources: posed facial behaviorfrom the Cohn-Kanade database, unscripted facial behav-ior from the RU-FACS database, and infants observed withtheir mothers. Fig. (1) illustrates the main idea of the paperfor the case of the RU-FACS database. In addition, we showhow our algorithms for temporal clustering of facial eventscan be used for summarization and visualization.

2. Temporal segmentation and clustering ofhuman behavior

This section describes previous work on temporal clus-tering and segmentation of facial and human behavior.

With few exceptions, previous work on facial expressionor action unit recognition has been supervised in nature (i.e.event categories are defined in advance in labeled trainingdata, see [1, 28, 24, 27] for a review of state-of-the-art al-gorithms). Little attention has been paid to the problemof unsupervised temporal segmentation or clustering priorto recognition. In a pioneering study, Mase and Pentland[19] found that zero crossings in the velocity contour offacial motion are useful for temporal segmentation of vi-sual speech. Hoey [15] presented a multilevel Bayesiannetwork to learn in a weakly supervised manner the dy-namics of facial expression. Bettinger et al. [2] used AAMto learn the dynamics of person-specific facial expressionmodels. Irani and Zelnik [34] proposed a modification ofstructure-from-motion factorization to temporally segmentrigid and non-rigid facial motion. De la Torre et al. [8] pro-posed a geometric-invariant clustering algorithm to decom-pose a stream of one person’s facial behavior into facial ges-tures. Their approach suggested that unusual facial expres-sions might be detected through temporal outlier patterns.In summary, previous work in facial expression addressestemporal segmentation of facial expression in a single per-son. The current work extends previous approaches to un-supervised temporal clustering across individuals.

Outside of the facial expression literature, unsupervisedtemporal segmentation and clustering of human and animalbehavior has been addressed by several groups. Zelnik-Manor and Irani [33] extracted spatio-temporal featuresat multiple temporal scales to isolate and cluster events.Guerra-Filho and Aloimonos [13] presented a linguisticframework to learn human activity representations. Thelow level representation of their framework, motion prim-itives, referred to as kinetemes, were proposed as the foun-dation for a kinetic language. Yin et al. [31] proposed adiscriminative feature selection method to discover a set oftemporal segments, or units, in American Sign Language.These units could be distinguished with sufficient reliabil-ity to improve accuracy in ASL recognition. Wang et al.[30] used deformable template matching of shape and con-

text in static images to discover action classes. Turaga etal. [29] presented a cascade of dynamical systems to clustera video sequence into activities. Niebles et al. [22] pro-posed an unsupervised method to learn human action cate-gories. They represented video as a bag-of-words model ofspace-time interest points. Latent topic models were usedto learn their probability distribution, and intermediate top-ics corresponded to human action categories. Oh et al. [23]proposed parametric segmental switching dynamical mod-els to segment honeybees behavior. Related work in tempo-ral segmentation has been done, as well, in the area of datamining [17] and change point detection [14]. Unlike previ-ous approaches, we propose extensions of Aligned ClusterAnalysis (ACA). ACA generalizes kernel k-means to clus-ter time series, providing a simple yet effective and robustmethod to cluster multi-dimensional time series with fewparameters to tune.

3. Facial feature tracking and image featuresOver the last decade, appearance models [4, 20] have

become increasingly prominent in computer vision. In thework below, we use AAMs [20] to detect and track facialfeatures and align and extract facial features. Fig. 2a showsan example using image data from RU-FACS [1].

Sixty-six facial features and the related face texture aretracked throughout an image sequence. To register imagesto a canonical view and face, a normalization step registerseach image with respect to an average face. After the nor-malization step, we build shape and appearance features forthe upper and lower face regions. Shape features include,xU1 the distance between inner brow and eye, xU

2 the dis-tance between outer brow and eye, xU

3 the height of eye, xL1

the height of tip, xL2 the height of teeth, and xL

3 the angle ofmouth corners. Appearance features are composed of SIFTdescriptors computed at points around the outer outline ofthe mouth (at 11 locations) and on the eyebrows (5 points).The dimensionality of the resulting feature vector is reducedusing PCA to retain 95% of the energy, yielding appearancefeatures for the upper (xU

4 ) and lower (xL4 ) face. For the task

of clustering emotions, features from both face parts wereused to obtain a holistic representation of the face. For moreprecise facial action segmentation, each face part was con-sidered individually. See Fig. 2b for an illustration of thefeature extraction process.

4. Aligned Cluster Analysis (ACA)This section describes Aligned Cluster Analysis (ACA),

an extension of kernel k-means to cluster time series. ACAcombines kernel k-means with Dynamic Time AlignmentKernel (DTAK). A preliminary version of ACA was pre-sented at [35].

2

(a)

(b)

distance (inner brow)

height(tip)

distance (outer brow)

height(teeth)

angle(mouth corner)

height(eye)

appearance(upper face)

appearance(lower face)

Figure 2. Facial features used for temporal clustering. (a) AAM fit-ting across different subjects. (b) Eight different features extractedfrom distance between tracked points, height of facial parts, anglesfor mouth corners, and appearance patches.

4.1. Dynamic time alignment kernel (DTAK)To align time series, a frequent approach is Dynamic

Time Warping (DTW). A known drawback of using thismeasure as a distance is that it fails to satisfy the triangleinequality. To address this issue, Shimodaira et al. [26] ex-tended DTW to satisfy the Cauchy-Schwartz inequality andproposed Dynamic Time Alignment Kernel (DTAK). Giventwo sequences, X .

= [x1, · · · ,xnx] ∈ Rd×nx (see notation

1) and Y.= [y1, · · · ,yny ] ∈ Rd×ny , DTAK, τ(X,Y), is

defined as:

τ = maxQ

l∑c=1

1

nx + ny(q1c − q1c−1 + q2c − q2c−1)κq1cq2c ,

where κij = φ(xi)Tφ(yj) represents the kernel similarity

between frame xi and yj . Q ∈ R2×l is an integer matrixthat contains indexes to the alignment path between X andY. For instance, if the cth column of Q is [q1c q2c]

T , theq1c frame in X corresponds to the q2c frame in Y. l is thenumber of steps needed to align both signals.

DTAK finds the path that maximizes the weighted sum ofthe similarity between sequences. A more revealing math-ematical expression can be achieved by considering a newnormalized correspondence matrix W ∈ Rnx×ny , wherewij = 1

nx+ny(q1c − q1c−1 + q2c − q2c−1) if there exist

q1c = i and q2c = j for some c, otherwise wij = 0. Then

1Bold capital letters denote a matrix X, bold lower-case letters a col-umn vector x, and all non-bold letters denote scalar variables. xj repre-sents the jth column of the matrix X. xij denotes the scalar in the row iand column j of the matrix X. Ik ∈ Rk×k denotes the identity matrix.||x||22 denotes the norm of the vector x. tr(X) =

∑i xii is the trace of

the matrix X. ||X||2F = tr(XTX) = tr(XXT ) designates the Frobe-nius norm of a matrix. ◦ denotes the Hadamard or point-wise product.

DTAK can be rewritten:

τ(X,Y) = tr(KTW) = ψ(X)Tψ(Y) (1)

where ψ(·) denotes a mapping of the sequence into a featurespace, and K ∈ Rnx×ny . More details in [36].

4.2. k-means and kernel k-meansClustering refers to the partition of n data points into

k disjoint clusters. Among various approaches to unsu-pervised clustering, k-means [18, 11] and kernel k-means(KKM) [10, 32] are among the simplest and most popular.k-means and KKM clustering split a set of n objects into cgroups by minimizing the within cluster variation. That is,kernel k-means finds the partition of the data that is a localoptimum of the energy function [11, 7]:

Jkkm(M,G) = ||ψ(X)−MG||2F , (2)

where X ∈ Rd×n, G ∈ Rk×n and M ∈ Rd×k. G is anindicator matrix, such that

∑c gci = 1, gij ∈ {0, 1} and

gij is 1 if di belongs to class c, n denotes the number ofsamples. The columns of X contain the original data points,and the columns of M represent the cluster centroids; d isthe dimension of the kernel mapping. In the case of KKM,d can be infinite dimensional and typically M cannot becomputed explicitly. Substituting the optimal M value ineq. (2) results in:

Jkkm(G) = tr(LK

)L = In −GT (GGT )−1G. (3)

The KKM method typically uses a local search [10] to find amatrix G that is maximally correlated to the sample kernelmatrix K = ψ(X)Tψ(X).

4.3. ACA objective functionGiven a sequence X

.= [x1, · · · ,xn] ∈ Rd×n with n

samples, ACA decomposes X into m disjointed segments,each of which corresponds to one of k classes. The ith seg-ment, Yi

.= [xsi , · · · ,xsi+1−1]

.= X[si,si+1) ∈ Rd×wi ,

is composed of samples that begin at position si and endat si+1 − 1. The length of the segment is constrained assi+1 − si ≤ nmax. nmax is the maximum length of thesegment that controls the temporal granularity of the factor-ization. An indicator matrix G ∈ {0, 1}k×m assigns eachsegment to a class; gci = 1 if Yi belongs to class c.

ACA combines kernel k-means with the DTAK toachieve temporal clustering by minimizing:

Jaca(G,M, s) = ||ψ(X[si,si+1))−MG||2F . (4)

Observe that the only difference between KKM and ACAis the introduction of the variable s that determines the startand end of each segment. ψ(·) is a non-linear mapping suchthat, τij = ψ(Yi)

Tψ(Yj) = tr(KTijWij) is the DTAK.

3

(a)

23 frames 7 segments, 3 clusters

1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 11 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

1234567

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23147

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23147

10

0 0 0 1 0 0 00 0 1 0 1 0 01 1 0 0 0 1 11 2 3 4 5 6 71 4 7 10 13 16 19 22

1

4

7

10

13

16

19

22

1 4 7 10 13 16 19 22

1

4

7

10

13

16

19

22

(c) (d)

(e)

(f)

(b)

Figure 3. Example of temporal clustering. (a) 1-D sequence. (b)Results of temporal clustering. (c) Self-similarity matrix (K). (d)Correspondence matrix (W). (e) Frame-segment indicator matrix(H). (f) Segment-class indicator matrix (G).

Observe that there are two kernel matrices, T ∈ Rm×m isthe kernel segment matrix and K ∈ Rn×n is the kernel sam-ple matrix. Unfortunately, DTAK is not a strictly positivedefinite kernel [6]. Thus, we add a scaled identity matrix tothe sample kernel matrix; that is, K ← K + σIn, were σis chosen to be the absolute value of the smallest eigenvalueof the kernel segment matrix if it has negative eigenvalues.

After substituting the optimal value of M in eq. (4),a more enlightened form of Jaca can be derived by re-arranging them×m blocks of Wij ∈ Rwi×wj into a globalcorrespondence matrix W ∈ Rn×n,

Jaca(G, s) = tr((L ◦W)K

), (5)

where L = In −HTGT (GGT )−1GH. H ∈ {0, 1}m×n

is the frame-segment indicator matrix; hij = 1 if jth framebelong to ith segment. Fig. (3) illustrates the matrices K,H, W and G in a synthetic example of temporal clustering.Consider the special case when, m = n and H = In; thatis, each frame is treated as a segment. In this case, DTAKwould be a kernel between two frames, i.e., W = 1n1

Tn

and ACA is equivalent to kernel k-means, eq. (3).Optimizing ACA is a non-convex problem. We use a co-

ordinate descent strategy that alternates between optimizingG and s while implicitly computing M. Given a sequenceX of length n, the number of possible segmentations isexponential (i.e., O(2n)), which typically renders a brute-force search infeasible. We adopt a dynamic programming(DP) based algorithm that has a complexity O(n2nmax) toexhaustively examine all the possible segmentations. See[36] for more details on the optimization and the code.

4.4. Learning good features for ACAThe success of kernel machines largely depends on the

choice of the kernel parameters and the functional formof the kernel. As in previous work on multiple kernellearning [9, 5], we consider the frame kernel as a pos-itive combination of multiple kernels, that is: K(a) =

∑dl=1 alKl, s.t. a ≥ 0d where the set {K1, · · · ,Kd}

is given and the al’s are to be optimized.In the ideal case [9, 5], if two samples belong to the same

class, the kernel function outputs a similarity of 1 and 0otherwise. In the case of temporal segmentation, the labelof the ith frame is given by Ghi. Assuming that all labels(G,H,W) are known, we minimize the distance betweenthe ideal kernel matrix and the parameterized one, that is:

Jlearn(a) = ‖W ◦(F−K(a)

)‖2F , (6)

where F = HTGTGH, and the correspondence matrix(W) weights individually the pair of frames that has beenused in the calculation of DTAK.

To optimize Jlearn with respect to a, we rewrite eq.( 6)as a quadratic programming problem: Jlearn = aTZa −2fTa + c, where zij = tr((W ◦ Ki)

T (W ◦ Kj)), fi =tr((W ◦F)T (W ◦Ki)) and c is a constant irrelevant to a.

5. ExperimentsThis section first reports experimental results for unsu-

pervised temporal segmentation of facial behavior and com-pares them with emotion and FACS labels in two scenar-ios: first for individual subjects and then for sets of sub-jects. The latter explores generalizability or correspon-dences between facial events for different individuals. Weuse a variety of video sources: posed facial behavior fromthe Cohn-Kanade database [16], unscripted facial behaviorfrom the RU-FACS database [1], and infants observed withtheir mothers [21].

5.1. Facial event discovery for individual subjectsThis section describes two experiments in facial event

discovery of one individual. The first experiment comparesthe clustering results with the ones provided by FACS. Thesecond experiments uses unsupervised temporal clusteringto summarize facial behavior in an infant.

5.1.1 RU-FACS databaseThe RU-FACS database [1] consists of digitized video andmanual FACS of 34 young adults. They were recordedduring an interview of approximately 2 minutes durationin which they lied or told the truth in response to an in-terviewer’s questions. Pose orientation was mostly frontalwith small to moderate out-of-plane head motion. Imagedata from five subjects could not be analyzed due to imageartifacts. Thus, image data from 29 subjects was used. Werandomly selected 10 sets of 19 subjects for training and 10subjects for testing. We compared the performance of un-supervised ACA, ACA with learned weights (ACA+Learn),and kernel k-means (KKM). We used 8 features (4 upperface and 4 lower face) as described in section 3.

For each realization (10 in total), we used 10 ran-dom subjects from the RU-FACS database, we ran ACA,

4

ACA+Learn, and KKM 10 times for each subject startingfrom different initializations and select the solution withleast error. Because the number and frequency of actionunits varied among subjects, and to investigate the stabil-ity of the clustering w.r.t. the number of clusters, between8 ∼ 11 clusters were selected for the lower face and 4 ∼ 7for the upper face. The clustering results are the averageover all clusters. The length constraint of the facial actionswas set to be nmax = 80. Accuracy is computed as the per-centage of temporal clusters found by ACA that contain thesame AU or AU combination using the confusion matrix:

C(c1, c2) =

malg∑i=1

mtruth∑j=1

galgc1igtruthc2j |Y

algi ∩Ytruth

j | (7)

where Yalgi is the ith segment returned by ACA (or KKM),

and Ytruthj is the jth segment of the ground-truth data.

C(c1, c2) represents the total number of frames on the clus-ter segment c1 that are shared by the cluster segment c2 inground truth. galgc1i

is a binary value that indicates whetherthe ith segment is classified as the c1 temporal cluster ofACA. |Yalg

i ∩ Ytruthj | denotes the number of frames they

share. The Hungarian algorithm is applied to find the op-timum solution for the cluster correspondence problem.Empty rows or columns are inserted if the number of clus-ters is different from the ground truth, i.e., kalg 6= ktruth.Due to the possible occurrence of multiple AUs in the sameframe, we consider AU combinations as distinct temporalclusters. We consider AUs with a minimum duration of 10video frames. Any frames for which no AUs occurred wereomitted.

Fig. (4b) shows the mean accuracy and variance of thetemporal clustering for both the unsupervised and super-vised (learned weights) versions of ACA and KMM. Theclustering accuracy was about 60% for lower face AUs andabove 80% for upper face AUs. As expected, learningweights for ACA improved the temporal clustering in theupper and lower face although not substantially. The meanand variance for the weights for all the features in the lowerand upper face are shown in Fig. (4a). As expected theweights give more importance to the appearance features.Fig. 4c shows a representative lower-face confusion matrixfor subject 14. The AUs were: AU 50 or AU 25, AU 50+12,AU 14, AU 12, AU 12+25+26, AU 12+15 and 17. Moredetails are given in [36].

5.1.2 Summarization of infant facial expressionThis experiment shows an application of the proposed tech-niques to summarize the facial expression of an infant. In-fant facial behavior is known to be more temporally com-plex than that of adults. Fig. 5 shows the results of run-ning unsupervised ACA with 10 clusters in 1000 frames ofa 6-month-old infant that was recorded while she interacted

(a)

(b)

(c)

184 0 16 8 8 0 12 0 0

0 236 20 0 0 0 0 40 0

0 0 84 4 0 24 0 0 0

0 0 0 36 0 0 0 0 12

4 0 0 4 8 0 0 8 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

1 2 3 4 5 6 7 8 9

14

12|12+25

25

25+26

15

12+25+26

Figure 4. Performance of person-specific segmentation on RU-FACS database. (a) Mean and standard deviation for the featureweights. (b) Temporal clustering accuracy. (b) Confusion matrixfor subject S014.

(b)

(c) (d)

(e)

−1−0.5

00.5

1−0.5

0

0.5

−0.5

0

0.5

−6−4

−20 −6

−4−2

02

−0.500.5

−50

5−6−4−2024

−2024

surprisesadnessfearjoyanger

(a)

Figure 6. Clustering of 5 different facial expressions. (a) Embed-ding segment by ACA. (b) Embedding frames by kernel PCA. (c)Embedding frames by PCA. (d) Clustering accuracy. (e) Temporalclustering accuracy.

with her mother [21]. We used the appearance and shapefeatures for the eyes and mouth. These 10 clusters providea summarization of the infant’s facial events.

5.2. Facial event discovery for multiple subjectsIn this section we test the ability of ACA to cluster facial

behavior corresponding to different subjects. We first reportresults for posed facial actions. We then report results forthe more challenging case of unposed, naturally occurringfacial behavior in an interview setting.

5.2.1 Cohn-Kanade(CK) databaseACA was used to segment the facial expression data fromthe Cohn-Kanade database [16]. The database contains arecording of facial behavior for 100 adults between 18 to50 years of age. We selected the facial behavior coded withfive emotion labels: surprise, sadness, anger, fear and joy.

There are small changes in pose and illumination, allexpressions are brief (about 20 frames on average), beginat neutral and proceed to a target expression, and are welldifferentiated relative to unposed facial behavior in a natu-ralistic context (e.g., RU-FACS). Following person-specificAAM as described above, we used the 6 shape features de-scribed in section 3. All the features were normalized di-

5

ACA

0 100 200 300 400 500 600 700 800 900

00.5

Figure 5. Temporal clustering of an infant facial behavior. Each facial gesture is coded with a different color. The feature represents theangle of the mouth. Observe how the frames of the same cluster correspond to a similar facial expressions.

viding them with respect to the first frame. The frame ker-nel was computed as a linear combination of 6 kernels withequal weighting.

In the first experiment on the CK database we test theability of ACA to temporally cluster several expressions of30 randomly selected subjects (the number of facial expres-sions varies across subjects). Fig. (6e) shows the mean (andvariance) results for 10 realizations. The number of clustersis five and nmax = 25. Both ACA and KMM are initialized10 times and the solution with less error is selected. Again,ACA outperforms KKM. Fig.( 7) shows one example of thetemporal clustering achieved by unsupervised ACA.

The second experiment explores the use of ACA for vi-sualization of facial events. Fig. (6a) shows the ACA em-bedding of 112 sequences from 30 randomly selected sub-jects (different expressions). The embedding is done bycomputing the first three eigenvectors of the kernel seg-ment matrix (T). In this experiment, the kernel segmentmatrix is computed using the ground-truth data (expressionlabels). Each point represents a video segment of facial ex-pression. Fig. (6b) and Fig. (6c) represent the embeddingfound kernel PCA and PCA using independent frames (theframes are embedded using the first three eigenvectors ofthe kernel sample matrix K). Because each frame repre-sents a point it is harder to visualize the temporal structureof the facial events. To test the quality of the embedding forclustering, we randomly generated 10 sets of the facial ex-pression for 30 subjects. For each set the ground-truth labelis known and the ”optimal” three dimensional embedding iscomputed. Then we run KMM to cluster the data into fiveclusters. The results (mean and variance) of the clusteringare shown in Fig. (6d). As expected, the segment embed-ding provided by ACA achieves higher clustering accuracythan kernel PCA or PCA on independent frames.

5.2.2 RU-FACS databaseThis section tested the ability of ACA to discover dynamicfacial events in a more challenging database of naturallyoccurring facial behavior (i.e., not posed) of multiple peo-ple. Several issues contribute to the challenge of this taskon the RU-FACS database. These include non-frontal pose,

Figure 8. Temporal clustering across individuals on the RU-FACSdatabase.

moderate out-of-plane head motion, large variability in thetemporal scale of facial gestures, person variability and theexponential nature of possible facial action combinations.

To solve this challenging scenario two strategies are con-sidered: (1) ACA+CAT: concatenate all videos and runACA in the concatenated video sequence, (2) ACA+MDA:Run ACA independently for each sequence and solve for thecorrespondence of clusters across people using the Multidi-mensional Assignment Algorithm (MDA) [25]. We proposea heuristic approach to solve the multidimensional assign-ment problem called Pairwise Approximation Multidimen-sional Assignment(PA-MDA). Details of the algorithm aregiven in [36].

Using the same features as section 5.1.1, we randomlyselected 10 sets of 5 people and report the mean clusteringresults and variance. For ACA+MDA, we kept the same pa-rameter setting as in the previous segmentation of one sub-ject. The number of clusters in ACA+CAT was set to 14 ∼17 and the length constraint is the same as before (80). Asshown in Fig. 8, ACA+MDA achieved more accurate seg-mentation than ACA+CAT. Moreover, ACA+MDA scalesbetter for clustering many videos. Recall that ACA+CATscales quadratically in space and time, and this can be alimitation when processing many subjects. As expected theclustering performance is lower than in the case of usingonly clustering one individual.

Fig. 9a shows the results for temporal segmentationachieved by ACA on subjects S012, S028 and S049. Eachcolor denotes a temporal cluster discovered by ACA. Fig. 9shows some of the dynamic vocabularies for facial expres-sion analysis discovered by ACA+MDA. The algorithmcorrectly discovers smiling, silent, and talking as differentfacial events. Visual inspection of all subjects’ data suggeststhat the vocabulary of facial events is moderately consistentwith human evaluation. More details and updated results

6

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

01

joy anger surprise sadness fear

Truth

ACA

KKM

Figure 7. (a) Mouth angle. Blue dots corresponds to frames. (b) Manual labels, Unsupervised ACA and KKM.

1 2 32

33

3

12

2

3

1 21

S012

S049

S028

S012

S049

S028

1 2 3

1

1

2

2

1

2

3

3

1

3

(a)

(b)

2 31 21 3 1 2 3

Figure 9. a) Results obtained by ACA on subjects S012, S028 and S049. b) Corresponding frames found by ACA+MDA.

are given in [36].

6. Conclusions and future workAt present, taxonomies of facial expression are based on

FACS or other observer-based schemes. Consequently, ap-proaches to automatic facial expression recognition are de-pendent on access to corpuses of FACS or similarly labeledvideo. This is a significant concern, in that recent work sug-gests that extremely large corpuses of labeled data may beneeded to train robust classifiers. This paper raises the ques-tion of whether facial actions can be learned directly fromvideo in an unsupervised manner.

We developed a method for temporal clustering of facialbehavior that solves for correspondences between dynamicevents and has shown promising concurrent validity withmanual FACS. In experimental tests using the RU-FACS

database, agreement between facial actions identified by un-supervised analysis of face dynamics and FACS approachedthe level of agreement that has been found between inde-pendent FACS coders. These findings suggest that unsu-pervised learning of facial expression is a promising alter-native to supervised learning of FACS-based actions. Atleast three benefits follow. One is the prospect that auto-matic facial expression analysis may be freed from its de-pendence on observer-based labeling. Second, because thecurrent approach is fully empirical, it potentially can iden-tify regularities in video that have not been anticipated bythe top-down approaches such as FACS. New discoveriesbecome possible. This becomes especially important asautomatic facial expression analysis increasingly developsnew metrics, such as system dynamics, not easily capturedby observer-based labeling. Three, similar benefits may ac-

7

crue in other areas of image understanding of human behav-ior. Recent efforts to develop vocabularies and grammarsof human actions [13] depend on advances in unsupervisedlearning. The current work may contribute to this effort.Current challenges include how best to scale ACA for verylarge databases and increase accuracy for subtle facial ac-tions. We are especially interested in applications of ACAto detection of anomalous actions and efficient image in-dexing and retrieval.

Acknowledgements This work was partially supportedby National Institute of Health Grant R01 MH 051435.Thanks to Tomas Simon for helpful discussions and IainMatthews for assistance with the AAM code.

References[1] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel,

and J. Movellan. Automatic recognition of facial actions inspontaneous expressions. J. Multimedia, 2006.

[2] F. Bettinger, T. F. Cootes, and C. J. Taylor. Modelling facialbehaviours. In BMVC, 2002.

[3] J. F. Cohn, Z. Ambadar, and P. Ekman. Observer-based mea-surement of facial expression with the Facial Action CodingSystem. The handbook of emotion elicitation and assess-ment, 2007.

[4] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appear-ance models. In ECCV, 1998.

[5] N. Cristianini, J. Taylor, A. Elisseeff, and J. Kandola. Onkernel-target alignment. In NIPS, 2002.

[6] M. Cuturi, J.-P. Vert, O. Birkenes, and T. Matsui. A kernelfor time series based on global alignments. In ICASSP, 2007.

[7] F. de la Torre. A unification of component analysis methods.In Handbook of Pattern Recognition and Computer Vision(4th edition), October 2009.

[8] F. de la Torre, J. Campoy, Z. Ambadar, and J. Cohn. Tempo-ral segmentation of facial behavior. In ICCV, 2007.

[9] F. de la Torre and O. Vinyals. Learning kernel expansionsfor image classification. In CVPR, 2007.

[10] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graphcuts without eigenvectors: A multilevel approach. PAMI,29(11):1944–1957, 2007.

[11] C. Ding and X. He. k-means clustering via principal compo-nent analysis. In ICML, 2004.

[12] P. Ekman and W. Friesen. Facial action coding system: Atechnique for the measurement of facial movement. Consult-ing Psychologists Press, 1978.

[13] G. Guerra-Filho and Y. Aloimonos. A language for humanaction. IEEE Computer, 40(5):42–51, 2007.

[14] Z. Harchaoui, F. Bach, and E. Moulines. Kernel change-point analysis. In NIPS, 2009.

[15] J. Hoey. Hierarchical unsupervised learning of facial expres-sion categories. In IEEE Workshop on Detection and Recog-nition of Events in Video, 2001.

[16] T. Kanade, Y. li Tian, and J. F. Cohn. Comprehensivedatabase for facial expression analysis. In AFGR, 2000.

[17] E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An onlinealgorithm for segmenting time series. In ICDM, 2001.

[18] J. B. MacQueen. Some methods for classification and analy-sis of multivariate observations. In 5-th Berkeley Symposiumon Mathematical Statistics and Probability, pages 281–297,1967.

[19] K. Mase and A. Pentland. Automatic lipreading by com-puter. Trans. Inst. Elect. Information and CommunicationEng., 1990.

[20] I. Matthews and S. Baker. Active appearance models revis-ited. IJCV, 60(2):135–164, 2004.

[21] D. S. Messinger, M. H. Mahoor, S. M. Chow, and J. F.Cohn. Automated measurement of facial expression ininfant-mother interaction: A pilot study. Infancy, 14(3):285–305, 2009.

[22] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learn-ing of human action categories using spatial-temporal words.IJCV, (79):299–318, 2008.

[23] S. M. Oh, J. M. Rehg, T. Balch, and F. Dellaert. Learning andinferring motion patterns using parametric segmental switch-ing linear dynamic systems. IJCV, 77(1-3):103–124, 2008.

[24] M. Pantic and I. Patras. Dynamics of facial expression:Recognition of facial actions and their temporal segmentsfrom face profile image sequences. IEEE Trans. Syst. Man.Cybern. B Cybern., 36(2):433–449, 2006.

[25] W. P. Pierskalla. The multidimensional assignment problem.Oper. Res., 16(2):422–431, 1968.

[26] H. Shimodaira, K.-I. Noma, M. Nakai, and S. Sagayama.Dynamic time-alignment kernel in support vector machine.In NIPS, 2001.

[27] T. Simon, M. H. Nguyen, F. de la Torre, and J. F. Cohn.Action unit detection with segment-based SVMs. In CVPR,2010.

[28] Y. Tong, W. Liao, and Q. Ji. Facial action unit recognition byexploiting their dynamic and semantic relationships. PAMI,29(10):1683–1699, 2007.

[29] P. Turaga, A. Veeraraghavan, and R. Chellappa. Unsuper-vised view and rate invariant clustering of video sequences.CVIU, (113):353–371, 2009.

[30] Y. Wang, H. Jiang, M. S. Drew, Z.-N. Li, and G. Mori. Un-supervised discovery of action classes. In CVPR, 2006.

[31] P. Yin, T. Starner, H. Hamilton, I. Essa, and J. M. Rehg.Learning the basic units in American sign language usingdiscriminative segmental feature selection. In ICASSP, 2009.

[32] R. Zass and A. Shashua. A unifying approach to hard andprobabilistic clustering. In ICCV, 2005.

[33] L. Zelnik-Manor and M. Irani. Event-based analysis ofvideo. In CVPR, 2001.

[34] L. Zelnik-Manor and M. Irani. Temporal factorization vs.spatial factorization. In ECCV, 2004.

[35] F. Zhou, F. de la Torre, and J. K. Hodgins. Aligned clus-ter analysis for temporal segmentation of human motion. InAFGR, 2008.

[36] F. Zhou, T. Simon, F. de la Torre, and J. F. Cohn. Unsuper-vised discovery of facial events. In Technical Report TR-10-10. RI CMU, May 2010.

8