Identity-Aware Convolutional Neural Network for Facial ...mengz/papers/FG2017.pdf ·...

8
Identity-Aware Convolutional Neural Network for Facial Expression Recognition Zibo Meng* 1 Ping Liu* 2 Jie Cai 1 Shizhong Han 1 Yan Tong 1 1 Department of Computer Science and Engineering, South Carolina University, USA 2 Sony Electronics, USA Abstract— Facial expression recognition suffers under real- world conditions, especially on unseen subjects due to high inter-subject variations. To alleviate variations introduced by personal attributes and achieve better facial expression recog- nition performance, a novel identity-aware convolutional neural network (IACNN) is proposed. In particular, a CNN with a new architecture is employed as individual streams of a bi-stream identity-aware network. An expression-sensitive contrastive loss is developed to measure the expression similarity to ensure the features learned by the network are invariant to expression variations. More importantly, an identity-sensitive contrastive loss is proposed to learn identity-related information from iden- tity labels to achieve identity-invariant expression recognition. Extensive experiments on three public databases including a spontaneous facial expression database have shown that the proposed IACNN achieves promising results in real world. I. I NTRODUCTION Facial activity is the most powerful and natural means for understanding emotional expression for humans. Extensive efforts have been devoted to facial expression recognition in the past decades [31], [51], [36]. An automatic facial expres- sion recognition system is desired in emerging applications in human-computer interaction (HCI), such as online/remote education, interactive games, and intelligent transportation. Although great progress has been made from posed or deliberate facial displays, facial expression recognition in real-world suffers from various factors including uncon- strained face pose, illumination change, and high inter- subject variations. Moreover, recognition performance usu- ally degrades on unseen subjects, primarily due to high inter- subject variations introduced by age, gender, and especially person-specific characteristics associated with identity as discussed in [43]. Since these factors are nonlinearly coupled with facial expressions in a multiplicative way, features extracted through existing methods are not purely related to expressions. As illustrated in Fig. 1, I 1 and I 2 are the same subject displaying different expressions, whereas I 1 and I 3 are different subjects displaying the same expression. For facial expression recognition, it is desired to have the distance D 1 between images I 1 and I 2 larger than D 2 between I 1 and I 3 in the feature space, as in Fig. 1b. However, due to high inter-subject variations, D 1 is usually smaller than D 2 in the feature spaces of existing approaches, as in Fig. 1a. This motivates us to attack the challenging problem of learning and extracting personal-independent and expression-discriminative features. To solve this problem, * indicates equal contribution. Surprise Anger (a) (b) Fig. 1: Illustration for features learned by (a) existing methods, and (b) the proposed IACNN. Best viewed in color. we develop an identity-aware convolutional neural network (IACNN) to learn expression-related representations and, at the same time, to learn identity-related features to facilitate identity-invariant facial expression recognition. In addition to minimizing the classification errors, a simi- larity metric for expression is developed and employed in the IACNN to pull the samples with the same expression together while pushing those with different expressions apart in the feature space. As a result, the intra-expression variations are reduced, while the inter-expression differences are increased. However, the learned expression representations may contain irrelevant identity information such that the performance of facial expression recognition is affected by high inter- subject variations as illustrated in Fig. 1a. To alleviate the effect of the inter-subject variations, a similarity metric for identity is proposed in the IACNN to learn identity-related features, which will be combined with the expression-related representations for facial expression recognition. The architecture of the proposed IACNN is illustrated in Fig 2. During the training process, expression and identity related features are jointly estimated through a deep CNN framework, which is composed of two identical CNN streams and trained by simultaneously minimizing the classifica- tion errors while maximizing the expression and identity similarities. Specifically, given a pair of images, each of which is fed into one CNN stream, the similarity losses are computed using the expression and identity related features, respectively. In addition, the classification errors in terms of expression recognition are also calculated for both images and used to fine-tune the model parameters to ensure the learned features are meaningful for expression recognition. During testing, an input image is fed into one CNN stream, and predictions are generated based on both the expression- related and the identity-related features. In summary, our

Transcript of Identity-Aware Convolutional Neural Network for Facial ...mengz/papers/FG2017.pdf ·...

Page 1: Identity-Aware Convolutional Neural Network for Facial ...mengz/papers/FG2017.pdf · Abstract—Facial expression recognition suffers under real-world conditions, especially on unseen

Identity-Aware Convolutional Neural Network for Facial ExpressionRecognition

Zibo Meng*1 Ping Liu*2 Jie Cai1 Shizhong Han1 Yan Tong1

1 Department of Computer Science and Engineering, South Carolina University, USA2 Sony Electronics, USA

Abstract— Facial expression recognition suffers under real-world conditions, especially on unseen subjects due to highinter-subject variations. To alleviate variations introduced bypersonal attributes and achieve better facial expression recog-nition performance, a novel identity-aware convolutional neuralnetwork (IACNN) is proposed. In particular, a CNN with a newarchitecture is employed as individual streams of a bi-streamidentity-aware network. An expression-sensitive contrastive lossis developed to measure the expression similarity to ensure thefeatures learned by the network are invariant to expressionvariations. More importantly, an identity-sensitive contrastiveloss is proposed to learn identity-related information from iden-tity labels to achieve identity-invariant expression recognition.Extensive experiments on three public databases including aspontaneous facial expression database have shown that theproposed IACNN achieves promising results in real world.

I. INTRODUCTION

Facial activity is the most powerful and natural means forunderstanding emotional expression for humans. Extensiveefforts have been devoted to facial expression recognition inthe past decades [31], [51], [36]. An automatic facial expres-sion recognition system is desired in emerging applicationsin human-computer interaction (HCI), such as online/remoteeducation, interactive games, and intelligent transportation.

Although great progress has been made from posed ordeliberate facial displays, facial expression recognition inreal-world suffers from various factors including uncon-strained face pose, illumination change, and high inter-subject variations. Moreover, recognition performance usu-ally degrades on unseen subjects, primarily due to high inter-subject variations introduced by age, gender, and especiallyperson-specific characteristics associated with identity asdiscussed in [43]. Since these factors are nonlinearly coupledwith facial expressions in a multiplicative way, featuresextracted through existing methods are not purely relatedto expressions. As illustrated in Fig. 1, I1 and I2 are thesame subject displaying different expressions, whereas I1and I3 are different subjects displaying the same expression.For facial expression recognition, it is desired to have thedistance D1 between images I1 and I2 larger than D2

between I1 and I3 in the feature space, as in Fig. 1b.However, due to high inter-subject variations, D1 is usuallysmaller than D2 in the feature spaces of existing approaches,as in Fig. 1a. This motivates us to attack the challengingproblem of learning and extracting personal-independent andexpression-discriminative features. To solve this problem,

* indicates equal contribution.

SurpriseAnger

(a) (b)

Fig. 1: Illustration for features learned by (a) existing methods,and (b) the proposed IACNN. Best viewed in color.

we develop an identity-aware convolutional neural network(IACNN) to learn expression-related representations and, atthe same time, to learn identity-related features to facilitateidentity-invariant facial expression recognition.

In addition to minimizing the classification errors, a simi-larity metric for expression is developed and employed in theIACNN to pull the samples with the same expression togetherwhile pushing those with different expressions apart in thefeature space. As a result, the intra-expression variations arereduced, while the inter-expression differences are increased.However, the learned expression representations may containirrelevant identity information such that the performanceof facial expression recognition is affected by high inter-subject variations as illustrated in Fig. 1a. To alleviate theeffect of the inter-subject variations, a similarity metric foridentity is proposed in the IACNN to learn identity-relatedfeatures, which will be combined with the expression-relatedrepresentations for facial expression recognition.

The architecture of the proposed IACNN is illustrated inFig 2. During the training process, expression and identityrelated features are jointly estimated through a deep CNNframework, which is composed of two identical CNN streamsand trained by simultaneously minimizing the classifica-tion errors while maximizing the expression and identitysimilarities. Specifically, given a pair of images, each ofwhich is fed into one CNN stream, the similarity losses arecomputed using the expression and identity related features,respectively. In addition, the classification errors in terms ofexpression recognition are also calculated for both imagesand used to fine-tune the model parameters to ensure thelearned features are meaningful for expression recognition.During testing, an input image is fed into one CNN stream,and predictions are generated based on both the expression-related and the identity-related features. In summary, our

Page 2: Identity-Aware Convolutional Neural Network for Facial ...mengz/papers/FG2017.pdf · Abstract—Facial expression recognition suffers under real-world conditions, especially on unseen

Fig. 2: The architecture of the IACNN used for training, where a pair of images are input into two identical CNNs with sharing weights,respectively. In each CNN, there are two FC layers, i.e. FCexp, and FCID , on top of the first FC layer for learning the expression-relatedand identity-related features, respectively. LExp

Contrasitve/ LIDContrasitve is a contrastive loss used to minimize the differences between the

samples with the same expression/identity. With explicit expression labels, LExpSoftmax and LSoftmax are employed to ensure the learned

features are meaningful for expression recognition. Loss layers are highlighted by gray blocks. Best viewed in color.

main contributions in this paper are:1) Developing an IACNN, which is capable of utilizing

both expression-related and identity-related informa-tion for facial expression recognition;

2) Introducing a new auxiliary layer with an identity-sensitive contrastive loss to learn identity-related rep-resentations to alleviate high inter-subject variations;

3) Proposing a joint loss function, which considers clas-sification errors of expression recognition as well asexpression and identity similarities, to fine tune theexpression-related and identity-related features simul-taneously.

Extensive experiments on two well-known posed facialexpression databases, i.e. Extended Cohn-Kanade database(CK+) [14], [25] and MMI database [32], have demon-strated the effectiveness of the proposed method for facialexpression recognition. Furthermore, the proposed methodwas evaluated on a spontaneous facial expression dataset,i.e. Static Facial Expressions in the Wild (SFEW) [6], whichcontains face images with large head pose variations anddifferent illuminations and has been widely used for bench-marking facial expression recognition. Experimental resultson the SFEW dataset have shown that the proposed approachachieves promising results in real world.

II. RELATED WORK

As elaborated in the surveys [31], [51], [36], facial expres-sion recognition has been extensively studied over the pastdecades. Both 2D and 3D features have been extracted fromstatic images or image sequences to capture the appearance

and geometry facial changes caused by target expressions.The features employed can be human-designed includingHistograms of Oriented Gradients (HOG) [2], Scale InvariantFeature Transform (SIFT) features [50], [5], histograms ofLocal Binary Patterns (LBP) [43], [3], histograms of LocalPhase Quantization (LPQ) [12], and their spatiotemporal ex-tensions [18], [38], [52], [12], [47]. Other spatiotemporal ap-proaches, such as temporal modeling of shapes (TMS) [10],interval temporal Bayesian network (ITBN) [46], expres-sionlets on spatiotemporal manifold (STM-ExpLet) [22],and spatiotemporal covariance descriptors (Cov3D) [35],have been developed to utilize both spatial and temporalinformation in an image sequence.

Human-crafted features can achieve high performance onposed expression data. However, performance degenerateson spontaneous data with large variations in head pose andillumination. In contrast, features can be learned in a data-driven manner by sparse coding [24], [54], [33], [28] ordeep learning [34], [42], [23], [29], [20], [16], [49], [30],[13], [21], [53]. Most recently, CNN-based deep learningapproaches [16], [49], [30] have been demonstrated to bemore robust to real world conditions in EmotiW2015 [6].

Most of the aforementioned approaches focus on improv-ing person-independent recognition, while a few of themlearn models from person-specific data. For example, Chenet al. [3] learn a person-specific model from a few person-specific samples based on transfer learning.

Existing approaches learn a similarity metric using eitherhuman-designed features [17], [37] or deep metric learn-ing such as CNNs with pairwise-constraints [41], [4], [53]

Page 3: Identity-Aware Convolutional Neural Network for Facial ...mengz/papers/FG2017.pdf · Abstract—Facial expression recognition suffers under real-world conditions, especially on unseen

Fig. 3: The proposed Exp-Net for facial expression recognition includes a pair of identical component CNNs, whose weights are shared.As depicted in the dashed rectangle, the component CNN includes three convolutional layers, each of which is followed by a PReLUlayer and a BN layer. A max pooling layer is employed after each of the first two BN layers. Following the third convolutional layer, twoFC layers are used to generate the representation for each input sample. Finally, a softmax loss layer (LExp

Softmax) is employed to producethe distribution over the target expressions and to calculate the classification errors for fine-tuning the parameters. A contrastive loss, i.e.LExp

Contrastive, is employed to reduce the intra-class variations, while to decrease the inter-class similarity. Loss layers are highlighted bygray blocks. Best viewed in color.

or triplet-constraints [45]. Compared with those based onhand-crafted features, deep metric learning achieves morepowerful representations with low intra-class yet high inter-class distances in a data driven manner and thus, has shownpromising results in many applications. For example, Zhaoet al. [53] considered similarity between peak and non-peak frames from the same subject to achieve invariance toexpression intensity. In this work, we explicitly learn theexpression-related and identity-related features to achieveidentity-invariant expression recognition, where pairwise-constraints based contrastive loss is chosen as the loss func-tion for learning the similarity metric for either expressionor identity. Unlike previous approaches, we utilize both con-trastive loss and softmax loss to learn both features jointly.More importantly, a new auxiliary layer plus a contrastiveloss is employed, which is responsible for learning identity-related representations to alleviate the inter-subject variationsintroduced by personal attributes. The proposed frameworkis an end-to-end system, which can be optimized via standardstochastic gradient descent (SGD).

III. PROPOSED METHOD

A. A Component CNN

Instead of utilizing an off-the-shelf model, e.g.AlexNet [19], a new CNN architecture is designed for thecomponent network of IACNN, due to limited expression-labeled data. As shown in Fig. 3, the CNN enclosed in thedashed rectangle includes three convolutional layers, eachof which is followed by a parametric rectified unit (PReLU)layer [8] as the activation function. As an extension of arectified linear unit (ReLU) activation function, PReLUhas better fitting capability than the sigmoid function orhyperbolic tangent function [19] and further boosts theclassification performance as compared to the traditionalReLU. After each PReLU layer, a batch normalization (BN)layer [9] is utilized to normalize each scalar feature to zero

mean and unit variance. The BN layer has been shownto improve classification performance and accelerate thetraining process [9]. After each of the first two BN layers,there is a max pooling layer. Following the third PReLUlayer, two fully connected (FC) layers consisting of 1,024neurons are employed. Finally, a softmax loss layer is usedto generate the probabilistic distribution over the K targetexpressions and to calculate the classification loss given theexpression labels.

B. An Exp-Net for Facial Expression Recognition

In real-world scenarios, facial expression recognition suf-fers from intra-expression variations. As a result, CNNs cangenerate quite different representations for image samplescontaining the same expression. To cope with the problem,an Exp-Net composed of two identical component CNNs isconstructed, as illustrated in Fig. 3. Two softmax losses andone contrastive loss are employed to learn representationsthat are meaningful for facial expression recognition. Giventhe expression labels, a softmax loss is employed on top ofeach component network to calculate the classification errorsand is used to fine tune the parameters in the lower layers.In addition, a contrastive loss is utilized to learn a similaritymetric for image pairs to make sure that the samples withthe same expression have similar representations, and at thesame time, those with different expressions are far away inthe feature space.

1) Expression-Sensitive Contrastive Loss: As illustratedin Fig. 4, an expression-sensitive contrastive loss is designedto pull the samples with the same expression towards eachother, and to push the samples with different expressionsaway from each other at the same time. Specifically, thesimilarity between two images I1 and I2 is defined as thesquared Euclidean distance in the feature space:

D(fE(I1), f

E(I2))=‖ fE(I1)− fE(I2) ‖22 (1)

Page 4: Identity-Aware Convolutional Neural Network for Facial ...mengz/papers/FG2017.pdf · Abstract—Facial expression recognition suffers under real-world conditions, especially on unseen

Surprise (I1)

CNN

CNN

Shared

Weights

Feature Space

f(I1)

f(I2)

Lp

Surprise (I2)

(a)

Feature Space

Surprise (I1)

Sadness (I2)

CNN

CNN

Shared

Weights

f(I1)

f(I2)

Ln D(f(I1),f(I2)<α

(b)Fig. 4: During the training process, the contrastive loss (a) pulls the samples with the same expression towards each other, and (b) pushesthe samples with different expressions apart. Lp and Ln represent the contrastive losses for the image pairs with the same expression anddifferent expressions, respectively.

where fE(·) is the expression-related image representation,i.e., the output of the FCexp layer in Fig. 4. A smallerdistance D

(f(I1), f(I2)

)indicates that the two images are

more similar in the feature space.Then, the expression-sensitive contrastive loss is defined

as follows:

LexpContrastive =

M∑i=1

LE(zEi , f

E(Ii,1), fE(Ii,2)

)(2)

LE(zEi , f

E(Ii,1), fE(Ii,2)

)=zEi2∗D(fE(Ii,1), f

E(Ii,2))

+1− zEi

2∗ max

(0, αE −D

(fE(Ii,1), f

E(Ii,2)))

(3)

where M is the number of image pairs; and(zEi , f

E(Ii,1), fE(Ii,2)

)represents the label and the

expression-related image features for the ith pair of trainingsamples, respectively. When the pair of images has the sameexpression label, zEi = 1 and the loss is

LE(1, fE(Ii,1), f

E(Ii,2))=

1

2D(fE(Ii,1), f

E(Ii,2)). (4)

Otherwise, zEi = 0 and the loss becomes

LE

(0, fE(Ii,1), f

E(Ii,2)

)=1

2max

(0, αE −D

(fE(Ii,1), f

E(Ii,2)))(5)

where αE > 0 is a parameter to determine how muchdissimilar pairs contribute to the loss function. When thedistance between two dissimilar samples in a pair is lessthan αE , then the loss will be calculated. In our experiment,αE is set to 10 for the expression-sensitive contrastive loss,empirically.

C. The IACNN for Facial Expression Recognition

The performance of facial expression recognition oftendrops significantly for unseen subjects, mainly due to highinter-subject variations introduced by age, gender, and espe-cially person-specific characteristics associated with identity.To deal with this problem, an IACNN is proposed byintroducing an auxiliary FC layer into the Exp-Net, whichtakes the input from the lower layers of the Exp-Net, but hasits own set of learnable weights.

As shown in Fig. 2, the FCID layer is on top of thefirst FC layer of the Exp-Net after the convolutional layers.Given the identity labels, the FCID is responsible for

learning identity-related features using an identity-sensitivecontrastive loss LID

Contrastive, defined as follows

LIDContrastive =

M∑i=1

LID

(zIDi , fID(Ii,1), f

ID(Ii,2)

)(6)

where f ID(·) is the identity-related feature, i.e., the outputof the FCID layer; and LID

(zIDi , f ID(Ii,1), f

ID(Ii,2))

isdefined similar to the expression-sensitive contrastive loss asbelow:

LID(zID

i ,fID

(Ii,1),fID

(Ii,2))=zID

i

2∗D(fID

(Ii,1), fID

(Ii,2))

+1− zIDi

2∗ max

(0, α

ID−D

(fID

(Ii,1), fID

(Ii,2)))

(7)

where zIDi = 1, if the pair of images comes from the samesubject; otherwise, zIDi = 0. αID is set to 10 for the identity-sensitive contrastive loss, empirically.

The identity-related features are then concatenated withthe expression-related features encoded by the FCexp layerto form the final feature vector FCfeat for facial expres-sion recognition. Therefore, the overall loss function of theproposed IACNN is defined as

L = λ1LExpContrastive + λ2LIDContrastive + λ3L1Softmax

+ λ4L2Softmax + λ5LExp1Softmax + λ6LExp2

Softmax (8)

where λ1-λ6 are the weights of each loss, respectively.LExpContrastive and LID

Contrastive are the expression-sensitivecontrastive loss and the identity-sensitive contrastive loss,as defined in Eq. 2 and Eq. 6, respectively. LExp1

Softmax andLExp2Softmax represent the classification errors using only the

expression-related features from the component CNN; whileL1Softmax and L2

Softmax represent the classification errorsusing the concatenated feature vector FCfeat.

The overall loss is back-propagated to both FCexp andFCID. As a result, the expression-related and identity-related features are fine-tuned jointly in the IACNN.

During the testing, only one stream, i.e., the componentCNN, is employed for making the decision. Given a testingsample Ii, the expression and identity related representations,i.e. fE(Ii) and f ID(Ii) are calculated and concatenated toconstruct the final feature vector, i.e. FCfeat, for facial ex-pression recognition. In this work, classification is performedusing the softmax classifier of the CNN.

Page 5: Identity-Aware Convolutional Neural Network for Facial ...mengz/papers/FG2017.pdf · Abstract—Facial expression recognition suffers under real-world conditions, especially on unseen

IV. EXPERIMENTS

To demonstrate the effectiveness of the proposed methodin terms of facial expression recognition, extensive ex-periments have been conducted on three public databasesincluding two posed facial expression databases, i.e. the CK+database [14], [25] and the MMI database [32], and moreimportantly, a spontaneous facial expression database, i.e.,the SFEW datasets [6].

A. Preprocessing

Face alignment is conducted to reduce variation in facescale and in-plane rotation across different facial images.Specifically, 66 landmarks are detected using a state-of-the-art face alignment method, i.e., Discriminative ResponseMap Fitting (DRMF) [1]. The face regions are aligned basedon three fiducial points: the centers of the eyes and themouth, and then are cropped and scaled to a size of 60×60.

It may be not sufficient to learn a deep model withthe limited number of sequences or images in the facialexpression databases. To alleviate the chance of over-fitting,an augmentation procedure is employed to train the CNNmodels, where a 48×48 patch is randomly cropped from animage and randomly flipped horizontally as the input of theCNN, resulting 288 times larger than the original trainingdata. During testing, only the 48× 48 patch centered at theface image is used as the input to one stream of the IACNN.B. Implementation Details

The proposed component CNN is fine-tuned from aCNN model pretrained on the Facial Expression Recognition(FER-2013) dataset [7] using stochastic gradient decent witha batch size of 128, momentum of 0.9, and a weight decayparameter of 0.005. Dropout is applied to each FC layerwith a probability of 0.6, i.e. zeroing out the output of aneuron with probability of 0.6. In Eq. 8, λ2 is set to 5 forCK+/SFEW and 2 for MMI, while other parameters are setto 1 for all datasets empirically. The CNNs are implementedusing the Caffe library [11].

Identity information is required to train the IACNN model.Labeled subject IDs are provided in the CK+ and MMIdatabases and we manually labeled subject IDs for the SFEWdatabase. In practice, the proposed system only requiresweakly supervised identity information, i.e., whether the twoimages are from the same subject, which can be automati-cally obtained by an off-the-shelf face verification method.C. Experimental Results

To better demonstrate the effectiveness of the proposedmodel, two baseline methods are employed, i.e. the one-stream component CNN described in Section III-A, denotedby CNN , and the Exp-Net introduced in Section III-B,denoted by Exp−Net.

1) Results on CK+ dataset: CK+ database [14], [25]is widely used for evaluating facial expression recognitionsystem. It contains 327 image sequences collected from 118subjects, each of which is labeled as one of 7 expressions,i.e. anger, contempt, disgust, fear, happiness, sadness, andsurprise. For each sequence, the label is only provided for

TABLE I: Confusion matrix of the proposed IACNN methodevaluated on the CK+ database [14], [25]. The ground truth andthe predicted labels are given by the first column and the first row,respectively.

An Co Di Fe Ha Sa SuAn 91.1% 0% 0% 1.1% 0% 7.8% 0%Co 5.6% 86.1% 0% 2.7% 0% 5.6% 0%Di 0% 0% 100% 0% 0% 0% 0%Fe 0% 4% 0% 98% 2% 0% 8%Ha 0% 0% 0% 0% 100% 0% 0%Sa 3.6% 0% 0% 1.8% 0% 94.6% 0%Su 0% 1.2% 0% 0% 0% 0% 98.8%

TABLE II: Performance comparison on the CK+ database [14],[25] in terms of the average accuracy of 7 expressions.

Method Accuracy3DCNN [21] 85.9

MSR [33] 91.4HOG 3D [18] 91.44

TMS [10] 91.89Cov3D [35] 92.3

3DCNN-DAP [21] 92.4STM-ExpLet [22] 94.19

DTAGN [13] 97.25CNN (baseline) 89.30

Exp-Net (baseline) 92.81IACNN 95.37

the last frame (the peak frame). To collect more data, thelast three frames of each sequence are selected as peakframes associated with the provided expression label. Thus,an experimental database consisting of 981 images is built.The database is further divided into 8 subsets, where thesubjects in any two subsets are mutually exclusive. Thenan 8-fold cross-validation strategy is employed, where, foreach run, data from 6 subsets are used for training and thatfrom the remaining two subsets for validation and testing,respectively.

The proposed IACNN and the two baseline methods aretrained and tested on static images. The final sequence-level predictions are obtained by choosing the class withthe highest average score of the three images. The resultsare reported as the average of the 8 runs. The confusionmatrix of the proposed IACNN model is reported in Table I,where diagonal entries represent the recognition accuracy foreach expression. As shown in Table II, the performance ofthe proposed IACNN outperforms the two baseline methods,especially the one-stream component CNN, in terms of theaverage accuracy of the 7 expressions. The IACNN modelis also compared with the state-of-the-art methods evaluatedon the CK+ database including methods using human craftedfeatures (HOG 3D [18], TMS [10], Cov3D [35], and STM-ExpLet [22]), methods using sparse coding (MSR [33]), andCNN-based methods (3DCNN and 3DCNN-DAP [21] andDTAGN [13]). As shown in Table II, the IACNN outperformsthe methods based on human crafted features or sparsecoding and also performs better or is at least comparableto the CNN-based methods. Note that all these methodsexcept the MSR employed temporal information extractedfrom image sequences. In contrast, the proposed IACNNlearns and extracts features from static images, which is more

Page 6: Identity-Aware Convolutional Neural Network for Facial ...mengz/papers/FG2017.pdf · Abstract—Facial expression recognition suffers under real-world conditions, especially on unseen

AngerContemptDisgustFearHappySadSurprise

(a) Original images (b) Features learned by CNN (c) Features learned by Exp-Net (d) Features learned by IACNN

Subject504

Subject504

Subject504

Subject504

Subject504

Centroid

Centroid

Centroid

Centroid

Centroid

Centroid

Fig. 5: A visualization study of (a) the original raw images and the features learned by (b) the component CNN, (c) the Exp-Net, and (d)the IACNN model on the CK+ database. The number of samples is 981 including 247× 3 training data from 6 subsets, 38× 3 validationdata, and 42 × 3 testing data. The dots, stars, and diamonds represent training, validation, and testing data, respectively. The featureslearned by the IACNN are better separated according to expressions for both the validation and testing data. Best viewed in color.

TABLE III: Confusion matrix of the proposed IACNN methodevaluated on the MMI database [32]. The ground truth and thepredicted labels are given by the first column and the first row,respectively.

An Di Fe Ha Sa SuAn 81.8% 3% 3% 1.5% 10.6% 0%Di 10.9% 71.9% 3.1% 4.7% 9.4% 6%Fe 5.4% 8.9% 41.1% 7.1% 7.1% 30.4%Ha 1.1% 3.6% 0% 92.9% 2.4% 0%Sa 17.2% 7.8% 0% 1.6% 73.4% 0%Su 7.3% 0% 14.6% 1.2% 0% 76.9%

suitable for applications, where videos or image sequencesare not available.Visualization Study: To further demonstrate the effective-ness in terms of learning good representations for expres-sion recognition, we visualize the features learned by thecomponent CNN, the Exp-Net, and the IACNN, respectively,using t-SNE [44], which is widely employed to visualize highdimensional data. As shown in Fig. 5a, the training samplesare denoted by dots, validation samples denoted by stars,and testing samples denoted by diamonds. As illustrated inFig. 5a, the original raw images are randomly distributed. Incontrast, the learned features (Fig. 5b, c, and d) are clusteredbased on their expression labels. A close-up enclosed by thedashed rectangle is given for Fig. 5b, c, and d. ComparingFig. 5b (the component CNN) with Fig. 5c (Exp-Net) and d(IACNN), the samples of the same subject with differentexpressions are closer to each other rather than to theircorresponding cluster centers marked by magenta circles. Ascompared to the features learned by the component CNNand the Exp-Net, the proposed IACNN model yields a betterseparation of the features, as depicted in Fig. 5d.

2) Results on MMI dataset: The MMI dataset [32] con-sists of 213 image sequences, among which 208 sequencescontaining frontal-view faces of 31 subjects will be used inour experiment. Each sequence is labeled as one of six basicexpressions, i.e. anger, disgust, fear, happiness, sadness, andsurprise. Starting with a neutral expression, each sequencedevelops the facial expression as time goes on, reaches peaknear the middle of the sequence, and ends with a neutralexpression. Since the actual location of the peak frame is notprovided, three frames in the middle of each image sequenceare collected as peak frames and associated with the provided

TABLE IV: Performance comparison on the MMI database [32]in terms of the average accuracy of 6 expressions.

Method Accuracy3DCNN [21] 53.2

ITBN [46] 59.7HOG 3D [18] 60.89

3DCNN-DAP [21] 63.43D SIFT [38] 64.39DTAGN [13] 70.24

STM-ExpLet [22] 75.12CNN (baseline) 57.00

Exp-Net (baseline) 64.31IACNN 69.48

CNNCK (baseline trained from MMI+CK+) 65.17Exp-NetCK (baseline trained from MMI+CK+) 70.55

IACNNCK (trained from MMI+CK+) 71.55

expression labels. Hence, there are a total of 208×3 imagesused in our experiments.

Similar to that on the CK+ database, the proposed IACNNand the two baseline methods are trained and tested on staticimages and the final sequence-level predictions are made byselecting the class with the highest average score of the threeimages. The dataset is divided into 10 subsets for person-independent 10-fold cross validation, where, for each run,data from 8 subsets are used for training and those fromthe remaining 2 subsets are used for validation and testing,respectively. The results are reported as the average of 10runs. The confusion matrix of the proposed IACNN modelevaluated on the MMI dataset is reported in Table III.

As shown in Table IV, the proposed IACNN outperformsthe two baseline methods significantly. Furthermore, theIACNN also outperforms most of the state-of-the-art meth-ods. Note that the image sequences in the MMI databasecontain a full temporal pattern of expressions, i.e., fromneutral to apex, and then released, and are especially favoredby these methods exploiting temporal information, e.g., allthe state-of-art methods in comparison.

Since the MMI dataset contains a small number of sam-ples, i.e. only 624 face images, it is not large enough to traina deep model. To demonstrate that the IACNN can achievebetter performance given more training data, we employedadditional data of the six basic expressions from the CK+dataset. Specifically, for each run, the 8 subsets of the MMIdataset plus the data from the CK+ dataset will be used as

Page 7: Identity-Aware Convolutional Neural Network for Facial ...mengz/papers/FG2017.pdf · Abstract—Facial expression recognition suffers under real-world conditions, especially on unseen

TABLE V: Confusion matrix of the proposed IACNN methodevaluated on the SFEW [6] validation set. The ground truth andthe predicted labels are given by the first column and the first row,respectively.

An Di Fe Ha Ne Sa SuAn 70.7% 0% 2.7% 5.2% 6.7% 8% 6.7%Di 19.1% 0% 0% 4.8% 33.3% 38% 4.8%Fe 37.8% 0% 8.9% 11.1% 15.6% 13.3% 13.3%Ha 9.9% 0% 0% 70.4% 9.9% 8.4% 1.4%Ne 5.1% 0% 2.6% 1.3% 60.3% 24.3% 6.4%Sa 7.4% 0% 2.9% 5.9% 20.6% 58.8% 4.4%Su 23.1% 0% 3.8% 3.8% 28.9% 11.5% 28.9%

TABLE VI: Confusion matrix of the proposed IACNN methodevaluated on the SFEW [6] testing set. The ground truth and thepredicted labels are given by the first column and the first row,respectively.

An Di Fe Ha Ne Sa SuAn 79.8% 0% 1.4% 0% 8.7% 1.4% 8.7%Di 35.3% 0% 0% 29.4% 11.8% 23.5% 0%Fe 34.1% 0% 9.8% 12.2% 9.8% 7.3% 26.8%Ha 8.4% 0% 0% 75.8% 4.2% 7.4% 4.2%Ne 17.2% 0% 1.7% 3.5% 60.4% 10.3% 6.9%Sa 23.6% 0% 7.3% 20% 5.5% 34.6% 9%Su 24.3% 0% 2.7% 8.1% 10.8% 8.1% 46%

training set and the remaining two subsets of the MMI datasetare used as the validation and testing sets, respectively.The results are reported at the bottom of Table IV denotedas CNNCK , CNNCK , and IACNNCK for the one-streamCNN, the Exp-Net, and the proposed IACNN, respectively.With additional training data, the recognition accuracy of theIACNN further improves from 69.48 to 71.55.

3) Results on SFEW dataset: The SFEW [6] is the mostwidely used benchmark for facial expression recognition,which “targets the efforts required towards affect analysisin the wild” [6]. The SFEW database is composed of 1,766images, i.e. 958 for training, 436 for validation, and 372 fortesting. Each of the images has been assigned to one of sevenexpression categories, i.e., anger, disgust, fear, neutral, happy,sad, and surprise. The expression labels of the training andvalidation sets are provided, while those of the testing set isheld back by the challenge organizer.

As illustrated in Table VII, Kim et al. [16], Yu et al. [49],Ng et al. [30], and Yao et al. [48] are ranked at the 1st

to 4th among the 18 teams in the EmotiW2015 challenge,respectively. Note that the first two methods, i.e. Kim etal. [16] and Yu et al. [49], used an ensemble of CNNs.For example, Kim et al. [16] employed an ensemble ofCNNs with different architectures, which can boost the finalperformance. The proposed IACNN model outperforms thebaseline of SFEW (35.93 on the validation set and 39.13 onthe testing set) by a large margin. The proposed IACNN isranked at the 5th place on both the validation set and thetesting set among all the methods compared and achievescomparable performance with Ng et al. [30] and Yao etal. [48], which demonstrates the effectiveness of the IACNNmodel in the real world.

4) Cross-database validation: To further demonstrate thatthe proposed IACNN is less affected by identity, a cross-

TABLE VII: Performance comparison on the SFEW database [6]in terms of the average accuracy of 7 expressions.

Method Validation Set Test SetKim et al. [16] 53.9 61.6Yu et al. [49] 55.96 61.29Ng et al. [30] 48.5 55.6Yao et al. [48] 43.58 55.38Sun et al. [40] 51.02 51.08

Zong et al. [55] N/A 50Kaya et al. [15] 53.06 49.46

Mollahosseini et al. [29] 47.7 N/ADhall et al. [6] (baseline of SFEW) 35.93 39.13

CNN (baseline) 47.80 50.54Exp-Net 49.51 52.96IACNN 50.98 54.30

TABLE VIII: Cross-database facial expression recognition perfor-mance in terms of average accuracy.

Test Set [26] [39] [27] CNN IACNNCK+ 47.1 - 56 69.26 71.29MMI 51.4 50.8 36.8 54.87 55.41

database experiment was conducted. The classifiers trainedon CK+ and MMI in the previous experiments are directlyapplied to MMI and CK+, respectively. The results arereported as the average of 10 runs for CK+ and 8 runsfor MMI. As shown in Table VIII, the proposed IACNNoutperforms all the state-of-the-art methods as well as thebaseline CNN method.

V. CONCLUSION

In this work, we proposed a novel identity-aware CNNto capture both expression-related and identity-related infor-mation to alleviate the effect of personal attributes on facialexpression recognition. Specifically, a softmax loss combinedwith an expression-sensitive contrastive loss is utilized tolearn the expression-related representations. To alleviate thevariations introduced by different identities, a new auxiliarylayer with an identity-sensitive contrastive loss is employedto learn identity-related representations. Both the expression-related and identity-related features are concatenated andemployed to achieve an identity-invariant facial expressionrecognition.

Experimental results on two posed facial expressiondatasets have demonstrated that the proposed IACNN modeloutperformed the baseline CNN methods as well as most ofthe state-of-the-art methods that exploit dynamic informationextracted from image sequences. More importantly, IACNNhas shown promise on a spontaneous facial expressiondataset, which demonstrates its effectiveness in real world.In the future, we plan to recognize face expressions fromvideos by incorporating temporal information in the proposedIACNN model.

VI. ACKNOWLEDGEMENT

This work is supported by National Science Foundationunder CAREER Award IIS-1149787.

REFERENCES

[1] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust discrimi-native response map fitting with constrained local models. In CVPR,pages 3444–3451, 2013.

Page 8: Identity-Aware Convolutional Neural Network for Facial ...mengz/papers/FG2017.pdf · Abstract—Facial expression recognition suffers under real-world conditions, especially on unseen

[2] T. Baltrusaitis, M. Mahmoud, and P. Robinson. Cross-dataset learningand person-specific normalisation for automatic action unit detection.In FG, volume 6, pages 1–6, 2015.

[3] J. Chen, X. Liu, P. Tu, and A. Aragones. Learning person-specificmodels for facial expression and action unit recognition. PatternRecognition Letters, 34(15):1964–1970, 2013.

[4] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metricdiscriminatively, with application to face verification. In CVPR,volume 1, pages 539–546, 2005.

[5] W. Chu, F. De la Torre, and J. Cohn. Selective transfer machine forpersonalized facial expression analysis. IEEE T-PAMI, 2016.

[6] A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, and T. Gedeon.Video and image based emotion recognition challenges in the wild:Emotiw 2015. In ICMI, pages 423–426. ACM, 2015.

[7] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza,B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D. Lee, et al. Challengesin representation learning: A report on three machine learning contests.In ICML, pages 117–124. Springer, 2013.

[8] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers:Surpassing human-level performance on imagenet classification. InICCV, pages 1026–1034, 2015.

[9] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. In ICML, pages448–456, 2015.

[10] S. Jain, C. Hu, and J. K. Aggarwal. Facial expression recognition withtemporal modeling of shapes. In ICCV Workshops, pages 1642–1649,2011.

[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture forfast feature embedding. In ACM MM, pages 675–678. ACM, 2014.

[12] B. Jiang, M. Valstar, and M. Pantic. Action unit detection using sparseappearance descriptors in space-time video volumes. In FG, 2011.

[13] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint fine-tuning in deepneural networks for facial expression recognition. In ICCV, pages2983–2991, 2015.

[14] T. Kanade, J. F. Cohn, and Y. Tian. Comprehensive database for facialexpression analysis. In FG, pages 46–53, 2000.

[15] H. Kaya, F. Gurpinar, S. Afshar, and A. A. Salah. Contrasting andcombining least squares based learners for emotion recognition in thewild. In ICMI, pages 459–466, 2015.

[16] B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee. Hierarchical committee ofdeep cnns with exponentially-weighted decision fusion for static facialexpression recognition. In ICMI, pages 427–434, 2015.

[17] J. J. Kivinen and C. K. Williams. Transformation equivariant boltz-mann machines. In IACNN, pages 1–9. Springer, 2011.

[18] A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporal descriptorbased on 3d-gradients. In BMVC, pages 275–1, 2008.

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classificationwith deep convolutional neural networks. In NIPS, pages 1097–1105,2012.

[20] M. Liu, S. Li, S. Shan, and X. Chen. AU-aware deep networks forfacial expression recognition. In FG, pages 1–6, 2013.

[21] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen. Deeply learningdeformable facial action parts model for dynamic expression analysis.In ACCV, 2014.

[22] M. Liu, S. Shan, R. Wang, and X. Chen. Learning expressionlets onspatio-temporal manifold for dynamic facial expression recognition.In CVPR, pages 1749–1756, 2014.

[23] P. Liu, S. Han, Z. Meng, and Y. Tong. Facial expression recognitionvia a boosted deep belief network. In CVPR, pages 1805–1812, 2014.

[24] P. Liu, S. Han, and Y. Tong. Improving facial expression analysis usinghistograms of log-transformed nonnegative sparse representation witha spatial pyramid structure. In FG, pages 1–7, 2013.

[25] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, andI. Matthews. The extended cohn-kanade dataset (ck+): A completeexpression dataset for action unit and emotion-specified expression.In CVPR Workshops, pages 94–101, 2010.

[26] C. Mayer, M. Eggers, and B. Radig. Cross-database evaluation forfacial expression recognition. Pattern recognition and image analysis,24(1):124–132, 2014.

[27] Y. Miao, R. Araujo, and M. S. Kamel. Cross-domain facial expressionrecognition using supervised kernel mean matching. In ICMLA,volume 2, pages 326–332, 2012.

[28] M. R. Mohammadi, E. Fatemizadeh, and M. H. Mahoor. Simultaneousrecognition of facial expression and identity via sparse representation.In WACV, pages 1066–1073, 2014.

[29] A. Mollahosseini, D. Chan, and M. H. Mahoor. Going deeper in facialexpression recognition using deep neural networks. In WACV, 2015.

[30] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler. Deep learningfor emotion recognition on small datasets using transfer learning. InICMI, pages 443–449, 2015.

[31] M. Pantic, A. Pentland, A. Nijholt, and T. S. Huang. Human computingand machine understanding of human behavior: A survey. In T. S.Huang, A. Nijholt, M. Pantic, and A. Pentland, editors, ArtificialIntelligence for Human Computing, LNAI. Springer Verlag, London,2007.

[32] M. Pantic, M. Valstar, R. Rademaker, and L. Maat. Web-baseddatabase for facial expression analysis. In ICME, pages 5–pp. IEEE,2005.

[33] R. Ptucha, G. Tsagkatakis, and A. Savakis. Manifold based sparserepresentation for robust expression recognition without neutral sub-traction. In ICCV Workshops, pages 2136–2143, 2011.

[34] S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza. Disentan-gling factors of variation for facial expression recognition. In ECCV,pages 808–822, 2012.

[35] A. Sanin, C. Sanderson, M. T. Harandi, and B. C. Lovell. Spatio-temporal covariance descriptors for action and gesture recognition. InWACV, pages 103–110, 2013.

[36] E. Sariyanidi, H. Gunes, and A. Cavallaro. Automatic analysis offacial affect: A survey of registration, representation and recognition.IEEE T-PAMI, 37(6):1113–1133, 2015.

[37] U. Schmidt and S. Roth. Learning rotation-aware features: Frominvariant priors to equivariant descriptors. In CVPR, pages 2050–2057,2012.

[38] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptorand its application to action recognition. In ACM MM, pages 357–360,2007.

[39] C. Shan, S. Gong, and P. McOwan. Facial expression recognition basedon Local Binary Patterns: A comprehensive study. J. IVC, 27(6):803–816, 2009.

[40] B. Sun, L. Li, G. Zhou, X. Wu, J. He, L. Yu, D. Li, and Q. Wei.Combining multimodal features within a fusion network for emotionrecognition in the wild. In ICMI, pages 497–502, 2015.

[41] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closingthe gap to human-level performance in face verification. In CVPR,pages 1701–1708, 2014.

[42] Y. Tang. Deep learning using linear support vector machines. In ICML,2013.

[43] M. F. Valstar, M. Mehu, B. Jiang, M. Pantic, and K. Scherer. Meta-analysis of the first facial expression recognition challenge. IEEET-SMC-B, 42(4):966–979, 2012.

[44] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. JMLR,9(2579-2605):85, 2008.

[45] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin,B. Chen, and Y. Wu. Learning fine-grained image similarity withdeep ranking. In CVPR, pages 1386–1393, 2014.

[46] Z. Wang, S. Wang, and Q. Ji. Capturing complex spatio-temporalrelations among facial muscles for facial expression recognition. InCVPR, pages 3422–3429, 2013.

[47] P. Yang, Q. Liu, and D. N. Metaxas. Boosting encoded dynamicfeatures for facial expression recognition. Pattern Recognition Letters,30(2):132–139, Jan. 2009.

[48] A. Yao, J. Shao, N. Ma, and Y. Chen. Capturing AU-aware facialfeatures and their latent relations for emotion recognition in the wild.In ICMI, pages 451–458, 2015.

[49] Z. Yu and C. Zhang. Image based static facial expression recognitionwith multiple deep network learning. In ICMI, pages 435–442, 2015.

[50] A. Yuce, H. Gao, and J. Thiran. Discriminant multi-label manifoldembedding for facial action unit detection. In FG, 2015.

[51] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang. A survey of affectrecognition methods: Audio, visual, and spontaneous expressions.IEEE T-PAMI, 31(1):39–58, Jan. 2009.

[52] G. Zhao and M. Pietiainen. Dynamic texture recognition using localbinary patterns with an application to facial expressions. IEEE T-PAMI,29(6):915–928, June 2007.

[53] X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, and S. Yan.Peak-piloted deep network for facial expression recognition. In ECCV,pages 425–442. Springer, 2016.

[54] L. Zhong, Q. Liu, P. Yang, J. Huang, and D. Metaxas. Learningmultiscale active facial patches for expression analysis. IEEE Trans.on Cybernetics, 45(8):1499–1510, August 2015.

[55] Y. Zong, W. Zheng, X. Huang, J. Yan, and T. Zhang. Transductivetransfer lda with riesz-based volume lbp for emotion recognition inthe wild. In ICMI, pages 491–496, 2015.