Facial Expression Recognition by SVM-based Two-stage...

4
Facial Expression Recognition by SVM-based Two-stage Classifier on Gabor Features Fan CHEN, Kazunori KOTANI School of Information Science Japan Advanced Institute of Science and Technology Ashahi-dai 1-8, Nomi, Ishikawa 923-1292, Japan [email protected] [email protected] Abstract We propose a two-stage classifier for the elastic bunch graph matching based recognition of facial expressions. The major purpose is to calculate distinctive similarity between image patterns by applying optimal weights to responses from different Gabor kernels and those from different fiducial points. In the first stage, we perform SVM on each fiducial point individually to extract a weighted feature from the Gabor response. The opti- mal fusion of those features is then calculated by an- other stage of SVM, providing the weight between fidu- cial points. From numerical experiments, the proposed method shows improved performances when comparing with other methods. 1 Introduction Normalization error in head pose variations and fidu- cial point displacements significantly affects the per- formance of appearance-based methods for both face and facial expression recognition, e.g., eigenface, Fish- erface, etc. Some model-based methods such as the Elastic Bunch Graph Matching(EBGM)[1], could alle- viate the problem by explicitly including the position matching of features. Using Gabor filters in EBGM to extract local descriptors further improves the ro- bustness against light and contrast.[2] [3] Extensions of EBGM to tolerate larger variations in pose and to han- dle larger dataset have also been proposed.[4] After the extraction of local descriptors, the similarity between two images is computed for facial expression classifi- cation. We observe that not only some fiducial points but also some responses from certain Gabor filters are more distinctive than others in discriminating different facial expressions. Instead of using the simple sum of local similarity values at each fiducial point, it is rea- sonable for us to apply weighted sum to both responses from different Gabor kernels and local descriptors at different points. Many statistical methods have been applied to the EBGM method for searching the opti- mal weights between fiducial points, which include Lin- ear Discriminant Analysis(LDA)[5][6], Support Vector Machine(SVM)[7][8] and neural networks[11]. From the viewpoint of combining classifiers, they can be re- garded as performing the output fusion on the bank of Gabor filters.[9][10] If we extract a further distinc- tive vector from the Gabor jet at each point, further improvements on accuracy are available. Direct classi- fication of the feature vector constructed from all local descriptors of each image might cause over-fitting, due to the curse of dimensionality. We consider a two- stage classifier for facial expression recognition, which applies base classifiers on all fiducial points individu- ally to estimate the optimal weight of response from different Gabor kernels and then the fusion of their outputs is calculated by a second-stage classifier to fi- nally obtain the classification support. Especially, we use SVM in both the base classifiers and the second- stage classifier for its good generalization ability. In Section 2, we introduce briefly the architecture of our two-stage classifier. In Section 3, numerical exper- iments are made to investigate the performance of the proposed method. Comparisons with other methods are also made. Concluding remarks and comments on our future works are given in Section 4. 2 SVM-based Two-stage Classifier for Facial Expression Recognition Our major purpose is to find a optimal classifier on local descriptors gathered from EBGM. Therefore, we will omit the introduction of EBGM and focus the ex- planation on the development of classifier on obtained local descriptors. We let q (n) be the n-th image vec- tor in a set of N images Q = {q (n) |n ∈{1, ··· ,N }}. z = {z n |n ∈{1, ··· ,N }} is the set of class labels with z n {1, ··· ,C } being the label of the n-th image vector. C is the number of all classes. We extract M fiducial points for the n-th image manu- ally or automatically, whose coordinates form the set {(x (n) m ,y (n) m )|m ∈{1, ··· ,M }}. Local descriptors at fiducial points are extracted by a Gabor bank of K fil- ters, G = {G k |k ∈{1, ··· ,K}} in Ref.[3], where the element at coordinate (x, y) of the k-th filter reads, G kxy = exp - 1 2 x 2 σ 2 kx + y 2 σ 2 ky 2πσ kx σ ky {exp [i (ω kx x + ω ky y)] - exp - (ω kx σ kx ) 2 +(ω ky σ ky ) 2 2 , (1) 453 MVA2007 IAPR Conference on Machine Vision Applications, May 16-18, 2007, Tokyo, JAPAN 13-6

Transcript of Facial Expression Recognition by SVM-based Two-stage...

Page 1: Facial Expression Recognition by SVM-based Two-stage ...b2.cvl.iis.u-tokyo.ac.jp/mva/proceedings/2007CD/papers/13-06.pdf · Facial Expression Recognition by SVM-based Two-stage Classifier

Facial Expression Recognition by SVM-based Two-stage Classifier on

Gabor Features

Fan CHEN, Kazunori KOTANI

School of Information Science

Japan Advanced Institute of Science and Technology

Ashahi-dai 1-8, Nomi, Ishikawa 923-1292, Japan

[email protected] [email protected]

Abstract

We propose a two-stage classifier for the elastic bunch

graph matching based recognition of facial expressions.

The major purpose is to calculate distinctive similarity

between image patterns by applying optimal weights to

responses from different Gabor kernels and those from

different fiducial points. In the first stage, we perform

SVM on each fiducial point individually to extract a

weighted feature from the Gabor response. The opti-

mal fusion of those features is then calculated by an-

other stage of SVM, providing the weight between fidu-

cial points. From numerical experiments, the proposed

method shows improved performances when comparing

with other methods.

1 Introduction

Normalization error in head pose variations and fidu-cial point displacements significantly affects the per-formance of appearance-based methods for both faceand facial expression recognition, e.g., eigenface, Fish-erface, etc. Some model-based methods such as theElastic Bunch Graph Matching(EBGM)[1], could alle-viate the problem by explicitly including the positionmatching of features. Using Gabor filters in EBGMto extract local descriptors further improves the ro-bustness against light and contrast.[2] [3] Extensions ofEBGM to tolerate larger variations in pose and to han-dle larger dataset have also been proposed.[4] After theextraction of local descriptors, the similarity betweentwo images is computed for facial expression classifi-cation. We observe that not only some fiducial pointsbut also some responses from certain Gabor filters aremore distinctive than others in discriminating differentfacial expressions. Instead of using the simple sum oflocal similarity values at each fiducial point, it is rea-sonable for us to apply weighted sum to both responsesfrom different Gabor kernels and local descriptors atdifferent points. Many statistical methods have beenapplied to the EBGM method for searching the opti-mal weights between fiducial points, which include Lin-ear Discriminant Analysis(LDA)[5][6], Support VectorMachine(SVM)[7][8] and neural networks[11]. Fromthe viewpoint of combining classifiers, they can be re-

garded as performing the output fusion on the bankof Gabor filters.[9][10] If we extract a further distinc-tive vector from the Gabor jet at each point, furtherimprovements on accuracy are available. Direct classi-fication of the feature vector constructed from all localdescriptors of each image might cause over-fitting, dueto the curse of dimensionality. We consider a two-stage classifier for facial expression recognition, whichapplies base classifiers on all fiducial points individu-ally to estimate the optimal weight of response fromdifferent Gabor kernels and then the fusion of theiroutputs is calculated by a second-stage classifier to fi-nally obtain the classification support. Especially, weuse SVM in both the base classifiers and the second-stage classifier for its good generalization ability.

In Section 2, we introduce briefly the architecture ofour two-stage classifier. In Section 3, numerical exper-iments are made to investigate the performance of theproposed method. Comparisons with other methodsare also made. Concluding remarks and comments onour future works are given in Section 4.

2 SVM-based Two-stage Classifier for

Facial Expression Recognition

Our major purpose is to find a optimal classifier onlocal descriptors gathered from EBGM. Therefore, wewill omit the introduction of EBGM and focus the ex-planation on the development of classifier on obtainedlocal descriptors. We let q(n) be the n-th image vec-tor in a set of N images Q = {q(n)|n ∈ {1, · · · , N}}.z = {zn|n ∈ {1, · · · , N}} is the set of class labelswith zn ∈ {1, · · · , C} being the label of the n-thimage vector. C is the number of all classes. Weextract M fiducial points for the n-th image manu-ally or automatically, whose coordinates form the set

{(x(n)m , y

(n)m )|m ∈ {1, · · · , M}}. Local descriptors at

fiducial points are extracted by a Gabor bank of K fil-ters, G = {Gk|k ∈ {1, · · · , K}} in Ref.[3], where theelement at coordinate (x, y) of the k-th filter reads,

Gkxy =exp

[

−12

(

x2

σ2kx

+ y2

σ2ky

)]

2πσkxσky

{exp [i (ωkxx + ωkyy)]

− exp

[

−(ωkxσkx)

2+ (ωkyσky)

2

2

]}

, (1)

453

MVA2007 IAPR Conference on Machine Vision Applications, May 16-18, 2007, Tokyo, JAPAN13-6

Page 2: Facial Expression Recognition by SVM-based Two-stage ...b2.cvl.iis.u-tokyo.ac.jp/mva/proceedings/2007CD/papers/13-06.pdf · Facial Expression Recognition by SVM-based Two-stage Classifier

which is a complex sinusoid centered at frequency(ωkx, ωky) and modulated by a Gaussian envelope, asshown in Fig.1. σkx and σky are the standard devi-ations of the elliptical Gaussian along x and y. Thesecond term inside the bracket makes the Gabor ker-nel DC-free to improve the robustness against bright-ness variation. A Gabor jet is defined as the vector ofGabor response for the m-th fiducial point, whose k-thelement is the convoluted result between the image and

the k-th Gabor kernel under offset (x(n)m , y

(n)m ), i.e.,

u(n)mk =

x

y

q(n)

x(n)m −x,y

(n)m −y

Gkxy. (2)

q(n)x,y is the intensity at pixel (x, y) of the n-th image.

(a.) Real part (b.) Imaginary part

Fig. 1: A Gabor kernel.

Since the feature vector constructed from all localdescriptors of each image is in general of a longer lengththan the size of the training set, direct classificationof this feature vector might cause over-fitting, due tothe curse of dimensionality. We consider a two-stageclassifier, whose diagram is given in Fig. 2. The weightof responses from different Gabor kernels are optimizedby first-stage classifiers, the fusion of whose outputs iscomputed by the second-stage classifier to produce theclassification support.

1.) For the n-th image, EBGM is used to search

optimal coordinates {(x(n)m , y

(n)m )} for all fiducial points

by performing local searches inside a neighboring areaof a starting point next to the ground truth. Detailedexplanations on EBGM can be found in Ref.[1] and [3].

2.) For the located m-th fiducial point, its Gabor jet

is calculated and normalized to unit length, i.e., u(n)m =

[|u(n)mk|/

k |u(n)mk|

2|k ∈ {1, · · · , K}]T . The magnitudenormalization of Gabor jets improves the robustnessagainst light and contrast variations.

3.) Gabor jets from different training data at the

m-th fiducial point form a Gabor bunch {u(n)m |n ∈

{1, · · · , N}} to train the m-th max-win SVM classifier

at the first stage. In the testing phase, Gabor Jet u(n)m

is fed into the trained max-win SVM classifier whichoutputs v

(n)m = [v

(n)md |d ∈ {1, · · · , D}]T with D being

the output dimensionality.4.) Outputs for the n-th image from all M classi-

fiers can be reorganized in a decision profile V(n) =

[v(n)m |m ∈ {1, · · · , M}]T , which is vectorized and sent

to the pairwise SVM classifier for classification.Different kinds of combiners on the fusion of deci-

sion profile have been proposed, which can be clas-sified as non-trainable and trainable combiners[9].

Fig. 2: Block diagram for proposed two-stageclassifier.

Average-fusion and max-fusion, two commonly usednon-trainable combiners, are defined as:

average − fusion : o(n)d =

1

M

m

v(n)md , D̂ = D, (3)

max − fusion : o(n)d = maxmv

(n)md, D̂ = D, (4)

respectively. o(n) = {o(n)

d̂|d̂ ∈ {1, · · · , D̂}} is the clas-

sification support from the combiner. Different fromthe usual weighted average fusion,

o(n) =1

M

m

wmv(n)md , D̂ = D, (5)

we consider a generalized weighted average fusion, i.e.,

o(n) = wV̂(n), (6)

where V̂(n) = [[v(n)m ]T |m ∈ {1, · · · , M}]T is the vector-

ized form of V.

Various statistical classifiers are applicable to thesecond-stage classifier to calculates the optimal fusionmatrix w of the decision profile. In both stages of clas-sifiers, we take SVM whose purpose in a two-class prob-lem is to maximize the margin between two classes, orequaivalently, to search for w that minimizes the fol-lowing objective function:

Lp = {1

2wTw −

n

αn[z∗n(wT V̂(n)m + w0) − 1]}, (7)

where {αn|n = 1, · · · , N, αn > 0} are the Langrangemultipliers and z∗n takes 1 for zn = 0 and −1 for zn = 1.Detailed explanations on SVM can be found in Ref.[7].

When extending the two-class SVM to the multi-class version, a Max-Win multi-class SVM adopts theone-against- rest strategy, while a Pairwise one takesthe one-against-one approach. We use the Max-Winmulti-class SVM for the first stage classifiers becauseit produces continuous output. Pairwise multi-classSVM has been used in the second stage of our methodbecause it provides better performance, based on nu-merical data which is not shown in the present paper.

454

Page 3: Facial Expression Recognition by SVM-based Two-stage ...b2.cvl.iis.u-tokyo.ac.jp/mva/proceedings/2007CD/papers/13-06.pdf · Facial Expression Recognition by SVM-based Two-stage Classifier

Fig. 3: Two kinds of multi-class SVM: Max-WinSVM and Pairwise SVM.

3 Numerical Experiments and Discus-

sions

We focus on the performance comparisions betweenthe proposed method and other fusion methods. Thesingle stage versions of classifiers will also be compared.The Japanese Female Facial Expression (JAFFE)Database [2] is adopted, which includes 213 images intotal. The goal of recognition is to classify them intoneutral face or one of six elemental facial expressionssuggested by Ekman et al.[12], i.e., happiness, anger,fear, disgust, sadness and surprise. We normalize theseimages through the alignment of both eye positions forlater comparisons with appearance-based SVM. Somenormalized samples are given in Fig. 4. All images areresized to 100 × 120 pixels.

(a) (b) (c) (d) (e) (f) (g)

Fig. 4: Some normalized samples that are used in ournumerical experiments from the JAFFE database. (a)Neutral (b) Happiness (c) Anger (d) Fear (e) Disgust(f) Sadness and (g) Surprise.

Fiducial points are manually landmarked in our ex-periments for both the training and testing dataset. 52selected fiducial points is given in Fig.5, with 30 first-tier points (marked in crossing symbol) and 22 second-tier points (marked in circles). First-tier points areindependently located while second-tier points are cal-culated from the positions of its neighboring first-tierpoints by taking centering or crossing points. We referto the case of using LDA as the second-stage classifieras LDA-fusion, while we name the case of using SVMas SVM-fusion. Further, methods of applying SVM di-rectly to the feature vectors from raster-scanned pixelsor from the Gabor jets at all fiducial points of each im-age are referred to as appearance-based SVM method,stacked Gabor jet based SVM method, respectively.

To evaluate the performance quantitatively, we de-fine a recognition rate as

rc =1

N

N∑

n=1

δ(zn, z∗n) (8)

Fig. 5: Location of selected fiducial points. Cross-ing symbols are for first-tier fiducial points andcircles for second-tier fiducial points.

where δ(x, y) is the Kronecker delta. z∗n is the esti-mated label value. Due to the limited size of trainingdataset, the recognition rate on the training data is1.0 or near 1.0 for almost all cases. Therefore, we willonly give the results on testing dataset and mainly fo-cus on the comparison of their generalization ability tountrained testing data. Numerical experiments havebeen performed on 15 randomly selected training setswith size N varying from 44 to 100. For each N , a pairof two mutually exclusive sets was created, one with Nimages for training, and another with 213− N imagesfor testing. In total, 15 pairs of training and testingdatasets are used in our experiments and their resultsare summarized as follows:

Fig. 6: Recognition rate rc on the testing datasetfor the appearance based SVM method, thestacked Gabor jet based SVM method and theSVM-fusion method. The proposed SVM-fusionachieves the best performance in recognition.

1.) In Fig.6, recognition rate rc is plotted as a func-tion of N , the size of training set, for three methods,i.e., the appearance-based SVM method, the stackedGabor jet based SVM method and the SVM-fusionmethod. Comparing with the appearance-based SVM,the proposed method significantly increased the ac-curacy of recognition by including explicit position-matching of fiducial points. We also note that rc fromthe two-stage SVM is about 3.1 points higher than thatin the stacked Gabor jet-based SVM method on aver-age.

2.) Recognition rates rc under different fusion

455

Page 4: Facial Expression Recognition by SVM-based Two-stage ...b2.cvl.iis.u-tokyo.ac.jp/mva/proceedings/2007CD/papers/13-06.pdf · Facial Expression Recognition by SVM-based Two-stage Classifier

methods are compared in Fig.7. Average-fusion(AVE-Fusion), Max-fusion, LDA-fusion, and the proposedSVM-fusion have been tested. We found that SVM-fusion has the highest accuracy of recognition, whichis considered as a result of the better generalizationability of SVM when comparing to LDA.

3.) In above experiments, we use linear kernels in allSVM implementations. In Fig.8, a comparison is madebetween SVM-fusion with a linear kernel and a Gaus-sian RBF kernel, i.e., φ(xi, xj) = exp(−γ|xi − xj |

2).We take γ = 0.1 for this figure. For small train-ing datasets nonlinear version may cause overfittingwhich damages its ability of generalization, while withenough training samples the nonlinear SVM-fusionoutperforms that using a linear kernel.

Fig. 7: Recognition rate rc on the testing datasetunder different fusion methods.

Although only one sample set for each size of bunchhas been tested, we still conclude that the proposedmethod has better performance, because the proposedmethod outperforms other methods for all sizes of test-ing datasets we have tested.

4 Conclusions

We have proposed a method to enhance fiducial-pointbased recognition of facial expressions by estimatingoptimal weights on both different Gabor kernels anddifferent fiducial points. We achieved this goal by de-veloping a two-stage classifier. Max-win multi-classSVMs have been used as the base classifiers for pro-ducing continuous output and a pair-wise multi-classSVM has been applied for output fusion. Numericalexperiments on manually landmarked fiducial pointsverified the efficiency of our proposed method in fa-cial expression recognition. Further numerical experi-ments on the robustness against the localization errorof EBGM will be made in our future works. We willalso consider the possibility of including bagging forbase classifier selection and utilizing the decision treeto organize base classifiers for further improvements.

References

[1] J. Zhang, Y. Yan, and M. Lades, ”Face recognition:eigenface, elastic matching and neural nets,” Proceed-

Fig. 8: Recognition rate rc on the testing datasetunder linear and nonlinear SVM-fusions. Thenonlinear SVM uses a RBF kernel with γ = 0.1.

ings of the IEEE, vol.85, pp.1423-1435, 1997.

[2] M.J. Lyons, S. Akamatsu, M. Kamachi, and J. Gy-oba, ”Coding facial expressions with Gabor wavelets,”Proc. 3rd IEEE Int’l Conf. on Automatic Face andGesture Recognition, vol.1, pp.200-205, 1998.

[3] L. Wiskott, J.M. Fellous, N. Kruger, and C. Malsburg,”Face recognition by elastic bunch graph matching,”IEEE Trans. PAMI, vol.19, pp.775-779, 1997.

[4] R.P. Wurtz, ”Object recognition robust under trans-lations, deformations and changes in background,”IEEE Trans. PAMI, vol.19, pp.769-775, 1997.

[5] C. Liu, and H. Wechsler, ”Gabor feature based classi-fication using the enhanced fisher linear discriminantmodel for face recognition,” IEEE Trans. Image Pro-cessing, vol.11, pp.467-476, 2002.

[6] Y. Pang, L. Zhang, M. Li, Z. Liu, and W. Ma, ”Anovel Gabor-LDA based face recognition method,”Lecture Notes in Computer Science, vol.3331, pp.352-358, 2004.

[7] C.J.C. Burges. ”A tutorial on support vector machinesfor pattern recognition,” Data Mining and Knowledge.Discovery, vol.2 no.2 pp.955-974, 1998.

[8] A. Tefas, C. Kotropoulos, and I. Pitas, ”Using supportvector machines to enhance the performance of elasticgraph matching for frontal face authentication,” IEEETrans. PAMI, vol.23, pp.735-746, 2001.

[9] L.I. Kuncheva, ”Combining pattern classifiers: meth-ods and algorithms,” Wiley, 2004.

[10] W.F. Liu, and Z.F. Wang, ”Facial expression recog-nition based on fusion of multiple Gabor features,”Proc. 18th IEEE Int’l Conf. on Pattern Recognition(ICPR’06), vol.1, pp.536-539, 2006.

[11] N. Ueda, ”Optimal linear combination of neuralnetworks for improving classification performance,”IEEE Trans. PAMI, vol.22, pp.207-215, 2000.

[12] P. Ekman, W.V. Friesen, and P. Ellsworth, ”Emotionin the human face: Guidelines for research and anintegration of findings,” Pergamon Press, New York,1972.

456