ROBUST FACE TRACKING WITH A CONSUMER DEPTH CAMERAranger.uta.edu/~huang/papers/ICIP12_Kinect.pdf ·...

4
ROBUST FACE TRACKING WITH A CONSUMER DEPTH CAMERA Fei Yang 1 , Junzhou Huang 2 , Xiang Yu 1 , Xinyi Cui 1 , Dimitris Metaxas 1 1 Rutgers University 2 University of Texas at Arlington ABSTRACT We address the problem of tracking human faces under vari- ous poses and lighting conditions. Reliable face tracking is a challenging task. The shapes of the faces may change dramat- ically with various identities, poses and expressions. More- over, poor lighting conditions may cause a low contrast image or cast shadows on faces, which will significantly degrade the performance of the tracking system. In this paper, we de- velop a framework to track face shapes by using both color and depth information. Since the faces in various poses lie on a nonlinear manifold, we build piecewise linear face models, each model covering a range of poses. The low-resolution depth image is captured by using Microsoft Kinect, and is used to predict head pose and generate extra constraints at the face boundary. Our experiments show that, by exploiting the depth information, the performance of the tracking system is significantly improved. Index TermsFace tracking, Depth camera 1. INTRODUCTION Reliable tracking of 3D deformable faces is a challenging task in computer vision. For one, the shapes of faces change dra- matically with various identities, poses and expressions. For the other, poor lighting conditions may cause a low contrast image or cast shadows on faces, which will significantly de- grade the performance of the tracking system. Most previous face tracking systems uses a single optical camera. Existing methods can be divided into two categories: appearance based methods and feature based methods. The appearance based methods use generative appearance models to capture the shape and texture variations of faces, such as active appearance models (AAMs) [1] [2] and 3D morphable models [3]. The deformation parameters can be estimated us- ing gradient descent optimization. These methods may suffer from weak generalization capability due to lighting and tex- ture variations. The feature based methods track local facial features and apply global shape models to constrain the feature locations. Vogler et al. [4] developed a system that detects facial land- marks by using active shape models (ASMs) [5]. Zhang et al. [6] proposed a method using a combination of semantic features, silhouette features, and online tracking features, and estimated model parameters in the form of an energy min- imization problem. In addition to linear shape subspaces, sparse learning based methods [7] [8] are also applied to model shape priors to handle complex shape variations and non-Gaussian errors. The feature based methods have better generalization for new faces, but they may still lose tracking due to the lack of semantic features and occlusions. One of the main difficulties of visual descriptors on RGB data is the insufficient discrimination on shapes, textures and foreground objects. The problem gets even worse with poor lighting condition and changing of viewpoints. On the other hand, the depth information is invariant to color, texture and lighting, making it easier to differentiate foreground objects from background. Based on recent development of inexpen- sive depth cameras, there is rapidly growing interest in ex- ploring depth information in vision systems. A large body of previous work with depth cameras fo- cus on estimating the gestures of human bodies [9] [10]. As 3D face models attract more attention for face recognition and animations [3] [11], some recently studies also use depth cameras to recover 3D face shapes and other characteristics. Fanelli et al. [12] developed a random forest algorithm to esti- mate head orientations from the range data. Cai et al.[13] de- veloped a maximum likelihood solution to track face shapes from the noisy input depth data. Weise et al. [14] developed a realtime system to reconstruct 3D head shapes from the range data, which are used to generate face animations. Lai et al. [15] proposed a method for object recognition by combin- ing both RGB and depth data. Other applications include 3D scene reconstruction [16] and indoor robotics [17], etc. In this paper, we develop a framework to track face shapes by using both color and depth information. Since the faces in various poses lie on a nonlinear manifold, we build piecewise linear face models, each model covering a range of poses. The low-resolution depth image is captured by using Microsoft Kinect, and is used to predict head pose and generate ex- tra constraints at the face boundary. KLT trackers [18] are employed to track individual facial landmarks, under global shape constraints. The overview of our system is shown in Fig. 1. A kinect sensor is used to capture both RGB and depth data, and the depth image is used to estimate the face orientation. The face subspace of the closest pose is selected to constrain the shape of the face. We estimate the landmark locations using both

Transcript of ROBUST FACE TRACKING WITH A CONSUMER DEPTH CAMERAranger.uta.edu/~huang/papers/ICIP12_Kinect.pdf ·...

Page 1: ROBUST FACE TRACKING WITH A CONSUMER DEPTH CAMERAranger.uta.edu/~huang/papers/ICIP12_Kinect.pdf · Kinect, and is used to predict head pose and generate ex-tra constraints at the

ROBUST FACE TRACKING WITH A CONSUMER DEPTH CAMERA

Fei Yang1, Junzhou Huang2, Xiang Yu1, Xinyi Cui1, Dimitris Metaxas1

1Rutgers University 2University of Texas at Arlington

ABSTRACT

We address the problem of tracking human faces under vari-ous poses and lighting conditions. Reliable face tracking is achallenging task. The shapes of the faces may change dramat-ically with various identities, poses and expressions. More-over, poor lighting conditions may cause a low contrast imageor cast shadows on faces, which will significantly degrade theperformance of the tracking system. In this paper, we de-velop a framework to track face shapes by using both colorand depth information. Since the faces in various poses lie ona nonlinear manifold, we build piecewise linear face models,each model covering a range of poses. The low-resolutiondepth image is captured by using Microsoft Kinect, and isused to predict head pose and generate extra constraints at theface boundary. Our experiments show that, by exploiting thedepth information, the performance of the tracking system issignificantly improved.

Index Terms— Face tracking, Depth camera

1. INTRODUCTION

Reliable tracking of 3D deformable faces is a challenging taskin computer vision. For one, the shapes of faces change dra-matically with various identities, poses and expressions. Forthe other, poor lighting conditions may cause a low contrastimage or cast shadows on faces, which will significantly de-grade the performance of the tracking system.

Most previous face tracking systems uses a single opticalcamera. Existing methods can be divided into two categories:appearance based methods and feature based methods. Theappearance based methods use generative appearance modelsto capture the shape and texture variations of faces, such asactive appearance models (AAMs) [1] [2] and 3D morphablemodels [3]. The deformation parameters can be estimated us-ing gradient descent optimization. These methods may sufferfrom weak generalization capability due to lighting and tex-ture variations.

The feature based methods track local facial features andapply global shape models to constrain the feature locations.Vogler et al. [4] developed a system that detects facial land-marks by using active shape models (ASMs) [5]. Zhang etal. [6] proposed a method using a combination of semanticfeatures, silhouette features, and online tracking features, and

estimated model parameters in the form of an energy min-imization problem. In addition to linear shape subspaces,sparse learning based methods [7] [8] are also applied tomodel shape priors to handle complex shape variations andnon-Gaussian errors. The feature based methods have bettergeneralization for new faces, but they may still lose trackingdue to the lack of semantic features and occlusions.

One of the main difficulties of visual descriptors on RGBdata is the insufficient discrimination on shapes, textures andforeground objects. The problem gets even worse with poorlighting condition and changing of viewpoints. On the otherhand, the depth information is invariant to color, texture andlighting, making it easier to differentiate foreground objectsfrom background. Based on recent development of inexpen-sive depth cameras, there is rapidly growing interest in ex-ploring depth information in vision systems.

A large body of previous work with depth cameras fo-cus on estimating the gestures of human bodies [9] [10]. As3D face models attract more attention for face recognitionand animations [3] [11], some recently studies also use depthcameras to recover 3D face shapes and other characteristics.Fanelli et al. [12] developed a random forest algorithm to esti-mate head orientations from the range data. Cai et al.[13] de-veloped a maximum likelihood solution to track face shapesfrom the noisy input depth data. Weise et al. [14] developeda realtime system to reconstruct 3D head shapes from therange data, which are used to generate face animations. Lai etal. [15] proposed a method for object recognition by combin-ing both RGB and depth data. Other applications include 3Dscene reconstruction [16] and indoor robotics [17], etc.

In this paper, we develop a framework to track face shapesby using both color and depth information. Since the faces invarious poses lie on a nonlinear manifold, we build piecewiselinear face models, each model covering a range of poses. Thelow-resolution depth image is captured by using MicrosoftKinect, and is used to predict head pose and generate ex-tra constraints at the face boundary. KLT trackers [18] areemployed to track individual facial landmarks, under globalshape constraints.

The overview of our system is shown in Fig. 1. A kinectsensor is used to capture both RGB and depth data, and thedepth image is used to estimate the face orientation. The facesubspace of the closest pose is selected to constrain the shapeof the face. We estimate the landmark locations using both

Page 2: ROBUST FACE TRACKING WITH A CONSUMER DEPTH CAMERAranger.uta.edu/~huang/papers/ICIP12_Kinect.pdf · Kinect, and is used to predict head pose and generate ex-tra constraints at the

(a)

(c)

(d)

(b)

Fig. 1. Overview of our system. (a) A kinect sensor is usedto capture both RGB and depth data. (b) The depth image isused to estimate the face orientation. (c) The face subspaceof the closest pose is selected to constrain the face shape. (d)Both RGB and depth information are used to track the face.

the 1D gradient features described in [5] and face silhouettecaptured from the depth image. Finally the landmarks aretracked with the global shape constrains.

2. POSE ESTIMATION AND FACE SUBSPACES

We follow the work of Fanelli et al. [12] to estimate faceorientation. In their approach, head pose estimation is for-mulated as a regression problem, and the pose parametersare estimated directly from the depth data. The regressionis implemented within a random forest framework, learninga mapping from simple depth features to a probabilistic esti-mation of real-valued parameters such as 3D nose coordinatesand head rotation angles. The system works in real-time on aframe-by-frame basis.

The face shapes under various poses lie on a hyper-sphere-like manifold. Traditional shape models includingactive shape models (ASMs) [5] and active appearance mod-els (AAMs) [19] impose the variation of face shapes in alinear subspace, which cannot handle large pose changes. Tosolve this problem, we train piecewise linear ASM models.The model is switched during the tracking procedure, basedon the head pose estimated from the depth data.

We train a total of 7 ASM models, i.e., frontal, half profile(left and right), full profile (left and right), upper, and lower.Each model covers a range of face poses. Some examples ofthe training faces are shown in Fig. 2.

3. FACE TRACKING

Traditional shape fitting methods use the mean shape as theinitial shape to be placed on the face image. The current

(a) (b)

(c) (d)

Fig. 2. Training set of the active shape models. (a) frontal;(b) left half profile; (c) lower; (d) right half profile.

shape is iteratively modified by updating each landmark posi-tion and then constraining the overall shape to lie on the shapesubspace. The active shape model and its recent extension[20] use 1D and 2D profile models to locate the approximateposition of each landmark by template matching. Any tem-plate matcher can be used, for instance, the classical ASMforms a fixed-length normalized gradient vector (called theprofile) by sampling the image along a line orthogonal to theshape boundary at the landmark. However, the gradient fea-tures often suffer from lack of discrimination, especially un-der poor lighting conditions and complex backgrounds. Thelandmarks on face boundaries are often misaligned to back-ground points with stronger gradients.

We propose a new local fitting method to solve this prob-lem. For face boundary landmarks, instead of only matchingthe Mahalanobis distance with the profile model, we also es-timate the edge information in the depth map. The score ofeach position along normal of boundary is defined as

d2 = (g − g)TS−1g (g − g) + λ|| 5 ID||2 (1)

where g is the gradient of current position. g and Sg arethe mean and covariance matrix acquired during the trainingphase. We use an additional term || 5 ID||2 to estimate theedge intensity of the depth map ID. λ is the parameter con-trolling the trade-off between two terms.

Running ASM in every frame is computationally expen-sive and causes jittering. To solve this problem, we trackthe features using the KLT tracker [18] across consecutiveframes. The KLT tracker is a method for registering two lo-cal features and computes the displacement of the feature byminimizing the intensity matching cost. It efficiently givesthe new location of each landmark in the next frame. The

Page 3: ROBUST FACE TRACKING WITH A CONSUMER DEPTH CAMERAranger.uta.edu/~huang/papers/ICIP12_Kinect.pdf · Kinect, and is used to predict head pose and generate ex-tra constraints at the

landmark coordinates are projected into the shape subspaceselected by the pose estimator. The positions are further up-dated by using the global shape constraints.

4. EXPERIMENTS

We test our system with a subject sitting in front the kinectsensor and performing different head poses and face expres-sions. The background is static, but with rich textures. Asshown in Fig. 3, our system could effectively track the faciallandmarks, while disabling the depth data often leads to losttracking in this scenario.

5. CONCLUSION

In this paper, we develop a framework to track face shapesby using both color and depth information. Since the faces invarious poses lie on a nonlinear manifold, we build piecewiselinear face models, each model covering a range of poses. Thelow-resolution depth image is captured by using MicrosoftKinect, and is used to predict head pose and generate extraconstraints at the face boundary. Our experiments show that,by exploiting the depth information, the performance of thetracking system is significantly improved.

6. REFERENCES

[1] Jing Xiao, Simon Baker, Iain Matthews, and TakeoKanade, “Real-time combined 2D+3D active appear-ance models,” in Proc. CVPR, 2004, pp. 535–542.

[2] Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn,“Face alignment through subspace constrained mean-shifts,” in Proc. ICCV, 2009, pp. 1034–1041.

[3] Volker Blanz and Thomas Vetter, “A morphable modelfor the synthesis of 3D faces,” in Proc. SIGGRAPH,1999, pp. 187–194.

[4] Christian Vogler, Zhiguo Li, Atul Kanaujia, SiomeGoldenstein, and Dimitris N. Metaxas, “The best of bothworlds: Combining 3D deformable models with activeshape models,” in Proc. ICCV, 2007, pp. 1–7.

[5] Timothy F. Cootes, Christopher J. Taylor, David H.Cooper, and Jim Graham, “Active shape models - theirtraining and application,” Computer Vision and ImageUnderstanding, vol. 61, no. 1, pp. 38–59, 1995.

[6] Wei Zhang, Qiang Wang, and Xiaoou Tang, “Real timefeature based 3-D deformable face tracking,” in Proc.ECCV, 2008, pp. 720–732.

[7] Shaoting Zhang, Yiqiang Zhan, Maneesh Dewan, Jun-zhou Huang, Dimitris N. Metaxas, and Xiang SeanZhou, “Towards robust and effective shape modeling:

Sparse shape composition,” Medical Image Analysis,vol. 16, no. 1, pp. 265–277, 2012.

[8] Fei Yang, Junzhou Huang, and Dimitris N. Metaxas,“Sparse shape registration for occluded facial feature lo-calization,” in Proc. the 9th IEEE International Confer-ence on Automatic Face and Gesture Recognition (FG),2011, pp. 272–277.

[9] Jamie Shotton, Andrew W. Fitzgibbon, Mat Cook, TobySharp, Mark Finocchio, Richard Moore, Alex Kipman,and Andrew Blake, “Real-time human pose recogni-tion in parts from single depth images,” in Proc. CVPR,2011, pp. 1297–1304.

[10] Mao Ye, Xianwang Wang, Ruigang Yang, Liu Ren, andMarc Pollefeys, “Accurate 3D pose estimation from asingle depth image,” in Proc. ICCV, 2011, pp. 731–738.

[11] Fei Yang, Jue Wang, Eli Shechtman, Lubomir Bourdev,and Dimitri Metaxas, “Expression flow for 3D-awareface component transfer,” ACM Trans. Graphics, vol.30, no. 4, pp. 60, 2011.

[12] Gabriele Fanelli, Thibaut Weise, Juergen Gall, and LucJ. Van Gool, “Real time head pose estimation fromconsumer depth cameras,” in Proc. DAGM-Symposium,2011, pp. 101–110.

[13] Qin Cai, David Gallup, Cha Zhang, and ZhengyouZhang, “3D deformable face tracking with a commoditydepth camera,” in Proc. ECCV, 2010, pp. 229–242.

[14] Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly,“Realtime performance-based facial animation,” ACMTrans. Graphics, vol. 30, no. 4, pp. 77, 2011.

[15] Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox,“Sparse distance learning for object recognition combin-ing RGB and depth information,” in Proc. ICRA, 2011,pp. 4007–4013.

[16] Shahram Izadi, David Kim, Otmar Hilliges, DavidMolyneaux, Richard A. Newcombe, Pushmeet Kohli,Jamie Shotton, Steve Hodges, Dustin Freeman, An-drew J. Davison, and Andrew W. Fitzgibbon, “Kinectfu-sion: real-time 3D reconstruction and interaction usinga moving depth camera,” in Proc. the 24th Annual ACMSymposium on User Interface Software and Technology,2011, pp. 559–568.

[17] Joao Cunha, Eurico Pedrosa, Cristovao Cruz, AntonioJ. R. Neves, and Nuno Lau, “Using a depth camera forindoor robot localization and navigation,” in RoboticsScience and Systems 2011 Workshop on Advanced Rea-soning with Depth Cameras, 2011.

[18] Jianbo Shi and Carlo Tomasi, “Good features to track,”in Proc. CVPR, 1994, pp. 593 – 600.

Page 4: ROBUST FACE TRACKING WITH A CONSUMER DEPTH CAMERAranger.uta.edu/~huang/papers/ICIP12_Kinect.pdf · Kinect, and is used to predict head pose and generate ex-tra constraints at the

Fig. 3. Face tracking results

[19] Timothy F. Cootes, Gareth J. Edwards, and Christo-pher J. Taylor, “Active appearance models,” IEEETrans. Pattern Analysis and Machine Intelligence, vol.23, no. 6, pp. 681–685, 2001.

[20] Stephen Milborrow and Fred Nicolls, “Locating facialfeatures with an extended active shape model,” in Proc.ECCV, 2008, pp. 504–513.