Affordable 3D Face Tracking Using Projective Visionx z Figure 1: Stereo tracking with two...

8
Affordable 3D Face Tracking Using Projective Vision D.O. Gorodnichy , S. Malik , G. Roth Computational Video Group, National Research Council, Ottawa, Canada K1A 0R6 School of Computer Science, Carleton University, Ottawa, Canada, K1S 5B6 http://www.cv.iit.nrc.ca/research/Stereotracker Abstract For humans, to view a scene with two eyes is clearly more advantageous than to do that with one eye. In computer vi- sion however, most of high-level vision tasks, an example of which is face tracking, are still done with one camera only. This is due to the fact that, unlike in human brains, the rela- tionship between the images observed by two arbitrary video cameras, in many cases, is not known. Recent advances in projective vision theory however have produced the method- ology which allows one to compute this relationship. This relationship is naturally obtained while observing the same scene with both cameras and knowing this relationship not only makes it possible to track features in 3D, but also makes tracking much more robust and precise. In this paper, we es- tablish a framework based on projective vision for tracking faces in 3D using two arbitrary cameras, and describe a stereo tracking system, which uses the proposed framework to track faces in 3D with the aid of two USB cameras. While being very affordable, our stereotracker exhibits pixel size precision and is robust to head’s rotation in all three axis of rotation. 1 Introduction We consider the problem of tracking faces using a video camera and focus our attention on the design of the vision- based perceptual user interface systems [28]. The main applications of these systems are seen in HCI, teleconfer- encing, entertainment, security and industry for disabled [27, 30]. Being a high-level vision problem, face tracking prob- lem poses four major challenges: robustness, precision, speed and affordability. While the last two have become much less critical over the last few years due to the signif- icant increase of computer power and decrease of camera cost, the first two remain unresolved. The approaches to face tracking can be divided into two classes: global (image-based) and local (feature-based) ap- proaches [10, 31]. Global approaches use global cues like The author the correspondence should be sent to. skin colour, head geometry and motion, are more robust, but cannot be used for pixel-size precision tracking. On the other hand, local approaches, which are based on track- ing facial features, can theoretically track very precisely. In practice however they do not, because they are very unro- bust, due to the variety of motions and expressions a face may exhibit. While for humans it is definitely easier to track objects with two eyes than with one eye, in computer vision face tracking is usually done with one camera only. A few au- thors do use stereo for face tracking [14, 15, 20, 29]. They however use the second camera mainly for the purpose of acquiring the third dimension rather than making tracking more robust, precise or affordable. In fact, conventional stereo setups are usually precalibrated and quite expensive. The problem is that human brain knows and makes use of the relationship between the images while processing them [2, 3], while computers do not. Recent advances in computer vision though provided the methodology based on projective vision that allows one to compute this relationship for any two cameras [9, 23]. This relationship is naturally obtained while observing the same scene with both cameras and is represented by the fundamental matrix which relates the two images to one another. This paper describes how to use the projective vision techniques for tracking faces with two arbitrary cameras. The most significant result is that computing the fundamen- tal matrix not only allows one to recover the 3D position of the object with low-cost cameras, but also makes track- ing much more robust. Combining the proposed projective- vision-based tracking approach with the robust convex- shape nose tracking technique described in [5] allowed us to build a stereo tracking system, which is able to track faces in 3D using two generic USB cameras. The robustness of the system, the binary code of which can be downloaded from our website, is such that the rotations of a head of up to 40 degrees in all three axis of rotation can be tracked. The paper is organized as follows. After presenting the outline of our framework for affordable stereotracking (Section 2), we recap the projective vision properties which make the framework possible (Section 3). This is followed by the description of the stereo selfcalibration procedure

Transcript of Affordable 3D Face Tracking Using Projective Visionx z Figure 1: Stereo tracking with two...

Page 1: Affordable 3D Face Tracking Using Projective Visionx z Figure 1: Stereo tracking with two web-cameras.The key issue is finding the relationship between the cameras. (Section 4). Then

Aff ordable 3D FaceTracking UsingProjectiveVision

D.O.Gorodnichy���

, S.Malik�, G. Roth

��ComputationalVideoGroup,NationalResearchCouncil,Ottawa,CanadaK1A 0R6�

Schoolof ComputerScience,CarletonUniversity, Ottawa,Canada,K1S5B6http://www.cv.iit.nrc.ca/research/Stereotracker

Abstract

For humans,to view a scenewith two eyesis clearly moreadvantageousthanto do that with oneeye. In computervi-sionhowever, mostof high-level visiontasks,an exampleofwhich is facetracking, are still donewith onecamera only.Thisis dueto thefact that,unlike in humanbrains,therela-tionshipbetweentheimagesobservedbytwoarbitraryvideocameras, in manycases,is not known. Recentadvancesinprojectivevisiontheoryhoweverhaveproducedthemethod-ology which allows oneto computethis relationship. Thisrelationshipis naturally obtainedwhile observingthesamescenewith bothcamerasandknowingthis relationshipnotonlymakesit possibleto track featuresin 3D,but alsomakestrackingmuch morerobustandprecise. In thispaper, wees-tablisha framework basedon projectivevision for trackingfacesin 3D using two arbitrary cameras, and describeastereotracking system,which usestheproposedframeworkto track facesin 3D with theaid of twoUSBcameras.Whilebeingvery affordable, our stereotracker exhibits pixel sizeprecisionandis robustto head’srotationin all threeaxisofrotation.

1 Intr oduction

We considerthe problemof tracking facesusing a videocameraandfocusour attentionon thedesignof thevision-basedperceptualuser interface systems[28]. The mainapplicationsof thesesystemsare seenin HCI, teleconfer-encing, entertainment,security and industry for disabled[27, 30].

Being a high-level vision problem,facetrackingprob-lem posesfour major challenges: robustness,precision,speedand affordability. While the last two have becomemuchlesscritical over the last few yearsdueto the signif-icant increaseof computerpower and decreaseof cameracost,thefirst two remainunresolved.

Theapproachesto facetrackingcanbedividedinto twoclasses:global (image-based)andlocal (feature-based)ap-proaches[10, 31]. Global approachesuseglobal cueslike�

Theauthorthecorrespondenceshouldbesentto.

skin colour, headgeometryand motion, are more robust,but cannotbe usedfor pixel-sizeprecisiontracking. Ontheotherhand,local approaches,which arebasedon track-ing facial features,cantheoreticallytrackvery precisely. Inpracticehowever they do not, becausethey arevery unro-bust, due to the variety of motionsandexpressionsa facemayexhibit.

While for humansit is definitely easierto track objectswith two eyes thanwith oneeye, in computervision facetracking is usuallydonewith onecameraonly. A few au-thorsdo usestereofor facetracking[14, 15, 20, 29]. Theyhowever usethe secondcameramainly for the purposeofacquiringthe third dimensionratherthanmaking trackingmore robust, preciseor affordable. In fact, conventionalstereosetupsareusuallyprecalibratedandquiteexpensive.

Theproblemis thathumanbrainknowsandmakesuseofthe relationshipbetweenthe imageswhile processingthem[2, 3], whilecomputersdonot. Recentadvancesin computervisionthoughprovidedthemethodologybasedonprojectivevision that allows oneto computethis relationshipfor anytwo cameras[9, 23]. This relationshipis naturallyobtainedwhile observingthe samescenewith both camerasand isrepresentedby thefundamentalmatrixwhichrelatesthetwoimagesto oneanother.

This paperdescribeshow to use the projective visiontechniquesfor tracking faceswith two arbitrary cameras.Themostsignificantresultis thatcomputingthefundamen-tal matrix not only allows one to recover the 3D positionof the objectwith low-costcameras,but alsomakestrack-ing muchmorerobust. Combiningtheproposedprojective-vision-basedtracking approachwith the robust convex-shapenosetrackingtechniquedescribedin [5] allowedustobuild astereotrackingsystem,whichis ableto trackfacesin3D usingtwo genericUSB cameras.The robustnessof thesystem,the binary codeof which canbedownloadedfromour website,is suchthat therotationsof a headof up to 40degreesin all threeaxisof rotationcanbetracked.

The paper is organizedas follows. After presentingthe outline of our framework for affordablestereotracking(Section2), we recaptheprojectivevision propertieswhichmake the framework possible(Section3). This is followedby the descriptionof the stereoselfcalibrationprocedure

Page 2: Affordable 3D Face Tracking Using Projective Visionx z Figure 1: Stereo tracking with two web-cameras.The key issue is finding the relationship between the cameras. (Section 4). Then

x

z

Figure1: Stereotrackingwith two web-cameras.Thekey issueisfinding therelationshipbetweenthecameras.

(Section4). Thenwe describethetrackingprocedureof theframework andpresenttheexperimentalresults(Section5).

2 Stereosystemsetup

In order to presenta framework for doing stereotrackingwith two arbitrary uncalibratedcameras,we will describetheStereoTrackerdevelopedbyourgroup,wherethisframe-work is implemented.

The StereoTracker allows a user to do high-level vi-siontaskswith off-the-shelfvision equipment.It tracksthefaceof a userin 3D with theaid of two ordinaryUSB web-camerasandconsistsof threemajormodules:1) stereoself-calibration,2) facial featureslearningand3) featuretrack-ing. Thesetupof thesystemis thefollowing.

A usermountstwo USB camerason thetop of thecom-putermonitor so thathis faceis seenby bothcameras(seeFigure1), afterwhich theuserrunstheselfcalibrationmod-uleof theStereoTracker to acquiretherelationshipbetweenthecameras.For this theusercaptureshis headat thesametime with both camerasat the distanceto be usedin track-ing (Figure2). Whenthestereocalibrationinformationhasbeencalculated,the StereoTracker makesuseof it to learnrobust facial features(Figure3), and thenmakesuseof itagainwhile trackingthefeaturesin 3D (Figure4).

Beforeproceedingto the descriptionof the calibrationprocedure,which is the basisof our stereotrackingframe-work, we needto describenotationsand propertiesto beusedthroughoutthepaper.

3 Projectivevision interlude

3.1 Points, lines and calibration matrix.

Accordingto the projective paradigm,every 2D pixel ���� � �����of an imageis asassociatedwith the3D vector ���� ��������� �

which startsat thecameraorigin andgoesthoughto all 3D spacepointswhich areprojectedto thesamepixel

on the imageplane.1 A line in an imageis representedby3D vector � which is perpendicularto all points � belongingto theline: � � ����� .

The relationshipbetweenvector � andpixel position �of a point in theimageis expressedas

���! !"�$# (1)

where "� denotesa 3D vectorobtainedfrom a 2D vector �by addingoneasthelastelement.Matrix K in thisequation,thesimplifiedform of which is written below, is termedthecalibration matrix of thecamera.It describesthe intrinsicparametersof the camera,the mostimportantof which arethe centerof the image % &'(�)&+* measuredin pixel coordi-nates,andthe focal lengthof the camera, , definedasthedistancefrom thecameraorigin to thecameraimageplanemeasuredin pixels:

-� ./ , � &� , � &� � �01 # (2)

Computingthe calibrationmatrix of the cameraconsti-tutesthecalibrationprocessof thecamera.Thegoalof self-calibrationis to computethis matrix directly from an im-agesequencewithout resortingto a formal calibrationpro-cedure.Oneof thewaysto do this is by usingthe conceptof thefundamentalmatrix.

3.2 Stereoand fundamental matrix

Whena 3D point 2 in spaceis observed by two cameras,it is projectedto theimageplaneof eachcamera.This gen-eratestwo vectors� and �43 startingfrom theorigin of eachcamera.Thesevectorsarerelatedto eachotherthroughtheequation � ��57698 � 3 �!� (3)

where5

is the translationvectorbetweenthe cameraposi-tions and

8is the rotation matrix. This equationsimply

statesthefactvectors� ,5

and �43 arecoplanar. In computervision,thisequationis known astheepipolarconstraint andis usuallywrittenas

� �;: � 3 ��� where (4):=< 5>6?8is termedtheessentialmatrix# (5)

The epipolarconstraintholdsfor any two camerasetupandis very importantfor 3D applications,asit definestherelationshipbetweenthe correspondingpointsin two cam-era images.For handmadestereosetupwith off-the-shelf

1Some authors [11, 8] prefer using vector @ ACBED�B F+GIH instead of@ ACBEDJB � GIH in orderto avoid theunbalancingof thevalues.

Page 3: Affordable 3D Face Tracking Using Projective Visionx z Figure 1: Stereo tracking with two web-cameras.The key issue is finding the relationship between the cameras. (Section 4). Then

Figure2: Imagescapturedby two camerasto beusedin selfcali-bration.They shouldhave enoughvisualfeatures.

cameras,5

and8

arenot known. If thecamerasareuncali-brated,thenmatrix is not known either. In this case,theepipolarconstraintis rewritten,usingEq. 1, as"� ��K "�L��� (6)

where "�M� � � ONP)�Q��� and "�;3R� � �S3 ��3 )�Q��� definethe rawpixel coordinatesof thecalibratedvectors� and �43 , andma-trix

Kdefinedas K < : � (7)

is termedthefundamentalmatrix of thestereo.Computingthe fundamentalmatrix constitutesthe cali-

brationprocessof the stereosetup. Next sectiondescribesthisprocessin detailfor thecaseof low-qualitycameras,andbelow we describethepropertiesof thefundamentalmatrixto beusedin stereotracking.

3.3 Epipolar lines

Given a point � in oneimage,the fundamentalmatrix canbe usedto computea line in the otherimageon which thematchingpointmustlie. This line is calledtheepipolarlineandcanbecomputedas �ETU� K �V# (8)

It is this pieceof extra informationthatmakestrackingwith two camerasmuchmorerobustthantrackingwith onecamera.In orderto filter out badmatches,the epipolarer-ror, definedasthe sumof squaresof distancesof pointstotheir epipolarlines,canbeused[9]. UsingEq. 8, the rela-tionshipbetweenepipolarerror W andfundamentalmatrix

KcanbederivedasW <!XZY %[� �ETJ\ *4] X 3 Y %[��3 �[T *�^%E� 3 � K � * YP_ `acb TJdfe g�h aib TJdfee ] `aibkj T \ dfe glh aibkj T \ dEeeZm (9)

andthefollowing propositioncanbeusedto make trackingmorerobust.Proposition: Provided that fundamentalmatrix

Kof the

stereo is known,the bestpair of matches � and ��3 corre-spondingto a 3D featureis theonethatminimizestheepipo-lar error definedbyEq. 9.

4 Stereosystemselfcalibration

The calibrationproceduredescribedbelow is implementedusing the public domain Projective Vision Toolkit (PVT)[18] and is basedon finding the correspondencesin twoimagescapturedwith both camerasobservingat the samestaticscene.Becauseoff-the-shelfUSBcamerasareusuallyof low quality and resolution,extra care is taken to dealwith the badmatchesby applyinga setof filters andusingrobuststatistics.Thedescriptionof eachstepfollows.

Finding interest points. After two imagesof the samesceneare taken, the first stepis to find a setof cornersorinterestpoints in eachimage. Theseare the pointswherethere is a significantchangein intensity gradientin boththe

�and

�direction. A local interestpoint operator[26]

is usedand a fixed numberof cornersis returned. Thefinal resultsarenot particularlysensitive to the numberofcorners. Typically there are in the order of 200 cornersfoundin eachimage.

Matching corners and symmetry filter. The next stepis to matchcornersbetweenthe images. A local windowaroundeachcorner is correlatedagainstall other cornerwindows in the adjacentimage that are within a certainpixel distance[32]. Thisdistancerepresentsanupperboundon the maximumdisparity and is set to 1/3 of the imagesize. All corner pairs that passa minimum correlationthresholdare then filtered using a symmetry test, whichrequiresthe correlation be maximum in both directions.Thisfilters out half of thematchesandforcestheremainingmatchesto beone-to-one.

Disparity gradient filter. Thenext stepis to performlocalfiltering of thesematches.We usea relaxation-like processbasedon theconceptof disparitygradient[12] which mea-suresthecompatibilityof two correspondencesbetweenanimagepair. It is definedastheratioX'npo �rq s�tvuws�x�q yzq s|{�}'~� q (10)

where s t and s x arethe disparityvectorsof two corners,s�{�}'~� is the vector that joins midpointsof thesedisparity

vectors,and q���q designatesthe absolutevalueof a vector.The smaller the disparity gradient, the more the twocorrespondencesin agreementwith eachother. This filter isvery efficient. At a very low computationalcost,it removesa significantnumberof incorrectmatches.

Usingrandom sampling. Thefinal stepis to usethefilteredmatchesto computethe fundamentalmatrix. This processmustberobust,sinceit still cannot beassumedthatall fil-teredcorrespondencesarecorrect. Robustnessis achievedby usinga randomsamplingalgorithm. This is a “generate

Page 4: Affordable 3D Face Tracking Using Projective Visionx z Figure 1: Stereo tracking with two web-cameras.The key issue is finding the relationship between the cameras. (Section 4). Then

andtest”processin whichaminimalsetof correspondences,thesmallestnumbernecessaryto defineauniquefundamen-tal matrix (7 points),arerandomlychosen[25, 22]. A fun-damentalmatrix is then computedfrom this bestminimalsetusingEq. 6. Thesetof all cornersthatsatisfythis fun-damentalmatrix, in termsof Eq. 9, is calledthesupportset.Thefundamentalmatrix with thelargestsupportsetis usedin stereotracking.Beforeit canbeusedhowever, it needstobe evaluated,becauserobust andprecisetracking trackingis possibleonly underthe assumptionsthat the computedfundamentalmatrix correctlyrepresentsthe currentstereo-setup.

4.1 Evaluating the quality of calibration

The evaluationis doneusing two ways: analytically - byexaminingthesizeof thesupportset,andvisually- by visualexaminationof theepipolarlines.

It hasbeenempiricallyobtainedthatfor thefundamentalmatrix to becorrectit shouldhaveat least35matchesin thesupportset. If their numberis less,it meansthat either i)therewerenotenoughvisualfeaturespresentin theimages,or ii) camerasare locatedtoo far from eachother. In thiscase,camerashaveto berepositionedor someextraobjects,in additionto thehead,shouldbeaddedin thecamerafieldof view andthecalibrationprocedureshouldberepeated.

Figure3 shows theepipolarline (on theright side)cor-respondingto the tip of the nose(on the left side) as ob-tainedfor thestereosetupconsistingof two Intel USBweb-camsshown in Figure1. As canbeseen,it passescorrectlythroughthenose,thusverifying thecorrectnessof thefunda-mentalmatrixandpresentingausefulconstraintfor locatingthenosein thetrackingstage.

4.2 Finding the calibration matrix

Knowing matrixK

allowsoneto computematrix . Underthe assumptionsthat the intrinsic parametersof both cam-erasarethesame,theapproachwe useis thatof [17]. It isbasedon the fact that matrix

:, definedby Eq. 5 andre-

latedto matrixK

throughEq. 7, hasexactly two non-zeroandequaleigenvalues.The ideais to find suchcalibrationmatrix that makesthe two eigenvaluesof

:ascloseas

possible.Giventwo nonzeroeigenvaluesof:

: � ` and � Y( � `|� � Y ), considerthedifference%E� ` u�� Y * y�� ` .

Givena fundamentalmatrixK

, selfcalibrationproceedsby finding sucha calibrationmatrix thatminimizesthis dif-ference. This is an optimizationproblemwhich is solvedby dynamichill climbinggradientdescentapproach[16]. In[24] it is shown that this procedure,while quite simple, isnot inferior to morecomplicatedapproachesto selfcalibra-tion, suchasthoseusingtheKruppa’sequation[13].

Knowing matrix and the focal length allows one toreconstructspatialrelationshipof theobservedfeatures.For

Figure3: Learningthefeaturesusingtheepipolarconstraints.Thenosefeaturemustlie on theepipolarline.

the tracking processhowever, this information is not re-quired.

5 Stereofeature tracking

We usethepatternrecognitionparadigmto representa fea-tureasamulti-dimensionalvectormadeof featureattributes.In the caseof image-basedfeaturetracking,the featureat-tributesare the image intensitiesobtainedby centeringaspeciallydesignedpeepholemaskonthepositionof thefea-ture(asin [6]).

Themajoradvantageof stereotracking,i.e. trackingwithtwo cameras,overtrackingwith onecamerais thatof havinganadditionalepipolarconstraintwhich ensuresthat theob-servedfeaturesbelongto arigid body(seeSection3.3).Us-ing thisconstraintallowsusto relaxthematchingconstraintusedto enforcethe visual similarity betweenthe observedandthetemplatefeaturevectors.Thematchingconstraintisthe main reasonwhy feature-basedtrackingwith onecam-erais not robust,andrelaxingthisconstraintmakestrackingmuchmorerobust.

5.1 Feature learning

In thiswork, wearenotconcernedwith automaticdetectionof facial features.All featuresaremanuallychosenat thelearningstage.During this stage,by clicking with a mousea userselectsa featurein oneof thepair images(seeFigure3). This generatesan epipolarline in the other imageandthetemplatevectorsof boththeselectedfeatureandthebestmatchof the featurein theotherimagelying on theepipo-lar line arestored.Theexamplesof learnttemplatescanbeseenin Figure4 underneaththefaceimages.By visuallyex-aminingthesimilarity of thesetemplates,a usercanensurethat thestereosetupis well calibratedandthat theepipolarconstraintindeedprovidesausefulconstraint.

Several featurescanbe selectedfor tracking. Howeveroneof thesefeaturesmustbethetip of thenose.Thetip ofthenoseis averyuniqueandimportantfeature.As shown in[5], dueto its convex shape,this featureis invariantto boththerotationandthescaleof thehead.If detectionof theface

Page 5: Affordable 3D Face Tracking Using Projective Visionx z Figure 1: Stereo tracking with two web-cameras.The key issue is finding the relationship between the cameras. (Section 4). Then

orientationis not required,thenrobust facetrackingin 3Dcanbeachievedusingthis featureonly.

Otherfeaturesmayincludeconventionaledge-basedfea-turessuchasinnercornersof the brows andcornersof themouth. At leasttwo of thesefeaturesarerequiredin orderto track the orientationof the head.Thesefeaturesarenotinvariant to the 3D motion of the head. Therefore,in or-der to make the trackingof thesefeaturesrobust, the rigid-ity constraintwhich relatestheir positionsto thenoseposi-tion is imposed.Thisconstraintis computedwhile selectingthe features. Another constraintusedin stereotrackingisthedisparityconstraint,which definestheallowabledispar-ity valuesfor featuresduring thetracking,thuslimiting theareaof search.Boththerigidity andthedisparityconstraintsarecomputedduringtheselectionof featuresin thelearningstageandcanbeengagedor disengagedduringthetracking.

5.2 Tracking procedure

Eachselectedfacial featureis tracked using the followingfour-stepprocedure.

Step1. The areafor the local searchis obtainedusingtherigidity constraintandtheknowledgeof thenosetip po-sition. Whentherigidity constraintis not engagedor whenthe featureis the nose,the local searchareais setaroundthepreviouspositionof thefeature.If thisknowledgeis notavailable(asin thecaseof mistracking),thenthesearchareais setto theentireimage.

Step2. A setof � bestcandidatematches�+���+� is gen-eratedin both imagesby usingthepeepholemaskandcor-relatingthelocal searchareawith templatevector ��S� learntin the training. In order to be consideredvalid, all candi-datematchesare requiredto have the correlationwith thetemplatefeaturelarger thana certainthreshold. In our ex-periments,theallowableminimumcorrelationis setto 0.9.

Step3. Out of � Y possiblematchpairs betweentwoimages, the best � pairs are selectedusing the cross-correlationbetweenthe images. In our experiments,� issetto 10and � is setto 5. Increasing� or � increasestheprocessingtime,but makestrackingmorerobust.

Step4. Finally, thepropositionof Section3 is usedandthe matchpair that minimizesthe epipolarerror W definedby Eq. 9, is returnedby thestereotrackingsystem.If there-turnedmatchhastheepipolarerrorlessthenacertainthresh-old, thanthefeatureis consideredsuccessfullytracked.Oth-erwise,the featureis consideredmistracked. The valueofthemaximumallowableepipolarerror WZ���p� dependsonthequality of stereoselfcalibrationand shouldbe determinedduring the learningstageby observingthe epipolarlinesatdifferentpartsof theimage.In theexperimentspresentedinthispaperit is setequalto 5 pixels.

Whena featureis successfullytracked, its 3D positioncan be calculatedusing the essentialmatrix

:as in [9].

The distancebetweenthe camerascanbe eithermeasured

Figure4: TheStereoTracker atwork. Theorientationandscaleofthe virtual man(at the bottomright) is controlledby the positionof theobservedface.

by handor setequalto one.In thelattercase,thereconstruc-tion will be known up to a scalefactor, which is sufficientfor mostapplications

5.3 Experiments

The tracked facial featuresprovide the information aboutthe position and the orientationof the headwith respectto the cameras.This informationcanbe usedto control a2D object on the screen,suchas a cursoror a turret in aaim-n-shootgame(as in [7]), or it can be usedto controla 3D object,examplesof which canbe found in computergamesand avatar-like computer-generatedcommunicationprograms(asin [19]). TheStereoTracker system,which wehave developed,allows one to test the applicability of theproposedstereotrackingtechniqueto eitherof theseappli-cations.

Figure4 shows two imageswhich arecontrolledby themotionof thehead.Theimageon thebottomleft is there-projectionof thedetectedfeaturesontothe2D imageplane.Our experimentsshow thatby thinking of a noseasa point-ing device,onecaneasilypin-pointto any pixelsontheper-spectiveview screen.Thesystemis robustto therotationofthehead.This makesit possibleto draw a straightline withthenose,representedby the lowestvertex of thetriangleinthefigure,by simply rotatingthehead.

Theuserinterfaceof theStereoTrackerallowsonealsotochangetheperspectiveview of thefeaturesto thetop-down.view. This is usedto evaluatethepotentialof usingthethirddimensionof thetrackedfeaturesfor controllingthesizeofa cursoror anothervirtual objecton thescreen.The imageon thebottomright of Figure4 showsavirtual man,thepo-sitionof whichin avirtual 3D spaceis controlledby themo-

Page 6: Affordable 3D Face Tracking Using Projective Visionx z Figure 1: Stereo tracking with two web-cameras.The key issue is finding the relationship between the cameras. (Section 4). Then

tion of theuser’s head;thescaleof themanis proportionalto thedistancebetweenthefeaturesandthecameraandtheroll rotationof the mancoincideswith the roll rotationofthehead.

Dueto therobustnessof thestereotracking,thereis quitea wide rangeof headmotion which can be tracked. Forexample,for the setupshown in Figure1, the experimentsshow that within a rangefrom 20 cm to 60 cm the headistrackedsuccessfully. Thetrackinghoweverbecomelessro-bustto therotationof thehead,asausermovesfurtherfromtheoriginalposition.

Another observation is that stereotrackingwith low-quality camerasdoesnot allow oneto retrieve the panandtilt rotationof thehead.Thishoweverdoesnotcomeassur-prise,if we recallthatthedepthcalculationerrordueto lowresolutionof the imagecanbe ashigh as10% of the mea-sureddepthdistancedependingonthepositionof thefeature[4, 21].

The user interfaceof the StereoTracker allows one toevaluatetherobustness,speedandprecisionof stereotrack-ing with respectto the internalparametersandconstraintsof tracking. Theseparameterscan be changedduring thetrackingstageandincludealreadythementionedminimumallowablecorrelation,thenumberof featurecandidates( � ),the numberof refinedmatchesafter cross-correlation( � ),maximumepipolarerror, andalsothesizeof theareafor thesubpixel refinement,whichcanbeusedfor thefeaturespos-sessingthecontinuityproperty(see[5]), andthemaximumcolordifferencebetweenthestoredfeaturepixel andapixelbeing scanned. The last constraintis due to the fact thattemplatematchingin oursystemis donewith black-n-whiteimages;addingthis constraintallows us to usethe colourinformationto reducethenumberof featurecandidates.

In addition,thefacialfeaturescanbecheckedin andoutfor trackingusingthe menuof the program.Thedisparity,rigidity and epipolarconstraintscan also be engagedanddisengagedusing the menu. While tracking is performed,the StereoTracker outputsstatisticssuchas minimum andmaximumvaluesof correlation,cross-correlationandepipo-lar errorsof thedetectedfeaturesaswell asthedetected3Dpositionandroll orientationof theheadandthepositionofthecursorcontrolledby thehead.

While the paperpresentsonly the snapshotsof our ex-periments,full MPEG videosof the StereoTracker at workareavailableatourwebsite.Thebinarycodeof theprogramcanalsobedownloadedfrom thewebsite.In orderto run it,a userwill only have to have two USB webcamsconnectedto a computer. In our experiments,we usedtwo Intel USBweb-cameras.OtherUSB webcams,suchasLogitech,Cre-ative and3Com,werealso tried. Accessingthe imagesisdoneusingtheDirectX interface.Theresolutionof the im-agesis 320by 240.Lowerresolutionof 160by 120wasalsotried andfoundto besufficient for preciseandrobusttrack-ing. Imagesmoothingis doneusingthe Intel OpenSource

CV library [1]. On a PentiumIII 800 MHz processor, allprocessingtakes0.10msperframein average.This allowsoneto do stereofacetrackingin realtime.

Figure5 shows a typical rangeof headmotionwhich istrackedusingour framework alongwith thepositionof 2Dand3D objectscontrolledby the headmotion: tilt motion(a),panmotion(b), roll motion(c), andmotionfurtherfromthecameras(d). Yellow boxesaroundthefeaturesshow theareasfor local searchof thefacialfeatures.Usingthreefea-turesis found most optimal for recovering the orientationof the head;inner cornersof the brows beingmoreprefer-ableto track thancornersof themouth. Usingfive featuresslowed down the processingandoften resultedin breakingthe rigidity constraint.Thefigureshows well the precisionandtherobustnessof stereotracking. By switchingon andoff theepipolarconstraint,we wereableto observe thatus-ing onecameradoesnot allow oneto detectfeaturesin theimageslike thoseshown in Figure5.(c)(d), whereasusingtwo camerasdoes.In asimilar fashion,by switchingonandoff the rigidity constraint,we couldseehow stereotrackingallows oneto track robustly even suchnon-robust featuresascornersof thebrows.

We have experimentedwith the differentbaselinesbe-tweenthecameras,andthedistanceof about10–20cm ap-pearsto be themostoptimal. Largerbaselinesresultin toofew matchesneededto computethe fundamentalmatrix,while smallerbaselinescauselarge epipolarerrors. It wasalsoobservedthat thealignmentof thecamerasis not verycritical for our framework, whichis is attributedto usingtherotationinvariantconvex-shapenosefeatureandthecombi-nationof theepipolarconstraintwith therigidity constraint.

6 Conclusions

In this paperwe defineda framework for trackingfacesin3D usingtwo genericweb-cameras.Theadvantagesof us-ing two cameras(two “eyes”) for facetracking,asexhibitedby ourStereoTracker, aresummarizedbelow. Onecantrackfeatures:1. In low quality images.– Low costplug-n-playUSBcam-erasareused.2. Moreprecisely, with sub-pixel accuracy. – Onecanmovethe screencursorwith his/herheadonepixel at time on a360x240grid.3. More robustly, with respectto scaleand rotations. –Featuresaretrackedfor up to almost40 degreesof rotationof headin all threedirections: “yes”(up-down), “no”(left-right), and“don’t know” (clockwise).4. In real time. – After two camerashave beencalibrated,the work of trackingbecomessimpler. Calibrationcanbedoneautomatically, assoonastheheadis seenin bothcam-eras.5. In 3D. - 3D coordinatesandtheroll angleof theheadare

Page 7: Affordable 3D Face Tracking Using Projective Visionx z Figure 1: Stereo tracking with two web-cameras.The key issue is finding the relationship between the cameras. (Section 4). Then

recovered.Projective vision androbust statisticsenableus to deal

with uncalibratedcameras(i.e. camerasfor which the in-trinsic parameterssuchas focal length, optical centeretc.arenotknown),whicharethemostcommoncamerasonthemarket. Thegainin theaccuracy androbustnessis achievedby usingtwo cameraswhereonly onecamerahasbeenusedin the past. It is thus believed that the proposedtechnol-ogy bringshigh-level computervision solutionscloserforthemarket.

References

[1] G. Bradski.OpenCV:Examplesof useandnew applicationsin stereo,recognitionandtracking. In Proc.Intern.Conf. onVision Interface(VI’2002), Calgary, May 2002.

[2] D. Burr, M. Morrone, and L. Vaina. Large receptivefields for optic flow detectorsin humans. Vision Research,38(12):1731–1743,1998.

[3] D. Gorodnichy. Investigationand designof high perfor-mance neural networks. PhD dissertation, GlushkovCybernetics Center of Ac.Sc. of Ukraine in Kiev(http://www.cs.ualberta.ca/� ai/alumni/dmitri/PINN), 1997.

[4] D. Gorodnichy. Vision-basedworld modelingusingapiece-wise linear representationof the occupancy function. PhDdissertation,ComputingScienceDept,University of Alberta(http://www.cs.ualberta.ca/� ai/alumni/dmitri/WorldModeling),2000.

[5] D. Gorodnichy. On importanceof nosefor facetracking.In Proc. IEEE Intern.Conf. on AutomaticFaceandGestureRecognition (FG’2002), Washington,D.C.,May 2002.

[6] D. Gorodnichy, W. Armstrong,andX. Li. Adaptivelogicnet-works for facial featuredetection.In Proc. Intern. Conf. onImage Analysisand Processing(ICIAP’97), Vol. II (LNCS,Vol. 1311),pp.332-339,Springer, 1997.

[7] D. Gorodnichy, S. Malik, and G. Roth. Nouse‘Use yournoseasa mouse’- a new technologyfor hands-freegamesand interfaces. In Proc. Intern. Conf. on Vision Interface(VI’2002), Calgary, May 2002.

[8] R. Hartley. In defenceof the8 pointalgorithm. IEEETrans-actionsonPatternAnalysisandMachineIntelligence, 19(6),1997.

[9] R. Hartley and A. Zisserman. Multiple view geometryincomputervision. CambridgeUinversityPress,2000.

[10] E. HjelmasandB. K. Low. Large receptive fields for op-tic flow detectorsin humans. ComputerVision and ImageUnderstanding, 83:236– 274,2001.

[11] K. Kanatani. GeometricComputationfor Machine Vision.Oxford UniversityPress,1993.

[12] R. Klette, K. Schluns,andA. Koschan. ComputerVision:three-dimensionaldatafromimages. Springer, 1996.

[13] L. LourakisandR. Deriche. Cameraself-calibrationusingthe svd of the fundamentalmatrix. TechnicalReport3748,INRIA, August1999.

[14] G. Loy, E. Holden, and R. Owens. 3D headtracker foran automaticlipreadingsystem. In Proc. Australian Conf.on RoboticsandAutomation(ACRA2000), Melbourne,Aus-tralia,August2000.

[15] Y. MatsumotoandA. Zelinsky. An algorithmfor real-timestereovision implementationof headposeandgazedirec-tion measurement.In Proc.IEEEIntern.Conf. onAutomaticFaceandGesture Recognition (FG2000), 2000.

[16] M. MazaandD. Yuret. Dynamichill climbing. AI Expert,pages26–31,1994.

[17] P. Mendoncaand R. Cipolla. A simple techniqueforself-calibration. In Proc. IEEE Conf. on ComputerVisionand Pattern Recognition (ICPR’99), pages112–116,FortCollins,Colorado,June1999.

[18] NRC. PVT (Projective Vision Toolkit) website.http://www.cv.iit.nrc.ca/research/PVT,2001.

[19] M. PanticandL. Rothkrantz.Automaticanalysisof facialex-pressions:Thestateof theart. IEEE Transactionson PAMI,22(12):1424–1445,2000.

[20] R.Newman,Y. Matsumoto,S. Rougeaux,andA. Zelinsky.Real-timestereotrackingfor headposeandgazeestimation.In Proc. IEEE Intern.Conf. on AutomaticFaceandGestureRecognition Gesture Recognition (FG2000), 2000.

[21] J. RodriguezandJ. Aggarwal. Stochasticanalysisof stereoquantazationerror. 12/5:467–470,1990.

[22] G. RothandM. D. Levine. Extractinggeometricprimitives.ComputerVision, Graphicsand Image Processing: ImageUnderstanding, 58(1):1–22,July1993.

[23] G. Roth andA. Whitehead.Using projective vision to findcamerapositionsin an image sequence. In Proc. Intern.Conf. on Vision Interface(VI’2000), pages87–94,Montreal,May 2000.

[24] G. RothandA. Whitehead.Someimprovementson two au-tocalibrationalgorithmsbasedon the fundamentalmatrix.In Proc. Intern. Conf. on Pattern Recognition (ICPR2002),QuebecCity, Canada,2002.

[25] P. J.Rousseeuw. Leastmedianof squaresregression.Journalof AmericanStatisticalAssociation, 79(388):871–880,De-cember1984.

[26] S. SmithandJ. Brady. Susan- a new approachto low levelimageprocessing.InternationalJournalof ComputerVision,pages45–78,May 1997.

[27] K. Toyama. Prolegomenafor robust facetracking. Techni-cal ReportMSR-TR-98-65,Microsoft Research,November1998.

[28] M. Turk, C. Hu, R. Feris,F. Lashkari,andA. Beall. TLAbasedfacetracking. In Proc. Intern. Conf. on Vision Inter-face(VI’2002), Calgary, May 2002.

[29] M. Xu andT. Akatsuka. Detectingheadposefrom stereoimagesequencefor active facerecognition. In Proc. IEEEIntern. Conf. on AutomaticFace and Gesture Recognition(FG’98), 1998.

[30] J.Yang,R.Stiefelhagen,U. Meier, andA. Waibel.Real-timefaceand facial featuretrackingandapplications. In Proc.AVSP’98,pages79–84,Terrigal, Australia, 1998.

[31] M. Yang,N. Ahuja, andD. Kriegman. Detectingfacesinimages: A survey. IEEE Transactionon Pattern AnalysisandMachineIntelligence, 24(1):34–58,2002.

[32] Z. Zhang,R. Deriche,O. Faugeras,andQ.-T. Luong. A ro-busttechniquefor matchingtwo uncalibratedimagesthroughtherecovery of theunkown epipolargeometry. Artificial In-telligenceJournal, 78:87–119,October1995.

Page 8: Affordable 3D Face Tracking Using Projective Visionx z Figure 1: Stereo tracking with two web-cameras.The key issue is finding the relationship between the cameras. (Section 4). Then

a)

b)

c)

d)

Figure5: Snapshotsof experimentsshowing therobustnessandprecisionof thestereotrackingaccomplishedwith two webcams.