CHAPTER 3 DATABASE FOR EXPERIMENTAL ANALYSIS 3.1...

CHAPTER 3

DATABASE FOR EXPERIMENTAL ANALYSIS

3.1 INTRODUCTION

Face tracking can be considered to be a kind of algorithm that

analyzes video frames and outputs the location of moving faces within each

frame. For each tracked face, three steps are involved, i.e., initialization,

tracking, and stopping. Most methods use a face detector for initialization of

their tracking processes. An always ignored difficulty with this step is how to

control false face detections as described above. Although there have been

studies on profile or intermediate pose face detectors, they all suffer from

the false-detection problem far more than a frontal face detector does.

Chaudhury et al., 2003, used two face probability maps instead of a fixed

threshold to initialize the face tracker, one for frontal views and one for

profiles. All local maxima in these maps are chosen as face candidates, the

face probabilities of which are propagated throughout the temporal

sequence. Candidates whose probabilities either go to zero or remain low

over time are determined to be non-face and are eliminated. The information

from the two face probability maps is combined to represent an intermediate

head pose. Their experiments showed that the proposed probabilistic

detector was more accurate than a traditional face detector and could handle

head movements covering ±90 degrees out-of-plane rotation (yaw).

After initialization, one should choose the features to track before

tracking a face. The exploitation of color is one of the more common choices

because it is invariant to facial expressions, scale, and pose changes

(Boccignone et al., 2005, Li et al., 2006). However, color-based face

trackers often depend on a learning set dedicated to a certain type of

processed video and might not work on unknown videos with varying

illumination conditions or on faces of people of different races. Moreover, the

color image is susceptible to occlusion by other head-like objects. Two other

choices that are more robust to varying illuminations and occlusions are key

point and facial features (Arnaud et al., 2005; Zhu et al.,2005; Tong et al.,

2007), e.g. eyes, nose, mouth, etc. Although the generality of key points

allows for tracking of different kinds of objects, without any face- specific

knowledge, this method’s power to discriminate between the target and

clutter might not be enough to deal with background noise or other adverse

conditions. Facial features enable tracking of high-level facial information,

but they are of little use when the video is of low quality. Most facial-

feature-based face trackers have been tested using only non-broadcast

video.

An appearance-based or featureless tracker matches an observation

model of the entire facial appearance with the input image, instead of

choosing only a few features to track. Li et al., 2006, uses a multi-view face

detector to detect and track faces from different poses. Besides the face-

based observation model, a head model is also included to represent the

back of the head. This model is based on the idea that a head can be an

object of interest because the face is not always trackable An extended

particle filter is used to fuse these two sets of information to handle

occlusions due to out-of-plane head rotations (yaw) exceeding ±90 degrees.

During the tracking procedure, face tracking systems usually use a

motion model that describes how the image of the target might change for

different possible face motions. Assuming the face to be a planar object, the

corresponding motion model can be a 2D transformation, e.g. affine

transformation or homography, of a facial image, e.g. the initial frame.

Some research treats the face as a rigid 3D object; the resulting motion

model defines aspects depending on 3D position and orientation. However, a

face is actually both 3D and deformable. Some systems try to model faces in

this sense, and the image of face can be covered with a mesh, i.e. a

sophisticated geometry and texture face model Dornaika et.al. 2004, 2006.

The motion of the face is defined by the position of the nodes of the mesh. If

the quality of the video is high, a more sophisticated motion model will give

more accurate results. For instance, a sophisticated geometry and texture

model might be more insusceptible to false face detections and drifting than

a simple 2D transformation model. However, most 3D-based and mesh-

based face trackers require a relatively clear appearance, high resolution,

and a limited pose variation, e.g. out-of-plane head rotations (roll and yaw)

that are far less than ±90 degrees. These requirements cannot be satisfied

in the case of broadcast video. Therefore, most 3D-based and mesh-based

face trackers are only tested on non-broadcast video, e.g. webcam video.

Finally, the stopping procedure constitutes a major deficiency for the

face tracking algorithms that are generally not able to stop a face track in

case of tracking errors, i.e. drifting. [Arnaud, et al, 2005] proposed an

approach that uses a general object tracker for face tracking and a stopping

criterion based on the addition of an eye tracker to alleviate drifting. The two

positions of the tracked eyes are compared with the tracked face position. If

neither of the eyes is in the face region, drifting is determined to be

occurring and the tracking process stops. In addition, most mesh-based or

top-down trackers are assumed to be able to avoid drifting

3.2 CHALLENGES IN FACE TRACKING

The main challenges that face tracking methods have to overcome are

(i) Variations of pose and lighting, (ii) Facial deformations, (iii) Occlusion and

clutter, and (iv) Facial resolution. Robustness to Pose and Illumination

Variations: Pose and illumination variations often lead to loss of track. One

of the well-known methods for dealing with illumination variations was

presented in Hager and Belhumeur, 1998, where the authors proposed using

a parameterized function to describe the movement of the image points,

taking into account illumination variation by modifying the brightness

constancy constraint of optical flow. Illumination invariant 3D tracking was

considered within the Active Appearance Model (AAM) framework in (Koterba

et al., 2005), but the method requires training images to build the model

and the result depends on the quality and variety of such data. 3D model

based motion estimation algorithms are the usually robust to pose

variations, but often lack robustness to illumination. Xu and Roy-Chowdhury,

2007, proposed a model-based face tracking method that was robust to both

pose and lighting changes. This was achieved through an analytically derived

model for describing the appearance of a face in terms of its pose, the

incident lighting, shape and surface reflectance.

Tracking through Facial Deformations: Tracking faces through changes

of expressions, i.e., through facial deformations, is another challenging

problem. A well-known work in this area is (Terzopoulos and Waters,1993),

which has been used by many researchers for tracking, recognition and

reconstruction. In contrast to this model-based approach, Black and Yacoob,

1995 proposed a data-driven approach for tracking and recognition of non-

rigid facial motion. More recently, the 3D morphable model (Blanz and

Vetter, 2003) has been quite popular in synthesizing different facial

expressions, which implies that it can also be used for tracking by posing the

problem as estimation of the synthesis parameters (coefficients of a set of

basic functions representing the morphable model).

Occlusion and Clutter: As with most tracking problems, occlusion and

clutter affect the performance of most face trackers. One of the robust

tracking approaches in this scenario is the use of particle filters

(Arulampalam etal, 2002) which can recover from a loss of track given a

high enough number of particles and observations. However, in practice,

occlusion and clutter remain serious impediments in the design of highly

robust face tracking systems.

Facial resolution: Low resolution will hamper performance of any

tracking algorithm, face tracking being no exception. Zhao et al., 2003,

identified low resolution to be one of the main impediments in video-based

face recognition. Super-resolution of faces is a challenging problem by itself

because of detailed facial features that need to be modeled accurately.

Recently, Dedeoglu, ET, al, 2006 proposed a method for face super-

resolution using AAMs.

Super-resolution requires registration of multiple images, followed by

interpolation. Usually, these two stages are treated separately, i.e.,

registration is obtained through a tracking procedure followed by super-

resolution. Yu et al., 2007, proposed feeding back the super-resolved texture

in the nth frame for tracking the (n+1) th frame. This improves the tracking,

which, in turn, improves the super-resolution output. This could be an

interesting area of future work taking into consideration issues of stability

and convergence.

3.3 APPLICATIONS OF FACE TRACKING

Video Surveillance: Since faces are often the most easily

recognizable signature of identity and intent from a distance, video

surveillance systems often focus on the face, Zhao et al., 2003. This requires

tracking the face over multiple frames.

Biometrics: Video-based face recognition systems require alignment

of the faces before they can be compared. This alignment compensates for

changes of pose. Face tracking, especially 3D pose estimation, is therefore

an important component of such applications. Also, integration of identity

over the entire video sequence requires tracking the face

Face Modeling: Reconstruction of the 3D model of a face from a

video sequence using structure from motion requires tracking. This is

because the depth estimates are related non-linearly to the 3D motion of the

object. This is a difficult non-linear estimation problem and many papers can

be found that focus primarily on this, some examples being (Shan et al.,

2001; Roy-Chowdhury et al., 2005).

Video Communications and Multimedia Systems: Face tracking is

also important for applications like video communications. Motion estimates

remove the inter-frame redundancy in video compression schemes like

MPEG and H.26x. In multimedia systems like sports videos, face tracking can

be used in conjunction with recognition or reconstruction modules, or for

focusing on a region of interest in the image.

3.4 DATABASE CREATION

Table 3.1 presents sample frames. Row-1, Row-2 presents frames

from video 1. Row-2 is the lady who is on site talking with person in a TV

studio as shown in Row-1. Frames have to be communicated between TV

studio and the onsite conversation only with relevant changes in the frames.

Row-3 presents two frames of a news reader in video 2.

3.5 SUMMARY

This chapter presents database creation of two live videos and one

software created ‘ruth’ database available in the internet. Chapter 4 presents

implementation of radial basis function for facial tracking

Table 3.1 Sample frames from two videos

Video Start frame End Frame Frame

start

Frame

end

Total

fram

es

1 Person

from

Televisio

n station

Video -1

968 2997 2029

Person

from on

the spot

news

collectio

n

Video -1

2999 3783 784

2 Video 2

131 192 61

3 Video 3

34749 35125 376

Fig 3.1 Frame differences for the Person from Television station in

video 1

There are about 3000 frames in video-1. Frames 968 till 2997 contain

the person shown in Row-1 of Table 3.1. Figure 3.1 shows frame numbers in

x-axis and each frame’s summed intensity values along y-axis. There is lot

of variations in the y-axis for many frame numbers. A threshold has to be

fixed such that only frame who’s summed threshold is above a certain value

should be considered for extracting information and sending to the receiver.

0 500 1000 1500 2000 2500 30000

100

200

300

400

Frame number

Tot

al s

umm

ed d

Iffe

renc

e fo

r th

e ad

jace

nt f

ram

es X

100

00

Fig.3.2 Number of frames above a threshold for Figure 3.1

The total number of frames that are considered is 80.

0 20 40 60 80 1000

100

200

300

400

500

Threshold

Tot

al n

umbe

r of

fr

ames

abo

ve T

hres

hold

Fig 3.3 Frame differences for the lady from on the spot news

collection in video 1

The frames that correspond to the lady of Row-2 are shown between

3000 to 3800 frames.

0 500 1000 1500 2000 2500 3000 3500 40000

5

10

15

20

Frame number

Tot

al s

umm

ed d

Iffe

renc

e fo

r th

e ad

jace

nt f

ram

es X

100

00

Fig.3.4 Number of frames above a threshold for Figure 3.3

The total number of frames that are considered is 20.

0 20 40 60 80 1000

5

10

15

20

25

Threshold

Tot

al n

umbe

r of

fr

ames

abo

ve T

hres

hold

Table 3.2 Various positions of eyes, lips eyebrows

Video-

2

(

‘Ruth’

video

datab

ase

Table 3.2 shows various facial expressions of the news reader.

Similarly, the second row shows the facial expressions of a standard ‘ruth’

video database.

Fig. 3.5 Frames 1-30 of ‘ruth’ video

Fig. 3.8 Left eyelash position of frames 1-72 that belong to ‘ruth’

video

Figure 3.8 shows left eyelash position of frames 1-72 that belong to ‘ruth’

video

Fig. 3.9 Mouth position of ‘ruth’ video

Figure 3.6 shows mouth position of frames 1-78 that belong to ‘ruth’ video

Fig. 3.10 Right eyelash position of ‘ruth’ video

Figure 3.7 shows right eyelash position of frames 1-78 that belong to ‘ruth’

video.

CHAPTER 3 DATABASE FOR EXPERIMENTAL ANALYSIS 3.1...

Documents

Transcript of CHAPTER 3 DATABASE FOR EXPERIMENTAL ANALYSIS 3.1...