CHAPTER 3 DATABASE FOR EXPERIMENTAL ANALYSIS 3.1...
Transcript of CHAPTER 3 DATABASE FOR EXPERIMENTAL ANALYSIS 3.1...
CHAPTER 3
DATABASE FOR EXPERIMENTAL ANALYSIS
3.1 INTRODUCTION
Face tracking can be considered to be a kind of algorithm that
analyzes video frames and outputs the location of moving faces within each
frame. For each tracked face, three steps are involved, i.e., initialization,
tracking, and stopping. Most methods use a face detector for initialization of
their tracking processes. An always ignored difficulty with this step is how to
control false face detections as described above. Although there have been
studies on profile or intermediate pose face detectors, they all suffer from
the false-detection problem far more than a frontal face detector does.
Chaudhury et al., 2003, used two face probability maps instead of a fixed
threshold to initialize the face tracker, one for frontal views and one for
profiles. All local maxima in these maps are chosen as face candidates, the
face probabilities of which are propagated throughout the temporal
sequence. Candidates whose probabilities either go to zero or remain low
over time are determined to be non-face and are eliminated. The information
from the two face probability maps is combined to represent an intermediate
head pose. Their experiments showed that the proposed probabilistic
detector was more accurate than a traditional face detector and could handle
head movements covering ±90 degrees out-of-plane rotation (yaw).
After initialization, one should choose the features to track before
tracking a face. The exploitation of color is one of the more common choices
because it is invariant to facial expressions, scale, and pose changes
(Boccignone et al., 2005, Li et al., 2006). However, color-based face
trackers often depend on a learning set dedicated to a certain type of
processed video and might not work on unknown videos with varying
illumination conditions or on faces of people of different races. Moreover, the
color image is susceptible to occlusion by other head-like objects. Two other
choices that are more robust to varying illuminations and occlusions are key
point and facial features (Arnaud et al., 2005; Zhu et al.,2005; Tong et al.,
2007), e.g. eyes, nose, mouth, etc. Although the generality of key points
allows for tracking of different kinds of objects, without any face- specific
knowledge, this method’s power to discriminate between the target and
clutter might not be enough to deal with background noise or other adverse
conditions. Facial features enable tracking of high-level facial information,
but they are of little use when the video is of low quality. Most facial-
feature-based face trackers have been tested using only non-broadcast
video.
An appearance-based or featureless tracker matches an observation
model of the entire facial appearance with the input image, instead of
choosing only a few features to track. Li et al., 2006, uses a multi-view face
detector to detect and track faces from different poses. Besides the face-
based observation model, a head model is also included to represent the
back of the head. This model is based on the idea that a head can be an
object of interest because the face is not always trackable An extended
particle filter is used to fuse these two sets of information to handle
occlusions due to out-of-plane head rotations (yaw) exceeding ±90 degrees.
During the tracking procedure, face tracking systems usually use a
motion model that describes how the image of the target might change for
different possible face motions. Assuming the face to be a planar object, the
corresponding motion model can be a 2D transformation, e.g. affine
transformation or homography, of a facial image, e.g. the initial frame.
Some research treats the face as a rigid 3D object; the resulting motion
model defines aspects depending on 3D position and orientation. However, a
face is actually both 3D and deformable. Some systems try to model faces in
this sense, and the image of face can be covered with a mesh, i.e. a
sophisticated geometry and texture face model Dornaika et.al. 2004, 2006.
The motion of the face is defined by the position of the nodes of the mesh. If
the quality of the video is high, a more sophisticated motion model will give
more accurate results. For instance, a sophisticated geometry and texture
model might be more insusceptible to false face detections and drifting than
a simple 2D transformation model. However, most 3D-based and mesh-
based face trackers require a relatively clear appearance, high resolution,
and a limited pose variation, e.g. out-of-plane head rotations (roll and yaw)
that are far less than ±90 degrees. These requirements cannot be satisfied
in the case of broadcast video. Therefore, most 3D-based and mesh-based
face trackers are only tested on non-broadcast video, e.g. webcam video.
Finally, the stopping procedure constitutes a major deficiency for the
face tracking algorithms that are generally not able to stop a face track in
case of tracking errors, i.e. drifting. [Arnaud, et al, 2005] proposed an
approach that uses a general object tracker for face tracking and a stopping
criterion based on the addition of an eye tracker to alleviate drifting. The two
positions of the tracked eyes are compared with the tracked face position. If
neither of the eyes is in the face region, drifting is determined to be
occurring and the tracking process stops. In addition, most mesh-based or
top-down trackers are assumed to be able to avoid drifting
3.2 CHALLENGES IN FACE TRACKING
The main challenges that face tracking methods have to overcome are
(i) Variations of pose and lighting, (ii) Facial deformations, (iii) Occlusion and
clutter, and (iv) Facial resolution. Robustness to Pose and Illumination
Variations: Pose and illumination variations often lead to loss of track. One
of the well-known methods for dealing with illumination variations was
presented in Hager and Belhumeur, 1998, where the authors proposed using
a parameterized function to describe the movement of the image points,
taking into account illumination variation by modifying the brightness
constancy constraint of optical flow. Illumination invariant 3D tracking was
considered within the Active Appearance Model (AAM) framework in (Koterba
et al., 2005), but the method requires training images to build the model
and the result depends on the quality and variety of such data. 3D model
based motion estimation algorithms are the usually robust to pose
variations, but often lack robustness to illumination. Xu and Roy-Chowdhury,
2007, proposed a model-based face tracking method that was robust to both
pose and lighting changes. This was achieved through an analytically derived
model for describing the appearance of a face in terms of its pose, the
incident lighting, shape and surface reflectance.
Tracking through Facial Deformations: Tracking faces through changes
of expressions, i.e., through facial deformations, is another challenging
problem. A well-known work in this area is (Terzopoulos and Waters,1993),
which has been used by many researchers for tracking, recognition and
reconstruction. In contrast to this model-based approach, Black and Yacoob,
1995 proposed a data-driven approach for tracking and recognition of non-
rigid facial motion. More recently, the 3D morphable model (Blanz and
Vetter, 2003) has been quite popular in synthesizing different facial
expressions, which implies that it can also be used for tracking by posing the
problem as estimation of the synthesis parameters (coefficients of a set of
basic functions representing the morphable model).
Occlusion and Clutter: As with most tracking problems, occlusion and
clutter affect the performance of most face trackers. One of the robust
tracking approaches in this scenario is the use of particle filters
(Arulampalam etal, 2002) which can recover from a loss of track given a
high enough number of particles and observations. However, in practice,
occlusion and clutter remain serious impediments in the design of highly
robust face tracking systems.
Facial resolution: Low resolution will hamper performance of any
tracking algorithm, face tracking being no exception. Zhao et al., 2003,
identified low resolution to be one of the main impediments in video-based
face recognition. Super-resolution of faces is a challenging problem by itself
because of detailed facial features that need to be modeled accurately.
Recently, Dedeoglu, ET, al, 2006 proposed a method for face super-
resolution using AAMs.
Super-resolution requires registration of multiple images, followed by
interpolation. Usually, these two stages are treated separately, i.e.,
registration is obtained through a tracking procedure followed by super-
resolution. Yu et al., 2007, proposed feeding back the super-resolved texture
in the nth frame for tracking the (n+1) th frame. This improves the tracking,
which, in turn, improves the super-resolution output. This could be an
interesting area of future work taking into consideration issues of stability
and convergence.
3.3 APPLICATIONS OF FACE TRACKING
Video Surveillance: Since faces are often the most easily
recognizable signature of identity and intent from a distance, video
surveillance systems often focus on the face, Zhao et al., 2003. This requires
tracking the face over multiple frames.
Biometrics: Video-based face recognition systems require alignment
of the faces before they can be compared. This alignment compensates for
changes of pose. Face tracking, especially 3D pose estimation, is therefore
an important component of such applications. Also, integration of identity
over the entire video sequence requires tracking the face
Face Modeling: Reconstruction of the 3D model of a face from a
video sequence using structure from motion requires tracking. This is
because the depth estimates are related non-linearly to the 3D motion of the
object. This is a difficult non-linear estimation problem and many papers can
be found that focus primarily on this, some examples being (Shan et al.,
2001; Roy-Chowdhury et al., 2005).
Video Communications and Multimedia Systems: Face tracking is
also important for applications like video communications. Motion estimates
remove the inter-frame redundancy in video compression schemes like
MPEG and H.26x. In multimedia systems like sports videos, face tracking can
be used in conjunction with recognition or reconstruction modules, or for
focusing on a region of interest in the image.
3.4 DATABASE CREATION
Table 3.1 presents sample frames. Row-1, Row-2 presents frames
from video 1. Row-2 is the lady who is on site talking with person in a TV
studio as shown in Row-1. Frames have to be communicated between TV
studio and the onsite conversation only with relevant changes in the frames.
Row-3 presents two frames of a news reader in video 2.
3.5 SUMMARY
This chapter presents database creation of two live videos and one
software created ‘ruth’ database available in the internet. Chapter 4 presents
implementation of radial basis function for facial tracking
Table 3.1 Sample frames from two videos
Video Start frame End Frame Frame
start
Frame
end
Total
fram
es
1 Person
from
Televisio
n station
Video -1
968 2997 2029
Person
from on
the spot
news
collectio
n
Video -1
2999 3783 784
2 Video 2
131 192 61
3 Video 3
34749 35125 376
Fig 3.1 Frame differences for the Person from Television station in
video 1
There are about 3000 frames in video-1. Frames 968 till 2997 contain
the person shown in Row-1 of Table 3.1. Figure 3.1 shows frame numbers in
x-axis and each frame’s summed intensity values along y-axis. There is lot
of variations in the y-axis for many frame numbers. A threshold has to be
fixed such that only frame who’s summed threshold is above a certain value
should be considered for extracting information and sending to the receiver.
0 500 1000 1500 2000 2500 30000
100
200
300
400
Frame number
Tot
al s
umm
ed d
Iffe
renc
e fo
r th
e ad
jace
nt f
ram
es X
100
00
Fig.3.2 Number of frames above a threshold for Figure 3.1
The total number of frames that are considered is 80.
0 20 40 60 80 1000
100
200
300
400
500
Threshold
Tot
al n
umbe
r of
fr
ames
abo
ve T
hres
hold
Fig 3.3 Frame differences for the lady from on the spot news
collection in video 1
The frames that correspond to the lady of Row-2 are shown between
3000 to 3800 frames.
0 500 1000 1500 2000 2500 3000 3500 40000
5
10
15
20
Frame number
Tot
al s
umm
ed d
Iffe
renc
e fo
r th
e ad
jace
nt f
ram
es X
100
00
Fig.3.4 Number of frames above a threshold for Figure 3.3
The total number of frames that are considered is 20.
0 20 40 60 80 1000
5
10
15
20
25
Threshold
Tot
al n
umbe
r of
fr
ames
abo
ve T
hres
hold
Table 3.2 Various positions of eyes, lips eyebrows
Video-
2
(
‘Ruth’
video
datab
ase
Table 3.2 shows various facial expressions of the news reader.
Similarly, the second row shows the facial expressions of a standard ‘ruth’
video database.
Fig. 3.5 Frames 1-30 of ‘ruth’ video
Fig. 3.6 Frames 31-60 of ‘ruth’ video
Fig. 3.7 Frames 61-90 of ‘ruth’ video
Fig. 3.8 Left eyelash position of frames 1-72 that belong to ‘ruth’
video
Figure 3.8 shows left eyelash position of frames 1-72 that belong to ‘ruth’
video
Fig. 3.9 Mouth position of ‘ruth’ video
Figure 3.6 shows mouth position of frames 1-78 that belong to ‘ruth’ video
Fig. 3.10 Right eyelash position of ‘ruth’ video
Figure 3.7 shows right eyelash position of frames 1-78 that belong to ‘ruth’
video.