Audio-Visual Biometrics
-
Upload
bingwazzup -
Category
Documents
-
view
235 -
download
0
Transcript of Audio-Visual Biometrics
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 1/20
I N V I T E D
P A P E R
Audio-Visual BiometricsLow-cost technology that combines recognition of individuals’ faces withvoice recognition is now available for low-security applications,but overcoming false rejections remains an unsolved problem.
By Petar S. Aleksic and Aggelos K. Katsaggelos
ABSTRACT | Biometric characteristics can be utilized in order
to enable reliable and robust-to-impostor-attacks person
recognition. Speaker recognition technology is commonly
utilized in various systems enabling natural human computer
interaction. The majority of the speaker recognition systems
rely only on acoustic information, ignoring the visual modality.
However, visual information conveys correlated and compli-
mentary information to the audio information and its integra-
tion into a recognition system can potentially increase the
system’s performance, especially in the presence of adverse
acoustic conditions. Acoustic and visual biometric signals, such
as the person’s voice and face, can be obtained using
unobtrusive and user-friendly procedures and low-cost sen-
sors. Developing unobtrusive biometric systems makes bio-
metric technology more socially acceptable and accelerates its
integration into every day life. In this paper, we describe the
main components of audio-visual biometric systems, review
existing systems and their performance, and discuss future
research and development directions in this area.
KEYWORDS | Audio-visual biometrics; audio-visual databases;
audio-visual fusion; audio-visual person recognition; face
tracking; hidden Markov models; multimodal recognition;
visual feature extraction
I . IN TR O DUC TIO N
Biometrics, or biometric recognition, refers to the utili-zation of physiological and behavioral characteristics for
automatic person recognition [1], [2]. Person recognitioncan be classified into two problems: person identification
and person verification (authentication). Person identifi-
cation is the problem of determining the identity of a person from a closed set of candidates, while person
verification refers to the problem of determining whether
a person is who s/he claims to be. In general, a person
verification system should be capable of rejecting claimsfrom impostors, i.e., persons not registered with the
system or registered but attempting access under someone
else’s identity, and accepting claims from clients, i.e.,persons registered with the system and claiming their ownidentity. Applications that can employ person recognition
systems include automatic banking, computer network
security, information retrieval, secure building access, e-commerce, teleworking, etc. Personal devices, such as cell
phones, PDAs, laptops, cars, etc., could also have built-in
person recognition systems which would help preventimpostors from using them. Traditional person recognition
methods, including knowledge-based (e.g., passwords,PINs) and token-based (e.g., ATM or credit cards, and
keys) methods, are vulnerable to impostor attacks. Pass-
words can be compromised, while keys and cards can bestolen or duplicated. Identity theft is one of the fastest
growing crimes worldwide. For example, in the U.S. over
ten million people were victims of identity theft within a 12-month period in 2004 and 2005 (approximately 4.6% of the adult population) [3].
Unlike knowledge- and token-based information,
biometric characteristics cannot be forgotten or easily stolen and the systems that utilize them exhibit improved
robustness to impostor attacks [1], [2], [4]. There are many
different biometric characteristics that can be used inperson recognition systems, including fingerprints, palm
prints, hand and finger geometry, hand veins, iris andretinal scans, infrared thermograms, DNA, ears, faces,
gait, voice, signature, etc. [1] (see Fig. 1). The choice of
biometric characteristics depends on many factors,including the best achievable performance, robustness to
noise, cost and size of biometric sensors, invariance of
characteristics with time, robustness to attacks, unique-ness, population coverage, scalability, template (represen-
tation) size, etc. [1], [2].Each biometrics modality has its own advantages and
disadvantages with respect to the above-mentioned factors. All of them are usually considered when choosing the most
Manuscript received September 2, 2005; revised August 16, 2006.
The authors are with the Department of Electrical Engineering and
Computer Science, Northwestern University, Evanston, IL 60208 USA (e-mail:
[email protected]; http://ivpl.ece.northwestern.edu/Staff/Petar.html;[email protected]).
Digital Object Identifier: 10.1109/JPROC.2006.886017
Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 20250018-9219/$20.00 Ó2006 IEEE
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 2/20
appropriate biometric characteristics for a certain appli-
cation. A comparison of the most commonly used bio-metric characteristics with respect to maturity, accuracy,
scalability, cost, obtrusiveness, sensor size, and template
size is shown in Table 1 [2]. The best recognition per-formance is achieved when iris, fingerprints, hand, or
signature are used as biometric features. However, systemsutilizing these biometric features require user cooperation,
are considered intrusive, and/or utilize high cost sensors.
Although reliable, they are usually not widely acceptable,except in high-security applications. In addition, there are
many biometric applications, such as sport venue (gym)
entrance check, access to desktop, building access, etc., which do not require high security, and in which it is very important to use unobtrusive and low-cost methods for
extracting biometric features, thus enabling natural person
recognition and reducing inconvenience.In this review paper, we address audio-visual (AV)
biometrics, where speech is utilized together with static
video frames of the face or certain parts of the face (facerecognition) [5]–[9] and/or video sequences of the face or
mouth area (visual speech) [10]–[16] in order to improveperson recognition performance (see Fig. 2). With respect
to the type of acoustic and visual information they use,
person recognition systems can be classified into audio-only, visual-only-static (only visual features obtained from
a single face image are used), visual-only-dynamic (visual
features containing temporal information obtained from video sequences are used), audio-visual-static, and audio-
visual-dynamic. The systems that utilize acoustic informa-tion only are extensively covered in the audio-only speaker
recognition literature [17] and will not be discussed here.
In addition, visual-only-static (face recognition) systemsare also covered extensively in the literature [18]–[22] and
will not be discussed separately, but only in the context of
AV biometrics.
Speaker recognition systems that rely only on audiodata are sensitive to microphone types (headset, desktop,telephone, etc.), acoustic environment (car, plane, factory,
babble, etc.), channel noise (telephone lines, VoIP, etc.), or
complexity of the scenario (speech under stress, Lombardspeech, whispered speech). On the other hand, systems
that rely only on visual data are sensitive to visual noise,such as extreme lighting changes, shadowing, changing
background, speaker occlusion and nonfrontality, segmen-tation errors, low spatial and temporal resolution video,
compressed video, and appearance changes (hair style,
make-up, clothing). Audio-only speaker recognizers can perform poorly
even at typical acoustic background signal-to-noise (SNR)
levels (À10 to 15 dB), and the incorporation of additionalbiometric modalities can alleviate problems characteristic
of a single modality and improve system performance. Thishas been well established in the literature [5]–[16],
[23]–[51]. The use of visual information, in addition to
audio, improves speaker recognition performance even innoise-free environments [14], [28], [34]. The potential for
such improvements is greater in acoustically noisy envi-
ronments, since visual speech information is typically much less affected by acoustic noise than the acousticspeech information. It is true, however, that there is an
equivalent to the acoustic Lombard effect in the visual
Fig. 1. Biometric characteristics: (a) fingerprints, (b) palm print, (c) hand and finger geometry, (d) hand veins, (e) retinal scan,
(f) iris, (g) infrared thermogram, (h) DNA, (i) ear, (j) face,
(k) gait, (l) speech, and (m) signature.
Table 1 Comparison of Biometric Characteristics (Adapted From [2])
Aleksic and Katsaggelos: Audio-Visual Biometrics
2026 Proceedings of the IEEE | Vol. 94, No. 11, November 2006
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 3/20
domain; although, it has been shown that it does not affectthe visual speech recognition as much as the acoustic
Lombard effect affects acoustic speech recognition [52].
Audio-only and static-image-based (face recognition)biometric systems are susceptible to impostor attacks
(spoofing) if the impostor possesses a photograph and/orspeech recordings of the client. It is considerably more
difficult for an impostor to impersonate both acoustic and
dynamical visual information simultaneously. Overall, AVperson recognition is one of the most promising user-
friendly low-cost person recognition technologies that is
rather resilient to spoofing. It holds promise for wideradoption due to the low cost of audio and video biometric
sensors and the ease of acquiring audio and video signals(even without assistance from the client) [53].
The remainder of the paper is organized as follows. Wefirst describe in Section II the justification for combining
audio and video modalities for person recognition. We
then describe the structure of an AV biometric system inSection III, and in Section IV we review the various ap-
proaches for extracting and representing important visual
features. Subsequently, in Section V we describe speechrecognition process and in Section VI review the main
approaches for integrating the audio and visual informa-tion. In Section VII, we provide a description of some of
the AV biometric systems that appeared in the literature.
Finally, in Section VIII, we provide an assessment of thetopic, describe some of the open problems, and conclude
the paper.
I I . I M P O R T A N CE O F A V B I O M E T R I C S
Although great progress has been achieved over the past
decades in computer processing of speech, it still lags
significantly compared to human performance levels [54],especially in noisy environments. On the other hand,
humans easily accomplish complex communication tasksby utilizing additional sources of information whenever
required, especially visual information. Face visibility be-nefits speech perception due to the fact that the visual
signal is both correlated to the produced audio signal[55]–[60] and also contains complementary information
to it [55], [61]–[66]. There has been significant work on
investigating the relationship between articulatory move-ments and vocal tract shape and speech acoustics [67]–[70].
It has also been shown that there exists a strong corre-lation among face motion, vocal tract shape, and speech
acoustics [55], [61]–[66].
For example, Yehia et al. [55] investigated the degreesof this correlation. They measured the motion of markers,
which were placed on the face and in the vocal tract. Their
results show that 91% of the total variance observed in thefacial motion could be determined from the vocal tract
motion, using simple linear estimators. In addition, lookingat the reverse problem, they determined that 80% of the
total variance observed in the vocal tract can be estimatedfrom face motion. Regarding speech acoustics, linear esti-
mators were sufficient to determine between 72% and 85%
(depending on subject and utterance) of the varianceobserved in the root mean squared amplitude and line-
spectrum pair parametric representation of the spectral
envelope from face motion. They also showed that even thetongue motion can be reasonably well recovered from the
face motion, since the tongue frequently displays similarmotion as the jaw during speech articulation.
Jiang et al. [56] investigated the correlation among
external face movements, tongue movements, and speechacoustics for consonant–vowel (CV) syllables and sen-
tences. They showed that multilinear regression could
successfully be used to predict face movements fromspeech acoustics for short speech segments, such as CV
syllables. The prediction was the best for chin movements,followed by lips and cheeks movements. They also showed,
like the authors of [55], that there is high correlationbetween tongue and face movements.
Hearing impaired individuals utilize lipreading and
speechreading in order to improve their speech perception.In addition, normal hearing persons also use lipreading and
speechreading to a certain extent, especially in acoustically noisy environments [61]–[66]. Lipreading represents the
Fig. 2. AV biometric system.
Aleksic and Katsaggelos: Audio-Visual Biometrics
Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 2027
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 4/20
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 5/20
The preprocessing should be coupled with the choiceand extraction of acoustic and visual features as depicted
by the dashed lines in Fig. 3. Acoustic features are chosenbased on their robustness to channel and background
noise. A number of results have been reported in the
literature on extracting the appropriate acoustic featuresfor both clean and noisy speech conditions [88]–[91]. The
most commonly utilized acoustic features are mel-
frequency cepstral coefficients (MFCCs) and linearprediction coefficients (LPCs). Features that are more
robust to noise obtained with the use of spectral subbandcentroids [90] or the zero-crossings with peak-amplitudes
model [89] as an acoustic front end, have also been
proposed. Acoustic features are usually augmented by their first- and second-order derivatives (delta and delta–
delta coefficients) [88]. The appropriate selection and
extraction of acoustic features is not addressed in this
paper.On the other hand, the establishment of visual featuresfor speaker recognition is a relatively newer research
topic. Various approaches have been implemented towards
face detection and tracking and facial feature extraction[22], [92]–[96] and will be discussed in more detail in
Section IV. The dynamics of the visual speech are cap-tured, similarly to acoustic features, by augmenting the
Bstatic[ (frame-based) visual feature vector by its first- andsecond-order time derivatives, which are computed over a
short temporal window centered at the current video
frame [86]. Mean normalization of the visual feature vec-tors can also be utilized to reduce variability due to illumi-
nation [115].
AV fusion combines audio and visual information inorder to achieve higher person recognition performance
than both audio-only and visual-only person recognitionsystems. If no fusion of the acoustic and visual information
takes place, then audio-only and visual-only person recog-
nition systems result (see Fig. 3). The main advantage of AV biometric systems lies in their robustness, since each
modality can provide independent and complementary
information and therefore prevent performance degrada-tion due to the noise present in one or both of the modali-ties. There exist various adaptive fusion approaches, which
weight the contribution of different modalities based on
their discrimination ability and reliability, as discussed inSection VI. The rates of the acoustic and visual features are
typically different. The rate of acoustic features is usually
100 Hz [86], while video frame rates can be up to25 frames per second (50 fields per second) for PAL or
30 frames per second (60 fields per second) for NTSC.For fusion methods that require the same rate for both
modalities, the video is typically up-sampled using an in-
terpolation technique in order to achieve AV feature syn-chrony at the audio rate. Finally, adaptation of the person’s
models is an important part of the AV system in Fig. 3. It is
usually performed when the environment or the speaker’s voice characteristics change, or when the person’s appear-
ance changes, due for example, to pose or illuminationchanges, facial hair, glasses, or aging [80].
IV . A N A L Y S I S O F V I S U A L F E A T U R E SU T I L I Z E D F O R A V B I O M E T R I CS
The choice of acoustic features for speaker recognition has
been thoroughly investigated in the literature [88] (see
also other papers in this issue). Therefore, in this paper, wefocus on presenting various approaches for the extraction
of visual features utilized for AV speaker recognition.Visual features are usually extracted from two-dimensional
(2-D) or three-dimensional (3-D) images [97]–[99], in the
visible or infrared part of the spectrum [100]. Facial visualfeatures can be classified into global or local, depending on
whether the face is represented by only one or multiple
feature vectors. Each local feature vector represents in-
formation contained in small image patches of the face orspecific regions of the face (e.g., eyes, nose, mouth, etc.).Visual features can also be either static (a single face image
is used) or dynamic (a video sequence of only the mouth
region, the visual-labial features, or the whole face isused). Static visual features are commonly used for face
recognition [18]–[22], [92]–[96], while dynamic visualfeatures for speaker recognition, since they contain addi-
tional important temporal information that captures thedynamics of facial feature changes, especially the changes in
the mouth region (visual speech). The various sets of visual
facial features proposed in the literature are generally grouped into three categories [101]: 1)
appearance-based features, such as transformed vectors of the face or mouth
region pixel intensities using, for example, image com-pression techniques [28], [77], [79]–[81], [83], [102];
2) shape-based features, such as geometric or model-basedrepresentations of the face or lip contours [77], [79]–[83];
and 3) features that are a combination of both appearance
and shape features in 1) and 2) [79]–[81], [103].The algorithms utilized for detecting and tracking the
face, mouth, or lips depend on the visual features that will
be used for speaker recognition, along with the quality of the video data and the resource constraints. For example,only a rough detection of the face or mouth region is suf-
ficient to obtain appearance-based visual features, requir-
ing only the tracking of the face and the two mouthcorners. On the other hand, a computationally more ex-
pensive lip extraction and tracking algorithm is addi-
tionally required for obtaining shape-based features, a challenging task especially in low-resolution videos.
A. Facial Feature Detection, Tracking, and ExtractionFace detection constitutes, in general, a difficult
problem, especially in cases where the background, headpose, and lighting are varying. It has attracted significant
interest in the literature [93]–[96], [104]–[106]. Some
reported face detection systems use traditional imageprocessing techniques, such as edge detection, image
Aleksic and Katsaggelos: Audio-Visual Biometrics
Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 2029
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 6/20
thresholding, template matching, color segmentation, ormotion information in image sequences [106]. They take
advantage of the fact that many local facial subfeaturescontain strong edges and are approximately rigid. Never-
theless, the most widely used techniques follow a
statistical modeling of the face appearance to obtain a classification of image regions into face and nonface
classes. Such regions are typically represented as vectors of
grayscale or color image pixel intensities over normalizedrectangles of a predetermined size. They are often pro-
jected onto lower dimensional spaces and are defined over a
Bpyramid[ of possible locations, scales, and orientations in
the image [93]. These regions are usually classified, using
one or more techniques, such as neural networks,clustering algorithms along with distance metrics from
the face or nonface spaces, simple linear discriminants,
support vector machines (SVMs), and Gaussian mixture
models (GMMs) [93], [94], [104]. An alternative popularapproach uses a cascade of weak classifiers instead that aretrained using the AdaBoost technique and operate on local
appearance features within these regions [105]. If color
information is available, image regions that do not containsufficient number of skin-tone like pixels can be deter-
mined (for example, utilizing hue and saturation) [107],[108] and eliminated from the search. Typically, face
detection goes hand-in-hand with tracking in which thetemporal correlation is taken into account (tracking can be
performed at the face or facial feature level). The simplest
possible approach to capitalize on the temporal correlationis to assume that the face (or facial feature) will be present
in the same spatial location in the next frame.
After successful face detection, the face image can beprocessed to obtain an appearance-based representation of
the face. Hierarchical techniques can also be used at thispoint to detect a number of interesting facial features, such
as mouth corners, eyes, nostrils, and chin, by utilizing
prior knowledge of their relative position on the face inorder to simplify the search. Also, if color information is
available, hue and saturation information can be utilized in
order to directly detect and extract certain facial features(especially lips) or constrain the search area and enablemore accurate feature extraction [107], [108]. These
features can be used to extract and normalize the mouth
region-of-interest (ROI), containing useful visual speechinformation. The normalization is usually performed with
respect to head-pose information and lighting [Fig. 4(a)
and (b)]. The appearance-based features are extractedfrom the ROI using image transforms (see Section IV).
On the other hand, shape-based visual mouth features(divided into geometric, parametric, and statistical, as in
Fig. 5) are extracted from the ROI utilizing techniques
such as snakes [109], templates [110], and active shape andappearance models [111]. A snake is an elastic curve rep-
resented by a set of control points, and it is used to detect
important visual features, such as lines, edges, or contours.The snake control point coordinates are iteratively up-
dated, converging towards a minimum of the energy func-tion, defined on the basis of curve smoothness constraints
and a matching criterion to desired features of the image[109]. Templates are parametric curves that are fitted to the
desired shape by minimizing an energy function, defined
similarly to snakes. Examples of lip contour estimationusing a gradient vector field (GVF) snake and two parabolic
templates are depicted in Fig. 4(c) [82]. Examples of
statistical models are active shape models (ASMs) andactive appearance models (AAMs) [111]. The former are
obtained by applying principal component analysis (PCA)[112] to training vectors containing the coordinates of a set
of points that lie on the shapes of interest, such as the lip
inner and outer contours. These vectors are projected onto a lower dimensional space defined by the eigenvectors
corresponding to the largest PCA eigenvalues, representing
the axes of shape variation. The latter are extensions of
ASMs that, in addition, capture the appearance variation of the region around the desired shape. AAMs remove the
redundancy due to shape and appearance correlation and
create a single model that describes both shape and thecorresponding appearance deformation.
B. Visual FeaturesWith appearance-based approaches to visual feature
representation, the pixel-values of the face or mouth ROI,tracked and extracted according to the discussion of the
previous section, are directly considered. The extracted
ROI is typically a rectangle containing the mouth, possibly including larger parts of the lower face, such as the jaw
and cheeks [80] or could even be the entire face [103]
(see Figs. 4(b) and 5). It can also be extended into a three-dimensional rectangle, containing adjacent frame
ROIs, thus capturing dynamic visual speech information. Alternatively, the mouth ROI can be obtained from a
number of image profiles vertical to the estimated lip
contour as in [79] or from a disc around the mouth center[113]. A feature vector x t (see Fig. 5) is created by order-
ing the grayscale pixel values inside the ROI. The
Fig. 4. Mouth appearance and shape tracking for visual feature
extraction. (a) Commonly detected facial features. (b) Two
corresponding mouth ROIs of different sizes. (c) Lip contour
estimation using a gradient vector field snake (upper: the snake’s external force field is depicted) and
two parabolas (lower) [82].
Aleksic and Katsaggelos: Audio-Visual Biometrics
2030 Proceedings of the IEEE | Vol. 94, No. 11, November 2006
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 7/20
dimension d of this vector typically becomes prohibitively
large for successful statistical modeling of the classes of interest, and thus a lower dimensional transformation of it
is used instead. A D Â d dimensional linear transform
matrix R is generally sought after, such that the trans-formed data vector y t ¼ R Á x t contains most speech
reading information in its D ( d elements (see Fig. 5).Matrix R is often obtained based on a number of training
ROI grayscale pixel value vectors utilizing techniquesborrowed from the image compression and pattern classi-
fication literature. Examples of such transforms are PCA,
generating Beigenlips[ (or Beigenfaces[ if applied to faceimages for face recognition) [113], the discrete cosine
transform (DCT) [114], [115], the discrete wavelet trans-
form (DWT) [115], linear discriminant analysis (LDA) [11],[77], [80], Fisher linear discriminant (FLD), and the
maximum likelihood linear transform (MLLT) [28], [80].PCA provides low-dimensional representation optimal in
the mean-squared error sense, while LDA and FLD provide
most discriminant features, that is, features that offer a clear separation between the pattern classes. Often, these
transforms are applied in a cascade [11], [80] in order to
cope with the Bcurse of dimensionality [ problem. Inaddition, fast algorithmic implementations are available forsome of these transformations. It is important to point out
that appearance-based features allow for dynamic visual
feature extraction in real time due to the fact that a roughROI extraction can be achieved by utilizing computation-
ally inexpensive face detection algorithms. Clearly, the
quality of the appearance-based visual features degradesunder intense head-pose and lighting variations.
With shape-based features it is assumed that most of the information is contained in face contours or the shape
of the speaker’s lips [77], [102], [103], [116]. Therefore,
such features achieve a compact representation of facial
images and visual speech using low-dimensional vectorsand are invariant to head pose and lighting. However, their
extraction requires robust algorithms, which is often dif-ficult and computationally intensive in realistic scenarios.
Geometric features, such as the height, width, perimeter of
the mouth, etc., are meaningful to humans and can bereadily extracted from the mouth images. Geometric
mouth features have been used for visual speech recogni-tion [87], [117] and speaker recognition [117]. Alterna-
tively, model-based visual features are typically obtained
in conjunction with a parametric or statistical facialfeature extraction algorithm. With model-based ap-
proaches the model parameters are directly used as visual
speech features [79], [82], [111]. An example of model-based visual features is repre-
sented by the facial animation parameters (FAPs) of theouter- and inner-lip contours [14], [82], [102]. FAPs
describe facial movement and are used in the MPEG-4
AV object-based video representation standard to controlfacial animation, together with the so-called facial defi-
nition parameters (FDPs) that describe the shape of the
face. The FAPs extraction system described in [82] isshown in Fig. 6. The system first employs a templatematching algorithm to locate the person’s nostrils by
searching the central area of the face in the first frame of
each sequence. Tracking is performed by centering thesearch area in the next frame at the location of the nose in
the previous frame. The nose location is used to determine
the approximate mouth location. Subsequently, the outerlip contour is determined by using a combination of a GVF
snake and a parabolic template [see also Fig. 4(c)].Following the outer lip contour detection and tracking, ten
FAPs describing the outer-lip shape (Bgroup 8[ FAPs [82])
Fig. 5. Various visual speech feature representation approaches discussed in this section: appearance-based (upper) and shape-based
features (lower) that may utilize lip geometry, parametric, or statistical lip models.
Aleksic and Katsaggelos: Audio-Visual Biometrics
Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 2031
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 8/20
are extracted from the resulting lip contour (see also
Figs. 4 and 5). These are placed into a feature vector,
which is subsequently projected by means of PCA onto a three-dimensional space [82]. The resulting visual
features (eigenFAPs) are augmented by their first- and
second-order derivatives providing a nine-dimensionaldynamic visual speech vector.
Since appearance- and shape-based visual featurescontain respectively low- and high-level information about
the person’s face and lip movements, their combination
has been utilized in the expectation of improving theperformance of the recognition system. Features of each
type are usually just concatenated [79], or a single model of
face shape and appearance is created [103], [111]. Forexample, PCA appearance features are combined with
snake-based features or ASMs or a single model of faceshape and appearance is created using AAMs [111]. PCA
can be applied further to this single model vector [103].In summary, a number of approaches can be used for
extracting and representing visual information utilized for
speaker recognition. Unfortunately, limited work exists inthe literature in comparing the relative performance of
visual speech features. The advantage of the appearance-
based features is that they, unlike shape-based features, donot require sophisticated extraction methods. In addition,
appearance-based visual features contain information thatcannot be captured by shape-based visual features. Their
disadvantage is that they are generally sensitive to lighting
and rotation changes. The dimensionality of the appear-ance-based visual features is also usually much higher than
that of the shape-based visual features, which affects
reliable training of the AV person recognition systems.Most comparisons of visual features are made for features
within the same category (appearance- or shape-based) in
the context of AV or V-only person recognition, or AV-ASR
[103], [114]–[116]. Occasionally, features across categoriesare compared, but in most cases with inconclusive results
[102], [103], [114]. Thus, the question of what are the most
appropriate and robust visual speech features remains to a
large extent unresolved. Clearly, the characteristics of the
particular application and factors such as computational
requirements, video quality, and the visual environment,have to be considered in addressing this question.
V . S PEA KER R EC O GN ITIO N PR O C ES S
Although single-modality biometric systems can achievehigh performance in some cases, they are usually not
robust under nonideal conditions and do not meet the
needs of many potential person recognition applications.In order to improve the robustness of biometric systems,
multisamples (multiple samples of the same biometric
characteristic), multialgorithms (multiple algorithms withthe same biometric sample), and multimodal (different
biometric characteristics) biometric systems have beendeveloped [5]–[16], [23]–[51]. Different modalities can
provide independent and complementary information,thus alleviating problems characteristic of single modali-
ties. The fusion of multiple modalities is a critical issue in
the design of any recognition system, as is the case with thefusion of the audio and visual modalities in the design of
AV person recognition systems. In order to justify the
complexity and cost of incorporating the visual modality into a person recognition system, fusion strategies should
ensure that the performance of the resulting AV systemexceeds that of its single-modality counterpart, hopefully
by a significant amount, especially in nonideal conditions
(e.g., acoustically noisy environments). The choice of classifiers and algorithms for feature and classifier fusion
are clearly central to the design of AV person recognition
systems. In this section, we describe the speaker recogni-tion process, and in the next section we review the mainconcepts and techniques of combining the acoustic and
visual feature streams.
Audio-only speaker recognition has been extensively discussed in the literature [17] and the objective of this
paper is to concentrate only on AV speaker recognition
systems. The process of speaker recognition consists of two
Fig. 6. Shape-based visual feature extraction system of [82], depicted schematically in parallel with audio front end, as used for
AV speaker recognition experiments [14].
Aleksic and Katsaggelos: Audio-Visual Biometrics
2032 Proceedings of the IEEE | Vol. 94, No. 11, November 2006
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 9/20
phases, those of training and testing. In the training phase,the speaker is asked to utter certain phrases in order to
acquire the data to be used for training. In general, thelarger the amount of training data, the better the per-
formance of a speaker recognition system. In the testing
phase, the speaker utters a certain phrase, and the systemaccepts or rejects his/her claim (speaker authentication) or
makes a decision on the speaker’s identity (speaker iden-
tification). The testing phase can be followed by theadaptation phase during which the recognized speaker’s
data is used to update the models corresponding tohim/her. This phase is used for improving the robustness
of the system to the speaker’s acoustic and visual speech
characteristic changes over time, due to channel and envi-ronmental noise, visual appearance changes, etc. Maxi-
mum a posteriori (MAP) [112] adaptation is commonly
used in this phase.
Speaker recognition systems can be classified into text-dependent and text-independent, based on the text used inthe testing phase. Text-dependent systems can be further
divided into fixed-phrase or prompted-phrase systems. Fixed-
phrase systems are trained on the phrase that is also used fortesting. However, these systems are very vulnerable to
attacks (an impostor, for example, could play a recordedphrase uttered by the user). Prompted-phrase systems ask
the claimant to utter a word sequence (phoneme sequence)not used in the training phase or in previous tests. These
systems are less vulnerable to attacks (an impostor would
need to generate in real time an AV representation of thespeaker) but require an interface for prompting phrases. In
the text-independent systems, the speech used for testing is
unconstrained. These systems are convenient for applica-tions in which we cannot control the claimant’s input.
However, they would be classified as more vulnerable toattacks than the prompted-phase systems.
A. Classifiers in AV Speaker RecognitionThe speaker recognition problem is a classification
problem. A set of classes needs to be defined first, and then
based on the observations one of these classes is chosen.Let C denote the set of all classes. In speaker identificationsystems, C typically consists of the enrolled subject pop-
ulation possibly augmented by a class denoting the
unknown subject. On the other hand, in speaker authen-tication systems, C reduces to a two-member set, consist-
ing of the class corresponding to the user and the general
population (impostor class).The number of classes can be larger than mentioned
above, as, for example, in text-dependent speaker recogni-tion (and ASR). In this case, subphonetic units are con-
sidered, utilizing tree-based clustering of possible phonetic
contexts (bi-phones, or tri-phones, for example), to allow coarticulation modeling [86]. The set of classes C then
becomes the product space between the set of speakers and
the set of phonetic based units. Similarly, one could con-sider visemic subphonetic units, obtained for example by
decision tree clustering based on visemic context (visemicunits have been used in ASR [118]). The same set of classes
is typically used if both speech modalities are considered,since the use of different classes complicates AV integration,
especially in text-dependent speaker recognition.
The input observations to the recognizer are repre-sented by the extracted feature vectors at time t, os;t, and
their sequence over a time interval T , Os ¼ fos;t; t 2 T g,
where s 2 S denotes the available modalities. For instance,for the AV speaker recognition system, S ¼ fa; v; f g, wherea stands for audio, v for visual-dynamic (visual-labial), and
f for face appearance input.
A number of approaches can be used to model our
knowledge of how the observations are generated by eachclass. Such approaches are usually statistical in nature and
express our prior knowledge in terms of the conditional
probability P ðos;tjcÞ, where c 2 C , by utilizing artificial
neural networks (ANNs), support vector machines(SVMs), Gaussian mixture models (GMMs), hiddenMarkov models (HMMs), etc. The parameters of the prior
model are estimated during training. During testing based
on the trained model, which describes the way theobservations are generated by each class, the posterior
probability is maximized. In single modality speaker iden-
tification systems, the objective is to determine the class c,corresponding to one of the enrolled persons or theunknown person that best matches the person’s biometric
data, that is
c ¼ argmaxc2C
P ðcjos;tÞ; s 2 fa; v; f g (1)
where P ðcjos;tÞ represents the posterior conditional prob-ability. In closed-set identification systems the unknown
person is not modeled and the classification is forced into
one of the enrolled persons’ classes.In single modality speaker verification systems there
are only two classes, the class corresponding to the general
population (impostor class), that is w, and c, the classcorresponding to the true claimant. Then, the following
similarity measure D can be defined:
D ¼ log P ðcjos;tÞ À log P ðwjos;tÞ; s ¼ fa; v; f g: (2)
If D is larger than an a priori defined verification threshold
the claim is accepted; otherwise, it is rejected. The world
model is usually obtained using biometric data from thegeneral speaker population.
A widely used prior model is the GMM, expressed by
P ðos;tjcÞ ¼XK s;c
k¼1
ws;c;kN ðos;t; ms;c;k; X s;c;kÞ (3)
Aleksic and Katsaggelos: Audio-Visual Biometrics
Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 2033
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 10/20
where K s;c denotes the number of mixture weights ws;c;k, which are positive and add up to one, and N ðo; m; X Þrepresents a multivariate Gaussian distribution with mean
m and covariance matrix X , typically considered as a
diagonal matrix. During training the parameters of each
GMM in (3), namely ws;c;k, ms;c;k, and X s;c;k are estimated.In certain applications, for example, text-independent
speaker recognition, a single GMM is used to model the
entire observation sequence Os for each class c. Then, for a given modality s, the MAP estimation of the unknown class
is obtained as
c ¼ argmaxc2C
P ðcjOsÞ ¼ argmaxc2C
P ðcÞYt2T
P ðos;tjcÞ (4)
where P ðcÞ denotes the class prior.
In a number of other applications, such as text-dependent speaker recognition, the observations aregenerated by a temporal sequence of interacting states.
In this case, HMMs are widely used. They consist of a number of states. At each time instance, one of them
generates (Bemits[) the observed features with class con-
ditional probability given by (3). The HMM parameters(mixture weights, means, variances, transition proba-
bilities) are typically estimated iteratively during train-
ing, using the expectation-maximization (EM) algorithm[84], [86], or by discriminative training methods [84].
Once the model parameters are estimated, HMMs can be
used to obtain the Boptimal state sequence[ lc ¼ flct ; t 2 T gper class c, given an observation sequence Os over an in-
terval T ; namely
lc ¼ argmax lc
P ð lcjOsÞ
¼ argmax lc
P ðcÞYt2T
P lct jlc
tÀ1
À ÁP os;tjl
ct
À Á(5)
where P ðlct jlc
tÀ1Þ denotes the transition probability from
state lctÀ1 to state lc
t . The Viterbi algorithm is used for
solving (5), based on dynamic programming [84], [86].Then, the MAP estimate of the unknown class is ob-tained as
c ¼ argmaxc2C
P ð lcjOsÞ: (6)
B. Performance Evaluation MeasuresThe performance of identification systems is usually
reported in terms of identification error or rank-N correct
identification rate, defined as the probability that the
correct match of the unknown person’s biometric data is inthe top N similarity scores (this scenario corresponds to
the identification system which is not fully automated,needing human intervention or additional identification
systems applied in cascade).
Two commonly used error measures for verificationperformance are the false acceptance rate (FAR) V an
impostor is accepted V
and the false rejection rate(FRR) V a client is rejected. They are defined by
FAR ¼I AI
 100% FRR ¼C R
C Â 100% (7)
where I A denotes the number of accepted impostors, I the
number of impostor claims, C R the number of rejectedclients, and C the number of client claims. There is an
inherent tradeoff between FAR and FRR, which iscontrolled by an a priori chosen verification threshold.
The receiver operator curve (ROC) or the detection errortradeoff (DET) curve can be used to graphically representthe tradeoff between FAR and FRR [119]. DET and ROC
depict FRR as a function of FAR in a log and linear scale,
respectively. The detection cost function (DCF) is a measure derived from FAR and FRR according to
DCF ¼ Cost(FR) Á P (client) Á FRR
þ Cost(FA) Á P (impostor) Á FAR (8)
whereP
(client) andP
(impostor) are the prior probabilities
that a client or an impostor will use the system, respec-tively, while Cost(FA) and Cost(FR) represent, respec-tively, the costs of false acceptance and false rejection.
Half total error rate (HTER) [119], [120] is a special case of
DCF when the prior probabilities are equal to 0.5 and thecosts equal to 1, resulting in HTER ¼ ð1=2ÞðFRR þ FAR Þ.
Verification system performance is often reported using
a single measure either by choosing the threshold for which FAR and FRR are equal, resulting in the equal error
rate (EER), or by choosing the threshold that minimizesDCF (or HTER). The appropriate threshold can be found
either using the test set (providing biased results) or a
separate validation set [121]. Expected performance curves(EPCs) are proposed as a verification measure in [121] and
[122]. They provide unbiased expected system perfor-
mance analysis using a validation set to compute thresh-olds corresponding to various criteria related to real-life
applications.
VI. A V F U S I O N M E T H O D S
Information fusion is used for integration of different
sources of information with the ultimate goal of achiev-
ing superior classification results. Fusion approaches areusually classified into three categories: premapping
fusion, midst-mapping fusion, and postmapping fusion [8]
Aleksic and Katsaggelos: Audio-Visual Biometrics
2034 Proceedings of the IEEE | Vol. 94, No. 11, November 2006
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 11/20
(see Fig. 7). These are also referred to in the literature asearly integration, intermediate integration, and late integra-tion, respectively [80], and they will be used interchange-
ably in this paper. In the premapping fusion audio and visual information are combined before the classification
process. In the midst-mapping fusion, audio and visual in-
formation are combined during the mapping from sensordata or feature space into opinion or decision space. Fi-
nally, in the postmapping fusion, information is combinedafter the mapping from sensor data or feature space into
opinion or decision space. In the remainder of this section,following [8], we will review some of the most commonly
used information fusion methods and comment on their
use for the fusion of acoustic and visual information.
A. Premapping Fusion (Early Integration)Premapping fusion can be divided into sensor data level
and feature level fusion. In sensor data level fusion [123],
the sensor data obtained from different sensors of the samemodality is combined. Weighted summation and mosaic
construction are typically utilized in order to enable sensor
data level fusion. In the weighted summation approach thedata is first normalized, usually by mapping them to a
common interval, and then combined utilizing weights.
For example, weighted summation can be utilized to com-bine multiple visible and infrared images or to combine
acoustic data obtained from several microphones of dif-ferent types and quality. Mosaic construction can be uti-
lized to create one image from images of parts of the face
obtained by several different cameras.Feature level fusion represents the combination of the
features obtained from different sensors. Joint feature vectors are obtained either by weighted summation (after
normalization) or concatenation (e.g., by appending a visual to an acoustic feature vector with normalization in
order to obtain a joint feature vector). The features ob-
tained by the concatenation approach are usually of highdimensionality, which can affect reliable training of a
classification system (Bcurse of dimensionality [) and
ultimately recognition performance. In addition, concat-enation does not allow for the modeling of the reliability of
individual feature streams. For example, in the case of
audio and visual streams, it cannot take advantage of theinformation that might be available, about the acoustic or
the visual noise in the environment. Furthermore, audioand visual feature streams should be synchronized before
the concatenation is performed.
B. Midst-Mapping Fusion (Intermediate Integration)With midst-mapping fusion, information streams are
processed during the procedure of mapping the feature
space into the opinion or decision space. It exploits the
temporal dynamics contained in different streams, thusavoiding problems resulting from vector concatenation,
such as the Bcurse of dimensionality [ and the requirementof matching rates. Furthermore, stream weights can be
utilized in midst-mapping fusion to account for the
reliability of different streams of information. Multi-stream and extended HMMs [32], [79], [80], [82], [88] are
commonly used for midst-mapping fusion in AV speaker
recognition systems [13], [30], [32], [33]. For example,multi-stream HMMs can be used to combine acoustic and
visual dynamic information [80], [82], [88], allowing foreasy modeling of the reliability of the audio and visual
streams and various levels of asynchronicity between them.
In the state-synchronous case, the probability density func-tion for each HMM state is defined as [77], [88]
P os;tjlctÀ Á
¼Y
s2fa;vg
XK s;c
k¼1
ws;lct ;kN ðos;t; ms;c;k; X s;c;kÞ
" # s
(9)
Fig. 7. AV fusion methods (adapted from [8]).
Aleksic and Katsaggelos: Audio-Visual Biometrics
Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 2035
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 12/20
where s denotes the stream weight corresponding tomodality s. The stream weights add up to one, that is,
a þ v ¼ 1. The combination of audio and visual informa-tion can also be performed at phone, word, and utterance
level [80], allowing for different levels of asynchronicity
between audio and visual streams.
C. Postmapping Fusion (Late Integration)Postmapping fusion approaches are grouped into
decision and opinion fusion (also referred to as score-level fusion). With decision fusion [11], [14], classifier decisionsare combined in order to obtain the final decision by uti-
lizing majority voting, or combination of ranked lists, or and and or operators. In majority voting [48], the finaldecision is made when the majority of the classifiers reaches
the same decision. The number of classifiers should be
chosen carefully in order to prevent ties (e.g., for a two class
problem, such as speaker verification, the number of classifiers should be odd). In ranked list combination fusion[48], [124], ranked lists provided by each classifier are
combined in order to obtain the final ranked list. There exist
various approaches for combination of ranked lists [124]. In and fusion [125], the final decision is made only if all
classifiers reach the same decision. This type of fusion istypically used for high-security applications where we want
to achieve very low FA rates, by allowing higher FR rates.On the other hand, when the or fusion method is utilized,
the final decision is made as soon as one of the classifiers
reaches a decision. This type of fusion is utilized for low-security applications where we want to achieve lower FR
rates and prevent causing inconvenience to the registered
users by allowing higher FA rates.Unlike decision fusion, in opinion (score-level) fusion
methods the experts do not provide final decisions but only opinions (scores) on each possible decision. The opinions
are usually first normalized by mapping them to a common
interval and then combined utilizing weights (e.g., either weighted summation or weighted product fusion). The
weights are determined based on the discriminating ability
of the classifier and the quality of the utilized features(usually affected by the feature extraction method and/orpresence of different types of noise). For example, when
audio and visual information are employed, the acoustic
SNR and/or the quality of the visual feature extractionalgorithms are considered in determining weights. After
the opinions are combined, the class that corresponds to
the highest opinion is chosen. In the postclassifier opinionfusion approach [8] the likelihoods corresponding to each
of the N C classes of interest, obtained utilizing each of the N L available experts, are considered as features in the
N L Â N C dimensional space, where the classification of the
resulting features is performed. This method is particularly useful in verification applications since only two classes
are available. In the case of AV speaker verification, the
number of experts can also be small (two or more).However, utilizing this method in identification problems
(large number of classes), or when the number of expertsN L is large, can result in features of high dimensionality,
which could cause inadequate performance.It is important to point out that the problem of
combining single-modality biometric results becomes
more complicated in the case of speaker identificationand requires one-to-many comparisons to find the best
match. In addition, when choosing biometric modalities to
be used in a multimodal biometric system, one shouldconsider not only how much the modalities complement
each other in terms of identification accuracy but also theidentification speed. In some cases, the cascade of single-
modality biometric classifiers can be used in order to
narrow down the number of candidates by a fast, but notnecessarily of high accuracy, first biometric system, and
then use the second, higher accuracy biometric system on
the remaining candidates to improve the overall identifi-
cation performance.
VII. A V B I O M E T R I C S Y S T E M S
In this section, we briefly review corpora commonly usedfor AV person recognition research and present examples
of specific systems and their performance.
A. AV DatabasesIn contrast to the abundance of audio-only databases,
there exist only a few databases suitable for AV biometric
research. This is because the field is relatively young, butalso due to the fact that AV corpora pose additional
challenges concerning database collection, storage, distri-
bution, and privacy. Most commonly used databases in theliterature were collected by few university groups or
individual researchers with limited resources, and as a result, they usually contain a small number of subjects and
have relatively short duration. AV databases usually vary
greatly in the number of speakers, vocabulary size, numberof sessions, nonideal acoustic and visual conditions, and
evaluation measures. This makes the comparison of
different visual features and fusion methods, with respectto the overall performance of an AV biometric system,difficult. They lack realistic variability and are usually
limited to one area or focus on one aspect of biometric
person recognition. Some of the currently publicly avail-able AV databases which have been used in the published
literature for AV biometrics are the M2VTS (multimodal
verification for teleservices and security applications)[126] and XM2VTS (extended M2VTS) [127], BANCA
(biometric access control for networked and e-commerceapplications) [128], VidTIMIT [26] (video recordings of
people reciting sentences from the TIMIT corpus), DAVID
[129], VALID [130], and AVICAR (AV speech corpus in a car environment) [131] databases. We provide next a short
description for each of them.
The M2VTS [126] database consists of audio record-ings and video sequences of 37 subjects uttering digits
Aleksic and Katsaggelos: Audio-Visual Biometrics
2036 Proceedings of the IEEE | Vol. 94, No. 11, November 2006
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 13/20
0 through 9 in five sessions spaced apart by at least one week. The subjects were also asked to rotate their head to
the left and then to the right in each session in order to
obtain a head rotation sequence that can provide 3-D facefeatures to be used for face recognition purposes. The main
drawbacks of this database are its small size and limited
vocabulary. The extended M2VTS database [127] consistsof audio recordings and video sequences of 295 subjects
uttering three fixed phrases, two ten-digit sequences andone seven-word sentence, with two utterances of each
phrase, in four sessions. The main drawback of this data-
base is its limitation to the development of text-dependentsystems. Both M2VTS and XM2VTS databases have been
frequently used in the literature for comparison of dif-
ferent AV biometric systems (see Table 2).The BANCA database consists of audio recordings and
video sequences of 208 subjects (104 male, 104 female)
recorded in three different scenarios, controlled, degradedand adverse, over 12 different sessions spanning three
months. The subjects were asked to say a random 12-digitnumber, their name, their address and date of birth, during
each of the recordings. The BANCA database was captured
Table 2 Sample AV Person Recognition Systems
Aleksic and Katsaggelos: Audio-Visual Biometrics
Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 2037
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 14/20
in four European languages. Both high- and low-quality microphones and cameras were used for recording. This
database provides realistic and challenging conditions andallows for comparison of different systems with respect to
their robustness.
The VidTIMIT database consists of audio recordingsand video sequences of 43 subjects (19 female and 24 male),
reciting short sentences from the test section of the
NTIMIT corpus [26] in three sessions with an average delay of a week between sessions, allowing for appearance and
mood changes. Each person utters ten sentences. The firsttwo sentences are the same for all subjects, while the re-
maining eight are generally different for each person. All
sessions contain phonetically balanced sentences. In addi-tion to the sentences, the subjects were asked to move their
heads left, right, up, then down, in order to obtain head
rotation sequence. The AV biometric systems that utilize
the VidTIMIT corpora are described in [8].The DAVID database consists of audio and videorecordings (frontal and profile views) of more than
100 speakers including 30 subjects recorded in five
sessions over a period of several months. The utterancesinclude digit set, alphabet set, vowel–consonant–vowel
syllables, and phrases. The challenging visual conditionsinclude illumination changes and variable scene back-
ground complexity.The VALID database consists of five recordings of
106 subjects (77 male, 29 female) over a period of one
month. Four of the sessions were recorded in an officeenvironment in the presence of visual noise (illumination
changes) and acoustic noise (background noise). In
addition, one session was recorded in the studio environ-ment containing a head rotation sequence, where the
subjects were asked to face four targets, placed above,below, left, and right of the camera. The database consists
of recordings of the same utterances as those recorded in
the XM2VTS database, therefore enabling comparison of the performance of different systems and investigation of
the effect of challenging visual environments on the
performance of algorithms developed with the XM2VTSdatabase.
The AVICAR database [131] consists of audio record-
ings and video sequences of 100 speakers (50 male and
50 female) uttering isolated digits, isolated letters, phonenumbers, and TIMIT sentences with various language
backgrounds (60% native American English speakers)
inside a car. Audio recordings are obtained using a visor-mounted array composed of eight microphones under
five different car noise conditions (car idle, 35 and 55 mph with all windows rolled up, or just front windows rolled
down). Video sequences are obtained using dashboard-
mounted array of four video cameras. This database pro- vides different challenges for tracking and extraction of
visual features and can be utilized for analysis of the effect
of nonideal acoustic and visual conditions on AV speakerrecognition performance.
Additional datasets for AV biometrics research are theClemson University AV Experiments (CUAVE) corpus
containing connected digit strings [132], the AMP/CMUdatabase of 78 isolated words [133], the Tulips1 set of four
isolated digits [134], the IBM AV database [15], and the
AV-TIMIT AV corpus [9].None of the existing AV databases probably has all
desirable characteristics, such as adequate number of sub-
jects, size of vocabulary and utterances, realistic varia-bility (representing for example speaker identification on
a mobile hand-held device, or taking into account othernonideal acoustic and visual conditions), recommended
experiment protocols (it is, for example, specified that
certain specific subjects are to be used as clients andcertain specific subjects as impostors), and ability to uti-
lize them for text-independent as well as text-dependent
verification systems. There is, therefore, a great need for
new, standardized databases and evaluation measures (seeSection V-B) that would enable fair comparison of dif-
ferent systems and represent realistic nonideal condi-
tions. Experiment protocols should also be defined in a way that avoids biased results and allows for fair com-
parison of different person recognition systems.
B. Examples of AV Biometric SystemsThe performance of AV person recognition systems
strongly depends on the choice and accurate extraction of
the acoustic and visual features and the AV fusion
approach utilized. Due to differences in visual features,fusion methods, AV databases, and evaluation procedures,
it is usually very difficult to compare systems. As men-
tioned in Section I, audio-only speaker recognition andface recognition systems are extensively covered elsewhere
and will not be discussed here. We present in this section various visual-only-dynamic, audio-visual-static, and audio-
visual-dynamic systems and provide some comparisons.
Table 2 shows an overview of the various AV person recog-nition systems found in the literature. Some of those
systems are discussed in more detail in the remainder of
this section.Luettin et al. [135] developed a visual-only speaker
identification system by utilizing only the dynamic visual
information present in the video recordings of the mouth
area. They utilized the Tulips1 database [134], consisting of recordings of 12 speakers uttering first four English digits,
extracted shape- and appearance-based visual features, and
performed both text-dependent and text-independentexperiments. Their person identification system, based
on HMMs, achieved 72.9%, 89.6%, and 91.7% recognitionrates when shape-based, appearance-based, and joint
(concatenation fusion) visual features were utilized,
respectively, in text-dependent experiments. In text-independent experiments, their system achieved 83.3%,
95.8%, and 97.9% recognition rates when shape-based,
appearance-based, and joint (concatenation) visual fea-tures were utilized, respectively. In summary, they
Aleksic and Katsaggelos: Audio-Visual Biometrics
2038 Proceedings of the IEEE | Vol. 94, No. 11, November 2006
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 15/20
achieved better results with appearance-based than withshape-based visual features, and the identification perfor-
mance improved further when joint features were utilized.
1) Audio-Visual-Static Biometric Systems: Chibelushi et al.
[5] developed an AV biometrics system that utilizesacoustic information and static visual information con-
tained in face profiles. They utilized an AV database that
consists of audio recordings and face images of ten speakers[5]. The images are taken at different head orientations,
image scales, and subject positions. They combinedacoustic and visual information utilizing weighted summa-
tion fusion. Their system achieved an EER of 3.4%, 3.0%,
and 1.5% when only speech information, only visualinformation, or both acoustic and visual information were
used, respectively.
Brunelli and Falavigna [6] developed a text-
independent speaker identification system that combinesaudio-only speaker identification and face recognition.They utilized an AV database that consists of audio re-
cordings and face images of 89 speakers collected in three
sessions [6]. The system provides five classifiers, twoacoustic and three visual. The two acoustic classifiers cor-
respond to two sets of acoustic features (static anddynamic) derived from the short time spectral analysis of
the speech signal. Their audio-only speaker identificationsystem is based on vector quantization (VQ). The three
visual classifiers correspond to the visual classifying fea-
tures extracted from three regions of the face, i.e., eyes,nose, and mouth. The individually obtained classification
scores are combined using the weighted product approach.
The identification rate of the integrated system is 98%,compared to the 88% and 91% rates obtained by the audio-
only speaker recognition and face recognition systems,respectively.
Ben-Yacoub et al. [7] developed both text-dependent
and text-independent AV speaker verification systems by utilizing acoustic information and frontal face visual
information from the XM2VTS database. They utilized
elastic graph matching in order to obtain face matchingscores. They investigated several binary classifiers forpostclassifier opinion fusion, namely, SVM, Bayesian
classifier, Fisher’s linear discriminant, decision tree, and
multilayer perceptron (MLP). They obtained the best re-sults utilizing SVM and Bayesian classifiers, which also
outperformed single modalities.
Sanderson and Paliwal [8] utilized speech and faceinformation to perform text-independent identity verifica-
tion. They extracted appearance-based visual features by performing PCA on the face image window containing the
eyes and the nose. The acoustic features consisted of
MFCCs and their corresponding deltas and maximumautocorrelation values, which capture pitch and voicing
information. A voice activity detector (VAD) was used to
remove the feature vectors which represent silence orbackground noise, while a GMM classifier was used as a
modality (speech or face) expert, to obtain opinions fromthe speech features. They performed elaborate analysis and
evaluation of several nonadaptive and adaptive approachesfor information fusion and compared them in noisy and
clean audio conditions with respect to overall verification
performance on the VidTIMIT database. The fusionmethods they analyzed include weighted summation,
Bayesian classifier, SVM, concatenation, adaptive weighted
summation, and proposed piecewise linear postclassifier,and modified Bayesian postclassifier. The utilized fusion
methods take into account how the distributions of opinions are likely to change due to noisy conditions,
without making a direct assumption about the type of noise
present in the testing features. The verification resultsobtained for various SNRs in the presence of operations-
room noise are shown in Fig. 8. The operations-room noise
contains background speech as well as machinery sounds.
The results are reported in terms of total error (TE),defined as TE ¼ FAR þ FRR. They concluded that the
performance of most of the nonadaptive fusion systems
was similar and that it degraded in noisy conditions.Hazen et al. [9] developed a text-dependent speaker
authentication system that utilizes lower quality audio and
visual signals obtained by a handheld device. They de-tected 14 face components and used ten of them, after
normalization, as visual features for a face recognitionalgorithm which utilizes SVMs. They achieved 90%
reduction in speaker verification EER when fusing face
and speaker identification information.
2) Audio-Visual-Dynamic Biometric Systems: Jourlin et al.[10] developed a text-dependent AV speaker verificationsystem that utilizes both acoustic and visual dynamic
information and tested it on the M2VTS database. Their
Fig. 8. AV person recognition performance obtained for
various SNRs in presence of operations-room noise utilizing several nonadaptive and adaptive AV fusion methods
described in [8] on VidTimit database [26].
Aleksic and Katsaggelos: Audio-Visual Biometrics
Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 2039
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 16/20
39-dimensional acoustic features consist of LPC coeffi-cients and their first- and second-order derivatives. They
use 14 lip shape parameters, ten intensity parameters, andthe scale as visual features, resulting in a 25-dimensional
visual feature vector. They utilize HMMs to perform
audio-only, visual-only, and AV experiments. The AV scoreis computed as a weighted sum of the audio and visual
scores. Their results demonstrate a reduction of FAR from
2.3% when the audio-only system is used to 0.5% when themultimodal system is used.
Wark et al. [11]–[13] employed multi-stream HMMs todevelop text-independent AV speaker verification and
identification systems tested on the M2VTS database. They
utilized MFCCs as acoustic features and lip contourinformation obtained after applying PCA and LDA, as
visual features. They trained the system in clean conditions
and tested it in degraded acoustic conditions. At low SNRs,
the AV system achieved significant performance improve-ment over the audio-only system and also outperformedthe visual-only system, while at high SNRs the perfor-
mance was similar to the performance of the audio-only
system.We have developed an AV speaker recognition system
with the AMP/CMU database [133] utilizing 13 MFCCcoefficients and their first- and second-order derivatives
as acoustic features [14]. A visual shape-based feature vector consisting of ten FAPs, which describe the move-
ment of the outer-lip contour [82], extracted using the
systems previously discussed in Section IV, was projectedby means of PCA onto a three-dimensional space (see
Fig. 6). The resulting visual features were augmented with
first- and second-order derivatives providing nine-dimen-sional dynamic visual feature vectors. We used a feature
fusion integration approach and single-stream HMMs tointegrate dynamic acoustic and visual information. Speak-
er verification and identification experiments were per-
formed using audio-only and AV information, under bothclean and noisy audio conditions at SNRs ranging from 0 to
30 dB. The obtained results for both speaker identification
and verification experiments, expressed in terms of theidentification error and EER, are shown in Table 3.Significant improvement in performance over the audio-
only (AU) speaker recognition system was achieved,
especially under noisy acoustic conditions. For instance,the identification error was reduced from 53.1%, when
audio-only information was utilized, to 12.82%, when AVinformation was employed at 0-dB SNR.
Chaudhari et al. [15] developed an AV speaker iden-tification and verification system which modeled the
reliability of the audio and video information streams with
time-varying and context-dependent parameters. Theacoustic features consisted of 23 MFCC coefficients, while
the visual features consisted of 24 DCT coefficients from
the transformed ROI. They utilized GMMs to modelspeakers and parameters that depended on time, modality,
and speaker to model stream reliability. The system wastested on the IBM database [15] achieving an EER of 1.04%,
compared to 1.71%, 1.51%, and 1.22%, of the audio-only,
video-only, and AV (feature fusion) systems, respectively.Dieckmann et al. [16] developed a system which used
visual features obtained from all three modalities, face,
voice, and lip movement. Their fusion scheme utilized
majority voting and opinion fusion. Two of the threeexperts had to agree on the opinion, and the combined
opinion had to exceed the predefined threshold. The iden-
tification error decreased to 7% when all three modalities were used, compared to 10.4%, 11%, and 18.7%, when
voice, lip movements, and face visual features were used
individually.
VIII. DIS C US S IO N / C O N C LUS IO N
In this paper, we addressed how the joint processing of
audio and visual signals, both generated by a talkingperson, can provide valuable information to benefit AV
speaker recognition applications. We first concentrated on
the analysis of visual signals and described various ways of representing and extracting information available in them
unique to each speaker. We then discussed how the visualfeatures can complement features extracted (by well-
studied methods) from the acoustic signal, and how the
two-modality representations can be fused together toallow joint AV processing. We reviewed several speaker
identification and/or authentication systems that have
appeared in the literature and presented some experimen-tal results. These results demonstrated the importance of utilizing visual information for speaker recognition,
especially in the presence of acoustic noise.
The field of joint AV person recognition is new andactive, with many accomplishments (as described in this
paper) and exciting opportunities for further research and
development. Some of these open issues and opportunitiesare the following.
There is a need for additional resources for advancingand accessing the performance of AV speaker recognition
systems. Publicly available multimodal corpora that better
reflect realistic conditions, such as acoustic noise andlighting changes would help in investigating robustness of
AV systems. They can serve as a reference point for devel-
opment, as well as evaluation and comparison of varioussystems. Baseline algorithms and systems could also be
Table 3 Speaker Recognition Performance Obtained for Various SNRs
Utilizing Audio-Only and AV Systems in [14] Tested on AMP/CMU
Database [133]
Aleksic and Katsaggelos: Audio-Visual Biometrics
2040 Proceedings of the IEEE | Vol. 94, No. 11, November 2006
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 17/20
agreed upon and made available in order to facilitateseparate investigation of the effects that various factors,
such as the choice of acoustic and visual features, theinformation fusion approach, or the classification algo-
rithms, have on system performance.
In comparing and evaluating speaker recognitionsystems, the statistical significance of the results needs
to be determined [136]. It is not adequate to simply report
that one system achieved a lower error rate than anotherone (and therefore it is better than the other one) using the
same experiment setup. The mean and the variance of a particular error measure can assist in determining the
relative performance of systems. In addition, standard
experiment protocols and evaluation procedures should bedefined in order to enable fair comparison of different
systems. Experiment protocols could include a number of
different configurations in which available subjects are
randomly divided into enrolled subjects and impostors,therefore providing performance measures obtained foreach of the configurations. These measures can be used to
determine statistical significance of the results.
In many cases, the performance of person identifica-tion systems is reported for a closed set system. The
underlying assumption is that the same performance willcarry over to an open set system. This, however, may well
not hold true; therefore, a model for an unknown personshould be used.
The design of a truly high-performing visual feature
representation system with improved robustness to the visual environment, possibly employing 2.5-D or 3-D face
information [97]–[99], needs to be further investigated.
Finally, the development of improved AV integration algo-rithms that will allow unconstrained AV asynchrony
modeling and robust, localized reliability estimation of the signal information content, due for example, to occlu-
sion, illumination change, or pose, are also needed.Concerning the practical deployment of the AV bio-
metric technology, robustness represents the grand chal-
lenge. There are few applications where the environmentand the data acquisition mechanism can be carefully
controlled. For the widespread use of the technology it will
require us to be able to handle variability in the envi-ronment, data acquisition devices, and degradations due to
data and channel encoding and spoofing. Overall, thetechnology is quite robust to spoofing (when audio and
dynamic visual features are used); although, it would be
possible in the future to synthesize a video of a persontalking and replay it to the biometric system, for example
on a laptop and even on the fly, in order to defeat a
prompted phrase systems.
The technology is readily available, however, for mostday-to-day applications that have the following character-istics [137]:
1) low security and highly user friendly, e.g., accessto desktop using biometric log-in;
2) high security but user can tolerate the inconve-
nience of being falsely rejected, e.g., access tomilitary property;
3) low security and convenience is the prime factor(more than any other factors such as cost), e.g.,
time-stamping in a factory setting.
The technology is not yet available for highly securedand highly user-friendly applications (such as banking).
Further research and development is therefore required
for AV biometric systems to become widespread inpractice. h
RE F E RE NCE S
[1] A. K. Jain, A. Ross, and S. Prabhakar,B An introduction to biometric recognition,[IEEE Trans. Circuits Systems Video Technol., vol. 14, no. 1, pp. 4–20, Jan. 2004.
[2] N. K. Ratha, A. W. Senior, and R. M. Bolle,B Automated biometrics,[ in Proc. Int. Conf.
Advances Pattern Recognition, Rio de Janeiro,Brazil, 2001, pp. 445–474.
[3] Financial crimes report to the public. Fed.Bur. Investigation, Financial Crimes Section,Criminal Investigation Division. [Online]. Available: http://www.fbi.gov/publications/financial/fcs_report052005/fcs_report052005.htm.
[4] A. K. Jain and U. Uludag, BHiding biometricdata,[ IEEE Trans. Pattern Anal. MachineIntell., vol. 25, no. 11, pp. 1494–1498,Nov. 2003.
[5] C. C. Chibelushi, F. Deravi, and J. S. Mason,BVoice and facial image integration forspeaker recognition,[ in Proc. IEEE Int. Symp.
Multimedia Technologies Future Appl.,Southampton, U.K., 1993.
[6] R. Brunelli and D. Falavigna, BPersonidentification using multiple cues,[ IEEE
Trans. Pattern Anal. Machine Intell., vol. 10,pp. 955–965, Oct. 1995.
[7] S. Ben-Yacoub, Y. Abdeljaoued, andE. Mayoraz, BFusion of face and speechdata for person identity verification,[ IEEETrans. Neural Networks, vol. 10,pp. 1065–1074, 1999.
[8] C. Sanderson and K. K. Paliwal, BIdentity verification using speech and faceinformation,[ Digital Signal Processing, vol. 14, no. 5, pp. 449–480, 2004.
[9] T. J. Hazen, E. Weinstein, R. Kabir, A. Park,and B. Heisele, BMulti-modal face andspeaker identification on a handheld device,[in Proc. Works. Multimodal User
Authentication, Santa Barbara, CA, 2003,pp. 113–120.
[10] P. Jourlin, J. Luettin, D. Genoud, andH. Wassner, BIntegrating acoustic and labialinformation for speaker identification and verification,[ in Proc. 5th Eur. Conf. SpeechCommunication Technology, Rhodes, Greece,1997, pp. 1603–1606.
[11] T. Wark, S. Sridharan, and V. Chandran,BRobust speaker verification via fusion of speech and lip modalities,[ in Proc. Int. Conf.
Acoustics, Speech Signal Processing , Phoenix, AZ, 1999, pp. 3061–3064.
[12] VV , BRobust speaker verification via
asynchronous fusion of speech and lipinformation,[ in Proc. 2th Int. Conf.
Audio- and Video-Based Biometric Person
Authentication, Washington, DC, 1999,pp. 37–42.
[13] VV , BThe use of temporal speech andlip information for multi-modal speakeridentification via multi-stream HMMs,[ inProc. Int. Conf. Acoustics, Speech SignalProcessing , Istanbul, Turkey, 2000,pp. 2389–2392.
[14] P. S. Aleksic and A. K. Katsaggelos,B An audio-visual person identification and verification system using FAPs as visualfeatures,[ in Proc. Works. Multimedia User
Authentication, Santa Barbara, CA, 2003,pp. 80–84.
[15] U. V. Chaudhari, G. N. Ramaswamy,G. Potamianos, and C. Neti, BInformationfusion and decision cascading foraudio-visual speaker recognition based ontime-varying stream reliability prediction,[in Proc. Int. Conf. Multimedia Expo,Baltimore, MD, Jul. 6–9, 2003, pp. 9–12.
[16] U. Dieckmann, P. Plankensteiner, andT. Wagner, BSESAM: A biometric personidentification system using sensor fusion,[Pattern Recogn. Lett., vol. 18, pp. 827–833,1997.
[17] J. P. Campbell, BSpeaker recognition:
A tutorial,[ Proc. IEEE, vol. 85, no. 9,pp. 1437–1462, Sep. 1997.
Aleksic and Katsaggelos: Audio-Visual Biometrics
Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 2041
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 18/20
[18] W.-Y. Zhao, R. Chellappa, P. J. J. Phillips,and A. Rosenfeld, BFace recognition: Aliterature survey,[ ACM Computing Survey,pp. 399–458, 2003, Dec. Issue.
[19] M. Turk and A. Pentland, BEigenfacesfor recognition,[ J. CognitiveNeuroscience, vol. 3, no. 1, pp. 586–591,Sep. 1991.
[20] M. Kirby and L. Sirovich, B Applicationof the Karhunen-Loeve procedure for thecharacterization of human faces,[ IEEETrans. Pattern Anal. Mach. Intell., vol. 12,no. 1, pp. 103–108, Jan. 1990.
[21] P. N. Belhumeur, J. P. Hespanha, andD. J. Kriegman, BEigenfaces versusfisherfaces: Recognition using classspecific linear projection,[ IEEE Trans.Pattern Anal. Mach. Intell., vol. 19,pp. 711–720, 1997.
[22] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, BFace recognition: A literaturesurvey,[ Proc. ACM Computing Surverys(CSUR), vol. 35, no. 4, pp. 399–458, 2003.
[23] J. Luettin, BVisual speech and speakerrecognition,[ Ph.D. dissertation, Dept.
Computer Science, Univ. Sheffield,Sheffield, U.K., 1997.
[24] C. C. Chibelushi, F. Deravi, and J. S. Mason,B Audio-visual person recognition: Anevaluation of data fusion strategies,[ inProc. Eur. Conf. Security Detection, London,U.K., 1997, pp. 26–30.
[25] R. Brunelli, D. Falavigna, T. Poggio, andL. Stringa, B Automatic person recognitionusing acoustic and geometric features,[
Machine Vision Appl., vol. 8,pp. 317–325, 1995.
[26] C. Sanderson and K. K. Paliwal, BNoisecompensation in a person verification systemusing face and multiple speech features,[Pattern Recognition, vol. 36, no. 2,pp. 293–302, Feb. 2003.
[27] P. Jourlin, J. Luettin, D. Genoud, andH. Wassner, B Acoustic-labial speaker verification,[ Pattern Recogn. Lett., vol. 18,pp. 853–858, 1997.
[28] U. V. Chaudhari, G. N. Ramaswamy,G. Potamianos, and C. Neti, B Audio-visualspeaker recognition using time-varyingstream reliability prediction,[ in Proc. Int.Conf. Acoustics, Speech Signal Processing ,Hong Kong, China, 2003, pp. V-712–V-715.
[29] S. Bengio, BMultimodal authentication usingasynchronous HMMs,[ in Proc. 4th Int. Conf.
Audio- and Video-Based Biometric Person Authentication, Guildford, U.K., 2003,pp. 770–777.
[30] A. V. Nefian, L. H. Liang, T. Fu, andX. X. Liu, B A Bayesian approach toaudio-visual speaker identification,[ inProc. 4th Int. Conf. Audio- and Video-BasedBiometric Person Authentication, Guildford,U.K., 2003, pp. 761–769.
[31] T. Fu, X. X. Liu, L. H. Liang, X. Pi, and A. V. Nefian, B An audio-visual speakeridentification using coupled hidden Markov models,[ in Proc. Int. Conf. Image Processing ,Barcelona, Spain, 2003, pp. 29–32.
[32] S. Bengio, BMultimodal authentication usingasynchronous HMMs,[ in Proc. 4th Int. Conf.
Audio- and Video-Based Biometric Person Authentication, Guildford, U.K., 2003,pp. 770–777.
[33] VV , BMultimodal speech processing usingasynchronous hidden Markov models,[Information Fusion, vol. 5, pp. 81–89, 2004.
[34] N. A. Fox, R. Gross, P. de Chazal, J. F. Cohn,and R. B. Reilly, BPerson identificationusing automatic integration of speech, lip,
and face experts,[ in Proc. ACM SIGMM2003 Multimedia Biometrics Methods and
Applications Workshop (WBMA’03), Berkeley,CA, 2003, pp. 25–32.
[35] N. A. Fox and R. B. Reilly, B Audio-visualspeaker identification based on the use of dynamic audio and visual features,[ inProc. 4th Int. Conf. Audio- and Video-Based
Biometric Person Authentication, Guildford,U.K., 2003, pp. 743–751.
[36] Y. Abdeljaoued, BFusion of personauthentication probabilities by Bayesianstatistics,[ in Proc. 2nd Int. Conf.
Audio- and Video-Based Biometric Person Authentication, Washington, DC, 1999,pp. 172–175.
[37] Y. Yemez, A. Kanak, E. Erzin, and A. M. Tekalp, BMultimodal speakeridentification with audio-videoprocessing,[ in Proc. Int. Conf. ImageProcessing , Barcelona, Spain, 2003, pp. 5–8.
[38] A. Kanak, E. Erzin, Y. Yemez, and A. M. Tekalp, BJoint audio-videoprocessing for biometric speakeridentification,[ in Proc. Int. Conf. Acoustic,Speech Signal Processing , Hong Kong,China, 2003, pp. 561–564.
[39] E. Erzin, Y. Yemez, and A. M. Tekalp,BMultimodal speaker identificationusing an adaptive classifier cascade basedon modality reliability,[ IEEE Trans.
Multimedia, vol. 7, no. 5, pp. 840–852,Oct. 2005.
[40] M. E. Sargin, E. Erzin, Y. Yemez, and A. M. Tekalp, BMultimodal speakeridentification using canonical correlationanalysis,[ in IEEE Proc. Int. Conf. Acoustics,Speech Signal Processing , Toulouse, France,May 2006, pp. 613–616.
[41] J. Kittler, J. Matas, K. Johnsson, andM. U. Ramos-Sa nchez, BCombining evidencein personal identity verification systems,[Pattern Recogn. Lett., vol. 18,
pp. 845–852, 1997.[42] J. Kittler, M. Hatef, R. P. W. Duin, and
J. Matas, BOn combining classifiers,[ IEEETrans. Pattern Anal. Machine Intell., vol. 20,pp. 226–239, 1998.
[43] J. Kittler and K. Messer, BFusion of multipleexperts in multimodal biometric personalidentity verification systems,[ in Proc. 12thIEEE Workshop Neural Networks Sig.Processing, Switzerland, 2002, pp. 3–12.
[44] E. S. Bigun, J. Bigun, B. Duc, and S. Fisher,BExpert conciliation for multi modalperson authentication systems by Bayesianstatistics,[ in Proc. 1st Int. Conf. Audio- andVideo-Based Biometric Person Authentication,Crans-Montana, Switzerland, Mar. 1997,pp. 291–300.
[45] R. W. Frischholz and U. Dieckmann,BBiolD: A multimodal biometricidentification system,[ Computer, vol. 33,pp. 64–68, 2000.
[46] S. Basu, H. S. M. Beigi, S. H. Maes,M. Ghislain, E. Benoit, C. Neti, and A. W. Senior, BMethods and apparatus foraudio-visual speaker recognition andutterance verification,[ U.S. Patent6 219 640, 1999.
[47] L. Hong and A. Jain, BIntegrating faces andfingerprints for personal identification,[IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 1295–1307, 1998.
[48] V. Radova and J. Psutka, B An approach tospeaker identification using multipleclassifiers,[ in Proc. IEEE Conf. Acoustics,
Speech Signal Processing , Munich,Germany, 1997, vol. 2, pp. 1135–1138.
[49] A. Ross and A. Jain, BInformation fusion inbiometrics,[ Pattern Recogn. Lett., vol. 24,pp. 2115–2125, 2003.
[50] V. Chatzis, A. G. Bors, and I. Pitas,BMultimodal decision-level fusion for personauthentication,[ IEEE Trans. Systems, Man,Cybernetics, Part A: Syst. Humans, vol. 29,no. 5, pp. 674–680, Nov. 1999.
[51] N. Fox, R. Gross, J. Cohn, and R. B. Reilly,BRobust automatic human identificationusing face, mouth, and acousticinformation,[ in Proc. Int. Workshop Analysis
Modeling of Faces and Gestures, Beijing,China, Oct. 2005, pp. 263–277.
[52] F. J. Huang and T. Chen, BConsiderationof Lombard effect for speechreading,[ inProc. Works. Multimedia Signal Process, 2001,pp. 613–618.
[53] J. D. Woodward, BBiometrics: Privacy’s foeor privacy’s friend?[ Proc. IEEE, vol. 85,pp. 1480–1492, 1997.
[54] R. P. Lippmann, BSpeech recognitionby machines and humans,[ Speech Commun., vol. 22, no. 1, pp. 1–15, 1997.
[55] H. Yehia, P. Rubin, and
E. Vatikiotis-Bateson, BQuantitativeassociation of vocal-tract and facialbehavior,[ Speech Commun., vol. 26,no. 1–2, pp. 23–43, 1998.
[56] J. Jiang, A. Alwan, P. A. Keating,E. T. Auer, Jr., and L. E. Bernstein,BOn the relationship between facemovements, tongue movements, andspeech acoustics,[ EURASIP J. Appl. SignalProcessing, vol. 2002, no. 11, pp. 1174–1188,Nov. 2002.
[57] J. P. Barker and F. Berthommier,BEstimation of speech acoustics from visualspeech features: A comparison of linear andnon-linear models,[ in Proc. Int. Conf.
Auditory Visual Speech Processing , Santa Cruz,CA, 1999, pp. 112–117.
[58] H. C. Yehia, T. Kuratate, andE. Vatikiotis-Bateson, BUsing speechacoustics to drive facial motion,[ in Proc.14th Int. Congr. Phonetic Sciences,San Francisco, CA, 1999, pp. 631–634.
[59] A. V. Barbosa and H. C. Yehia, BMeasuringthe relation between speech acoustics and2-D facial motion,[ in Proc. Int. Conf.
Acoustics, Speech Signal Processing , Salt LakeCity, UT, 2001, vol. 1, pp. 181–184.
[60] P. S. Aleksic and A. K. Katsaggelos,BSpeech-to-video synthesis using MPEG-4compliant visual features,[ IEEE Trans.CSVT, Special Issue Audio Video Analysis for
Multimedia Interactive Services, pp. 682–692,May 2004.
[61] A. Q. Summerfield, BSome preliminariesto a comprehensive account of audio-visualspeech perception,[ in Hearing by Eye: ThePsychology of Lip-Reading , R. Campbell andB. Dodd, Eds. London, U.K.: LawrenceErlbaum, 1987, pp. 3–51.
[62] D. W. Massaro and D. G. Stork, BSpeechrecognition and sensory integration,[ Amer.Scientist, vol. 86, no. 3, pp. 236–244, 1998.
[63] J. J. Williams and A. K. Katsaggelos,B An HMM-based speech-to-videosynthesizer,[ IEEE Trans. Neural Networks,Special Issue Intelligent Multimedia, vol. 13,no. 4, pp. 900–915, Jul. 2002.
[64] Q. Summerfield, BUse of visual informationin phonetic perception,[ Phonetica, vol. 36,pp. 314–331, 1979.
[65] VV , BLipreading and audio-visual speechperception,[ Phil. Trans. R. Soc. Lond. B., vol. 335, pp. 71–78, 1992.
Aleksic and Katsaggelos: Audio-Visual Biometrics
2042 Proceedings of the IEEE | Vol. 94, No. 11, November 2006
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 19/20
[66] K. W. Grant and L. D. Braida,BEvaluating the articulation index forauditory-visual input,[ J. Acoustical Soc.
Amer., vol. 89, pp. 2950–2960, Jun. 1991.
[67] G. Fant, Acoustic Theory of Speech Production,S-Gravenhage. Amsterdam,The Netherlands: Mouton, 1960.
[68] J. L. Flanagan, Speech Analysis Synthesis
and Perception. Berlin, Germany:Springer-Verlag, 1965.
[69] S. Narayanan and A. Alwan,B Articulatory-acoustic models for fricativeconsonants,[ IEEE Trans. Speech AudioProcessing, vol. 8, no. 3, pp. 328–344,Jun. 2000.
[70] J. Schroeter and M. Sondhi, BTechniquesfor estimating vocal-tract shapes from thespeech signal,[ IEEE Trans. Speech AudioProcessing, vol. 2, no. 1, pp. 133–150,Feb. 1994.
[71] H. McGurk and J. MacDonald, BHearinglips and seeing voices,[ Nature, vol. 264,pp. 746–748, 1976.
[72] T. Chen and R. R. Rao, B Audio-visualintegration in multimodal communication,[
Proc. IEEE, vol. 86, no. 5, pp. 837–852,May 1998.
[73] S. Oviatt, P. Cohen, L. Wu, J. Vergo,L. Duncan, B. Suhm, J. Bers, T. Holzman,T. Winograd, J. Landay, J. Larson, andD. Ferro, BDesigning the user interface formultimodal speech and pen-based gestureapplications: State-of-the-art systems andresearch directions,[ Human-Computer Interaction, vol. 15, no. 4, pp. 263–322, Aug. 2000.
[74] J. Schroeter, J. Ostermann, H. P. Graf,M. Beutnagel, E. Cosatto, A. Syrdal, A. Conkie, and Y. Stylianou, BMultimodalspeech synthesis,[ in Proc. Int. Conf.
Multimedia Expo, New York, 2000,pp. 571–574.
[75] C. C. Chibelushi, F. Deravi, and J. S. D.Mason, B A review of speech-based bimodalrecognition,[ IEEE Trans. Multimedia, vol. 4,no. 1, pp. 23–37, Mar. 2002.
[76] D. G. Stork and M. E. Hennecke, Eds.,Speechreading by Humans and Machines.Berlin, Germany: Springer, 1996.
[77] P. S. Aleksic, G. Potamianos, and A. K. Katsaggelos, BExploiting visualinformation in automatic speechprocessing,[ in Handbook of Image and VideoProcessing , A. Bovik, Ed. New York: Academic, Jun. 2005, pp. 1263–1289.
[78] E. Petajan,B Automatic lipreading to enhancespeech recognition,[ Ph.D. dissertation,Univ. Illinois at Urbana-Champaign, Urbana,IL, 1984.
[79] S. Dupont and J. Luettin,B Audio-visualspeech modeling for continuous speech
recognition,[ IEEE Trans. Multimedia, vol. 2,no. 3, pp. 141–151, Sep. 2000.
[80] G. Potamianos, C. Neti, G. Gravier, A. Garg,and A. W. Senior, BRecent advances in theautomatic recognition of audiovisualspeech,[ Proc. IEEE, vol. 91, no. 9,pp. 1306–1326, Sep. 2003.
[81] G. Potamianos, C. Neti, J. Luettin, andI. Matthews, B Audio-visual automatic speechrecognition: An overview,[ in Issues in Visualand Audio-Visual Speech Processing , G. Bailly,E. Vatikiotis-Bateson, and P. Perrier, Eds.Cambridge, MA: MIT Press, 2004.
[82] P. S. Aleksic, J. J. Williams, Z. Wu, and A. K. Katsaggelos, B Audio-visual speechrecognition using MPEG-4 compliant visualfeatures,[ EURASIP J. Appl. Signal
Processing, vol. 2002, no. 11, pp. 1213–1227,Nov. 2002.
[83] T. Chen, B Audiovisual speech processing.Lip reading and lip synchronization,[ IEEESignal Processing Mag., vol. 18, no. 1,pp. 9–21, Jan. 2001.
[84] J. R. Deller, Jr., J. G. Proakis, and J. H. L.Hansen, Discrete-Time Processing of Speech
Signals. Englewood Cliffs, NJ: Macmillan,1993.
[85] R. Campbell, B. Dodd, and D. Burnham,Eds., Hearing by Eye II: Advances in thePsychology of Speechreading and Auditory Visual Speech. Hove, U.K.: Psychology Press, 1998.
[86] S. Young, G. Evermann, T. Hain,D. Kershaw, G. Moore, J. Odell, D. Ollason,D. Povey, V. Valtchev, and P. Woodland,The HTK Book. London, U.K.: Entropic,2005.
[87] A. J. Goldschen, O. N. Garcia, andE. D. Petajan, BRationale for phoneme-visememapping and feature selection in visualspeech recognition,[ in Speechreading byHumans and Machines, D. G. Stork and
M. E. Hennecke, Eds. Berlin, Germany:Springer, 1996, pp. 505–515.
[88] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ:Prentice Hall, 1993.
[89] D.-S. Kim, S.-Y. Lee, and R. M. Kil,B Auditory processing of speech signals forrobust speech recognition in real-world noisy environments,[ IEEE Trans. Speech AudioProcessing, vol. 7, no. 1, pp. 55–69, Jan. 1999.
[90] K. K. Paliwal, BSpectral subband centroidsfeatures for speech recognition,[ in Proc. Int.Conf. Acoustics, Speech and Signal Processing ,Seattle, WA, 1998, vol. 2, pp. 617–620.
[91] M. Akbacak and J. H. L. Hansen,BEnvironmental sniffing: Noise knowledgeestimation for robust speech systems,[ in
Proc. Int. Conf. Acoustics, Speech and SignalProcessing , Hong Kong, China, 2003, vol. 2,pp. 113–116.
[92] H. A. Rowley, S. Baluja, and T. Kanade,BNeutral networks-based face detection,[IEEE Trans. Pattern Anal. Machine Intell., vol. 20, no. 1, pp. 23–38, Jan. 1998.
[93] A. W. Senior, BFace and feature finding for a face recognition system,[ in Proc. Int. Conf.
Audio Video-based Biometric Person Authentication, Washington, DC, 1999,pp. 154–159.
[94] K. Sung and T. Poggio, BExample-basedlearning for view-based human facedetection,[ IEEE Trans. Pattern Anal.
Machine Intell., vol. 20, no. 1, pp. 39–51,1998.
[95] E. Hjelmas and B. K. Low,B
Face detection: A survey,[ Computer Vision and ImageUnderstanding, vol. 83, no. 3, pp. 236–274,Sep. 2001.
[96] M.-H. Yang, D. Kriegman, and N. Ahuja,BDetecting faces in images: A survey,[ IEEETrans. Pattern Anal. Machine Intell., vol. 24,no. 1, pp. 34–58, Jan. 2002.
[97] V. Blanz, P. Grother, P. J. Phillips, andT. Vetter, BFace recognition based on frontal views generated from non-frontal images,[in Proc. Computer Vision PatternRecognition, 2005, pp. 454–461.
[98] C. Sanderson, S. Bengio, and Y. Gao,BOn transforming statistical models fornon-frontal face verification,[ PatternRecognition, vol. 39, no. 2, pp. 288–302,2006.
[99] K. W. Bowyer, K. Chang, and P. Flynn,B A survey of approaches and challenges in
3-D and multi-modal 3-D face recognition,[Computer Vision Image Understanding, vol. 101, no. 1, pp. 1–15, 2006.
[100] S. G. Kong, J. Heo, B. R. Abidi, J. Paik, andM. A. Abidi, BRecent advances in visualand infrared face recognition V A review,[Computer Vision Image Understanding, vol. 97,no. 1, pp. 103–135, 2005.
[101] M. E. Hennecke, D. G. Stork, andK. V. Prasad, BVisionary speech: Lookingahead to practical speechreading systems,[in Speechreading by Humans and Machines,D. G. Stork and M. E. Hennecke, Eds.Berlin, Germany: Springer, 1996,pp. 331–349.
[102] P. S. Aleksic and A. K. Katsaggelos,BComparison of low- and high-level visualfeatures for audio-visual continuousautomatic speech recognition,[ in Proc.Int. Conf. Acoustics, Speech Signal Processing ,Montreal, Canada, 2004, pp. 917–920.
[103] I. Matthews, G. Potamianos, C. Neti, andJ. Luettin, B A comparison of model andtransform-based visual features foraudio-visual LVCSR,[ in Proc. Int. Conf.
Multimedia Expo, 2001, pp. 22–25.
[104] H. A. Rowley, S. Baluja, and T. Kanade,BNeural network based face detection,[ IEEETrans. Pattern Anal. Machine Intell., vol. 20,no. 1, pp. 23–38, Jan. 1998.
[105] P. Viola and M. Jones, BRapid objectdetection using a boosted cascade of simplefeatures,[ in Proc. Conf. Computer VisionPattern Recognition, Kauai, HI, Dec. 11–13,2001, pp. 511–518.
[106] H. P. Graf, E. Cosatto, and G. Potamianos,BRobust recognition of faces and facialfeatures with a multi-modal system,[ in Proc.Int. Conf. Systems, Man, Cybernetics, Orlando,FL, 1997, pp. 2034–2039.
[107] M. T. Chan, Y. Zhang, and T. S. Huang,BReal-time lip tracking and bimodalcontinuous speech recognition,[ in Proc.Workshop Multimedia Signal Processing ,Redondo Beach, CA, 1998, pp. 65–70.
[108] G. Chetty and M. Wagner,BFLiveness_ verification in audio-videoauthentication,[ in Proc. Int. Conf. SpokenLanguage Processing , Jeju Island, Korea,2004, pp. 2509–2512.
[109] M. Kass, A. Witkin, and D. Terzopoulos,BSnakes: Active contour models,[ Int. J.Computer Vision, vol. 4, no. 4,pp. 321–331, 1988.
[110] A. L. Yuille, P. W. Hallinan, and D. S. Cohen,BFeature extraction from faces usingdeformable templates,[ Int. J. Computer Vision, vol. 8, no. 2, pp. 99–111, 1992.
[111] T. F. Cootes, G. J. Edwards, and C. J. Taylor,B Active appearance models,[ in Proc. Eur.Conf. Computer Vision, Freiburg,Germany, 1998, pp. 484–498.
[112] R. O. Duda, P. E. Hart, and D. G. Stork,Pattern Classification. Hoboken, NJ: Wiley,2001.
[113] B. Maison, C. Neti, and A. Senior,B Audio-visual speaker recognition forbroadcast news: Some fusion techniques,[ inProc. Works. Multimedia Signal Processing ,Copenhagen, Denmark, 1999, pp. 161–167.
[114] P. Duchnowski, U. Meier, and A. Waibel,BSee me, hear me: Integrating automaticspeech recognition and lip-reading,[ in Proc.Int. Conf. Spoken Lang. Processing , Yokohama,Japan, Sep. 18–22, 1994, pp. 547–550.
[115] G. Potamianos, H. P. Graf, and E. Cosatto,B An image transform approach for HMMbased automatic lipreading,[ in Proc. Int.
Aleksic and Katsaggelos: Audio-Visual Biometrics
Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 2043
8/6/2019 Audio-Visual Biometrics
http://slidepdf.com/reader/full/audio-visual-biometrics 20/20
Conf. Image Processing , Chicago, IL, Oct. 4–7,1998, vol. 1, pp. 173–177.
[116] P. S. Aleksic and A. K. Katsaggelos,BComparison of MPEG-4 facialanimation parameter groups withrespect to audio-visual speech recognitionperformance,[ in Proc. Int. Conf. ImageProcessing, Italy, Sep. 2005, vol. 5,
pp. 501–504.[117] X. Zhang, C. C. Broun, R. M. Mersereau, and
M. Clements, B Automatic speechreading with applications to human-computerinterfaces,[ EURASIP J. Appl. SignalProcessing, vol. 2002, no. 11,pp. 1228–1247, 2002.
[118] M. Gordan, C. Kotropoulos, and I. Pitas,B A support vector machine-based dynamicnetwork for visual speech recognitionapplications,[ EURASIP J. Appl. SignalProcessing, vol. 2002, no. 11,pp. 1248–1259, 2002.
[119] F. Cardinaux, C. Sanderson, and S. Bengio,BUser authentication via adapted statisticalmodels of face images,[ IEEE Trans. SignalProcessing, vol. 54, no. 1, pp. 361–373,Jan. 2006.
[120] G. R. Doddington, M. A. Przybycki, A. F. Martin, and D. A. Reynolds, BThe NISTspeaker recognition evaluation V Overview,methodology, systems, results, perspective,[Speech Commun., vol. 31, no. 2–3,pp. 225–254, 2000.
[121] S. Bengio, J. Mariethoz, and M. Keller, BTheexpected performance curve,[ in Int. Conf.
Machine Learning, Workshop ROC Analysis Machine Learning , Bonn, Germany, 2005.
[122] S. Bengio and J. Mariethoz, BThe expectedperformance curve: A new assessmentmeasure for person authentication,[ inProc. Speaker Language Recognition Works.(Odyssey), Toledo, OH, 2004, pp. 279–284.
[123] D. L. Hall and J. Llinas, BMultisensor data fusion,[ in Handbook of Multisensor DataFusion, D. L. Hall and J. Llinas, Eds.Boca Raton, FL: CRC, 2001, pp. 1–10.
[124] T. K. Ho, J. J. Hull, and S. N. Srihari,BDecision combination in multiple classifiersystems,[ IEEE Trans. Pattern Anal. MachineIntell., vol. 16, pp. 66–75, 1994.
[125] R. C. Luo and M. G. Kay, BIntroduction,[in Multisensor Integration and Fusion for Intelligent Machines and Systems, R. C. Luoand M. G. Kay, Eds. Norwood, NJ: Ablex,1995, pp. 1–26.
[126] S. Pigeon and L. Vandendorpe, BThe M2VTSmultimodal face database (release 1.00),[ inProc. 1st Int. Conf. Audio- and Video-BasedBiometric Person Authentication,Crans-Montana, Switzerland, 1997,pp. 403–409.
[127] K. Messer, J. Matas, J. Kittler, J. Luettin, andG. Maitre, BXM2VTSDB: Te extendedM2VTS database,[ in Proc. 2nd Int. Conf.
Audio- and Video-Based Biometric Person Authentication, Washington, DC, 1999,pp. 72–77.
[128] E. Bailly-Bailliere, S. Bengio, F. Bimbot,M. Hamouz, J. Kittler, J. Mariethoz, J. Matas,K. Messer, V. Popovici, F. Poree, B. Ruiz, andJ.-P. Thiran, BThe BANCA database andevaluation protocol,[ in Proc. Audio- andVideo-Based Biometric Person Authentication,Guilford, 2003, pp. 625–638.
[129] C. C. Chibelushi, F. Deravi, and J. S. Mason,BT DAVID Database V Internal Rep., Speechand Image Processing Research Group,Dept. of Electrical and ElectronicEngineering, Univ, les Swansea, 1996.
[130] N. Fox, B. O’Mullane, and R. B. Reilly, BTherealistic multi-modal VALID database and visual speaker identification comparisonexperiments,[ in Lecture Notes in Computer
Science, T. Kanade, A. K. Jain, and N. K.Ratha, Eds. New York: Springer-Verlag,2005, vol. 3546, p. 777.
[131] B. Lee, M. Hasegawa-Johnson,C. Goudeseune, S. Kamdar, S. Borys, M. Liu,and T. Huang, B AVICAR: Audio-visualspeech corpus in a car environment,[ in Proc.Conf. Spoken Language, Jeju, Korea, 2004.
[132] E. K. Patterson, S. Gurbuz, Z. Tufekci, andJ. N. Gowdy, BCUAVE: A new audio-visualdatabase for multimodal human-computerinterface research,[ in Proc. Int. Conf.
Acoustics, Speech and Signal Processing ,Orlando, FL, 2002.
[133] T. Chen, B Audiovisual speech processing,[IEEE Signal Processing Mag., vol. 18,pp. 9–21, Jan. 2001.
[134] J. R. Movellan, BVisual speech recognition with stochastic networks,[ in Advancesin Neural Information Processing Systems, G. Tesauro, D. Toruetzky, andT. Leen, Eds. Cambridge, MA: MIT Press,1995, vol. 7.
[135] J. Luettin, N. Thacker, and S. Beet, BSpeakeridentification by lipreading,[ in Proc Int.
Conf. Speech and Language Processing ,Philadelphia, PA, 1996, pp. 62–64.
[136] S. Bengio and J. Mariethoz, B A statisticalsignificance test for person authentication,[in Proc. Speaker and Language RecognitionWorkshop (Odyssey), Toledo, 2004,pp. 237–244.
[137] N. Poh and J. Korczak, BBiometricauthentication in the e-World,[ in Automated
Authentication Using Hybrid BiometricSystem, D. Zhang, Ed. Boston, MA:Kluwer, 2003, ch. 16.
A BOUT THE A UTHORS
Petar S. Aleksic received the B.S. degree in
electrical engineering from the University of
Belgrade, Serbia, in 1999, and the M.S. and Ph.D.
degrees in electrical engineering from Northwest-
ern University, Evanston, IL, in 2001 and 2004,
respectively.
He has been a member of the Image and Video
Processing Lab at Northwestern University, since
1999, where he is currently a Postdoctoral Fellow.
He has published more thanten articlesin the area
of audio-visual signal processing, pattern recognition, and computervision. His primary research interests include visual feature extraction
and analysis, audio-visual speech recognition, audio-visual biometrics,
multimedia communications, computer vision, pattern recognition, and
multimedia data mining.
Aggelos K. Katsaggelos received the Diploma
degree in electrical and mechanical engineering
from the Aristotelian University, Thessaloniki,
Greece, in 1979, and the M.S. and Ph.D. degrees
from the Georgia Institute of Technology, Atlanta,
in 1981 and 1985, respectively, both in electrical
engineering.
He is currently a Professor of electrical engi-
neering and computer science at Northwestern
University, Evanston, IL, and also the Director of
the Motorola Center for Seamless Communications and a member ofthe academic affiliate staff, Department of Medicine, Evanston Hospital.
He is the Editor of Digital Image Restoration (New York, Springer,
1991), Coauthor of Rate-Distortion Based Video Compression (Kluwer,
Norwell, 1997), and Coeditor of Recovery Techniques for Image and
Video Compression and Transmission (Kluwer, Norwell, 1998). Also, he
is Coinventor of ten international patents.
Dr. Katsaggelos is a member of the Publication Board of the IEEE
PROCEEDINGS and has served as the Editor-in-Chief of the IEEE Signal
Processing Magazine (1997–2002). He has been a recipient of the IEEE
Third Millennium Medal (2000), the IEEE Signal Processing Society
Meritorious Service Award (2001), and an IEEE Signal Processing Society
Best Paper Award (2001).
Aleksic and Katsaggelos: Audio-Visual Biometrics