A simplified audiovisual fusion model with application to large … · 2013-03-20 · A...

11
A simplified audiovisual fusion model with application to large-vocabulary recognition of French Canadian speech Un mod ` ele simplifi ´ e de fusion audio-visuelle avec application ` a la reconnaissance ` a grand vocabulaire du franc ¸ais canadien parl´ e L. Gagnon, S. Foucher, F. Lalibert´ e, and G. Boulianne * A new, simple and practical way of fusing audio and visual information to enhance audiovisual automatic speech recognition within the framework of an application of large-vocabulary speech recognition of French Canadian speech is presented, and the experimental methodology is described in detail. The visual information about mouth shape is extracted off-line using a cascade of weak classifiers and a Kalman filter, and is combined with the large-vocabulary speech recognition system of the Centre de Recherche Informatique de Montr´ eal. The visual classification is performed by a pair-wise kernel-based linear discriminant analysis (KLDA) applied on a principal component analysis (PCA) subspace, followed by a binary combination and voting algorithm on 35 French phonetic classes. Three fusion approaches are compared: (1) standard low-level feature-based fusion, (2) decision-based fusion within the framework of the transferable belief model (an interpretation of the Dempster-Shafer evidential theory), and (3) a combination of (1) and (2). For decision-based fusion, the audio information is considered to be a precise Bayesian source, while the visual information is considered an imprecise evidential source. This treatment ensures that the visual information does not significantly degrade the audio information in situations where the audio performs well (e.g., a controlled noise-free environment). Results show significant improvement in the word error rate to a level comparable to that of more sophisticated systems. To the authors’ knowledge, this work is the first to address large-vocabulary audiovisual recognition of French Canadian speech and decision-based audiovisual fusion within the transferable belief model. On explore une nouvelle fac ¸on simple et pratique de fusionner des informations audio et visuelles pour am´ eliorer la reconnaissance automatique de la parole dans le cadre d’une application ` a large vocabulaire de franc ¸ais canadien parl´ e. On d´ ecrit en d´ etail la m´ ethodologie exp´ erimentale suivie. L’information visuelle sur la forme de la bouche est extraite en diff´ er´ e` a l’aide d’une cascade de classificateurs simples et d’un filtre de Kalman, et alimente le syst` eme de reconnaissance de la parole ` a large vocabulaire du Centre de Recherche Informatique de Montr´ eal. La classification visuelle est obtenue par des KLDA binaires sur un sous-espace PCA, suivies par un algorithme de combinaison et de vote sur 35 classes phon´ etiques franc ¸aises. Trois approches de fusion ont ´ et´ e test´ ees : (1) une fusion standard de bas niveau des vecteurs de caract´ eristiques, (2) une fusion d´ ecisionnelle sur les classes phon´ etiques dans le cadre du mod` ele de transfert des croyances (une interpr´ etation de la th´ eorie ´ evidentielle de Dempster-Shafer), et (3) une combinaison de ces deux approches. Pour la fusion d´ ecisionnelle, l’information audio est consid´ er´ ee comme une source Bayesienne pr´ ecise, tandis que l’information visuelle est consid´ er´ ee comme une source ´ evidentielle impr´ ecise. Ceci assure que l’information visuelle ne d´ egrade pas l’information audio dans le cas ou ce dernier est d´ ej` a performant (par exemple, dans un environnement non bruit´ e). Les r´ esultats montrent une am´ elioration significative du taux de reconnaissance des mots en environnement bruit´ e, ` a un niveau comparable avec celui d’autres syst` emes plus sophistiqu´ es. ` A notre connaissance, ce travail est le premier ` a traiter de la reconnaissance audio-visuelle automatique ` a large vocabulaire pour le franc ¸ais canadien parl´ e et la fusion d´ ecisionnelle audio-visuelle dans le cadre du mod` ele de transfert des croyances. Keywords: audiovisual speech recognition; Dempster-Shafer theory; mouth tracking; multimodal fusion I Introduction The aim of this paper is to explore audiovisual fusion approaches for the recognition of French Canadian speech within a large-vocabulary application, that is, multi-speaker reading of French-language televi- sion news. The technical goal is to identify and implement a practical mouth shape descriptor and to feed the audio-only speech recognition system of the Centre de Recherche Informatique de Montr´ eal (CRIM) in order to maintain good word-recognition performance when audio acquisition conditions are poor (SNR as low as 10 dB in our tests). Although audiovisual automatic speech recognition (AV-ASR) is not a new topic, investigations of practical systems for large- vocabulary and multi-speaker speech recognition are relatively re- cent [1]–[4]. Most of the work done so far in AV-ASR has been con- * L. Gagnon, S. Foucher, F. Lalibert´ e, and G. Boulianne are with the Re- search and Development Department, Centre de Recherche Informatique de Montr´ eal (CRIM), 550 Sherbrooke West, Suite 100, Montreal, P.Q., Canada H3A 1B9. E-mail: [email protected]. ducted on data of short duration and, in most cases, has been limited to a small number of speakers (mostly fewer than ten) and to small- vocabulary tasks like recognizing nonsense words, connected letters, closed-set sentences and small-vocabulary continuous speech reading (see [1] and references therein for a detailed review of AV-ASR re- search). However, as expressed in [1], “if the visual modality is to become a viable component in real-world AV-ASR systems, research work is required on larger vocabulary tasks, developing speech reading systems on data of sizable duration and of large subject populations.” A first attempt towards this goal was made in 2000, when a speaker- independent AV-ASR system for large-vocabulary continuous speech recognition was developed for English speech [5]. In fact, most AV databases are recorded in English. Few French AV databases are avail- able [1], and, to our knowledge, no AV database for French Canadian speech recognition has previously existed. AV-ASR systems consist of three main parts [6]: (1) a visual front- end (i.e., mouth detection and tracking and visual characteristics ex- tractor), (2) an audiovisual fusion strategy, and (3) speech recognition. Many techniques are being investigated for efficient head/mouth track- Can. J. Elect. Comput. Eng., Vol. 33, No. 2, Spring 2008

Transcript of A simplified audiovisual fusion model with application to large … · 2013-03-20 · A...

Page 1: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

A simplified audiovisual fusion model withapplication to large-vocabulary recognition of

French Canadian speech

Un modele simplifie de fusion audio-visuelleavec application a la reconnaissance a grand

vocabulaire du francais canadien parle

L. Gagnon, S. Foucher, F. Laliberte, and G. Boulianne∗

A new, simple and practical way of fusing audio and visual information to enhance audiovisual automatic speech recognition within the frameworkof an application of large-vocabulary speech recognition of French Canadian speech is presented, and the experimental methodology is described indetail. The visual information about mouth shape is extracted off-line using a cascade of weak classifiers and a Kalman filter, and is combined with thelarge-vocabulary speech recognition system of the Centre de Recherche Informatique de Montreal. The visual classification is performed by a pair-wisekernel-based linear discriminant analysis (KLDA) applied on a principal component analysis (PCA) subspace, followed by a binary combination andvoting algorithm on 35 French phonetic classes. Three fusion approaches are compared: (1) standard low-level feature-based fusion, (2) decision-basedfusion within the framework of the transferable belief model (an interpretation of the Dempster-Shafer evidential theory), and (3) a combination of (1)and (2). For decision-based fusion, the audio information is considered to be a precise Bayesian source, while the visual information is considered animprecise evidential source. This treatment ensures that the visual information does not significantly degrade the audio information in situations where theaudio performs well (e.g., a controlled noise-free environment). Results show significant improvement in the word error rate to a level comparable to thatof more sophisticated systems. To the authors’ knowledge, this work is the first to address large-vocabulary audiovisual recognition of French Canadianspeech and decision-based audiovisual fusion within the transferable belief model.

On explore une nouvelle facon simple et pratique de fusionner des informations audio et visuelles pour ameliorer la reconnaissance automatique dela parole dans le cadre d’une application a large vocabulaire de francais canadien parle. On decrit en detail la methodologie experimentale suivie.L’information visuelle sur la forme de la bouche est extraite en differe a l’aide d’une cascade de classificateurs simples et d’un filtre de Kalman, et alimentele systeme de reconnaissance de la parole a large vocabulaire du Centre de Recherche Informatique de Montreal. La classification visuelle est obtenue pardes KLDA binaires sur un sous-espace PCA, suivies par un algorithme de combinaison et de vote sur 35 classes phonetiques francaises. Trois approchesde fusion ont ete testees : (1) une fusion standard de bas niveau des vecteurs de caracteristiques, (2) une fusion decisionnelle sur les classes phonetiquesdans le cadre du modele de transfert des croyances (une interpretation de la theorie evidentielle de Dempster-Shafer), et (3) une combinaison de ces deuxapproches. Pour la fusion decisionnelle, l’information audio est consideree comme une source Bayesienne precise, tandis que l’information visuelle estconsideree comme une source evidentielle imprecise. Ceci assure que l’information visuelle ne degrade pas l’information audio dans le cas ou ce dernierest deja performant (par exemple, dans un environnement non bruite). Les resultats montrent une amelioration significative du taux de reconnaissance desmots en environnement bruite, a un niveau comparable avec celui d’autres systemes plus sophistiques. A notre connaissance, ce travail est le premier atraiter de la reconnaissance audio-visuelle automatique a large vocabulaire pour le francais canadien parle et la fusion decisionnelle audio-visuelle dans lecadre du modele de transfert des croyances.

Keywords: audiovisual speech recognition; Dempster-Shafer theory; mouth tracking; multimodal fusion

I Introduction

The aim of this paper is to explore audiovisual fusion approaches forthe recognition of French Canadian speech within a large-vocabularyapplication, that is, multi-speaker reading of French-language televi-sion news. The technical goal is to identify and implement a practicalmouth shape descriptor and to feed the audio-only speech recognitionsystem of the Centre de Recherche Informatique de Montreal (CRIM)in order to maintain good word-recognition performance when audioacquisition conditions are poor (SNR as low as 10 dB in our tests).

Although audiovisual automatic speech recognition (AV-ASR)is not a new topic, investigations of practical systems for large-vocabulary and multi-speaker speech recognition are relatively re-cent [1]–[4]. Most of the work done so far in AV-ASR has been con-

∗L. Gagnon, S. Foucher, F. Laliberte, and G. Boulianne are with the Re-search and Development Department, Centre de Recherche Informatique deMontreal (CRIM), 550 Sherbrooke West, Suite 100, Montreal, P.Q., CanadaH3A 1B9. E-mail: [email protected].

ducted on data of short duration and, in most cases, has been limitedto a small number of speakers (mostly fewer than ten) and to small-vocabulary tasks like recognizing nonsense words, connected letters,closed-set sentences and small-vocabulary continuous speech reading(see [1] and references therein for a detailed review of AV-ASR re-search). However, as expressed in [1], “if the visual modality is tobecome a viable component in real-world AV-ASR systems, researchwork is required on larger vocabulary tasks, developing speech readingsystems on data of sizable duration and of large subject populations.”A first attempt towards this goal was made in 2000, when a speaker-independent AV-ASR system for large-vocabulary continuous speechrecognition was developed for English speech [5]. In fact, most AVdatabases are recorded in English. Few French AV databases are avail-able [1], and, to our knowledge, no AV database for French Canadianspeech recognition has previously existed.

AV-ASR systems consist of three main parts [6]: (1) a visual front-end (i.e., mouth detection and tracking and visual characteristics ex-tractor), (2) an audiovisual fusion strategy, and (3) speech recognition.Many techniques are being investigated for efficient head/mouth track-

Can. J. Elect. Comput. Eng., Vol. 33, No. 2, Spring 2008

Page 2: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

110 CAN. J. ELECT. COMPUT. ENG., VOL. 33, NO. 2, SPRING 2008

ing in the context of AV-ASR. One of the most popular is based onfacial key-points detection and template matching. This approach pro-vides good performance in a controlled environment [7]. Here, we usea neural network approach [8]–[9] instead, in order to achieve greaterrobustness to pose and illumination changes. The speech recognitioncomponent is provided by CRIM’s proven speech recognition system,which is based on hidden Markov models (HMMs), Gaussian mixturesand the N -gram language model [10].

AV-ASR fusion strategies can be split into two main categories:(1) feature-based fusion (i.e., stack-vector fusion), where visual fea-tures are simply concatenated with the audio features, and (2) decision-based fusion, where fusion is performed at a higher level [1]–[4].Most works use linear representations and a strict Bayesian frame-work. Here, we explore decision-based fusion using the Dempster-Shafer evidential theory [11] on phoneme decision probability as wellas a combination of feature- and decision-based approaches.

Decision-based fusion seems to be an appropriate choice for AV-ASR applications [1]. By nature, the visual evidence is highly impre-cise compared to the audio evidence, mainly because most of the ar-ticulators involved in speech production are not visible (tongue body,velum, glottis). Visual evidence is also impaired by the coarser sam-pling rate (usually three audio frames for one visual frame) and varia-tions in the head pose and in mouth appearance. The Dempster-Shafertheory offers a very powerful framework for fusion of heterogeneousdata. It is being applied to various data fusion problems (for instance,see [12]–[14] and references therein). In particular, an extension of theevidential theory, called the transferable belief model (TBM), offers aneven more flexible framework [12]. Recently, [14] has proposed usingDempster-Shafer theory to fuse decisions from an imprecise informa-tion source (modelled by an evidential mass function) with a precisesource (modelled by a Bayesian probability). This approach has theadvantage of providing a Bayesian mass function that can be furtherhandled by a standard Bayesian algorithm. Following this idea, wepropose to model the visual information by an evidential mass func-tion while the audio information remains a Bayesian mass function.We also exploit a few concepts present in the TBM framework, such asthe ballooning extension and the conjunctive rule of combination [12].

The paper is organized as follows. Section II describes the FrenchCanadian AV database specifically acquired for the project. Section IIIgives a description of the work done, along with a justification of thetechnical choices regarding the head/mouth tracker, the mouth shapefeatures, the visual and audio classifiers, as well as the stack- anddecision-based fusion approaches. Performance results for phonemeclassification and word-recognition rate are given in Section IV, andSection V describes a newly developed experimental tool. Finally, weconclude with a discussion about the advantages and limitations of ourwork and possible future improvements. All the mathematical detailshave been put in the Appendix for text clarity.

II Database

The AV corpus was acquired at CRIM and consists of television newstexts read by 26 native French-speaking Quebecers. A total of 740 dis-tinct utterances were read, providing a total of 4.5 hours of full-facefrontal video data. The raw audio data was acquired at 48 kHz, lin-early quantized to 16 bits, and further downsampled at a rate of 16 kHz(0.0625 ms). Audio characteristics for speech recognition were ex-tracted every 10 ms. The SNR ratio is around 26 dB (NIST SNR). Im-ages are in RGB colour with a size of 380 × 540 pixels and wereacquired at a rate of 29.97 Hz (a single video frame lasts 33.367 ms).There are thus 3.3367 frames of audio characteristics for each videoframe. The video characteristics were linearly upsampled every 10 msin order to be synchronous with the audio characteristics.

For each subject, the data was split into two sets: 80% for train-ing and 20% for the test. A subset of the test set, limited to six sub-jects (2514 words), was also used as a development set. Training was

Figure 1: Block diagram of the main processing steps used in this study.

Figure 2: Block diagram of the head/mouth detection and tracking approach implementedfor this study.

done in a multi-speaker framework, i.e., the same subjects were usedin training and testing (although training and testing utterances weredistinct). Noisy versions (19 dB, 14 dB and 10 dB) of the same cor-pus were generated by adding artificial speech babble (taken from theNOISEX noise database [15]) and pink noise to the original audiowaveforms. Noise levels were measured with the NIST speech qual-ity evaluation tools [16]. The exact speaker speeches were stored inaudio transcription files, which contain the phonetic transcription ofwhat was read, with starting time and ending time for each phonemeto an accuracy of 0.0625 ms. These files constitute the data groundtruth. Also, additional videos consisting of documentary films, newsbulletins and parliamentary debates were used to test our mouth trackerunder variable visual conditions.

III System description

Fig. 1 shows the processing steps of the data. After detecting the mouthregion of interest (ROI) on all frames of the dataset, we perform princi-pal component analysis (PCA). Samples are used to carry out kernel-based linear discriminant analysis (KLDA) [17] on 35 French pho-netic classes, which are assumed to be Gaussian. The discriminatinganalysis is of binary type (one-against-one), i.e., each discriminatingfunction is built on the basis of a binary classification problem. Duringthe test phase, classification of the PCA coefficients into one of the35 classes is made via a binary-decision round-robin algorithm [18].Finally, the original Bayesian audio likelihoods are merged with the vi-sual evidence using a conjunctive TBM rule [12]. Two different massfunctions for the visual information were tested: Bayesian and simpleconsonance. The following provides more details regarding the mainprocessing steps.

III.A Head/mouth trackingThe face/mouth detection and tracking approach is based on an im-proved version of the Viola and Jones object detection/tracking algo-

Page 3: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

GAGNON / FOUCHER / LALIBERTE / BOULIANNE: A SIMPLIFIED AUDIOVISUAL FUSION MODEL 111

Figure 3: Examples of head/mouth detection on various video types: documentary films(top), news bulletins (middle) and parliamentary debates (bottom).

rithm, implemented in the OpenCV library [19], which is based oncascades of boosted classifiers [8]–[9]. The term cascade refers to thefact that the resulting classifier is composed of several simple classi-fiers that are applied in a hierarchical way. The term boosted refers tothe fact that the classifiers, at each stage, are composed of basic clas-sifiers using one of the following four boosting techniques: discreteAdaBoost, real AdaBoost, gentle AdaBoost and LogitBoost [8]–[9].The entries of the basic classifiers are structures similar to Haar struc-tures (Fig. 2). Two classifiers in cascade are used to locate the areaof the mouth in the lower half of the face: one specialized for speak-ers with beards, and the other for those without. The mouth trackingis initiated only if the mouth is detected in several successive images.During the tracking, the updated mouth ROI is estimated using a linearKalman filter [20]. In order to stabilize the ROI, post-processing is ap-plied, which consists of (1) a linear interpolation that fills holes in thetrajectory caused by missed detections, (2) a median filter to eliminatebad detections, and (3) a Gaussian filter to remove the fast trajectoryoscillations.

The original detected mouth ROI is 64 × 64 pixels. This imageis cropped to 40 × 54 pixels and then resized to 10 × 16 pixels.A grayscale vector of dimension 160 pixels then serves as the inputfor mouth shape characterization. Fig. 3 shows examples of face andmouth detection obtained using a few frame samples from documen-tary films, news bulletins and parliamentary debates. The large rect-

Figure 4: Examples of images of the first 36 PCA components (top) and energy contentin each component (bottom).

angle and the small square in each example are, respectively, the faceand mouth ROI. The mouth ROI is detected for various mouth sizesand poses, as well as various image resolutions, lighting conditionsand occlusions.

III.B Mouth shape featuresThe output of the mouth detector is a 160-pixel vector. A standardPCA subspace projection keeps the first 56 vector components, whichaccount for 95% of the total variance (energy). In order to reducethe impact of lighting conditions, each feature vector was centred onthe mean feature vector of each speaker. Fig. 4 shows an example ofmouth ROI projection on the first 36 PCA components for one of the26 speakers.

III.B.1 Visual feature analysisVisual feature analysis in the AV-ASR context aims at finding andgrouping mouth shape patterns. This can be done in relation to orapart from audio information (phonemes or sets of phonemes). Vi-sual mouth shape classification is difficult because of the high vari-ability in the visual content. In addition to variations due to the differ-ent phonemes, intra-speaker variability, pose variations, tracking er-rors and pronunciation variations make the problem extremely com-plex and often database- and speaker-dependent. It is thus important toperform this type of data analysis on our dataset.

Page 4: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

112 CAN. J. ELECT. COMPUT. ENG., VOL. 33, NO. 2, SPRING 2008

Figure 5: Value plot of the PCA components (vertical axis) for all the visual frames con-taining only the phoneme @, for two different speakers: before (top) and after (bottom) thethree normalization approaches.

Figure 6: Mean mouths reconstructed from the first 56 PCA components for eachphoneme.

We note that an audio characteristics frame (10 ms) almost alwaysrepresents only one phoneme (the minimum duration of a phonemebeing about 30 ms); on the other hand, a video frame (33.367 ms) canoften cover the end of one phoneme and the beginning of another. Theresult is ambiguity about the association between the mouth shape andthe pronounced phoneme. This is a standard issue in all AV-ASR sys-tems. Pure video frames are those covering the duration of only onephoneme, as described in the transcription file. It is important to iden-tify these pure video frames prior to any statistical data analysis re-garding mouth shape and audio characteristics association.

III.B.1.a Inter-speaker variability: The first statistical analysis wasrelated to the question of the variability of the visual information be-

Figure 7: Jeffries-Matusita distance matrix (top) with a threshold at 1.78 (bottom) onthe PCA components. Each black square represents a non-separable pair of phonemes(<1.78).

tween speakers for a given phoneme. The raw PCA visual informationfor the characterization of a phoneme is very nonlinear. The PCA coef-ficients encode not only information about the mouth shape of a givenspeaker, but also the variability between speakers. Fig. 5 illustrates thephenomenon for two speakers and a given phoneme. One can reducethis inter-speaker variability (without, however, totally eliminating it,as we will see) by normalizing the input images. Three normalizationapproaches have been tested: (1) subtraction of the mean mouth imagefor each speaker, (2) subtraction of the mean mouth image plus vari-ance normalization, and (3) subtraction of the mean mouth image, sig-nal whitening and covariance normalization (the so-called Choleskydecomposition). Fig. 5 illustrates the effect of these three methodswith the corresponding Bhattacharyya distance; a small distance corre-sponds to low variability between the speakers. The first method offersthe best compromise between performance and implementation com-plexity. In fact, little improvement is obtained when the covariancematrix is normalized. The Bhattacharyya distances are still very higheven after normalization, which indicates that speakers still form sepa-rable clusters within the same phoneme. Fig. 6 shows the average, overall speakers, of the reconstructed mouths for each phoneme. One seesthe mouth shape variation according to the pronounced phoneme.

Even after correcting for the mean mouth, we find that other mea-sures show a significant residual variability between the speakers foreach phoneme. We measured this residual variability from the covari-ance matrices of the PCA coefficients for each speaker and for eachphoneme. It turns out that the matrices are quite different (if the speak-ers are not separable, these matrices should be very similar for a given

Page 5: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

GAGNON / FOUCHER / LALIBERTE / BOULIANNE: A SIMPLIFIED AUDIOVISUAL FUSION MODEL 113

Figure 8: Example of mean mouth variance for one speaker and phoneme @ for the fiveK-means clusters that are separable according to the Bhattacharyya distance.

phoneme). The conclusion is that the speakers are thus separable atthe raw PCA data level, making a speaker-independent visual classi-fication task more complex. We therefore decided to adopt a speaker-dependent approach.

III.B.1.b Visemes and phonemes: The French language is madeup of 37 phonemes, including “silence.” It is generally accepted that,physiologically, several of these phonemes correspond to roughly thesame mouth shape, called a viseme. Several viseme classifications ex-ist (see [21] and [22], for instance). Although there seems to be generalagreement for the consonants, the phoneme groupings for vowels areless obvious because they are very accent-dependent. Table 1 givesan example of a phoneme grouping we have developed from in-houseexperiences with French-speaking Quebecers.

Visemes are the visual equivalent of phonemes. It thus seems naturalto base visual clustering analysis of mouth shapes on the physiologicalviseme concept. We therefore investigated whether pure video framescould indeed be grouped according to physiological viseme classes bymeasuring phonetic class separability of the PCA components withrespect to the Jeffries-Matusita distance.

Several tests yielded between six and 18 groupings among the pho-netic classes. Fig. 7 shows an example of phonetic class separability.None of the tests provided results similar to standard physiologicalviseme classifications ([21]–[22]). Intra-class separability (spreadingof a class) was good, but inter-class separability (distance between theclasses) was poor in some cases. In particular, it was always possibleto group the phonemes @ with the viseme class d, n, t. This sug-gests non-homogeneity of the basic phonetic classes and the need toinvestigate more carefully with another grouping procedure.

III.B.1.c Phoneme cluster tendency: Cluster tendency analysis aimsat finding clustering structures within data. We analyzed the results of aK-means clustering on the PCA components for each speaker and eachphoneme and checked whether the K-means classes were separableaccording to the Bhattacharyya distance. For instance, five groups werefound within the phonetic class @ for speaker #19, and all these classeswere found to be separable. This result indicated that the phoneme @(and most probably the others) is not homogeneous, as is visually ap-parent for speaker #19 in Fig. 8. This observation raised the conceptof contextual phonemes (or triphones). To clarify this, we identified allthe triphones for a given pure frame in the dataset, extracted the num-ber of occurrences of all triphones, and checked whether the triphoneshaving the largest number of occurrences have a tendency to form sep-arable groups according to Jeffries-Matusita distances. We found thatthis was indeed the case. For instance, for the mean mouth shapes ofspeaker #19 and for phoneme @, the triphones occurring most fre-quently were t @ sil, k @ sil, s @ sil, r @ sil and l @ sil. All thesetriphones were found to be separable among themselves and even withrespect to any group formed by all the other triphones.

Table 1In-house physiological viseme classification for French

Quebecers’ pronunciation

Vowels1 a patte2 i lit3 y, u, x, o, o∼, h uni, cours, feu, idiot, long, fruit4 e, E, e∼ ble, fait, pain5 O comme6 œ, œ∼ fleur, brun7 a∼ banc8 @ table

Consonants9 p, b, m pont, bleu, moule10 t, d, n tube, doux, non11 k, g cage, goutte12 f, v faux, vrai13 s, z son, zero14 S, Z chat, jus15 l le16 w voir (oui, ouatte)17 j hier (fille)18 r rond19 G, N∼ signe, parking20 sil Silence

The results of the above analysis confirm that mouth shape, for agiven phoneme, is largely variable and strongly depends on the con-text. Single phonemes are not atomic representations of mouth shapesand are heavily dependent on the context (e.g., the mouth shapes inpronouncing the phoneme p in the French words pain and pont aredifferent). Also, physiological viseme classes do not seem to corre-late well with the raw-data cluster tendency analysis. Thus, contextualphonemes are certainly more appropriate to use in practical AV-ASRsystems. However, this practice is not practical; it requires the trainingof a visual classifier for tens of thousands of triphones (for the Frenchlanguage). We prefer to explore a new decision-based fusion approachthat is a kind of tradeoff between the physiological viseme and the tri-phone approaches. As we will see, the fusion algorithm compensatesfor the potential limitation of the phoneme-based approach by adaptingthe confidence level of the visually based phoneme decision accordingto the context.

III.C Binary kernel-based visual classifierKernel-based learning techniques are still an intensive researchtopic [17]. They are powerful tools for separating nonlinear patternsusing distance embedding techniques and are efficient in describinghigh-dimensional nonlinear manifolds. The kernel function in KLDAallows for nonlinear extensions of the linear feature extraction meth-ods. The input vectors are mapped into a new higher-dimensional fea-ture space in which the linear methods are applied.

Once the mapping function is applied, a standard LDA criterion ismaximized to produce a discriminant function with a kernel functionthat allows the construction of a kernel matrix (the so-called Gram ma-trix). One major drawback of kernel-based learning is the size of theGram matrix, which increases with the number of training samples.Therefore, training on large datasets is prohibitive. In particular, multi-class KLDA with a large number of classes (as in our case) requires alarge number of samples per class. Thus, solving the generalized eigen-value problem required by LDA is usually too computationally expen-sive. In this study, we chose instead to train a set of pair-wise binaryclassifiers (one-against-one) instead of dealing directly with the multi-class problem (Fig. 9). Each pair-wise classifier trained on a class pairproduces a discriminant function. A binary Gaussian classifier was

Page 6: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

114 CAN. J. ELECT. COMPUT. ENG., VOL. 33, NO. 2, SPRING 2008

then trained on this discriminant function. Once the pair-wise classifi-cation was performed, we recombined the individual binary decisionsto form a likelihood vector on the original set of visual classes. A sim-ple voting algorithm (round-robin algorithm [18]) was used to builda histogram for each decision. In the case of a tie, the prior phonemeprobabilities were used.

The one-against-one approach was adopted in preference to the one-against-all approach because of its simplicity. Similar borders weregiven for all 595 possible pair combinations for the 35 French pho-netic classes. A total of (35 × 34)/2 = 595 combinations of binaryclassifiers were thus analyzed for each subject. We did not considerthe phonemes G (as in the French word signe) and N∼ (as in park-ing) in this work because of the small number of representatives in thedataset. (Note that these phonemes are more common in English.) Inoperational mode, a mouth ROI is assigned 595 decisions correspond-ing to the 595 discriminating distances: a decision for each pair of clas-sifiers. The class that receives the largest number of votes is retainedas the most probable class. The membership probabilities of the 34 re-maining classes are sorted according to the number of assignments.

For instance, Fig. 9 illustrates a simple classification problem. Thedotted line is the discriminating border between classes 1 and 2. A fea-ture vector located near the centre of cluster 3 would have the follow-ing assignations for the 15 possible pairs of binary classifiers: 1-2→2,1-3→3, 1-4→1, 1-5→1, 1-6→1, 2-3→3, 2-4→2, 2-5→2, 2-6→2, 3-4→3, 3-5→3, 3-6→3, 4-5→4, 4-6→4, 5-6→6. The number of assig-nations for each class would be 1:3, 2:4, 3:5, 4:3, 5:0 and 6:1. The mostprobable class would then be 3, followed by 2, 1, 4, 6 and 5.

III.D Audiovisual classifierIn the acoustic front-end of the speech recognition system, each 10 msaudio-data characteristic vector is composed of 10 mel-frequency cep-stral coefficients with their first and second time derivatives, consti-tuting a 30-dimensional feature vector [10]. Cepstral mean subtrac-tion (CMS) was applied on a per-utterance basis.

The baseline acoustic models were speaker- and gender-independent continuous three-state HMMs created from scratch withthe standard Hidden Markov Modeling Toolkit (HTK), using only thetraining set. Decision-tree clustering was used to obtain 4000 cross-word triphone models sharing 1466 state distributions, each distribu-tion being a Gaussian mixture of eight means and diagonal covariance.

The language model was a 3-gram back-off model with Kneser-Ney smoothing [23], trained on a general corpus of 150 millionwords from French-language Quebec newspapers, interpolated withanother 3-gram model trained on 13 million words from broadcasters’archives. Entropy pruning was applied to yield a final model contain-ing 836 000 probabilities.

The vocabulary was composed of 20 000 words (with case pre-served) selected from the language model training text using NHK-weighted vocabulary selection.

III.E Fusion approachesIII.E.1 Stack fusionStack fusion consists simply of concatenating the audio and visuallow-level characteristics vectors. In our experiments, the super-vectorswere generated every 10 ms and were composed of 20 audio coeffi-cients (10 cepstral coefficients and 10 first derivatives), plus the 10 bestvideo coefficients of the KLDA (10 best visually detected phonemes).The training was then done on these “super-vectors” with the audioclassifier, as if they were pure audio vectors. For the stack-based fusionapproach, acoustic models were produced from scratch, following thesame procedure described above but replacing the feature vectors withstack vectors and adjusting decision-tree thresholds to get 1448 statedistributions.

III.E.2 Decision-based fusionDecision-based fusion consists of combining the independent phoneticdecisions coming from the audio and video classifiers. The principal

Figure 9: Sketch representation of the one-against-one pair-wise classification approach.

advantages of decisional fusion are that it (1) avoids training from char-acteristics vectors of large dimension (which in general requires muchmore data for training without guaranteeing improvement in classifica-tion performance), (2) improves the audio classification performanceonly if it has a low reliability, and (3) models the ignorance throughthe use of the Dempster-Shafer theory. When the environment is noise-free, the audio source (and the classifier) provides accurate and reliableinformation on the recognized phonemes. The video source providesless accurate and reliable information because of the ambiguity aboutthe phoneme associated with the mouth shape and the fact that contextis not taken into account. However, in a very noisy environment, theaudio source becomes less accurate and reliable than the video source,since the latter is not affected by the noisy environment. The concernwhen combining the decisions coming from these two sources of in-formation is to ensure that the less reliable one does not affect theperformance of the more reliable one.

The audio source is typically a Bayesian source because it has astrong potential for discriminating between phonemes. One can thenconsider that the decision hypothesis is at the phoneme level (i.e., sin-gleton hypothesis). The video source is of “evidential” type becauseit does not have as great a capacity to discriminate between phonemesand is therefore highly imprecise. The decision hypothesis of the videosource can group several phonemes (i.e., non-singleton hypothesis).The Dempster-Shafer evidential theory provides a statistical frame-work for the combination of these two types of hypotheses. Funda-mentally, it is a question of assigning a degree of confidence (i.e., amass function) to the various audio and video decisions and combiningthem. Since the audio source is Bayesian, the mass function in this caseis the recognition probability for each phoneme (i.e., the likelihood).On the other hand, for the evidential video source, many possibilitiesexist for the mass function (see Appendix for mathematical details).We tested one possibility, the simple consonance mass functions.

Simple consonance consists of ordering the phonemes accordingto their decreasing probability, P (1), P (2), . . ., and assigning themass m(1) = 1− P (2)/P (1) to the phoneme of rank “1” having thehighest probability, P (1). The rest of the masses are similarly assignedto the other phonemes, such that the sum of all masses is 1. In this way,if P (1) = P (2), the ignorance is maximum, i.e., the imprecision of thedecision is maximum. On the other hand, if P (2) = 0, the confidenceis maximum. If the video source is treated as a Bayesian source, themass function becomes the standard likelihood.

The combination (fusion) of the probabilities of the phonetic deci-sions of the audio and video sources for a phoneme i was made accord-ing to the relation Score(i) = (Audio likelihood(i))Audio Weight ×(Video mass(i))Visual Weight (see Appendix for mathematical detailson deriving this relation). A video phoneme with a large mass willthus reinforce the audio mass for this phoneme with respect to the

Page 7: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

GAGNON / FOUCHER / LALIBERTE / BOULIANNE: A SIMPLIFIED AUDIOVISUAL FUSION MODEL 115

Figure 10: Average WER on test data for babble noise (top) and pink noise (bottom).

others. When the visual information is completely unreliable (samemass for all phonemes), the video contribution vanishes (i.e., no au-dio likelihood is reinforced). The weights are calculated empiricallyby minimizing the word-recognition error rate on the developmentsubset of the original dataset. Finally, during Viterbi decoding withspeech-only models, the posterior forward probability computed foreach arc expansion is recorded for every frame. Then a phone-levelprobability is estimated by merging the triphone model probabilitiesbelonging to the same phoneme, using the Viterbi approximation in-stead of a summation. This operation yields the forward probabilityAudio likelihood(i) for the best path, starting from the initial stateat the first frame and ending somewhere inside phoneme i at the cur-rent frame.

IV Results

Performance results were measured in terms of word error rate (WER)after fusion with the audio component. The WER is calculated usingthe standard formula WER(%) = 100 · (S + D + I)/N , where S isthe number of substituted words, D is the number of suppressions, Iis the number of inserted words, and N is the total number of words.Fusion results on the test set for the clean data (SNR = 26 dB) and upto three levels of babble and pink noise (19 dB, 14 dB and 10 dB) aregiven in Fig. 10.

In addition to the audio-only recognition, four types of fusion ap-proaches were tested: (1) decision-based fusion of simple consonancetype, (2) decision-based fusion of Bayesian type, (3) stack-based fu-sion with KLDA components, and (4) a combination of stack- anddecision-based fusion. Fig. 10 clearly shows a tendency to reinforceword recognition with visual information when the noise level is high.The best results (for the two highest noise levels) are obtained with the

last approach, followed closely by stack-based fusion. One observesthat these two approaches are less powerful than the audio recognitionfor noise-free data. We recall that the 10 components of the secondderivative of the audio vector were replaced by the 10 video compo-nents to limit the vector dimension to 30, while pure audio recognitionresults include these 10 second-order components. One concludes thatthe 10 visual components are less relevant than the 10 second deriva-tives in low-noise conditions. On the other hand, pure decision-basedfusion improves the speech recognition rate by a few percent in a cleanenvironment. Fusion allows a reduction of the WER from 82% to 67%for the worst noise conditions, or an equivalent gain of 4 dB in theSNR. Similar behaviours were obtained for pink noise (Fig. 10). More-over, the fusion performances are even better at 14 dB than in the caseof babble noise (a reduction of the WER from 88% to 65%, or a 5 dBgain in SNR).

It is always difficult to compare AV-ASR system performancesagainst other published results because they are not tested on the samedataset. One fair comparison point for us is the work done by IBM,which has developed an AV-ASR system for large-vocabulary En-glish recognition, trained and tested against the ViaVoice dataset [1].This very large corpus consists of full-face frontal video of 290 sub-jects, uttering continuous read speech with mostly verbalized punc-tuation, dictation-style. High-quality wideband audio was collectedsynchronously with the video at a rate of 16 kHz and in a relativelyclean audio environment (quiet office with some background computernoise), resulting in a 19.5 dB SNR. The duration of the entire databaseis approximately 50 hours, and it contains 24 325 transcribed utter-ances with a 10 403-word vocabulary. Except for the dataset size andlanguage and the noise level of the clean audio, our dataset has ap-proximately the same characteristics. Furthermore, as was done in ourtests, the ViaVoice dataset was artificially degraded with babble noiseat 9.5 dB. Various fusion approaches have also been tested by the IBMgroup. It would take too long to describe them here, but one can men-tion that all these approaches are based on the same fusion paradigms(stack, decisional and hybrid) and use a similar speech recognitionsystem based on HMM, Gaussian mixtures and an N -gram languagemodel.

The best WER obtained by the IBM group after audio-only recog-nition at a noise level of about 10 dB is 48% (compared to 81%for us); after audiovisual contextual fusion, IBM achieved a WERof 35% (compared to 65% for us). This represents an improvementof (48 − 35)/48 = 27% for IBM, compared to an improvement of(81 − 65)/81 = 20% in our work. This result is quite encourag-ing, given that (1) the IBM speech recognition system is much moresophisticated (larger models trained with many hundreds of hours ofdata, more complex data preprocessing using maximum-likelihood lin-ear transformation (MLT) and heteroscedastic LDA (HLDA), and mis-match correction techniques between the clean and noisy data), and(2) our work uses a rather simple decision-based fusion approach withno contextual visual information (although the original audio noise inour case is of better quality).

Finally, Table 2 gives a concrete example of utterance output forbabble noise at 10 dB, obtained without fusion and after the stack- anddecision-based fusion approaches. The ground-truth utterance is Hieron avait quatorze points d’ecarts chez les francophones donc les gainsque le PQ a fait au cours des derniers jours sont annules et aujourd’huiil y a que dix points d’ecart. The boldface words in Table 2 are thosecorrectly recognized. Although the best output sentence is not totallymeaningful, the improvement in WER is significant.

V Experimental tool

In order to facilitate the analysis as well as further work, we have im-plemented an experimental tool for the data manipulation and fusiontests (Fig. 11). The tool has been integrated as a plug-in for the open-source video editing tool VirtualDub [24]. The plug-in interface al-lows the user to adjust all face and mouth tracking parameters, along

Page 8: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

116 CAN. J. ELECT. COMPUT. ENG., VOL. 33, NO. 2, SPRING 2008

Table 2Example of utterance output for babble noise at 10 dB

Fusion technique Output utterance WER (%)

Audio only

D’hier on avait quatorze ans invitesa se defendre dans l’interet des afinde realiser son vin aujourd’huilaquelle nous ideale.

86

Stack-based

Vingt six ans avait hate a chez lesfrancophones que compte la figuredes derniers jours une aiguille etaujourd’hui il est une pour ideal.

71

Stack- anddecision-based

Il garde avait hate a chez lesfrancophones et de honte le PQ afait au cours des derniers jours uneaiguille et aujourd’hui il est unepour ideal.

55

with the display, file reading and saving options. The plug-in inputsare (1) the video files, (2) the phoneme transcription file used for per-formance measures, and (3) the off-line visual classification phonemefiles. The visual phoneme classification can also be done automaticallyusing trained off-line classifiers.

The output information is displayed in VirtualDub’s main outputwindow. The displayed information includes (1) the exact phonemes inthe transcription file (first timeline at the top of the window), (2) the vi-sually recognized phonemes (second timeline), (3) the detected mouthregion (or the reconstructed mouth from the 56 PCA components),(4) the phoneme detection rate (PDR) calculated against the transcrip-tion audio file, and (5) the face/mouth detection regions (large andsmall squares respectively). Two vertical lines crossing the timelinesat the top of the output window represent the current video frame.One interesting feature of this tool is that it allows the user to visuallycompare the mouth shape with the ground truth and visually detectedphonemes. In particular, this feature is useful for visually identifying“pure” video frames, i.e., frames that contain only one phoneme andthat have been used to train the visual classifier.

VI Conclusion

We reported results of a study aimed at exploring the potential of anew decision-based audiovisual fusion method for large-vocabularyspeech recognition. Three fusion approaches were tested: (1) low-levelfeature-based vector (i.e., stack-vector-based fusion), (2) phoneme de-cision declaration (i.e., declaration-based fusion), and (3) a combina-tion of (1) and (2). Performance results for mouth detection and track-ing were measured on four different video datasets: in-house AV dataof news bulletins, documentary films, news reports and videos of par-liamentary debates. The AV-ASR dataset consists of French-languagetelevision news bulletins read by 26 subjects, totalling 4.5 hours offull-face frontal video data acquired with a digital camera.

On the fusion side, we explored the potential of the Dempster-Shafer and TBM probability theories. The evidential framework, par-ticularly the TBM theory, offers the following advantages: (1) possi-bility of extending mass functions to a larger set of hypotheses (bal-looning extension), (2) manipulation of non-singleton hypothesis, and(3) modelling of the total imprecision. However, the design of the mostappropriate mass function for this application is still an open problem.Here, we chose a statistical evidence framework initially proposed by[25] mainly because it produces simple and efficient functions that areeasily combinable. More work is needed to identify the best choice ofthe mass functions.

Performance results on word recognition were satisfactory, giventhe fact that we used a simple fusion approach and no contextual vi-sual information. A combination of stack- and decision-based fusioncan reduce the WER from 80% (audio only) to 65% (audiovisual fu-sion) for babble noise and from 85% to 62% for pink noise at 10 dB.These improvement performances are comparable to other publishedresults [1]–[7]. We are thus very confident that our result can be im-proved by considering, for instance, more refined (eventually context-dependent) subphonetic classification procedures and/or other fusionschemes. However, a potential limitation is the pair-wise classifica-tion approach, which might be difficult to extend to a larger numberof triphone classes because of the quadratic increase in the number ofbinary classifiers. Kernel-based learning might also be difficult to im-plement in practice because of the computational cost associated withthe kernel matrix and the resulting discriminant function.

In summary, our work can be positioned with respect to others asfollows. First, to our knowledge, it is the first to address audiovisualrecognition of French Canadian speech. Second, it deals with a large-vocabulary application, which is still an open research issue for anylanguage. Third, it explores a new decision-based fusion within theframework of the transferable belief model. Fourth, the project hasled to the acquisition of a large audiovisual database of French Cana-dian readings. Finally, it targets practical results by comparing simplebut efficient fusion approaches that have potential for future real-timeimplementation.

Acknowledgements

This work has been supported in part by CANARIE Inc. throughthe Advanced Research in Interactive Media (ARIM) program, bythe Ministere du Developpement Economique de la Recherche et del’Exportation du Gouvernement du Quebec, and by the Natural Sci-ences and Engineering Research Council of Canada (NSERC). Thetext of the news bulletins was provided by the French-language televi-sion network TVA.

Appendix

We present here the mathematical background for the Dempster-Shaferevidential theory and the TBM used in this work for the decision-basedfusion approach. The TBM is an interpretation of the Dempster-Shaferevidential theory [11]. It was originally proposed by [12] in order tocompensate for some of the shortcomings of the evidential theory. Themain mathematical concepts used here come from set and probabilitytheory.

A Background and definitionsThe core element of the TBM is the basic belief assignment (or mass)function m(). The mass function assigns a belief on the subsets of theset Ω = ωiM

i=1, constituted by M mutually exclusive hypothesesωi. Based on the available evidence (the facts) E, the mass function isdefined by

mΩ [E] () : 2Ω → [0, 1] withXB⊆Ω

mΩ [E] (B) = 1, (A-1)

where 2Ω denotes the set of all subsets of Ω (the so-called power set).In the Dempster-Shafer theory, an additional normalization is imposedto ensure that the null hypothesis has a null belief (m(Ø) = 0). TheTBM does not require this.

We call the focal set (denoted by F ) the set of sub-sets of 2Ω having a non-null mass (the focal elements), i.e.,F = A ⊆ Ω | mΩ[E](a) > 0.

Page 9: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

GAGNON / FOUCHER / LALIBERTE / BOULIANNE: A SIMPLIFIED AUDIOVISUAL FUSION MODEL 117

Figure 11: Snapshots of the test environment: main display window (top) and configuration window (bottom).

The belief function bel() is defined as

bel : 2Ω → [0, 1], such that bel(A) =X

Ø 6=B⊆A

mΩ[E](B), (A-2)

∀A ⊆ Ω.

The degree of belief bel(A) represents the degree of justified (i.e.,B supports A; thus B ⊆ A) and specific (i.e., not free to supportany other hypothesis; in other words, B does not support A, and thusB ⊂/ A) support for hypothesis A. The belief function plays the samerole as a probability function in probability models.

The plausibility function pl(A) represents the degree of support thatcould be assigned to A, but could also support another subset:

pl : 2Ω → [0, 1], such that pl(A) =X

BT

A 6=Ø

mΩ[E](B), ∀A ⊆ Ω.

(A-3)

The two quantities bel(A) and pl(A) are often interpreted as thelower and upper bounds of an unknown probability measure P on A.In addition, the difference pl(A)−bel(A) is an indicator of the degreeof knowledge imprecision on P (A).

A.1 Conjunctive combinationWhen two distinct pieces of evidence, E1 and E2, are available, wecan combine the associated mass functions to form a new one. Assum-ing that the two sources are fully reliable, we can derive a new massfunction as

mΩ [E1, E2] (A) =“mΩ [E1]⊗mΩ [E2]

”(A)

=X

B1T

B2=A

mΩ [E1] (B1) mΩ [E2] (B2). (A-4)

A.2 Ballooning extensionThe ballooning extension is a useful concept when belief is availableon a subset Ω′ of the full set of hypothesis Ω. This happens, for in-stance, when beliefs are built on a limited set and one discovers after-wards that some alternatives were not considered. The least committedmass function on Ω, denoted by mΩ′⇑Ω(), is derived from the partialmass function as

mΩ′⇑Ω(A) =

(mΩ′

(A′) if A′ ⊆ Ω′, A = A′ ∪ Ω′,

0 otherwise,(A-5)

Page 10: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

118 CAN. J. ELECT. COMPUT. ENG., VOL. 33, NO. 2, SPRING 2008

where Ω′ is the complementary set of Ω′ on Ω. In particular, this newextended mass function produces a new plausibility function:

plΩ′⇑Ω(A) =

(plΩ

′(A) if A ⊆ Ω,

1 if A ⊆ Ω′.(A-6)

We use this principle to extend the visual beliefs on the phoneme sub-set Ωv = Ω\G, N∼ to the full set of phonemes Ω (Ωv ⊆ Ω). Thisis useful because the G and N∼ phonemes are rare in French and wedo not have enough samples in our training set to properly learn them.

A.3 Mass function constructionWithin the TBM framework, [12] proposed the generalized Bayesiantheorem (GBT). However, the computational cost associated with theGBT is too high for our application because of the combinatory na-ture of the GBT (it requires the calculation of the mass function onthe entire power set). Therefore, we choose to use a method proposedby Shafer, called statistical evidence theory, which simply constructsmass functions from observed likelihoods [11]. This theory has beengeneralized by [25] when Ω is partitioned into τ subsets with the defi-nition of the partially consonant belief. In fact, a belief function bel(A)is defined as partially consonant on Ω if it is defined as consonant ona partition Ω = ∪k=1,...,τVk. The associated belief function has thefollowing representation:

belΩ[E](A)

=

τXk=1

„max

ωi∈Vk

p(x | ωi) − maxωi∈A∩Vk

p(x | ωi)«

τXk=1

maxωi∈Vk

p(x | ωi). (A-7)

Depending on the type of partition on Ω, (A-7) leads to different massfunctions. The two limit cases are as follows:

1. τ = 1, in which case we obtain the simple consonant mass func-tion proposed by Shafer:

belΩ[E](ωj) = 1−max

ωi∈ωj

p(x | ωi)

maxωi∈Ω

p(x | ωi), ∀ωj ∈ Ω; (A-8)

2. τ = |Ω| = M , in which case we obtain the Bayesian belieffunction

belΩ[E](ωj) =p(x | ωj)PM

k=1 p(x | ωk), ∀ωj ∈ Ω. (A-9)

B Application to AV-ASRIn the literature [1]–[3], the HMM state-dependent emission of an au-diovisual observation vector is represented by a direct product of theprobability for each audio frame t and the HMM context-dependentstate c:

P (oav,t | c) = P (oa,t | c)λa,c,t P (ov,t | c)λv,c,t , ∀ c ∈ C, (A-10)

where λa,c,t (λv,c,t) is the reliability factor and oa,t (ov,t) are the ob-served low-level feature vectors for the audio (visual) source. The non-negative reliability factors control each modality contribution. Usually,c is composed of three phonemes (ωi, ωj , ωk) and is modelled by aMarkovian method. For the visual information, the likelihood is non-contextual and

P (ov,t | c) = P (ov,t | ωj) with c = ωi, ωj , ωk, (A-11)∀ωi, ωk ∈ Ω.

To estimate the reliability factors, we propose to formulate (A-10) inan evidential framework, i.e., in the form

mΩ[a, v](c) =“mΩ[a]⊗mΩ[v]

”(c), (A-12)

where mΩ[a] (mΩ[v]) is the mass function associated with the audio(visual) information and mΩ[a, v] is the combined audiovisual mass.We assume that the audio modality is a precise Bayesian source, so thatthe audio mass function is the directly observed audio likelihood, i.e.,

mΩ[a](c) = P (oa,t | c) . (A-13)

In this case, the audiovisual mass function and (A-12) become

mΩ[a, v](c) = P (oa,t | c)X

A⊆Ω|A∩c=c

mΩ[v](A)

= P (oa,t | c) plΩ[v](c). (A-14)

In addition, we introduce the reliability coefficients to get

mΩ[a, v](c) = P (oa,t | c)λa

hplΩ[v](c)

iλv

. (A-15)

Equation (A-15) can be seen as a generalization of (A-10). It is thegeneric form of the intuitive scoring relation given in Section III.E.2.We now need to express the visual plausibility function for the chosenbelief structures. In the following we give the results for the Bayesian,simple consonance, and partially consonant belief structures.

We assume that an ordered set of hypothesis Ωv =

ω(1), . . . , ω(M), resulting from the ordering of hypothesis ωi ∈ Ω

according to a set of observed likelihood pv(ω(1)) > pv(ω(2)) > · · ·> pv(ω(M)), is known.

B.1 Bayesian mass functionIn this case, the focal set is F = Ω and the mass function is given by(A-9). Applying the ballooning extension (A-6), we get, for all A ∈ Ω,

plΩ[v](A) =

(1 if A ∈ Ωv,

mΩv

[v](A) if A ∈ Ωv.(A-16)

B.2 Simple consonance mass functionIn this case, the focal set is F = ω(1), Ω and the visual mass func-tion is given by (A-8):

mΩv

[v](A) =

8>>>>>><>>>>>>:1−

pv“ω(2)

”pv (ω(1))

, A = ω(1),

0, A 6= ω(1),

pv“ω(2)

”pv (ω(1))

, A = Ωv.

(A-17)

After applying the ballooning extension, we obtain, for all A ∈ Ω, thevisual plausibility function:

plΩ[v](A) =

(1 if A ∈ ω(1) ∪ Ωv,

mΩv

[v](Ωv) if A /∈ ω(1) ∪ Ωv.(A-18)

B.3 Partially consonant mass functionIn this case we assume the partition Ωv = ω(1) ∪ ω(1), which leads tothe visual mass function

mΩv

[v](A) =

8>>>>>>>>>>><>>>>>>>>>>>:

pv“ω(1)

”pv (ω(1)) + pv (ω(2))

, A = ω(1),

pv“ω(2)

”− pv

“ω(3)

”pv (ω(1)) + pv (ω(2))

, A = ω(2),

pv“ω(3)

”pv (ω(1)) + pv (ω(2))

, A = Ωv.

(A-19)

Page 11: A simplified audiovisual fusion model with application to large … · 2013-03-20 · A simplified audiovisual fusion model with application to large-vocabulary recognition of French

GAGNON / FOUCHER / LALIBERTE / BOULIANNE: A SIMPLIFIED AUDIOVISUAL FUSION MODEL 119

The visual plausibility function is then, for all A ∈ Ω,

plΩ[v](A) =

8>>>>>>><>>>>>>>:

1 if A ∈ Ωv,

1−mΩv

[v]“ω(2)

”if A = ω(1),

1−mΩv

[v]“ω(1)

”if A = ω(2),

mΩv

[v] (Ωv) otherwise.

(A-20)

References

[1] G. Potamianos, C. Neti, J. Luettin, and I. Matthews, “Audio-visual automatic speechrecognition: An overview,” in Issues in Visual and Audio-Visual Speech Processing,ed. G. Bailly, E. Vatikiotis-Bateson, and P. Perrier, Cambridge, Mass.: MIT Press,2004.

[2] G. Potamianos, C. Neti, G. Gravier, and A. Garg, “Recent advances in the auto-matic recognition of audio-visual speech,” Proc. IEEE, vol. 91, no. 9, Sept. 2003,pp. 1306–1326.

[3] G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu, “A cascade imagetransform for speaker independent automatic speechreading,” in IEEE Int. Conf.Multimedia and Expo, vol. 2, New York, 2000, pp. 1097–1100.

[4] A.V. Nefian, L. Liang, X. Pi, X. Liu, C. Mao, and K. Murphy, “A coupled HMMfor audio-visual speech recognition,” in Proc. Int. Conf. Acoust., Speech, SignalProcessing (ICASSP 2002), vol. 2, Orlando, Fla., May 2002, pp. 2013–2016.

[5] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A.Mashari, and J. Zhou, “Audio-visual speech recognition,” Center for Language andSpeech Processing, The Johns Hopkins University, Baltimore, Md., Final Workshop2000 Report, 2000.

[6] M.E. Hennecke, D.G. Stork, and K.V. Prasad, “Visionary speech: Looking ahead topractical speechreading systems,” in Speechreading by Humans and Machines, ed.D.G. Stork and M.E. Hennecke, Berlin, Germany: Springer, 1996, pp. 331–349.

[7] G. Potamianos and C. Neti, “Improved ROI and within frame discriminant featuresfor lipreading,” in Proc. Int. Conf. Image Processing (ICIP 2001), vol. 3, Thessa-loniki, Greece, 2001, pp. 250–253.

[8] P. Viola and M. Jones, “Robust real-time object detection,” Cambridge ResearchLaboratory, Cambridge, U.K., Tech. Report No. CRL2001/01, 2001.

[9] R. Lienhart and J. Maydt, “An extended set of Haar-like features for rapid objectdetection,” in Proc. IEEE Int. Conf. Image Processing (ICIP 2002), vol. 1, 2002,pp. 900–903.

[10] G. Boulianne, J. Brousseau, P. Ouellet, and P. Dumouchel, “French large vocabu-lary recognition with cross-word phonology transducers,” in Proc. IEEE Int. Conf.Acoust., Speech, Signal Processing (ICASSP 2000), vol. 3, Istanbul, Turkey, June5–9, 2000, pp. 1675–1678.

[11] G. Shafer, A Mathematical Theory of Evidence, Princeton, N.J.: Princeton Univer-sity Press, 1976.

[12] F. Delmotte and P. Smets, “Target identification based on the transferable beliefmodel interpretation of Dempster-Shafer model,” IEEE Trans. Syst., Man, Cybern.A, vol. 34, July 2004, pp. 457–471.

[13] S. Fabre, X. Briottet, and A. Appriou, “Impact of contextual information integra-tion on pixel fusion,” IEEE Trans. Geosci. Remote Sensing, vol. 40, no. 9, 2002,pp. 1997–2010.

[14] A. Bendjebbour, Y. Delignon, L. Fouque, V. Samson, and W. Pieczynski,“Dempster-Shafer fusion in Markov fields context,” IEEE Trans. Geosci. RemoteSensing, vol. 39, no. 8, 2001, pp. 1789–1798.

[15] A. Varga et al., “The Noise-92 study on the effect of additive noise on automaticspeech recognition,” DRA Speech Research Unit, Malvern, Worcestershire, U.K.,1999.

[16] National Institute of Standards and Technology (NIST), “Tools,” Gaithersburg, Md.:NIST, Dec. 20, 2007, http://www.nist.gov/speech/tools.

[17] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, New York:Cambridge University Press, 2004.

[18] J. Furnkranz, “Round robin classification,” J. Machine Learning Research, vol. 2,Mar. 2002, pp. 721–747.

[19] SourceForge, Inc., “Open Computer Vision Library,” SourceForge, Inc., 2007,http://sourceforge.net/projects/opencvlibrary.

[20] M.D. Cordea, E.M. Petriu, N.D. Georganas, D.C. Petriu, and T.E. Whalen, “Real-time 2(1/2)-D head pose recovery for model-based video coding,” IEEE Trans. In-strum. Meas., vol. 50, no. 4, 2001, pp. 1007–1013.

[21] C. Benoıt, T. Lallouache, T. Mohamadi, and C. Abry, “A set of French visemes forvisual speech synthesis,” in Talking Machines: Theories, Models, and Designs, ed.G. Bailly and C. Benoıt, Amsterdam: Elsevier, 1992, pp. 485–504.

[22] B. Jutras, J.-P. Gagne, M. Picard, and J. Roy, “Identification visuelle etcategorisation de consonnes en francais quebecois” (Visual identification and cate-gorization of consonants in Quebec French), Revues d’orthophonie et d’audiologie,vol. 22, no. 2, 1998, pp. 81–87.

[23] R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP 1995), vol. 1,1995, pp. 181–184.

[24] VirtualDub v1.7.8, www.virtualdub.org.

[25] P. Walley, “Belief function representation of statistical evidence,” The Annals ofStatistics, vol. 15, no. 4, 1987, pp. 1439–1465.

Langis Gagnon received a Ph.D. in physics-mathematics fromthe Universite de Montreal, Montreal, Quebec, Canada, in1988. Until 1995, he was a research officer at the Centred’Optique, Photonique et Laser de l’Universite Laval, Quebec,Quebec, Canada; the Centre de Recherches Mathematiquesde l’Universite de Montreal; and the Laboratoire de PhysiqueNucleaire de l’Universite de Montreal. From 1995 to 1998,he was a specialized researcher at Lockheed Martin Canada,where he worked on developing radar-image processing toolsfor aerial surveillance applications. Langis has published closeto 150 scientific articles relating to the fields of image pro-cessing, object recognition and math-based nonlinear optical

modelling. He is a member of SPIE, ACM, IEEE, AIA and IASTED.

Samuel Foucher has an engineering degree with a major intelecommunications and a Ph.D. in radar imaging from CAR-TEL. He has acquired expertise in image processing, mul-tiresolution encoding techniques (wavelets), data fusion, be-lief theory and Markovian techniques. From 1999 to 2002, asa research scientist for the India Meteorology Department, hecontributed to an industrial project (for the SEPIA company)based on image mining from the Insat-2E satellite. He joinedthe Centre de Recherche Informatique de Montreal (CRIM),Montreal, Quebec, Canada, in March 2002.

France Laliberte has a Ph.D. in physics from UniversiteLaval, Quebec, Quebec, Canada, in the area of registration,fusion and three-dimensional recontruction of retinal im-ages. She joined the Centre de Recherche Informatique deMontreal (CRIM), Montreal, Quebec, Canada, in 2003, firstas a postdoctoral fellow and then as a researcher.

Gilles Boulianne has over 14 years of experience in thefield of speech recognition and speech synthesis. While atINRS Telecommunications, he studied the complexities ofvery large-vocabulary speech recognition. Since joining theCentre de Recherche Informatique de Montreal (CRIM),Montreal, Quebec, Canada, his areas of focus have includedtransducer-based search techniques aimed at acceleratingsystem operation and/or increasing the size of the vocabularyof the CRIM speech recognition system.