Impact 2013 Final

IMPACT-2013

Figure 1. Main Concept of Audio Visual Speech Recognition

Enhancement of VSR Using Low Dimension Visual

Feature Prashant Upadhyaya1, Omar Farooq2, Priyanka Varshney3, Amit Upadhyaya4

1 Assistant Professor - Department of Electronics and Communication Engineering, SRI-Datia-475661-India. [email protected]

2 Associate Professor - Department of Electronics Engineering, Aligarh Muslim University-Aligarh-202002-India [email protected]

3 M.Tech - Department of Electronics Engineering, GLA University-Mathura-281406-India [email protected]

4 Station Controller-Metro One Operation Private Limited-Mumbai- 400053-India [email protected]

Abstract—This paper presents a study about the low dimension visual (LDV) space features and investigates the improvement in audio visual automatic speech recognition using different set of visual features. The experiment is divided into three sub-sections; in first phase the recognition is performed on 12 static DCT features; in second phase the recognition is performed for combination of 6 static and 6 dynamic features and in third phase the recognition is performed on 12 low dimension DCT feature. For this research work Hindi AMUAV (Aligarh Muslim University Audio-Visual) database was developed in which audio sample at 44.1 kHz and video sample at 25 frames per second was opted. Hidden Markov Model (HMM) tool kit with left-right HMMs modeled was used for recognition and an overall improvement of 26.04% in word recognition is achieved with LDV space features.

I. INTRODUCTION In today’s techno world, when people talk about the

technologies, then most of the concentration is put to deliver a high standard of performance in research field which is centered on human machine interaction. On the other hand the fact which is kept in mind when working in competitive environment by the researcher is to work and design the cost-effective technology which can be proven as highest quality to cater to the needs of the Industry, Research and Development (R&D) organizations and academia.

Therefore one of the exciting and fascinating technologies that were studied during the world war in years 1940s and 1950s was evaluation of speech performance in noisy environment which yield for the development of Audio Visual automatic speech recognition system (AVASR) [1]. Figure 1 gives the clear cut pictorial concept for the need of audio visual speech recognition system.

Therefore in the cutting edge area of speech recognition application it becomes significant to cultivate high standard of performance by using robustness techniques which mitigating the effect of background noise. Among various robust technique used for enhancing the speech recognition [1], AVASR has provided maximum recognition performance for enhancing the speech system not only in noisy environment

but also in a noise free environment and this is the key reason why it is given a priority by the researcher for examining how the visual feature along with the audio feature, improved the recognition rate. Next section deal with review of the enhancement techniques used for improving the speech recognition system.

II. TECHNIQUES TO ENHANCE AUDIO VISUAL SPEECH RECOGNITION SYSTEM

Sumby and Pollack in 1954 [2] presented the first known work on audio-visual speech processing, in which they reported that human can tolerate higher noise levels in speech

978-1-4799-1205-6/13/$31.00 ©2013 IEEE

71

IMPACT-2013

Figure 2. Speakers sample from AMUAV database.

when using lip information in comparison when no lip information was used. An effective improvement up to 15 dB in the speech to noise ratio is achieved. Hence it was shown by Sumby and Pollack that adding the visual information to the audio signal improved the performance of AVASR. But the idea how these visual modality improve the overall ASR system was unknown.

McGurk and MacDonald in 1976 [3] gave the clear view how these visual modalities contributed to the audio signal. McGurk and MacDonald demonstrate the bimodal nature of speech via the McGurk effect. It demonstrates the phenomenon when a person sees the repeated utterances of the syllable /ga/ with the sound /ba/ being dubbed onto the lip movements. It was observed that often the person does not perceive either /ga/ or /ba/, but instead perceives the sound /da/. These works show that adding the visual information not only improves the speech intelligibility but it does it by providing complementary information, which is the key motivation behind AVASR. This work on AVASR gave an indication of what role the visual modality has to play in terms of providing complementary information to the acoustic channel. But it was not sufficient for helping the hearing impaired who use visual information to increase their speech intelligibility.

Chen [4] focused on three core model used for lip reading system involve; visual feature extraction, audio feature extraction and the recognizer/classifier. In his model he used sixteenth-order linear prediction coding (LPC) coefficients but the limitation of this technique proposed by Chen was that the speaker needed to be in front of the camera. Chen compared the performance of a HMM based speech recognizer with audio only input, audio-visual input and visual only input. Finally experiment was carried out by adding additive white Gaussian noise at various SNRs ranging from 32 dB to almost 16 dB in audio. Chen concluded that automatic lip reading could enhance the reliability of speech recognition.

This fact motivated the first actual implementation of an AVASR system developed by Petajan in 1984 [5]. In this initial system, Petajan extracted simple black and white images of a speaker’s mouth and took the mouth height, width, perimeter and area as his visual feature cues applied with the acoustic waveform to the recognizer. Petajan results confirmed Chen’s claim of obtaining higher recognition rates with the addition of visual cues.

Luettin and Dupont [6] showed that approach for extracting visual feature from the image sequences can be done by anyone of the technique’s based on image based, geometric feature based, visual motion based, and model based approaches. They combined the inner and outer lip contour positions together with the lip intensity information and formed a vector of 24 elements representing the visual features. The audio features were obtained by choosing 24 linear prediction coefficients (LPC). The audio and visual features were applied to a HMM for speech recognition. They tested their system on clean speech and reported an error rate of 48% with visual features only and 3.4% with the audio signal only. When both audio and visual features were used, the error dropped to 2.6%.

Potamianos [7], proposed a technique for improving automatic recognition by taking highest 24 coefficients from a Discrete Cosine Transform (DCT) of a 64x64 pixel region of interest (ROI) containing the speaker's mouth as a visual feature on the other hand 24 cepstral coefficients of the acoustical speech was used as audio feature. Both features were concatenated to form a single feature vector that was then applied to a Hidden Markov Model (HMM). Incorporating the visual information improved the SNR by 61% over audio only processing. However it was one of the major challenges for developing an AVASR for the real time application. A major problem was the collection of database due to lack of large vocabulary for developing the speaker independent system. In a major effort to remedy this situation, IBM’s Human Language Technologies Department at the T. J. Watson Research Center coordinated a workshop at the John Hopkins University in Baltimore, USA, 2000 to collect such as database and to further improve techniques associated with AVASR [8]. As AVASR is a technology which is highly dependent on data, recent progress in this field has centered on the work conducted by IBM [8] due to their ability to capture high quality audio-visual data.

Varshney, Upadhyaya, and Farooq [9] evaluated the performance on a Hindi Speech Database by choosing three viseme classes and evaluated there result by selecting thirteen Mel Frequency Cepstral Coefficient (MFCC) audio features and 2-D DCT (Discrete Cosine Transform) as a visual feature. Experiment was performed for phoneme recognition and repeated by varying the visual feature and result obtained show improvement in the phoneme recognition when less number of visual features is used along with audio feature. This experiment give the motivation to author [9] to known how visual feature effect the AVASR performance. This paper deals with word recognition which is further work of [9].

III. AMUAV DATABASE Aligarh Muslim University Audio Visual (AMUAV)

corpus is an audio-visual database that has been designed to meet some criteria. It is a database of continuous words of high quality audio and video of a speaker. Each speaker in AMUAV corpus recorded 10 sentences out of which 2 sentences are common for all the speakers. Recordings have been made in realistic conditions for testing robust audio visual schemes. The database was recorded in Speech Signal Processing Lab, Aligarh Muslim University Aligarh.

72

IMPACT-2013

Figure 4. Method for visual feature extraction.

The large database sentences was collected and recorded in the lab, keeping in mind to cover the entire phoneme which occurs in Hindi language. Hindi language has been chosen as the basis of our work. This is because Hindi as a language is far more advanced than English. It is written as it is spoken and even has larger number of phonemes as compared to English.

Figure 2 shows sample of the speakers with different kind of facial characteristics and skin tones from the database recorded. Video was recorded at 640x380 resolutions with 25 frames per second (fps) in full colour. Hence this signify that feature vectors for the visual stream are extracted at a rate of 25 vectors per second in contrast with audio features which are extracted at a rate of 100 vectors per second. So, visual feature vectors need to be interpolated in order to match with audio features. The sound was recorded in 16-bit stereo at 44.1 kHz. During the recording the background conditions of all the speakers are kept almost the same. Also NOISE-AMUAV database was recorded by EDIROL R-09 device in which recording format is uncompressed 24 bit/48 kHz and recording was done at 48 kHz. To match the

bandwidth of speech signal and noise signal, speech signal was down sampled to 8 kHz i.e. since in speech signal most of the energy is concentrated upto 8 kHz.

IV. EXPERIMNTAL APPROACH

The experiment was carried out to evaluate the performance of Visual Speech Recognition (VSR) by using different set of combination of static and dynamic visual feature. The experiment was carried out in three phase; in first phase the recognition is performed on 12 static DCT features; in second phase the recognition is performed for combination

of 6 static and 6 dynamic features which result in overall combination of 12 DCT features and in third phase the recognition is performed on 12 low dimension DCT feature that are obtained after passing the DCT coefficient through low dimension space as shown in figure 3.

Hence total number of sentence used for this research work is 100 which correspond to total 1225 words. The experiment was performed on Matlab version 7.12.0.635 (R2011a), HMM code was executed using Matlab by calling C library module, and for comparison the HTK version 3.4.1 is selected. HMM code was used as a benchmark. All speech signals are modeled with left-right HMMs and in order to present rigorous comparisons of visual features the visual only speech recognition (VSR) results are shown here. For visual speech recognition twelve 2-D DWT DCT coefficients are used as the standard visual feature i.e. 12 dimensional features dv=12.

The visual feature is extracted by the proposed algorithm [9]. The step by step procedure is shown in figure 4 for extracting the visual features. Visual features obtained are applied to the HMM classifier to perform the recognition and also to investigating the compatibility of HTK toolkit [10] with respect to visual features.

V. RESULT ANALYSIS This section is divided into three sub section which will

deal with the experimental result obtained for VSR.

A. Visual-only recognition for clean signal using twelve static visual (SV) features Visual speech recognition (VSR) is performed using static

visual features i.e. SV features. Table I shows that percentage of correction(C) obtained by using SV features is 31.67%. Also from the table I it can be easily observed that lots of deletion (D) and substitutions (S) performed by HTK during visual speech recognition.

TABLE I. VISUAL-ONLY RECOGNITION FOR 12 SV FEATURES

C A H D S I N

31.67 29.71 388 415 422 24 1225

where C = Percentage of correctly recognized words, A = Percentage of accuracy, H = Number of words correctly recognized, D = Total number of deletions, S = Total number of substitutions, I = Total number of insertions and N = Total number of evaluated words.

B. Visual-only recognition for clean signal using static dynamic visual (SDV) features Another experiment was performed by combination set of

static and dynamic visual features i.e. SDV features. Percentage correction of 34.69% is obtained as shown in Table II. From the Table II one can easily conclude that there

Figure 3. Method applied for Visual Speech Recognition.

73

IMPACT-2013

Figure 5. Visual features comparative evaluation.

has been an improvement in percentage correction by the factor of 3.02% with SDV features .

TABLE II. VISUAL-ONLY RECOGNITION FOR 12 SDV FEATURES

C A H D S I N

34.69 31.59 425 227 573 38 1225

C. Visual-only recognition for clean signal using Low dimension visual (LDV) features

Finally in the last experiment the static visual features obtained is passed through the low dimension space vector before applied to the classifier. Here in our case the low dimension feature is obtained by using the Singular value Decomposition (SVD).Method applied for extracting the low dimension feature using SVD is achieved by making the projection of obtained DCT feature over SVD i.e. SVD, which is that once we have identified where the most variation is, it's possible to find the best approximation of the original data points using fewer dimensions therefore SVD is used here as a method for data reduction. Result obtained using this procedure shows the tremendous improvement in the percentage correction. Percentage correction of 57.71% is obtained as shown in Table III. From the Table III one can easily conclude that there has been an improvement by a factor of 26.04% in percentage correction by LDV features. Figure 5 shows the comparison graph.

TABLE III. VISUAL-ONLY RECOGNITION FOR LDV FEATURES

C A H D S I N

57.71 56 707 183 335 21 1225

VI. CONCLUSION From the result obtained in Table I one can conclude that,

HTK is generally based on the acoustic model and here the visual features used are of high dimension features thereby resulting inadequate modeling with HTK. Hence from Table I it can be concluded that high dimension features result in poor recognition rate when modeled with HTK. On the other hand one can easily observed that there is slightly increase in recognition rate as shown in Table II when dynamic feature

added to SV features. The main reason for increase in the recognition rate is due to the fact that HTK performance is improved slightly when dynamic features are used along with the static features. Hence it can be concluded that adding the dynamic features improves the recognition rate as compared to static features.

Finally from the Table III one can observe the variation of 26.04% in percentage correction with LDV features. Therefore, it can be concluded from the results obtained for visual-only features that, HTK gives better performance when visual features obtained are projected in low dimension spaces, since HTK as a tool is designed for the acoustic features which are represented in low dimension features such as MFCC.

REFERENCES [1] Summerfield, “Some preliminaries to a comprehensive account of

audio-visual speech perception,” in Hearing by Eye: The Psychology of Lip-Reading pp. 3–51, London, United Kingdom: Lawerence Erlbaum Associates, 1987.

[2] W. Sumby and I. Pollack, “Visual contribution to speech intelligibility” Journal of the Acoustical Society of America, vol. 26, no. 2, pp. 212–215,1954.

[3] H. McGurk and J. MacDonald, “Hearing lips and seeing voices” Nature, pp. 746–748, December 1976.

[4] T. Chen, "Audio Visual Speech Processing," IEEE Signal Processing Magazine, pp. 9-21, January 2001.

[5] E. Petajan, “Automatic lipreading to enhance speech recognition,” in IEEE Global Telecommunications Conference, (Atlanta, GA, USA), pp. 265–272, IEEE, 1984.

[6] J. Luettin and S. Dupont, “Continuous Audio-Visual Speech Recognition”, Proc. of Fifth European Conference on Computer Vision, Frieburg, Germany, ISSN-0302-9743 June 1998, vol 2, pp 657-673.

[7] G. Potamianos and C. Neti, “Automatic Spearheading for Impaired Speech”, Proceedings of the Audio Visual Speech Processing Workshop, September 2001.

[8] G. Potamianos and C. Neti, “Audio-visual speech recognition in challenging environments,” in Proceedings of the European Conference on Speech Communication and Technology, (Geneva, Switzerland), pp. 1293–1296, 2003

[9] P. Varshney, P. Upadhyaya, O. Farooq, “Transform based Visual Features for Bimodal Recognition of Hindi Visemes”, International Journal of Electronics and Computer Science Engineering, ISSN- 2277-1956, vol. 1, no. 3, 2012, pp. 892-897.

[10] M. Gales and S. Young, “The Application of Hidden Markov Models in Speech Recognition” Foundations and Trends in Signal Processing Vol. 1, No. 3 195–304, 2008.

74

Impact 2013 Final

Documents

Transcript of Impact 2013 Final