AN UNSUPERVISED APPROACH FOR AUTOMATIC LANGUAGE IDENTIFICATION

57
AN UNSUPERVISED APPROACH FOR AUTOMATIC LANGUAGE IDENTIFICATION by Tuba İslam B.S. in Electrical and Electronics Eng., Boğaziçi University, 2000 Submitted to the Institute for Graduate Studies in Science and Engineering in partial fulfillment of the requirements for the degree of Master of Science Graduate Program in Electrical and Electronics Engineering Boğaziçi University 2003

Transcript of AN UNSUPERVISED APPROACH FOR AUTOMATIC LANGUAGE IDENTIFICATION

AN UNSUPERVISED APPROACH FOR

AUTOMATIC LANGUAGE IDENTIFICATION

by

Tuba İslam

B.S. in Electrical and Electronics Eng., Boğaziçi University, 2000

Submitted to the Institute for Graduate Studies in

Science and Engineering in partial fulfillment of

the requirements for the degree of

Master of Science

Graduate Program in Electrical and Electronics Engineering

Boğaziçi University

2003

ii

ABSTRACT

AN UNSUPERVISED APPROACH FOR AUTOMATIC LANGUAGE IDENTIFICATION

Today, the need for multi-language communication applications, which can serve

people from different nations in their native languages, has gained an increasing

importance. Automatic Language Identification has a significant role in the pre-process

phase of multi-language systems. The conventional systems require difficult and time-

consuming labeling process of phoneme boundaries of the utterances in the speech corpus.

In our work, we propose an unsupervised method in order to built an automatic language

identification system that does not require labeled speech database or linguistic

information of the target languages. The method comprises two branches processing in

parallel; nearest neighbor selection method using language-dependent Gaussian mixtures,

and mono-lingual phoneme recognition method using language-dependent network files.

The performance of the system is compared with the previous studies and 24.3 per cent of

decrease and 13.9 per cent of increase is observed respectively in the worst and best cases.

With the proposed method, a robust system with a tolerable performance is built that can

easily integrate any language into the LID application.

iii

ÖZET

OTOMATİK DİL TANIMADA GÖZETİMSİZ YAKLAŞIM

Günümüzde insanlara kendi dillerinde hizmet sunabilmek için farklı dillerde iletişim

kurabilen sistemlere duyulan gereksinim giderek artmaktadır. Otomatik Dil Tanıma

uygulaması çok dilli sistemlerde ön işlem olarak yer almaktadır. Geleneksel sistemlerde

modellerin eğitimi öncesinde ses verisi zahmetli ve zaman alıcı bir etiketleme aşamasından

geçerek konuşmalardaki fonem sınırları belirlenmektedir. Bu tezde izlenen yöntem

etiketlenmiş veritabanına veya dillere ait linguistik bilgiye ihtiyaç duymaksızın geliştirilen

otomatik dil tanıma sistemine ait gözetimsiz bir yaklaşım içermektedir. Bu yöntem paralel

olarak işleyen iki ayrı daldan oluşmaktadır; dile özgü Gauss karışımlar kullanılarak

gerçekleştirilen en yakın komşu karışım seçimi yöntemi ve dile özgü ağ yapıları

kullanılarak gerçekleştirilen tek dilde eğitilmiş fonem tanıma yöntemi. Elde edilen

sonuçlarla önceki çalışmalar karşılaştırıldığında en kötü durumda başarımın yüzde 24.3

azaldığı, en iyi durumda ise yüzde 13.9 arttığı görülmüştür. Önerilen yöntem kullanılarak

kabul edilebilir bir başarıma sahip gürbüz bir dil tanıma sistemi geliştirilmiştir.

iv

TABLE OF CONTENTS

ABSTRACT .......................................................................................................................ii

ÖZET...................................................................................................................................iii

LIST OF FIGURES.........................................................................................................vi

LIST OF TABLES.........................................................................................................viii

LIST OF ABBREVIATIONS.......................................................................................ix

1. INTRODUCTION ...................................................................................................1

1.1. Motivation..............................................................................................................1

1.2. Applications ...........................................................................................................2

1.3. Language Discrimination Basics ...........................................................................2

1.4. Previous Research..................................................................................................7

1.4.1. LID Using Spectral Content ......................................................................8

1.4.2. LID Using Prosody ....................................................................................9

1.4.3. LID Using Phone-Recognition ................................................................10

1.4.4. LID Using Word-Recognition .................................................................11

1.4.5. LID Using Continuous Speech Recognition............................................11

2. THEORETICAL BACKGROUND ..................................................................13

2.1. Speech Representation.........................................................................................13

2.2. Vector Quantization.............................................................................................15

2.3. Hidden Markov Model Topology ........................................................................17

3. BASE SYSTEM FOR LANGUAGE IDENTIFICATION .........................23

3.1. OGI Multi-Language Database............................................................................23

3.2. Feature Extraction................................................................................................24

3.3. Gaussian Mixture Modeling using Vector Quantization .....................................25

4. PROPOSED SYSTEM.........................................................................................28

4.1. Language-Dependent Gaussian Mixtures............................................................28

4.1.1. Uni-gram and Bi-gram Modeling ............................................................28

4.1.2. Language Normalization..........................................................................30

4.2. Unsupervised Language Modeling Using Mono-Lingual Phoneme Recognizer 31

v

4.3. Superposition of Two Methods............................................................................33

5. CONCLUSIONS....................................................................................................38

APPENDIX A: LIST OF OGI TEST FILES ...........................................................40

APPENDIX B: MATLAB FILE FOR COMPUTING MFCC............................41

APPENDIX C: C FUNCTIONS FOR GMM TRAINING AND NNMS ........42

APPENDIX D: HTK COMMANDS FOR HMM GENERATION...................44

APPENDIX E: NET FILE OF HTK FOR PHONEME RECOGNITION.......45

REFERENCES ................................................................................................................46

vi

LIST OF FIGURES

Figure 1.1. An experimental perspective of spoken language...............................................3

Figure 1.2. Phoneme error rates of some languages ..............................................................5

Figure 1.3. Perceptual language identification results...........................................................7

Figure 2.1. Filter bank for computing MFCCs ....................................................................14

Figure 2.2. Representation of a mixture with M-components .............................................18

Figure 2.3. Matrix visualization of Viterbi algorithm .........................................................22

Figure 3.1. Distributions of c1 vs. c0 ixtures for some languages coefficients of m .............24

Figure 3.2. Distance search algorithm .................................................................................26

Figure 3.3. Mean and variance of target-language ranks with a plain NNMS system ........27

Figure 4.1. Mean and variance values of target-language ranks for different u values.......29

Figure 4.2. Mean and variance values of target-language ranks for different b values.......29

Figure 4.3. Mean and variance values of target-language ranks for different n values.......30

Figure 4.4. Five-state left-to-right HMM architecture.........................................................31

Figure 4.5. Mean and variance values of language ranks with phoneme recognizer ..........33

Figure 4.6. The superposition of two methods ....................................................................33

vii

Figure 4.7. The algorithm for merging the two results........................................................34

viii

LIST OF TABLES

Table 1.1. Phoneme categories of Turkish with examples of words .....................................4

Table 1.2. Phoneme categories of English with examples of words .....................................4

Table 3.1. Number of speakers in OGI database .................................................................23

Table 3.2. Effect of feature normalization on LID for baseline system ..............................25

Table 3.3. Mean values of language ranks obtained by plain NNMS .................................27

Table 4.1. Mean values of language ranks obtained by phoneme recognizer .....................32

Table 4.2. Performance of LID system with weights: 0.6, 0.8 and 1 ..................................34

Table 4.3. Performance of the LID system with weights: 0.8, 0.4 and 1 ............................35

Table 4.4. Confusion matrix of languages in LID system of weights: 0.8, 0.4 and 1 .........36

Table 4.5. Percentage of correct identification for six-language test ..................................36

Table 4.6. Percentage of correct identification for three-language test ...............................37

Table 4.7. Percentage of correctness for pair-wise test of English......................................37

ix

LIST OF ABBREVIATIONS

FFT Fast Fourier Transform

GMM Gaussian Mixture Model

HMM Hidden Markov Model

Hz Hertz

LID Language Identification

LD Language Dependent

LI Language Independent

LPCC Linear Predictive Cepstrum Coefficients

MFCC Mel Frequency Cepstrum Coefficient

ML Mono Lingual

NNMS Nearest Neighbor Mixture Selection

OGI-TS Oregon Graduate Institute-Telephone Speech

PRLM-P Phoneme Recognition followed by Language Modeling - Parallel

TURTEL Turkish Telephone Speech Corpora

VQ Vector Quantization

1

1. INTRODUCTION

1.1. Motivation

The necessity for multilingual capacities grows with the development of world

communication. Speaking different languages will remain as an obstacle until either multi-

lingual large vocabulary continuous speech recognition or automatic language

identification systems reach excellent performance and reliability. Automatically

identifying a language from just the acoustics without understanding the language is a

challenging problem. A multilingual person has no problem identifying the languages he

understands. Word spotting is the basic method that is followed during this process in

brain. In a human-made system, however, it is not easy to model the words in different

languages and to generate a successful language identification system, which needs a large

amount of labeled speech data and linguistic information of the target languages.

In all speech processing applications, one major restriction for a better performance is

the data limitation. The accuracy of the system relies on the quality and the variety of the

database. Another restriction is the necessity of the labeled speech corpora for the

languages under test. The labeling process of raw speech material and other linguistic data

takes great effort and time. Systems using multiple large vocabulary continuous speech

recognizers give the best results. These systems include a complete word recognizer for

each language and use word and sentence level language modeling. To build such a

system, a large amount of labeled speech is necessary to train the recognizers and large

amounts of written text are needed to train language models of word n-grams. A simpler

but successful approach is parallel language-dependent phone-recognition followed by

language modeling but since it is based on multiple language-specific phone recognizers, it

also requires labeled speech to train those recognizers.

Our motivation in this thesis is to search for the methods of building a language

identification system that does not require linguistic information and labeled speech

corpora of the target languages. The system will not have much dependency on pre-

2

processed data and therefore there will be no difficulty in adapting a new language to the

application.

1.2. Applications

The purpose of a language identification application includes the ability of

automatically adapting a speech-based tool, such as online banking or information

retrieval, to the native language of the user. With the growth of the Internet, we now live in

a worldwide society communicating and doing business with people who use a wide

variety of languages which makes language identification more important each day.

Multilingual environments may have political, military, scientific, commercial or tourist

context (Adda-Decker, 2000).

Just a few of the many different uses where a language identifier may be useful are

the natural language processing systems, information retrieval systems, speech mining

applications, speech file filtering systems, translation services through software, anywhere

where you might need to work with more than one language or knowledge management

systems.

1.3. Language Discrimination Basics

Humans and machines can use many different attributes to distinguish one language

from another. There are some essential cues for understanding a spoken language. The

most accurate way is to catch some of the words spoken. Detecting the phones that are not

common in most languages or focusing on the intonation and the stress also help but are

not enough. The basic perspective of a spoken language is presented in Figure 1.1

(Greenberg, 2001).

3

Understanding

Syntax Morphology

1000 ms Stress- Accent

Intonation Words

“Intelligibility”Prosody Lexicon

200 ms 40-400 ms

Interface between Sound and Meaning

Range Syllables

80 ms 40-400 ms

Phonetic Segments “Articulation”

Segments

80 ms 40-400 ms

Place of Articulation (200 ms) Manner of Articulation (80 ms)

Voicing, Rounding (40 ms)

Features

200 ms 40-400 ms

Modulation Spectrum Acoustics

Figure 1.1. An experimental perspective of spoken language

One way of representing speech sounds is by using phonemes. A “phoneme” is a

physical representation of a phonological unit in a language. Across the world’s languages,

the sound inventory varies considerably. The size of the phoneme inventory used for

speech recognition can be 29 phonemes as it is in Turkish or 46 phonemes as it is in

Portuguese. Formally, we can define the phoneme as a linguistic unit such that, if one

phoneme is substituted for another in a word, the meaning of that word could change. This

is only true for a set of phonemes in one language. Therefore in a single language, a finite

set of phonemes exists. However, when different languages are compared, there are

differences; for example, in Turkish, /l/ and /r/ (as in "laf" and "raf") are two different

phonemes, whereas in Japanese, they are not (Ladefoged, 1962). Similarly, the presence of

individual sounds, such as the "clicks" found in some sub-Saharan African languages, or

the velar fricatives found in Arabic, take attention of the listeners fluent in languages that

do not contain these phonemes. Still, as the vocal part used in the production of languages

is universal, phoneme sets mostly overlap and the total number of phonemes is finite

4

(Ladefoged, 1962). The Turkish phonemes subdivided into groups based on the way they

are produced are given in Table 1.1.

Table 1.1. Phoneme categories of Turkish with examples of words

Vowels: Semivowels: Fricatives: Nasals: Plosives: Affricates: kim rey sar mal bul can gül lale şal nal del çam kel yer far gir göl lala hep pul çal zor ter yıl dağ kaç kul ver gem bol jüri

Different from the Turkish phoneme structure, there are many diphthongs in some

other languages like English and German. The classification of English phonemes is given

with examples of words in Table 1.2.

Table 1.2. Phoneme categories of English with examples of words

Vowels: Diphthongs: Semivowels: Fricatives: Nasals: Plosives: Affricates: heed bay was sail am bat jaw hid by ran ship an disc chore head bow lot funnel sang goat had bough yacht thick pool hard beer hull tap hod doer zoo kite hoard boar azure hood boy that who'd bear valve hut heard the

The vowel systems also differ from one language to the other. In the study of

Pellegrino et al. in 1999, the phonemic differences based on vowels were taken into

account in language identification. Five languages (Spanish, Japanese, Korean, French and

Vietnamese) were chosen for the evaluations of LID system because of their

phonologically different vowel systems. Spanish and Japanese vowel systems are, for

example, rather simple, as they include only five vowels. But on the other hand, Korean

5

and French systems are quite complex and they make use of secondary articulations (long

vs. short vowel opposition in Korean and nasalization in French).

The phoneme error rate of a language correlates with the number of phonemes used

to model this language. In the study of Schultz (2001), the acoustic confusability of

languages, obtained by the phoneme-based recognizers, are given in Figure 1.2. The

phoneme error rates range from 33.8 per cent to 46.4per cent. Turkish is an exception in

this result because of the high substitution rate between the vowels “e”, “i” and “y”.

Figure 1.2. Phoneme error rates of some languages as an example for acoustic

confusability

It is also possible to distinguish between speech sounds depending on the way they

are produced. The speech units in this case are known as the phones. A “phone” is a

realization of an acoustic-phonetic unit or segment. It is the actual sound produced when a

speaker thinks of speaking a phoneme. Phone and phoneme sets differ from one language

to another, even though many languages share a common subset of phones and phonemes

(Shultz, 2001). Phone and phoneme frequencies of occurrence may also differ from one

language to another. Phonotactics, the rules of allowed sequence of phones and phonemes

are also different in most cases.

6

There are more phones than phonemes, as some of them are produced in different

ways depending on the context. For example, the pronunciation of the phoneme /l/ differs

slightly when it occurs before consonants and at the end of utterances such as in “salı”

(tuesday) and in “kalk” (wake up). As they are both different forms of the same phoneme,

they form a set of allophones. Any machine-based speech recognizer would need to be

aware of the existence of allophone sets.

Also the morphology, i.e. the word roots and lexicons, are usually different one

language to another. Each language has its own vocabulary and its own way in word

formation.

The stress, rhythm and intonation of speech are the prosodic features. Duration of

phonemes, pitch characteristics, and stress patterns differ from one language to another.

Stress is used in two different levels. It indicates the most important words in the sentences

and the prominent syllables in the words, which may change the meaning totally. As an

example, the English word "object" could be understood as either a noun or a verb,

depending on whether the stress is placed on the first or second syllable. Intonation, or

pitch movement, is very important in indicating the meaning of an English sentence. In

tonal languages, such as Mandarin and Vietnamese, the intonation determines the meaning

of individual words as well.

The syntax, the sentence patterns are different among languages. Although there may

be same words shared in two language such as “hat” in German and Turkish or “ten” in

English and Turkish, the neighboring words in the sentence and also the suffixes or

prefixes used in the word will be different.

The perceptual confusion of different languages is examined by Muthusamy in 1994

and the responses of subjects are shown in Figure 1.3.

7

100.0

22.4

84.1

79.5

38.2

13.5

24.5

70.5

51.7

19.0

100.0

30.2

83.3

77.8

40.0

14.7

27.3

81.4

60.9

34.5

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

EN FA FR GE JA KO MA SP TA VI

Language Codes

Ave

rage

Sub

ject

Per

form

ance

First Quarter

Last Quarter

Figure 1.3. Perceptual language identification results

The difference between the last quarter and the first quarter, which denotes the

beginning and the end of the test respectively, shows the effect of learning of the subjects

during the test because of the feedback given after each response. Muthusamy implies that,

having the lowest score in human perception, Korean is confused more often with Farsi,

Japanese, Mandarin, Tamil and Vietnamese.

1.4. Previous Research

Since 1970s researches have been focused on automatic language identification from

speech. Systems implemented up to date mainly vary according to their methods for

modeling languages. There are two phases of language identification; the “training phase”

and “recognition phase”.

During the “training phase” the base system is presented with examples of speech

from a variety of languages. Each training speech utterance is converted into a stream of

feature vectors that are computed from short windows of the speech waveform (e.g. 20ms).

The windowed speech waveform is assumed stationary in some manner. The feature

vectors are recomputed with a pre-defined step size (e.g. 10ms, 50 per cent overlapping)

8

and contain cepstral information about the speech signal. The training algorithm analyzes a

sequence of such vectors and produces a model for each language.

During the “recognition phase” of LID, feature vectors computed from the new

utterance are compared to each of the language-dependent models. The likelihood that the

new utterance was spoken in the same language as the speech used to train each model is

computed by a distance measure and the maximum-likely model is found. The language of

the speech that was used to train the model having the maximum-likelihood is assigned as

the language of the utterance.

Complicated language identification systems use phonemes to model speech. During

the training phase of the phoneme models, these systems use a phonetic transcription that

implies the sequence of symbols representing the spoken sounds or an orthographic

transcription, which implies the text of the words spoken, and also a phonemic

transcription dictionary is necessary.

1.4.1. LID Using Spectral Content

In the earliest researches on language identification, developers focused on the

differences in spectral content among languages. The basic idea was that different

languages contain different phonemes and phones. A set of short-term spectra is obtained

from the training utterances and these prototypes are compared to the ones obtained from

the test speech.

There are many different options to choose during the implementation of this

approach. The training and test spectra can be used directly or can be used to obtain the

feature vectors for the speech such as the cepstrum coefficients or formant-based vectors.

The training data can be chosen directly from the training utterances or they can be

synthesized by using K-means clustering. The similarity between the sets of training and

test spectra can be calculated by the Euclidean, Mahalanobis, or any other distance metric.

Examples to these language identification systems are proposed and developed by

Cimarutsi and Ives in 1982, Goodman et al. in 1989 and Sugiyama in 1991.

9

In order to compute the similarity between languages, most of the early systems

calculated the distance between the test vector and its closest train vector and accumulated

the result as an overall distance. In these systems, the language with the lowest distance is

assigned as the identified language. Later, Gaussian mixture modeling is applied to this

approach by Nakagawa et al. in 1992 and Zissman in 1993. In this case, each vector is

assumed to be generated randomly according to a probability density that is a weighted

sum of multi-variate Gaussian densities (Zissman, 2001). During the training, Gaussian

mixture models for the feature vectors are computed for each language. During the

recognition, the likelihood of the test utterance feature vectors is computed given each of

the language models. The language having the maximum likelihood is proposed as the

identified language. In this approach instead of only one feature vector from the training

set, the whole set of training feature vectors affects the scoring of each test vector, and

therefore this may be called a soft version of vector quantization.

Since using vector quantization gives a static classification, in order to model the

sequential characteristics of speech, language identification systems with Hidden Markov

Modeling are implemented in late 80s. HMM based language identification systems first

proposed by House and Neuburg in 1977 (Zissman, 2001). In these systems, HMM

training was performed on unlabeled training speech and the system performance was even

worse than static classifiers in some cases.

Later, a new approach was proposed by Li in 1994 by labeling the vowels of each

speech utterance automatically and computing spectral vectors in the neighborhood of the

vowels. Instead of modeling the feature vectors over all training data, the selected portions

are used. During the recognition, the selected portions of test data are processed and

language with the maximum likelihood is assigned as the identified language.

1.4.2. LID Using Prosody

Pitch frequency (fundamental frequency) of speech is defined as the frequency at

which the vocal cords vibrate during a voiced sound (Hess, 1983). It is difficult to make a

reliable estimate of the pitch frequency from the speech data since the harmonics of the

side frequencies cause a distortion.

10

One of the basic and simplest algorithms used depends on the multiple measures of

periodicity in the signal. Fundamental frequency (f0) is usually processed on a logarithmic

scale rather than a linear one in order to match the resolution of human auditory system.

Normally 50 Hz ≤ f0 ≤ 500 Hz for voiced speech. For unvoiced speech f0 is undefined and

by convention, it is zero in log scale.

Since the fundamental frequency implies the characteristics of the speaker, it does not

give global information about the language or the utterance. The slope of the pitch

frequency, however, gives some clues about the prosody and the stress on the utterance,

which might differ from language to language. Humans can also use prosodic information

in order to guess the spoken language (Muthusamy et al., 1994).

Language identification system depending on prosody alone has also been proposed

by Itahashi et al. (1994, 1995) and especially in noisy environments, pitch estimation is

argued to be more robust compared to spectral parameters. However, compared to phonetic

information, prosody comprises little information about the language (Hazen, 1993). Some

researches imply that systems with prosodic and phonetic parameters perform about the

same as the systems using only phonetic parameters. Therefore, the prosodic information

of speech is not concerned in this thesis.

1.4.3. LID Using Phone-Recognition

Different languages have different phone distributions and that leads many

researchers to build LID systems that extract the phone sequence of the utterances and

determine the language based on the statistics of that sequence. An example of this

approach is implemented by Lamel, who built two HMM-based phone recognizers for

English and French (Lamel and Gauvain, 1993). He found that the likelihood scores

obtained from language-dependent phone recognizers can be used to distinguish between

the two languages.

In a different approach by Shultz (2001), language specific phonemes of N

languages are unified into one global set. The target language that we do not have enough

11

information is modeled by the adaptation of other well-known languages using the

phoneme description of languages.

As building a system that depends on the phone recognition necessitates multi-

language phonetically labeled corpora, it becomes more difficult to include new languages

into the language identification process. This difficulty can be handled by using a phone

recognizer for a single language and obtaining the phonetic distributions for other

languages. Hazen and Zue (1993) and Zissman and Singer (1994) developed LID systems

that use one single-language front-end phone recognizers, with a successful performance.

This work is extended by Zissman and Singer (1994) and Yan and Barnard (1995) for

multiple single-language front ends.

In this thesis, a phoneme recognizer based on Turkish phonemes is developed and the

phonetic distributions of the languages in OGI multi-language database are evaluated.

1.4.4. LID Using Word-Recognition

The systems based on word-recognition are more complicated than the phone-level

systems and less complicated than the large-vocabulary systems. They use the lexical

information of languages and score the occurrence of words for each language.

In the approach of Kadambe and Hieronymus (1995) which uses lexical modeling for

language identification, the incoming utterance is processed by parallel language-

dependent phone recognizers and possible word sequences are identified from the resulting

phone sequences. To obtain the lexical information of all target languages is not an easy

task to deal with since each language dependent lexicon includes several thousand entries.

1.4.5. LID Using Continuous Speech Recognition

In order to obtain better LID performance, researchers try to add more and more

knowledge to their systems. Large-vocabulary continuous-speech recognition systems are

the most complicated ones issued for this purpose. During the training process, one speech

recognizer per language is created and during the evaluations all recognizers are run in

12

parallel to select the most likely one as the recognized language. Mendoza et al. (1996),

Schultz and Waibel (1998) and Hieronymus and Kadambe (1997) have worked on these

systems.

As these systems use higher-level knowledge (words and word sequences) rather than

lower-level knowledge (phones and phone sequences), the identification performance is

better than other simpler systems. On the other hand, they require many hours of labeled

training data for each language to be recognized and the algorithms are the most complex

ones in computation (Zissman, 2001).

13

2. THEORETICAL BACKGROUND

2.1. Speech Representation

Since their introduction in early 1970’s, homomorphic signal processing techniques

have been of great interest in speech recognition. Homomorphic systems are a class of

nonlinear systems that obey a generalized principle of superposition. Linear systems are a

special, a plain case of a homomorphic system (Picone, 1993).

In speech processing, the homomorphic system should have the following property:

D[[x1(n)]α • [x2(n)] β] = αD[x1(n)] + βD[x2(n)] (2.1)

Homomorphic systems considered useful for speech processing because they present

a way for separating the excitation signal from the vocal tract characteristics. The

convolution of two sources in time domain is represented as follows:

s(n) = g(n) * v(n) (2.2)

where g(n) denotes the excitation signal and v(n) the vocal tract impulse response and “*”

the convolution.

The frequency domain representation of this process is as follows:

S(f) = G(f) . V(f) (2.3)

When we take the logarithm of both sides, we get:

Log(S(f)) = Log (G(f) . V(f))

= Log (G(f)) + Log (V(f)) (2.4)

14

The cepstrum is one homomorphic transformation that allows the separation of the

source from the filter, which means that we obtain the superimposition of the excitation

and the vocal tract shape in log domain and we can separate them. The cepstrum of a

speech segment can be computed by windowing the signal with a window of length N.

Mel Frequency Cepstrum Coefficients (MFCC) vector is a representation defined as

the real cepstrum of a windowed short-time signal derived from the FFT of that signal

(Huang, 2001). The difference from the real cepstrum is that a non-linear frequency scale,

called mel-scale, is used in order to approximate the behavior of the auditory response of

the human ear.

To compute MFFCs of raw speech data, we first compute the log spectral

magnitudes, apply the resulting values to filter banks and then compute the inverse Fourier

transform. The filter banks are linearly spaced in mel frequencies. The frequency band

below 1000 Hz does not follow a linear rule but a logarithmic rule. The mapping from

linear frequency to mel frequency is defined as follows.

Mel(f) = 2595log10(1+ f / 700) (2.5)

In Figure 2.1, the triangular band-pass filters that are equally spaced along the mel-

frequency scale between 0 and 4000KHz are shown.

Figure 2.1. Filter bank for computing MFCCs

15

The expression for computing the MFCCs using the discrete cosine transform is

given as follows.

______ N MFCCi = √(2/N) ∑ mj cos ( πi (j-0.5) / N ) (2.6)

j=1

where N is the number of filters, mj is the log band-pass filter output amplitudes.

It has been proven that the performances of the systems are enhanced by the addition

of time derivatives to the static parameters. The first order derivatives are referred as the

delta coefficients and they are also included in the feature vector.

MFCCs are one of the most popular parameterization methods used by researchers in

speech applications because of its capability of capturing the phonetically important

characteristics of speech. A small drawback is that because of the Fast Fourier

Transform(FFT) at the early stages to convert speech from the time to the frequency

domain, MFCCs are computationally more complex compared to the other methods like

LPCC (Wong and Sridharan, 2001).

2.2. Vector Quantization

Quantization is the process of approximating continuous amplitude signals by

discrete symbols. The quantization of a single signal value or parameter is the scalar

quantization. In vector quantization, instead of individual values, small arrays of them are

represented. VQ was first proposed as a highly efficient quantization method for LPC

parameters (Linde et al., 1980), and later was applied to waveform coding.

A vector quantizer is described by a codebook, which is a set of fixed prototype

vectors. Each of these vectors is also referred to as codeword. During the quantization

process the input vector is matched against each codeword in the codebook using some

distortion measure. The input vector is then assigned to the index of the codeword with the

smallest distortion. Therefore, a vector quantization process includes two main steps; the

distortion measure and the generation of each codeword in the codebook.

16

To design an M-level codebook, it is necessary to partition d-dimensional space into

M cells and assign a quantized vector to each cell. The criterion for optimizing the

quantizer is to minimize the overall distortion over M levels. There are two necessary

conditions for an optimal quantizer. The first is that, the quantizer is realized by using a

nearest-neighbor selection rule expressed in Equation (2.7), where x is quantized as z. The

second condition is that, each codeword zi is chosen to minimize the average distortion in

the cell i.

q(x) = zi if and only if i = argmin d(x , zk ) k

(2.7)

The procedure known as the K-means algorithm or the generalized Lloyd algorithm

partitions the set of training vectors into M clusters in such a way that the two necessary

conditions for optimality are satisfied. The algorithm can be described as follows:

Initialization: Initial values for the codewords in the codebook are assigned.

Nearest-Neighbor Classification: Each training vector is classified into one of the

cells by choosing the closest codeword.

Codebook Updating: Codeword of every cell is updated by computing the centroid of

the training vector assigned to each cell.

Iteration: The first two steps are repeated until the ratio of the new overall distortion to

the previous one is above the pre-defined threshold.

Since the initial values of the codebook is critical to the ultimate quality of the

quantizer, another procedure called the LBG algorithm, which is also known as the

extended K-means algorithm, is proposed in order to design the M-vector codebook in

stages. Different from the K-means algorithm, the LGB algorithm first computes a 1-vector

codebook, then uses a splitting algorithm on the codewords to obtain the initial 2-vector

codebook and continues splitting until the desired M-vector codebook is obtained (Huang,

2001).

17

2.3. Hidden Markov Model Topology

Hidden Markov Modeling (HMM) is one of the basic methods used in speech

processing. It is a widely used statistical method of characterizing the spectral properties of

frames of an utterance, which is assumed to be a random process, and the parameters of

this process can be estimated in a precise, well-defined way. Explaining simply, HMM is a

Markov model in which the states are hidden.

HMMs can be classified into different types of models such as discrete models,

continuous models or semi-continuous models depending on the observable events

assigned to each state being discrete, continuous or both. The states may be defined as

ergodic, which implies that there is a transition from each state to any state in any time. A

simpler definition of the states, also used in our system, is left-to-right in which the states

follow a route from left to right in order to terminate the sequence.

In all speech processing applications, there ought to be a training phase followed by a

recognition phase. During the training, the parameters of the base reference model are

computed. There are three parameters to be estimated. One of them is the state transition

probability matrix A, with elements of aij denoting the transition probability of being at

state i at time t and at state j at time t + 1. When an observation sequence O = {o1, o2, …,

oT} is defined, each vector element of this sequence denotes the feature parameter vector in

speech recognition task. The matrix B = [b j(ot)] is the observation symbol probability

distribution where b j(ot) is the probability of observing vector ot at time t in state j. The

vector π = {πi} denote the initial state distribution which is the probability of being in state

i at the beginning. These three parameters form the compact representation of λ = {A,

B,π}, which is used to represent an HMM in general. There are also some other parameters

which are the number of states, N, and the number of the mixtures in each state, M. There

are various ways of representing observation symbol probabilities but the continuous

probability densities are preferred generally. The multivariate Gaussian distributions are

also widely used. The expression for the computation of bj(ot) is as follows:

∏ ∑= =

=

S

s

M

mjsmjsmjsmj

s

Ncb1 1

),;()( Σµoo stt (2.8)

18

where Ms is the number of mixture components in stream s, cjsm is the weight of the mth

component and N(-; µ, Σ) is a multivariate Gaussian with mean vector µ and the

covariance matrix Σ, expressed as follows:

)(')(

21 1

||)2(1),;(

µoΣµoµo

−−− −

= eNnπ

(2.9)

where n is the dimensionality of o.

At the beginning of the training process, a rough estimate of the λ values of HMMs

should be computed. Viterbi algorithm is used to assign the initial values after the

observation sequence is uniformly segmented with a Segmental K-means algorithm. Using

these initial values of the parameters, further improvements are achieved with a Baum-

Welch or Expectation-Maximization re-estimation procedure.

Figure 2.2. Representation of a mixture with M-components

For a single state HMM, the estimation of parameters of the symbol observation

probability is given in Equation (2.10). Since the Gaussian mixtures are independent, the

computation of multi component models will also be an easy process.

( ))(')(

21 1

||π2

1)(µoΣµo

o−−− −

= ebnj (2.10)

19

The maximum likelihood estimates of µ j and Σ j are simple averages,

∑=

=T

tj T 1

1toµ (2.11)

and

∑=

−−=T

tj T 1

'))((1jtjt µoµoΣ (2.12)

Let Lj (t) be the probability of being in state j at time t then using Equation (2.11) and

Equation (2.12) the weighted averages are obtained as follows:

∑∑

=

== T

t j

T

t jj

tL

tL

1

1

)(

)( toµ (2.13)

and

∑∑

=

=−−

= T

t j

jjT

t jj

tL

tL

1

1

)(

)')()(( µoµoΣ tt

(2.14)

Equations (2.13) and (2.14) are known as the Baum-Welch re-estimation expressions

for means and variances. In order to calculate the mean and variance terms above, the state

duration probability, L j (t) should be evaluated by using an efficient algorithm such as the

Forward-Backward algorithm. The forward probability αj (t) for a model with M mixtures

and N states is defined as the joint probability of observing the first t speech vectors in

sequence and being in state j at time t.

)|)(,()( MjtxPtj == t1 o,.....,oα (2.15)

Equation (2.15) can be computed efficiently by the recursion below:

)()1()(1

2toj

N

iijij batt

−= ∑

=

αα (2.16)

20

The initial condition and final condition of the recursion expression in Equation

(2.16) for 1< j < N are given as follows:

)()1(1)1(

1

1

1ojjj ba==

αα

(2.17)

∑−

=

=1

2)()(

N

iiNiN aTT αα (2.18)

The calculation of forward probability yields the total likelihood P(O | M).

)()|( TMP Nα=O (2.19)

The backward probability is defined below and the recursion expression is given in

Equation (2.21).

),)(|()( MjtxPtj == + T1t o,......,oβ (2.20)

∑−

=+ +=

1

2)1()()(

N

jjjiji tbat ββ 1to (2.21)

The initial condition and final condition of recursion expression above for 1< j < N

are given as follows:

∑−

==

=1

211 )1()()1(

)(N

jjjj

iNi

ba

aT

ββ

β

1o (2.22)

Backward probability is defined as a conditional probability, whereas the forward

probability is defined as a joint probability. The product of these two probabilities is

proportional to the probability of state occupation. By using the definitions above we

obtained the following expressions:

21

)|)(,()()( MjtxPtt jj == Oβα (2.23)

)()(1)|(

)|)(,(

),|)(()(

ttP

MPMjtxP

MjtxPtL

jj

j

βα=

==

==

OO

O

(2.24)

where P = P (O | M).

During the recognition phase of HMM-based speech processing applications, the

Baum-Welch algorithm also takes place. While computing the forward probability with the

efficient recursive methods, the total likelihood P = P (O | M) is yielded as a by-product.

Therefore, this algorithm can be used to find out which model would maximize the value

of P = P (O | Mi), where i denotes the individual models.

However, it is more convenient to have the recognition process based on the

maximum likelihood state sequence. There is a small modification in the algorithm such

that the sum sign is changed with a maximum operation. If φ j (t) represents the maximum

likelihood of observing speech vectors from o1 to ot and being in state j at time t, the partial

likelihood can be computed from Equation (2.16) as the following:

)(})1({max)( tojijiij batt −= φφ (2.25)

where the initial and final conditions of recursion expression for 1< j < N are as follows:

})({max)(

)()1(1)1(

1

1

iNiiN

jjj

aTT

ba

φφ

φφ

=

==

to (2.26)

The expression of the final condition is equal to the maximum likelihood, P (O | M).

The probability values in the equations are very small and the multiplication of these

numbers may cause an underflow. Therefore, the log likelihood values should be preferred

instead of the linear ones. By this approach Equation (2.27) is obtained as follows:

22

)(log(})log()1({max)( tojijiij batt ++−= ψψ ) (2.27)

The equation above is the basis of Viterbi decoding algorithm. This algorithm can be

visualized as finding the best path in a matrix as shown in Figure 2.3. The rows imply the

states and columns are the speech frames. Each large dot denotes the log probability of

observing that frame at that time and each arc represents the log transition probability

between those states. The log probability of any path is equal to the sum of the dots and

arcs passed through.

Figure 2.3. Matrix visualization of Viterbi algorithm

The path extends from left-to-right, column-by-column. At time t, any partial path

ψi(t-1) is known, therefore any ψj(t) can be computed from Equation (2.27) (Young et al.,

2000).

23

3. BASE SYSTEM FOR LANGUAGE IDENTIFICATION

3.1. OGI Multi-Language Database

The mixture models of different languages are trained and the results of the proposed

system are evaluated using the records of Oregon Graduate Institute Multi-Language

Telephone Speech Corpus (OGI-TS), described in (Muthusamy, 1992).

This database comprises of 11 spoken languages, English, Farsi, French, German,

Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. There are

total of 23,118 files and 568 files are time-aligned, broad phonetic transcripted. Types of

utterances are nlg (native language), clg (common language), dow (days of the week), num

(number 0 thru 10), htl (hometown likes), htc (hometown climate), roo (room description),

mea (description of most recent meal), stb (free speech before the tone), sta (free speech

after the tone). The records classified as “stories before the tone” -stb files-, each lasts 45

seconds, are used in our evaluations. The numbers of speakers for the training and test sets

in OGI database are given in Table 3.1.

Table 3.1. Number of speakers in OGI database

Number of Speakers Language Training Set Evaluation Set English 50 141

Farsi 49 51 French 50 57 German 50 59 Hindi 173* 52

Korean 50 40 Japanese 49 37 Mandarin 49 52 Spanish 50 60 Tamil 50 55

Vietnamese 50 50 * “stb” files only.

The speech files have the NIST SPHERE header format. All files, compressed by

"shorten" speech compression method, are decompressed and byte-swapped before the

feature extraction phase.

24

3.2. Feature Extraction

The speech files in OGI multi-language corpora, sampled at 8kHz with 16-bit

resolution, are parameterized every 20ms with 10ms overlap between contiguous frames.

For each frame a 24-dimensional feature vector is computed; 12 cepstrum coefficients, 12

delta cepstrum coefficients. Speech utterances are windowed by using a 160-point

Hamming window to get the short-term energy spectrum. After that they are filtered with a

filter bank of 16 filters.

The energy coefficient is not included in the feature vector because of the different

recording levels over telephone line. In the study of Wong in 2001, it is shown that the

static log energy coefficient reduces the performance of the language identification system.

As an explanation of this result, it is implied that the static short-term features do not

encapsulate the language specific information in contrast to the transient features.

Cepstral normalization is performed in order to minimize the channel effect. During

this process, the mean cepstrum of each file is calculated and then the obtained value is

subtracted from each feature vector. The effect of cepstrum normalization is examined

through a plain system test for normalized and unnormalized feature vectors.

Figure 3.1. Distributions of c1 vs. c0 coefficients of mixtures for some languages

25

Table 3.2. Effect of feature normalization on LID for baseline system

UnnormalizedParameters

Normalized Parameters

Improvement in mean rank

EN 1.40 1.00 3.6% FA 8.25 8.20 0.5% FR 6.45 6.85 3.6% GE 1.80 2.00 1.8% HI 3.70 3.55 1.4% JA 3.75 4.20 -4.1% KO 8.70 7.35 4.1% MA 7.25 6.90 3.2% SP 4.30 3.80 4.5% TA 4.20 3.60 5.4% VI 10.60 10.80 -1.8%

Overall 5.49 5.29 3.6%

The results in Table 3.2 are in terms of ranks of the target languages and it is implied

that the normalization process improves LID system performance for all languages except

Japanese and Vietnamese. The unexpected result for the two languages may be related with

the voiced phoneme distribution of these languages. The ratio of the high-energy frames

might be corrupted during the normalization process.

3.3. Gaussian Mixture Modeling using Vector Quantization

In order to build an LID system, which does not dependent on the amount of labeled

speech data, we implement a method based on Gaussian mixtures generated by vector

quantization. The speech files in the training set of OGI corpora for 11 languages are used

in order to obtain the codebook of the system for each language.

A composite of mixtures is evaluated with an optimal codebook size of 32. Mixture

splitting is performed by using an entropy-based distance measure, defined over the

codewords of each language.

The algorithm can be summarized as follows:

Decide on the codebook size (N=32).

Evaluate the initial mean value (codeword) using the input feature vectors.

26

Split the mean vector with the maximum weight in the codebook recursively until the

total number of codewords is reached.

Using the sum of squared error as the distortion measure, clusterize the input vectors

around each codeword. This is done by finding the distance between the input vector

and each codeword. The input vector belongs to the cluster of the codeword that

yields the minimum distance.

Re-estimate the new set of codewords. This is done by obtaining the average of each

cluster. Add the component of each vector and divide by the number of vectors.

(3.1)

where i is the component of each vector (x, y, z, ... directions), m is the number of

vectors in the cluster.

Repeat the previous two steps until either the codewords do not change or the change

is quite small.

The evaluations of this system is based on the Nearest Neighbor Selection algorithm

(Higgins, 1993), which is a non-parametric approach using the averaged nearest neighbor

distance to classify features into mixtures. The Gaussian that best fits the input vector of

mel cepstrum coefficients, c, is found by evaluating the distance using the log likelihood

values. The distance search algorithm can be expressed briefly as follows:

dmin = ∞ ;

for m = 1..M d = 0; for n=1..N d = d + ( c n - µn

m )/σnm ;

end if d<dmin

dmin = d; argmin = m; end

end return dmin;

n - µnm )*( c

Figure 3.2. Distance search algorithm

27

The likelihood values obtained from 11 languages are sorted in order to get the rank

of each language, such that the one with the maximum likelihood takes the rank of 1, and

the minimum takes the rank of 11. The mean values of the language ranks for each model

tested for each test set (45-sec utterances of 20 files) are given in Table 3.3. The mean and

variance values of the correct matches are plotted in Figure 3.3.

Table 3.3. Mean values of language ranks obtained by plain NNMS

Languages of Model Files Mean EN FA FR GE HI JA KO MA SP TA VI

EN 1.00 9.60 8.50 2.05 4.65 4.40 8.10 7.55 3.70 5.45 11.00FA 1.10 8.20 8.00 1.95 3.95 5.80 8.70 8.30 4.10 4.90 11.00FR 1.30 9.60 6.85 2.20 3.90 5.30 8.15 8.20 4.05 5.45 11.00GE 1.20 9.45 8.25 2.00 3.75 4.90 8.25 7.40 4.30 5.55 10.95HI 1.15 8.90 8.55 2.80 3.55 5.40 7.85 7.90 4.35 4.60 10.95JA 1.30 9.45 8.75 1.90 4.30 4.20 7.85 7.85 3.95 5.50 10.95KO 1.55 9.30 8.15 1.65 4.40 4.90 7.35 8.40 3.95 5.35 11.00MA 1.55 9.75 8.40 2.05 3.55 4.00 7.50 6.90 5.15 6.20 10.95SP 1.25 9.25 8.55 2.45 3.95 4.50 8.00 8.10 3.80 5.35 10.80TA 1.30 9.20 8.85 2.35 4.30 5.10 8.20 7.65 4.55 3.60 10.90

T E S T

F I L E S

VI 1.35 9.40 8.90 2.15 4.35 4.10 7.35 7.70 4.65 5.25 10.80

Values

1 2 3 4 5 6 7 8 9 10 11 120

2

4

6

8

10

12

Ran

k of

lang

uage

s

EN FA FR GE HI JA KO MA S P TA V I ALL

P lain Test

mean variance

Figure 3.3. Mean and variance of target-language ranks with a plain NNMS system

28

4. PROPOSED SYSTEM

4.1. Language-Dependent Gaussian Mixtures

4.1.1. Uni-gram and Bi-gram Modeling

For a better LID performance, uni-gram and bi-gram values of mixtures are included

in the computation of language probabilities. Each of the 32 mixtures generated for each

language are supposed to match with a phoneme. Therefore, the frequency of mixture

occurrences is assigned as the uni-gram value and the frequency of one mixture following

the other is assigned as the bi-gram value.

The conventional bi-gram modeling of a language depends on the word distributions

and an approximation is made such that the probability of a word depends only on the

identity of the preceding word, expressed as P(wi|wi-1) (Huang, 2001). In this study, using

the same approach, the bi-gram probability values of the acoustic mixture distributions,

P(mi|mi-1), are calculated for each language. During the computation of the mixture

statistics, the floor value of 0.001 is assigned for the mixture pairs that occur infrequently.

In order to estimate the uni-gram values using the training set, we simply count the number

of occurrences of each mixture in the output sequence.

The uni-gram and bi-gram values are weighted by the optimum coefficient which is

computed during the development tests and the results are added to the log likelihood value

of the relevant frame. The probabilities for each frame are accumulated through the

processed input file.

The tests are performed for 20 files of spontaneous speech from each language in the

OGI database and the language ranks are computed. The mean values of the ranks for the

uni-gram coefficients of u = 1, 2, 4 and 6 are given in Figure 4.1. The results for the bi-

gram weight coefficients of b = 0.8, 1, 1.2, 2 are plotted in Figure 4.2.

29

1 2 3 4 5 6 7 8 9 10 11 120

2

4

6

8

10

12R

ank

of la

ngua

ges

E N FA FR G E H I JA K O M A S P TA V I A LL

Uni-gram Tes ts

: 1.0: 2.0: 4.0: 6.0

Figure 4.1. Mean and variance values of target-language ranks for different u values

1 2 3 4 5 6 7 8 9 10 11 120

2

4

6

8

10

12

Ran

k of

lang

uage

s

EN FA FR GE HI JA KO MA S P TA VI ALL

Bi-gram Tests

: 0.8: 1.0: 1.2: 2.0

Figure 4.2. Mean and variance values of target-language ranks for different b values

30

4.1.2. Language Normalization

In the evaluations of LD-GMM based LID system, it is observed that some of the

languages such as English and German are biased in the domain of OGI database, therefore

language normalization is applied to the mixture probabilities of the languages during the

identification phase.

In order to compute the language normalization coefficients, the training files for

each language are processed separately by the plain LID system using the corresponding

model file. The average of likelihood values are assigned as the normalization coefficient

and the weighted coefficients are added to the frame likelihoods of the hypothesized

language. The mean values of the detected language ranks with the weight coefficient of

n: 0.4, 0.6, 0.8, 1 and 1.2 are given in Figure 4.3.

1 2 3 4 5 6 7 8 9 10 11 120

2

4

6

8

10

12

Ran

k of

lang

uage

s

EN FA FR GE HI JA KO MA SP TA VI ALL

Normalization Tests

: 0.4: 0.6: 0.8: 1.0: 1.2

Figure 4.3. Mean and variance values of target-language ranks for different n values

31

4.2. Unsupervised Language Modeling Using Mono-Lingual Phoneme Recognizer

The Hidden Markov Modeling toolkit HTK, developed in Cambridge University, is

used for the implementation of the phoneme recognizer. Since we can not provide labeled

speech data for training the mixture models of each language, we decided to implement a

reference model apart from the languages under test. For this purpose, we trained Turkish

phonemes using TÜBİTAK-TURTEL database described in Yapanel et al. (2001), which

includes the labels of Turkish utterances in word level.

A phoneme recognizer, based on continuous mixture HMMs is used for decoding the

incoming spoken utterance. In the training phase, HTK modules of HERest and HHed

takes place. Initial phoneme models with single Gaussian are passed through the embedded

training module, HERest (six times). After that process, mixtures are splitted by HHEd

module of HTK and the resulting four-mixture models are trained by HERest (two times).

Later, the mixtures are splitted again and the eight-mixture models are obtained.

During the training, validity of the models with varying number of states and

mixtures are examined by a continuous word-recognition test using 30 sentence-records

from TURTEL database. The test is performed for two-mixture, four-mixture, eight-

mixture and sixteen-mixture models. The percentage of true hits, which are obtained using

HResults tool of HTK, for eight-mixture and sixteen-mixture models were 57.5 per cent

and 56.7 per cent respectively. Therefore, we decided to develope a system based on eight-

mixture models, which seems to be appropriate to represent the language and to obtain an

affordable computation time for evaluations. The effect of differing number of states

(three-state and five-state) on the performance is also examined. As a result, the eight-

mixture model with the conventional five-state left-to-right architecture shown in Figure

4.4 is selected in order to proceed on the statistical training.

a22 a44a33

s2 s3 s4 s1 s5 a12 a23 a34 a45

Figure 4.4. Five-state left-to-right HMM architecture

32

Since we have a mono-lingual phoneme recognizer, we do not have any acoustic

diversity and we need to distinguish the languages according to their lexical information.

In order to achieve this, training utterances of each language are tested using a plain net

file with no statistical information. The output phoneme sequences are obtained and the

probabilistic values of one phoneme following the other are computed. The network

structure of phoneme recognizer for each language is generated by inserting the bi-gram

transition values. An example of the net file for English is given in Appendix-E.

Including the silence model, Turkish phoneme list comprises of 30 phonemes but 28

phonemes are used in our system. The phoneme “ğ” (referred as G) and the silence model

(referred as Z) are excluded from the list because these two are strongly biased in the

output phoneme sequences.

The set of test files used for the evaluation of the phoneme recognizer includes the

same 20 files from each language as the ones used for the evaluation of LD-GMMs. Using

the recognition module of HVite with the language dependent net files and the model file

of Turkish phonemes, the following results are obtained by the phoneme recognizer.

Table 4.1. Mean values of language ranks obtained by phoneme recognizer

Languages of Model Files Mean Values EN FA FR GE HI JA KO MA SP TA VI

EN 3,00 6,15 5,35 8,30 5,10 5,75 5,90 5,15 7,80 6,85 6,65FA 5,95 2,55 6,05 7,00 5,45 6,55 5,05 7,25 6,50 7,65 6,00FR 4,65 6,60 1,15 7,65 7,20 5,30 6,25 6,45 6,50 6,60 7,65GE 4,55 4,50 7,70 1,70 7,80 6,60 5,70 7,30 6,85 6,50 6,80HI 5,25 7,40 6,70 7,20 2,35 5,45 5,50 6,60 7,05 6,05 6,45JA 5,80 6,80 4,80 6,90 6,70 3,55 6,00 7,40 6,20 5,90 5,95KO 5,05 6,55 4,90 6,35 7,95 4,10 3,35 8,15 7,45 6,30 5,85MA 5,60 6,10 4,80 7,90 6,85 5,40 6,25 3,10 7,05 5,80 7,15SP 5,75 6,80 6,85 6,40 6,50 6,65 4,80 6,10 3,15 6,80 6,20TA 6,05 6,85 5,35 7,30 6,20 6,90 6,05 5,65 5,50 3,55 6,60

T E S T

F I L E S

VI 6,15 4,80 4,90 8,60 4,75 6,05 6,75 5,80 7,25 7,25 3,70

33

1 2 3 4 5 6 7 8 9 10 11 120

2

4

6

8

10

12R

ank

of la

ngua

ges

EN FA FR GE HI JA KO MA S P TA VI ALL

PR Test

mean variance

Figure 4.5. Mean and variance values of language ranks with phoneme recognizer

4.3. Superposition of Two Methods

In our proposed system of LID, the two separate methods, explained above, are

processed in parallel and the outputs are merged at final stage as shown in Figure 4.6.

Nearest Neighbor

Search of LD-GMMs

Detected Language

Feature

Extraction

Ranks of Languages

Merge Outputs Utterance

Monolingual Phoneme

Recognizer with LD networks

Figure 4.6. The superposition of two methods

34

Search for language with the highest rank in output_1

and find index I

Search for language with the highest rank in output_2

and find index J

Find rank in output_2 corresponding to index I

Find rank in output_1 corresponding to index J

no norank_I > thresh rank_J > thresh

False Detection

yes

rank_I > rank_J

yes

Language “ J ”

DetectedLanguage

“ I ” Detected

Language “ I ”

Detected

Language “ J ”

Detected

yes no

Figure 4.7. The algorithm for merging the two results

In both methods, the ranks of the languages for each file are obtained as output and

the results are merged through an algorithm described in Figure 4.7. The output of one

method is checked by the output of the other and this leads the system to give more

accurate decisions and to handle some of the false detections. Each language in the test

domain also behaves as a garbage model for the target language.

The results for 11 languages with different coefficients of LID system are listed in

Table 4.2 and Table 4.3. The evaluations, obtained for the uni-gram, bi-gram and

normalization coefficients of “0.8”, “0.4” and “1” respectively, give better results in the

overall performance compared to the system with weights of “0.6”, “0.8” and “1”. In an

LID application, the weights of the system should be tuned depending on the results

obtained for the languages under test.

Table 4.2. Performance of LID system with weights: 0.6, 0.8 and 1

35

% LD-NNMS LI-Phoneme Recognizer Superposed System

EN 15.0 15.0 30.0 FA 15.0 30.0 15.0 FR 20.0 85.0 55.0 GE 55.0 45.0 75.0 HI 30.0 35.0 45.0 JA 00.0 15.0 15.0 KO 15.0 10.0 15.0 MA 05.0 15.0 15.0 SP 35.0 00.0 35.0 TA 45.0 15.0 35.0 VI 00.0 15.0 05.0

Overall 21.4 25.5 30.9

Table 4.3. Performance of the LID system with weights: 0.8, 0.4 and 1

% LD-NNMS LI-Phoneme Recognizer Superposed System

EN 55.0 15.0 50.0 FA 20.0 30.0 25.0 FR 40.0 85.0 65.0 GE 25.0 45.0 50.0 HI 25.0 35.0 40.0 JA 00.0 15.0 10.0 KO 40.0 10.0 40.0 MA 10.0 15.0 20.0 SP 25.0 00.0 25.0 TA 45.0 15.0 40.0 VI 15.0 15.0 15.0

Overall 25.2 25.5 34.5

The confusion probabilities of the target languages are given in Table 4.4. It is

observed that English test files are most probably confused with French and Spanish with a

probability of 0.1. French, German and Hindi are mostly confused with English with a

probability of 0.15, 0.1 and 0.15, respectively.

36

Table 4.4. Confusion matrix of languages in LID system of weights: 0.8, 0.4 and 1

Languages of Model Files EN FA FR GE HI JA KO MA SP TA VI EN 0.50 0.05 0.10 0.05 0.00 0.05 0.05 0.05 0.10 0.05 0.00 FA 0.15 0.25 0.20 0.10 0.00 0.00 0.00 0.00 0.10 0.00 0.00 FR 0.15 0.00 0.65 0.00 0.00 0.05 0.00 0.00 0.05 0.00 0.00 GE 0.10 0.05 0.05 0.50 0.00 0.00 0.00 0.00 0.00 0.05 0.00 HI 0.15 0.00 0.00 0.00 0.40 0.10 0.05 0.00 0.00 0.10 0.00 JA 0.00 0.05 0.10 0.10 0.10 0.10 0.05 0.10 0.15 0.05 0.00 KO 0.05 0.00 0.10 0.05 0.05 0.00 0.40 0.00 0.10 0.05 0.05 MA 0.10 0.05 0.10 0.10 0.15 0.00 0.05 0.20 0.05 0.05 0.00 SP 0.15 0.00 0.05 0.10 0.05 0.05 0.05 0.00 0.25 0.05 0.05 TA 0.00 0.00 0.00 0.05 0.00 0.05 0.10 0.00 0.20 0.40 0.00

T E S T

F I L E S

VI 0.00 0.05 0.15 0.10 0.15 0.10 0.10 0.05 0.00 0.00 0.15

For the final evaluations of the system, different language groups are tested in order

to compare the results with the ones computed in the previous studies.

The first test set includes six languages, English, German, Hindi, Japanese, Mandarin

and Spanish. This set is used in a study of Zissman in 1995, which applies a high level LID

system with multiple single-language phoneme recognizers followed by n-gram language

models. The comparison of the results for the proposed system with coefficients of “0.8”,

“0.4” and “1” for uni-gram, bi-gram and normalization factors, respectively, are listed in

Table 4.5.

Table 4.5. Percentage of correct identification for six-language test

% PRLM-P*

(10 sec.) PRLM-P*

(45 sec.) LD-NNMS LI-Phoneme Recognizer

Superposed System

EN ~57 ~75 60.0 25.0 60.0 GE ~56 ~73 30.0 80.0 90.0 HI ~55 ~70 35.0 55.0 50.0 JA ~51 ~66 05.0 40.0 35.0 MA ~53 ~68 15.0 45.0 40.0 SP ~55 ~75 30.0 30.0 50.0

Overall ~54.5 ~71.6 29.2 45.8 54.2 * Phoneme Recognition followed by Language Modeling performed in Parallel.

The other set comprises of three languages, English, German and Spanish. In the

study of Dalsgaard in 1996, an LID system, which consists of three parallel phoneme

recognizers followed by three language modeling modules, each characterizing the bi-gram

37

probabilities, are applied. The language-dependent phoneme models together with the

language-independent speech units are used. The results are given in Table 4.6 in order to

compare with the ones obtained in our proposed system with system coefficients of 0.6,

0.8 and 1 for uni-gram, bi-gram and normalization factors, respectively.

Table 4.6. Percentage of correct identification for three-language test

% *Phoneme Clusters LD-NNMS LI-Phoneme

Recognizer Superposed System

EN 90.0 20.0 80.0 70.0 GE 82.0 65.0 90.0 90.0 SP 77.0 50.0 60.0 80.0

Overall 83.0 45.0 76.6 80.0 * Language Independent phoneme clusters of speech units included.

The final test of the proposed system includes the pair-wise test of English with ten

languages in OGI database. The results listed in Table 4.7 are compared with the ones

obtained by Zissman in 1993 and Arslan in 1997.

Table 4.7. Percentage of correctness for pair-wise test of English

Test Pairs

Lincoln* Algorithm

Viterbi** Algorithm NNMS HTK-PR Superposed

System EN-FA 80 84 70 80 72 EN-FR 83 84 70 85 85 EN-GE 67 70 65 90 85 EN-HI - - 68 80 85 EN-JA 79 91 80 72 88 EN-KO 82 78 78 70 85 EN-MA 86 92 72 70 72 EN-SP 83 92 63 88 82 EN-TA 84 84 75 80 80 EN-VI 78 83 72 88 88 Overall 72 84 71 80 82 * Results in the study of Zissman (1993),** results in the study of Arslan (1997).

When the proposed system is compared with the previous studies, for the six-

language test a decrease of 24.3 per cent, for three-language test a decrease of 3.6 per cent

and for the pair-wise test a decrase of 2.4 per cent in one system and an increase of 13.9

per cent in the other system are obtained in the overall results.

38

5. CONCLUSIONS

The amount of data for a speech processing application is very critical in all aspects.

For the LID systems, this problem is even more severe since the amount is multiplied by

the number of languages in the target set. In order to build a high-level system, the speech

data should be processed such that the phoneme boundaries of the utterances are labeled.

Our motivation in this thesis was to propose a language identification system without

having the necessity of the labeled speech data. This system comprises of two branches

processing in parallel; nearest neighbor selection of language-dependent GMMs and

monolingual phoneme recognizer with language-dependent network files.

The first method includes Gaussian mixtures, generated by vector quantization, for

each language in OGI database. The Gaussian that best fits the input vector of mel

cepstrum coefficients, c, is selected by evaluating the nearest distance for the log

likelihood values. All languages in the target set are modeled according to their distribution

of mixtures and 32 mixtures are generated for each. The frequency of mixture occurrences

is assigned as the mixture’s uni-gram value and the frequency of one mixture following the

other is assigned as the bi-gram value. These values are weighted with the optimum

coefficient obtained during the development phase and added to the log likelihood of the

relevant frame. Results for each frame are accumulated through the processed file. In the

second method an HMM-based phoneme recognizer is built using the speech recognition

toolkit HTK. Since we do not have labeled speech data to model all languages, Turkish

database of TÜBİTAK-TURTEL is used for training the HMMs of phonemes with 8

mixtures. Using these models as reference, phoneme statistics of each language is obtained

and inserted in the generated network structure.

The two separate systems are processed in parallel and the outputs are merged at the

final stage. In the overall identification performance of the superposed system, there is an

increase of 36.9 per cent and 35.3 per cent compared with the results of language-

dependent Gaussian mixtures and the mono-lingual phoneme recognizer, respectively.

39

When the proposed system is compared with the previous studies such as PRLM-P of

Zissman and phoneme clusters of Dalsgaard, a decrease of 24.3 per cent and 3.6 per cent is

obtained respectively in the overall results. In the system of Zissman, including the parallel

phoneme recognizers of six languages followed by language models, and also in the

system of Dalsgaard high-level pre-processed speech data are used in order to train the

acoustic models of the languages under test. In the pair-wise test of English vs. other

languages the performance of our proposed system shows a decrease of 2.40 per cent for

the method of Arslan and an increase of 13.9 per cent for the method of Zissman including

Hidden Markov Models. In all these systems, the necessity of labeled speech data and

linguistic information of languages cause a limitation for the variety of languages included

in the application since the insertion of a new language becomes a difficult task. By the

proposed method, a robust LID system with tolerable performance is built which easily

integrates any language into the application without any restrictions.

40

APPENDIX A: LIST OF OGI TEST FILES

The test files used for the evaluations are spontaneous utterances of 45 seconds

(*stb.wav) listed in TableA.1. Twenty files for each language are selected from OGI multi-

language telephone-speech database.

Table A.1. List of OGI files used in evaluations

EN FA FR GE HI JA KO MA SP TA VI

031 031 072 072 336 092 039 083 076 078 085 077 077 075 075 334 097 073 090 077 105 094 078 078 081 077 335 100 074 093 078 108 098 081 081 082 078 337 101 079 098 079 110 100 083 083 085 080 189 102 094 101 080 111 101 087 087 086 083 182 110 101 105 082 114 104 093 093 098 086 112 116 105 106 087 117 105 097 097 100 102 342 117 109 109 089 122 107 099 099 102 118 340 118 110 118 093 132 114 105 105 104 120 348 121 111 119 095 133 117 106 106 105 123 349 124 112 124 102 137 127 107 107 117 125 355 129 113 140 103 138 129 109 109 122 129 354 131 118 146 105 139 131 113 113 123 136 352 136 120 147 108 142 134 114 114 124 144 358 137 139 149 110 148 135 115 115 126 148 367 138 141 160 115 150 146 116 116 131 150 363 139 143 163 117 152 149 117 117 132 152 368 140 145 165 118 153 152 118 118 138 156 375 141 147 174 121 155 153

T

E

S

T

F

I

L

E

S

123 123 143 157 377 142 148 180 143 179 156

41

APPENDIX B: MATLAB FILE FOR COMPUTING MFCC function m_pFeature = mfcc8000(x); % mfcc8K(filename), this function convert raw speech data to MFCC file. fbank = 17; % m_nFilterBanks mel = 14; % m_nFeatureVectorLength m_DimNorm = sqrt(2.0/fbank); fs = 8000; melmax = 2595*log10(1.0+((fs/2.0)/700.0)); melmax = melmax / (fbank+1); for i = 1:fbank+2 m_CenterFreqs(i) = floor(1.5 + (512/4000)*700.0*(10^((1.0/2595.0)*(i-1)*tmp)-1.0)); end x = filter([1 -0.97], 1, x); x = x.*hanning(length(x)); fx = abs(fft(x,1024)); first = 0; last = fbank; for i = first+1:last m(i-first) = 0.0; len = m_CenterFreqs(i+2)-m_CenterFreqs(i)+1; wgt = triang(len)/sum(triang(len)); m(i-first) = log(max(1,sum(fx(m_CenterFreqs(i):m_CenterFreqs(i+2)).* wgt))); end m = m(1:fbank-1); fbank = length(m); for i = 1:mel m_pFeature(i) = 0.0; for j=1:fbank m_pFeature(i) = m_pFeature(i) + m(j)*cos((((i-1)*pi)/fbank)*(j-0.5)); end m_pFeature(i) = m_DimNorm*m_pFeature(i); end m_pFeature(1) = 0.1*m_pFeature(1); m_pFeature = m_pFeature(:); m_pFeature = m_pFeature(1:mel); return;

42

APPENDIX C: C FUNCTIONS FOR GMM TRAINING AND NNMS

// TRAINMODEL ----------------------------------------------------- // INPUT : script file of parametrized data used for training. // OUTPUT: model file with mean, variance and weight values. void CTrain::TrainModel(int numMixes) { FILE *fid, *finput, *fout;

char inputfilename[FNLEN]; int argmax; int frmNo=0; fid = FRead(scriptfile); printf("training started..\n"); SetNumMixes(numMixes); weights = new float[numMixes]; means = new float[numMixes][NUMMEL]; vars = new float[numMixes][NUMMEL]; memset(weights, 0, numMixes*sizeof(float)); memset(means, 0, numMixes*NUMMEL*sizeof(float)); memset(vars, 0, numMixes*NUMMEL*sizeof(float)); /* initialization */ while(!feof(fid)) {

if(!ReadLineFromScript(fid, inputfilename)) break;

finput = FRead(inputfilename); //read cepstrum values. while(ReadFrame(finput, ceps, NUMMEL)) { frmNo++; //total number of frames. UpdateAvg(means[0], ceps, frmNo); //update mean. } fclose(finput); } numberofFrms = frmNo; /* mixture splitting */ fseek(fid, 0, SEEK_SET); for (int nmix=1; nmix<numMixes; nmix++) {

// find mixture of max weight. argmax = FindMaxWeightArg(weights, nmix);

// increase the number of mean vectors by splitting. SplitMean(means[argmax], ceps); for(int nmel=0; nmel<NUMMEL; nmel++) means[nmix][nmel] = ceps[nmel]; printf("update %d mixtures\n", nmix+1); UpdateMeans(fid, nmix+1); fseek(fid, 0, SEEK_SET); }

43

/* iterations of training */ for (int iter=0; iter<numIter; iter++) { printf("iterno: %d\n", iter+1); UpdateMeans(fid, numMixes); fseek(fid, 0, SEEK_SET); } UpdateVars(fid, numMixes); fout = FWrite(modelfile); WriteModelFile(fout, means, vars, weights); fclose(fout); fclose(fid); delete []means; delete []vars; delete weights; return; }

// FIND NEAREST MODEL ------------------------------ // INPUT : observation input and model index. // OUTPUT: most likely model. float CTest::FindNearest(float *frm, int m, int *argmin) { int nm, nc; float mdiff, mindist, dist; mindist = (float)MAXVALUE; for(nm=0; nm<numMixes; nm++) { dist = 0; for(nc=0; nc<NUMMEL; nc++) { mdiff = frm[nc]-means[m*numMixes+nm][nc]; dist = dist + mdiff*mdiff/vars[m*numMixes+nm][nc]; } if(dist<mindist) { mindist = dist; argmin[0] = nm; } } return mindist; }

44

APPENDIX D: HTK COMMANDS FOR HMM GENERATION

HCopy

D:\heteke\bin> HCopy -C .\config.txt -S ..\script\parmAllFiles.list -T 1

HInit

G:\heteke\bin> HInit –S ..\script\initFiles.scp –l a –L ..\label\initLabs –M ..\HInitMFC\hmmA

..\script\proto.txt

HRest

G:\heteke\bin> HRest –S ..\script\initFiles.scp –l a –L ..\label\initLabs –M ..\hmm\hmm0

..\hmm\hmmA\a.hmm

HERest

G:\heteke\bin> HERest -S ..\script\trainFiles.scp -I ..\label\all.mono.mlf -H

..\hmm\hmm01\model1.gmm -M ..\hmm\hmm02 ..\lib\monophones.list

HHEd

G:\heteke\bin>HHed -H ..\hmm\hmm6\model1.gmm -M ..\hmm\hmm7 ..\script\mixsplit.hed

..\lib\monophones.list

HVite

for bi-gram statistics:

D:\heteke\bin> HVite -H ..\hmm\hmm21\model8.gmm -S ..\script\train\train_EN.list -T 1 -i

..\stats\ogi_EN.out -w ..\lib\net\mono.net ..\lib\dic\mono.dic ..\lib\monophones.list

for final evaluations:

D:\heteke\bin> HVite -H ..\hmm\hmm21\model8.gmm -S ..\script\test\stb_EN.list -T 1 -i

..\result\ogi_EN.out -w ..\lib\net\net_EN.net ..\lib\dic\mono.dic ..\lib\monophones.list

45

APPENDIX E: NET FILE OF HTK FOR PHONEME RECOGNITION

VERSION=1.0 N=30 L=840 I=0 W=!NULL I=29 W=!NULL I=1 W=a I=2 W=b I=3 W=c I=4 W=C I=5 W=d I=6 W=e I=7 W=f I=8 W=g I=9 W=h I=10 W=I I=11 W=i I=12 W=j I=13 W=k I=14 W=l I=15 W=m I=16 W=n I=17 W=o I=18 W=O I=19 W=p I=20 W=r I=21 W=s I=22 W=S I=23 W=t I=24 W=u I=25 W=U I=26 W=v I=27 W=y I=28 W=z J=0 S=0 E=1 J=1 S=0 E=2 J=2 S=0 E=3 . . . J=26 S=0 E=27 J=27 S=0 E=28 J=28 S=1 E=29 J=29 S=2 E=29 . . . J=55 S=28 E=29 J=56 S=1 E=1 l=-11.51293 J=57 S=1 E=2 l=-6.16081 . . . J=83 S=1 E=28 l=-8.93340 J=84 S=2 E=1 l=-7.14164 . . . J=838 S=28 E=27 l=-11.51293 J=839 S=28 E=28 l=-7.2286

46

REFERENCES

Adda-Decker, M., 2000, Towards Multilingual Interoperability in Automatic Speech

Recognition, Spoken Language Processing Group Report, LIMSI, France.

Arslan, L. M., and J. H. L. Hansen, 1997, “Frequency Characteristics of Foreign Accented

Speech”, IEEE, pp. 1123-1126.

Cimarusti, D., and R. B. Ives, 1982, “Development of an Automatic Identification System

of Spoken Languages: phase I”, ICASSP, pp. 1661-1663.

Dalsgaard, P., O. Andersen, H. Hesselager, B. Petek, 1996, “Language Identification using

Language Dependent Phonemes and Language Independent Speech Units”, ICSPL,

pp. 1808-1811.

Goodman, F. J, A. F. Martin and R. E. Wohlford, 1989, “Improved Automatic Language

Identification in Noisy Speech”, ICASSP, Vol. 1, pp. 528-531.

Greenberg, S., 2001, “What are the Essential cues for Understanding Spoken

Language?”, Presented at the 141st Meeting of the Acoustical Society of America,

Chicago, Vol. 2.

Hazen, T. J. and V. W. Zue, 1993, “Automatic Language Identification using a Segment-

Based Approach ”, Eurospeech, Vol. 2, pp. 1303-1306.

Hess, W., Pitch Determination of Speech Signals, Springer-Verlag, New York, New York,

USA, 1983.

Hieronymus, J. L. and S. Kadambe, 1997, “Robust Spoken Language Identification using

Large Vocabulary Speech Recognition”, ICASSP, Vol. 2, pp. 111-1114.

Higgins et al., 1993, ICASSP, Vol. 2, pp. 275-8.

47

House, A. S. and E. P. Neuburg, 1977, “Toward Automatic Identification of the Language

of an Utterance. I. Preliminary Methodological Considerations”, J. Acoust. Soc.

Amer., Vol. 62, pp. 708-713.

Kadambe, S. and J. L. Hieronymus, 1995, “Language Identification with Phonological and

Lexical Models”, ICASSP, Vol. 5, pp. 3507-3511.

Ladefoged, P., 1962, Elements of Acoustic Phonetics, The University of Chicago Press.

Lamel, L. F. and J. L. Gauvain, 1993, “Cross-Lingual Experiments with Phone

Recognition”, ICASSP, Vol. 2, pp. 507-510.

Li, K. -P., 1994, “Automatic Language Identification Using Syllabic Spectral Features”,

ICASSP, Vol. 1, pp. 297-300.

Linde, Y., A. Buzo, and R. M. Gray, 1980, “An Algorithm for Vector Quantizer Design”,

IEEE Trans. Commun., COM-28, Vol. 1, pp. 84-95.

Mendoza et al., 1996, “Automatic Language Identification using Large Vocabulary

Continuous Speech Recognition”, ICASSP, Vol. 2, pp. 785-788.

Muthusamy, Y. K., R. A. Cole and B. T. Oshika, 1992, “The OGI Multi-Language

Telephone Speech Corpus,” Proceedings International Conference on Spoken

Language Processing 92, Banff, Alberta, Canada.

Muthumasy, Y. K., N. Jain, R. A. Cole, 1994, “Perceptual Benchmarks for Automatic

Language Identification”, International Conference on Acoustics, Speech, and Signal

Processing, Vol. 1, pp. 333-336.

Nakagawa, S., Y. Ueda and T. Seino, 1992, “Speaker-Independent, Text-Independent

Language Identification by HMM”, ICASSP, Vol. 2, pp. 1011-1014.

48

Pellegrino, F., J. Farinas and R. Andrè-Obrecht, 1999, “Vowel System Modeling: A

Complement to Phonetic Modeling in Language Identification”, Proceedings of the

ESCA-NATO Workshop on Multi-Lingual Interoperability in Speech Technology

(MIST), Leusden, The Netherlands, pp.119-124.

Picone, J., 1993, “Signal Modeling Techniques in Speech Recognition”, IEEE

Proceedings.

Schultz, T. and A. Waibel, 1998, “Language Independent and Language Adaptive Large

Vocabulary Speech Recognition”, ICASSP, Vol. 5, pp. 1819-1823.

Sugiyama, M., 1991, “Automatic Language Recognition using Acoustic Features”,

ICASSP, Vol. 2, pp. 813-816.

Yapanel, Ü., T. İslam, M. U. Doğan and H. Palaz, 2001, TURTEL Database Technical

Report, TÜBİTAK-UEKAE.

Young et al., 2000, The HTK Book.

Wong, E. and S. Sridharan, 2001, “Comparison of Linear Prediction Cepstrum Coefficients

and Mel-Frequency Cepstrum Coefficients for Language Identification”, Proceedings

of 2001 International Symposium on Intelligent Multimedia, Video and Speech

Processing, pp. 95-98.

Zissman, M. A., 1993, “Automatic Language Identification using Gaussian Mixture and

Hidden Markov Models”, ICASSP, Vol. 2, pp. 399-402.

Zissman, M. A. and K. M. Berkling, 2001, “Automatic Language Identification”, Speech

Communications 2001, Vol. 35, pp.115-124.

Zissman, M. A. and E. Singer, 1994, “Automatic Language Identification of Telephone

Speech Messages using Phoneme Recognition and N-gram Modeling”, ICASSP, Vol.

1, pp. 305-308.