AN UNSUPERVISED APPROACH FOR AUTOMATIC LANGUAGE IDENTIFICATION
Transcript of AN UNSUPERVISED APPROACH FOR AUTOMATIC LANGUAGE IDENTIFICATION
AN UNSUPERVISED APPROACH FOR
AUTOMATIC LANGUAGE IDENTIFICATION
by
Tuba İslam
B.S. in Electrical and Electronics Eng., Boğaziçi University, 2000
Submitted to the Institute for Graduate Studies in
Science and Engineering in partial fulfillment of
the requirements for the degree of
Master of Science
Graduate Program in Electrical and Electronics Engineering
Boğaziçi University
2003
ii
ABSTRACT
AN UNSUPERVISED APPROACH FOR AUTOMATIC LANGUAGE IDENTIFICATION
Today, the need for multi-language communication applications, which can serve
people from different nations in their native languages, has gained an increasing
importance. Automatic Language Identification has a significant role in the pre-process
phase of multi-language systems. The conventional systems require difficult and time-
consuming labeling process of phoneme boundaries of the utterances in the speech corpus.
In our work, we propose an unsupervised method in order to built an automatic language
identification system that does not require labeled speech database or linguistic
information of the target languages. The method comprises two branches processing in
parallel; nearest neighbor selection method using language-dependent Gaussian mixtures,
and mono-lingual phoneme recognition method using language-dependent network files.
The performance of the system is compared with the previous studies and 24.3 per cent of
decrease and 13.9 per cent of increase is observed respectively in the worst and best cases.
With the proposed method, a robust system with a tolerable performance is built that can
easily integrate any language into the LID application.
iii
ÖZET
OTOMATİK DİL TANIMADA GÖZETİMSİZ YAKLAŞIM
Günümüzde insanlara kendi dillerinde hizmet sunabilmek için farklı dillerde iletişim
kurabilen sistemlere duyulan gereksinim giderek artmaktadır. Otomatik Dil Tanıma
uygulaması çok dilli sistemlerde ön işlem olarak yer almaktadır. Geleneksel sistemlerde
modellerin eğitimi öncesinde ses verisi zahmetli ve zaman alıcı bir etiketleme aşamasından
geçerek konuşmalardaki fonem sınırları belirlenmektedir. Bu tezde izlenen yöntem
etiketlenmiş veritabanına veya dillere ait linguistik bilgiye ihtiyaç duymaksızın geliştirilen
otomatik dil tanıma sistemine ait gözetimsiz bir yaklaşım içermektedir. Bu yöntem paralel
olarak işleyen iki ayrı daldan oluşmaktadır; dile özgü Gauss karışımlar kullanılarak
gerçekleştirilen en yakın komşu karışım seçimi yöntemi ve dile özgü ağ yapıları
kullanılarak gerçekleştirilen tek dilde eğitilmiş fonem tanıma yöntemi. Elde edilen
sonuçlarla önceki çalışmalar karşılaştırıldığında en kötü durumda başarımın yüzde 24.3
azaldığı, en iyi durumda ise yüzde 13.9 arttığı görülmüştür. Önerilen yöntem kullanılarak
kabul edilebilir bir başarıma sahip gürbüz bir dil tanıma sistemi geliştirilmiştir.
iv
TABLE OF CONTENTS
ABSTRACT .......................................................................................................................ii
ÖZET...................................................................................................................................iii
LIST OF FIGURES.........................................................................................................vi
LIST OF TABLES.........................................................................................................viii
LIST OF ABBREVIATIONS.......................................................................................ix
1. INTRODUCTION ...................................................................................................1
1.1. Motivation..............................................................................................................1
1.2. Applications ...........................................................................................................2
1.3. Language Discrimination Basics ...........................................................................2
1.4. Previous Research..................................................................................................7
1.4.1. LID Using Spectral Content ......................................................................8
1.4.2. LID Using Prosody ....................................................................................9
1.4.3. LID Using Phone-Recognition ................................................................10
1.4.4. LID Using Word-Recognition .................................................................11
1.4.5. LID Using Continuous Speech Recognition............................................11
2. THEORETICAL BACKGROUND ..................................................................13
2.1. Speech Representation.........................................................................................13
2.2. Vector Quantization.............................................................................................15
2.3. Hidden Markov Model Topology ........................................................................17
3. BASE SYSTEM FOR LANGUAGE IDENTIFICATION .........................23
3.1. OGI Multi-Language Database............................................................................23
3.2. Feature Extraction................................................................................................24
3.3. Gaussian Mixture Modeling using Vector Quantization .....................................25
4. PROPOSED SYSTEM.........................................................................................28
4.1. Language-Dependent Gaussian Mixtures............................................................28
4.1.1. Uni-gram and Bi-gram Modeling ............................................................28
4.1.2. Language Normalization..........................................................................30
4.2. Unsupervised Language Modeling Using Mono-Lingual Phoneme Recognizer 31
v
4.3. Superposition of Two Methods............................................................................33
5. CONCLUSIONS....................................................................................................38
APPENDIX A: LIST OF OGI TEST FILES ...........................................................40
APPENDIX B: MATLAB FILE FOR COMPUTING MFCC............................41
APPENDIX C: C FUNCTIONS FOR GMM TRAINING AND NNMS ........42
APPENDIX D: HTK COMMANDS FOR HMM GENERATION...................44
APPENDIX E: NET FILE OF HTK FOR PHONEME RECOGNITION.......45
REFERENCES ................................................................................................................46
vi
LIST OF FIGURES
Figure 1.1. An experimental perspective of spoken language...............................................3
Figure 1.2. Phoneme error rates of some languages ..............................................................5
Figure 1.3. Perceptual language identification results...........................................................7
Figure 2.1. Filter bank for computing MFCCs ....................................................................14
Figure 2.2. Representation of a mixture with M-components .............................................18
Figure 2.3. Matrix visualization of Viterbi algorithm .........................................................22
Figure 3.1. Distributions of c1 vs. c0 ixtures for some languages coefficients of m .............24
Figure 3.2. Distance search algorithm .................................................................................26
Figure 3.3. Mean and variance of target-language ranks with a plain NNMS system ........27
Figure 4.1. Mean and variance values of target-language ranks for different u values.......29
Figure 4.2. Mean and variance values of target-language ranks for different b values.......29
Figure 4.3. Mean and variance values of target-language ranks for different n values.......30
Figure 4.4. Five-state left-to-right HMM architecture.........................................................31
Figure 4.5. Mean and variance values of language ranks with phoneme recognizer ..........33
Figure 4.6. The superposition of two methods ....................................................................33
vii
Figure 4.7. The algorithm for merging the two results........................................................34
viii
LIST OF TABLES
Table 1.1. Phoneme categories of Turkish with examples of words .....................................4
Table 1.2. Phoneme categories of English with examples of words .....................................4
Table 3.1. Number of speakers in OGI database .................................................................23
Table 3.2. Effect of feature normalization on LID for baseline system ..............................25
Table 3.3. Mean values of language ranks obtained by plain NNMS .................................27
Table 4.1. Mean values of language ranks obtained by phoneme recognizer .....................32
Table 4.2. Performance of LID system with weights: 0.6, 0.8 and 1 ..................................34
Table 4.3. Performance of the LID system with weights: 0.8, 0.4 and 1 ............................35
Table 4.4. Confusion matrix of languages in LID system of weights: 0.8, 0.4 and 1 .........36
Table 4.5. Percentage of correct identification for six-language test ..................................36
Table 4.6. Percentage of correct identification for three-language test ...............................37
Table 4.7. Percentage of correctness for pair-wise test of English......................................37
ix
LIST OF ABBREVIATIONS
FFT Fast Fourier Transform
GMM Gaussian Mixture Model
HMM Hidden Markov Model
Hz Hertz
LID Language Identification
LD Language Dependent
LI Language Independent
LPCC Linear Predictive Cepstrum Coefficients
MFCC Mel Frequency Cepstrum Coefficient
ML Mono Lingual
NNMS Nearest Neighbor Mixture Selection
OGI-TS Oregon Graduate Institute-Telephone Speech
PRLM-P Phoneme Recognition followed by Language Modeling - Parallel
TURTEL Turkish Telephone Speech Corpora
VQ Vector Quantization
1
1. INTRODUCTION
1.1. Motivation
The necessity for multilingual capacities grows with the development of world
communication. Speaking different languages will remain as an obstacle until either multi-
lingual large vocabulary continuous speech recognition or automatic language
identification systems reach excellent performance and reliability. Automatically
identifying a language from just the acoustics without understanding the language is a
challenging problem. A multilingual person has no problem identifying the languages he
understands. Word spotting is the basic method that is followed during this process in
brain. In a human-made system, however, it is not easy to model the words in different
languages and to generate a successful language identification system, which needs a large
amount of labeled speech data and linguistic information of the target languages.
In all speech processing applications, one major restriction for a better performance is
the data limitation. The accuracy of the system relies on the quality and the variety of the
database. Another restriction is the necessity of the labeled speech corpora for the
languages under test. The labeling process of raw speech material and other linguistic data
takes great effort and time. Systems using multiple large vocabulary continuous speech
recognizers give the best results. These systems include a complete word recognizer for
each language and use word and sentence level language modeling. To build such a
system, a large amount of labeled speech is necessary to train the recognizers and large
amounts of written text are needed to train language models of word n-grams. A simpler
but successful approach is parallel language-dependent phone-recognition followed by
language modeling but since it is based on multiple language-specific phone recognizers, it
also requires labeled speech to train those recognizers.
Our motivation in this thesis is to search for the methods of building a language
identification system that does not require linguistic information and labeled speech
corpora of the target languages. The system will not have much dependency on pre-
2
processed data and therefore there will be no difficulty in adapting a new language to the
application.
1.2. Applications
The purpose of a language identification application includes the ability of
automatically adapting a speech-based tool, such as online banking or information
retrieval, to the native language of the user. With the growth of the Internet, we now live in
a worldwide society communicating and doing business with people who use a wide
variety of languages which makes language identification more important each day.
Multilingual environments may have political, military, scientific, commercial or tourist
context (Adda-Decker, 2000).
Just a few of the many different uses where a language identifier may be useful are
the natural language processing systems, information retrieval systems, speech mining
applications, speech file filtering systems, translation services through software, anywhere
where you might need to work with more than one language or knowledge management
systems.
1.3. Language Discrimination Basics
Humans and machines can use many different attributes to distinguish one language
from another. There are some essential cues for understanding a spoken language. The
most accurate way is to catch some of the words spoken. Detecting the phones that are not
common in most languages or focusing on the intonation and the stress also help but are
not enough. The basic perspective of a spoken language is presented in Figure 1.1
(Greenberg, 2001).
3
Understanding
Syntax Morphology
1000 ms Stress- Accent
Intonation Words
“Intelligibility”Prosody Lexicon
200 ms 40-400 ms
Interface between Sound and Meaning
Range Syllables
80 ms 40-400 ms
Phonetic Segments “Articulation”
Segments
80 ms 40-400 ms
Place of Articulation (200 ms) Manner of Articulation (80 ms)
Voicing, Rounding (40 ms)
Features
200 ms 40-400 ms
Modulation Spectrum Acoustics
Figure 1.1. An experimental perspective of spoken language
One way of representing speech sounds is by using phonemes. A “phoneme” is a
physical representation of a phonological unit in a language. Across the world’s languages,
the sound inventory varies considerably. The size of the phoneme inventory used for
speech recognition can be 29 phonemes as it is in Turkish or 46 phonemes as it is in
Portuguese. Formally, we can define the phoneme as a linguistic unit such that, if one
phoneme is substituted for another in a word, the meaning of that word could change. This
is only true for a set of phonemes in one language. Therefore in a single language, a finite
set of phonemes exists. However, when different languages are compared, there are
differences; for example, in Turkish, /l/ and /r/ (as in "laf" and "raf") are two different
phonemes, whereas in Japanese, they are not (Ladefoged, 1962). Similarly, the presence of
individual sounds, such as the "clicks" found in some sub-Saharan African languages, or
the velar fricatives found in Arabic, take attention of the listeners fluent in languages that
do not contain these phonemes. Still, as the vocal part used in the production of languages
is universal, phoneme sets mostly overlap and the total number of phonemes is finite
4
(Ladefoged, 1962). The Turkish phonemes subdivided into groups based on the way they
are produced are given in Table 1.1.
Table 1.1. Phoneme categories of Turkish with examples of words
Vowels: Semivowels: Fricatives: Nasals: Plosives: Affricates: kim rey sar mal bul can gül lale şal nal del çam kel yer far gir göl lala hep pul çal zor ter yıl dağ kaç kul ver gem bol jüri
Different from the Turkish phoneme structure, there are many diphthongs in some
other languages like English and German. The classification of English phonemes is given
with examples of words in Table 1.2.
Table 1.2. Phoneme categories of English with examples of words
Vowels: Diphthongs: Semivowels: Fricatives: Nasals: Plosives: Affricates: heed bay was sail am bat jaw hid by ran ship an disc chore head bow lot funnel sang goat had bough yacht thick pool hard beer hull tap hod doer zoo kite hoard boar azure hood boy that who'd bear valve hut heard the
The vowel systems also differ from one language to the other. In the study of
Pellegrino et al. in 1999, the phonemic differences based on vowels were taken into
account in language identification. Five languages (Spanish, Japanese, Korean, French and
Vietnamese) were chosen for the evaluations of LID system because of their
phonologically different vowel systems. Spanish and Japanese vowel systems are, for
example, rather simple, as they include only five vowels. But on the other hand, Korean
5
and French systems are quite complex and they make use of secondary articulations (long
vs. short vowel opposition in Korean and nasalization in French).
The phoneme error rate of a language correlates with the number of phonemes used
to model this language. In the study of Schultz (2001), the acoustic confusability of
languages, obtained by the phoneme-based recognizers, are given in Figure 1.2. The
phoneme error rates range from 33.8 per cent to 46.4per cent. Turkish is an exception in
this result because of the high substitution rate between the vowels “e”, “i” and “y”.
Figure 1.2. Phoneme error rates of some languages as an example for acoustic
confusability
It is also possible to distinguish between speech sounds depending on the way they
are produced. The speech units in this case are known as the phones. A “phone” is a
realization of an acoustic-phonetic unit or segment. It is the actual sound produced when a
speaker thinks of speaking a phoneme. Phone and phoneme sets differ from one language
to another, even though many languages share a common subset of phones and phonemes
(Shultz, 2001). Phone and phoneme frequencies of occurrence may also differ from one
language to another. Phonotactics, the rules of allowed sequence of phones and phonemes
are also different in most cases.
6
There are more phones than phonemes, as some of them are produced in different
ways depending on the context. For example, the pronunciation of the phoneme /l/ differs
slightly when it occurs before consonants and at the end of utterances such as in “salı”
(tuesday) and in “kalk” (wake up). As they are both different forms of the same phoneme,
they form a set of allophones. Any machine-based speech recognizer would need to be
aware of the existence of allophone sets.
Also the morphology, i.e. the word roots and lexicons, are usually different one
language to another. Each language has its own vocabulary and its own way in word
formation.
The stress, rhythm and intonation of speech are the prosodic features. Duration of
phonemes, pitch characteristics, and stress patterns differ from one language to another.
Stress is used in two different levels. It indicates the most important words in the sentences
and the prominent syllables in the words, which may change the meaning totally. As an
example, the English word "object" could be understood as either a noun or a verb,
depending on whether the stress is placed on the first or second syllable. Intonation, or
pitch movement, is very important in indicating the meaning of an English sentence. In
tonal languages, such as Mandarin and Vietnamese, the intonation determines the meaning
of individual words as well.
The syntax, the sentence patterns are different among languages. Although there may
be same words shared in two language such as “hat” in German and Turkish or “ten” in
English and Turkish, the neighboring words in the sentence and also the suffixes or
prefixes used in the word will be different.
The perceptual confusion of different languages is examined by Muthusamy in 1994
and the responses of subjects are shown in Figure 1.3.
7
100.0
22.4
84.1
79.5
38.2
13.5
24.5
70.5
51.7
19.0
100.0
30.2
83.3
77.8
40.0
14.7
27.3
81.4
60.9
34.5
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
EN FA FR GE JA KO MA SP TA VI
Language Codes
Ave
rage
Sub
ject
Per
form
ance
First Quarter
Last Quarter
Figure 1.3. Perceptual language identification results
The difference between the last quarter and the first quarter, which denotes the
beginning and the end of the test respectively, shows the effect of learning of the subjects
during the test because of the feedback given after each response. Muthusamy implies that,
having the lowest score in human perception, Korean is confused more often with Farsi,
Japanese, Mandarin, Tamil and Vietnamese.
1.4. Previous Research
Since 1970s researches have been focused on automatic language identification from
speech. Systems implemented up to date mainly vary according to their methods for
modeling languages. There are two phases of language identification; the “training phase”
and “recognition phase”.
During the “training phase” the base system is presented with examples of speech
from a variety of languages. Each training speech utterance is converted into a stream of
feature vectors that are computed from short windows of the speech waveform (e.g. 20ms).
The windowed speech waveform is assumed stationary in some manner. The feature
vectors are recomputed with a pre-defined step size (e.g. 10ms, 50 per cent overlapping)
8
and contain cepstral information about the speech signal. The training algorithm analyzes a
sequence of such vectors and produces a model for each language.
During the “recognition phase” of LID, feature vectors computed from the new
utterance are compared to each of the language-dependent models. The likelihood that the
new utterance was spoken in the same language as the speech used to train each model is
computed by a distance measure and the maximum-likely model is found. The language of
the speech that was used to train the model having the maximum-likelihood is assigned as
the language of the utterance.
Complicated language identification systems use phonemes to model speech. During
the training phase of the phoneme models, these systems use a phonetic transcription that
implies the sequence of symbols representing the spoken sounds or an orthographic
transcription, which implies the text of the words spoken, and also a phonemic
transcription dictionary is necessary.
1.4.1. LID Using Spectral Content
In the earliest researches on language identification, developers focused on the
differences in spectral content among languages. The basic idea was that different
languages contain different phonemes and phones. A set of short-term spectra is obtained
from the training utterances and these prototypes are compared to the ones obtained from
the test speech.
There are many different options to choose during the implementation of this
approach. The training and test spectra can be used directly or can be used to obtain the
feature vectors for the speech such as the cepstrum coefficients or formant-based vectors.
The training data can be chosen directly from the training utterances or they can be
synthesized by using K-means clustering. The similarity between the sets of training and
test spectra can be calculated by the Euclidean, Mahalanobis, or any other distance metric.
Examples to these language identification systems are proposed and developed by
Cimarutsi and Ives in 1982, Goodman et al. in 1989 and Sugiyama in 1991.
9
In order to compute the similarity between languages, most of the early systems
calculated the distance between the test vector and its closest train vector and accumulated
the result as an overall distance. In these systems, the language with the lowest distance is
assigned as the identified language. Later, Gaussian mixture modeling is applied to this
approach by Nakagawa et al. in 1992 and Zissman in 1993. In this case, each vector is
assumed to be generated randomly according to a probability density that is a weighted
sum of multi-variate Gaussian densities (Zissman, 2001). During the training, Gaussian
mixture models for the feature vectors are computed for each language. During the
recognition, the likelihood of the test utterance feature vectors is computed given each of
the language models. The language having the maximum likelihood is proposed as the
identified language. In this approach instead of only one feature vector from the training
set, the whole set of training feature vectors affects the scoring of each test vector, and
therefore this may be called a soft version of vector quantization.
Since using vector quantization gives a static classification, in order to model the
sequential characteristics of speech, language identification systems with Hidden Markov
Modeling are implemented in late 80s. HMM based language identification systems first
proposed by House and Neuburg in 1977 (Zissman, 2001). In these systems, HMM
training was performed on unlabeled training speech and the system performance was even
worse than static classifiers in some cases.
Later, a new approach was proposed by Li in 1994 by labeling the vowels of each
speech utterance automatically and computing spectral vectors in the neighborhood of the
vowels. Instead of modeling the feature vectors over all training data, the selected portions
are used. During the recognition, the selected portions of test data are processed and
language with the maximum likelihood is assigned as the identified language.
1.4.2. LID Using Prosody
Pitch frequency (fundamental frequency) of speech is defined as the frequency at
which the vocal cords vibrate during a voiced sound (Hess, 1983). It is difficult to make a
reliable estimate of the pitch frequency from the speech data since the harmonics of the
side frequencies cause a distortion.
10
One of the basic and simplest algorithms used depends on the multiple measures of
periodicity in the signal. Fundamental frequency (f0) is usually processed on a logarithmic
scale rather than a linear one in order to match the resolution of human auditory system.
Normally 50 Hz ≤ f0 ≤ 500 Hz for voiced speech. For unvoiced speech f0 is undefined and
by convention, it is zero in log scale.
Since the fundamental frequency implies the characteristics of the speaker, it does not
give global information about the language or the utterance. The slope of the pitch
frequency, however, gives some clues about the prosody and the stress on the utterance,
which might differ from language to language. Humans can also use prosodic information
in order to guess the spoken language (Muthusamy et al., 1994).
Language identification system depending on prosody alone has also been proposed
by Itahashi et al. (1994, 1995) and especially in noisy environments, pitch estimation is
argued to be more robust compared to spectral parameters. However, compared to phonetic
information, prosody comprises little information about the language (Hazen, 1993). Some
researches imply that systems with prosodic and phonetic parameters perform about the
same as the systems using only phonetic parameters. Therefore, the prosodic information
of speech is not concerned in this thesis.
1.4.3. LID Using Phone-Recognition
Different languages have different phone distributions and that leads many
researchers to build LID systems that extract the phone sequence of the utterances and
determine the language based on the statistics of that sequence. An example of this
approach is implemented by Lamel, who built two HMM-based phone recognizers for
English and French (Lamel and Gauvain, 1993). He found that the likelihood scores
obtained from language-dependent phone recognizers can be used to distinguish between
the two languages.
In a different approach by Shultz (2001), language specific phonemes of N
languages are unified into one global set. The target language that we do not have enough
11
information is modeled by the adaptation of other well-known languages using the
phoneme description of languages.
As building a system that depends on the phone recognition necessitates multi-
language phonetically labeled corpora, it becomes more difficult to include new languages
into the language identification process. This difficulty can be handled by using a phone
recognizer for a single language and obtaining the phonetic distributions for other
languages. Hazen and Zue (1993) and Zissman and Singer (1994) developed LID systems
that use one single-language front-end phone recognizers, with a successful performance.
This work is extended by Zissman and Singer (1994) and Yan and Barnard (1995) for
multiple single-language front ends.
In this thesis, a phoneme recognizer based on Turkish phonemes is developed and the
phonetic distributions of the languages in OGI multi-language database are evaluated.
1.4.4. LID Using Word-Recognition
The systems based on word-recognition are more complicated than the phone-level
systems and less complicated than the large-vocabulary systems. They use the lexical
information of languages and score the occurrence of words for each language.
In the approach of Kadambe and Hieronymus (1995) which uses lexical modeling for
language identification, the incoming utterance is processed by parallel language-
dependent phone recognizers and possible word sequences are identified from the resulting
phone sequences. To obtain the lexical information of all target languages is not an easy
task to deal with since each language dependent lexicon includes several thousand entries.
1.4.5. LID Using Continuous Speech Recognition
In order to obtain better LID performance, researchers try to add more and more
knowledge to their systems. Large-vocabulary continuous-speech recognition systems are
the most complicated ones issued for this purpose. During the training process, one speech
recognizer per language is created and during the evaluations all recognizers are run in
12
parallel to select the most likely one as the recognized language. Mendoza et al. (1996),
Schultz and Waibel (1998) and Hieronymus and Kadambe (1997) have worked on these
systems.
As these systems use higher-level knowledge (words and word sequences) rather than
lower-level knowledge (phones and phone sequences), the identification performance is
better than other simpler systems. On the other hand, they require many hours of labeled
training data for each language to be recognized and the algorithms are the most complex
ones in computation (Zissman, 2001).
13
2. THEORETICAL BACKGROUND
2.1. Speech Representation
Since their introduction in early 1970’s, homomorphic signal processing techniques
have been of great interest in speech recognition. Homomorphic systems are a class of
nonlinear systems that obey a generalized principle of superposition. Linear systems are a
special, a plain case of a homomorphic system (Picone, 1993).
In speech processing, the homomorphic system should have the following property:
D[[x1(n)]α • [x2(n)] β] = αD[x1(n)] + βD[x2(n)] (2.1)
Homomorphic systems considered useful for speech processing because they present
a way for separating the excitation signal from the vocal tract characteristics. The
convolution of two sources in time domain is represented as follows:
s(n) = g(n) * v(n) (2.2)
where g(n) denotes the excitation signal and v(n) the vocal tract impulse response and “*”
the convolution.
The frequency domain representation of this process is as follows:
S(f) = G(f) . V(f) (2.3)
When we take the logarithm of both sides, we get:
Log(S(f)) = Log (G(f) . V(f))
= Log (G(f)) + Log (V(f)) (2.4)
14
The cepstrum is one homomorphic transformation that allows the separation of the
source from the filter, which means that we obtain the superimposition of the excitation
and the vocal tract shape in log domain and we can separate them. The cepstrum of a
speech segment can be computed by windowing the signal with a window of length N.
Mel Frequency Cepstrum Coefficients (MFCC) vector is a representation defined as
the real cepstrum of a windowed short-time signal derived from the FFT of that signal
(Huang, 2001). The difference from the real cepstrum is that a non-linear frequency scale,
called mel-scale, is used in order to approximate the behavior of the auditory response of
the human ear.
To compute MFFCs of raw speech data, we first compute the log spectral
magnitudes, apply the resulting values to filter banks and then compute the inverse Fourier
transform. The filter banks are linearly spaced in mel frequencies. The frequency band
below 1000 Hz does not follow a linear rule but a logarithmic rule. The mapping from
linear frequency to mel frequency is defined as follows.
Mel(f) = 2595log10(1+ f / 700) (2.5)
In Figure 2.1, the triangular band-pass filters that are equally spaced along the mel-
frequency scale between 0 and 4000KHz are shown.
Figure 2.1. Filter bank for computing MFCCs
15
The expression for computing the MFCCs using the discrete cosine transform is
given as follows.
______ N MFCCi = √(2/N) ∑ mj cos ( πi (j-0.5) / N ) (2.6)
j=1
where N is the number of filters, mj is the log band-pass filter output amplitudes.
It has been proven that the performances of the systems are enhanced by the addition
of time derivatives to the static parameters. The first order derivatives are referred as the
delta coefficients and they are also included in the feature vector.
MFCCs are one of the most popular parameterization methods used by researchers in
speech applications because of its capability of capturing the phonetically important
characteristics of speech. A small drawback is that because of the Fast Fourier
Transform(FFT) at the early stages to convert speech from the time to the frequency
domain, MFCCs are computationally more complex compared to the other methods like
LPCC (Wong and Sridharan, 2001).
2.2. Vector Quantization
Quantization is the process of approximating continuous amplitude signals by
discrete symbols. The quantization of a single signal value or parameter is the scalar
quantization. In vector quantization, instead of individual values, small arrays of them are
represented. VQ was first proposed as a highly efficient quantization method for LPC
parameters (Linde et al., 1980), and later was applied to waveform coding.
A vector quantizer is described by a codebook, which is a set of fixed prototype
vectors. Each of these vectors is also referred to as codeword. During the quantization
process the input vector is matched against each codeword in the codebook using some
distortion measure. The input vector is then assigned to the index of the codeword with the
smallest distortion. Therefore, a vector quantization process includes two main steps; the
distortion measure and the generation of each codeword in the codebook.
16
To design an M-level codebook, it is necessary to partition d-dimensional space into
M cells and assign a quantized vector to each cell. The criterion for optimizing the
quantizer is to minimize the overall distortion over M levels. There are two necessary
conditions for an optimal quantizer. The first is that, the quantizer is realized by using a
nearest-neighbor selection rule expressed in Equation (2.7), where x is quantized as z. The
second condition is that, each codeword zi is chosen to minimize the average distortion in
the cell i.
q(x) = zi if and only if i = argmin d(x , zk ) k
(2.7)
The procedure known as the K-means algorithm or the generalized Lloyd algorithm
partitions the set of training vectors into M clusters in such a way that the two necessary
conditions for optimality are satisfied. The algorithm can be described as follows:
Initialization: Initial values for the codewords in the codebook are assigned.
Nearest-Neighbor Classification: Each training vector is classified into one of the
cells by choosing the closest codeword.
Codebook Updating: Codeword of every cell is updated by computing the centroid of
the training vector assigned to each cell.
Iteration: The first two steps are repeated until the ratio of the new overall distortion to
the previous one is above the pre-defined threshold.
Since the initial values of the codebook is critical to the ultimate quality of the
quantizer, another procedure called the LBG algorithm, which is also known as the
extended K-means algorithm, is proposed in order to design the M-vector codebook in
stages. Different from the K-means algorithm, the LGB algorithm first computes a 1-vector
codebook, then uses a splitting algorithm on the codewords to obtain the initial 2-vector
codebook and continues splitting until the desired M-vector codebook is obtained (Huang,
2001).
17
2.3. Hidden Markov Model Topology
Hidden Markov Modeling (HMM) is one of the basic methods used in speech
processing. It is a widely used statistical method of characterizing the spectral properties of
frames of an utterance, which is assumed to be a random process, and the parameters of
this process can be estimated in a precise, well-defined way. Explaining simply, HMM is a
Markov model in which the states are hidden.
HMMs can be classified into different types of models such as discrete models,
continuous models or semi-continuous models depending on the observable events
assigned to each state being discrete, continuous or both. The states may be defined as
ergodic, which implies that there is a transition from each state to any state in any time. A
simpler definition of the states, also used in our system, is left-to-right in which the states
follow a route from left to right in order to terminate the sequence.
In all speech processing applications, there ought to be a training phase followed by a
recognition phase. During the training, the parameters of the base reference model are
computed. There are three parameters to be estimated. One of them is the state transition
probability matrix A, with elements of aij denoting the transition probability of being at
state i at time t and at state j at time t + 1. When an observation sequence O = {o1, o2, …,
oT} is defined, each vector element of this sequence denotes the feature parameter vector in
speech recognition task. The matrix B = [b j(ot)] is the observation symbol probability
distribution where b j(ot) is the probability of observing vector ot at time t in state j. The
vector π = {πi} denote the initial state distribution which is the probability of being in state
i at the beginning. These three parameters form the compact representation of λ = {A,
B,π}, which is used to represent an HMM in general. There are also some other parameters
which are the number of states, N, and the number of the mixtures in each state, M. There
are various ways of representing observation symbol probabilities but the continuous
probability densities are preferred generally. The multivariate Gaussian distributions are
also widely used. The expression for the computation of bj(ot) is as follows:
∏ ∑= =
=
S
s
M
mjsmjsmjsmj
s
Ncb1 1
),;()( Σµoo stt (2.8)
18
where Ms is the number of mixture components in stream s, cjsm is the weight of the mth
component and N(-; µ, Σ) is a multivariate Gaussian with mean vector µ and the
covariance matrix Σ, expressed as follows:
)(')(
21 1
||)2(1),;(
µoΣµoµo
−−− −
= eNnπ
(2.9)
where n is the dimensionality of o.
At the beginning of the training process, a rough estimate of the λ values of HMMs
should be computed. Viterbi algorithm is used to assign the initial values after the
observation sequence is uniformly segmented with a Segmental K-means algorithm. Using
these initial values of the parameters, further improvements are achieved with a Baum-
Welch or Expectation-Maximization re-estimation procedure.
Figure 2.2. Representation of a mixture with M-components
For a single state HMM, the estimation of parameters of the symbol observation
probability is given in Equation (2.10). Since the Gaussian mixtures are independent, the
computation of multi component models will also be an easy process.
( ))(')(
21 1
||π2
1)(µoΣµo
tΣ
o−−− −
= ebnj (2.10)
19
The maximum likelihood estimates of µ j and Σ j are simple averages,
∑=
=T
tj T 1
1toµ (2.11)
and
∑=
−−=T
tj T 1
'))((1jtjt µoµoΣ (2.12)
Let Lj (t) be the probability of being in state j at time t then using Equation (2.11) and
Equation (2.12) the weighted averages are obtained as follows:
∑∑
=
== T
t j
T
t jj
tL
tL
1
1
)(
)( toµ (2.13)
and
∑∑
=
=−−
= T
t j
jjT
t jj
tL
tL
1
1
)(
)')()(( µoµoΣ tt
(2.14)
Equations (2.13) and (2.14) are known as the Baum-Welch re-estimation expressions
for means and variances. In order to calculate the mean and variance terms above, the state
duration probability, L j (t) should be evaluated by using an efficient algorithm such as the
Forward-Backward algorithm. The forward probability αj (t) for a model with M mixtures
and N states is defined as the joint probability of observing the first t speech vectors in
sequence and being in state j at time t.
)|)(,()( MjtxPtj == t1 o,.....,oα (2.15)
Equation (2.15) can be computed efficiently by the recursion below:
)()1()(1
2toj
N
iijij batt
−= ∑
−
=
αα (2.16)
20
The initial condition and final condition of the recursion expression in Equation
(2.16) for 1< j < N are given as follows:
)()1(1)1(
1
1
1ojjj ba==
αα
(2.17)
∑−
=
=1
2)()(
N
iiNiN aTT αα (2.18)
The calculation of forward probability yields the total likelihood P(O | M).
)()|( TMP Nα=O (2.19)
The backward probability is defined below and the recursion expression is given in
Equation (2.21).
),)(|()( MjtxPtj == + T1t o,......,oβ (2.20)
∑−
=+ +=
1
2)1()()(
N
jjjiji tbat ββ 1to (2.21)
The initial condition and final condition of recursion expression above for 1< j < N
are given as follows:
∑−
==
=1
211 )1()()1(
)(N
jjjj
iNi
ba
aT
ββ
β
1o (2.22)
Backward probability is defined as a conditional probability, whereas the forward
probability is defined as a joint probability. The product of these two probabilities is
proportional to the probability of state occupation. By using the definitions above we
obtained the following expressions:
21
)|)(,()()( MjtxPtt jj == Oβα (2.23)
)()(1)|(
)|)(,(
),|)(()(
ttP
MPMjtxP
MjtxPtL
jj
j
βα=
==
==
OO
O
(2.24)
where P = P (O | M).
During the recognition phase of HMM-based speech processing applications, the
Baum-Welch algorithm also takes place. While computing the forward probability with the
efficient recursive methods, the total likelihood P = P (O | M) is yielded as a by-product.
Therefore, this algorithm can be used to find out which model would maximize the value
of P = P (O | Mi), where i denotes the individual models.
However, it is more convenient to have the recognition process based on the
maximum likelihood state sequence. There is a small modification in the algorithm such
that the sum sign is changed with a maximum operation. If φ j (t) represents the maximum
likelihood of observing speech vectors from o1 to ot and being in state j at time t, the partial
likelihood can be computed from Equation (2.16) as the following:
)(})1({max)( tojijiij batt −= φφ (2.25)
where the initial and final conditions of recursion expression for 1< j < N are as follows:
})({max)(
)()1(1)1(
1
1
iNiiN
jjj
aTT
ba
φφ
φφ
=
==
to (2.26)
The expression of the final condition is equal to the maximum likelihood, P (O | M).
The probability values in the equations are very small and the multiplication of these
numbers may cause an underflow. Therefore, the log likelihood values should be preferred
instead of the linear ones. By this approach Equation (2.27) is obtained as follows:
22
)(log(})log()1({max)( tojijiij batt ++−= ψψ ) (2.27)
The equation above is the basis of Viterbi decoding algorithm. This algorithm can be
visualized as finding the best path in a matrix as shown in Figure 2.3. The rows imply the
states and columns are the speech frames. Each large dot denotes the log probability of
observing that frame at that time and each arc represents the log transition probability
between those states. The log probability of any path is equal to the sum of the dots and
arcs passed through.
Figure 2.3. Matrix visualization of Viterbi algorithm
The path extends from left-to-right, column-by-column. At time t, any partial path
ψi(t-1) is known, therefore any ψj(t) can be computed from Equation (2.27) (Young et al.,
2000).
23
3. BASE SYSTEM FOR LANGUAGE IDENTIFICATION
3.1. OGI Multi-Language Database
The mixture models of different languages are trained and the results of the proposed
system are evaluated using the records of Oregon Graduate Institute Multi-Language
Telephone Speech Corpus (OGI-TS), described in (Muthusamy, 1992).
This database comprises of 11 spoken languages, English, Farsi, French, German,
Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. There are
total of 23,118 files and 568 files are time-aligned, broad phonetic transcripted. Types of
utterances are nlg (native language), clg (common language), dow (days of the week), num
(number 0 thru 10), htl (hometown likes), htc (hometown climate), roo (room description),
mea (description of most recent meal), stb (free speech before the tone), sta (free speech
after the tone). The records classified as “stories before the tone” -stb files-, each lasts 45
seconds, are used in our evaluations. The numbers of speakers for the training and test sets
in OGI database are given in Table 3.1.
Table 3.1. Number of speakers in OGI database
Number of Speakers Language Training Set Evaluation Set English 50 141
Farsi 49 51 French 50 57 German 50 59 Hindi 173* 52
Korean 50 40 Japanese 49 37 Mandarin 49 52 Spanish 50 60 Tamil 50 55
Vietnamese 50 50 * “stb” files only.
The speech files have the NIST SPHERE header format. All files, compressed by
"shorten" speech compression method, are decompressed and byte-swapped before the
feature extraction phase.
24
3.2. Feature Extraction
The speech files in OGI multi-language corpora, sampled at 8kHz with 16-bit
resolution, are parameterized every 20ms with 10ms overlap between contiguous frames.
For each frame a 24-dimensional feature vector is computed; 12 cepstrum coefficients, 12
delta cepstrum coefficients. Speech utterances are windowed by using a 160-point
Hamming window to get the short-term energy spectrum. After that they are filtered with a
filter bank of 16 filters.
The energy coefficient is not included in the feature vector because of the different
recording levels over telephone line. In the study of Wong in 2001, it is shown that the
static log energy coefficient reduces the performance of the language identification system.
As an explanation of this result, it is implied that the static short-term features do not
encapsulate the language specific information in contrast to the transient features.
Cepstral normalization is performed in order to minimize the channel effect. During
this process, the mean cepstrum of each file is calculated and then the obtained value is
subtracted from each feature vector. The effect of cepstrum normalization is examined
through a plain system test for normalized and unnormalized feature vectors.
Figure 3.1. Distributions of c1 vs. c0 coefficients of mixtures for some languages
25
Table 3.2. Effect of feature normalization on LID for baseline system
UnnormalizedParameters
Normalized Parameters
Improvement in mean rank
EN 1.40 1.00 3.6% FA 8.25 8.20 0.5% FR 6.45 6.85 3.6% GE 1.80 2.00 1.8% HI 3.70 3.55 1.4% JA 3.75 4.20 -4.1% KO 8.70 7.35 4.1% MA 7.25 6.90 3.2% SP 4.30 3.80 4.5% TA 4.20 3.60 5.4% VI 10.60 10.80 -1.8%
Overall 5.49 5.29 3.6%
The results in Table 3.2 are in terms of ranks of the target languages and it is implied
that the normalization process improves LID system performance for all languages except
Japanese and Vietnamese. The unexpected result for the two languages may be related with
the voiced phoneme distribution of these languages. The ratio of the high-energy frames
might be corrupted during the normalization process.
3.3. Gaussian Mixture Modeling using Vector Quantization
In order to build an LID system, which does not dependent on the amount of labeled
speech data, we implement a method based on Gaussian mixtures generated by vector
quantization. The speech files in the training set of OGI corpora for 11 languages are used
in order to obtain the codebook of the system for each language.
A composite of mixtures is evaluated with an optimal codebook size of 32. Mixture
splitting is performed by using an entropy-based distance measure, defined over the
codewords of each language.
The algorithm can be summarized as follows:
Decide on the codebook size (N=32).
Evaluate the initial mean value (codeword) using the input feature vectors.
26
Split the mean vector with the maximum weight in the codebook recursively until the
total number of codewords is reached.
Using the sum of squared error as the distortion measure, clusterize the input vectors
around each codeword. This is done by finding the distance between the input vector
and each codeword. The input vector belongs to the cluster of the codeword that
yields the minimum distance.
Re-estimate the new set of codewords. This is done by obtaining the average of each
cluster. Add the component of each vector and divide by the number of vectors.
(3.1)
where i is the component of each vector (x, y, z, ... directions), m is the number of
vectors in the cluster.
Repeat the previous two steps until either the codewords do not change or the change
is quite small.
The evaluations of this system is based on the Nearest Neighbor Selection algorithm
(Higgins, 1993), which is a non-parametric approach using the averaged nearest neighbor
distance to classify features into mixtures. The Gaussian that best fits the input vector of
mel cepstrum coefficients, c, is found by evaluating the distance using the log likelihood
values. The distance search algorithm can be expressed briefly as follows:
dmin = ∞ ;
for m = 1..M d = 0; for n=1..N d = d + ( c n - µn
m )/σnm ;
end if d<dmin
dmin = d; argmin = m; end
end return dmin;
n - µnm )*( c
Figure 3.2. Distance search algorithm
27
The likelihood values obtained from 11 languages are sorted in order to get the rank
of each language, such that the one with the maximum likelihood takes the rank of 1, and
the minimum takes the rank of 11. The mean values of the language ranks for each model
tested for each test set (45-sec utterances of 20 files) are given in Table 3.3. The mean and
variance values of the correct matches are plotted in Figure 3.3.
Table 3.3. Mean values of language ranks obtained by plain NNMS
Languages of Model Files Mean EN FA FR GE HI JA KO MA SP TA VI
EN 1.00 9.60 8.50 2.05 4.65 4.40 8.10 7.55 3.70 5.45 11.00FA 1.10 8.20 8.00 1.95 3.95 5.80 8.70 8.30 4.10 4.90 11.00FR 1.30 9.60 6.85 2.20 3.90 5.30 8.15 8.20 4.05 5.45 11.00GE 1.20 9.45 8.25 2.00 3.75 4.90 8.25 7.40 4.30 5.55 10.95HI 1.15 8.90 8.55 2.80 3.55 5.40 7.85 7.90 4.35 4.60 10.95JA 1.30 9.45 8.75 1.90 4.30 4.20 7.85 7.85 3.95 5.50 10.95KO 1.55 9.30 8.15 1.65 4.40 4.90 7.35 8.40 3.95 5.35 11.00MA 1.55 9.75 8.40 2.05 3.55 4.00 7.50 6.90 5.15 6.20 10.95SP 1.25 9.25 8.55 2.45 3.95 4.50 8.00 8.10 3.80 5.35 10.80TA 1.30 9.20 8.85 2.35 4.30 5.10 8.20 7.65 4.55 3.60 10.90
T E S T
F I L E S
VI 1.35 9.40 8.90 2.15 4.35 4.10 7.35 7.70 4.65 5.25 10.80
Values
1 2 3 4 5 6 7 8 9 10 11 120
2
4
6
8
10
12
Ran
k of
lang
uage
s
EN FA FR GE HI JA KO MA S P TA V I ALL
P lain Test
mean variance
Figure 3.3. Mean and variance of target-language ranks with a plain NNMS system
28
4. PROPOSED SYSTEM
4.1. Language-Dependent Gaussian Mixtures
4.1.1. Uni-gram and Bi-gram Modeling
For a better LID performance, uni-gram and bi-gram values of mixtures are included
in the computation of language probabilities. Each of the 32 mixtures generated for each
language are supposed to match with a phoneme. Therefore, the frequency of mixture
occurrences is assigned as the uni-gram value and the frequency of one mixture following
the other is assigned as the bi-gram value.
The conventional bi-gram modeling of a language depends on the word distributions
and an approximation is made such that the probability of a word depends only on the
identity of the preceding word, expressed as P(wi|wi-1) (Huang, 2001). In this study, using
the same approach, the bi-gram probability values of the acoustic mixture distributions,
P(mi|mi-1), are calculated for each language. During the computation of the mixture
statistics, the floor value of 0.001 is assigned for the mixture pairs that occur infrequently.
In order to estimate the uni-gram values using the training set, we simply count the number
of occurrences of each mixture in the output sequence.
The uni-gram and bi-gram values are weighted by the optimum coefficient which is
computed during the development tests and the results are added to the log likelihood value
of the relevant frame. The probabilities for each frame are accumulated through the
processed input file.
The tests are performed for 20 files of spontaneous speech from each language in the
OGI database and the language ranks are computed. The mean values of the ranks for the
uni-gram coefficients of u = 1, 2, 4 and 6 are given in Figure 4.1. The results for the bi-
gram weight coefficients of b = 0.8, 1, 1.2, 2 are plotted in Figure 4.2.
29
1 2 3 4 5 6 7 8 9 10 11 120
2
4
6
8
10
12R
ank
of la
ngua
ges
E N FA FR G E H I JA K O M A S P TA V I A LL
Uni-gram Tes ts
: 1.0: 2.0: 4.0: 6.0
Figure 4.1. Mean and variance values of target-language ranks for different u values
1 2 3 4 5 6 7 8 9 10 11 120
2
4
6
8
10
12
Ran
k of
lang
uage
s
EN FA FR GE HI JA KO MA S P TA VI ALL
Bi-gram Tests
: 0.8: 1.0: 1.2: 2.0
Figure 4.2. Mean and variance values of target-language ranks for different b values
30
4.1.2. Language Normalization
In the evaluations of LD-GMM based LID system, it is observed that some of the
languages such as English and German are biased in the domain of OGI database, therefore
language normalization is applied to the mixture probabilities of the languages during the
identification phase.
In order to compute the language normalization coefficients, the training files for
each language are processed separately by the plain LID system using the corresponding
model file. The average of likelihood values are assigned as the normalization coefficient
and the weighted coefficients are added to the frame likelihoods of the hypothesized
language. The mean values of the detected language ranks with the weight coefficient of
n: 0.4, 0.6, 0.8, 1 and 1.2 are given in Figure 4.3.
1 2 3 4 5 6 7 8 9 10 11 120
2
4
6
8
10
12
Ran
k of
lang
uage
s
EN FA FR GE HI JA KO MA SP TA VI ALL
Normalization Tests
: 0.4: 0.6: 0.8: 1.0: 1.2
Figure 4.3. Mean and variance values of target-language ranks for different n values
31
4.2. Unsupervised Language Modeling Using Mono-Lingual Phoneme Recognizer
The Hidden Markov Modeling toolkit HTK, developed in Cambridge University, is
used for the implementation of the phoneme recognizer. Since we can not provide labeled
speech data for training the mixture models of each language, we decided to implement a
reference model apart from the languages under test. For this purpose, we trained Turkish
phonemes using TÜBİTAK-TURTEL database described in Yapanel et al. (2001), which
includes the labels of Turkish utterances in word level.
A phoneme recognizer, based on continuous mixture HMMs is used for decoding the
incoming spoken utterance. In the training phase, HTK modules of HERest and HHed
takes place. Initial phoneme models with single Gaussian are passed through the embedded
training module, HERest (six times). After that process, mixtures are splitted by HHEd
module of HTK and the resulting four-mixture models are trained by HERest (two times).
Later, the mixtures are splitted again and the eight-mixture models are obtained.
During the training, validity of the models with varying number of states and
mixtures are examined by a continuous word-recognition test using 30 sentence-records
from TURTEL database. The test is performed for two-mixture, four-mixture, eight-
mixture and sixteen-mixture models. The percentage of true hits, which are obtained using
HResults tool of HTK, for eight-mixture and sixteen-mixture models were 57.5 per cent
and 56.7 per cent respectively. Therefore, we decided to develope a system based on eight-
mixture models, which seems to be appropriate to represent the language and to obtain an
affordable computation time for evaluations. The effect of differing number of states
(three-state and five-state) on the performance is also examined. As a result, the eight-
mixture model with the conventional five-state left-to-right architecture shown in Figure
4.4 is selected in order to proceed on the statistical training.
a22 a44a33
s2 s3 s4 s1 s5 a12 a23 a34 a45
Figure 4.4. Five-state left-to-right HMM architecture
32
Since we have a mono-lingual phoneme recognizer, we do not have any acoustic
diversity and we need to distinguish the languages according to their lexical information.
In order to achieve this, training utterances of each language are tested using a plain net
file with no statistical information. The output phoneme sequences are obtained and the
probabilistic values of one phoneme following the other are computed. The network
structure of phoneme recognizer for each language is generated by inserting the bi-gram
transition values. An example of the net file for English is given in Appendix-E.
Including the silence model, Turkish phoneme list comprises of 30 phonemes but 28
phonemes are used in our system. The phoneme “ğ” (referred as G) and the silence model
(referred as Z) are excluded from the list because these two are strongly biased in the
output phoneme sequences.
The set of test files used for the evaluation of the phoneme recognizer includes the
same 20 files from each language as the ones used for the evaluation of LD-GMMs. Using
the recognition module of HVite with the language dependent net files and the model file
of Turkish phonemes, the following results are obtained by the phoneme recognizer.
Table 4.1. Mean values of language ranks obtained by phoneme recognizer
Languages of Model Files Mean Values EN FA FR GE HI JA KO MA SP TA VI
EN 3,00 6,15 5,35 8,30 5,10 5,75 5,90 5,15 7,80 6,85 6,65FA 5,95 2,55 6,05 7,00 5,45 6,55 5,05 7,25 6,50 7,65 6,00FR 4,65 6,60 1,15 7,65 7,20 5,30 6,25 6,45 6,50 6,60 7,65GE 4,55 4,50 7,70 1,70 7,80 6,60 5,70 7,30 6,85 6,50 6,80HI 5,25 7,40 6,70 7,20 2,35 5,45 5,50 6,60 7,05 6,05 6,45JA 5,80 6,80 4,80 6,90 6,70 3,55 6,00 7,40 6,20 5,90 5,95KO 5,05 6,55 4,90 6,35 7,95 4,10 3,35 8,15 7,45 6,30 5,85MA 5,60 6,10 4,80 7,90 6,85 5,40 6,25 3,10 7,05 5,80 7,15SP 5,75 6,80 6,85 6,40 6,50 6,65 4,80 6,10 3,15 6,80 6,20TA 6,05 6,85 5,35 7,30 6,20 6,90 6,05 5,65 5,50 3,55 6,60
T E S T
F I L E S
VI 6,15 4,80 4,90 8,60 4,75 6,05 6,75 5,80 7,25 7,25 3,70
33
1 2 3 4 5 6 7 8 9 10 11 120
2
4
6
8
10
12R
ank
of la
ngua
ges
EN FA FR GE HI JA KO MA S P TA VI ALL
PR Test
mean variance
Figure 4.5. Mean and variance values of language ranks with phoneme recognizer
4.3. Superposition of Two Methods
In our proposed system of LID, the two separate methods, explained above, are
processed in parallel and the outputs are merged at final stage as shown in Figure 4.6.
Nearest Neighbor
Search of LD-GMMs
Detected Language
Feature
Extraction
Ranks of Languages
Merge Outputs Utterance
Monolingual Phoneme
Recognizer with LD networks
Figure 4.6. The superposition of two methods
34
Search for language with the highest rank in output_1
and find index I
Search for language with the highest rank in output_2
and find index J
Find rank in output_2 corresponding to index I
Find rank in output_1 corresponding to index J
no norank_I > thresh rank_J > thresh
False Detection
yes
rank_I > rank_J
yes
Language “ J ”
DetectedLanguage
“ I ” Detected
Language “ I ”
Detected
Language “ J ”
Detected
yes no
Figure 4.7. The algorithm for merging the two results
In both methods, the ranks of the languages for each file are obtained as output and
the results are merged through an algorithm described in Figure 4.7. The output of one
method is checked by the output of the other and this leads the system to give more
accurate decisions and to handle some of the false detections. Each language in the test
domain also behaves as a garbage model for the target language.
The results for 11 languages with different coefficients of LID system are listed in
Table 4.2 and Table 4.3. The evaluations, obtained for the uni-gram, bi-gram and
normalization coefficients of “0.8”, “0.4” and “1” respectively, give better results in the
overall performance compared to the system with weights of “0.6”, “0.8” and “1”. In an
LID application, the weights of the system should be tuned depending on the results
obtained for the languages under test.
Table 4.2. Performance of LID system with weights: 0.6, 0.8 and 1
35
% LD-NNMS LI-Phoneme Recognizer Superposed System
EN 15.0 15.0 30.0 FA 15.0 30.0 15.0 FR 20.0 85.0 55.0 GE 55.0 45.0 75.0 HI 30.0 35.0 45.0 JA 00.0 15.0 15.0 KO 15.0 10.0 15.0 MA 05.0 15.0 15.0 SP 35.0 00.0 35.0 TA 45.0 15.0 35.0 VI 00.0 15.0 05.0
Overall 21.4 25.5 30.9
Table 4.3. Performance of the LID system with weights: 0.8, 0.4 and 1
% LD-NNMS LI-Phoneme Recognizer Superposed System
EN 55.0 15.0 50.0 FA 20.0 30.0 25.0 FR 40.0 85.0 65.0 GE 25.0 45.0 50.0 HI 25.0 35.0 40.0 JA 00.0 15.0 10.0 KO 40.0 10.0 40.0 MA 10.0 15.0 20.0 SP 25.0 00.0 25.0 TA 45.0 15.0 40.0 VI 15.0 15.0 15.0
Overall 25.2 25.5 34.5
The confusion probabilities of the target languages are given in Table 4.4. It is
observed that English test files are most probably confused with French and Spanish with a
probability of 0.1. French, German and Hindi are mostly confused with English with a
probability of 0.15, 0.1 and 0.15, respectively.
36
Table 4.4. Confusion matrix of languages in LID system of weights: 0.8, 0.4 and 1
Languages of Model Files EN FA FR GE HI JA KO MA SP TA VI EN 0.50 0.05 0.10 0.05 0.00 0.05 0.05 0.05 0.10 0.05 0.00 FA 0.15 0.25 0.20 0.10 0.00 0.00 0.00 0.00 0.10 0.00 0.00 FR 0.15 0.00 0.65 0.00 0.00 0.05 0.00 0.00 0.05 0.00 0.00 GE 0.10 0.05 0.05 0.50 0.00 0.00 0.00 0.00 0.00 0.05 0.00 HI 0.15 0.00 0.00 0.00 0.40 0.10 0.05 0.00 0.00 0.10 0.00 JA 0.00 0.05 0.10 0.10 0.10 0.10 0.05 0.10 0.15 0.05 0.00 KO 0.05 0.00 0.10 0.05 0.05 0.00 0.40 0.00 0.10 0.05 0.05 MA 0.10 0.05 0.10 0.10 0.15 0.00 0.05 0.20 0.05 0.05 0.00 SP 0.15 0.00 0.05 0.10 0.05 0.05 0.05 0.00 0.25 0.05 0.05 TA 0.00 0.00 0.00 0.05 0.00 0.05 0.10 0.00 0.20 0.40 0.00
T E S T
F I L E S
VI 0.00 0.05 0.15 0.10 0.15 0.10 0.10 0.05 0.00 0.00 0.15
For the final evaluations of the system, different language groups are tested in order
to compare the results with the ones computed in the previous studies.
The first test set includes six languages, English, German, Hindi, Japanese, Mandarin
and Spanish. This set is used in a study of Zissman in 1995, which applies a high level LID
system with multiple single-language phoneme recognizers followed by n-gram language
models. The comparison of the results for the proposed system with coefficients of “0.8”,
“0.4” and “1” for uni-gram, bi-gram and normalization factors, respectively, are listed in
Table 4.5.
Table 4.5. Percentage of correct identification for six-language test
% PRLM-P*
(10 sec.) PRLM-P*
(45 sec.) LD-NNMS LI-Phoneme Recognizer
Superposed System
EN ~57 ~75 60.0 25.0 60.0 GE ~56 ~73 30.0 80.0 90.0 HI ~55 ~70 35.0 55.0 50.0 JA ~51 ~66 05.0 40.0 35.0 MA ~53 ~68 15.0 45.0 40.0 SP ~55 ~75 30.0 30.0 50.0
Overall ~54.5 ~71.6 29.2 45.8 54.2 * Phoneme Recognition followed by Language Modeling performed in Parallel.
The other set comprises of three languages, English, German and Spanish. In the
study of Dalsgaard in 1996, an LID system, which consists of three parallel phoneme
recognizers followed by three language modeling modules, each characterizing the bi-gram
37
probabilities, are applied. The language-dependent phoneme models together with the
language-independent speech units are used. The results are given in Table 4.6 in order to
compare with the ones obtained in our proposed system with system coefficients of 0.6,
0.8 and 1 for uni-gram, bi-gram and normalization factors, respectively.
Table 4.6. Percentage of correct identification for three-language test
% *Phoneme Clusters LD-NNMS LI-Phoneme
Recognizer Superposed System
EN 90.0 20.0 80.0 70.0 GE 82.0 65.0 90.0 90.0 SP 77.0 50.0 60.0 80.0
Overall 83.0 45.0 76.6 80.0 * Language Independent phoneme clusters of speech units included.
The final test of the proposed system includes the pair-wise test of English with ten
languages in OGI database. The results listed in Table 4.7 are compared with the ones
obtained by Zissman in 1993 and Arslan in 1997.
Table 4.7. Percentage of correctness for pair-wise test of English
Test Pairs
Lincoln* Algorithm
Viterbi** Algorithm NNMS HTK-PR Superposed
System EN-FA 80 84 70 80 72 EN-FR 83 84 70 85 85 EN-GE 67 70 65 90 85 EN-HI - - 68 80 85 EN-JA 79 91 80 72 88 EN-KO 82 78 78 70 85 EN-MA 86 92 72 70 72 EN-SP 83 92 63 88 82 EN-TA 84 84 75 80 80 EN-VI 78 83 72 88 88 Overall 72 84 71 80 82 * Results in the study of Zissman (1993),** results in the study of Arslan (1997).
When the proposed system is compared with the previous studies, for the six-
language test a decrease of 24.3 per cent, for three-language test a decrease of 3.6 per cent
and for the pair-wise test a decrase of 2.4 per cent in one system and an increase of 13.9
per cent in the other system are obtained in the overall results.
38
5. CONCLUSIONS
The amount of data for a speech processing application is very critical in all aspects.
For the LID systems, this problem is even more severe since the amount is multiplied by
the number of languages in the target set. In order to build a high-level system, the speech
data should be processed such that the phoneme boundaries of the utterances are labeled.
Our motivation in this thesis was to propose a language identification system without
having the necessity of the labeled speech data. This system comprises of two branches
processing in parallel; nearest neighbor selection of language-dependent GMMs and
monolingual phoneme recognizer with language-dependent network files.
The first method includes Gaussian mixtures, generated by vector quantization, for
each language in OGI database. The Gaussian that best fits the input vector of mel
cepstrum coefficients, c, is selected by evaluating the nearest distance for the log
likelihood values. All languages in the target set are modeled according to their distribution
of mixtures and 32 mixtures are generated for each. The frequency of mixture occurrences
is assigned as the mixture’s uni-gram value and the frequency of one mixture following the
other is assigned as the bi-gram value. These values are weighted with the optimum
coefficient obtained during the development phase and added to the log likelihood of the
relevant frame. Results for each frame are accumulated through the processed file. In the
second method an HMM-based phoneme recognizer is built using the speech recognition
toolkit HTK. Since we do not have labeled speech data to model all languages, Turkish
database of TÜBİTAK-TURTEL is used for training the HMMs of phonemes with 8
mixtures. Using these models as reference, phoneme statistics of each language is obtained
and inserted in the generated network structure.
The two separate systems are processed in parallel and the outputs are merged at the
final stage. In the overall identification performance of the superposed system, there is an
increase of 36.9 per cent and 35.3 per cent compared with the results of language-
dependent Gaussian mixtures and the mono-lingual phoneme recognizer, respectively.
39
When the proposed system is compared with the previous studies such as PRLM-P of
Zissman and phoneme clusters of Dalsgaard, a decrease of 24.3 per cent and 3.6 per cent is
obtained respectively in the overall results. In the system of Zissman, including the parallel
phoneme recognizers of six languages followed by language models, and also in the
system of Dalsgaard high-level pre-processed speech data are used in order to train the
acoustic models of the languages under test. In the pair-wise test of English vs. other
languages the performance of our proposed system shows a decrease of 2.40 per cent for
the method of Arslan and an increase of 13.9 per cent for the method of Zissman including
Hidden Markov Models. In all these systems, the necessity of labeled speech data and
linguistic information of languages cause a limitation for the variety of languages included
in the application since the insertion of a new language becomes a difficult task. By the
proposed method, a robust LID system with tolerable performance is built which easily
integrates any language into the application without any restrictions.
40
APPENDIX A: LIST OF OGI TEST FILES
The test files used for the evaluations are spontaneous utterances of 45 seconds
(*stb.wav) listed in TableA.1. Twenty files for each language are selected from OGI multi-
language telephone-speech database.
Table A.1. List of OGI files used in evaluations
EN FA FR GE HI JA KO MA SP TA VI
031 031 072 072 336 092 039 083 076 078 085 077 077 075 075 334 097 073 090 077 105 094 078 078 081 077 335 100 074 093 078 108 098 081 081 082 078 337 101 079 098 079 110 100 083 083 085 080 189 102 094 101 080 111 101 087 087 086 083 182 110 101 105 082 114 104 093 093 098 086 112 116 105 106 087 117 105 097 097 100 102 342 117 109 109 089 122 107 099 099 102 118 340 118 110 118 093 132 114 105 105 104 120 348 121 111 119 095 133 117 106 106 105 123 349 124 112 124 102 137 127 107 107 117 125 355 129 113 140 103 138 129 109 109 122 129 354 131 118 146 105 139 131 113 113 123 136 352 136 120 147 108 142 134 114 114 124 144 358 137 139 149 110 148 135 115 115 126 148 367 138 141 160 115 150 146 116 116 131 150 363 139 143 163 117 152 149 117 117 132 152 368 140 145 165 118 153 152 118 118 138 156 375 141 147 174 121 155 153
T
E
S
T
F
I
L
E
S
123 123 143 157 377 142 148 180 143 179 156
41
APPENDIX B: MATLAB FILE FOR COMPUTING MFCC function m_pFeature = mfcc8000(x); % mfcc8K(filename), this function convert raw speech data to MFCC file. fbank = 17; % m_nFilterBanks mel = 14; % m_nFeatureVectorLength m_DimNorm = sqrt(2.0/fbank); fs = 8000; melmax = 2595*log10(1.0+((fs/2.0)/700.0)); melmax = melmax / (fbank+1); for i = 1:fbank+2 m_CenterFreqs(i) = floor(1.5 + (512/4000)*700.0*(10^((1.0/2595.0)*(i-1)*tmp)-1.0)); end x = filter([1 -0.97], 1, x); x = x.*hanning(length(x)); fx = abs(fft(x,1024)); first = 0; last = fbank; for i = first+1:last m(i-first) = 0.0; len = m_CenterFreqs(i+2)-m_CenterFreqs(i)+1; wgt = triang(len)/sum(triang(len)); m(i-first) = log(max(1,sum(fx(m_CenterFreqs(i):m_CenterFreqs(i+2)).* wgt))); end m = m(1:fbank-1); fbank = length(m); for i = 1:mel m_pFeature(i) = 0.0; for j=1:fbank m_pFeature(i) = m_pFeature(i) + m(j)*cos((((i-1)*pi)/fbank)*(j-0.5)); end m_pFeature(i) = m_DimNorm*m_pFeature(i); end m_pFeature(1) = 0.1*m_pFeature(1); m_pFeature = m_pFeature(:); m_pFeature = m_pFeature(1:mel); return;
42
APPENDIX C: C FUNCTIONS FOR GMM TRAINING AND NNMS
// TRAINMODEL ----------------------------------------------------- // INPUT : script file of parametrized data used for training. // OUTPUT: model file with mean, variance and weight values. void CTrain::TrainModel(int numMixes) { FILE *fid, *finput, *fout;
char inputfilename[FNLEN]; int argmax; int frmNo=0; fid = FRead(scriptfile); printf("training started..\n"); SetNumMixes(numMixes); weights = new float[numMixes]; means = new float[numMixes][NUMMEL]; vars = new float[numMixes][NUMMEL]; memset(weights, 0, numMixes*sizeof(float)); memset(means, 0, numMixes*NUMMEL*sizeof(float)); memset(vars, 0, numMixes*NUMMEL*sizeof(float)); /* initialization */ while(!feof(fid)) {
if(!ReadLineFromScript(fid, inputfilename)) break;
finput = FRead(inputfilename); //read cepstrum values. while(ReadFrame(finput, ceps, NUMMEL)) { frmNo++; //total number of frames. UpdateAvg(means[0], ceps, frmNo); //update mean. } fclose(finput); } numberofFrms = frmNo; /* mixture splitting */ fseek(fid, 0, SEEK_SET); for (int nmix=1; nmix<numMixes; nmix++) {
// find mixture of max weight. argmax = FindMaxWeightArg(weights, nmix);
// increase the number of mean vectors by splitting. SplitMean(means[argmax], ceps); for(int nmel=0; nmel<NUMMEL; nmel++) means[nmix][nmel] = ceps[nmel]; printf("update %d mixtures\n", nmix+1); UpdateMeans(fid, nmix+1); fseek(fid, 0, SEEK_SET); }
43
/* iterations of training */ for (int iter=0; iter<numIter; iter++) { printf("iterno: %d\n", iter+1); UpdateMeans(fid, numMixes); fseek(fid, 0, SEEK_SET); } UpdateVars(fid, numMixes); fout = FWrite(modelfile); WriteModelFile(fout, means, vars, weights); fclose(fout); fclose(fid); delete []means; delete []vars; delete weights; return; }
// FIND NEAREST MODEL ------------------------------ // INPUT : observation input and model index. // OUTPUT: most likely model. float CTest::FindNearest(float *frm, int m, int *argmin) { int nm, nc; float mdiff, mindist, dist; mindist = (float)MAXVALUE; for(nm=0; nm<numMixes; nm++) { dist = 0; for(nc=0; nc<NUMMEL; nc++) { mdiff = frm[nc]-means[m*numMixes+nm][nc]; dist = dist + mdiff*mdiff/vars[m*numMixes+nm][nc]; } if(dist<mindist) { mindist = dist; argmin[0] = nm; } } return mindist; }
44
APPENDIX D: HTK COMMANDS FOR HMM GENERATION
HCopy
D:\heteke\bin> HCopy -C .\config.txt -S ..\script\parmAllFiles.list -T 1
HInit
G:\heteke\bin> HInit –S ..\script\initFiles.scp –l a –L ..\label\initLabs –M ..\HInitMFC\hmmA
..\script\proto.txt
HRest
G:\heteke\bin> HRest –S ..\script\initFiles.scp –l a –L ..\label\initLabs –M ..\hmm\hmm0
..\hmm\hmmA\a.hmm
HERest
G:\heteke\bin> HERest -S ..\script\trainFiles.scp -I ..\label\all.mono.mlf -H
..\hmm\hmm01\model1.gmm -M ..\hmm\hmm02 ..\lib\monophones.list
HHEd
G:\heteke\bin>HHed -H ..\hmm\hmm6\model1.gmm -M ..\hmm\hmm7 ..\script\mixsplit.hed
..\lib\monophones.list
HVite
for bi-gram statistics:
D:\heteke\bin> HVite -H ..\hmm\hmm21\model8.gmm -S ..\script\train\train_EN.list -T 1 -i
..\stats\ogi_EN.out -w ..\lib\net\mono.net ..\lib\dic\mono.dic ..\lib\monophones.list
for final evaluations:
D:\heteke\bin> HVite -H ..\hmm\hmm21\model8.gmm -S ..\script\test\stb_EN.list -T 1 -i
..\result\ogi_EN.out -w ..\lib\net\net_EN.net ..\lib\dic\mono.dic ..\lib\monophones.list
45
APPENDIX E: NET FILE OF HTK FOR PHONEME RECOGNITION
VERSION=1.0 N=30 L=840 I=0 W=!NULL I=29 W=!NULL I=1 W=a I=2 W=b I=3 W=c I=4 W=C I=5 W=d I=6 W=e I=7 W=f I=8 W=g I=9 W=h I=10 W=I I=11 W=i I=12 W=j I=13 W=k I=14 W=l I=15 W=m I=16 W=n I=17 W=o I=18 W=O I=19 W=p I=20 W=r I=21 W=s I=22 W=S I=23 W=t I=24 W=u I=25 W=U I=26 W=v I=27 W=y I=28 W=z J=0 S=0 E=1 J=1 S=0 E=2 J=2 S=0 E=3 . . . J=26 S=0 E=27 J=27 S=0 E=28 J=28 S=1 E=29 J=29 S=2 E=29 . . . J=55 S=28 E=29 J=56 S=1 E=1 l=-11.51293 J=57 S=1 E=2 l=-6.16081 . . . J=83 S=1 E=28 l=-8.93340 J=84 S=2 E=1 l=-7.14164 . . . J=838 S=28 E=27 l=-11.51293 J=839 S=28 E=28 l=-7.2286
46
REFERENCES
Adda-Decker, M., 2000, Towards Multilingual Interoperability in Automatic Speech
Recognition, Spoken Language Processing Group Report, LIMSI, France.
Arslan, L. M., and J. H. L. Hansen, 1997, “Frequency Characteristics of Foreign Accented
Speech”, IEEE, pp. 1123-1126.
Cimarusti, D., and R. B. Ives, 1982, “Development of an Automatic Identification System
of Spoken Languages: phase I”, ICASSP, pp. 1661-1663.
Dalsgaard, P., O. Andersen, H. Hesselager, B. Petek, 1996, “Language Identification using
Language Dependent Phonemes and Language Independent Speech Units”, ICSPL,
pp. 1808-1811.
Goodman, F. J, A. F. Martin and R. E. Wohlford, 1989, “Improved Automatic Language
Identification in Noisy Speech”, ICASSP, Vol. 1, pp. 528-531.
Greenberg, S., 2001, “What are the Essential cues for Understanding Spoken
Language?”, Presented at the 141st Meeting of the Acoustical Society of America,
Chicago, Vol. 2.
Hazen, T. J. and V. W. Zue, 1993, “Automatic Language Identification using a Segment-
Based Approach ”, Eurospeech, Vol. 2, pp. 1303-1306.
Hess, W., Pitch Determination of Speech Signals, Springer-Verlag, New York, New York,
USA, 1983.
Hieronymus, J. L. and S. Kadambe, 1997, “Robust Spoken Language Identification using
Large Vocabulary Speech Recognition”, ICASSP, Vol. 2, pp. 111-1114.
Higgins et al., 1993, ICASSP, Vol. 2, pp. 275-8.
47
House, A. S. and E. P. Neuburg, 1977, “Toward Automatic Identification of the Language
of an Utterance. I. Preliminary Methodological Considerations”, J. Acoust. Soc.
Amer., Vol. 62, pp. 708-713.
Kadambe, S. and J. L. Hieronymus, 1995, “Language Identification with Phonological and
Lexical Models”, ICASSP, Vol. 5, pp. 3507-3511.
Ladefoged, P., 1962, Elements of Acoustic Phonetics, The University of Chicago Press.
Lamel, L. F. and J. L. Gauvain, 1993, “Cross-Lingual Experiments with Phone
Recognition”, ICASSP, Vol. 2, pp. 507-510.
Li, K. -P., 1994, “Automatic Language Identification Using Syllabic Spectral Features”,
ICASSP, Vol. 1, pp. 297-300.
Linde, Y., A. Buzo, and R. M. Gray, 1980, “An Algorithm for Vector Quantizer Design”,
IEEE Trans. Commun., COM-28, Vol. 1, pp. 84-95.
Mendoza et al., 1996, “Automatic Language Identification using Large Vocabulary
Continuous Speech Recognition”, ICASSP, Vol. 2, pp. 785-788.
Muthusamy, Y. K., R. A. Cole and B. T. Oshika, 1992, “The OGI Multi-Language
Telephone Speech Corpus,” Proceedings International Conference on Spoken
Language Processing 92, Banff, Alberta, Canada.
Muthumasy, Y. K., N. Jain, R. A. Cole, 1994, “Perceptual Benchmarks for Automatic
Language Identification”, International Conference on Acoustics, Speech, and Signal
Processing, Vol. 1, pp. 333-336.
Nakagawa, S., Y. Ueda and T. Seino, 1992, “Speaker-Independent, Text-Independent
Language Identification by HMM”, ICASSP, Vol. 2, pp. 1011-1014.
48
Pellegrino, F., J. Farinas and R. Andrè-Obrecht, 1999, “Vowel System Modeling: A
Complement to Phonetic Modeling in Language Identification”, Proceedings of the
ESCA-NATO Workshop on Multi-Lingual Interoperability in Speech Technology
(MIST), Leusden, The Netherlands, pp.119-124.
Picone, J., 1993, “Signal Modeling Techniques in Speech Recognition”, IEEE
Proceedings.
Schultz, T. and A. Waibel, 1998, “Language Independent and Language Adaptive Large
Vocabulary Speech Recognition”, ICASSP, Vol. 5, pp. 1819-1823.
Sugiyama, M., 1991, “Automatic Language Recognition using Acoustic Features”,
ICASSP, Vol. 2, pp. 813-816.
Yapanel, Ü., T. İslam, M. U. Doğan and H. Palaz, 2001, TURTEL Database Technical
Report, TÜBİTAK-UEKAE.
Young et al., 2000, The HTK Book.
Wong, E. and S. Sridharan, 2001, “Comparison of Linear Prediction Cepstrum Coefficients
and Mel-Frequency Cepstrum Coefficients for Language Identification”, Proceedings
of 2001 International Symposium on Intelligent Multimedia, Video and Speech
Processing, pp. 95-98.
Zissman, M. A., 1993, “Automatic Language Identification using Gaussian Mixture and
Hidden Markov Models”, ICASSP, Vol. 2, pp. 399-402.
Zissman, M. A. and K. M. Berkling, 2001, “Automatic Language Identification”, Speech
Communications 2001, Vol. 35, pp.115-124.
Zissman, M. A. and E. Singer, 1994, “Automatic Language Identification of Telephone
Speech Messages using Phoneme Recognition and N-gram Modeling”, ICASSP, Vol.
1, pp. 305-308.