Building A Highly Accurate Mandarin Speech Recognizer with...

Building A Highly Accurate Mandarin Speech Recognizer withLanguage-Independent Technologies and Language-Dependent

Modules

M. Hwang, G. Peng, M. Ostendorf{mhwang,gpeng,mo }@ee.washington.edu

University of Washington, Seattle, WA

W. [email protected]

SRI International, Menlo Park, CA

A. [email protected]

ICSI, UC Berkeley, CA


National Taiwan University, Taipei, Taiwan

UWEE Technical ReportNumber UWEETR-2008-0020May 2008

Department of Electrical EngineeringUniversity of WashingtonBox 352500Seattle, Washington 98195-2500PHN: (206) 543-2150FAX: (206) 543-3842URL: http://www.ee.washington.edu

Building A Highly Accurate Mandarin Speech Recognizer withLanguage-Independent Technologies and Language-Dependent

Modules

M. Hwang, G. Peng, M. Ostendorf{mhwang,gpeng,mo }@ee.washington.edu

University of Washington, Seattle, WA

W. [email protected]

SRI International, Menlo Park, CA


ICSI, UC Berkeley, CA


National Taiwan University, Taipei, Taiwan

University of Washington, Dept. of EE, UWEETR-2008-0020

May 2008

Abstract

We describe in detail a system for highly accurate large-vocabulary Mandarin automatic speech recognition(ASR), a collaborative success among four research organizations. The prevailing hidden Markov model (HMM)based technologies are essentially language independent and constitute the backbone of our system. These includeminimum-phone-error (MPE) discriminative training and maximum-likelihood linear regression (MLLR) adaptation,among others. Additionally, careful considerations are taken into account for Mandarin-specific issues includinglexical word segmentation, tone modeling, pronunciation designs, and automatic acoustic segmentation.

Our system comprises two acoustic models (AMs) with significant differences but similar accuracy performancefor the purposes of cross adaptation. The outputs of the two sub-systems are then rescored by an adapted n-gramlanguage model (LM). Final confusion network combination (CNC) [1] yielded 9.1% character error rate (CER) onthe DARPA GALE [2] 2007 official evaluation, the best Mandarin ASR system in that year. In this paper, we willillustrate the experimental contributions of various technical components. We have learned that most componentsare equally important (although some techniques are clearly better, such as MPE rather than MLE training), andthat fortunately their merits do not completely overlap each other. Error analysis on the GALE-2007 evaluationset reveals that performance on broadcast news (BN) is impressive (3.4%), yet further improvement is needed forbroadcast conversational (BC) speech in dealing with spontaneous speech phenomena such as hesitation, re-start,filled pauses, cross talk, etc. We will be exploring more along this direction in our future work.

1

1 Introduction

With China’s rapid economic growth and one of the most widely spoken languages in the world, various purposesdemand a highly accurate Mandarin automatic speech recognizer. Under the DARPA GALE Project, this paper seeksto achieve such a goal on broadcast news and broadcast conversational speech. Note that the broadcast news genretraditionally consists of “talking head” style broadcasts, i.e., generally one person reading a news script. The broadcastconversation genre, by contrast, is more interactive and spontaneous, consisting of talk shows, interviews, call-in pro-grams and roundtable discussions. We will prove that the core technologies developed for English ASR are applicableto a new language such as Mandarin. However, to achieve the best performance, one needs to add language-dependentcomponents, which for Mandarin includes extraction of tone-related features, word segmentation, phonetic pronun-ciation, and optimization of automatic acoustic segmentation. We have published extraction of tone-related featuresin [3], [4], [5], and [6]. This paper will elaborate our word segmentation algorithm and the design philosophy of ourpronunciation phone sets and acoustic segmentation, along with experimental results. The core speech technologiesinclude discriminative training criteria with discriminative features, multiple-pass unsupervised cross adaptation be-tween two different feature front ends, and language model adaptations. We will discuss how we design two differentacoustic models to achieve comparable accuracy performance for optimal cross adaptation and system combination.Furthermore, topic-based language model adaptation is applied before confusion network system combination.

This paper starts with a description of the acoustic and language model training data used in building the system,and the lexical word segmentation algorithm. Then we summarize our decoding architecture, which illustrates the needfor two complementary sub-systems; in this setup, one can also clearly see where LM adaptation fits. The next threesections describe the key components of the system: Section 4 elaborates the improvement we made in automaticallysegmenting long recordings into utterances of a few seconds. Section 5 describes the two front-end features that areused to build two complementary acoustic models. Section 6 discusses the LM adaptation approach we adopted andother possible varieties.

Finally Section 7 presents our experimental results and demonstrates the relative improvements of various compo-nents. In Section 8 we summarize our contribution and future work.

2 Speech and Text Corpora

2.1 Acoustic Data

In this paper, we use about 866 hours of speech data collected by the LDC, including the corpora of the MandarinHub4 (30 hours), automatically filtered TDT4 (89 hours), and GALE Phase 1 release (747 hours, up to shows datedbefore 11/1/2006 per GALE evaluation restriction) for training our acoustic models. Chronologically, they span from1997 through July 2006, from TV and radio shows on CCTV, RFA, NTDTV, PHOENIX, ANHUI, and so on. Theyinclude both BN and BC types of speech. The TDT4 data do not have manual transcriptions associated with them –instead, only closed captions. We use a flexible alignment algorithm to filter out bad segments where the search pathsdiffer significantly from the closed captions [7]. After the filtering, we keep 89 hours of data for training. Table 1summarizes our acoustic training data.

Table 1: Acoustic training data.Corpus Year Duration

Hub4 1997 30 hrsTDT4 2000-2001 89 hrs

GALE Phase 1 2004-10/2006 747 hrs

Total 1997-2006 866 hrs

We use three different data sets for various studies during our development: the DARPA EARS RT-04 evaluationset (Eval04), DARPA GALE 2006 evaluation set (Eval06), and GALE 2007 development set (Dev07)1. Once the

1The Dev07 set used here is the IBM-modified version, not the original LDC-released version.

UWEETR-2008-0020 2

system parameters are finalized based on these development test sets, we then apply the settings to the GALE 2007evaluation set (Eval07). All data sets are selected by LDC, from segments of different shows, as summarized in Table2, where segments are mostly a few minutes long each, but occasionally can be as long as 20 minutes as in Eval04.

Table 2: Summary of all test sets. The first three are used for system development. The last one is the evaluation setfor the final system.

Data year/month #shows BC BN

Eval04 2004/04 3 0 1 hrEval06 2006/02 24 1 hr 1.16 hrsDev07 2006/11 74 1.5 hrs 1 hrEval07 2006/11,12 83 1 hr 1.25 hrs

2.2 Text Data, Word Segmentation and Lexicon

Our text corpora come from a wider range of data. In addition to those transcripts of the acoustic training data, we addthe LDC Chinese TDT2, TDT3, TDT4 corpora (LDC2001T57, LDC2001T58, LDC2005T16), Chinese News Transla-tion Text Part 1 (LDC2005T06), Multiple-Translation Chinese Corpus Part 1, 2, and 3 (LDC2002T01, LDC2003T17,and LDC2004T07), Chinese Gigaword corpus (LDC2005T14), all GALE-related Chinese web text releases with newsdated before 11/1/2006, 183M characters of National Taiwan University (NTU) Web downloaded BN transcripts andnewswire text, 33M words BN and newswire web data collected by Cambridge University, and the Mandarin conversa-tional telephone speech (CTS) ASR system LM training data described in [8], including 1M word acoustic transcriptsand 300M word Web downloaded conversational style and topic related data.

Before applying word segmentation and LM training, we first perform text normalization on the Chinese text datato remove HTML tags, get rid of phrases with bad or corrupted codes, convert full-width punctuations and lettersto half-width, and convert numbers, dates and currencies into their verbalized forms in Chinese (English phrases andsentences mixed in Chinese text are also normalized, including verbalization of non-standard words). Word fragments,laughter, and background noise transcriptions were mapped to a special garbage word, which is treated as a lexicalword.

To leverage English ASR technologies to Mandarin directly, our Mandarin ASR system is based on “word” recog-nition. We believe word-based ASR is better than character-based ASR for its longer acoustic and language contexts.Chinese word segmentation aims at inserting spaces between sequences of Chinese characters and forming grammat-ical and meaningful Chinese sentences. Starting from the 64,000-word BBN-modified LDC Chinese word lexicon,we manually augment it with 20,000 new words (both Chinese and English words) over time from various sources offrequent names and word lists. For word segmentation, we begin with a simple longest-first match (LFM) algorithmwith this 80K-word list to segment all training text. The most frequent 60,000 words are then selected as our lexiconto train an n-gram LM, with all out-of-vocabulary (OOV) words mapped to the garbage word. In total, after wordsegmentation, the LM training corpora comprise around 1.4 billion running words. Among them, 11 million wordsare from the BC genre.

Longest-first word segmentation does not take context into account and sometimes makes inappropriate segmen-tation errors that result in wrong or difficult-to-interpret semantics, as the following examples show:

(English) The reporter learned, from the Chinese National PlanningCommittee, that ...

(LFM) �V ,¥ ) )��ÊÌ ��\ Üç (wrong)(1gram) �V , ¥) )��ÊÌ ��\ Üç (correct)

(English) The Green Party and Qin-Min Party reached a consensus.(LFM) Ì�j Z* Ìj HÄ á# (wrong)(1gram) Ì�j Z *Ìj HÄ á# (correct)

To improve word segmentation, we propose to use the initial n-gram LM trained on the LFM-segmented text toimprove word segmentation on the training text, based on maximum-likelihood (ML) search. A lower-order n-gram is

UWEETR-2008-0020 3

preferred for the re-segmentation if the initial n-gram is ML-trained on the same text, so that there is a better chanceto get out of the local optimal segmentation. In our case, we used the unigram portion to re-segment the same trainingtext. Having done this, the only possible out-of-vocabulary (OOV) words now are OOV English words and OOVsingle-character Chinese words. We did not add these OOV single-character words in our decoding lexicon because(a) adding them would only increase n-gram perplexity and acoustic confusability among the existing vocabulary, and(b) they are so rare that it is not worth the increased recognition complexity. In our experience, ML word segmentationresults in only slightly better perplexity and usually translates to no further improvement in recognition. However, webelieve that the ML word segmentation more often offers a semantically better segmentation which helps downstreamapplications such as machine translation (MT).2

After the ML word segmentation, we then re-train our n-gram LMs with the modified Kneser-Ney smoothingmethod [9] using the SRILM toolkit [10]. N-grams of the garbage word are also trained.

Table 3 lists the sizes of the full n-gram LMs and their pruned versions. The pruned n-grams (qLMn) are used infast decoding, as will be explained in Section 3. For example, our full 4-gram LM has 316 million 3-grams entries and201 million 4-grams entries. To estimate the difficulty of the ASR task, we extract a subset of Dev07 where there is noOOV word with respect to the 60K-lexicon, for computing perplexity. All noises labeled in the reference are removedwhen computing perplexity. This Dev07-IV (in-vocabulary) set contains about 44K Chinese characters, accountingfor 99% of the full Dev07 set.

Table 3: Numbers of entries in n-gram LMs.qLMn are the highly pruned versions of the full n-grams. Dev07-IV is asubset of Dev07 which contains no OOV word.

Language Models #bigrams #trigrams #4grams PerplexityDev07-IV

qLM3 6M 3M — 379.8qLM4 19M 24M 6M 331.2LM3 38M 108M — 325.7LM4 58M 316M 201M 297.8

3 Decoding Architecture

Figure 1 illustrates the flowchart of our recognition engine. We will briefly describe the architecture in this section,while details will be presented in the next three sections.

3.1 Acoustic Segmentation and Feature Normalization

GALE test data come with per-show recordings. However, only specified segments (usually a few minutes longper segment) in each recording need to be recognized. The file that contains the specified segments is called theUnpartitioned Evaluations Maps (UEM) file. Instead of feeding the minute-long segment into our decoder, we re-fineeach segment into utterances of a few seconds long, separated by long pauses, and run utterance-based recognition.This allows us to run wide-beam search and keep rich search lattices per utterance. We call this first step the acousticsegmentation step.

Next we perform automatic speaker clustering using Gaussian mixture models of static MFCC features and K-means clustering. We call these speakersautospeakers. Note that the number of speakers in the show is unknown.Therefore, we limit the average number of utterances per speaker by an empirical threshold, tuned on the developmentsets. Vocal tract length normalization (VTLN) is then performed for each auto speaker, followed by utterance-basedcepstral mean normalization (CMN) and cepstral variance normalization (CVN) on all features of models used in thesubsequent decoding passes. Speaker boundary information is also used in later steps for speaker-dependent acousticmodel adaptation.

2A personal communication with Y-C Pan at National Taiwan University indicated that their ML word segmentation improved character recog-nition accuracy from 74.42% using LFM to 75.01%.

UWEETR-2008-0020 4

Figure 1: System decoding architecture. Single arrows represent the best word sequence output, while block arrowsrepresent top N-best word sequences.

VTLN/ CMN/CVN

Auto speaker clustering

Acoustic segmentation

PLP-SA MLLR, LM3

qLM4 Adapt/Rescore

qLM4 Adapt/Rescore

MLP-SA MLLR, LM3

MLP-SI qLM3

Confusion Network Combination

3.2 Search with Trigrams and Cross Adaptation

Referring to Figure 1, the decoding is composed of three trigram recognition passes:

1. MLP-SI Search: Using the highly pruned trigram LM, we start a quick search with a speaker-independent(SI) within-word (non cross-word, nonCW) triphone MPE-trained acoustic model, based on MFCC cepstrumfeatures, pitch features, and multi-layer perceptron (MLP) generated phoneme posterior features. This model isdenoted the MLP model, which will be elaborated in Section 5.1. This step can generate a good initial adaptationhypothesis quickly.

2. PLP-SA Search: Next we use the above hypothesis to learn the speaker-dependent SAT feature transform andto perform MLLR adaptation [11] per speaker, on the cross-word (CW) triphone MPE trained model, with PLPcepstrum features and pitch features. This PLP model and SAT transform will be explained in Section 5.2.

After the acoustic model is speaker adapted (SA), we then run full-trigram decoding to produce an N-best listfor each utterance.3

3. MLP-SA Search: Similar to PLP-SA Search, we run cross adaptation first, using the trigram hypothesis fromPLP-SA Searchto adapt the CW triphone MPE trained MLP-model, followed by full-trigram decoding to pro-duce another N-best list. This model has the same feature front end as the one at theMLP-SIstep.

3.3 LM Adaptation and Confusion Network Combination

Based on topic-dependent LM adaptation, the N-best list generated byPLP-SA Searchis used to adapt the 4-gram LMon a per-utterance basis. All of the N-best hypotheses of an utterance are then rescored by this adapted 4-gram LM.Finally all the words in each N-best hypothesis are split into character sequences, with the adapted acoustic score andadapted language score unchanged.

Next we apply the same 4-gram adaptation procedure to the N-best list generated byMLP-SA Search. Finally acharacter-level confusion network is obtained by combining both rescored character N-best lists, and the most likelycharacter in each column of the network is output as the system output.

3For various legacy and computation reasons, the actual implementation is to use a pruned bigram to dump word lattices first, and then expandthe bigram lattices into full trigram lattices, from which we then extract N-best lists.

UWEETR-2008-0020 5

4 Acoustic Segmentation

In error analysis of our previous system, we discover that deletion errors were particularly serious. Deletion errors notonly degrade ASR performance, but are particularly bad for downstream processing such as machine translation; theireffect is worse than insertion errors. Our garbage word is modeled acoustically by a noise phonerej . We find thatsome of the deletion errors were caused by false alarm garbage words. That is, speech was mapped to garbage. Tocontrol these garbage false alarms, we introduce a garbage penalty into the decoder, which is successful in removingsome deletion errors.

However, most of our deletion errors came from dropped speech segments due to faulty acoustic segmentation.Therefore, we attempt to improve acoustic segmentation so that not only fewer speech segments are dropped, but newinsertion errors are simultaneously avoided [12].

4.1 Previous Segmenter

Our previous segmenter is a speech-silence detector given a segment of long recording. A recognizer is run with afinite state grammar as shown in Figure 2(a). There are three words in the vocabulary of the decoder:silence ,noise , andspeech , whose pronunciations are shown in the figure. Each pronunciation phone (bg , rej , fg ) ismodeled by a 3-state HMM, with 300 Gaussians per state. The HMMs are ML-trained on Hub4, with 39-dimensionalfeatures of MFCC and its first-order and second-order differences. The segmenter operates without any knowledgeof the underlying phoneme sequence contained in the speech waveform. More seriously, due to the pronunciation ofspeech , each speech segment is defined as having at least 18 consecutivefg s, which forces any speech segment tohave a minimum duration of 540 ms, given that our front-end computes one feature vector every 10 ms.

Figure 2: The finite state grammar of our acoustic segmenter: (a) previous segmenter, (b) new segmenter.

man2 /I2 F/

man1 /I1 F/

silence /bg bg/

end

start

noise /rej rej/

(b)

silence /bg bg/

end

start

noise /rej rej/

speech /fg18+/

(a) eng /forgn forgn/

After speech/silence is detected, segments composed of only silence and noises are discarded. Then if the seg-menter judges that two consecutive segments are from the same speaker (based on MFCC Gaussian-mixture models)and the pause in between is very short (under a certain threshold), it attempts to merge short speech segments intoutterances that are at most 9 seconds long. The speaker boundary decision algorithm here is similar to (but moremicro, and independent of) the auto-speaker clustering step in Figure 1.

4.2 New Segmenter

Our new segmenter, shown in Figure 2(b), makes use of broad phonetic knowledge of Mandarin and models the inputrecording with five words:silence , noise , a Mandarin character with a voiceless initial, a Mandarin character witha voiced initial, and a non-Mandarin word. Altogether there are 6 distinct HMMs (bg rej I1 F I2 forgn )forspeech-silence detection, and the minimum speech duration is reduced to 60 ms. Except for the finite-state grammarand the pronunciations, the rest of the segmentation process remains the same. As shown later in Section 7.1, we areable to recover most of the discarded speech segments via the new finite state grammar and the new duration constraint.

UWEETR-2008-0020 6

5 Two Acoustic Systems

As illustrated in Figure 1, a key component of our system is cross adaptation and system combination between twosub-systems. We seek to create two sub-systems having approximately the same error rate performance but with errorbehaviors as different as possible, so that they will compensate each other. [13] showed three discriminative techniquesthat are effective in reducing recognition errors: MLP features [14], MPE (MPFE) [15], [16] discriminative learningcriterion, and fMPE feature transform [17]. Table 4, cited from [13], shows that the three discriminative techniquesare additive, under a speaker-independent recognition setup. However, combining all three yields minimal furtherimprovement compared with combining only two of them, especially after unsupervised adaptation. In designing ourtwo acoustic systems, we therefore decide to choose the most effective combinations of two techniques: MLP+MPEand fMPE+MPE, where the most effective technique, MPE training, is always applied. Furthermore, to diversify theerror behaviors, we specifically design a new pronunciation phone set for the second system, in the hope to behavebetter on BC speech. The differences between our two acoustic systems are summarized in Table 5.

Table 4: Contribution of three discriminative techniques. The experiments were conducted on an English speakerindependent ASR task.

MLP MPE fMPE WER

17.1%yes 15.3%

yes 14.6%yes 15.6%

yes yes 13.4%yes yes 14.7%

yes yes 13.9%yes yes yes 13.1%

Table 5: Differences of our two acoustic models.System-MLP System-PLP

feature dim 74 (MFCC+pitch+MLP) 42 (PLP+pitch)fMPE no yes# of phones 72 81

5.1 System-MLP

5.1.1 The Pronunciation Phone Set

The first system uses 70 phones for pronunciations, case-sensitive and inherited from the BBN dictionary with minorfixes. Vowels usually have 4 distinct tones, accounting for 4 different phones (or tonemes for tonal phonemes),although some rare tonemes such asI2 are not modeled. Neutral tones are not modeled in this phone set either.Instead, the third tone is used to represent the neutral tone. The pronunciation rule follows the main vowel ideaproposed in [18] for reducing the size of the phone set and easy parameter sharing. Table 6 lists some key syllablesfor a better understanding while converting from Pinyin to the main-vowel phonetic pronunciation. Notice for Pinyinsyllable final/an/ and/ai/ , we use capital phone/A/ to distinguish them from other/a/ vowels.I2 is modeledby I1 asI2 does not exist in this phone set.

In addition to the 70 phones, there is one phone designated for silence, and another one,rej , for modelingall noises, laughter, and foreign (non-Mandarin) speech. Both the silence phone and the noise phone are context-independent. The pronunciation of the lexical garbage word is two or morerej . All HMMs have the 3-state Bakistopology without skipping arcs.

UWEETR-2008-0020 7

Table 6: Main-vowel phonetic pronunciations for Chinese syllables.Sample character Pinyin Phonetic pronunciation² ba b a3� kang k a2 NG� kan k a3 N� yao y a4 W� bian b y A1 N� ai A4 Yÿ e e2û ben b e1 N� beng b e1 NG er er2� ye y E1� chui ch w E1 Y· wo w o3� you y o3 W� yi y i1O yin y i1 N] ying y i1 NGX bu b u4� zhi zh IH1Ï chi ch IH14 shi sh IH4� ri r IH4ý zi z I1� ci c I1� si s I1ß chong ch o1 NGÁ jiong j y o3 NGí juan j v A1 N� xu x yu1¬ yu v yu4ß yun v e2 N

5.1.2 The HMM Observation Features

The front-end features consist of 74 dimensions per frame (about 10ms), including

• 13-dim MFCC cepstra, and its first- and second-order derivatives;

• Spline smoothed pitch feature [5], and its first- and second-order derivatives;

• 32-dim phoneme-posterior features generated by multi-layer perceptrons (MLPs) [14, 13].

The MLP feature is designed to provide discriminative phonetic information at the frame level. Its generationinvolves the following three main steps, as illustrated in Figure 3. In all MLPs mentioned in the paper, all hidden unitsuse the sigmoid output function:

yj =∑

i

wijxi

oj =1

1 + e−a(yj+b)

UWEETR-2008-0020 8

wherexi denotes the MLP input;wij is the weight from inputi to hidden unitj; a andb are constants. All outputunits however use the softmax output function:

zk =∑

j

hjkoj

pk =ezk∑k ezk

wherehjk is the weight from hidden unitj to output unitk, so that the outputpk is guaranteed to be between 0 and 1.

1. The Tandem feature generated by one MLPWe first, for each input frame, concatenate its neighboring 9 frames of PLP and pitch features as the input to anMLP. That is, there are42∗9 = 378 input units in this MLP. Each output unit of the MLP models the likelihoodof the central frame belonging to a certain phone, given the 9-frame intermediate temporal acoustic evidence.We call this vector of output probabilities theTandemphoneme posteriors. Particularly in our case, there are15000 hidden units, and 71 output units in the Tandem MLP. The noise phone is excluded from the MLP outputbecause it is presumably not a very discriminable class. It is aligned to many kinds of noises and foreign speech,not any particular phonetic segment. The acoustic training data are Viterbi aligned using an existing acousticmodel, to identify the target phone label for each frame, which is then used during MLP back-propagationtraining. The data for the noise phone are excluded from MLP training.

2. The HATs feature generated by a network of MLPsNext, we separately construct a two-stage network structure where the first stage contains 19 MLPs and thesecond stage is a single MLP. The purpose of each MLP in the first stage, with 60 hidden units each, is toidentify a different class of phonemes, based on the log energy of a different critical band across a long temporalcontext (51 frames≈ 0.5 seconds). Suppose the log energy of frequency bandf at timet is e

(f)t , a scalar. The

input to thef -th MLP of the first stage ise(f)t−25, e(f)

t−24, ...,e(f)t , e(f)

t+1,..e(f)t+25. From [14], it is more advantageous

to feed forward to the 2nd stage the output from the hidden units rather than the output of the output-layer units,probably because the softmax function mitigates the discriminative power of the output layer.

The second stage of MLP processing then combines the information from all of thehiddenunits (60∗19 = 1140)from the first stage to make a grand judgment on the phoneme identity for the central frame. This merger MLPhas 8,000 hidden units. The output of the second stage is called theHATsphoneme posteriors (hiddenactivationtemporal patterns) and is illustrated in Figure 3.

3. Combining Tandem and HATsFinally, the 71-dimTandemandHATsposterior vectors are combined using the Dempster-Shafer [19] algorithm.The values in the combined vector still range from 0 to 1. We then apply a logarithm to make the posterior distri-butions appear more like a Gaussian. Principal component analysis (PCA) is then applied to the log posteriors to(a) make each dimension independent, as our HMM models use Gaussian mixtures with diagonal co-variances,and (b) reduce the dimensionality from 71 to 32. The 32 dimensions of phoneme posteriors are then appendedto the MFCC and pitch features. This system with 74-dim features is thus referred to as System-MLP becauseof the use of the MLP features.

MLP feature extraction as described in this paper differs from our 2006 system [20] in three principal ways: (1)the amount of training data is doubled, (2) pitch features are added to the Tandem MLP input layer, and (3) to combinethe two posterior feature streams (Tandem and HATs), we use the Dempster-Shafer theory of evidence rather thaninverse-entropy weighted summation.

While there is no data like more data, this introduces considerable practical complications for MLP training sincethe online training algorithm cannot be easily parallelized across multiple machines. To address this, we optimize ourmulti-threaded QuickNet code to run on an 8-core server in a quasi-online batched mode: each network update is per-formed using the feed-forward error signal accumulated from a batch of 2048 randomized input frames. Additionally,we decrease the training time by partitioning the training data and applying the learning schedule described in [21].

UWEETR-2008-0020 9

Figure 3: TheHATsfeature, computed using a two-stage MLP. Notice the output from the hidden units of the first-stageMLPs is the input to the second-stage MLP.

5.1.3 Model Training

Except for the Hub4 training corpus, where we use the hand-labeled speaker information, we apply the same automaticspeaker clustering algorithm in Figure 1 to all other training corpora, followed by speaker-based VTLN/CMN/CVN.

Then following the flowchart shown in Figure 4, we first train nonCW triphone models with the 74-dim MLP-feature, using the ML objective function. Next through unigram recognition, we generate phone lattices on the trainingdata, which are used to do MPE training [15, 16]. The MPE training updates only Gaussian means, while keeping MLGaussian covariances and mixture weights intact. The nonCW MPE model is used at StepMLP-SI.

With the nonCW triphone ML model4, we learn the speaker-dependent SAT feature transform based on 1-classconstrained MLLR [22, 11], or cMLLR. Assume the SI Gaussian isN(µ,Σ) and the speaker-dependent transform isWk anddk for speakerk, such thatN(Wkµ + dk,WkΣW t

k) is the SA Gaussian. Then the log of the density value ofthe SA Gaussian distribution is

lnN(x;Wkµ + dk,WkΣW tk) = − ln |Wk|+ lnN(y;µ,Σ) (1)

wherey = W−1k (x− dk) = Akx + bk (2)

whereAk = W−1k , bk = −W−1

k dk (3)

Since there is only one transform per speaker, the constant− ln |Wk| in Formula (1) is canceled during training whilecomputing the occupancy count. It is irrelevant during time-synchronous decoding as it does not affect the relativemerit of any search path of the same time length. Therefore we can apply the learned SAT transform in the featuredomain, as Formula (2) shows, and keep the model parameters intact. As shown in Formula (2), the SAT featuretransform is a full74x74 + 74 transformation. It is applied to train CW triphone ML model, which is then used with aunigram decoder to generate CW triphone lattices on the training data for the purpose of training a CW triphone MPEmodel. The CW triphone SAT ML model is also used to learn the speaker-dependent MLLR model transforms at StepMLP-SAdepicted in Figure 1. Notice this MLLR transform usually contains multiple regression classes, dependingon the amount of SA data, and is applied at the model domain of the MPE model during the full trigram search.

The last column of Figure 4 is not used by System-MLP and will be explained in the next section.

4The ML model is used because SAT is an ML learning criterion.

UWEETR-2008-0020 10

Figure 4: The acoustic model training process.

tx

nonCW ML Model

tz

ty

Unigram Recognition

Phone lattices

MPE Training

cMLLR

SAT transforms, kk bA ,

nonCW MPE Model

ML Training

CW ML Model

MPE Training

Unigram Recognition

CW MPE Model

Phone lattices

Smaller ML Training

Smaller CW ML Model, h

fMPE Training

fMPE Transform, M

All models use decision-tree based HMM state clustering [23] to determine Gaussian parameter sharing. In oursystem, there are 3500 shared states, each with 128 Gaussians. This model size is denoted as 3500x128.

5.2 System-PLP

The second acoustic system contains 42-dimensional features with static, first- and second-order derivatives of PLPand pitch features. Similar to System-MLP, we first train a 3500x128 nonCW triphone ML model and use it to learnthe speaker-dependent SAT feature transforms ( dimension42x42 + 42). Next referring to the third column in Figure4, we ML train a smaller (3500x32) CW triphone model with the SAT transforms, for the purpose of learning thefMPE [17] feature transform. A smaller model is adopted for its computational efficiency.

5.2.1 fMPE Feature Transform

From [13], the addition of an fMPE feature transform [17] into MLP-features with MPE training brought marginalfurther improvement on SI recognition, and hardly any after unsupervised acoustic adaptation. Therefore, we decideto leave fMPE out of the MLP system. However, in order for the PLP-system to compete with the MLP models whichhave a stronger feature representation, the fMPE feature transform is particularly useful.

For each time frame, we compute all Gaussian density values of 5 neighboring frames, given the 3500x32 smallerCW SAT ML model. This gives us a Gaussian posterior vector,ht, of 3500 ∗ 32 ∗ 5 = 560K dimensions if context-independent phone models are excluded. The final feature used is

zt = (Akxt + bk) + Mht

whereM is on the order of42 ∗ 560K dimension and is learned through an MPE criterion [17] and is speaker-independent.

Finally with thezt features, we train 3500x128 CW triphone ML and MPE models. Both models are used at StepPLP-SA: The ML model is used to learn MLLR transforms, which are applied to the MPE model for full-trigramsearch.

5.2.2 The Pronunciation Phone Set

To make the two acoustic systems behave more differently and to focus on the more difficult BC speech, we design adifferent pronunciation phone set aiming at modeling conversational speech. Since BC speech tends to be spoken more

UWEETR-2008-0020 11

quickly than narrative speech, we first introduce a few diphthongs in the PLP model as shown in Table 7, where vowelswith no tone represent all four tones. Combining two phones into one reduces the minimum duration requirement byhalf and hence is likely better for fast speech. The addition of diphthongs also naturally removes the need for thesyllable-endingY andWsounds.

Next the 72-phone set does not model the neutral tone, or tone 5, but instead the third tone is used as a replacement.As there are a few very common characters that are 5-th tone, we add three neutral-tone phones for them.

Furthermore, we add phone/V/ for the v sound in English words, as this phone is missing in Mandarin butnot difficult at all for Chinese people to pronounce accurately. For the two common filled-pause characters (ã,F),we use two separate phones to model them individually, so that they are not sharing parameters with regular Chinesewords. As there is not much training data forV and the filled pauses, we make these three phones context-independent,indicated by the asterisks.

Finally, to keep the size of the new phone set manageable, we merge/A/ into /a/ , and both/I/ and/IH/ into/i/ . We rely on triphone modeling to distinguish these allophones of the same phoneme. The 72-phone set does notmodel the somewhat rare phone/I2/ as in# ; instead, it uses/I1/ in that case. With/I2/ represented by/i2/ ,the second tone of the non-retroflex/i/ sound is now modeled correctly.

In sum, the new phone set has 81 base phones, including three extra context-independent phones.

Table 7: Difference between the 72-phone and 81-phone sets. Asterisks indicate context-independent phones.example phone-72 phone-81

� a W awð E Y ey� o W ow� a Y ay� A N a N' I iÚ IH iê e3 e5m a3 a5� i3 i5victory w V ∗

ã o3 fp o∗

F e3 N fp en∗

6 Topic-Based Language Model Adaptation

After we obtain two N-best lists for each utterance, we attempt to update the language score with adapted higher-ordern-gram probabilities.

We perform topic-based language model adaptation using a Latent Dirichlet Allocation (LDA) topic model [24,25]. The topic inference algorithm takes as input a weighted bag of wordsw (e.g. in one topic-coherent story)and an initial topic mixtureθ0 and returns a topic mixtureθ. During training, we label the topic of each individualsentence to be the one with the maximum weight inθ, and add the sentence to this topic’s corpus. We then use theresulting topic-specific corpora to train one n-gram LM per topic [26]. The general LMs trained in Table 3 are calledtopic-independent background LMs.

During decoding, we infer the topic mixture weights dynamically for each utterance; we select the top few mostrelevant topics above a threshold, and use their weights inθ to interpolate with the background language model.

In order to make topic inference more robust against recognition errors, we weight the words inw based on anN-best-list derived confidence measure; additionally we include words not only from the utterance being rescored butalso from surrounding utterances in the same story chunk via a decay factor, where the words of distant utterances aregiven less weight than those of nearer utterances. As a heuristic, utterances that are in the same show and less than 4

UWEETR-2008-0020 12

seconds apart are considered to be part of the same story chunk. The adapted n-gram is then used to rescore the N-bestlist.

7 Experimental Results

7.1 Acoustic Segmentation

Tables 8 and 9 show the CERs with different segmenters at theMLP-SI Searchstep andPLP-SA Searchstep, respec-tively, on Eval06. The error distributions and our manual error analysis both show that the main benefit of the newsegmenter is in recovering lost speech segments and thus in lowering deletion errors. However, those lost speech seg-ments are usually of lower speech quality and therefore lead to more substitution and insertion errors. For comparison,we also show the CERs with the oracle segmentation as derived from the reference transcriptions. These results showthat our segmenter is very competitive.

Table 8: CERs at StepMLP-SIon Eval06 using different acoustic segmenters.Segmenters Sub Del Ins Overall

Previous segmenter 9.7% 7.0% 1.9% 18.6%New segmenter 9.9% 6.4% 2.0% 18.3%Oracle segmenter 9.5% 6.8% 1.8% 18.1%

Table 9: CERs at StepPLP-SAon Eval06 using different acoustic segmenters.Segmenters Sub Del Ins Overall

Previous segmenter 9.0% 5.4% 2.0% 16.4%New segmenter 9.2% 4.8% 2.1% 16.1%Oracle segmenter 8.8% 5.3% 2.0% 16.1%

7.2 MLP Features

Since Mandarin is a tonal language, it is well known that adding pitch information helps with speech recognition[3]. For this reason, we investigate adding pitch into the input of the Tandem neural nets. For quick verification, weused Hub4 to train nonCW triphone ML models. Table 10 shows the SI bigram CER performance on Eval04. Pitchinformation obviously provides extra information for both the MFCC front-end and theTandemfront-end.

Table 10: SI bigram CERs on Eval04, using 30 hours of acoustic training data for within-word triphone ML models.HMM Feature MLP Input CER

MFCC — 24.1%MFCC+F0 — 21.4%MFCC+F0+Tandem PLP 20.3%MFCC+F0+Tandem PLP+F0 19.7%

To compare the impact of different phoneme posterior combination methods, we trained all neural networks withthe 866 hours of training data. Then to have a fast turnaround, we trained two nonCW triphone ML models with 98hours of data (Hub4 and a subset of TDT4). One model was trained using the MLP feature combined by the inverse

UWEETR-2008-0020 13

entropy method, the other by Dempster-Shafer. Table 11 suggests the superiority of the Dempster-Shafer approach,where the first column is speaker-independent recognition and the second column with unsupervised MLLR adaptationusing the first-pass output.

Table 11: Bigram CERs on Eval04, using different methods of combining Tandem and HATs features. The acousticmodels were within-word triphones, ML-trained on 98 hours of data.

Tandem+HATs SI-pass SA-pass

Inverse Entropy 17.6% 16.5%Dempster-Shafer 17.0% 16.4%

7.3 Cross Adaptation Using Outside Regions

To increase the amount of unsupervised adaptation data, we also decode those speech segments outside the UEM-specified range. Particularly, we decode 60 seconds before and after each specified testing range. If any utterancein these outside regions is classified as one of the auto speakers in theinside region, it is then added into MLLRadaptation.

The top three rows of Table 12 show how CERs change as we increase the amount of unsupervised acousticadaptation data, on Dev07. Acoustic models are obtained using all 866 hours of training data. Dev07-0-0 means nooutside region is used in either acoustic segmentation or adaptation. Dev07-60-0 means the±60s outside regionsare used during acoustic segmentation, but not during AM adaptation. As our acoustic segmenter is affected by thesurrounding context, the acoustic segmentations from Dev07-0-0 and that from Dev07-60-0 can be different. Dev07-60-60 means the outside regions are used in both acoustic segmentation and AM adaptation. Note that Dev07-60-0and Dev07-60-60 thus have identical acoustic segmentation.

Unfortunately, it seems that most of the improvement comes from better acoustic segmentation rather than frommore AM adaptation data, perhaps because the speaker boundary information is not very accurate, or maybe becausethe MLLR regression classes need to be learned using a more sophisticated approach.5

Table 12: Decoding progress on Dev07. All AMs are speaker-adapted. The top three rows use the full trigram fordecoding. All AMs are trained on 866 hours of speech.

Config PLP-SA MLP-SA CNC

Dev07-0-0 12.4% – –Dev07-60-0 12.1% – –Dev07-60-60 12.0% 11.9% –adapted qLM4 11.7% 11.4% 11.2%static LM4 11.9% 11.7% 11.4%

7.4 Pronunciation Phone Sets

Table 13 shows the CERs on Dev07 with the two different phone sets, using Dev07-60 acoustic segmentation fromTable 12. To perform a fair comparison, two nonCW triphone PLP models are ML-trained with the 866 hours of data:one with the 81-phone set and the other with the 72-phone set. These comparisons were conducted with the SI modelsand the pruned trigram in Table 3.

A careful analysis reveals that the improvement in the BC portion from the 81-phone set is completely due to thereduction in deletion errors. Therefore, despite the modest overall improvement, the new phone set achieves our goalof generating different error patterns.

5Currently the MLLR regression classes are fixed 3 or 4 classes (silence/noises, consonants, and vowels).

UWEETR-2008-0020 14

Table 13: qLM3 CERs on Dev07 using different phone sets. Both models are nonCW PLP ML trained on 866 hoursof speech.

model BN BC Avg

phone-81 7.6% 27.3% 18.9%phone-72 7.4% 27.6% 19.0%

7.5 Adaptation on Language Models

Due to memory constraints, we are unable to adapt the LM4. Instead, we train 64 topic-dependent 4-grams and in-terpolate them with qLM4 in Table 3. All N-best hypotheses of an utterance are used to compute the topic mixtureweightsθ, and the most relevant topics (those whose weights inθ are above a threshold) are then selected and inter-polated with qLM4, on a per-utterance basis. The adapted 4-gram is then applied to rescore the N-best list of eachutterance. The result is shown in the fourth row in Table 12. Compared with the full static 4-gram in the next row, theadapted 4-gram is slightly but consistently better.

7.6 System Combination

Finally, a character-level confusion network combination of the two rescored N-best lists yields a 11.2% CER onDev07, as shown in the last column of Table 12. When this system was officially evaluated on Eval07 in June 2007,we achieved the best GALE Mandarin ASR error rate of 9.1%6 as shown in the last row of Table 14.

Table 14: Official CERs on Eval07. System parameter settings were tuned on Dev07.Genre CER

BN 3.4%BC 16.3%Avg 9.1%

8 Contribution and Future Work

This paper presents, step by step, a highly accurate Mandarin speech recognizer. It illustrates the language-independenttechnologies adopted and details the language dependent modules that make our system stand out.

Anecdotal error analysis on Dev07 shows that diphthongs did help in examples such asðL (/b ey3 d a4/, BeijingUniversity), and merging /A/ and /a/ was not harmful. But merging /I/ and /IH/ into /i/ seemed to cause somewhatmore confusion among characters such as (4,�,�)=(shi,zhi,di). Perhaps we need to reverse the last decision.

We have recently investigated a technique for improving the quality of the MLP based features, adjusting thenetwork’s outputs for training data using knowledge of the Viterbi-aligned labels [27]. Also, we have been addressingthe practical issues involved with processing very large datasets; because MLP back-propagation training is an onlineprocedure, the synchronization of updates makes it difficult to distribute the task over a compute cluster. We havetherefore been trying to port our efficient software implementation to new parallel computing architectures, such asspecialized graphics cards with up to 128 processing units.

The topic-based LM adaptation had small, but not quite satisfactory, improvement. Further refinement in thealgorithm and in the implementation is needed to adapt the full 4-gram and obtain greater significance. Our previousstudy [20] showed that full re-recognition with the adapted LM offered more improvement than N-best rescoring. Yetthe computation was expensive. A lattice or word graph re-search is worth investigating.

Finally the system still has a much higher error rate on BC speech than BN. Cross-talk is definitely a major sourceof errors that has little hope of recovering unless multiple channels of recordings are supplied. Interjections and fast

6Our official result of 8.9% CER was obtained by CNC with yet another two N-best lists generated by adapting our two systems with RWTHUniversity top-1 hypotheses.

UWEETR-2008-0020 15

speech are common in BC but have not received much attention in our system design. There is also not much BCtext available to train conversation-specific language models. Web crawling of more conversations is strongly needed.Alternatively, if disfluencies can be discovered, removed, and/or repaired, LM context may become more powerful.For example, if we were able to remove the filled pauses, shown in the brackets in the following recognition output,even n-gram LMs trained mostly on BN data would become more powerful and the translation to another languagewould become easier as well.

é �/D { pF )ñ [ YÇ] �v P� { [ YÇ] � �:made [this] a looser [this] law interpretation regarding

the dispute on the special expenses

Here[this] plays a similar linguistic role as “you know” in English conversational speech. We are working alongall the above directions.

Acknowledgment

The authors would like to express their gratitude to SRI International, for providing Decipher as the backbone of thisstudy, and particularly for all of the technical support from A. Stolcke and Z. Jing. This material is based upon worksupported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-06-C-0023.Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and donot necessarily reflect the views of DARPA.

References

[1] L. Mangu, E. Brill, and A. Stolcke, “Finding consensus in speech recognition: Word error minimization andother applications of confusion networks,”Computer Speech and Language, pp. 373–400, 2000.

[2] “Global Autonomous Language Exploitation (GALE),” http://www.darpa.mil/ipto/programs/gale/.

[3] M.Y. Hwang, X. Lei, T. NG, I. Bulyko, M. Ostendorf, A. Stolcke, W. Wang, and J. Zheng, “Progress on man-darin conversational telephone speech recognition,” inInternational Symposium on Chinese Spoken LanguageProcessing, 2004.

[4] X. Lei, M.Y. Hwang, and M. Ostendorf, “Incorporating tone-related MLP posteriors in feature representation forMandarin ASR,” inProc. Interspeech, 2005, pp. 2981–2984.

[5] X. Lei, M. Siu, M.Y. Hwang, M. Ostendorf, and T. Lee, “Improved tone modeling for Mandarin broadcast newsspeech recognition,” inProc. Interspeech, 2006.

[6] X. Lei and M. Ostendorf, “Word-level tone modeling for Mandarin speech recognition,” inProc. ICASSP, 2007.

[7] A. Venkataraman, A. Stolcke, W. Wang, D. Vergyri, V. Gadde, and J. Zheng, “An efficient repair procedure forquick transcriptions,” inProc. ICSLP, 2004.

[8] T. NG, M. Ostendorf, M.Y. Hwang, M. Siu, I. Bulyko, and X. Lei, “Web data augmented language models forMandarin conversational speech recognition,” inProc. ICASSP, 2005, pp. 589–592.

[9] S. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,”ComputerScience Group, Harvard University, TR-10-98, 1998.

[10] A. Stolcke, “SRILM-an Extensible Language Modeling Toolkit,”Seventh International Conference on SpokenLanguage Processing, 2002.

[11] M.J.F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,”ComputerSpeech and Language, vol. 12, pp. 75–98, 1998.

[12] G. Peng, M.Y. Hwang, and M. Ostendorf, “Automatic acoustic segmentation for speech recognition on broadcastrecordings,” inProc. Interspeech, 2007.

UWEETR-2008-0020 16

[13] J. Zheng, O. Cetin, M.Y. Hwang, X. Lei, A. Stolcke, and N. Morgan, “Combining discriminative feature,transform, and model training for large vocabulary speech recognition,” ICASSP 2007.

[14] B. Chen, Q. Zhu, and N. Morgan, “Learning long-term temporal features in LVCSR using neural networks,” inProc. ICSLP, 2004.

[15] D. Povey and P.C. Woodland, “Minimum phone error and I-smoothing for improved discriminative training,” inProc. ICASSP, 2002.

[16] J. Zheng and A. Stolcke, “Improved discriminative training using phone lattices,” inProc. Interspeech, 2005,pp. 2125–2128.

[17] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig, “fMPE: Discriminatively trained featuresfor speech recognition,” inProc. ICASSP, 2005.

[18] C.J. Chen et al., “New methods in continuous Mandarin speech recognition,” inProc. Eur. Conf. Speech Com-munication Technology, 1997, vol. 3, pp. 1543–1546.

[19] F. Valente and H. Hermansky, “Combination of acoustic classifiers based on dempster-shafer theroy of evidence,”in Proc. ICASSP, 2007.

[20] M.Y. Hwang, W. Wang, X. Lei, J. Zheng, O. Cetin, and G. Peng, “Advances in mandarin broadcast speechrecognition,” inProc. Interspeech, 2007.

[21] Q. Zhu, A. Stolcke, B.Y. Chen, and N. Morgan, “Using MLP features in SRI’s conversational speech recognitionsystem,”Proc. Interspeech, Lisbon, 2005.

[22] T. Anastasakos, J. McDonough, and J. Makhoul, “Speaker adaptive training: A maximum likelihood approachto speaker normalization,” inInternational Conference on Spoken Language Processing, 1996.

[23] M.Y. Hwang, X.D. Huang, and F. Alleva, “Predicting unseen triphones with senones,” inProc. ICASSP, 1993,pp. 311–314.

[24] T. Hofmann, “Probabilistic latent semantic analysis,” inUncertainty in Artificial Intelligence, 1999.

[25] D.M. Blei, A.Y. NG, and M.I. Jordan, “Latent dirichlet allocation,” inThe Journal of Machine LearningResearch, 2003, pp. 993–1022.

[26] A. Heidel and L.S. Lee, “Robust topic inference for latent semantic language model adaptation,” inIEEEAutomatic Speech Recognition and Understanding Workshop, 2007.

[27] A. Faria and N. Morgan, “Corrected Tandem features for acoustic model training,” inProc. ICASSP, 2008.

UWEETR-2008-0020 17

Building A Highly Accurate Mandarin Speech Recognizer with...

Documents

Transcript of Building A Highly Accurate Mandarin Speech Recognizer with...