Analyzing Single and Combined G2P Converter Outputs over ...

Analyzing Single and CombinedG2P Converter Outputs over

Training Data

Studienarbeit am Cognitive Systems LabFakultat fur Informatik

Karlsruher Institut fur Technologie

von

cand. inform.Wolf Quaschningk

Betreuer:

Dipl.-Inform. Tim SchlippeProf. Dr.-Ing. Tanja Schultz

Tag der Anmeldung: 02. Mai 2013Tag der Abgabe: 31. Juli 2013

KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

Ich erklare hiermit, dass ich die vorliegende Arbeit selbstandig verfasst und keineanderen als die angegebenen Quellen und Hilfsmittel verwendet habe.

Karlsruhe, den 31. Juli 2013

Abstract

In this work we present the combination of grapheme-tp-phoneme (G2P) converteroutputs in order to improve the outputs gained by single G2P converters. This isbeneficial where low resources are available to train the single converters.Our experiments are conducted with English, German, French and Spanish and itis shown, that in most cases the combination outperforms the single converters. Es-pecially with only little training data, the impact of the fusion is high, which showstheir great importance for under-resourced languages. We detected that the outputof G2P converters built with web-derived word-pronunciation pairs can further im-prove pronunciation quality.Altogether we achieved improvements with the combination in 41 of 60 scenarios.Furthermore we used our improved G2P converter outputs for building speech rec-ognizers in two different languages, English and German.

Kurzfassung

In dieser Arbeit kombinieren wir Graphem-zu-Phonem (G2P) Konverter-Ausgaben,um Ausgaben zu verbessern, die durch einzelne G2P Konverter erzeugt wurden. Dasist von Vorteil, wenn wenig Trainingsmaterial zur Verfugung steht, um die Konverterzu trainieren.Unsere Experimente werden mit Englisch, Deutsch, Franzosisch und Spanisch durch-gefuhrt und es wird gezeigt, dass in den meisten Fallen die Kombination die einzelnenKonverter verbessert. Insbesondere mit wenig Trainingsmaterial ist der Einfluss derFusion groß, was ihr großes Potential fur Sprachen mit wenig Trainingsmaterial of-fenbart. Wir fanden heraus, dass die Ausgabe von G2P Konvertern, die mithilfe vonWort-Aussprache-Paaren aus dem Web erzeugt wurden, nochmals die Qualitat derAussprache verbessern kann.Alles in allem erzielten wir in 41 von 60 Fallen eine Verbesserung durch die Kombina-tion. Zusatzlich verwendeten wir unsere verbesserten G2P Konverter-Ausgaben, umSpracherkenner in zwei verschiedenen Sprachen, Englisch und Deutsch, zu bauen.

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Related Work 3

3 Experimental Setup 53.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 G2P Converters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2.1 Sequitur G2P . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2.2 Phonetisaurus . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.3 Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.4 CARTtree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.5 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Combination Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.1 ROVER-based Combination Scheme . . . . . . . . . . . . . . 11

3.3.1.1 System Architecture Overview . . . . . . . . . . . . 113.3.1.2 Alignment Process . . . . . . . . . . . . . . . . . . . 123.3.1.3 Voting Scheme . . . . . . . . . . . . . . . . . . . . . 13

3.3.2 Combination Script nbest-lattice . . . . . . . . . . . . . . . . . 143.4 Analysis of the different G2P Converter Outputs . . . . . . . . . . . . 14

4 Experiments and Results 154.1 Analysis of the G2P Converters’ Output . . . . . . . . . . . . . . . . 154.2 Combining the G2P Converters’ Output . . . . . . . . . . . . . . . . 174.3 Web-derived Pronunciations . . . . . . . . . . . . . . . . . . . . . . . 184.4 ASR Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Conclusion and Future Work 25

Bibliography 27

1. Introduction

1.1 Motivation

With more than 6,500 languages in the world [15], the biggest challenge today is torapidly port speech processing systems to new languages with low human effort andat reasonable cost. This includes the creation of qualified pronunciation dictionar-ies. The task of finding the pronunciation of a word given its written form is calledgrapheme-to-phoneme conversion (G2P).

The dictionaries provide the mapping from the orthographic form of a word toits pronunciation, which is useful in both Text-To-Speech (TTS) and AutomaticSpeech Recognition (ASR) systems. They are used to train speech processing sys-tems by describing the pronunciation of words according to manageable units, typi-cally phonemes [14]. Pronunciation dictionaries can also be used to build generalizedG2P models, for the purpose of providing pronunciations for words that do not ap-pear in the dictionary [23].It would be easy to generate a pronunciation dictionary, if there was a direct simpleG2P relationship in all languages. In fact, in most languages the association betweengraphemes and phonemes is in a way ambiguous and not regular. This is why themanual production of dictionaries can be time-consuming and expensive.To overcome these difficulties, knowledge-based and data-driven G2P conversionapproaches for the automatic dictionary generation have been introduced. As pro-nunciation dictionaries are so fundamental to speech processing systems, much carehas to be taken to create a dictionary that is as free of errors as possible. For ASRsystems, faulty pronunciations in the dictionary may lead to incorrect training ofthe system and consequently to a system that does not function to its full potential.Flawed dictionary entries can originate from G2P converters with shortcomings.

1.2 Goals

Our goal is to investigate if a combination of G2P converter outputs outperformsthe single converters. This is particularly important for the rapid bootstrapping ofspeech processing systems if not many manual created example word-pronunciation

2 1. Introduction

pairs are available and therfore a single G2P converter has a poor performance.We combine the G2P converter outputs based on a voting scheme at phoneme level.Our motivation is that the converters are reasonably close in performance but atthe same time produce an output that differs in their errors. This provides comple-mentary information which leads in combination to performance improvements.With the phoneme error rate (PER), we evaluate how close the resulting pronuncia-tions come to pronunciations which have been successfully used in speech processingprojects. For training the G2P converters, we selected different amounts (200, 500,1k, 5k, 10k) of English, French, German, and Spanish word-pronunciation pairs. Wechoose particularly small amounts of word-pronunciation pairs to simulate scenariosof under-resourced languages. These languages normally have few high-quality pro-nunciation pairs available, because the manual production of these pairs is expensiveand time-consuming.Furthermore we investigate the impact of the combined pronunciations on ASR per-formance.

1.3 Structure

Section 2 gives a short general overview of the evolution of G2P converters and theirauthors.In Section 3 we first describe the different pronunciation dictionaries used in thiswork and then introduce G2P converters more detailled. Furthermore our combina-tion script nbest-lattice which is central in this work is presented in this section.In Section 4 we show the results of our experiments. We analyze the outputs ofour used single G2P converters. Then we investigate the performance improvementsthrough the combination and finally check the impact on our speech recognizers. InSection 5 we summarize and conclude our work.

2. Related Work

As it is our goal to gain new G2P converter outputs through a combination, we givea short overview of the development of G2P conversion approaches in the following:

There are two different kinds of G2P conversion approaches: Knowledge-based anddata-driven. Knowledge-based approaches with rule-based G2P conversion systemscan typically be expressed as finite-state automata [10] [3]. Often, these methodsrequire specific linguistic skills and exception rules formulated by human experts.In contrast to knowledge-based approaches, data-driven approaches are based on theidea that, given enough examples, it should be possible to predict the pronunciationof unseen words purely by analogy. The benefit of the data-driven approach is thatit trades the time- and cost-consuming task of designing rules, which requires lin-guistic knowledge, for the much simpler one of providing example pronunciations.

[3] propose Classification and Regression Trees (CART) to the G2P task. To buildCARTs a pronunciation dictionary is taken as a word list and rules are derived thatmake the best predictive split of this list. This is done on subsets, until a decisiontree is generated. We use the Text-to-Phoneme Converter Builder (t2p) [3] in ourwork.In [12], the alignment between graphemes and phonemes is generated using a vari-ant of the Baum-Welch expectation maximization algorithm. [5], [19] and [21] usea joint-sequence model.In our work, we use the Sequitur G2P [4] converter, which also works with joint-sequence models. This state-of-the-art converter uses a modified expectation-maximization algorithm to simultaneously learn G2P alignment and G2P chunkinformation. While the decoding in the approach of [4] may be slow, [16] and [8]utilize weighted finite-state transducers (WFSTs) for decoding as a representationof the joint-sequence model. We choose Phonetisaurus [8] in our work. [7], [2],and [11], apply statistical machine translation (SMT)-based methods for the G2Pconversion. To work with this SMT approach, we use Moses [18] for our experiments.

A good overview of certain state-of-the-art G2P methods is given in [20]. In theirwork, five G2P converters are compared in order to tackle the G2P conversion task.

4 2. Related Work

Unlike our work, large pronunciation dictionaries were used to train the differentG2P converters with five different languages. We used small pronunciation dic-tionaries with only open source G2P converters. [20] conducted experiments withdifferent G2P converters and stated, that Sequitur G2P [4] and Phonetisaurus [8]outperform all other analysed G2P converters, which we can confirm with our re-sults. In contrast to us, they combined ASR systems with dictionaries generatedwith different G2P converters (late fusion), which resulted only in very slight im-provement. In contrast, we analyze a combination of pronunciation dictionariesgenerated from different G2P converters on phoneme level (early fusion) and usethe resulting dictionaries in ASR experiments.

3. Experimental Setup

3.1 Data

As our methods should work for languages with different grade of regularity in G2Prelationship, our experiments are conducted with German (de), English (en), Span-ish (es), and French (fr).

G2P accuracy is a measure of the regularity of the G2P relationship of a languageand [23] showed that the G2P accuracy for English is very low, for Spanish it is veryhigh, whereas German and French are located in between.As manual produced pronunciations as training data for G2P conversion can be ex-pensive, we investigate how well our methods perform on little training data givendifferent subsets of the qualified word-pronunciation pairs.For evaluating our G2P conversion methods, we use GlobalPhone dictionaries forGerman and Spanish as reference data since they have been successfully used inLarge Vocabulary Conversational Speech Recognition (LVCSR) [24]. GlobalPhoneis a multilingual speech and text corpus that covers speech data from 20 languages,including Arabic, Bulgarian, Chinese (Mandarin and Shanghai), Croatian, Czech,French, German, Hausa, Japanese, Korean, Polish, Portuguese, Russian, Spanish,Swedish, Tamil, Thai, Turkish, and Vietnamese. For all languages GlobalPhonepronunciation dictionaries cover the vocabulary words in the transcripts and weremanually cross-checked after a rule-based creation process.For French, we employ a dictionary developed within the Quaero Project. The En-glish dictionary is based on the CMU dictionary1.For each language, we randomly selected 10k word-pronunciation pairs from the dic-tionary for testing and further disjunctive 10k word-pronunciation pairs for a devel-oping set, which was used to determine optimal parameters for our voting scheme.From the remainder, we extracted 200, 500, 1k, 5k, and 10k word-pronunciationpairs for training to investigate how well our methods perform on small amountsof data. To evaluate the quality of the G2P converters’ outputs, we apply themto the words in the test set and compute their phoneme error rate (PER) to the

1http://www.speech.cs.cmu.edu/cgi-bin/cmudict

6 3. Experimental Setup

original pronunciations. For the ASR experiments, we replace the pronunciations inthe dictionaries of our German and English GlobalPhone-based speech recognizerswith pronunciations generated with the G2P converters. Then we use them to buildand decode ASR systems.

As we are able to find English, French, German and Spanish word-pronunciationpairs in Wiktionary2, we additionally built a G2P converter with these data for eachlanguage. First we naively extracted word-pronunciation pairs without any filtering(5-30k phoneme tokens with corresponding grapheme tokens to reflect a saturatedG2P consistency [23]). Second we filtered them before we built the G2P converteras described in [22]. The size of the word-pronunciation pairs gained from theseweb-derived pronunciations ranges from 1.5k to 5.6k.

3.2 G2P Converters

[4] distinguish the following G2P converter approaches, which is shown in Figure 3.1.Here, G2P converters can generally be classified in knowledge-based and data-drivenconverters. Regarding converters based on knowledge, a naive approach is a manualdictionary look-up to generate pronunciations for given grapheme strings. Therebyan expert with linguistic skills learns rules from existing dictionaries and maps thewords to its pronunciations manually. This method has the advantage that it is onthe one hand effective. However, it is exhausting and costly, particularly with hugedictionaries.

Rule-based conversion systems, as another knowledge-based approach, were devel-oped to overcome the limitations of manual dictionary look-up. These systems areable to generate pronunciations for a complete dictionary. But they have two dis-advantages:First, creating rules is time consuming and may be hard for the rule designers, whoneed in addition to their technical knowledge linguistic skills.Second, irregularities are present in many languages which can not be captured withstandard rules. So the designers must create exception lists including the irregular-ities. The relation between rules can be quite complex. So rule designers haveto recheck if the output of applying the rules is correct in all cases. This makesdevelopment and maintenance of rule-based conversion systems hard in practice.Furthermore, a rule-based G2P converter is still likely to make mistakes for words,which are not considered by the rule designer. We decided to use this rule-basedapproach by creating hand-crafted rules (see Section 3.2.5)Compared to the knowledge-based methods above, data-driven approaches need nolinguistic experts. The philosophy is that the system is fed with data examples inorder to train a model which produces new pronunciation of unseen words just byanalogy. This is beneficial because the hard and time-consuming task of manual ruledesigning is skipped.

When choosing data examples to train a model, it is worth to consider the appli-cation domain of the dictionary. In practice, available pronunciation dictionarieswhich typically cover the most frequent words of the language are used to train

2http://www.wiktionary.org

3.2. G2P Converters 7

Figure 3.1: Classification of our single G2P converters according to [4]

data-driven G2P models. In general, three different types of data-driven G2P con-version approaches can be distinguished:Pronunciation by analogy, local classification and probabilistic approaches. As localclassification method, we chose t2p [3], described in Section 3.2.4.[4] define approaches with nearest-neighbor technique as ”pronunciation by analogy”.However, it can be also used for all data-driven approaches. When transcribing un-seen words this technique scans the training pronunciation dictionary and searchesfor known grapheme string patterns to transcribe unseen words. This techniqueexceeds approaches by local classification.

The technique based on local classification works by checking the input graphemesequence from left to right. Due to the context of the current grapheme, a phonemeis predicted. This method is called local classification because every decision ismade for each character before stepping to the next. On the one hand this is notan optimal solution because it only works locally, but on the other hand it avoidssophisticated search algorithms for globally solutions. Neural Networks and decisiontrees are representative methods for local classification.

The probabilistic approach uses probabilities to predict its transcriptions. Thismethod also requires alignments which are created in a probabilistic manner.As can be seen in Figure 3.1, we decided to use Sequitur G2P (Section 3.2.1),Phonetisaurus (Section 3.2.2) and Moses (Section 3.2.3) for this prohabilistic ap-proach.

3.2.1 Sequitur G2P

Sequitur G2P [4] was first presented in 2008. It aims the task of G2P conversion bya data-driven approach. The aligning process is made in a probabilistic way, specifi-cally in this case known as joint-sequence modeling approach. The decision strategyis optimal with regard to word error which means that the risk of not getting thecorrect pronunciation of an unknown input sequence is minimized.

The basic idea of joint-sequence models is that a common sequence of joint units


which carries both input and output symbols generates the relation between input(grapheme) and output (phoneme) sequences. The easiest case is when each unitcarries zero or one input and zero or one output symbol which agrees with the con-ventional definition of finite state transducers (FST). In case that units carry morethan one input and output symbol this is called co-sequence, joint multigram or agrapheme-phoneme joint multigram (graphone).A graphone is called singular when it has at most one grapheme and one phoneme.The grouping of grapheme and phoneme sequences happens in an equal number ofsegments and is called segmentation or co-segmentation. Thereby the segmentationinto graphones is not unique. So it is possible that a word with its correspondentpronunciation can have a different number of graphones as shown in Figure 3.2 andFigure 3.3.

Figure 3.2: Pronunciation of mixing with sequence of four graphones

Figure 3.3: Pronunciation of mixing with sequence of seven graphones

Sequitur G2P uses an special Expectation-Maximization (EM) algorithm to trainits models. Thereby it can simultaneously learn grapheme-to-phoneme alignmentand grapheme-to-phoneme chunk information.

[4] show that sequence models with little or no context information (n-gram <= 2)have a better performance with longer chunks, when regarding the size of graphones.With higher context information the situation is reversed and the accuracy is worsethe bigger the chunks are. With 4-grams and beyond, the performance is betterwhen using only singular graphones. We use Sequitur G2P with 6-grams in ourwork.

3.2.2 Phonetisaurus

Presented in 2012 Phonetisaurus [8] is like Sequitur G2P a data-driven G2P con-verter and uses a new modified EM-algorithm for alignment. [8] introduce threemain modification:

• Only m-to-one and one-to-m arcs are considered during training

• For each input entry a joint alignment lattice is constructed during initializationand every unconnected arc is deleted

• All arcs are forced to maintain a non-zero weight

3.2. G2P Converters 9

Figure 3.4: Steps of model building

The step after the alignment is to convert the aligned sequence pairs to sequencesof aligned joint label pairs. Then an n-gram language model (LM) is trained.In the final step, the LM is converted to a weight-finite-state-transducer (WFST)(Figure 3.4). Phonetisaurus supports three different decoding schemes:

1. Recurrent Neutral Network Language Model (RNNML) n-best rescoring

2. Lattice Minimum Bayes-Risk (LMBR) decoding for G2P

3. Simple shortest path

Because RNNLM N-best rescoring and LMBR decoding is novel in the context ofG2P, it is worth to mention in terms of Phonetisaurus. But it is not mature until nowand needs further improvement. Thereby, we use the simple shortest path decoderwhich extracts the shortest path through the phoneme lattice created as follows:

ShortestPath(Opt(ProjectOutput(Compose(W,M)))).

In this connection, ShortestPath extracts the shortest path(s) through the result-ing hypothesis lattice, Opt describes an optional network optimization as deter-minizations and epsilon-removals. ProjectOutput stands for projection output labels(phonemes); Compose names the generic weigthed composition algorithm, W thenovel input word and M the joint-sequence G2P n-gram model. In our setup weuse 7-grams. In contrast to other G2P converters in this work, Phonetisaurus isremarkably quick. Additionally the performance is not suffering from this quicknessas can be seen in Section 4.


3.2.3 Moses

Different to all other G2P converters in this work, Moses [18] represents an approachfor statistical machine translation (SMT), where the original focus lies on translationamong languages. But this system can also be used for G2P conversion when thesource language is replaced by graphemes and the target language by phonemes.To train an SMT system, a parallel text corpus is required. This means, that thereis a collection of sentences in two different languages which have to be sentencealigned. That way each sentence in one language matches its corresponding trans-lated sentence in the other language.During the training process, Moses carries in the parallel data and derives analo-gies between the used languages. For that it detects co-occurrences of words andsegments. There are different types of translation such as phrase-based, hierarchicalphrase-based or syntax-based machine translation. The last two types add morestructure to the correspondences between two different languages. For example thehierarchical machine translation learns changes of the sentence structure which typi-cally occur when one language is translated to another. But this is not necessary forG2P conversion. Therefore, in this work the phrase-based translation is used wheresimply continuous sequences of words are linked between the two correspondent lan-guages. This is done in a phrase table which includes the learned translation rules.

Also necessary for the translation system is the LM, a statistical model built us-ing monolingual data which reflect the probability of the word order in the targetlanguage. The decoder needs the information of the LM to do its output. The tuningbuilds the final step in the creation of the SMT system. Here the different statisticalmodels are weighted against each other to produce the best possible translationsbefore decoding on the final test set can be performed.

3.2.4 CARTtree

Another approach of grapheme-to-phoneme model building are Classification andRegression Trees (CART) created in 1998 [3] . The CART-algorithm itself alreadyexists since 1984 [13].CART creates classification and regression trees for predicting continuous depen-dent variables (regression) and categorical predictor variables (classification). It ischaracterized by recursively splitting the feature space into a set of non-overlappingregions in an optimal way. The CART-algorithm creates only binary trees, thatmeans that on every splitting just two branches exist. The algorithm stops when nofurther gain can be made or, alternatively, the data can be split as much as possibleand then the created tree is pruned afterwards.For our experiments, we use t2p [3]. Here, a graphemic sliding window of threeletters to the left and three letters to the right from each central letter is used tocreate the tree.

3.2.5 Rules

In order to gain further G2P converter outputs for our combination, we decidedto use simple manual substitution rules. For each language, we inspected all ex-isting graphemes and replaced each grapheme by the most frequent correspondingphoneme. This resulted in 26 one-to-one substitutions for English, 28 for German,

3.3. Combination Scheme 11

43 for French and 33 for Spanish.As there are typical grapheme strings in each language, which can also be replacedby one phoneme, we considered such subsitutions as well. A typical German exam-ple is the word ”Schule”, where the grapheme string ”Sch” is mapped to the phoneme”Z”. Altogether we used 32 simple rules for English, 37 for German, 61 for Frenchand 36 for Spanish.

3.3 Combination Scheme

In order to combine our G2P outputs, we use the combination script nbest-lattice.With nbest-lattice, we applied a voting method at the phoneme level and is based onthe ROVER algorithm [9]. Unlike the phoneme level combination of nbest-lattice,the ROVER method produces a composite ASR system output, when multiple ASRsystem outputs are available. As ROVER is fundamental for nbest-lattice, it isdescribed in the following subsections.

3.3.1 ROVER-based Combination Scheme

ROVER means Recognizer Output Voting Error Reduction and was already pre-sented in 1997 [9]. It is a post-recognition method which takes in the outputs ofdifferent ASR systems and combines it in that way that the created composite ASRoutput performs better with regard to the error rate than the individual ASR out-puts. In addition, knowledge sources such as scores can be added in order to reducethe error rates. The combination of the ASR outputs ends in a single, minimumcost word transition network (WTN ) generated through iterative applications ofdynamic programming. The question of which sequence of the resulting word tran-sition network to take, decides an automatic rescoring or voting scheme by choosingthe output sequence with the lowest score.

3.3.1.1 System Architecture Overview

The ROVER system consists of two modules, as shown in Figure 3.5:An alignment module and a voting module [9]. First, the system takes different ASRoutputs and combines them into a single WTN. The alignment process is describedin the following section. Dynamic programming is used to generate the networkwhich happens in the alignment module. Finally the voting module is responsiblefor evaluating the best scoring transcription through voting the best scoring sequencefound in the resulting WTN.

Figure 3.5: ROVER system architecture


3.3.1.2 Alignment Process

The alignment process builds the first stage in generating a single composite WTNby gathering multiple hypothesis transcripts from different ASR systems. Whenaligning more than two WTNs the optimal solution would be to apply dynamicprogramming requiring a hyper-dimensional search. In fact such an algorithm isdifficult to implement. Thus [9] use an approximate solution which handles thedynamic programming alignment two-dimensionally.The initial step in the align and combine process of several WTNs is to create oneWTN for each single ASR output. It is important that these output have a lineartopology and have no branches in advantage to simplify the combination process.Figure 3.6 shows the linear topology of the initial disjunctive WTNs. The first WTN

Figure 3.6: Topology of WTNs

is determined as the basic WTN-Base from which the further WTN is developed.The second WTN is then aligned to the WTN-BASE and is expanded with wordtransition arcs which results in a sequence of correspondence sets between WTN-BASE and WTN-2 :

With the help of the correspondence set a new WTN (WTN-BASE’ ) is createdthrough adding the word transition arcs from WTN-2 into WTN-BASE’. This looksas follows:

Generally, there are four different rules of how to insert word transition arcs intoWTN-BASE :

1. “Correct” represented by CS2 and CS4 in the example. In this case a copy of theword transition arc from WTN-2 is added to the correspondent word in WTN-BASE.

2. “Substitution” which is CS3 in the example. Thereby a copy of the word transitionarc from WTN-2 is added to WTN-BASE.

3. “Deletion”. Here a NULL word transition arc is added to WTN-BASE, exemplifiedby CS1.

4. “Insertion” is the fourth rule, CS5 in the example. A sub-WTN is created andpushed between the contiguous nodes in WTN-BASE in order to record the factthat the WTN-2 inserted a word at this location.The sub-WTN consists of a two-node WTN, that has a copy of the word transitionarc from WTN-2 and P NULL transition arcs where P is the number of WTNs

3.3. Combination Scheme 13

already previously merged into WTN-BASE. Here P is 1 because this is the firstWTN merging in the example.After combined WTN-BASE and WTN-2 to the new WTN-BASE’, WTN-3 ismerged into WTN-BASE’, which results in the final WTN-BASE” (Figure 3.7):

Figure 3.7: Insertion of word transition arcs

3.3.1.3 Voting Scheme

After aligning and creating a composite WTN in the alignment module, the votingmodule investigates the WTN to find the best scoring word sequence. UnfortunatelyROVER does not support using any context information in the voting module. Itregards all transition arcs (correspondence set) out of a graph note as separate andindependent entities. There are three different voting schemes:

1. Voting by frequency of occurrence

2. Voting by frequency of occurrence and average word confidence

3. Voting by frequency of occurrence and maximum confidence

The scoring formula is:

The target of voting is to produce a set of unique word types within a correspon-dence set CSi. In the equation above, N(w,i) represents an array where w is theword type in the correspondence set i. α is a parameter which balances betweenthe decision of word frequency and confidence scores. When Voting by frequency ofoccurrence is performed, the parameter α is set to 1 ignoring all confidence scoresadded by the user. So N(w,i) is divided by N s (number of combined systems) toscale the frequency of occurrence to unity.

Voting by frequency of occurrence and average word confidence takes place by the“grid-based” search method using confidence scores to compute average confidencescores again for each word type in the array C(w,i). In this method optimal valuesfor the parameters α and Conf(α) are found a priori through minimizing the worderror rate on training set. Conf(α) is a trained parameter generated when NULLtransition arc occur in the composite WTN. The minimization is achieved by quan-tizing the parameter space into a grid of possible values. Then the grid is scannedfor the lowest word error rate.

Voting by frequency of occurrence and maximum confidence uses confidence scoresto find the maximum confidence score for each word type in the array C(w,i). As inthe second voting method α and Conf(α) are trained a priori on the training datawhile using the grid-search algorithm.


Converter N-best hypotheses Extracted hypothesisSequitur K EH 9r ZH

Phonetisaurus K EH ZHCART K AE ZH K EH 9r ZHMoses K AA 9r SRules K AX 9r S

Table 3.1: Combination of Composite G2P Outputs of the English Word ”cars” withthe Outvoted Hypothesis

3.3.2 Combination Script nbest-lattice

Nbest-lattice is a script of the SRI Language Modeling Toolkit [25] and generalizesthe ROVER algorithm. It combines hypotheses from multiple N-best lists at theword level, constructs a word lattice from all the N-best hypotheses and extractsthe path with the lowest expected word error.When combining, the ordering of the G2P outputs plays an important role. In ourwork we chose the order by evaluating our development set, mentioned in Section 3.1.We calculated the phoneme error rate (PER) for every single G2P converter on ourdeveloping set and based on this calculation the ordering was determined beginningwith the output with the lowest PER. Table 3.1 shows an example, how compositeG2P outputs are merged to one hypothesis at phoneme level. The alignment andvoting process takes place according to the ROVER algorithm.

3.4 Analysis of the different G2P Converter Out-

puts

Figure 3.8: Edit Distances at Phoneme Level between G2P Converter Outputs (En-glish / German / French / Spanish).

As it is a requirement that the G2P converters produce different outputs to gainnew outputs through combination, we present the edit distances at phoneme levelbetween the G2P converter outputs trained with 1k word-pronunciation pairs inFigure 3.8.How much they differ depends on the similarity of the corresponding technique.For example, the smallest distances are between Sequitur and Phonetisaurus, whileRules has the highest distances to the other approaches. It is also dependent on theG2P relationship:While the English outputs differ most for all G2P converters, the Spanish ones areclosest. The distances of French and German are located in between.

4. Experiments and Results

In this section, we first analyze how well each single converter performs on our 10ktest set. In Section 4.2, the results of the combined G2P outputs via nbest-latticeare presented.Furthermore we create speech recognizers based on the pronunciation dictionariesproduced with our combination scheme.

4.1 Analysis of the G2P Converters’ Output

For all G2P converters, we use context and tuning parameters that result in lowestphoneme error rates on the test set with 1k training data. Figure 4.1 and Table 4.1demonstrate the PERs of the single G2P converter outputs. The PERs were de-termined by evaluating the edit distances to the reference pronunciations at thephoneme level.In general, we observe lower PERs with increasing amount of training data. Withdecreasing amount of training data, the PERs become higher throughout all lan-guages and G2P converter outputs. It can be noticed, that the lowest PERs areachieved with Sequitur G2P and Phonetisaurus for all languages and data sizes,except for French (fr) with less than 1k data. Here, Carttree outperforms SequiturG2P and Phonetisaurus. The performance difference between Sequitur G2P andPhonetisaurus in our experiments is marginal, so both converters perform quite wellwith little training data.Disregarding the performance of Carttree in French (fr) with less than 1k data, thequality of the generated G2P output is settled in the middle of all G2P outputs.Except for German (de), Carttree performs always better than Moses. Moses is al-ways worse than Sequitur and Phonetisaurus, even it is very close for German (de).Rules perform generally worst. It is remarkable, that for 200 English (en) andFrench (fr) word-pronunciation pairs, Rules can even outperform Moses.With regard to the different G2P relationships of our used languages, Spanish (es)has the simplest one. It has the clearly lowest PERs, no matter which G2P convertergenerated the output. In contrast English (en) has the highest PERs in terms of allamounts of training data. Similar to the edit distance comparision in Section 3.4,German (de) and French (fr) are located in between.

16 4. Experiments and Results

Figure 4.1: PER over Amount of Training Data.

EnglishSequitur G2P Phonetisaurus Moses Carttree Rules

200 31.9 33.8 47.7 36.3 47.1500 26.6 27.5 43.7 32.2 47.11k 23.8 23.9 40.3 28.8 47.15k 17.8 17.7 34.0 22.6 47.110k 16.1 16.0 31.8 19.9 47.1German

Sequitur G2P Phonetisaurus Moses Carttree Rules200 18.0 18.7 18.0 21.1 35.3500 14.8 15.5 15.3 17.8 35.31k 12.8 13.0 13.6 16.0 35.35k 10.0 9.9 10.7 12.2 35.310k 9.1 9.0 10.0 11.0 35.3French

Sequitur G2P Phonetisaurus Moses Carttree Rules200 14.3 21.0 39.6 19.4 33.0500 9.5 15.0 30.1 13.8 33.01k 7.1 8.7 27.0 11.0 33.05k 5.1 5.4 19.9 7.6 33.010k 4.6 4.7 18.0 6.4 33.0Spanish

Sequitur G2P Phonetisaurus Moses Carttree Rules200 4.7 4.1 10.2 5.7 12.1500 3.1 3.2 8.8 3.8 12.11k 2.4 2.5 8.4 2.8 12.15k 1.7 2.0 4.9 2.2 12.110k 1.5 1.6 4.5 2.1 12.1

Table 4.1: PER (in %) over Amount of Training Data

4.2. Combining the G2P Converters’ Output 17

English German French Spanish200 0.4 1.0 -0.3 -0.6500 0.1 0.8 0.0 -0.11k 0.2 0.5 -0.1 -0.25k 0.0 0.2 0.1 -0.110k 0.1 0.1 0.2 -0.1

Table 4.2: Absolute PER Change over Training Data Size (in %)

4.2 Combining the G2P Converters’ Output

For the combination, we apply nbest-lattice at the phoneme level. Figure 4.2 showsthe change in the PER compared to the G2P converter output with the highestquality. Positive numbers mean improvement, negative ones perform a degradation.The exact absolute numbers are illustrated in Table 4.2. We detected that in somecases the combination of subsets of G2P converter outputs improved PER slightly.In other cases single much worse G2P converter outputs even helped to improvequality. As in a real scenario the impact is not clear, we continued our experimentswith the combination of all converter outputs.In 11 of 20 cases the combination performs better than the best single converter. ForGerman (de), we observe improvements for all training data sizes, for English (en)in four of five cases. Therefore we selected these languages for our ASR experiments(see Section 4.4). For Spanish (es), the language with the most regular G2P rela-tionship, the combination never results in improvements. While for German (de),the improvement is higher with less training data, the best French (fr) improvementcan be found with 10k training data.

Figure 4.2: Relative PER Change over Training Data Size

To enhance the combination with weights for the G2P converter outputs, we inves-tigated two approaches:First, we assigned weights according to the converter’s performance. However, thismethod could only reach the quality of the best single converter and not outperformit.In the second approach, we extracted weights based on normalized values of theG2P output confidences. But the correlation between the confidences and the out-put quality in terms of the PER was too poor to obtain reliable weights.Figure 4.3 illustrates the poor relationship between the confidences and the result-ing PER for pronunciations generated with Sequitur (the correlation coefficient R2

is 0.1203).


Figure 4.3: Confidence over PER.

4.3 Web-derived Pronunciations

We used Sequitur G2P to build additional G2P converters based on pronunciationswhich we found in Wiktionary together with corresponding words (web-derived pro-nunciation (WDP)) and analyzed their impact to the combination quality.Figure 4.4 shows the changes without (w/oWDP) and with additional converteroutputs compared to the best single converter. Table 4.3 shows the exact numbers.First we built G2P converters after we extracted word-pronunciation pairs (1949 foren, 1503 for de, 1573 for es and 1517 for fr) from Wiktionary without any filtering(unfiltWDP). Second we filtered them before we built the G2P converters as de-scribed in [22] (filtWDP). In this case, we used 5663 word-pronunciation pairs foren, 4389 for de, 1578 for es and 4378 for fr.We observe that unfiltWDP outperforms the best single method in 16 of the 20cases. In 19 cases it is better than w/oWDP.Like unfiltWDP, filtWDP outperformes the best single method in 16 cases. However,it is in 16 cases better than unfiltWDP and in 19 cases better than w/oWDP.

In order to receive improvements, it is necessary that the new output gained throughthe combination is different to the output of the single converters. Table 4.4 showsthe number of new outputs that arised by combination and were never created byone of the single converter. Note that each entry in the table represents the numberof general new outputs from our 10k test set. The first column shows the sizes ofword-pronunciation pairs for each language.It can be seen that with decreasing amount of training data the number of newcreated outputs by combination gets bigger throughout all langugages. We gainedmany changes in the G2P converter outputs with few training data through thecombination. With more training data the situation is reversed. Also it is notice-able that the number of new outputs is quite even whether unfiltered and filteredweb-derived pronunciations are taken into account or not.

4.3. Web-derived Pronunciations 19

Englishw/o WDP unfiltWDP filtWDP

200 0.4 / 1.3 2.7 / 8.5 4.4 / 13.8500 0.1 / 0.4 1.4 / 5.3 2.0 / 7.71k 0.2 / 0.8 1.1 / 4.6 1.6 / 6.75k 0.0 / 0.0 0.2 / 1.1 0.5 / 2.810k 0.1 / 0.6 0.3 / 1.9 0.4 / 2.5German

w/o WDP unfiltWDP filtWDP200 1.0 / 5.6 1.4 / 8.4 3.1 / 17.2500 0.8 / 5.4 1.2 / 8.1 1.3 / 9.21k 0.5 / 3.9 0.8 / 6.3 0.9 / 7.05k 0.2 / 2.0 0.3 / 3.0 0.4 / 4.010k 0.1 / 1.1 0.2 / 2.2 0.2 / 2.2French

w/o WDP unfiltWDP filtWDP200 -0.3 / -2.1 2.3 / 16.1 3.2 / 23.1500 0.0 / 0.0 1.4 / 14.7 1.5 / 15.81k -0.1 / -1.4 0.6 / 8.5 0.7 / 9.95k 0.1 / 2.0 0.3 / 5.9 0.4 / 7.810k 0.2 / 4.3 0.3 / 6.5 0.4 / 8.7Spanish

w/o WDP unfiltWDP filtWDP200 -0.6 / -14.6 -0.2 / -4.9 -0.1 / -2.4500 -0.1 / -3.2 0.2 / 6.5 0.2 / 6.51k -0.2 / -8.3 0.0 / 0.0 0.0 / 0.05k -0.1 / -5.9 0.0 / 0.0 0.0 / 0.010k -0.1 / -6.7 -0.1 / -6.7 -0.1 / -6.7

Table 4.3: Absolute / Relative Improvements in terms of PER (in %). Negativenumbers show, that no improvement was possible through combination.


EnglishCombination unfiltWDP filtWDP

200 3,279 3,386 3,839500 2,369 2,393 3,0711k 2,003 2,042 1,9585k 1,287 1,401 1,27010k 946 1,091 1,005German

Combination unfiltWDP filtWDP200 1,668 1,763 1,707500 1,127 865 1,1201k 767 610 6015k 612 435 41210k 583 438 411French

Combination unfiltWDP filtWDP200 2,631 1,976 2,105500 1,261 940 9611k 718 484 4725k 461 346 34210k 359 269 235Spanish

Combination unfiltWDP filtWDP200 198 178 180500 81 79 841k 87 65 735k 67 40 5410k 71 50 55

Table 4.4: Number of general new Outputs gained by Combination

4.4. ASR Experiments 21

Figure 4.4: Relative PER Change using Converters trained with Web-derived Pro-nunciations

4.4 ASR Experiments

We replaced the pronunciations in the dictionaries of our German and English speechrecognizers with pronunciations generated with the G2P converters. Then we usedthem to build and decode the systems. Table 4.5 shows the results in terms of worderror rate (WER).Figure 4.5 depicts variations in WER with increasing amounts of training data,even if there is a general decrease with more training data using our data-drivenG2P converters except for English with Moses. As in our PER evaluation, SequiturG2P and Phonetisaurus outperform the other approaches in most cases.

Figure 4.5: WER over Amount of Training Data.

However, Rules results in lowest word error rates for most scenarios with less than1k training data.Figure 4.6 and Table 4.6 show the changes in the combination without (w/oWDP)and with the additional converter outputs compared to the best single converter.unfiltWDP outperforms the best single method in only one of the 10 cases. In fourcases it outperforms w/oWDP.


Figure 4.6: Rel. WER (%) Change over Training Data Size With and WithoutWeb-derived Data.

EnglishSequitur Phonetisaurus Moses Carttree Rules

200 30.9 31.9 35.8 32.3 27.5500 29.7 29.8 37.1 32.7 27.51k 28.5 27.7 40.3 33.5 27.55k 25.8 23.8 39.7 26.9 27.510k 22.2 23.4 43.1 27.1 27.5German

Sequitur Phonetisaurus Moses Carttree Rules200 18.6 18.6 21.4 20.0 17.9500 17.8 17.8 17.6 19.1 17.91k 18.6 18.7 18.4 20.2 17.95k 17.7 17.4 18.6 20.2 17.910k 17.1 18.0 17.5 19.3 17.9

Table 4.5: WER (in %) over Amount of Training Data

filtWDP is in six cases better than the best single method. It is in nine cases betterthan unfiltWDP and in seven cases better than w/oWDP.

Additionally, we compare our combination approach to two approaches of buildingthe training and decoding dictionary where the outputs of the single G2P convert-ers are used as pronunciation variants. For that we include the output of the G2Pconverters trained with the filtered web data:In the first approach we use all pronunciations for the training and decoding dic-tionary as in [6] (var1-filtWDP). Assuming that speakers in training and test setuse similar speaking styles and vocabulary, and that our training process automat-ically selects the most likely pronunciations, in the second approach we remove allpronunciations in the dictionary that were not used in the training for decoding(var2-filtWDP).

Figure 4.6 shows that for English var1-filtWDP and var2-filtWDP did outperformneither the corresponding best single converter nor the combination approaches inmost cases. For German, var1-filtWDP and var2-filtWDP result each in three of thefive cases in better WERs than our combination approaches. Thereby var2-filtWDPis always better than var1-filtWDP.

4.4. ASR Experiments 23

Englishw/o WDP unfiltWDP filtWDP

200 -4.7 / -17.1 -3.9 / -14.3 0.2 / 0.5500 -2.3 / -8.6 -3.9 / -14.4 -0.2 / -0.81k -1.7 / -6.4 -1.6 / -5.9 1.0 / 3.85k -0.9 / -3.6 -1.6 / -6.7 -0.5 / -2.310k -1.9 / -8.7 -1.0 / -4.5 -1.2 / -5.4German

w/o WDP unfiltWDP filtWDP200 -0.9 / -5.0 -0.8 / -4.6 0.2 / 1.3500 0.7 / 3.9 -0.5 / -3.1 0.3 / 1.91k 0.5 / 3.0 0.1 / 0.3 0.5 / 2.65k -0.3 / -1.6 -0.4 / -2.6 -0.2 / -1.010k 0.4 / 2.0 -1.1 / -6.2 0.1 / 0.5

Table 4.6: Absolute / Relative Improvements in terms of WER (in %). Negativenumbers show, that no improvement was possible through combination.

5. Conclusion and Future Work

We have analyzed the G2P converter output combination of four languages withdiffering grade of G2P regularity and simulated scenarios with small amounts ofword-pronunciation pairs (200, 500, 1k, 5k, 10k).

We used a test set with the size of 10k words in every language to draw our compar-isons. As [20], we observed, that the state-of-the-art G2P converters Sequitur G2Pand Phonetisaurus are better in performance on the test sets than the other singleconverters used in this work. That means, that these G2P converters generated anoutput, which was closer to our validated reference pronunciations at the phonemelevel than the other G2P converters. Although they performed quite well, it waspossible to improve the quality of the G2P outputs through a post-processing com-bination even with worse G2P outputs. As requirement it was necessary that allconverters differ in their outputs to generally produce new outputs. We showed thatby calculating the edit distances of the different G2P outputs gained by the differentconverters in Section 3.4. Furthermore we discovered, that with decreasing amountsof training data the number of new unseen created outputs with the phoneme levelcombination gets bigger throughout all langugages (see Figure 4.4 in Section 4.3).

In 41 of 60 scenarios, we achieved improvements with the output combination.Thereby we could examine that the impact of the combination is bigger with decreas-ing amounts of training data. This fact can be quite beneficial for under-resourcedlanguages, where little word-pronunciation pairs as training data are available.With the help of web-derived word-pronunciation pairs from Wiktionary, the outputof G2P converters could even further improve pronunciation quality.

We conducted ASR experiments by replacing the dictionaries of our speech rec-ognizers with pronunciations generated with the G2P converters. As languages wechose German and English, because as it is shown in Section 4.2, that the best re-sults with the combination were achieved with these languages.The positive impact of the combination in lower PERs compared to qualified pro-nunciations had only little influence on the WERs of our ASR systems - more forGerman than for English. Using the single converter outputs as variants performed

26 5. Conclusion and Future Work

much worse for English. However, for German it outperformed our combinationapproaches in most cases.

As we could only simulate under-resourced scenarios by splitting our dictionariesin small portions, it would be interesting to check, how our combination would per-form in a real scenario with under-resourced languages.It is important to keep inmind, that a post-processed combination may only benefit, if the G2P relationshipof the language is not too simple. Especially, for Spanish which has a simple G2Prelationship, we were not able to gain improvement by our combination. Here, sin-gle G2P converters like Sequitur G2P and Phonetisaurus achieve satisfying resultsthemselves.

In this work, we used four open source G2P converters and as fifth converter, wedecided to create simple rules manually. As a continuation of this work, it would benice to see, if it is possible to further improve G2P output quality through combin-ing more than five converters. Additionally, while we did not take any graphemiccontext information into account during the phoneme level combination, this wouldbe a next step to tackle.

Bibliography

[1] http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm.

[2] A. Laurent, R. Deleglise and S. Meignier: Grapheme to Phoneme Con-version using a SMT System. In Interspeech, 2009.

[3] A. W. Black, K. Lenzo and V. Pagel: Issues in Building General Letterto Sound Rules. ESCA/COCOSDA Workshop on Speech Synthesis, 1998.

[4] Bisani, M. and H. Ney: Joint Sequence Models for Grapheme-to-PhonemeConversion. In Speech Communication, 2008.

[5] Chen, S. F.: Conditional and Joint Models for Grapheme-to-Phoneme Con-version. In Eurospeech, 2003.

[6] D. Jouvet, D. Fohr and I. Illina: Evaluating Grapheme-to-Phoneme Con-verters in Automatic Speech Recognition Context. In International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP), 2012.

[7] Gerosa, M. and M. Federico: Coping with Out-of-Vocabulary Words:Open Versus Huge Vocabulary ASR. In International Conference on Acous-tics, Speech, and Signal Processing (ICASSP), 2009.

[8] J. Novak, N. Minematsu and K. Hirose: WFST-based Grapheme-to-Phoneme Conversion: Open Source Tools for Alignment, Model-Building andDecoding. In International Workshop on Finite State Methods and NaturalLanguage Processing, 2012.

[9] J.Fiscus: A Post-Processing System to Yield Reduced Word Error Rates: Rec-ognizer Output Voting Error Reduction (ROVER). 1997.

[10] Kaplan, R. M. and M. Kay: Regular Models of Phonological Rule Systems.In Computational Linguistics, 1994.

[11] Karanasou, P. and L. Lamel: Comparing SMT methods for automatic gen-eration of pronunciation variants. In IceTAL (Conference on Natural LanguageProcessing), 2010.

[12] Kneser, R.: Grapheme-to-Phoneme Study. In Philips Speech Processing, Ger-many, WYT-P4091/00002, 2000.

[13] L. Breiman, J. Friedman, R. Olshen and C. Stone: Classification andRegression Trees. In Wadsworth & Brooks, 1984.

http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm

28 Bibliography

[14] Martirosian and Davel: Error analysis of a public domain pronunciationdictionary. In PRASA, 2007.

[15] Nettle, D.: Linguistic Diversity, 1999.

[16] Novak, J.: Phonetisaurus: A WFST-driven Phoneticizer. Inhttp://code.google.com/p/phonetisaurus/, 2011.

[17] Och, F. J. and H. Ney: A Systematic Comparison of Various StatisticalAlignment Models. In Computational Linguistics, vol. 29, 2003.

[18] P. Koehn, H. Hoang, A. Birch C. Callison-Burch M. Federico N.Bertoldi B. Cowan W. Shen C. Moran R. Zens C. Dyer O. BojarA. Constantin and E. Herbst: Open Source Toolkit for Statistical MachineTranslation. In Association for Computational Linguistics (ACL), demonstra-tion session, 2007.

[19] P. Vozila, J. Adams and R. Thomas: Grapheme to Phoneme Conversionand Dictionary Verification using Graphonemes. In Eurospeech, 2003.

[20] S. Hahn, P. Vozila and M. Bisani: Comparison of Grapheme-to-PhonemeMethods on Large Pronunciation Dictionaries and LVCSR Tasks. In Inter-speech, 2012.

[21] S. Jiampojamarn, G. Kondrak and T. Sherif: Applying Many-to-ManyAlignments and Hidden Markov Models to Letter-to-Phoneme Conversion. InHuman Language Technologies (HLT), 2007.

[22] T. Schlippe, S. Ochs, N. T. Vu and T. Schultz: Automatic Error Recov-ery for Pronunciation Dictionaries. In Interspeech, 2012.

[23] T. Schlippe, S. Ochs and T. Schultz: Grapheme-to-Phoneme Model Gen-eration for Indo-European Languages. In International Conference on Acous-tics, Speech, and Signal Processing (ICASSP), 2012.

[24] T. Schultz, N. T. Vu and T. Schlippe: GlobalPhone: A Multilingual Textand Speech Database in 20 Languages. In International Conference on Acous-tics, Speech, and Signal Processing (ICASSP), 2013.

[25] Toolkit, SRILM An Extensible Language Modeling: InternationalConference on Spoken Language Processing (ICSLP). In International Confer-ence on Spoken Language Processing (ICSLP), 2002.

Analyzing Single and Combined G2P Converter Outputs over ...

Documents

Transcript of Analyzing Single and Combined G2P Converter Outputs over ...