SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St....

SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced LanguagesSt. Petersburg, Russia

www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation

in Low-Resource Scenarios

Tim Schlippe, Wolf Quaschningk, Tanja Schultz

2 15-May-2014

Outline

1. Motivation and Goals

2. Experimental Setup1. Grapheme-to-phoneme converters

2. Data

3. Experiments and Results1. Single grapheme-to-phoneme converters’ performance

2. Phoneme-level combination scheme

3. Adding web-driven grapheme-to-phoneme converters

4. Automatic speech recognition experiments

4. Conclusion and Future Work

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

3 15-May-2014

Motivation

About 7.100 languages exist in the world (www.ethnologue.com)

only few languages have speech processing systems

Pronunciation dictionaries needed for text-to-speech and automatic speech recognition (ASR)

Manual production of pronunciations slow and costly19.2–30s / word for Afrikaans (Davel and Barnard, 2004)

Automatic grapheme-to-phoneme (G2P) conversionBut: Consistency pronunciations first at ~3.7k word-pronunciation pairs for training (30k phoneme tokens)

Methods to reduce manual effortCombining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

http://www.ethnologue.com/

4 15-May-2014

Goals

Common approaches use their single favorite G2P conversion tool

Idea: Use synergy effects of multiple G2P converters

Close in performance but at the same time produce an output that differs in their errors

Provides complementary information

Achieve pronunciations with higher quality through combination of G2P converter outputs

Reduce manual effort in semi-automatic methods

Impact on ASR performance


5 15-May-2014

Grapheme-to-phoneme converters


G2P converters

Knowledge-based

Manual Rule-based

Hand-crafted rules

Data-driven

Local classification

CART1-based „t2p“(Lenzo, 1998)

Probabilistic

Graphone-based „Sequitur“

(Bisani & Ney, 2008)

WFST2-based „Phonetisaurus

“(Novak 2011)

SMT3-based „Moses“

(Koehn, 2005)

(According to (Bisani and Ney, 2008))c a r s K AX 9r S

6 15-May-2014

Data

Languages:English, German, French, Spanish

Dictionaries:English: CMU dictionary

German, Spanish: GlobalPhone

French: Quaero Project

Data sets (randomly chosen):Training: 200, 500, 1k, 5k, 10k word-pronunciation pairs

Development / test set: 10k word-pronunciation pairs (disjunctive)


different amounts of

small training data sizes to simulate low

resources

different grade of G2P relationship

7 15-May-2014

Analysis of Single G2P Converter Outputs


Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER))

Lower PERs with increasing amount of training data

8 15-May-2014




Lowest PERs are achieved with Sequitur and Phonetisaurus for all languages and data sizes – even Moses it is very close for de

9 15-May-2014




For 200 en and fr W-P pairs, Rules outperforms Moses

10 15-May-2014

Phoneme-level combination scheme


Based on ROVER (Fiscus, 1997)(Recognizer Output Voting Error Reduction)(traditionally at word level)

Voting Module by frequency of occurence, since G2P confidence scores not reliable

11 15-May-2014

Phoneme-level combination scheme


Sequitur G2P k EH 9r ZH 25%

Phonetisaurus K AA ZH 25%

CART K AE ZH 50% K AA 9r ZH 0%

Moses K AA 9r S 25%

1:1 G2P (Rules) K AX 9r S 50%

Example (trained with 200 W-P pairs):Reference: cars K AA 9r ZH

Converter Output PER PLC output PER

12 15-May-2014

Phoneme-level combination


Relative PER change compared to best single converter output

de

In 10 of 16 cases combination equal or better

13 15-May-2014




de

Most improvement for de and en ASR experiments

14 15-May-2014




de

es (most regular G2P relationship) never improvements

15 15-May-2014

Wiktionary


39 Wiktionary editions with more than 1k IPA prons. (June 2012)

Growth of Wiktionary entries over several years ((meta.wikimedia.org/wiki/List of Wiktionaries

T. Schlippe, S. Ochs, T. Schultz: Web-based tools and methods for rapid pronunciation dictionary creation,Speech Communication, vol. 56, pp. 101 – 118, January 2014

16 15-May-2014

Wiktionary

Additional G2P converters based on word-pronunciation pairs in Wiktionary


Internal consistency (PER %)

3.3k W-P pairs

1.5k W-P pairs

3.8k W-P pairs

4.6k W-P pairs

17 15-May-2014

Data

Filtered web-derived pronunciationsFully automatic methods from (Schlippe, 2012a, 2012b, 2014)

~15% with each filtering method


Language Best method unfiltWDP filtWDP Rel. change

English (en) M2NAlign 33.18% 26.13% +21.25%

French (fr) Eps 14.96% 13.97% +6.62%

German (de) G2PLen 16.74% 14.17% +15.35%

Spanish (es) M2NAlign 10.25% 10.90% -6.34%

18 15-May-2014




PLC-unfiltWDP already better than w/oWDP

19 15-May-2014




Filtering web-derived pronunciations helps

23.1% rel. PER reduction

20 15-May-2014

ASR experiments


Replace dictionaries in de & en recognizers with pronunciations generated with G2P converters

Train and decode the systems

Word Error Rate (WER)

• As in PER evaluation: Sequitur and Phonetisaurus very good in most cases• However: Rules results in lowest WERs for most scenarios

21 15-May-2014

ASR experiments


In only 1 case PLC-w/oWDP better or equal best single converter

22 15-May-2014

ASR experiments


Filtering web-derived word-pronunciation pairs hels.

23 15-May-2014

ASR experiments


Confusion Network Combination (CNC) outperforms PLC

24 15-May-2014

ASR experiments


In 9 cases Adding system with PLC in helps in CNC

25 15-May-2014

Conclusion and Future Work

In most cases, PLC comes close validated reference pronunciations more than the single converters

Web-derived word-pronunciation pairs can further improve quality (Filtering the web data helpful)

Weighting single G2P converters’ outputs gave no improvement

according to performance on dev set

according to converters‘ confidences

Potential to enhance semi-automatic pronunciation dictionary creation by reducing the human editing effort


26 15-May-2014

Conclusion and Future Work

Positive impact of the combination in terms of lower PERs had only little influence on the WERs of our ASR systems

Including systems with pronunciation dictionaries that have been built with PLC to CNC can lead to improvements

Future work: Embedding PLC and web-derived pronunciations into the semi-automatic pronunciation dictionary creation

Further languages and further G2P converters


27 15-May-2014

благодари= м за внима= ние!


28 15-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment

References


References

SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St....

Documents

Transcript of SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St....