SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St....

30
SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia www.kit.edu KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios m Schlippe, Wolf Quaschningk, Tanja Schultz

Transcript of SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St....

Page 1: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced LanguagesSt. Petersburg, Russia

www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation

in Low-Resource Scenarios

Tim Schlippe, Wolf Quaschningk, Tanja Schultz

Page 2: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

2 15-May-2014

Outline

1. Motivation and Goals

2. Experimental Setup1. Grapheme-to-phoneme converters

2. Data

3. Experiments and Results1. Single grapheme-to-phoneme converters’ performance

2. Phoneme-level combination scheme

3. Adding web-driven grapheme-to-phoneme converters

4. Automatic speech recognition experiments

4. Conclusion and Future Work

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 3: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

3 15-May-2014

Motivation

About 7.100 languages exist in the world (www.ethnologue.com)

only few languages have speech processing systems

Pronunciation dictionaries needed for text-to-speech and automatic speech recognition (ASR)

Manual production of pronunciations slow and costly19.2–30s / word for Afrikaans (Davel and Barnard, 2004)

Automatic grapheme-to-phoneme (G2P) conversionBut: Consistency pronunciations first at ~3.7k word-pronunciation pairs for training (30k phoneme tokens)

Methods to reduce manual effortCombining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 4: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

4 15-May-2014

Goals

Common approaches use their single favorite G2P conversion tool

Idea: Use synergy effects of multiple G2P converters

Close in performance but at the same time produce an output that differs in their errors

Provides complementary information

Achieve pronunciations with higher quality through combination of G2P converter outputs

Reduce manual effort in semi-automatic methods

Impact on ASR performance

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 5: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

5 15-May-2014

Grapheme-to-phoneme converters

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

G2P converters

Knowledge-based

Manual Rule-based

Hand-crafted rules

Data-driven

Local classification

CART1-based „t2p“(Lenzo, 1998)

Probabilistic

Graphone-based „Sequitur“

(Bisani & Ney, 2008)

WFST2-based „Phonetisaurus

“(Novak 2011)

SMT3-based „Moses“

(Koehn, 2005)

(According to (Bisani and Ney, 2008))c a r s K AX 9r S

Page 6: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

6 15-May-2014

Data

Languages:English, German, French, Spanish

Dictionaries:English: CMU dictionary

German, Spanish: GlobalPhone

French: Quaero Project

Data sets (randomly chosen):Training: 200, 500, 1k, 5k, 10k word-pronunciation pairs

Development / test set: 10k word-pronunciation pairs (disjunctive)

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

different amounts of

small training data sizes to simulate low

resources

different grade of G2P relationship

Page 7: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

7 15-May-2014

Analysis of Single G2P Converter Outputs

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER))

Lower PERs with increasing amount of training data

Page 8: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

8 15-May-2014

Analysis of Single G2P Converter Outputs

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER))

Lowest PERs are achieved with Sequitur and Phonetisaurus for all languages and data sizes – even Moses it is very close for de

Page 9: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

9 15-May-2014

Analysis of Single G2P Converter Outputs

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER))

For 200 en and fr W-P pairs, Rules outperforms Moses

Page 10: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

10 15-May-2014

Phoneme-level combination scheme

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Based on ROVER (Fiscus, 1997)(Recognizer Output Voting Error Reduction)(traditionally at word level)

Voting Module by frequency of occurence, since G2P confidence scores not reliable

Page 11: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

11 15-May-2014

Phoneme-level combination scheme

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Sequitur G2P k EH 9r ZH 25%

Phonetisaurus K AA ZH 25%

CART K AE ZH 50% K AA 9r ZH 0%

Moses K AA 9r S 25%

1:1 G2P (Rules) K AX 9r S 50%

Example (trained with 200 W-P pairs):Reference: cars K AA 9r ZH

Converter Output PER PLC output PER

Page 12: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

12 15-May-2014

Phoneme-level combination

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Relative PER change compared to best single converter output

de

In 10 of 16 cases combination equal or better

Page 13: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

13 15-May-2014

Phoneme-level combination

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Relative PER change compared to best single converter output

de

Most improvement for de and en ASR experiments

Page 14: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

14 15-May-2014

Phoneme-level combination

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Relative PER change compared to best single converter output

de

es (most regular G2P relationship) never improvements

Page 15: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

15 15-May-2014

Wiktionary

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

39 Wiktionary editions with more than 1k IPA prons. (June 2012)

Growth of Wiktionary entries over several years ((meta.wikimedia.org/wiki/List of Wiktionaries

T. Schlippe, S. Ochs, T. Schultz: Web-based tools and methods for rapid pronunciation dictionary creation,Speech Communication, vol. 56, pp. 101 – 118, January 2014

Page 16: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

16 15-May-2014

Wiktionary

Additional G2P converters based on word-pronunciation pairs in Wiktionary

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Internal consistency (PER %)

3.3k W-P pairs

1.5k W-P pairs

3.8k W-P pairs

4.6k W-P pairs

Page 17: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

17 15-May-2014

Data

Filtered web-derived pronunciationsFully automatic methods from (Schlippe, 2012a, 2012b, 2014)

~15% with each filtering method

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Language Best method unfiltWDP filtWDP Rel. change

English (en) M2NAlign 33.18% 26.13% +21.25%

French (fr) Eps 14.96% 13.97% +6.62%

German (de) G2PLen 16.74% 14.17% +15.35%

Spanish (es) M2NAlign 10.25% 10.90% -6.34%

Page 18: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

18 15-May-2014

Phoneme-level combination

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Relative PER change compared to best single converter output

PLC-unfiltWDP already better than w/oWDP

Page 19: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

19 15-May-2014

Phoneme-level combination

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Relative PER change compared to best single converter output

Filtering web-derived pronunciations helps

23.1% rel. PER reduction

Page 20: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

20 15-May-2014

ASR experiments

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Replace dictionaries in de & en recognizers with pronunciations generated with G2P converters

Train and decode the systems

Word Error Rate (WER)

• As in PER evaluation: Sequitur and Phonetisaurus very good in most cases• However: Rules results in lowest WERs for most scenarios

Page 21: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

21 15-May-2014

ASR experiments

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

In only 1 case PLC-w/oWDP better or equal best single converter

Page 22: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

22 15-May-2014

ASR experiments

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Filtering web-derived word-pronunciation pairs hels.

Page 23: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

23 15-May-2014

ASR experiments

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Confusion Network Combination (CNC) outperforms PLC

Page 24: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

24 15-May-2014

ASR experiments

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

In 9 cases Adding system with PLC in helps in CNC

Page 25: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

25 15-May-2014

Conclusion and Future Work

In most cases, PLC comes close validated reference pronunciations more than the single converters

Web-derived word-pronunciation pairs can further improve quality (Filtering the web data helpful)

Weighting single G2P converters’ outputs gave no improvement

according to performance on dev set

according to converters‘ confidences

Potential to enhance semi-automatic pronunciation dictionary creation by reducing the human editing effort

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 26: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

26 15-May-2014

Conclusion and Future Work

Positive impact of the combination in terms of lower PERs had only little influence on the WERs of our ASR systems

Including systems with pronunciation dictionaries that have been built with PLC to CNC can lead to improvements

Future work: Embedding PLC and web-derived pronunciations into the semi-automatic pronunciation dictionary creation

Further languages and further G2P converters

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 27: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

27 15-May-2014

благодари= м за внима= ние!

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Page 28: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

28 15-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment

References

Page 29: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

29 15-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment

References

Page 30: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia  KIT – University of the State.

30 15-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment

References