Saint-Petersburg Saint-Petersburg. St. Petersburg - beautiful and fascinating holiday destination.
SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St....
-
Upload
randolph-wilkerson -
Category
Documents
-
view
220 -
download
0
Transcript of SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St....
SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced LanguagesSt. Petersburg, Russia
www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation
in Low-Resource Scenarios
Tim Schlippe, Wolf Quaschningk, Tanja Schultz
2 15-May-2014
Outline
1. Motivation and Goals
2. Experimental Setup1. Grapheme-to-phoneme converters
2. Data
3. Experiments and Results1. Single grapheme-to-phoneme converters’ performance
2. Phoneme-level combination scheme
3. Adding web-driven grapheme-to-phoneme converters
4. Automatic speech recognition experiments
4. Conclusion and Future Work
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
3 15-May-2014
Motivation
About 7.100 languages exist in the world (www.ethnologue.com)
only few languages have speech processing systems
Pronunciation dictionaries needed for text-to-speech and automatic speech recognition (ASR)
Manual production of pronunciations slow and costly19.2–30s / word for Afrikaans (Davel and Barnard, 2004)
Automatic grapheme-to-phoneme (G2P) conversionBut: Consistency pronunciations first at ~3.7k word-pronunciation pairs for training (30k phoneme tokens)
Methods to reduce manual effortCombining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
4 15-May-2014
Goals
Common approaches use their single favorite G2P conversion tool
Idea: Use synergy effects of multiple G2P converters
Close in performance but at the same time produce an output that differs in their errors
Provides complementary information
Achieve pronunciations with higher quality through combination of G2P converter outputs
Reduce manual effort in semi-automatic methods
Impact on ASR performance
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
5 15-May-2014
Grapheme-to-phoneme converters
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
G2P converters
Knowledge-based
Manual Rule-based
Hand-crafted rules
Data-driven
Local classification
CART1-based „t2p“(Lenzo, 1998)
Probabilistic
Graphone-based „Sequitur“
(Bisani & Ney, 2008)
WFST2-based „Phonetisaurus
“(Novak 2011)
SMT3-based „Moses“
(Koehn, 2005)
(According to (Bisani and Ney, 2008))c a r s K AX 9r S
6 15-May-2014
Data
Languages:English, German, French, Spanish
Dictionaries:English: CMU dictionary
German, Spanish: GlobalPhone
French: Quaero Project
Data sets (randomly chosen):Training: 200, 500, 1k, 5k, 10k word-pronunciation pairs
Development / test set: 10k word-pronunciation pairs (disjunctive)
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
different amounts of
small training data sizes to simulate low
resources
different grade of G2P relationship
7 15-May-2014
Analysis of Single G2P Converter Outputs
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER))
Lower PERs with increasing amount of training data
8 15-May-2014
Analysis of Single G2P Converter Outputs
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER))
Lowest PERs are achieved with Sequitur and Phonetisaurus for all languages and data sizes – even Moses it is very close for de
9 15-May-2014
Analysis of Single G2P Converter Outputs
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Edit distance to reference pronunciations at phoneme level (phoneme error rate (PER))
For 200 en and fr W-P pairs, Rules outperforms Moses
10 15-May-2014
Phoneme-level combination scheme
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Based on ROVER (Fiscus, 1997)(Recognizer Output Voting Error Reduction)(traditionally at word level)
Voting Module by frequency of occurence, since G2P confidence scores not reliable
11 15-May-2014
Phoneme-level combination scheme
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Sequitur G2P k EH 9r ZH 25%
Phonetisaurus K AA ZH 25%
CART K AE ZH 50% K AA 9r ZH 0%
Moses K AA 9r S 25%
1:1 G2P (Rules) K AX 9r S 50%
Example (trained with 200 W-P pairs):Reference: cars K AA 9r ZH
Converter Output PER PLC output PER
12 15-May-2014
Phoneme-level combination
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Relative PER change compared to best single converter output
de
In 10 of 16 cases combination equal or better
13 15-May-2014
Phoneme-level combination
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Relative PER change compared to best single converter output
de
Most improvement for de and en ASR experiments
14 15-May-2014
Phoneme-level combination
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Relative PER change compared to best single converter output
de
es (most regular G2P relationship) never improvements
15 15-May-2014
Wiktionary
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
39 Wiktionary editions with more than 1k IPA prons. (June 2012)
Growth of Wiktionary entries over several years ((meta.wikimedia.org/wiki/List of Wiktionaries
T. Schlippe, S. Ochs, T. Schultz: Web-based tools and methods for rapid pronunciation dictionary creation,Speech Communication, vol. 56, pp. 101 – 118, January 2014
16 15-May-2014
Wiktionary
Additional G2P converters based on word-pronunciation pairs in Wiktionary
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Internal consistency (PER %)
3.3k W-P pairs
1.5k W-P pairs
3.8k W-P pairs
4.6k W-P pairs
17 15-May-2014
Data
Filtered web-derived pronunciationsFully automatic methods from (Schlippe, 2012a, 2012b, 2014)
~15% with each filtering method
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Language Best method unfiltWDP filtWDP Rel. change
English (en) M2NAlign 33.18% 26.13% +21.25%
French (fr) Eps 14.96% 13.97% +6.62%
German (de) G2PLen 16.74% 14.17% +15.35%
Spanish (es) M2NAlign 10.25% 10.90% -6.34%
18 15-May-2014
Phoneme-level combination
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Relative PER change compared to best single converter output
PLC-unfiltWDP already better than w/oWDP
19 15-May-2014
Phoneme-level combination
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Relative PER change compared to best single converter output
Filtering web-derived pronunciations helps
23.1% rel. PER reduction
20 15-May-2014
ASR experiments
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Replace dictionaries in de & en recognizers with pronunciations generated with G2P converters
Train and decode the systems
Word Error Rate (WER)
• As in PER evaluation: Sequitur and Phonetisaurus very good in most cases• However: Rules results in lowest WERs for most scenarios
21 15-May-2014
ASR experiments
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
In only 1 case PLC-w/oWDP better or equal best single converter
22 15-May-2014
ASR experiments
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Filtering web-derived word-pronunciation pairs hels.
23 15-May-2014
ASR experiments
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Confusion Network Combination (CNC) outperforms PLC
24 15-May-2014
ASR experiments
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
In 9 cases Adding system with PLC in helps in CNC
25 15-May-2014
Conclusion and Future Work
In most cases, PLC comes close validated reference pronunciations more than the single converters
Web-derived word-pronunciation pairs can further improve quality (Filtering the web data helpful)
Weighting single G2P converters’ outputs gave no improvement
according to performance on dev set
according to converters‘ confidences
Potential to enhance semi-automatic pronunciation dictionary creation by reducing the human editing effort
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
26 15-May-2014
Conclusion and Future Work
Positive impact of the combination in terms of lower PERs had only little influence on the WERs of our ASR systems
Including systems with pronunciation dictionaries that have been built with PLC to CNC can lead to improvements
Future work: Embedding PLC and web-derived pronunciations into the semi-automatic pronunciation dictionary creation
Further languages and further G2P converters
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
27 15-May-2014
благодари= м за внима= ние!
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
28 15-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment
References
29 15-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment
References
30 15-May-2014 Pronunciation Extraction Through Cross-lingual Word-to-Phoneme Alignment
References