Multilingual Speech Processing -...

Multilingual Speech Processing -Rapid Language Adaptation Tools & Technologies

Tanja Schultz1,2 & Alan W Black1

1 Language Technologies Institute (LTI), Carnegie Mellon2 Cognitive Systems Lab, Karlsruhe Institute of Technology

Interspeech 2010, Tutorial on Multilingual Speech ProcessingSunday, September 26 2010, T-S2-R3 13:00 – 15:45, Makuhari, Japan

Tutorial Agenda

13:00 – 14:15 Part 1: Introduction and Motivation

o Motivation

o History and Leveraged Work

o Rapid Language Adaptation Server: Spice

o What, Why, and How

o Building process

14:15 – 14:30 BREAK

14:30 – 15:45 Part 2: SPICE - Under the hood

o Latest Experiments and Results in ASR and TTS

o Lessons Learnt from past studies

o Future

Outline Part 1

13:00 – 14:15 Part 1: Introduction

o Motivation

o World of Languages – Languages of the World

o Speech Processing Systems

o Speech-to-Speech Translation

o Spoken Dialog Systems

o History and Leveraged Work

o Globalphone

o FestVox

o Spice: A Rapid Language Adaptation Server

o User's Level View

o Walkthrough

o What, Why and How to use SPICE

o Overview of the Building Process

Introduction Outline

o Many Languages – so what?

o Growing Language Diversity on the web

o Why do we need Speech Processing in many languages?

o Is this really science – not just retraining on a new language?

o Language Characteristics

o Written form, scripts, letter-to-sound relationship

o Issues and Differences between languages

o Language Extinction

o Do we care? What can we do about?

o Challenges of Multilingual Speech Processing

o Lack of Resources

o Lack of Experts

o Solutions

o Prior Work: GlobalPhone and FestVox

o Intelligent Learning Systems

o Rapid Language Adaptation Server

o Lack of Resources

o Lack of Experts

o Solutions

Many Languages – So What?

Do we really need Speech Processing in many languages?

Myth: “Everyone speaks English, why bother?”

NO: About 6900 different languages in the world

Increasing number of languages on the web

Humanitarian and military needs

Rural areas, uneducated people, illiteracy

Why is this an research issue?

Myth: “It’s just retraining on foreign data – simple!”

NO: Other languages bring unseen challenges, for example:

different scripts, no vowelization, no writing system

no word segmentation, rich morphology,

tonality, click sounds,

social factors: trust, access, exposure, cultural background

Everyone speaks English, why bother?

o Huge number of Languages in the world: 6912

o Language is not only a communication tool but

fundamental to cultural identity and empowerment

o Treat linguistic

diversity as we treat

bio-diversity

(David Crystal)

o The strongest

eco systems are the

most diverse

o Cultures, ideas,

memories are

transmitted

through language

17791967

204308

[1 - 9]

Each dot gives the geographic center of the 6,912 living

languages, http://www.ethnologue.com (accessed Jul 2007)

Top Languages – Distribution

http://www.ethnologue.com, as of 25.08.2010

Distribution of Living Languages

So we need language support but why Speech?

Computerization: Speech is the key technology

Ubiquitous Information Access: on the go, phone-based

Mobile Devices: Too small and cumbersome for keyboards

Globalization:

Cross-cultural Human-Human Interaction

Multilingual Communities: EU, South Africa, …

Humanitarian needs, disaster, health care

Military ops, communicate with local people

Human-Machine Interfaces

People expect speech-driven applications in their mother tongue

Speech Processing in multiple Languages

Why Speech Processing?

10/145

ML Speech Processing – Research Issue?

It’s just retraining on foreign data - no science!

o New language – new challenges

o Writing system: different or no script, no vowelization, G-2-P

o Word segmentation, morphology

o Sound system: tonals, clicks

o Different Cultures – social factors

o trust, access, exposure, background

o Lack of Data and Resources

o Audio recordings, corresponding transcripts

o Pronunciation Dictionaries, Lexicon

o Text corpora, parallel bilingual data

o Lack of Experts

o Technology experts without language expertise

o Native language experts without technology expertise

11/145

o Lack of Resources

o Lack of Experts

o Solutions

12/145

Language Characteristics

Prosody, Tonality: Stress, Pitch, Lenght pattern, Tonal contours

(e.g. Mandarin 4, Cantonese 8, Thai & Vietnamese 5)

Sound system: simple vs very complex sound systems

(e.g. Hawaiian 5V+8C vs. German 17V+3D+22C)

Phonotactics: simple syllable structure vs complex

consonant clusters(e.g. Japanese Mora-syllables vs. German pf,st,ks)

Segmentation: Written form separate words by white space?

(NO: Chinese, Japanese, Thai, Vietnamese)

Morphology: short units, compounds, agglutination

English: Natural segmentation into short units – great!

German: Compounds – not quite so good

Donau-dampf-schiffahrts-gesellschafts-kapitäns-mütze …

Turkish: Agglutination – looooong phrases

Osman-l-laç-tr-ama-yabil-ecek-ler-imiz-den-miş-siniz

behaving as if you were of those whom we might consider not

converting into Ottoman

13/145

Writing Systems

Writing systems – basic unit is a Grapheme:

Logographic: based on semantic units, grapheme represents meaning

Chinese: >10.000 hanzi; Japanese ~7000 kanji, Korean to some extend

Phonographic: based on sound units, grapheme represents sound

Segmental: grapheme roughly corresponds to phonemes

Latin (190), Cyrillic (65), Arabic (22) graphems

Abjads = consonantal segmental phonographic, e.g. Arabic

Syllabic: grapheme represents entire syllable, e.g. Japanese kana

Abugidas = mix of segmental and syllabic systems

Featural: elements smaller than phone, e.g. articulatory features

e.g. Korean: ~5600 gulja

Segmental: Latin, Cyrillic, Latin&Cyrillic, Greek,

Georgian or Armenian

Abjads: Arabic, Arabic&Latin, Hebrew&Arabic

Abugidas: North Indic, South Indic, Ethiopic,

Thaana, Canadian Syllabic ,

Logographic+syllabic: Pure logographic,

Mixed logographic&syllabaries,

Featural syllabary+lmtd logographic

Featural-alphabetic syllabary

Wikipedia: August 2007

14/145

Scripts – Examples

Scripts of some languages: Arabic, Bulgarian, Catalan, Chinese, Croatian, Czech, English, Greek,

Hebrew, Hindi, Italian, Japanese, Korean, Romanian, Russian, Serbian, Thai

How many languages do have a written form?

• Omniglot lists about 780 languages that have scripts

• True number might be closer to 1000

(Source Simon Ager, 2007, www.omniglot.com)

Logographic scripts, mostly 2 representatives:

• Chinese: ~ 10.000 hanzi,

• Japanese: ~7000 kanji (+ 3 other scripts )

Phonographic:

• Korean: ~5600 gulja,

• Arabic, Devanagari, Cyrillic, Roman: ~100 characters

15/145

Grapheme-to-Phoneme Relation

Grapheme-to-Phoneme (Letter-to-Sound) Relationship:

Logographic: NO relationship at all

concern for Chinese, Japanese, Korean

Phonographic: segmental: close – far – complicated

e.g. Finnish, Spanish: more or less 1:1, -- English: try „Phydough“

Phonographic: segmental – consonantal

e.g. Arabic: no short vowels written

Phonographic: syllabic

e.g. Thai, Devanagari: C-V flips

Automatic Generation of Pronunciations might get complicated

Phonographic Logographic

English Korean

JapaneseFrench

Finnish Chinese

Ratio Phonetic/Semantic Code

16/145

o Lack of Resources

o Lack of Experts

o Solutions

17/145

One more Reason for MLSP …

6900 Languages in the world …. BUT

o Extinction of languages on massive scale (David Crystal, Spotlight 3/2000)

o Half of all existing languages die out over next century On Average: One language dies every two weeks!

o Survey Feb 1999 from Summer Institute of Linguistics

51 languages with 1 speaker left

28 of those in Australia alone

500 languages with 500 spks

1500 languages with < 1000 spks

3000 languages with < 10.000

5000 languages with < 100.000

96% of world‟s languages are

spoken by only 4% of its people

17791967

204308

18/145

The Future of Language

Is a language with 100.000 speakers safe?

o Survival for generations depends on pressure imposed on language

o Dominance of another language, Attitude of the speakers

o Example Breton: beginning of 20th century has 1 Mio speakers, now

down to 250.000; Without effort Breton could be gone in 50 years

Reasons that languages die:

o Disaster: Earthquake on Papua New Guinea: Sissano, Warapu, Arop

o Genocide: 90% America‟s natives died within 200 years Europeans

o Cultural assimilation: Colonialism, Suppression, Assimilation:

o (1) Political, social, economic pressure to speak the dominant language,

o (2) Emerging bilingualism,

o (3) self-conscious semilingualism, (4) monolingualism

Why should we care?

o Massive death of languages reduces the diversity

o Bio-diversity has been accepted to be a good thing

o Maybe we should accept this for language diversity (D. Crystal)

19/145

What can we do?

What do we learn from other languages?

o Intellectual issues: increase awareness of world history

such as movements of early civilization

o Practical issues: medical practices, alternative treatment forms

o Literature … but also new things about the language itself

o Slovakian proverb: “with each newly learned language you acquire a new soul”

How to save endangered languages:

o Community itself must want it, Surrounding culture must respect it

o Funding for courses, materials, and teachers, support the community

o Get linguists into the field, publish information, grammars, dictionaries

Costs associated:

o Depends on conditions (written vs. unwritten languages, etc.)

o Crystal estimates about $80.000 / year per language

o 3000 endangered languages is about $700Mio …

o Organizations to raise funds

o Foundation of endangered languages (FEL), UNESCO project

20/145

o Lack of Resources

o Lack of Experts

o Solutions

21/145

o Lack of Resources: Stochastic approach needs many data

o Hundreds of hours audio recordings and corresponding transcriptions

Audio data 40 languages; Transcriptions take up to 40x real time

o Pronunciation dictionaries for large vocabularies (>100.000 words)

Large vocabulary pronunciation dictionaries 20 languages

o Mono- and bilingual text corpora: few language pairs, pivot mostly English

o Algorithms are language independent – MLSP is not!

o Other Languages bring unseen challenges (segmentation, G2P, etc.)

o Have we already seen ALL or MOST of the language characteristics?

o Social and Cultural Aspects

o Non-native speech and language, code switching

o Combinatorical explosion (domain, speaking style, accent, dialect, ...)

o Few native speakers at hand for minority (endangered) languages

o Do we have the right data?

o Lack of Language Experts

o Bridge the gap between technology experts and language experts

Challenges of MLSP

22/145

o Lack of Resources

o Lack of Experts

o Solutions

23/145

Intelligent systems that learn a language from the user

o Efficient learning algorithms for speech processing

o Learning:

o Interactive learning with user in the loop

o Statistical modeling approaches

o Efficiency:

o Reduce amount of data (save time and costs): at least by factor of 10

o Speed up development cycles: days rather than months

Rapid Language Adaptation from universal models

o Bridge the gap between language and technology experts

o Technology experts do not speak all languages in question

o Native users are not in control of the technology

One Solution: Learning Systems

24/145

o Lack of Resources

o Lack of Experts

o Solutions

25/145

GlobalPhone

Prior Work: GlobalPhone and FestVox

26/145

Multilingual Database

Widespread languages

Native Speakers

Uniform Data

Broad Domain

Large Text Resources

Internet, Newspaper

Corpus

19 Languages … counting

1800 native speakers

400 hrs Audio data

Read Speech

Filled pauses annotated

Arabic

Ch-Mandarin

Ch-Shanghai

German

French

Japanese

Korean

Croatian

Portuguese

Russian

Spanish

Swedish

Turkish

+ Thai

+ Creole

+ Polish

+ Bulgarian

+ Vietnamese

+ ... ???

GlobalPhone

http://www.cs.cmu.edu/~tanja/GlobalPhone

Available from ELRA

Or check with Tanja

27/145

Phones in GlobalPhone

Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.86

28/145

1011.8

14 14.5 14.516.9 18 19 20 20 20.3

23.1 24.3

36.633.8

44.546.4

45.2 44.1

JA DE EN KO CH TU FR PO KR SP BL CZ PL RU

Word error rate Phoneme error rate

GlobalPhone Recognizers in 14 Languages

29/145

GlobalPhone: Morphology & OOV-Rate

Language Corpus Size

(in Mio of

word tokens)

Vocabulary

(in thousands of

word types)

OOV at

60k / 64k

English WSJ 19 105 1%

Spanish Newspaper 100 490 ~1.5%

Portuguese Newspaper 11 270 4.3%

German Broadcast N 45 >900 4.4%

Czech Newspaper 16 415 8%

Serbo-Croatian Internet 12 350 8.7% (49k)

MSA Newspaper 19 690 11%

Turkish Newspaper 16 500 15%

Korean Newspaper 15 (eojeols) 1400 (eojeols) 31%

44 (syllables) 3.5 (syllables) ~0.01%

Chinese Newspaper 82 (pinyin) 59 0%

Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, 5, 9

30/145

FestVox

Prior Work: GlobalPhone and FestVox

31/145

FestVox: Building Synthetic Voices

http://festvox.org [Black and Lenzo 2000]

o Documentation, Tools, Scripts, Examples

o Building Synthetic Voices in the Festival Speech Synthesis System

o Supports:

o Diphone, unit selection, (later Statistical Parametric Synthesis)

o Lexicon, letter to sound rules

o Text processing support.

32/145

Early FestVox Example Languages

o CMU development

o Croatian, Thai, Chinese (Mandarin), Japanese,

Catalan, Spanish, Nepali, Telugu, Tamil, Dari,

Pashto, Farsi

o Non-CMU

o At least: Italian, Malay, Maori, Mongolian,

Spanish, Telugu, Hindi, Japanese, English

(Many), German, Swedish, Polish, …

33/145

TTS Build Tasks

o Define phone set

o Define pronunciations (LTS vs. Lexicon)

o Design prompt list

o Record data

o Write text front-end

o Number, symbol expansion

o Write/train prosody model

o Deal with something novel

o Word segmentation, no vowels, declensions

34/145

TTS Build Results

o Results strongly correlated to effort

o Must-have for funded project

o Involve speech experts

o End of semester (student graduates)

o Almost random distribution rights

o Others can‟t always use the previous results

o No explicit copyrights (and no way to change them)

o Results often not in format for re-use

35/145

Joint Speech Model Development

CMU projects: Arabic, Thai, Croatian, Farsi

o Shared audio data collection

o Prompts with phonetic coverage

o Lots of (ASR) / Single (TTS) speaker(s)

o Shared Phone set

o Sometimes “similar” e.g. with/without Tone

o Shared Pronunciation Data

o (Note) input and output are different vocab

But we need a much tighter coupling ….

36/145

o Lack of Resources

o Lack of Experts

o Solutions

37/145

Speech Processing: Interactive Creation & Evaluation toolkit

• National Science Foundation, Grant 2004-2008 (Schultz & Black)

• Bridge the gap between technology experts language experts

• Automatic Speech Recognition (ASR),

• Machine Translation (MT),

• Text-to-Speech (TTS)

• Develop web-based intelligent systems

• Interactive Learning with user in the loop

• Rapid Adaptation from universal models

• SPICE webpage http://cmuspice.org

Rapid Language Adaptation Toolkit (RLAT)

• Text Data Webcrawling, Focused Recrawling, Text Normalization

• Wiktionary-based Pronunciation Generation, Telephone Interface

• RLAT webpage http://csl.ira.uka.de/rlat-dev

SPICE and RLAT

38/145

Input: Speech

Pronunciation rules

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput:

Speech & Text

Hello NLP

Text data

Phone set & Speech data

Speech Processing Systems

39/145

Lexst LMt

Word s

Word t N-grams

AMtDictt

sequence

N-grams

AMs Dicts

sequence

Word s

Word t

N-grams

AMs Dicts LMs

sequenceN-grams

AMtDictt

sequence

Input Ls Output Lt

Input LtOutput Ls

Speech-to-Speech Translation

Lsource Ltarget

SPICE Design Principles

o Data Sharing → Language Universal Models

o Knowledge Sharing across System Components

40/145

Monolingual Dialog Systems

o Speech Models

o ASR acoustic and language models

o TTS models

o Lexical coverage

o NLP models

o Parsing

o Generation

o Interpretation (back-end processing)

41/145

o Collects:

o Appropriate text data

o Appropriate audio data

o Defines:

o Phoneme set

o Rich prompt set

o Lexical pronunciations

o Produces:

o Pronunciation model

o ASR acoustic model

o ASR language model

o TTS voice

o Maintains:

o Projects and users login

o Data and Models

42/145

User„s View of Spice

Building support for a new language

o Login and Project Registration

o Text Collection & Prompt Selection

o Speech Data Collection

o Phone Set Specification and Selection

o Lexical construction

o TTS Voice Building

o ASR Bootstrap & training

o ASR Language model

43/145

Login and Project Registration

o Separate “projects“ for each language

o could share info between different projects

o All tasks times are logged

o Allow us to do cost/efficiency studies

44/145

45/145

46/145

Text Collection

o We need text data for the target language

o Web crawler

o Plus boost data from similar sites

o Language encoding

o Non-trivial, but ...

o Deal with very common alphabets

o Internally all utf-8

o In-domain vs general text

o Character analysis

o Find the character classes:

o casing, numerals, punctuation etc

47/145

48/145

Prompt Selection

o Prompts for recording:

o Collection without transcription

o “Good” coverage will give “clean” models

o Prompts should be:

o Easy to say (no hard words, numerals etc)

o Rich in variability

49/145

Finding Nice Prompts

o Only contain high frequency words

o No unusual words with unusual spelling

o 5 to 15 words

o Make them easy to say without errors

o Easy to say in one breath group

o “Phonetically” rich

o But we have no phonetic information yet

o Make them orthographically rich

o Greedily select to maximize tri-graphs

50/145

Speech Data Collection

o Online audio recording tools

o Collaboratively record large number of speakers

o Speakers may separate from developer

o Visual feedback during recording

o Automatic upload on completion

o Java based for portability

o Works with *many* browsers

o In control of recording

o We can control the recording format

o File contents and directory structure

51/145

52/145

New Alternative: Telephone-based Collection

o Option 1: use web-based recorder

o Option 2: Telephone (RLAT)

IVR: Interactive Voice Response

PSTN: Public Switched Telephone Network

VoIP: Voice over IP

53/145

Implementation

• Dialplan implementation: extensions.conf

• Phone number +49 721 180 30 681

54/145

Phoneme Selection

o Selection from standard IPA chart

o User‟s names for phonemes

o Can match their lexicon (if one exists)

o Can match their familiarity

o Audio feedback

o Click to hear recording of each phone

o Allows us to map their phone names

o We map phones to IPA

o Get phonetic features for user‟s phones

o (what are vowels, what are stops etc)

55/145

56/145

57/145

User„s View of Spice

Building support for a new language

o Login and Project Registration

o Text Collection & Prompt Selection

o Speech Data Collection

o Phone Set Specification and Selection

58/145

59/145

60/145

61/145

62/145

63/145

64/145

13:00 – 14:15 Part 1: Introduction

o Motivation, History, Leveraged Work

o Spice: A Rapid Language Adaptation Server

o User's Level Walkthrough

Login, Data Collection, Phone Selection

After the Break:

o Spice – Under the Hood

o Latest Experiments and Results

o Future

65/145

SPICE: Demo Tape

66/145

Outline Part 2

14:30 – 15:45 Part 2:

o SPICE – Under the hood

o Text collection & Prompt Selection

o Phone set specification

1. Forget construction, i.e. perform Grapheme-based ASR

2. Harvest Internet (Wiktionary) Resources

3. Interactive Learning (LexLearner)

o Evaluation

o Future Steps

67/145

Input: Speech

Pronunciation rules

hi youyou areI am

AM Lex LMOutput:

Speech & Text

Textdaten„adios“ /a/ /d/ /i/ /o/ /s/

„Hallo“ /h/ /a/ /l/ /o/

„Phydough“ ???

Rapid Portability: Pronunciation Dictionary

68/145

19,218,4

24,526,8

15,614 12,7

3336,4

Phoneme Grapheme (FTT)Grapheme

English Spanish German Russian Thai

Phoneme- vs Grapheme based ASR

Problem:

• 1 Grapheme 1 Phoneme

Flexible Tree Tying (FTT):

One decision tree

• Improved parameter tying

• Less over specification

• Fewer inconsistencies

0=vowel?

0=obstruent? 0=begin-state?

-1=syllabic? 0=mid? -1=obstruent? 0=end?

69/145

Wiktionary as Source

o Automatic Pronunciation Dictionary Generation

o Idea: Automatically extract pronunciations from Wiktionary

Paper here at Interspeech 2010, Oral presentation, Wednesday, 5:20pm, Hall A/B

Tim Schlippe, Sebastian Ochs, and Tanja Schultz

Wiktionary as a Source for Automatic Pronunciation Extraction

70/145

Wiktionary

• Quantity Check: Given a word list, what is the percentage of words for which phonetic

notations are found in a complete IPA representation?

• Quality Check: How many pronunciations derived from Wiktionary are identical to

existing GlobalPhone pronunciations?

How does adding Wiktionary pronunciations impact the performance

of ASR systems?

71/145

Wiktionary as a Source

Top-Ten of Wiktionary

Language Editions

(July 2010)

http://meta.wikimedia.org

/wiki/List of Wiktionaries

Quantity of Pronunciations found

(% of pages with pronunciations)

72/145

• Quantity of proper names Proper names can be of diverse etymological origin and can surface

in another language without undergoing the process of assimilation

to the phonetic system of the new language

important as difficult to generate with letter-to-sound rules

Search pronunciations of 189 international city names and

201 country names to investigate the coverage of proper

names:

73/145

Amount of compared pronunciations, percentage of

identical ones and amount of new pronunciation variants:

Approach I: Using all Wiktionary pronunciations for training and decoding

Impact on ASR performance

Approach II: Using only those Wiktionary pronunciations in decoding that were

chosen in training (see table):

74/145

* Follow the work of

Davel&Barnard

* Word list:

extract from text

Word list W

i:= best select

Word wi

Generate

pronunciation P(wi)

P(wi) okay?Yes

Delete wi

Update G-2-P

Improve

Delete wi

* Update after each wi

effective training

* G-2-P

- explicit map rules

- neural networks

- decision trees

- instance learning

(grapheme context)

LexSkip

Dictionary: Interactive Learning

75/145

76/145

77/145

Lex Learner

78/145

Lex Learner

79/145

Issues and Challenges

o How to make best use of the human?

o Definition of successful completion

o Which words to present in what order

o How to be robust against mistakes

o Feedback that keeps users motivated to continue

o How many words to be solicited?

o G2P complexity depends on the

language (SP easy, EN hard)

o 80% coverage

hundred (SP) to thousands (EN)

o G2P rule system perplexity

Language Perplexity

English 50.11

Dutch 16.80

German 16.70

Afrikaans 11.48

Italian 3.52

Spanish 1.21

80/145

Lex Learner TTS

o TTS feedback good for lexical pronunciation

o [Davel and Barnard 2004]

o Play TTS version of predicted pronunciation

o This even helps expert phoneticians

o Need full IPA TTS voice

o We do phonetic based TTS

o Unit selection (high fidelity)

o Flat prosody (but only isolated words)

o Good enough to keep user on track

81/145

Outline Part 2

14:15 – 15:45 Part 2:

o Evaluation

o Future

82/145

Input: Speech

hi youyou areI am

AM Lex LMOutput:

Speech & Text

Rapid Portability: TTS

83/145

84/145

Statistical Parametric TTS

o Text-to-speech for Applications:

o Common technologieso Diphone: too hard to record and label

o Unit selection: too much to record and label accurately

o Statistical Parametric: “just right”

o Statistical Parametric Synthesis

o “HMM synthesis”

o clustergen trajectory synthesiso Clusters representing context-dependent allophones

o PRO: o can work with little speech (10 minutes)

o Robust to poor data

o CON: o Signal sounds “buzzy”, can lack varied prosody

85/145

Voice Building Process

o Can usually collect 300-500 utterances

o Single speaker, rich prompt set

o Have lexical coverage (from Lex Learner)

o Automatic labeling from acoustic models

o Automatic: spectral and prosodic models

o But prosody will be similar to recordings

o No text processing front end (yet)

86/145

Cross Lingual Voice Conversion

o Use non-native acoustic models

o Adapt them to target language

o Use small amount of target data

o Align and build mapping function (mllr or GMM)

o Requires phoneme mapping

o Automatic or by hand

[Anumachipalli and Black SLTU 2010]

87/145

CLVC from English

o Conversion to German and Telugu

88/145

Using ASR data for TTS

o Conventional TTS databases:

o Single speaker, well recorded

o Conventional ASR databases:

o Multi speaker, varied quality

o [Yamagishi et al IS2009]

o Speaker Adaptive Training

o Statistical Parametric Synthesis

o (Select “similar” speakers from DB)

89/145

Outline Part 2

14:15 – 15:45 Part 2:

o Evaluation

o Future

90/145

91/145

Input: Speech

hi youyou areI am

AM Lex LMOutput:

Speech & Text

Rapid Portability: Acoustic Models

92/145

Rapid Portability: Data

Step 1:

• Uniform multilingual database (GlobalPhone)

• Build Monolingual acoustic models in many languages

93/145

Multilingual Acoustic Modeling

Step 2:

• Combine monolingual acoustic models to a set of

multilingual “language independent” acoustic model

94/145

Speech Production is independent from Language IPA

1) IPA-based Universal Sound Inventory

2) Each sound class is trained by data sharing

Reduction from 485 to 162 sound classes

m,n,s,l appear in all 12 languages

p,b,t,d,k,g,f and i,u,e,a,o in almost all

Universal Sound Inventory

95/145

Input: Speech

hi youyou areI am

AM Lex LMOutput:

Speech & Text

Rapid Portability: Acoustic Models

Step 3:

• Define mapping between ML set and new language

• Bootstrap acoustic model of unseen language

96/145

Acoustic Model Building

o Acoustic Model Building requires:

o Recorded Read Speech Data

(since data is read, we have the transcripts!)

o Phone set definition

o Pronunciation Lexicon

o Two step process:

1. Configuration

2. Model Training

97/145

Acoustic Model Building - Configuration

o Checks dependencies and errors

o Lexicon and phone set correspond

o Words in recorded prompts are covered by the lexicon

o Divides the recorded data into training and test sets

o Performance evaluation

o Few data: K-fold cross-validation, with K = #speakers

o More data: Data split into 90% (train) and 10% (test)

98/145

Acoustic Model Building - Configuration

99/145

Acoustic Model Building - Training

o Requires successful configuration

o Creates Log files and Displays to the user

o All steps of training

o EM Training for Context Independent Models

o 3-state HMM

o Number of Gaussians per Model depends on data

o EM Training for Context Dependent Models

o Number of models depends on data

o MFCC front-end, LDA

o Progress of training procedure

o Results of performance evaluation

100/145

Acoustic Model Building - Training

101/145

Unsupervised Adaptation

• Goal:

– Build Automatic Speech Recognition (ASR) for unseen

Language/Accent/Dialect with minimal human effort

• Challenge:

– No or Few Data, i.e. no transcribed Data!!!

• Solution:

– No transcriptions

apply unsupervised training approaches

– Lack of Linguistic Knowledge

transfer knowledge from other languages

– Here:

• Use several languages

• … of the same language family

• … and combine knowledge

102/145

Experimental Setup

• Given: Data, Transcripts, ASR for several languages

• Recognition Systems for 4 Slavic Languages

– Croatian (South-Slavic, 7M spks)

– Russian (East-Slavic, 165M spks)

– Bulgarian (South-Slavic, 12M spks)

– Polish (West-Slavic, 56M spks)

• Wanted: ASR for Czech: (West-Slavic, 12M spks)

• GlobalPhone (ELRA), read speech, 100spks, ~ 20hrs per language

103/145

SOURCERussian

SOURCEBulgarian

SOURCECroatian

Cross-Language Transfer

• Benefit :Source Acoustic Model not touched, apply CD models

• Benefit: Faster Decoding

• Drawback: If applied iteratively (see later), no adaptation of target AM

TARGETCzech

SOURCEPolish

Cross-

Language

Transfer

Mapped

104/145

Manual Phone Mapping

105/145

Cross-Language Transfer (C-T)o Given: ASR in 4 languages (Bulgarian, Croatian, Polish, Russian)

o Audio Data in Czech but NO Transcripts

o Czech Pronunciation Rules (G-2-P Mapping for Dictionary)

o Text Data, Vocabulary Selection (Automatically derived)

o ASR Performance on Czech

o Apply Manual Phone Mapping to Dictionary

o Apply Source Language Acoustic Models (AM) as is

o Recognize Target Language Czech (Dev set)

o Overall performance is depressingly bad

o Word Error Rates (WER) vary with source language

o Automatic Mapping gives slightly better numbers but not worth it

106/145

Improvements: Unsupervised Training (UT)

• UT: Assume audio but no transcription

– take recognizer hypothesis as transcription

• Several interesting works on UT (Zavaliagkos, 1998; Lamel

et. al 2002; Wessel/Lööf/Ney 1999-2009, …)

• UT effective only in combination with „confidence scores“

to select/weight the correct portions of the hypotheses

• Kemp and Schaaf, 1997 proposed „gamma“ and „A-stabil“

as confidence measures (JRTk)

• Problem: A-Stabil works well

for well trained Acoustic

Models (AM) but not with

badly trained AMs,

so NO option for C-T

107/145

Multilingual A-Stabil

108/145

Results on Multilingual A-Stabil

109/145

Bootstrap Framework

WER C-T Iter1 Iter2

Bulgarian 61.0 24.5 23.6

Croatian 57.2 24.6 23.7

Polish 55.8 24.4 24.1

Russian 64.3 24.1 23.8

BestCZ 23.1(supervised)

Performance on

Czech Development Set

Poster here at IS2010:

N.T. Vu, T.Schlippe, F. Kraus, T. Schultz,

Rapid Bootstrapping of five Eastern European

languages using the Rapid Language Adaptation

Toolkit , Tuesday 10:00, Room B

110/145

Outline Part 2

14:15 – 15:45 Part 2:

o Evaluation

o Future

111/145

Language Model Building

o Get as much relevant text data as possible

o Use the text data for

o Generating recording prompts

o Generating vocabulary lists

o Build Language Models for ASR

Approach

1. User supplies an URL to SPICE for crawling

2. Crawler retrieves N documents (web-pages)

3. Compute the statistics (TF-IDF) from the N documents

4. Terms with highest TF-IDF score form query terms

5. Query search engine (Google) to get the URLs for the

query terms

6. Crawl the URLs for the data

112/145

113/145

RLAT – Snapshot function

o Informative feedback about the quality of the crawled text

o Results show quality (PP, OOV), computed and displayed periodically

(to be defined by the user) during the crawling process

RLAT webpage:

http://csl.ira.uka.de/rlat-dev

114/145

RLAT webpage: http://csl.ira.uka.de/rlat-dev

o 10-fold

115/145

Example: 18 days of crawling www.leparisien.fr (French)

n-gram coverage

OOV Rate (%)

100 Mio

50 Mio

150 Mio

200 Mio

250 Mio

300 Mio

350 Mio

400 Mio

Vocabulary size Total words

116/145

Text Normalization based on SMT and Internet User Support

o Text extraction, HTML tag removal

o Language independent normalization

o Language-specific normalization

o Common abbreviations, punctuation, numbers, dates, casing

o New approach: (See Interspeech 2010 paper)

o Web-based user interface for language-specific text normalization

o Hybrid approach (rules + SMT)

Figure: Web-based User Interface for Text Normalization

117/145

o Experiments and Results:

o How well does SMT perform in comparison to LI-rule, LS-rule and

human?

o How does the performance of SMT evolve over the amount of

training data?

o How can we modify our system to get a time and effort reduction?

o Evaluation:

o comparing the quality of 1k output sentences derived from the

systems to text which was normalized by native speakers in our lab

o creating 3-gram LMs from our hypotheses and evaluated their

perplexities on 500 sentences manually normalized by native

speakers

o Detailed Results:Paper here at Interspeech 2010

Tim Schlippe, Chenfei Zhu, Jan Gebhardt, and Tanja Schultz,

Text Normalization based on Statistical Machine Translation and

Internet User Support

118/145

Figure: Performance (edit dist.) over amount of training data

119/145

Outline Part 2

14:15 – 15:45 Part 2:

o Evaluation

o Future

120/145

o Goal: Build Afrikaans – English Speech Translation System with SPICE

o Cooperation with University Stellenbosch and ARMSCOR

o Bilingual PhD visited CMU for 3 month

o Afrikaans: Related to Dutch and English,

g-2-p very close, regular grammar, simple morphology

o SPICE, all components apply statistical modeling paradigm

o ASR: HMMs, N-gram LM (JRTk-ISL)

o MT: Statistical MT (SMT-ISL)

o TTS: Unit-Selection (Festival)

o Dictionary: G-2-P rules using CART decision trees

o Text: 39 hansards; 680k words; 43k bilingual aligned sentence pairs;

Audio: 6 hours read speech; 10k utterances, telephone speech (AST)

SPICE 2005: Afrikaans – English

121/145

o Good results: ASR 20% WER; MT A-E (E-A) Bleu 34.1 (34.7), Nist 7.6 (7.9)

o Shared pronunciation dictionaries (for ASR+TTS) and LM (for ASR+MT)

o Most time consuming process: data preparation reduce amount of data!

o Still too much expert knowledge required (e.g. ASR parameter tuning!)

Data Training Tuning Evaluation Prototype

daysAM (ASR) Lex LM (ASR, MT) TM (MT) TTS S-2-S

Time Effort

Herman Engelbrecht, Tanja Schultz, Rapid Development of an Afrikaans-English Speech-to-Speech Translator ,

IWSLT 2005, Pittsburgh, PA, October 2005

122/145

SPICE 2007: Field Experiments

o Now targeting more languages in a shorter time frame

o 6-weeks Hands-on Course at CMU in Spring 2007

o Adopt native languages of participating students as targets

o Added up to 10 different languages: Bulgarian, English, French,

German, Hindi, Konkani, Mandarin, Telugu, Turkish, Vietnamese

o Teams of two students with different native language

o Course goal was to build a simple S-2-S system and use

this to communicate with each other in their mother tongue

o Solely rely on SPICE tools

o Build speech recognition components in two languages

o Build simple SMT component in two directions

o Build speech synthesis components in two languages

o Report back on problems and system shortfalls

123/145

Field Experiments (2)

o The 10 languages cover broad range of peculiarities

o Writing system:

o Logographic Hanzi (Mandarin);

o Cyrillic (Bulgarian);

o Roman (German, French and English);

o phonographic segmental (Telugu and Hindi);

o phonographic featural (Vietnamese)

o No script: Konkani

o Segmentation: No segmentation (Chinese); Segmentation white

spaces do not necessarily indicate word (Vietnamese)

o Morphology: simple, low inflecting (English), compounding (German),

agglutinating (Turkish) …

o Sound System: tonal (Mandarin and Vietnamese), stress (Bulgarian)

o G-2-P: straightforward (Turkish), challenging (Hindi), difficult (English),

no relationship (Chinese), invented (Konkani)

124/145

Lessons Learned

o It is possible to create speech processing components for

10 languages in 6-weeks using SPICE

o Each language brings new challenges

o Many SPICE features turned out to be very helpful, e.g.

only ONE speaker of Konkani in Pittsburgh, web recorder

allowed remote collection of more speakers

o Log: time spent

in SPICE interface

o Improve interface

using breakdown

o Use feedback

o Interface allows for

collaborative work

Task Time Spent

[hh:mm]

Text Collection 8:35

Audio Collection 10:07

Phoneme Selection 4:05

LM building 1:25

G-2-P specs 1:30

125/145

Outline Part 2

14:15 – 15:45 Part 2:

o Evaluation of time vs expertise

o Future

126/145

Initial evaluations

o Conducted 2 semester-long lab courses

o students use SPICE to create working ASR

and TTS in a language of their choice

o bonus for the ambitious

o train statistical MT system between two

languages to create a speech-to-speech

translation system

o Evaluation includes

o time to complete

o task difficulties

o ASR word error rate

o TTS voice quality

127/145

Evaluation details on TTS

o Research questions:

o Which features contribute most?

o Are language-dependent features critical?

o How does voice quality vary with:

o Amount of recorded data

o Number of lexical entries

o Can objective measures estimate voice quality?

o Can any measures motivate the user?

128/145

TTS from phonemes

o “welcome”

o W EH L K AH M

lexicon or LTS

clustered 'allophones'

symbolics

units or trajectories

acoustics

parametric synthesis

(trajectory-based)

predicted from

phonetic and linguistic

context using CART trees

129/145

CART tree features

o Four categories of training features

1. name: phoneme and HMM state context

2. position: e.g., % from start (of phrase,

word, phone, HMM-state)

3. IPA features: based on phoneme set

4. linguistic: e.g. PoS, syllable segmentation

o Increasing difficulty

o 1. and 2. are language-independent

o 3. requires a defined phoneset

o 4. requires a computational linguist

130/145

Effect of feature classes

Feature class Number Lang-dep. Δ MCD

no CART trees 1 no baseline

name symbolics 16 no - 0.452

position values 7 no - 0.402

IPA symbolics 72 yes - 0.001

linguistic sym. 14 yes + 0.004

o Mel-cepstral distortion (MCD)

o Standard objective measure in TTS/VC

o lower numbers are better

o ~ 0.2 is perceptually noticeable

o ~ 0.08 is statistically significant

o first two feature classes (name and position) matter

131/145

Effect of feature classes on MCD

arctic_slt

Wagon Stop Value

10 100 1000

6.0Cummulative Feature Classes

Legend

names + posn

names + posn + IPA

+ linguistic (hidden)

132/145

Effect of database size

o How does MCD improve with more speech?

o improvement of 0.2 from 3.75->7.5 minutes

o improvement 0.1-0.12 beyond that:

(near-linear increase with a doubling of data)

o 2x is perceptually better when <10m speech

o 4x is perceptually better after 10m speech

o Where is the asymptote?

o don't know yet!

o maybe 20 hours

133/145

Database size vs MCD

tested on

training data

tested on 10% heldout

Wagon Stop Value

10 100 1000

6.0Effect of Database Size on MCD

arctic_slt

Legend

1/16 hour

1/8 hour

1/4 hour

1/2 hour

1 hour

134/145

Effect of a good Lexicon

o Grapheme-based voice

o 26 letters a-z are a substitute 'phone' set

o no IPA and linguistics features

o 3 slides ago showed that these don't matter

o English has highly irregular spelling

o the acoustic classes are impure

o measuring global voice quality – overlooking

mispronounced words

o Results:

o MCD improves by 0.27

o consistent across CART leaf node size

(“stop value”)

135/145

Grapheme vs Phoneme (English)

Wagon Stop Value

10 100 1000

6.0Grapheme versus Phoneme-based Voices

arctic_slt

Legend

grapheme based

phoneme based

136/145

10 non-English test languages

o European

o Bulgarian, French, German, Turkish

o Indian

o Hindi, Konkani, Tamil, Telugu

o East Asian

o Mandarin, Vietnamese

137/145

Evaluating non-English voices

o Need a 'good' and a 'bad' voice

o Phoneme-based English is good

o Grapheme-based English is bad

o Data covers 3m to 1h of speech

o may be extrapolated to about 4h

o Following voices are from student

lab projects

138/145

Non-English languages

Database Size (h)

6.0Effect of Database Size on MCD - Multi-Lingual

arctic_slt

Konkani

Mandarin

Bulgarian

HindiTamil

German

French

Vietnamese Legend

character-based

phoneme-based

139/145

Characterizing voice quality

o MCD and size permits a quick assessment

o French is in good shape

o German could use lexicon improvements

o Hindi and Tamil are good for their size

o recommendation: collect more speech

o Bulgarian, Konkani and Mandarin need more

speech and a better lexicon

o Vietnamese voice had character set issues

o resulted in only ¼ of the speech being used

140/145

More speech or a better lexicon?

o From the English MCD error curves

o 5x the speech = fixing the phoneset+lexicon

o Which is more time effective?

o assume 3-4 sentence recordings per minute

o assume 2-3 lexicon verifications per minute

o Answer

o If speech DB is small, record more speech

o If speech DB is large, work on the lexicon

o The transition point is language dependent

141/145

Evaluation on TTS: Conclusions

o Discovered

o a way to estimate the quality of new voices

o a framework for directing the best course of

actions (recording more data vs lexical

improvements)

o that IPA and our particular language

dependent features are not critical to

success (Phew!)

o Next stage

o detecting bad lexical entries from acoustics

142/145

Outline Part 2

15:40 – 17:00 Part 2:

o Evaluation of time vs expertise

o Future

143/145

Conclusion

o Challenges in Multilingual Speech Processing

o Well defined build processes: ASR, MT, TTS … BUT:

o Every new language brings unseen challenges

o Current (statistical) approaches require lots of data

o … and native language expert and technology expertise

o How to bridge the gap between language and tech expert?

o Proposed solution: SPICE and RLAT

o Learning by interaction from a cooperative (but naïve) user

o Rapid adaptation from language universal models

o Knowledge sharing across components

o Development cycle: Days rather than weeks

o Better automatic identification of problem types

144/145

Future

o Continuous Server Support

o Improve Interface based on user feedback and lessons learned

o Improve Language Robustness: font encoding, …

o Software Engineering, Scaling

o Collaboration

o Multiple people working on the same project

o Leverage from archived projects

o Cross-confirmation

o Multiple views for within and across project confirmation

o Confidence measure to find appropriate combination

o Error-blaming

o End-to-end system Evaluation vs Component Evaluation

o Automatic Generation of Recommendations to improve systems

145/145

Latest Information

o System is online at http://cmuspice.org

o Use system for your own project

o Create new login/passwd and project

o Preloaded Hindi Example

o Login as

o Login: demo

o Passwd: demo

o Chose project # (your birth day)

146/145

Multilingual Speech Processing -...

Documents

Transcript of Multilingual Speech Processing -...

Shapes Magazine 2015 #1 Slovakian

Comparison of the ICT literacy level of the Slovakian and ...

Middle Miocene evolution of paleoenvironment in the Slovakian eastern part of the Vienna Basin

Transboundary Aquifers in Hungary - Hydrology.nl · Environmental state and sustainable management of Hungarian-Slovakian transboundary groundwater bodies (ENWAT) Objectives To prepare

ENGLISH VERSIONS OF FOREIGN NAMES · ENGLISH Czech. French German Hungarian Italian Polish Slovakian Russian Yiddish Aaron

oad Wrapup, Questions & Exam Infocsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/V13-Wrapup.pdf · Wrapup, Questions & Exam Info Felix Putze, Dominic Heger ... •Human Behavior

Television Across Europe (2008) - Slovakian Translation

Slovakian Presentation

THE SLOVAKIAN SUCCUBUS · 2021. 1. 15. · the slovakian succubus land rover’s all-new defender traveling with dogs – overland jeep gladiator making money on the road – the

THE SLOVAKIAN SUCCUBUS

December 2019 Oakcrossing News - peopleCare.ca...ROMANIAN - Craciun Fericit, Sarbatori Fericite SERBIAN - Hristos se rodi (Христос се роди) SLOVAKIAN - Vesele Vianoce

Public-KeyGeneration withVeriﬁableRandomness · and Slovakian smartcards and standard cryptographic libraries). This kind of weaknessesisnotnewasin2012,Lenstra,Hughes,Augier,Bos,Kleinjungand

ManpowerGroup Employment Outlook Survey Slovakia Q1 2018 fileManpowerGroup Employment Outlook Survey 1 Slovakia Employment Outlook Slovakian employers report very optimistic hiring

Memory Modeling & Knowledge Representation - KITcsl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/V4-Memory.pdf · Memory Modeling & Knowledge Representation ... Explicit and

Language Modeling - Part 1 - KITcsl.anthropomatik.kit.edu/.../MMMK-PP13-LanguageModeling1-SS2013.pdf · 6 Deterministic vs. Stochastic Language Models • Remember formal language

Tradesto Corporation Slovakian Presentation

A comparative analysis of the evolution of EGTCs at the Hungarian-Slovakian border

Multimodal Interaction - KITcsl.anthropomatik.kit.edu/.../DEIB1314_Multimodal_Interaction.pdf · 1/80 d of s Multimodal Interaction Christoph Amma & Felix Putze 08.11./15.11.2013

Slovakian Local Milk Production Study

slovakian traditions and culture about water