Bootstrapping a Language-Independent Synthesizer

22
Bootstrapping a Bootstrapping a Language-Independent Language-Independent Synthesizer Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002

description

Bootstrapping a Language-Independent Synthesizer. Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002. Introducing the Problem. Given a set of recordings and transcriptions in an arbitrary language, can we quickly and easily build a speech synthesizer? - PowerPoint PPT Presentation

Transcript of Bootstrapping a Language-Independent Synthesizer

Page 1: Bootstrapping a Language-Independent Synthesizer

Bootstrapping a Language-Bootstrapping a Language-Independent SynthesizerIndependent Synthesizer

Craig Olinsky

Media Lab Europe / University College Dublin

15 January 2002

Page 2: Bootstrapping a Language-Independent Synthesizer

Introducing the ProblemIntroducing the Problem

Given a set of recordings and transcriptions in an arbitrary language, can we quickly and easily build a speech synthesizer?

YES, if we know something about the language.

However, for the majority of languages for which such resources don’t exist…

Page 3: Bootstrapping a Language-Independent Synthesizer

PROS The existing synthesizer

provides a store of “linguistic” knowledge we can start from.

Analogue to speaker adaptation in Speech Recognition systems.

Overall, quality should be better.

CONS Difficulty related to degree of

different between sample and target language.

Best as a gradual process: accent/dialect, not language

Starting from SampleStarting from Sample

Page 4: Bootstrapping a Language-Independent Synthesizer

PROS Difficulty directly proportional to

complexity of the language.

Common (machine-learning) procedure based upon machine learning from recordings and transcript.

CONS Don’t have a great deal of

relevant knowledge to apply to the task.

If not using principled phone set, necessary to segment / label recordings cleanly

Starting from ScratchStarting from Scratch

Page 5: Bootstrapping a Language-Independent Synthesizer

The Obvious CompromiseThe Obvious Compromise

Take what we do know from building speech synthesis, and generalize it to an existing framework.

-- we’re not specifically learning from “scratch”

-- at the same time, we’re not making linguistic assumptions pre-coded into the source voices

Page 6: Bootstrapping a Language-Independent Synthesizer

““Generic” Synthesis Generic” Synthesis Framework/ToolkitFramework/Toolkit

Set of Scripts, Utilities, and Definition files to help to help to automate the creation of reasonable speech synthesis voices from an arbitrary language without the need for linguistic or language-specific information.

Build on top of the Festival Speech Synthesis System and FestVox toolkit (for wave form synthesis; most of text processing and pronunciation handling externalized to locally-developed tools)

Page 7: Bootstrapping a Language-Independent Synthesizer

Language-Dependent Language-Dependent Synthesis ComponentsSynthesis Components

Phone set

Word pronunciation (lexicon and/or letter-to-sound rules)

Token processing rules (numbers etc)

Durations

Intonation (accents and F0 contour)

Prosodic phrasing method

Page 8: Bootstrapping a Language-Independent Synthesizer

Phoneme SetsPhoneme Sets

If we rely on a pre-existing set of pronunciation rules, lexicon, etc., we are automatically limited to using the phone-set used in those resources (or something which they can be mapped to); most likely something language-dependent.

IPA, SAMPA: something language-universal?

We need to generate pronunciations: how do we create the relationship between our training database / phonetic representation / orthography?

Page 9: Bootstrapping a Language-Independent Synthesizer

““Multilingual” Phoneme Multilingual” Phoneme Sets: IPA, SAMPA Sets: IPA, SAMPA

We don’t want to be stuck with a set of phonemes targeted for a specific language, so we instead use a phoneme definition designed to be inclusive of all

But… this still assumes we know the relationship between the phone set and orthography of the language; i.e. for any given text we can generate a pronunciation.

This approach still assumes linguistic knowledge!

Page 10: Bootstrapping a Language-Independent Synthesizer

Orthography Orthography asas PronunciationPronunciation

cf: R. Singh, B. Raj and R.M. Stern, “Automatic Generation of Phone Sets and Lexical Transcriptions;” ..

Suppose we begin with the orthography of the written language.

e.g. CAT = [c] [a] [t] DOG = [d] [o] [g]

This implies• A relation between number of characters in a spelling and

the length of the pronunciation• The orthography of a language is consistent / efficient

Page 11: Bootstrapping a Language-Independent Synthesizer

Orthography Orthography asas PronunciationPronunciation

Page 12: Bootstrapping a Language-Independent Synthesizer

Implications for Data Implications for Data Labeling and TrainingLabeling and Training

Page 13: Bootstrapping a Language-Independent Synthesizer

Non-Roman Orthography: Non-Roman Orthography: Questions of TranscriptionQuestions of Transcription

Page 14: Bootstrapping a Language-Independent Synthesizer

Difficulties in Machine Difficulties in Machine Learning of PronunciationLearning of Pronunciation

“But there is a much more fundamental problem … in that it crucially assumes that letter-to-phoneme correspondences can in general be determined on the basis of information local to a particular portion of the letter string. While this is clearly true in some languages (e.g. Spanish), it is simply false for others….

“…It is unreasonable to expect that good results will be obtained from a system trained with no guidence of this kind, or … with data that is simply insufficient to the task.”

– Sproat et. al, Multilingual Text-to-Speech Synthesis: The Bell Labs Approach, pp.76-77

Page 15: Bootstrapping a Language-Independent Synthesizer

Lexicon / Letter-to-Sound Lexicon / Letter-to-Sound RulesRules

Page 16: Bootstrapping a Language-Independent Synthesizer

Token ProcessingToken Processing

Page 17: Bootstrapping a Language-Independent Synthesizer

Duration and Stress Duration and Stress ModelingModeling

Page 18: Bootstrapping a Language-Independent Synthesizer

Intonation and PhrasingIntonation and Phrasing

Page 19: Bootstrapping a Language-Independent Synthesizer

Unit Selection and Unit Selection and Waveform SynthesisWaveform Synthesis

Page 20: Bootstrapping a Language-Independent Synthesizer

Overview: Adaptation for Overview: Adaptation for Accent and DialectAccent and Dialect

Page 21: Bootstrapping a Language-Independent Synthesizer

Final PointsFinal Points

Page 22: Bootstrapping a Language-Independent Synthesizer