Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009

Laboratory for Digital Speech and Audio Processing - DSSP

Speech Synthesis in the SPACE Reading TutorClosing Symposium of the SPACE Project06 FEB 2009

Yuk On Kong, Lukas Latacz, Werner Verhelst

Laboratory for Digital Speech and Audio Processing

Vrije Universiteit Brussel

Laboratory for Digital Speech and Audio Processing - DSSP Introduction


To Record or Not to Record: That’s the question.

• Pre-recorded speech in existing reading

tutors

• Advantages / disadvantages?

Laboratory for Digital Speech and Audio Processing - DSSP Application-specific TTS

• Speaker / voice

• Material in speech corpus

• How to synthesize

• Any extra mode necessary?

• the child is too slow…

• How to maximize quality

Laboratory for Digital Speech and Audio Processing - DSSP Speaker / Voice

• Speaker– Appealing to children– Female speaker– Standard Flemish pronunciation, no noticeable

regional accent– Experienced speaker


Material inSpeech Corpus

• Database (about 6 hours)– Material from stories for children

– Words expected at 6 years of age

– Diphones

Laboratory for Digital Speech and Audio Processing - DSSP How to synthesize

• Based on the general unit

selection paradigm.

• Heterogeneous units: units

could be of various sizes

• Bases: – Use of longer chunks leads to

quality improvement.

– Used for synthesizing domain-specific utterances.

_-o

o

oma

ma

o-m m-a

Fig. Word “oma” to synthesize and multi-tier segmentation in word, syllable and segment


• Basic algorithm:– Search top-down and select longest sequence of targets at

each level and go to lower levels if no candidates are found.

• Coarticulation:– Even across word boundaries

• Level: diphone, syllable, word, phrase


Front-end

Back-end

TokenisationTokenisation

Silence InsertionSilence Insertion

ToDI IntonationToDI Intonation

Phrase and Pause Prediction

Phrase and Pause Prediction Part of speechPart of speech

Word PronunciationWord Pronunciation

Unit SelectionUnit Selection

Speech

DB

Text NormalisationText Normalisation

Word AccentWord Accent

Unit ConcatenationUnit Concatenation

Als het flink vriest, kunnen we schaatsen.


• Target prosody is described

symbolically

• Best sequence of units is

selected– Weighted sum of target and

join costs

– Viterbi search

• Joins: – Costs based on spectrum,

pitch, energy, duration and adjacency

– PSOLA-based algorithm with optimal coupling

Level Target costSegmentSegment

Phonemic identity*Pause type (if silence)*

Segment Position in syllableSyllableSyllableSyllableSyllableSyllable

Phoneme sequenceLexical stress*ToDI accent* Is_accented*

Onset and coda type *SyllableSyllableSyllable

Onset, nucleus and coda size*Distance to next/previous stressed syllable, in terms of syl’s

Number of stressed syllables until next/previous phrase breakSyllableSyllable

Distance to next/previous accented syllables, in terms of syl’sNumber of accented syllables until next/previous phrase break

WordWordWordWordWordWordWordWord WordWord

Position in phrase Part of speech*

Is_content_word*Has_accented_syllable(s)*

Is_capitalized *Position in phrase*Token punctuation*

Token prepunctuation*Number of words until next/previous phrase break

Number of content words until next/previous phrase break

Those with a * are also calculated for the neighboring segments, syllables or words. “Neighboring syllables” are restricted to the syllables of the current word. As for segments & words, three neighbors on the left and three on the right are taken into account.

Laboratory for Digital Speech and Audio Processing - DSSP Extra Modes?

• Phoneme-by-

phoneme mode– Stress

• Syllable mode

Laboratory for Digital Speech and Audio Processing - DSSP Extra Modes?

• Demonstration:

– Phoneme-by-phoneme• Stress on first phoneme

– Syllable – Normal mode

Moeilijk Koffiezetapparaat

Laboratory for Digital Speech and Audio Processing - DSSP The Child is Too Slow…

• Choosing the appropriate reading speed for

the child– Uniform WSOLA time-scaling– Insertion of additional silences between

neighboring words

• Reading along

Laboratory for Digital Speech and Audio Processing - DSSP The Child is Too Slow…

Commands & Timing Info

Synthesizer

Cygwin

Synthesis module

Playback module

Readingtutor

Teacher’smodule

Audio

Assessment

Errordetection

Tracking

Windows XP

Laboratory for Digital Speech and Audio Processing - DSSP How to Maximize Quality

• Major synthesis problems– Join artifacts– Inappropriate prosody

• Interactive tuning of synthesis– Assisted by quality management– User can make small changes to the input text

Laboratory for Digital Speech and Audio Processing - DSSP How to Maximize Quality

• Approach:– For each word, calculate average target and join

costs– Predictor:

• : threshold based on max and min of cost c

• uj usually lies between 0 & 1 because of training settings.

• Accept if uj < 0.5 and reject otherwise.

– Weights: linear regression– Best alphas found iteratively (maximizing f-score)

1

( ( , ))n

trj i ij i ij i

i

u w c w tr c

( , )tr c

Laboratory for Digital Speech and Audio Processing - DSSP Other Special Aspects

• Phrase and Silence Prediction

• Context-dependent Weight Training


Phrase and Silence Prediction

• Type of pauses: heavy, medium and light– Phrase breaks: both heavy and medium pauses

• Training– No manual labeling, but based on the pauses

automatically labeled in the speech database– Iterative classification based on these pauses– Training of memory-based learner (features such

as POS, punctuation, ...)


Context-dependent Weight Training

• Automatic adaptation (tuning) of weights

• Context-dependent weights– Context is described symbolically per phone

• Training:– Optimizing weights– Clustering optimized weights (decision trees)


Context-dependent Weight Training

• 7 subjects

• 4 conditions – Randomly selected corpus; Context-dependent weights– Randomly selected corpus; Untrained weights– Corpus selected based on word frequency; Context-dependent weights– Corpus selected based on word frequency; Untrained weights

• 25 test utterances, AVI 1-5 (5 utt./level)

• Results:Conditions MOS

Randomly selected corpus; Context-dependent weights 3.1

Randomly selected corpus; Untrained weights 3.1

Corpus selected based on word frequency; Context-dependent weights

3.3

Corpus selected based on word frequency; Untrained weights 3.0

Laboratory for Digital Speech and Audio Processing - DSSP Demonstration

• Hierarchical unit selection:– AVI 1: “Dit is te gek, gilt ze.”– AVI 3: “Toch had hij liever de hond gehad.”– AVI 5: “Roel ligt nog een paar dagen in het

ziekenhuis.”– AVI 7: “De kleine huizen staan dicht tegen elkaar

aan.” – AVI 9: “Nou Henk, zie je nu wel dat je moeder hier

fantastisch verzorgd wordt!”

Laboratory for Digital Speech and Audio Processing - DSSP WSOLA

Illustration of the WSOLA strategyTop: original signalBottom: WSOLA time-scaling

Laboratory for Digital Speech and Audio Processing - DSSP Other Application

• Audio-visual TTS– Example: “

The sentence you hear is made out of many combinations of original sound and video, selected from the recordings of natural speech.”

– Database containing about 20 minutes (LIPS Challenge ’08)

– For better audio quality, the database should be much larger

Laboratory for Digital Speech and Audio Processing - DSSP Future Work

• Optimizing synthesis – User feedback

• Expressive speech synthesis– Automated prosodic annotations

• Quality Management– Evaluation & optimization of the algorithm– Compare with the perceived quality of

synthesized sentences (MOS)

Laboratory for Digital Speech and Audio Processing - DSSP Questions?

• Thank you for your attention.

• Acknowledgments:– Prof. Wivine Decoster (our speaker)– Jacques, Leen and other SPACE members– Wesley and other DSSP people– IWT


THE END

Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009

Documents

Transcript of Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009