Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009
-
Upload
george-alford -
Category
Documents
-
view
13 -
download
0
description
Transcript of Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009
Laboratory for Digital Speech and Audio Processing - DSSP
Speech Synthesis in the SPACE Reading TutorClosing Symposium of the SPACE Project06 FEB 2009
Yuk On Kong, Lukas Latacz, Werner Verhelst
Laboratory for Digital Speech and Audio Processing
Vrije Universiteit Brussel
Laboratory for Digital Speech and Audio Processing - DSSP
To Record or Not to Record: That’s the question.
• Pre-recorded speech in existing reading
tutors
• Advantages / disadvantages?
Laboratory for Digital Speech and Audio Processing - DSSP Application-specific TTS
• Speaker / voice
• Material in speech corpus
• How to synthesize
• Any extra mode necessary?
• the child is too slow…
• How to maximize quality
Laboratory for Digital Speech and Audio Processing - DSSP Speaker / Voice
• Speaker– Appealing to children– Female speaker– Standard Flemish pronunciation, no noticeable
regional accent– Experienced speaker
Laboratory for Digital Speech and Audio Processing - DSSP
Material inSpeech Corpus
• Database (about 6 hours)– Material from stories for children
– Words expected at 6 years of age
– Diphones
Laboratory for Digital Speech and Audio Processing - DSSP How to synthesize
• Based on the general unit
selection paradigm.
• Heterogeneous units: units
could be of various sizes
• Bases: – Use of longer chunks leads to
quality improvement.
– Used for synthesizing domain-specific utterances.
_-o
o
oma
ma
o-m m-a
Fig. Word “oma” to synthesize and multi-tier segmentation in word, syllable and segment
Laboratory for Digital Speech and Audio Processing - DSSP How to synthesize
• Basic algorithm:– Search top-down and select longest sequence of targets at
each level and go to lower levels if no candidates are found.
• Coarticulation:– Even across word boundaries
• Level: diphone, syllable, word, phrase
Laboratory for Digital Speech and Audio Processing - DSSP How to synthesize
Front-end
Back-end
TokenisationTokenisation
Silence InsertionSilence Insertion
ToDI IntonationToDI Intonation
Phrase and Pause Prediction
Phrase and Pause Prediction Part of speechPart of speech
Word PronunciationWord Pronunciation
Unit SelectionUnit Selection
Speech
DB
Text NormalisationText Normalisation
Word AccentWord Accent
Unit ConcatenationUnit Concatenation
Als het flink vriest, kunnen we schaatsen.
Laboratory for Digital Speech and Audio Processing - DSSP How to synthesize
• Target prosody is described
symbolically
• Best sequence of units is
selected– Weighted sum of target and
join costs
– Viterbi search
• Joins: – Costs based on spectrum,
pitch, energy, duration and adjacency
– PSOLA-based algorithm with optimal coupling
Level Target costSegmentSegment
Phonemic identity*Pause type (if silence)*
Segment Position in syllableSyllableSyllableSyllableSyllableSyllable
Phoneme sequenceLexical stress*ToDI accent* Is_accented*
Onset and coda type *SyllableSyllableSyllable
Onset, nucleus and coda size*Distance to next/previous stressed syllable, in terms of syl’s
Number of stressed syllables until next/previous phrase breakSyllableSyllable
Distance to next/previous accented syllables, in terms of syl’sNumber of accented syllables until next/previous phrase break
WordWordWordWordWordWordWordWord WordWord
Position in phrase Part of speech*
Is_content_word*Has_accented_syllable(s)*
Is_capitalized *Position in phrase*Token punctuation*
Token prepunctuation*Number of words until next/previous phrase break
Number of content words until next/previous phrase break
Those with a * are also calculated for the neighboring segments, syllables or words. “Neighboring syllables” are restricted to the syllables of the current word. As for segments & words, three neighbors on the left and three on the right are taken into account.
Laboratory for Digital Speech and Audio Processing - DSSP Extra Modes?
• Phoneme-by-
phoneme mode– Stress
• Syllable mode
Laboratory for Digital Speech and Audio Processing - DSSP Extra Modes?
• Demonstration:
– Phoneme-by-phoneme• Stress on first phoneme
– Syllable – Normal mode
Moeilijk Koffiezetapparaat
Laboratory for Digital Speech and Audio Processing - DSSP The Child is Too Slow…
• Choosing the appropriate reading speed for
the child– Uniform WSOLA time-scaling– Insertion of additional silences between
neighboring words
• Reading along
Laboratory for Digital Speech and Audio Processing - DSSP The Child is Too Slow…
Commands & Timing Info
Synthesizer
Cygwin
Synthesis module
Playback module
Readingtutor
Teacher’smodule
Audio
Assessment
Errordetection
Tracking
Windows XP
Laboratory for Digital Speech and Audio Processing - DSSP How to Maximize Quality
• Major synthesis problems– Join artifacts– Inappropriate prosody
• Interactive tuning of synthesis– Assisted by quality management– User can make small changes to the input text
Laboratory for Digital Speech and Audio Processing - DSSP How to Maximize Quality
• Approach:– For each word, calculate average target and join
costs– Predictor:
• : threshold based on max and min of cost c
• uj usually lies between 0 & 1 because of training settings.
• Accept if uj < 0.5 and reject otherwise.
– Weights: linear regression– Best alphas found iteratively (maximizing f-score)
1
( ( , ))n
trj i ij i ij i
i
u w c w tr c
( , )tr c
Laboratory for Digital Speech and Audio Processing - DSSP Other Special Aspects
• Phrase and Silence Prediction
• Context-dependent Weight Training
Laboratory for Digital Speech and Audio Processing - DSSP
Phrase and Silence Prediction
• Type of pauses: heavy, medium and light– Phrase breaks: both heavy and medium pauses
• Training– No manual labeling, but based on the pauses
automatically labeled in the speech database– Iterative classification based on these pauses– Training of memory-based learner (features such
as POS, punctuation, ...)
Laboratory for Digital Speech and Audio Processing - DSSP
Context-dependent Weight Training
• Automatic adaptation (tuning) of weights
• Context-dependent weights– Context is described symbolically per phone
• Training:– Optimizing weights– Clustering optimized weights (decision trees)
Laboratory for Digital Speech and Audio Processing - DSSP
Context-dependent Weight Training
• 7 subjects
• 4 conditions – Randomly selected corpus; Context-dependent weights– Randomly selected corpus; Untrained weights– Corpus selected based on word frequency; Context-dependent weights– Corpus selected based on word frequency; Untrained weights
• 25 test utterances, AVI 1-5 (5 utt./level)
• Results:Conditions MOS
Randomly selected corpus; Context-dependent weights 3.1
Randomly selected corpus; Untrained weights 3.1
Corpus selected based on word frequency; Context-dependent weights
3.3
Corpus selected based on word frequency; Untrained weights 3.0
Laboratory for Digital Speech and Audio Processing - DSSP Demonstration
• Hierarchical unit selection:– AVI 1: “Dit is te gek, gilt ze.”– AVI 3: “Toch had hij liever de hond gehad.”– AVI 5: “Roel ligt nog een paar dagen in het
ziekenhuis.”– AVI 7: “De kleine huizen staan dicht tegen elkaar
aan.” – AVI 9: “Nou Henk, zie je nu wel dat je moeder hier
fantastisch verzorgd wordt!”
Laboratory for Digital Speech and Audio Processing - DSSP WSOLA
Illustration of the WSOLA strategyTop: original signalBottom: WSOLA time-scaling
Laboratory for Digital Speech and Audio Processing - DSSP Other Application
• Audio-visual TTS– Example: “
The sentence you hear is made out of many combinations of original sound and video, selected from the recordings of natural speech.”
– Database containing about 20 minutes (LIPS Challenge ’08)
– For better audio quality, the database should be much larger
Laboratory for Digital Speech and Audio Processing - DSSP Future Work
• Optimizing synthesis – User feedback
• Expressive speech synthesis– Automated prosodic annotations
• Quality Management– Evaluation & optimization of the algorithm– Compare with the perceived quality of
synthesized sentences (MOS)
Laboratory for Digital Speech and Audio Processing - DSSP Questions?
• Thank you for your attention.
• Acknowledgments:– Prof. Wivine Decoster (our speaker)– Jacques, Leen and other SPACE members– Wesley and other DSSP people– IWT