Design of Tree-based Context Clustering for an HMM-based Thai Speech Synthesis System
Fundamentals & recent advances in HMM-based speech synthesis
-
Upload
heiga-zen -
Category
Technology
-
view
7.250 -
download
3
description
Transcript of Fundamentals & recent advances in HMM-based speech synthesis
Fundamentals & recent advancesin HMM-based speech synthesis
Heiga ZEN
Toshiba Research Europe Ltd.
Cambridge Research Laboratory
FALA2010, Vigo, Spain
November 10th, 2010
Text-to-speech synthesis (TTS)Text (seq of discrete symbols) Speech (continuous time series)
Automatic speech recognition (ASR)Speech (continuous time series) Text (seq of discrete symbols)
Machine Translation (MT)Text (seq of discrete symbols) Text (seq of discrete symbols)
2
Text-to-speech as a mapping problem
Dobré ráno Good morning
Good morning
Good morning
text(concept)
speech
air flow
Sound sourcevoiced: pulseunvoiced: noise
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r w
ave
by s
peec
h in
form
atio
n
fund
amen
tal f
req.
voic
ed/u
nvoi
ced
freq
. tr
ans
. ch
ar.
3
Speech production process
Grapheme-to-Phoneme
Part-of-Speech tagging
Text normalization
Pause predictionProsody prediction
Waveform generation
4
Flow of text-to-speech synthesis
TEXT
Text analysis
SYNTHESIZEDSPEECH
Speech synthesis
This talk focuses on speech synthesis part of TTS
Rule-based, formant synthesis (~’90s)
– Based on parametric representation of speech
– Hand-crafted rules to control phonetic unit
DECtalk (or KlattTalk / MITTalk) [Klatt;‘82]
5
Speech synthesis methods (1)
Block diagram of KlattTalk
Corpus-based, concatenative synthesis (’90s~)
– Concatenate small speech units (e.g., phone) from a database
– Large data + automatic learning High-quality synthetic voices
Single inventory; diphone synthesis [Moullnes;‘90]
Multiple inventory; unit selection synthesis [Sagisaka;‘92, Black;‘96]
6
Speech synthesis methods (2)
Corpus-based, statistical parametric synthesis (mid ’90s~)
– Large data + automatic training
Automatic voice building
– Source-filter model + statistical modeling
Flexible to change its voice characteristics
Hidden Markov models (HMMs) as its statistical acoustic model
HMM-based speech synthesis (HTS) [Yoshimura;‘02]
7
Speech synthesis methods (3)
Parametergeneration
Modeltraining
Featureextraction
Waveformgeneration
This talk gives the basic implementations & some recent work in HMM-based speech synthesis research
8
Popularity of HMM-based speech synthesis
Outline
9
Fundamentals of HMM-based speech synthesis– Mathematical formulation
– Implementation of individual components
– Examples demonstrating its flexibility
Recent work– Quality improvements
– Vocoding
– Acoustic modeling
– Oversmoothing
– Other challenging research topics
Conclusions
Mathematical formulation
Statistical parametric speech synthesisTraining
Synthesis
Using HMM as generative model
HMM-based speech synthesis (HTS) [Yoshimura;’02]
10
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Labels
Training part
Synthesis part
ExcitationParameterextraction
SPEECHDATABASE
SpectralParameterExtraction
Spectralparameters
Excitationparameters
Speech signal
HMM-based speech synthesis system (HTS)
11
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Spectralparameters
Excitationparameters
Labels
Speech signal Training part
Synthesis part
HMM-based speech synthesis system (HTS)
12
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Labels
Training part
Synthesis part
ExcitationParameterextraction
SPEECHDATABASE
SpectralParameterExtraction
Spectralparameters
Excitationparameters
Speech signal
HMM-based speech synthesis system (HTS)
13
air flow
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r w
ave
by s
peec
h in
form
atio
n speech
Sound sourcevoiced: pulseunvoiced: noise
Speech production process
14
Divide speech into frames
15
Speech is a non-stationary signal
… but can be assumed to be quasi-stationary
Divide speech into short-time frames (e.g., 5ms shift, 25ms length)
lineartime-invariant
system
excitationpulse train(voiced)
white noise(unvoiced)
speech
Excitation (source) part Spectral (filter) part
Fourier transform
Source-filter model
16
Parametric models speech spectrum
Autoregressive (AR) model Exponential (EX) model
ML estimation of spectral model parameters
: AR model Linear prediction (LP) [Itakura;’70]
: EX model ML-based cepstral analysis
Spectral (filter) model
17
Excitation (source) model
excitationpulse train
white noise
18
Excitation model: pulse/noise excitation– Voiced (periodic) pulse trains
– Unvoiced (aperiodic) white noise
Excitation model parameters– V/UV decision
– V fundamental frequency (F0):
Speech samples
19
Natural speech
Reconstructed speech from extracted parameters (cepstral coefficients & F0 with V/UV decisions)
Quality degrades, but main characteristics are preserved
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Speech signal Training part
Synthesis part
Training HMMs
Spectralparameters
Excitationparameters
Labels
HMM-based speech synthesis system (HTS)
20
Structure of state-output (observation) vector
Spectrum part
Excitation part
Spectral parameters(e.g., cepstrum, LSPs)
log F0 with V/UV
21
Dynamic features
22
HMM-based modeling
・ ・
1 1 2 3 3 N
Observation sequence
State sequence
23
SentenceHMM
Transcription sil a i sil
1 2 31 2 3
Spectrum(cepstrum or LSP, & dynamic features)
Excitation(log F0
& dynamic features)
Stream
12
34
Multi-stream HMM structure
24
Time
Log
Fre
qu
enc
y
Unable to model by continuous or discrete distribution
Observation of F0
25
1 2 3
unvoiced
voiced
unvoiced
voiced
unvoiced
voiced
・ ・ ・
voiced/unvoiced weights
Multi-space probability distribution (MSD)
26
Voiced
Unvoiced
Voiced
Unvoiced
Voiced
Unvoiced
・
・
・
Log F0
Spectral params Single Gaussian
MSD(Gaussian & discrete)
MSD(Gaussian & discrete)
MSD(Gaussian & discrete)
Str
eam
1S
trea
m 2
,3,4
Structure of state-output distributions
27
data & labels
Compute variancefloor
Initialize CI-HMMs bysegmental k-means
Reestimate CI-HMMs byEM algorithm
Copy CI-HMMs toCD-HMMs
monophone(context-independent, CI)
Reestimate CD-HMMs byEM algorithm
Decision tree-basedclustering
Reestimate CD-HMMs byEM algorithm
Untie parameter tyingstructure
Estimate CD-dur Modelsfrom FB stats
Decision tree-basedclustering
Estimated dur models
EstimatedHMMs
fullcontext(context-dependent, CD)
Training process
29
HMM-based modeling
・ ・
1 1 2 3 3 N
Observation sequence
State sequence
30
SentenceHMM
Transcription
sil sil-a+i a-i+sil sil
1 2 31 2 3
Phoneme current phoneme {preceding, succeeding} two phonemes
Syllable # of phonemes at {preceding, current, succeeding} syllable {accent, stress} of {preceding, current, succeeding} syllable Position of current syllable in current word # of {preceding, succeeding} {accented, stressed} syllable in current phrase # of syllables {from previous, to next} {accented, stressed} syllable Vowel within current syllable
Word Part of speech of {preceding, current, succeeding} word # of syllables in {preceding, current, succeeding} word Position of current word in current phrase # of {preceding, succeeding} content words in current phrase # of words {from previous, to next} content word
Phrase # of syllables in {preceding, current, succeeding} phrase
…..
Huge # of combinations Difficult to have all possible models⇒
Context-dependent modeling
31
data & labels
Compute variancefloor
Initialize CI-HMMs bysegmental k-means
Reestimate CI-HMMs byEM algorithm
Copy CI-HMMs toCD-HMMs
monophone(context-independent, CI)
Reestimate CD-HMMs byEM algorithm
Decision tree-basedclustering
Reestimate CD-HMMs byEM algorithm
Untie parameter tyingstructure
Estimate CD-dur Modelsfrom FB stats
Decision tree-basedclustering
Estimated dur models
EstimatedHMMs
fullcontext(context-dependent, CD)
Training process
32
k-a+b/A:…
t-e+h/A:……
…
…w-a+t/A:…
w-i+sil/A:… w-o+sh/A:…
gy-e+sil/A:…
gy-a+pau/A:…
g-u+pau/A:…
…
…
leaf nodes
synthesized
states
yes
yes
yesyes
yes no
no
no
no
no
R=silence?
L=“gy”?
C=voiced?
L=“w”?
R=silence?
Decision tree-based context clustering [Odell;’95]
33
Spectrum & excitation have different context dependency
Build decision trees separately
Decision treesfor
mel-cepstrum
Decision treesfor F0
Stream-dependent clustering
34
data & labels
Compute variancefloor
Initialize CI-HMMs bysegmental k-means
Reestimate CI-HMMs byEM algorithm
Copy CI-HMMs toCD-HMMs
monophone(context-independent, CI)
Reestimate CD-HMMs byEM algorithm
Decision tree-basedclustering
Reestimate CD-HMMs byEM algorithm
Untie parameter tyingstructure
Estimate CD-dur Modelsfrom FB stats
Decision tree-basedclustering
Estimated dur models
EstimatedHMMs
fullcontext(context-dependent, CD)
Training process
35
1 2 3 4 5 6 7 8
i
0t 1t
T t
Estimation of state duration models [Yoshimura;’98]
36
Decision treesfor
mel-cepstrum
Decision treesfor F0
HMM
State durationmodel
Decision treefor state dur.
models
Stream-dependent clustering
37
data & labels
Compute variancefloor
Initialize CI-HMMs bysegmental k-means
Reestimate CI-HMMs byEM algorithm
Copy CI-HMMs toCD-HMMs
monophone(context-independent, CI)
Reestimate CD-HMMs byEM algorithm
Decision tree-basedclustering
Reestimate CD-HMMs byEM algorithm
Untie parameter tyingstructure
Estimate CD-dur Modelsfrom FB stats
Decision tree-basedclustering
Estimated dur models
EstimatedHMMs
fullcontext(context-dependent, CD)
Training process
38
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
SYNTHESIZEDSPEECH
Excitation
Speech signal Training part
Synthesis part
Training HMMs
Spectralparameters
Excitationparameters
Labels
TEXT
Text analysisParameter generation
from HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Spectralparameters
HMM-based speech synthesis system (HTS)
39
Composition of sentence HMM for given text
TEXT
Text analysis
context-dependent label sequence
G2P
POS tagging
Text normalization
Pause prediction
sentence HMM given labels
This sentence HMM gives
Speech parameter generation algorithm
41
Observation sequence
State sequence
・ ・
1 2 3
1 1 1 1 2 2 3 3
Determine state sequence via determining state durations
Determination of state sequence (1)
4 10 5State duration
42
Determination of state sequence (2)
43
Geometric
Gaussian
0.01 2 3 4 5 6 7 8 State duration
0.1
0.2
0.3
0.4
0.5
Sta
te-d
ura
tion
pro
b.
Determination of state sequence (3)
44
45
Speech parameter generation algorithm
Mean Variance
step-wise, mean values
Without dynamic features
46
Speech param. vectors includes both static & dyn. feats.
M2
The relationship between & can be arranged as
M M
Integration of dynamic features
47
48
Speech parameter generation algorithm
Solution
49
Mean Variance
Dyn
amic
Sta
ticGenerated speech parameter trajectory
50
50
12
34
5(kH
z)0
12
34
(kHz)
a i silsil
Generated spectra
51
w/o dynamic features
w/ dynamic features
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
TEXT
Text analysis
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels
Spectralparameters
Excitationparameters
Labels
Speech signal Training part
Synthesis partExcitationgeneration
Synthesis Filter
SYNTHESIZEDSPEECH
Excitationparameters
Excitation
Spectralparameters
HMM-based speech synthesis system (HTS)
52
lineartime-invariant
systemexcitation
pulse train
white noise
synthesizedspeech
Generatedexcitation parameter(log F0 with V/UV)
Generatedspectral parameter
(cepstrum, LSP)
Source-filter model
53
filtering
Speech samples
55
w/o dynamic features
w/ dynamic features
Use of dynamic features can reduce discontinuity
Outline
56
Fundamentals of HMM-based speech synthesis– Mathematical formulation
– Implementation of individual components
– Examples demonstrating its flexibility
Recent works– Quality improvements
– Vocoding
– Acoustic modeling
– Oversmoothing
– Other challenging research topics
Conclusions
Adaptation/adaptive training of HMMs– Originally developed in ASR, but works very well in TTS
– Average voice model (AVM)-based speech synthesis [Yamagishi;’06]
– Require small amount of target speaker/speaking style data
More efficient creation of new voices
AdaptiveTraining
Average-voice model
Training speakers
Adaptation
Target speakers
Adaptation (mimic voice)
57
Speaker adaptation– VIP voice
– Child voice
Style (expressiveness) adaptation (in Japanese)– Joyful
– Sad
– Rough
Difficult to collect large amount of data Adaptation framework gives a way to build TTS with small database
58
Adaptation demo
Some samples are from http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html
Interpolate parameters among representive HMM sets– Can obtain new voices even when no adaptation data is available
– Gradually change speakers & speaking styles [Yoshimura;’97, Tachibana;’05]
Interpolation (mixing voice)
59
Speaker interpolation (in Japanese)– Male & Female
Style interpolation– Neutral Angry
– Neutral Happy
60
Interpolation demo
Samples are from http://www.sp.nitech.ac.jp/& http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html
Outline
63
Fundamentals of HMM-based speech synthesis– Mathematical formulation
– Implementation of individual components
– Examples demonstrating its flexibility
Recent works– Quality improvements
– Vocoding
– Acoustic modeling
– Oversmoothing
– Other challenging research topics
Conclusions
Drawbacks of HTS
Quality of synthesized speech– Buzzy
– Flat
– Muffled
Three major factors degrade the quality– Poor vocoding
how to parameterize speech?
– Inaccurate acoustic modeling
how to model extracted speech parameter trajectories?
– Over-smoothing
how to recover generated speech parameter trajectories?
Recent work has drastically improved the quality of HTS
64
pulse train
white noseexcitation
Unvoiced Voiced
)(ne
Problem in vocoding
Binary pulse or noise excitation– Voiced (periodic) pulse trains
– Unvoiced (aperiodic) white noiseDifficult to model mixture of voiced & unvoiced sounds
(e.g., voiced fricatives)
Results in buzziness in synthesized speech
65
Limitations of HMMs for modeling speech parameters– Piece-wise constant statistics
Statistics do not vary within an HMM state
/a/ /i/ /sil//sil/
2.2
0
-1.20.4
0
-0.5mean
variance
)2(c
)2(c
Problem in acoustic modeling
66
1q 2q 3q
1o 2o 3o
Problem in acoustic modeling
Limitations of HMMs for modeling speech parameters– Piece-wise constant statistics
Statistics do not vary within an HMM state
– Frame-wise conditional independence assumption
State output probability depends only on the current state
67
1 2 3 4 5 6 7 8 State duration
0.1
0.2
0.3
0.4
0.5
Sta
te-d
ura
tion
pro
b.
68
Problem in acoustic modeling
Limitations of HMMs for modeling speech parameters– Piece-wise constant statistics
Statistics do not vary within an HMM state
– Frame-wise conditional independence assumption
State output probability depends only on the current state
– Weak duration modeling
State duration probability decreases exponentially with time
69
Problem in acoustic modeling
Limitations of HMMs for modeling speech parameters– Piece-wise constant statistics
Statistics do not vary within an HMM state
– Frame-wise conditional independence assumption
State output probability depends only on the current state
– Weak duration modeling
State duration probability decreases exponentially with time
None of them hold for real speech data
Speech parameters are generated from acoustic models
Better acoustic model may produce better speech
0
4
8
0
4
8
Fre
qu
en
cy (
kHz)
Fre
qu
en
cy (
kHz)
Na
tura
lG
en
erat
ed
70
Over-smoothing problem
Speech parameter generation algorithm– Dynamic features make generated parameters smooth
– Some times too smooth Muffled speech
Why?– Fine structure of spectra are removed by averaging effect
– Model underfitting
71
Nitech-HTS 2005 for Blizzard Challenge
Blizzard Challenge– Open evaluation of TTS
– Common database
– From 2005
Nitech-HTS 2005– Achieved best naturalness & intelligibility
– Still used as calibration system in Blizzard Challenge
– Includes various enhancements
Advanced vocoding
High-quality vocoder STRAIGHT
Better acoustic modeling
Hidden semi-Markov model (HSMM)
Oversmoothing compensation
Speech parameter generation algorithm considering GV
Waveform Synthetic waveform
Fixed-point analysis
F0 extraction
F0 adaptive spectralsmoothing in the
time-frequency region
Analysis
F0
Smoothedspectrum
Aperiodicfactors
Mixed excitation withphase manipulation
Synthesis
STRAIGHT analysis
72
Frequency [kHz]
Pow
er [d
B]
120
642
100
80
60
40
20
0
-200 8
FFT power spectrumFFT + mel-cepstral analysisSTRAIGHT + mel-cepstral analysis
Mel-cepstrum from STRAIGHT spectrum
73
Examples of generated excitation signals
pulse/noise
STRAIGHT
75
76
HSMM-based modeling
Hidden semi-Markov model (HSMM)
– Can be viewed as HMM with state-duration probability distributions
– State-output, state-duration, & state-transition probability distributions can be estimated simultaneously
Generated trajectory has smaller GV
1
0
-12nd
me
l-ce
pst
ral c
oeffi
cie
nt
0 1 2 3 time [sec]
Natural Generated
Parameter generation considering GV
77
Global variance intra-utterance var of speech params
Objective function of standard parameter generation
Objective function of parameter generation with GV
– 1st term is same as standard one
– 2nd term works as a penalty term for oversmoothing
78
Parameter generation considering GV
1
2
3
4
5
24 mcep. HMM
non-GV
24 ST. HMM
non-GV
24 mcep. HSMM non-GV
24 ST. HSMM non-GV
39 ST. HSMM non-GV
24 mcep. HSMM
GV
24 ST. HSMM
GV
39 ST. HSMM
GV
MO
S
2.0
3.1
2.0
3.2 3.2 3.1
4.7 4.7
Evaluations
80
Recent works to improve quality
Vocoding– MELP-style / CELP-style excitation
– LF model
– Sinusoidal models
Acoustic model– Segment models, trajectory models
– Model combination (product of experts)
– Minimum generation error training
– Bayesian modeling
Oversmoothing– Pre & postfiltering
– Improvements of GV
– Hybrid approaches
& more…
81
Other challenging topics
Keynote speech by Simon King (CSTR) in SSW7 this year
Speech synthesis is easy, if ...
• voice is built offline & carefully checked for errors
• speech is recorded in clean conditions
• word transcriptions are correct
• accurate phonetic labels are available or can be obtained
• speech is in the required language & speaking style
• speech is from a suitable speaker
• a native speaker is available, preferably a linguist
Speech synthesis is not easy if we don’t have right data82
Other challenging topics
Non-professional speakers
• AVM + adaptation (CSTR)
Too little speech data
• VTLN-based rapid speaker adaptation (Titech, IDIAP)
Noisy recordings
• Spectral subtraction & AVM + adaptation (CSTR)
No labels
• Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba)
Insufficient knowledge of the language or accent
• Letter (grapheme)-based synthesis (CSTR)
• No prosodic contexts (CSTR, Titech)
Wrong language
• Cross-lingual speaker adaptation (MSRA, EMIME)
• Speaker & language adaptive training (Toshiba)
83
Outline
84
Fundamentals of HMM-based speech synthesis– Mathematical formulation
– Implementation of individual components
– Examples demonstrating its flexibility
Recent works– Vocoding
– Acoustic modeling
– Oversmoothing
– Other challenging research topics
Conclusions
HMM-based speech synthesis– Getting popular in speech synthesis area
– Implementation
– Flexible to change its voice characteristics
– Its quality is improving
– There are many challenging research topics
Summary
85
TTS research is attractive for you!!
86
text analysis
vocoding
statistical modeling
Text analysis
• Part-of-speech tagging, Grapheme-to-Phoneme conversion, etc.
Vocoding
• Better signal processing, better speech representation
Statistical modeling
• Robust modeling in ASR, knowledge from machine learning
Natural language processing
Signal processing, speech coding
ASR, machine learning
87
Please join TTS research!
Thanks!