Post on 18-Dec-2021
Statistical Speech Synthesis
Heiga ZEN
Toshiba Research Europe Ltd.
Cambridge Research Laboratory
Speech Synthesis Seminar Series @ CUED, Cambridge, UK
January 11th, 2011
Text-to-speech synthesis (TTS)
Text (seq of discrete symbols) Speech (continuous time series)
Automatic speech recognition (ASR)
Speech (continuous time series) Text (seq of discrete symbols)
Machine Translation (MT)
Text (seq of discrete symbols) Text (seq of discrete symbols)
2
Text-to-speech as a mapping problem
Dobré ráno Good morning
Good morning
Good morning
text
(concept)
speech
air flow
Sound source
voiced: pulse
unvoiced: noise
frequency
transfer
characteristics
magnitude
start--end
fundamental
frequency
modula
tion o
f carr
ier
wave
by s
peech info
rmation
fun
dam
enta
l fr
eq.
vo
iced
/unvoic
ed
fre
q. tr
ans. ch
ar.
3
Speech production process
Rule-based, formant synthesis (~’90s)
– Based on parametric representation of speech
– Hand-crafted rules to control phonetic unit
DECtalk (or KlattTalk / MITTalk) [Klatt;‘82]
4
Speech synthesis methods (1)
Block diagram of KlattTalk
Corpus-based, concatenative synthesis (’90s~)
– Concatenate small speech units (e.g., phone) from a database
– Large data + automatic learning High-quality synthetic voices
Single inventory; diphone synthesis [Moullnes;‘90]
Multiple inventory; unit selection synthesis [Sagisaka;‘92, Black;‘96]
5
Speech synthesis methods (2)
Corpus-based, statistical parametric synthesis (mid ’90s~)
– Large data + automatic training
Automatic voice building
– Source-filter model + statistical modeling
Flexible to change its voice characteristics
Hidden Markov models (HMMs) as its statistical acoustic model
HMM-based speech synthesis (HTS) [Yoshimura;‘02]
6
Speech synthesis methods (3)
Parameter
generation
Model
training
Feature
extraction
Waveform
generation
7
Popularity of statistical speech synthesis
# of statistical speech synthesis
related papers in ICASSP
02468
101214161820
1995
1997
1999
2001
2003
2005
2007
2009
# o
f p
ap
ers
Statistical speech synthesis is getting popular, but…
not many researchers fully understand how it works
Formulate & understand the whole corpus-based speech
synthesis process in a unified statistical framework
Aim of this talk
Outline
9
HMM-based speech synthesis
– Overview
– Implementation of individual components
Bayesian framework for speech synthesis
– Formulation
– Realizations in HMM-based speech synthesis
– Recent works
Conclusions
– Summary
– Future research topics
Excitation
generation
Synthesis
Filter
TEXT
Text analysis
SYNTHESIZED
SPEECH
Training HMMs
Parameter generation
from HMMsLabelsExcitation
parametersExcitation
Spectral
parameters
Labels
Training part
Synthesis part
Excitation
Parameter
extraction
SPEECH
DATABASESpectral
Parameter
Extraction
Spectral
parameters
Excitation
parameters
Speech signal
HMM-based speech synthesis system (HTS)
10
Context-dependent HMMs
& state duration models
Text analysis
Text analysis
Excitation
generation
Synthesis
Filter
TEXT
Text analysis
SYNTHESIZED
SPEECH
Training HMMs
Parameter generation
from HMMs
Context-dependent HMMs
& state duration models
LabelsExcitation
parametersExcitation
Spectral
parameters
Labels
Training part
Synthesis part
Excitation
Parameter
extraction
SPEECH
DATABASESpectral
Parameter
Extraction
Spectral
parameters
Excitation
parameters
Speech signal
HMM-based speech synthesis system (HTS)
11
air flow
frequency
transfer
characteristics
magnitude
start--end
fundamental
frequency
mod
ula
tion o
f carr
ier
wave
by s
peech info
rmation speech
Sound source
voiced: pulse
unvoiced: noise
Speech production process
12
Divide speech into frames
13
Speech is a non-stationary signal
… but can be assumed to be quasi-stationary
Divide speech into short-time frames (e.g., 5ms shift, 25ms length)
linear
time-invariant
system
excitationpulse train
(voiced)
white noise
(unvoiced)
speech
Excitation (source) part Spectral (filter) part
Fourier transform
Source-filter model
14
Parametric models speech spectrum
Autoregressive (AR) model Exponential (EX) model
ML estimation of spectral model parameters
: AR model Linear prediction (LP) [Itakura;’70]
: EX model ML-based cepstral analysis
Spectral (filter) model
15
……
LP analysis (1)
LP analysis (2)
Excitation (source) model
excitationpulse train
white noise
18
Excitation model: pulse/noise excitation
– Voiced (periodic) pulse trains
– Unvoiced (aperiodic) white noise
Excitation model parameters
– V/UV decision
– V fundamental frequency (F0):
Speech samples
19
Natural speech
Reconstructed speech from extracted parameters (cepstral
coefficients & F0 with V/UV decisions)
Quality degrades, but main characteristics are preserved
Text analysis
SPEECH
DATABASEExcitation
Parameter
extraction
Spectral
Parameter
Extraction
Excitation
generation
Synthesis
Filter
TEXT
Text analysis
SYNTHESIZED
SPEECH
Parameter generation
from HMMs
Context-dependent HMMs
& state duration models
LabelsExcitation
parametersExcitation
Spectral
parameters
Speech signal Training part
Synthesis part
Training HMMs
Spectral
parameters
Excitation
parameters
Labels
HMM-based speech synthesis system (HTS)
20
Structure of state-output (observation) vector
Spectrum part
Excitation part
Spectral parameters
(e.g., cepstrum, LSPs)
log F0 with V/UV
21
Dynamic features
22
HMM-based modeling
・ ・
1 1 2 3 3 N
Observation
sequence
State
sequence
23
Sentence
HMM
Label
sequencesil a i sil
1 2 31 2 3
Spectrum
(cepstrum or LSP,
& dynamic features)
Excitation
(log F0
& dynamic features)
Stre
am
12
34
Multi-stream HMM structure
24
Time
Log F
req
uency
Unable to model by continuous or discrete distribution
Observation of F0
25
1 2 3
unvoiced
voiced
unvoiced
voiced
unvoiced
voiced
・ ・ ・
voiced/unvoiced weights
Multi-space probability distribution (MSD)
26
Voiced
Unvoiced
Voiced
Unvoiced
Voiced
Unvoiced
・
・
・
Log F0
Spectral params Single Gaussian
MSD
(Gaussian & discrete)
MSD
(Gaussian & discrete)
MSD
(Gaussian & discrete)
Str
eam
1S
tre
am
2,3
,4Structure of state-output distributions
27
data & labels
Compute variance
floor
Initialize CI-HMMs by
segmental k-means
Reestimate CI-HMMs by
EM algorithm
Copy CI-HMMs to
CD-HMMs
monophone
(context-independent, CI)
Reestimate CD-HMMs by
EM algorithm
Decision tree-based
clustering
Reestimate CD-HMMs by
EM algorithm
Untie parameter tying
structure
Estimate CD-dur Models
from FB stats
Decision tree-based
clustering
Estimated dur models
Estimated
HMMs
fullcontext
(context-dependent, CD)
Training process
29
HMM-based modeling
・ ・
1 1 2 3 3 N
Observation
sequence
State
sequence
30
Sentence
HMM
Transcription
sil sil-a+i a-i+sil sil
1 2 31 2 3
Phoneme current phoneme
{preceding, succeeding} two phonemes
Syllable # of phonemes at {preceding, current, succeeding} syllable
{accent, stress} of {preceding, current, succeeding} syllable
Position of current syllable in current word
# of {preceding, succeeding} {accented, stressed} syllable in current phrase
# of syllables {from previous, to next} {accented, stressed} syllable
Vowel within current syllable
Word Part of speech of {preceding, current, succeeding} word
# of syllables in {preceding, current, succeeding} word
Position of current word in current phrase
# of {preceding, succeeding} content words in current phrase
# of words {from previous, to next} content word
Phrase # of syllables in {preceding, current, succeeding} phrase
…..
Huge # of combinations ⇒ Difficult to have all possible models
Context-dependent modeling
31
data & labels
Compute variance
floor
Initialize CI-HMMs by
segmental k-means
Reestimate CI-HMMs by
EM algorithm
Copy CI-HMMs to
CD-HMMs
monophone
(context-independent, CI)
Reestimate CD-HMMs by
EM algorithm
Decision tree-based
clustering
Reestimate CD-HMMs by
EM algorithm
Untie parameter tying
structure
Estimate CD-dur Models
from FB stats
Decision tree-based
clustering
Estimated dur models
Estimated
HMMs
fullcontext
(context-dependent, CD)
Training process
32
k-a+b/A:…
t-e+h/A:……
…
…
w-a+t/A:…
w-i+sil/A:… w-o+sh/A:…
gy-e+sil/A:…
gy-a+pau/A:…
g-u+pau/A:…
…
…
leaf nodes
synthesized
states
yes
yes
yesyes
yes no
no
no
no
no
R=silence?
L=“gy”?
C=voiced?
L=“w”?
R=silence?
Decision tree-based context clustering [Odell;’95]
33
Spectrum & excitation have different context dependency
Build decision trees separately
Decision trees
for
mel-cepstrum
Decision trees
for F0
Stream-dependent clustering
34
data & labels
Compute variance
floor
Initialize CI-HMMs by
segmental k-means
Reestimate CI-HMMs by
EM algorithm
Copy CI-HMMs to
CD-HMMs
monophone
(context-independent, CI)
Reestimate CD-HMMs by
EM algorithm
Decision tree-based
clustering
Reestimate CD-HMMs by
EM algorithm
Untie parameter tying
structure
Estimate CD-dur Models
from FB stats
Decision tree-based
clustering
Estimated dur models
Estimated
HMMs
fullcontext
(context-dependent, CD)
Training process
35
1 2 3 4 5 6 7 8
i
0t 1t
Tt
Estimation of state duration models [Yoshimura;’98]
36
Decision trees
for
mel-cepstrum
Decision trees
for F0
HMM
State duration
model
Decision tree
for state dur.
models
Stream-dependent clustering
37
data & labels
Compute variance
floor
Initialize CI-HMMs by
segmental k-means
Reestimate CI-HMMs by
EM algorithm
Copy CI-HMMs to
CD-HMMs
monophone
(context-independent, CI)
Reestimate CD-HMMs by
EM algorithm
Decision tree-based
clustering
Reestimate CD-HMMs by
EM algorithm
Untie parameter tying
structure
Estimate CD-dur Models
from FB stats
Decision tree-based
clustering
Estimated dur models
Estimated
HMMs
fullcontext
(context-dependent, CD)
Training process
38
SPEECH
DATABASEExcitation
Parameter
extraction
Spectral
Parameter
Extraction
Excitation
generation
Synthesis
Filter
SYNTHESIZED
SPEECH
Excitation
Speech signal Training part
Synthesis part
Training HMMs
Spectral
parameters
Excitation
parameters
Labels
TEXT
Text analysisParameter generation
from HMMs
Context-dependent HMMs
& state duration models
LabelsExcitation
parameters
Spectral
parameters
HMM-based speech synthesis system (HTS)
39
Composition of sentence HMM for given text
TEXT
Text analysis
context-dependent label
sequence
G2P
POS tagging
Text normalization
Pause prediction
sentence HMM given labels
Speech parameter generation algorithm
41
Observation
sequence
State sequence
・ ・
1 2 3
1 1 1 1 2 2 3 3
Determine state sequence via determining state durations
Determination of state sequence (1)
4 10 5State duration
42
Determination of state sequence (2)
43
Geometric
Gaussian
0.01 2 3 4 5 6 7 8 State duration
0.1
0.2
0.3
0.4
0.5
Sta
te-d
ura
tion p
rob
.
Determination of state sequence (3)
44
45
Speech parameter generation algorithm
Mean Variance
step-wise, mean values
Without dynamic features
46
Speech param. vectors includes both static & dyn. feats.
M2
The relationship between & can be arranged as
M M
Integration of dynamic features
47
48
Speech parameter generation algorithm
Solution
49
Mean Variance
Dyn
am
icS
tatic
Generated speech parameter trajectory
50
50
12
34
5(k
Hz)
01
23
4(k
Hz)
a i silsil
Generated spectra
51
w/o dynamic
features
w/ dynamic
features
SPEECH
DATABASEExcitation
Parameter
extraction
Spectral
Parameter
Extraction
TEXT
Text analysis
Training HMMs
Parameter generation
from HMMs
Context-dependent HMMs
& state duration models
Labels
Spectral
parameters
Excitation
parameters
Labels
Speech signal Training part
Synthesis partExcitation
generation
Synthesis
Filter
SYNTHESIZED
SPEECH
Excitation
parametersExcitation
Spectral
parameters
HMM-based speech synthesis system (HTS)
52
linear
time-invariant
systemexcitation
pulse train
white noise
synthesized
speech
Generated
excitation parameter
(log F0 with V/UV)
Generated
spectral parameter
(cepstrum, LSP)
Source-filter model
53
filtering
Unvoiced frames & LP spectral coefficients
white noise Synthesized speech
Drive linear filter using white noise
Equivalent to sampling from Gaussian distribution
Speech samples
55
w/o dynamic features
w/ dynamic features
Use of dynamic features can reduce discontinuity
Outline
56
HMM-based speech synthesis
– Overview
– Implementation of individual components
Bayesian framework for speech synthesis
– Formulation
– Realizations in HMM-based speech synthesis
– Recent works
Conclusions
– Summary
– Future research topics
57
We have a speech database, i.e., a set of texts &
corresponding speech waveforms.
Given a text to be synthesized, what is the speech waveform
corresponding to the text?
: speech waveform
: speech waveforms
: set of texts
: text to be synthesized
databaseGiven
unknown
Statistical framework for speech synthesis (1)
58
Bayesian framework for prediction
1. Estimate predictive distribution given variables
2. Draw sample from the distribution
: speech waveform
: speech waveforms
: set of texts
: text to be synthesized
databaseGiven
unknown
Bayesian framework for speech synthesis (2)
59
: acoustic model (e.g. HMM )
1. Estimating predictive distribution is hard
Introduce acoustic model parameters
Bayesian framework for speech synthesis (3)
60
: parametric representation of speech waveform
(e.g., cepstrum, LPC, LSP, F0, aperiodicity)
Bayesian framework for speech synthesis (4)
2. Using speech waveform directly is difficult
Introduce parametric its representation
61
: labels derived from text
(e.g. prons, POS, lexical stress, grammar, pause)
Bayesian framework for speech synthesis (5)
3. Same texts can have multiple pronunciations, POS, etc.
Introduce labels
62
Bayesian framework for speech synthesis (6)
4. Difficult to perform integral & sum over auxiliary variables
Approximated by joint max
63
Bayesian framework for speech synthesis (7)
5. Joint maximization is hard
Approximated by step-by-step maximizations
64
Bayesian framework for speech synthesis (8)
6. Training also requires parametric form of wav & labels
Introduce them & approx by step-by-step maximizations
: parametric representation of speech waveforms
: labels derived from texts
65
Bayesian framework for speech synthesis (9)
SPEECH
DATABASEExcitation
Parameter
extraction
Spectral
Parameter
Extraction
Excitation
generation
Synthesis
Filter
TEXT
Text analysis
SYNTHESIZED
SPEECH
Training HMMs
Parameter generation
from HMMs
Context-dependent HMMs
& state duration models
LabelsExcitation
parametersExcitation
Spectral
parameters
Spectral
parameters
Excitation
parameters
Labels
Speech signal Training part
Synthesis part66
Text analysis
HMM-based speech synthesis system (HTS)
67
Problems
Many approximations
– Integral & sum ≈ max
– Joint max ≈ step-by-step max
Poor approximation
Recent works to relax approximations
– Max Integral & sum
Bayesian acoustic modeling
Multiple labels
– Step-wise max Joint max
Statistical vocoding
68
Bayesian acoustic modeling (1)
69
Bayesian acoustic modeling (2)
Bayesian approach
– Parameters are hidden variables & marginalized out
– Bayesian approach with hidden variables intractable
Variational Bayes [Attias;’99]
Jensen's inequality
70
Bayesian acoustic modeling (3)
Variational Bayesian acoustic modeling for speech
synthesis [Nankaku;’03]
– Fully VB-based speech synthesis
Training posterior distribution of model parameters
Parameter generation from predictive distribution
– Automatic model selection
Bayesian approach provides posterior probability of model structure
– Setting priors
Evidence maximization [Hashimoto;’06]
Cross validation [Hashimoto;’09]
– VB approach works better than ML one when
Data is small
Model is large
Multiple labels (1)
Label sequence is regarded as hidden variable & marginalized
72
Multiple labels (2)
Joint front-end / back-end model training [Oura;’08]
– Labels = regarded as hidden variable & marginalized
Robust against label errors
– Front- & back-end models are trained simultaneously
Combine text analysis & acoustic models as a unified model
73
Simple pulse/noise vocoding
Vocal tract filter
excitation
pulse train
white noise
synthesized
speech
Basic pulse/noise vocoder
– Binary switching between voiced & unvoiced excitations
Difficult to represent mix of voiced & unvoiced sounds
– Excitations signals of human speech are not pulse or noise
Colored voiced/unvoiced excitations
74
State-dependent filtering [Maia;’07]
Synthesized
speech
Voiced
excitation
Filters
Log F0
values
Mel-cepstral
coefficients
Sentence
HMM
Mixed
excitation
Unvoiced
excitation
2tC 1tC tC 1tC 2tC
2tp 1tp tp 1tp 2tp
+
Pulse train
generator
White noise
75
Waveform-level statistical model (1) [Maia;’10]
Synthesized
speech
Voiced
excitation
Mixed
excitation
+
Pulse train
generator
White noiseUnvoiced
excitation
76
Waveform-level statistical model (2) [Maia;’10]
Integral & sum are intractable
Approx integral & sum by joint max
Conventional step-by-step maximization
Proposed iterative joint maximization
Outline
77
HMM-based speech synthesis
– Overview
– Implementation of individual components
Bayesian framework for speech synthesis
– Formulation
– Realizations in HMM-based speech synthesis
– Recent works
Conclusions
– Summary
– Future research topics
Summary
78
HMM-based speech synthesis
– Statistical parametric speech synthesis approach
– Source-filter representation of speech + statistical acoustic modeling
– Getting popular
Bayesian framework for speech synthesis
– Formulation
– Decomposition to sub-problems
– Correspondence between sub-problems & modules in HMM-based
speech synthesis system
– Recent works to relax approximations
Drawbacks of HMM-based speech synthesis
Quality of synthesized speech
– Buzzy
– Flat
– Muffled
Three major factors degrade the quality
– Poor vocoding
how to parameterize speech?
– Inaccurate acoustic modeling
how to model extracted speech parameter trajectories?
– Over-smoothing
how to recover generated speech parameter trajectories?
Still need a lot of works to improve the quality
79
Future challenging topics in speech synthesis
Keynote speech by Simon King in ISCA SSW7 last year
Speech synthesis is easy, if ...
• voice is built offline & carefully checked for errors
• speech is recorded in clean conditions
• word transcriptions are correct
• accurate phonetic labels are available or can be obtained
• speech is in the required language & speaking style
• speech is from a suitable speaker
• a native speaker is available, preferably a linguist
Speech synthesis is not easy if we don’t have right data
80
Future challenging topics in speech synthesis
Non-professional speakers
• AVM + adaptation (CSTR)
Too little speech data
• VTLN-based rapid speaker adaptation (Titech, IDIAP)
Noisy recordings
• Spectral subtraction & AVM + adaptation (CSTR)
No labels
• Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba)
Insufficient knowledge of the language or accent
• Letter (grapheme)-based synthesis (CSTR)
• No prosodic contexts (CSTR, Titech)
Wrong language
• Cross-lingual speaker adaptation (MSRA, EMIME)
• Speaker & language adaptive training (Toshiba)
81
Thanks!
82