Sébastien Le Maguer, Bernd Möbius 5th January...
Transcript of Sébastien Le Maguer, Bernd Möbius 5th January...
Parametrical speech synthesis - an overview
Sébastien Le Maguer, Bernd Möbius
5th January 2016
1 / 46N
Introduction - Corpus-based TTS
Welcome to this tutorial
N.L.P.Text
Acou.coeff.
Desc. features
Offline stageS.P.
Online stage
N.L.P.Desc.
features
3 / 46N
Introduction - Focus on Unit Selection
Welcome to this tutorial
N.L.P.Text
Acou.coeff.
Desc. features
Offline stageS.P.
Online stage
N.L.P.Desc.
featuresGeneration
Database
S.P.
PostProcess.
4 / 46N
Introduction - Focus on Unit Selection
Main HypothesisNothing will be better than the speech itself
AdvantagesSignal quality !!!!!
DrawbacksNot flexible !!!
Prosody variation (⇒ expressive speech synthesis ?)All knowledge in the corpus (corpus design!)
*
⇒ Parametrical speech synthesis
Some samples
Sample 1 Sample 2
5 / 46N
Introduction - Parametrical corpus based TTS
Welcome to this tutorial
N.L.P.Text
Acou.coeff.
Desc. features
Offline stageS.P.
Online stage
N.L.P.Desc.
features
Training
Generation
HMM, DNN, ...
S.P.
S.P.
6 / 46N
Pre-requisite : Signal parametrization
Objective
How to represent speech
Numerical coefficients
Trend : complexity % quality
Vital constraints
ACOUSTIC PARAMETERS ⇒ SOUND!
7 / 46N
Pre-requisite : Signal parametrization
Vocoder - Source-filter model
F0 Periodic signalgeneration
White noisegeneration
Parameters
Vocal tractSpeechsignal
e(t) s(t)
Source Filter
8 / 46N
Pre-requisite : Signal parametrization
Vocoder - Source-filter model (voiced example)
(http://www.haskins.yale.edu/featured/heads/mmsp/acoustic.html)
9 / 46N
Pre-requisite : Signal parametrization
Vocoder - Mixed-mode excitation
F0 Periodic signalgeneration
White noisegeneration
ap.
BandpassFilter
BandpassFilter
Spectrum Parameters
Vocal tractSpeechsignal
e(t) s(t)
Source Filter
Achieved by STRAIGHT [Kawahara et al., 1999]
10 / 46N
Pre-requisite : Signal parametrization
Cepstrum - Mel-Frequency Cepstral Coefficients
sn preacc. windowing |FFT | Mel log FFT−1 cn
m1 mP
freq
1
m j... ...Energy inEach Band
MELSPEC
HTKBook figure [Young et al., 2005]
13 / 46N
Pre-requisite : Signal parametrization
Cepstrum - Mel-Log Spectrum Approximation Filter[Fukada et al., 1992]
Why ?In speech synthesis we need to generate the spectrum from the Cepstrum
Propertiesaccuracy = maximal spectral error 0.24dB
O(8M) */+ operations for a sample
stable
15 / 46N
Pre-requisite : Signal parametrization
Taking dynamic into account
Some samples
Sample 1 Sample 2
THE EQUATION TO REMEMBER
O = W.C
C = extracted coefficients
W = a windowing matrix
O = observations
17 / 46N
Pre-requisite : Signal parametrization
Taking dynamic into accounts
c1
∆c1
∆2c1
c2
∆c2
∆2c2
...
cT
∆cT
∆2cT
o1
o2
oT
3MT
O
=
1.I 0 0 0 . . . 0
w10 .I w1
L1.I 0 0 . . . 0
w20 .I w2
L2.I 0 0 . . . 0
0 1.I 0 0 . . . 0
w1−L1
.I w10 .I w1
L1.I 0 . . . 0
w2−L2
.I w20 .I w2
L2.I 0 . . . 0
0 0 1.I 0 . . . 0
0 w1−L1
.I w10 .I w1
L1.I . . . 0
0 w2−L2
.I w20 .I w2
L2.I . . . 0
. . . . .
0 . . . 0 0 1.I
0 . . . 0 w1−L1
.I w10 .I
0 . . . 0 w2−L2
.I w20 .I
v1
v9
v10
...
vT
v010
v110
v210
3MT
MT
W
×
c1
c2
.
..
cT
MT
C
18 / 46N
Pre-requisite : Get descriptive features
Pre-requisite : Get descriptive features
ObjectiveHow to describe speech
What are they ?Generally based on linguistic informations. . .
. . . .but you can use what you want
Only discrete values (booleans, numbers, enumerations, . . . )
Main constraintHave to be automatically predictable
19 / 46N
Pre-requisite : Get descriptive features
Pre-requisite : Get descriptive features - Example
label example
label format description
20 / 46N
Training stage
Where we are!
Welcome to this tutorial
N.L.P.Text
Acou.coeff.
Desc. features
Offline stageS.P.
Online stage
N.L.P.Desc.
features
Training
Generation
HMM, DNN, ...
S.P.
S.P.
21 / 46N
Training stage
Problems need to be solved
1 Acoustic modellingGaussianHidden Semi Markov Models
2 Heterogeneous dataMulti-Space Distribution
3 SparsenessDecision treeMulti-stage training process
22 / 46N
Training stage
Statistical distribution - Gaussian vectors
Definition - Multivariate gaussian distribution
N (µ, Σ) (1)
µ = mean vector
Σ = covariance matrix
Probability density function
fµ,Σ (X ) =1
(2π)N/2 |Σ|1/2 e−12 (X−µ)>Σ−1(x−µ) (2)
23 / 46N
Training stage
Statistical distribution - MSD
MSD = Multi-Space Distribution
Definition
X = (V , S) (3)
S = a space
V = a value in the space S
Application for the F0
MSD(X ) =
{w1 N (V (X );µ,Σ), S(X ) = {1} (voiced)w2 δ0(V (X )), S(X ) = {0} (unvoiced)
(4)
25 / 46N
Training stage
Markov Models (MM)
http://www.americanscientist.org/issues/pub/2013/2/first-links-in-the-markov-chain/99999
26 / 46N
Training stage
Hidden Markov Models (HMM)
Now we can’t get the weather’s sequence directly!
Main assumptionThere is an underlying phenomenon (hidden state sequence = Q) which canexplain our observations
Go back to our example
1
2 3
0.6
0.3
0.10.2
0.3
0.5
0.4
0.1
0.5
state 1 state 2 state 3
Rainy 0.5 0.2 0.3
Sunny 0.4 0.5 0.0
Cloudy 0.1 0.8 0.1
27 / 46N
Training stage
Hidden Markov Models (HMM) - Continuous
Based on the previous HMM
1
2 3
0.6
0.3
0.10.2
0.3
0.5
0.4
0.1
0.5
0
0
0
0
0
0
28 / 46N
Training stage
Hidden Markov Models (HMM) - the 3 problems
Tutorial = [Rabiner, 1989]
Problem 1Given an observation sequence O = (O1, . . . , 0T ) and an HMM λ, what isthe probability P(O|λ) ⇒ Backward-forward algorithm.
Problem 2Given an observation sequence O = (O1, . . . , 0T ) and an HMM λ, what isthe sequence of states Q which explains the best the sequence O accordingto the model λ ⇒ Viterbi algorithm.
Problem 3How to get the model parameters (A,B, π) ? ⇒ Given a set of observationsequences O = {O1, . . . ,Ok}, E.M. based algorithm (Baum-Welch).
29 / 46N
Training stage
Hidden Markov Models (HMM )- Training input(Discrete)
Example data
raining - sunny - sunny - sunny - raining - cloudysunny - raining - raining - cloudycloudy - cloudy...
30 / 46N
Training stage
Hidden Markov Models (HMM )- Training input(continuous)
Example data
0.66693531 - 0.71573471 - 0.44575163 - 0.34354172 - 0.180417870.76179169 - 0.105988740.22910708 - 0.24211562 - 0.90336061 - 0.729440930.60625303 - 0.24092975 - 0.943970270.39016014 - 0.32262471 - 0.10743479 - 0.06654687...
31 / 46N
Training stage
Hidden Markov Models (HMM) - Speech
Definition
λ = (A,B, π)A = {ai,j}, ∀i , j ∈ [1..S ]B = {bj(ot)}, ∀j ∈ [1..S ], t ∈ [1..T ]π = {πi}, ∀i ∈ [1..S ]aij = P(qt = j |qt−1 = i), ∀i , j ∈ [1..S ], t ∈ [2..T ]bj(ot) = P(ot |qt = j), ∀j ∈ [1..S ], t ∈ [1..T ]πi = P(q1 = i), ∀i ∈ [1..S ]
33 / 46N
Training stage
Hidden Semi-Markov Models (HSMM)
HMM = Geometric distribution
Pr(X = k) = (1− p)k−1 p (5)
PROBLEM !Optimization of duration during the synthesis stage => 1 frame per state!⇒ Using a gaussian distribution for the duration (synth.)⇒ HSMM (train.)
34 / 46N
Training stage
Decision tree + state tying [Young et al., 1994]
Objective Dealing with sparsenessenglish = 53 descriptive features, german = 70 descriptive features !6= statistical training
C-voiced?
N-vowel?
Yes No
Yes No
Iterative algorithm ⇒Criterion (Stop + Split)= Minimum Description Length (based on variance)
36 / 46N
Training stage
Training process
Initialisation(MSD-HMM)
Monophone(MSD-HSMM)
Fullcontext(MSD-HSMM)
Clusteringtree+(MSD-HSMM)
Initialisation “Finalisation”
37 / 46N
Synthesis stage
Where we are!
Welcome to this tutorial
N.L.P.Text
Acou.coeff.
Desc. features
Offline stageS.P.
Online stage
N.L.P.Desc.
features
Training
Generation
HMM, DNN, ...
S.P.
S.P.
38 / 46N
Synthesis stage
HMM to produce speech
0. Input = sequence of descriptive features
start-tt-an an-tt-end...
39 / 46N
Synthesis stage
HMM to produce speech
1. HMM-phrase building
start-tt-an an-tt-end...
...
Nb segments
39 / 46N
Synthesis stage
HMM to produce speech
2. Associate distributions [Yoshimura et al., 1999]
start-tt-an an-tt-end...
...
C-Plosive?
L-start?
N-Vowel?
C-Plosive? N-Voiced?
39 / 46N
Synthesis stage
HMM to produce speech
3. Acoustic coefficient generation [Tokuda et al., 2000]
start-tt-an an-tt-end...
...
...
Produce C which maximizes P(O|Λ) = P(WC |Λ)
E.M. based algorithm [Baum et al., 1970]Hidden sequence = Observations
39 / 46N
Synthesis stage
HMM to produce speech
4. Signal synthesis [Zen and Toda, 2005]
start-tt-an an-tt-end...
...
...
...
Use MLSA Filter [Fukada et al., 1992] + Vocoder [Kawahara et al., 1999]
39 / 46N
Adaptation
Informations
Why that’s great !Average voice built ⇒ no need of a huge amount of a training data
Flexible (voice, accent, expresssion, . . . ) + Robust
Samples
Sample 1 Sample 2
41 / 46N
Small introduction to DNN
Why is it adapted
Main advantagesMulti-layer models like speech (frames, phones, syllables, words, . . . )
really powerful modeling
current state of the art (Long-Term Short-Term models)
ProblemHUGE AMOUNT of data to train the models
Introspection and analysis complicated (not achieved yet)
45 / 46N
Conclusion
Summary of this talk
2 Corpus based TTS methodologiesUnit selectionParametrical speech synthesis
Focus on statistical speech synthesisWhat problems are implied by this kind of synthesisWhat solutions can be provided
HMM text to speech synthesisHTS : http://hts.sp.nitech.ac.jpMarytts : http://mary.dfki.de/Solutions
Acoustic modelling = HMM + MSDDuration modelling = HSMMData sparseness = Decision tree + training processDYNAMIC !!!!!
46 / 46N
Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970).A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of MarkovChains.The Annals of Mathematical Statistics, 41(1):pp. 164–171.
Fukada, T., Tokuda, K., Kobayashi, T., and Imai, S. (1992).An adaptative Algorithm for mel-cepstral analysis of speech.In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP),volume 1, pages 137–140.
Kawahara, H., Masuda-katsuse, I., and De Cheveign, A. (1999).Restructuring speech representations using a pitch-adaptive time frequency smoothing and aninstantaneous-frequency- based F0 extraction: Possible role of a repetitive structure in sounds 1.Speech Communication, 27:187–207.
Rabiner, L. R. (1989).A tutorial on hidden Markov models and selected applications in speech recognition.In Proceedings of the IEEE, volume 77, pages 257–286.
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., and Kitamura, T. (2000).Speech parameter generation algorithms for hmm-based speech synthesis.In Proceedings of the International Conference on Acoustics and Speech Signal Processing (ICASSP),pages 1315–1318.
Yamagishi, J., Zen, H., Toda, T., and Tokuda, K. (2007).Speaker-Independent HMM-based Speech Synthesis System - HTS-2007 System for the Blizzard Challenge2007.In Proceedings of the ISCA Tutorial and Research Workshop on Speech Synthesis (SSW6), Bonn,Germany.
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., and Kitamura, T. (1999).Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis.In Proceedings of the European Conference on Speech Communicationand Technology (Eurospeech),volume 5, pages 2347–2350, Budapest, Hungary.
Young, S., Everman, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D.,Povey, D., Valtchev, V., and Woodland, P. (2005).The HTK Book.
46 / 46N
Number July 2000.
Young, S. J., Odell, J. J., and Woodland, P. C. (1994).Tree-based state tying for high accuracy acoustic modelling.In Proceedings of the workshop on Human Language Technology (HLT), pages 307–312, Morristown, NewJersey, USA. Association for Computational Linguistics.
Zen, H., Senior, A., and Schuster, M. (2013).Statistical parametric speech synthesis using deep neural networks.In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7962–7966.IEEE.
Zen, H. and Toda, T. (2005).An overview of Nitech HMM-based speech synthesis system for blizzard challenge 2005.In Proceedings of the 9th European Conference on Speech Communicationand Technology (Eurospeech),Lisbon, Portugal.
46 / 46N