Sébastien Le Maguer, Bernd Möbius 5th January...

Parametrical speech synthesis - an overview

Sébastien Le Maguer, Bernd Möbius

5th January 2016

1 / 46N

Introduction - Text-To-Speech synthesis

Welcome to this tutorial

2 / 46N

Introduction - Corpus-based TTS


N.L.P.Text

Acou.coeff.

Desc. features

Offline stageS.P.

Online stage

N.L.P.Desc.

features

3 / 46N

Introduction - Focus on Unit Selection


N.L.P.Text

Acou.coeff.

Desc. features

Offline stageS.P.

Online stage

N.L.P.Desc.

featuresGeneration

Database

S.P.

PostProcess.

4 / 46N

Introduction - Focus on Unit Selection

Main HypothesisNothing will be better than the speech itself

AdvantagesSignal quality !!!!!

DrawbacksNot flexible !!!

Prosody variation (⇒ expressive speech synthesis ?)All knowledge in the corpus (corpus design!)

*

⇒ Parametrical speech synthesis

Some samples

Sample 1 Sample 2

5 / 46N

Introduction - Parametrical corpus based TTS


N.L.P.Text

Acou.coeff.

Desc. features

Offline stageS.P.

Online stage

N.L.P.Desc.

features

Training

Generation

HMM, DNN, ...

S.P.

S.P.

6 / 46N

Pre-requisite : Signal parametrization

Objective

How to represent speech

Numerical coefficients

Trend : complexity % quality

Vital constraints

ACOUSTIC PARAMETERS ⇒ SOUND!

7 / 46N


Vocoder - Source-filter model

F0 Periodic signalgeneration

White noisegeneration

Parameters

Vocal tractSpeechsignal

e(t) s(t)

Source Filter

8 / 46N


Vocoder - Source-filter model (voiced example)

(http://www.haskins.yale.edu/featured/heads/mmsp/acoustic.html)

9 / 46N

http://www.haskins.yale.edu/featured/heads/mmsp/acoustic.html


Vocoder - Mixed-mode excitation

F0 Periodic signalgeneration

White noisegeneration

ap.

BandpassFilter

BandpassFilter

Spectrum Parameters

Vocal tractSpeechsignal

e(t) s(t)

Source Filter

Achieved by STRAIGHT [Kawahara et al., 1999]

10 / 46N


Spectrum - Main information

11 / 46N


Spectrum - Voiced/Unvoiced

Voiced

Unvoiced

12 / 46N


Cepstrum - Mel-Frequency Cepstral Coefficients

sn preacc. windowing |FFT | Mel log FFT−1 cn

m1 mP

freq

1

m j... ...Energy inEach Band

MELSPEC

HTKBook figure [Young et al., 2005]

13 / 46N


Cepstrum - Voiced/Unvoiced

Voiced

Unvoiced

14 / 46N


Cepstrum - Mel-Log Spectrum Approximation Filter[Fukada et al., 1992]

Why ?In speech synthesis we need to generate the spectrum from the Cepstrum

Propertiesaccuracy = maximal spectral error 0.24dB

O(8M) */+ operations for a sample

stable

15 / 46N


Some samples

Sample 1 Sample 2

16 / 46N


Taking dynamic into account

Some samples

Sample 1 Sample 2

THE EQUATION TO REMEMBER

O = W.C

C = extracted coefficients

W = a windowing matrix

O = observations

17 / 46N


Taking dynamic into accounts

c1

∆c1

∆2c1

c2

∆c2

∆2c2

...

cT

∆cT

∆2cT

o1

o2

oT

3MT

O

=

1.I 0 0 0 . . . 0

w10 .I w1

L1.I 0 0 . . . 0

w20 .I w2

L2.I 0 0 . . . 0

0 1.I 0 0 . . . 0

w1−L1

.I w10 .I w1

L1.I 0 . . . 0

w2−L2

.I w20 .I w2

L2.I 0 . . . 0

0 0 1.I 0 . . . 0

0 w1−L1

.I w10 .I w1

L1.I . . . 0

0 w2−L2

.I w20 .I w2

L2.I . . . 0

. . . . .

0 . . . 0 0 1.I

0 . . . 0 w1−L1

.I w10 .I

0 . . . 0 w2−L2

.I w20 .I

v1

v9

v10

...

vT

v010

v110

v210

3MT

MT

W

×

c1

c2

.

..

cT

MT

C

18 / 46N

Pre-requisite : Get descriptive features


ObjectiveHow to describe speech

What are they ?Generally based on linguistic informations. . .

. . . .but you can use what you want

Only discrete values (booleans, numbers, enumerations, . . . )

Main constraintHave to be automatically predictable

19 / 46N


Pre-requisite : Get descriptive features - Example

label example

label format description

20 / 46N

attachments/cmu_us_arctic_slt_a0001.lab

Training stage

Where we are!


N.L.P.Text

Acou.coeff.

Desc. features

Offline stageS.P.

Online stage

N.L.P.Desc.

features

Training

Generation

HMM, DNN, ...

S.P.

S.P.

21 / 46N

Training stage

Problems need to be solved

1 Acoustic modellingGaussianHidden Semi Markov Models

2 Heterogeneous dataMulti-Space Distribution

3 SparsenessDecision treeMulti-stage training process

22 / 46N

Training stage

Statistical distribution - Gaussian vectors

Definition - Multivariate gaussian distribution

N (µ, Σ) (1)

µ = mean vector

Σ = covariance matrix

Probability density function

fµ,Σ (X ) =1

(2π)N/2 |Σ|1/2 e−12 (X−µ)>Σ−1(x−µ) (2)

23 / 46N

Training stage

Statistical distribution - MSD

Only used for the F0 modelling

24 / 46N

Training stage

Statistical distribution - MSD

MSD = Multi-Space Distribution

Definition

X = (V , S) (3)

S = a space

V = a value in the space S

Application for the F0

MSD(X ) =

{w1 N (V (X );µ,Σ), S(X ) = {1} (voiced)w2 δ0(V (X )), S(X ) = {0} (unvoiced)

(4)

25 / 46N

Training stage

Markov Models (MM)

http://www.americanscientist.org/issues/pub/2013/2/first-links-in-the-markov-chain/99999

26 / 46N



Training stage

Hidden Markov Models (HMM)

Now we can’t get the weather’s sequence directly!

Main assumptionThere is an underlying phenomenon (hidden state sequence = Q) which canexplain our observations

Go back to our example

1

2 3

0.6

0.3

0.10.2

0.3

0.5

0.4

0.1

0.5

state 1 state 2 state 3

Rainy 0.5 0.2 0.3

Sunny 0.4 0.5 0.0

Cloudy 0.1 0.8 0.1

27 / 46N

Training stage

Hidden Markov Models (HMM) - Continuous

Based on the previous HMM

1

2 3

0.6

0.3

0.10.2

0.3

0.5

0.4

0.1

0.5

0

0

0

0

0

0

28 / 46N

Training stage

Hidden Markov Models (HMM) - the 3 problems

Tutorial = [Rabiner, 1989]

Problem 1Given an observation sequence O = (O1, . . . , 0T ) and an HMM λ, what isthe probability P(O|λ) ⇒ Backward-forward algorithm.

Problem 2Given an observation sequence O = (O1, . . . , 0T ) and an HMM λ, what isthe sequence of states Q which explains the best the sequence O accordingto the model λ ⇒ Viterbi algorithm.

Problem 3How to get the model parameters (A,B, π) ? ⇒ Given a set of observationsequences O = {O1, . . . ,Ok}, E.M. based algorithm (Baum-Welch).

29 / 46N

Training stage

Hidden Markov Models (HMM )- Training input(Discrete)

Example data

raining - sunny - sunny - sunny - raining - cloudysunny - raining - raining - cloudycloudy - cloudy...

30 / 46N

Training stage

Hidden Markov Models (HMM )- Training input(continuous)

Example data

0.66693531 - 0.71573471 - 0.44575163 - 0.34354172 - 0.180417870.76179169 - 0.105988740.22910708 - 0.24211562 - 0.90336061 - 0.729440930.60625303 - 0.24092975 - 0.943970270.39016014 - 0.32262471 - 0.10743479 - 0.06654687...

31 / 46N

Training stage

Hidden Markov Models (HMM )- Speech

Tutorial = [Rabiner, 1989]

32 / 46N

Training stage

Hidden Markov Models (HMM) - Speech

Definition

λ = (A,B, π)A = {ai,j}, ∀i , j ∈ [1..S ]B = {bj(ot)}, ∀j ∈ [1..S ], t ∈ [1..T ]π = {πi}, ∀i ∈ [1..S ]aij = P(qt = j |qt−1 = i), ∀i , j ∈ [1..S ], t ∈ [2..T ]bj(ot) = P(ot |qt = j), ∀j ∈ [1..S ], t ∈ [1..T ]πi = P(q1 = i), ∀i ∈ [1..S ]

33 / 46N

Training stage

Hidden Semi-Markov Models (HSMM)

HMM = Geometric distribution

Pr(X = k) = (1− p)k−1 p (5)

PROBLEM !Optimization of duration during the synthesis stage => 1 frame per state!⇒ Using a gaussian distribution for the duration (synth.)⇒ HSMM (train.)

34 / 46N

Training stage

Hidden Semi-Markov Models (HSMM)

35 / 46N

Training stage

Decision tree + state tying [Young et al., 1994]

Objective Dealing with sparsenessenglish = 53 descriptive features, german = 70 descriptive features !6= statistical training

C-voiced?

N-vowel?

Yes No

Yes No

Iterative algorithm ⇒Criterion (Stop + Split)= Minimum Description Length (based on variance)

36 / 46N

Training stage

Training process

Initialisation(MSD-HMM)

Monophone(MSD-HSMM)

Fullcontext(MSD-HSMM)

Clusteringtree+(MSD-HSMM)

Initialisation “Finalisation”

37 / 46N

Synthesis stage

Where we are!


N.L.P.Text

Acou.coeff.

Desc. features

Offline stageS.P.

Online stage

N.L.P.Desc.

features

Training

Generation

HMM, DNN, ...

S.P.

S.P.

38 / 46N

Synthesis stage

HMM to produce speech

0. Input = sequence of descriptive features

start-tt-an an-tt-end...

39 / 46N

Synthesis stage


1. HMM-phrase building


...

Nb segments

39 / 46N

Synthesis stage


2. Associate distributions [Yoshimura et al., 1999]


...

C-Plosive?

L-start?

N-Vowel?

C-Plosive? N-Voiced?

39 / 46N

Synthesis stage


3. Acoustic coefficient generation [Tokuda et al., 2000]


...

...

Produce C which maximizes P(O|Λ) = P(WC |Λ)

E.M. based algorithm [Baum et al., 1970]Hidden sequence = Observations

39 / 46N

Synthesis stage


4. Signal synthesis [Zen and Toda, 2005]


...

...

...

Use MLSA Filter [Fukada et al., 1992] + Vocoder [Kawahara et al., 1999]

39 / 46N

Adaptation

What is adaptation [Yamagishi et al., 2007]

Train averagemodel Adaptation

40 / 46N

Adaptation

Informations

Why that’s great !Average voice built ⇒ no need of a huge amount of a training data

Flexible (voice, accent, expresssion, . . . ) + Robust

Samples

Sample 1 Sample 2

41 / 46N

Small introduction to DNN

What is a DNN (conceptually!)

42 / 46N


What is a DNN (conceptually!)

43 / 46N


How it is apply on TTS

More details in [Zen et al., 2013]

44 / 46N


Why is it adapted

Main advantagesMulti-layer models like speech (frames, phones, syllables, words, . . . )

really powerful modeling

current state of the art (Long-Term Short-Term models)

ProblemHUGE AMOUNT of data to train the models

Introspection and analysis complicated (not achieved yet)

45 / 46N

Conclusion

Summary of this talk

2 Corpus based TTS methodologiesUnit selectionParametrical speech synthesis

Focus on statistical speech synthesisWhat problems are implied by this kind of synthesisWhat solutions can be provided

HMM text to speech synthesisHTS : http://hts.sp.nitech.ac.jpMarytts : http://mary.dfki.de/Solutions

Acoustic modelling = HMM + MSDDuration modelling = HSMMData sparseness = Decision tree + training processDYNAMIC !!!!!

46 / 46N

http://hts.sp.nitech.ac.jp

http://mary.dfki.de/

Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970).A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of MarkovChains.The Annals of Mathematical Statistics, 41(1):pp. 164–171.

Fukada, T., Tokuda, K., Kobayashi, T., and Imai, S. (1992).An adaptative Algorithm for mel-cepstral analysis of speech.In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP),volume 1, pages 137–140.

Kawahara, H., Masuda-katsuse, I., and De Cheveign, A. (1999).Restructuring speech representations using a pitch-adaptive time frequency smoothing and aninstantaneous-frequency- based F0 extraction: Possible role of a repetitive structure in sounds 1.Speech Communication, 27:187–207.

Rabiner, L. R. (1989).A tutorial on hidden Markov models and selected applications in speech recognition.In Proceedings of the IEEE, volume 77, pages 257–286.

Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., and Kitamura, T. (2000).Speech parameter generation algorithms for hmm-based speech synthesis.In Proceedings of the International Conference on Acoustics and Speech Signal Processing (ICASSP),pages 1315–1318.

Yamagishi, J., Zen, H., Toda, T., and Tokuda, K. (2007).Speaker-Independent HMM-based Speech Synthesis System - HTS-2007 System for the Blizzard Challenge2007.In Proceedings of the ISCA Tutorial and Research Workshop on Speech Synthesis (SSW6), Bonn,Germany.

Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., and Kitamura, T. (1999).Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis.In Proceedings of the European Conference on Speech Communicationand Technology (Eurospeech),volume 5, pages 2347–2350, Budapest, Hungary.

Young, S., Everman, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D.,Povey, D., Valtchev, V., and Woodland, P. (2005).The HTK Book.

46 / 46N

Number July 2000.

Young, S. J., Odell, J. J., and Woodland, P. C. (1994).Tree-based state tying for high accuracy acoustic modelling.In Proceedings of the workshop on Human Language Technology (HLT), pages 307–312, Morristown, NewJersey, USA. Association for Computational Linguistics.

Zen, H., Senior, A., and Schuster, M. (2013).Statistical parametric speech synthesis using deep neural networks.In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7962–7966.IEEE.

Zen, H. and Toda, T. (2005).An overview of Nitech HMM-based speech synthesis system for blizzard challenge 2005.In Proceedings of the 9th European Conference on Speech Communicationand Technology (Eurospeech),Lisbon, Portugal.

46 / 46N

Sébastien Le Maguer, Bernd Möbius 5th January...

Documents

Transcript of Sébastien Le Maguer, Bernd Möbius 5th January...