Fundamentals & recent advances in HMM-based speech synthesis

81
Fundamentals & recent advances in HMM-based speech synthesis Heiga ZEN Toshiba Research Europe Ltd. Cambridge Research Laboratory FALA2010, Vigo, Spain November 10th, 2010

description

FALA 2011 Keynote, November 2010 Statistical parametric speech synthesis based on HMMs has grown in popularity over the last years. In this talk, its system architecture is outlined, and then basic techniques used in the system, including algorithms for speech parameter generation from HMM, are described with simple examples. Relation to the unit selection approach and recent improvements are summarized. Techniques developed for increasing the flexibility and improving the speech quality are also reviewed.

Transcript of Fundamentals & recent advances in HMM-based speech synthesis

Page 1: Fundamentals & recent advances in HMM-based speech synthesis

Fundamentals & recent advancesin HMM-based speech synthesis

Heiga ZEN

Toshiba Research Europe Ltd.

Cambridge Research Laboratory

FALA2010, Vigo, Spain

November 10th, 2010

Page 2: Fundamentals & recent advances in HMM-based speech synthesis

Text-to-speech synthesis (TTS)Text (seq of discrete symbols) Speech (continuous time series)

Automatic speech recognition (ASR)Speech (continuous time series) Text (seq of discrete symbols)

Machine Translation (MT)Text (seq of discrete symbols) Text (seq of discrete symbols)

2

Text-to-speech as a mapping problem

Dobré ráno Good morning

Good morning

Good morning

Page 3: Fundamentals & recent advances in HMM-based speech synthesis

text(concept)

speech

air flow

Sound sourcevoiced: pulseunvoiced: noise

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r w

ave

by s

peec

h in

form

atio

n

fund

amen

tal f

req.

voic

ed/u

nvoi

ced

freq

. tr

ans

. ch

ar.

3

Speech production process

Page 4: Fundamentals & recent advances in HMM-based speech synthesis

Grapheme-to-Phoneme

Part-of-Speech tagging

Text normalization

Pause predictionProsody prediction

Waveform generation

4

Flow of text-to-speech synthesis

TEXT

Text analysis

SYNTHESIZEDSPEECH

Speech synthesis

This talk focuses on speech synthesis part of TTS

Page 5: Fundamentals & recent advances in HMM-based speech synthesis

Rule-based, formant synthesis (~’90s)

– Based on parametric representation of speech

– Hand-crafted rules to control phonetic unit

DECtalk (or KlattTalk / MITTalk) [Klatt;‘82]

5

Speech synthesis methods (1)

Block diagram of KlattTalk

Page 6: Fundamentals & recent advances in HMM-based speech synthesis

Corpus-based, concatenative synthesis (’90s~)

– Concatenate small speech units (e.g., phone) from a database

– Large data + automatic learning High-quality synthetic voices

Single inventory; diphone synthesis [Moullnes;‘90]

Multiple inventory; unit selection synthesis [Sagisaka;‘92, Black;‘96]

6

Speech synthesis methods (2)

Page 7: Fundamentals & recent advances in HMM-based speech synthesis

Corpus-based, statistical parametric synthesis (mid ’90s~)

– Large data + automatic training

Automatic voice building

– Source-filter model + statistical modeling

Flexible to change its voice characteristics

Hidden Markov models (HMMs) as its statistical acoustic model

HMM-based speech synthesis (HTS) [Yoshimura;‘02]

7

Speech synthesis methods (3)

Parametergeneration

Modeltraining

Featureextraction

Waveformgeneration

Page 8: Fundamentals & recent advances in HMM-based speech synthesis

This talk gives the basic implementations & some recent work in HMM-based speech synthesis research

8

Popularity of HMM-based speech synthesis

Page 9: Fundamentals & recent advances in HMM-based speech synthesis

Outline

9

Fundamentals of HMM-based speech synthesis– Mathematical formulation

– Implementation of individual components

– Examples demonstrating its flexibility

Recent work– Quality improvements

– Vocoding

– Acoustic modeling

– Oversmoothing

– Other challenging research topics

Conclusions

Page 10: Fundamentals & recent advances in HMM-based speech synthesis

Mathematical formulation

Statistical parametric speech synthesisTraining

Synthesis

Using HMM as generative model

HMM-based speech synthesis (HTS) [Yoshimura;’02]

10

Page 11: Fundamentals & recent advances in HMM-based speech synthesis

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Labels

Training part

Synthesis part

ExcitationParameterextraction

SPEECHDATABASE

SpectralParameterExtraction

Spectralparameters

Excitationparameters

Speech signal

HMM-based speech synthesis system (HTS)

11

Page 12: Fundamentals & recent advances in HMM-based speech synthesis

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Spectralparameters

Excitationparameters

Labels

Speech signal Training part

Synthesis part

HMM-based speech synthesis system (HTS)

12

Page 13: Fundamentals & recent advances in HMM-based speech synthesis

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Labels

Training part

Synthesis part

ExcitationParameterextraction

SPEECHDATABASE

SpectralParameterExtraction

Spectralparameters

Excitationparameters

Speech signal

HMM-based speech synthesis system (HTS)

13

Page 14: Fundamentals & recent advances in HMM-based speech synthesis

air flow

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r w

ave

by s

peec

h in

form

atio

n speech

Sound sourcevoiced: pulseunvoiced: noise

Speech production process

14

Page 15: Fundamentals & recent advances in HMM-based speech synthesis

Divide speech into frames

15

Speech is a non-stationary signal

… but can be assumed to be quasi-stationary

Divide speech into short-time frames (e.g., 5ms shift, 25ms length)

Page 16: Fundamentals & recent advances in HMM-based speech synthesis

lineartime-invariant

system

excitationpulse train(voiced)

white noise(unvoiced)

speech

Excitation (source) part Spectral (filter) part

Fourier transform

Source-filter model

16

Page 17: Fundamentals & recent advances in HMM-based speech synthesis

Parametric models speech spectrum

Autoregressive (AR) model Exponential (EX) model

ML estimation of spectral model parameters

: AR model Linear prediction (LP) [Itakura;’70]

: EX model ML-based cepstral analysis

Spectral (filter) model

17

Page 18: Fundamentals & recent advances in HMM-based speech synthesis

Excitation (source) model

excitationpulse train

white noise

18

Excitation model: pulse/noise excitation– Voiced (periodic) pulse trains

– Unvoiced (aperiodic) white noise

Excitation model parameters– V/UV decision

– V fundamental frequency (F0):

Page 19: Fundamentals & recent advances in HMM-based speech synthesis

Speech samples

19

Natural speech

Reconstructed speech from extracted parameters (cepstral coefficients & F0 with V/UV decisions)

Quality degrades, but main characteristics are preserved

Page 20: Fundamentals & recent advances in HMM-based speech synthesis

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Speech signal Training part

Synthesis part

Training HMMs

Spectralparameters

Excitationparameters

Labels

HMM-based speech synthesis system (HTS)

20

Page 21: Fundamentals & recent advances in HMM-based speech synthesis

Structure of state-output (observation) vector

Spectrum part

Excitation part

Spectral parameters(e.g., cepstrum, LSPs)

log F0 with V/UV

21

Page 22: Fundamentals & recent advances in HMM-based speech synthesis

Dynamic features

22

Page 23: Fundamentals & recent advances in HMM-based speech synthesis

HMM-based modeling

・ ・

1 1 2 3 3 N

Observation sequence

State sequence

23

SentenceHMM

Transcription sil a i sil

1 2 31 2 3

Page 24: Fundamentals & recent advances in HMM-based speech synthesis

Spectrum(cepstrum or LSP, & dynamic features)

Excitation(log F0

& dynamic features)

Stream

12

34

Multi-stream HMM structure

24

Page 25: Fundamentals & recent advances in HMM-based speech synthesis

Time

Log

Fre

qu

enc

y

Unable to model by continuous or discrete distribution

Observation of F0

25

Page 26: Fundamentals & recent advances in HMM-based speech synthesis

1 2 3

unvoiced

voiced

unvoiced

voiced

unvoiced

voiced

・ ・ ・

voiced/unvoiced weights

Multi-space probability distribution (MSD)

26

Page 27: Fundamentals & recent advances in HMM-based speech synthesis

Voiced

Unvoiced

Voiced

Unvoiced

Voiced

Unvoiced

Log F0

Spectral params Single Gaussian

MSD(Gaussian & discrete)

MSD(Gaussian & discrete)

MSD(Gaussian & discrete)

Str

eam

1S

trea

m 2

,3,4

Structure of state-output distributions

27

Page 28: Fundamentals & recent advances in HMM-based speech synthesis

data & labels

Compute variancefloor

Initialize CI-HMMs bysegmental k-means

Reestimate CI-HMMs byEM algorithm

Copy CI-HMMs toCD-HMMs

monophone(context-independent, CI)

Reestimate CD-HMMs byEM algorithm

Decision tree-basedclustering

Reestimate CD-HMMs byEM algorithm

Untie parameter tyingstructure

Estimate CD-dur Modelsfrom FB stats

Decision tree-basedclustering

Estimated dur models

EstimatedHMMs

fullcontext(context-dependent, CD)

Training process

29

Page 29: Fundamentals & recent advances in HMM-based speech synthesis

HMM-based modeling

・ ・

1 1 2 3 3 N

Observation sequence

State sequence

30

SentenceHMM

Transcription

sil sil-a+i a-i+sil sil

1 2 31 2 3

Page 30: Fundamentals & recent advances in HMM-based speech synthesis

Phoneme current phoneme {preceding, succeeding} two phonemes

Syllable # of phonemes at {preceding, current, succeeding} syllable {accent, stress} of {preceding, current, succeeding} syllable Position of current syllable in current word # of {preceding, succeeding} {accented, stressed} syllable in current phrase # of syllables {from previous, to next} {accented, stressed} syllable Vowel within current syllable

Word Part of speech of {preceding, current, succeeding} word # of syllables in {preceding, current, succeeding} word Position of current word in current phrase # of {preceding, succeeding} content words in current phrase # of words {from previous, to next} content word

Phrase # of syllables in {preceding, current, succeeding} phrase

…..

Huge # of combinations Difficult to have all possible models⇒

Context-dependent modeling

31

Page 31: Fundamentals & recent advances in HMM-based speech synthesis

data & labels

Compute variancefloor

Initialize CI-HMMs bysegmental k-means

Reestimate CI-HMMs byEM algorithm

Copy CI-HMMs toCD-HMMs

monophone(context-independent, CI)

Reestimate CD-HMMs byEM algorithm

Decision tree-basedclustering

Reestimate CD-HMMs byEM algorithm

Untie parameter tyingstructure

Estimate CD-dur Modelsfrom FB stats

Decision tree-basedclustering

Estimated dur models

EstimatedHMMs

fullcontext(context-dependent, CD)

Training process

32

Page 32: Fundamentals & recent advances in HMM-based speech synthesis

k-a+b/A:…

t-e+h/A:……

…w-a+t/A:…

w-i+sil/A:… w-o+sh/A:…

gy-e+sil/A:…

gy-a+pau/A:…

g-u+pau/A:…

leaf nodes

synthesized

states

yes

yes

yesyes

yes no

no

no

no

no

R=silence?

L=“gy”?

C=voiced?

L=“w”?

R=silence?

Decision tree-based context clustering [Odell;’95]

33

Page 33: Fundamentals & recent advances in HMM-based speech synthesis

Spectrum & excitation have different context dependency

Build decision trees separately

Decision treesfor

mel-cepstrum

Decision treesfor F0

Stream-dependent clustering

34

Page 34: Fundamentals & recent advances in HMM-based speech synthesis

data & labels

Compute variancefloor

Initialize CI-HMMs bysegmental k-means

Reestimate CI-HMMs byEM algorithm

Copy CI-HMMs toCD-HMMs

monophone(context-independent, CI)

Reestimate CD-HMMs byEM algorithm

Decision tree-basedclustering

Reestimate CD-HMMs byEM algorithm

Untie parameter tyingstructure

Estimate CD-dur Modelsfrom FB stats

Decision tree-basedclustering

Estimated dur models

EstimatedHMMs

fullcontext(context-dependent, CD)

Training process

35

Page 35: Fundamentals & recent advances in HMM-based speech synthesis

1 2 3 4 5 6 7 8

i

0t 1t

T t

Estimation of state duration models [Yoshimura;’98]

36

Page 36: Fundamentals & recent advances in HMM-based speech synthesis

Decision treesfor

mel-cepstrum

Decision treesfor F0

HMM

State durationmodel

Decision treefor state dur.

models

Stream-dependent clustering

37

Page 37: Fundamentals & recent advances in HMM-based speech synthesis

data & labels

Compute variancefloor

Initialize CI-HMMs bysegmental k-means

Reestimate CI-HMMs byEM algorithm

Copy CI-HMMs toCD-HMMs

monophone(context-independent, CI)

Reestimate CD-HMMs byEM algorithm

Decision tree-basedclustering

Reestimate CD-HMMs byEM algorithm

Untie parameter tyingstructure

Estimate CD-dur Modelsfrom FB stats

Decision tree-basedclustering

Estimated dur models

EstimatedHMMs

fullcontext(context-dependent, CD)

Training process

38

Page 38: Fundamentals & recent advances in HMM-based speech synthesis

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis Filter

SYNTHESIZEDSPEECH

Excitation

Speech signal Training part

Synthesis part

Training HMMs

Spectralparameters

Excitationparameters

Labels

TEXT

Text analysisParameter generation

from HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Spectralparameters

HMM-based speech synthesis system (HTS)

39

Page 39: Fundamentals & recent advances in HMM-based speech synthesis

Composition of sentence HMM for given text

TEXT

Text analysis

context-dependent label sequence

G2P

POS tagging

Text normalization

Pause prediction

sentence HMM given labels

This sentence HMM gives

Page 40: Fundamentals & recent advances in HMM-based speech synthesis

Speech parameter generation algorithm

41

Page 41: Fundamentals & recent advances in HMM-based speech synthesis

Observation sequence

State sequence

・ ・

1 2 3

1 1 1 1 2 2 3 3

Determine state sequence via determining state durations

Determination of state sequence (1)

4 10 5State duration

42

Page 42: Fundamentals & recent advances in HMM-based speech synthesis

Determination of state sequence (2)

43

Page 43: Fundamentals & recent advances in HMM-based speech synthesis

Geometric

Gaussian

0.01 2 3 4 5 6 7 8 State duration

0.1

0.2

0.3

0.4

0.5

Sta

te-d

ura

tion

pro

b.

Determination of state sequence (3)

44

Page 44: Fundamentals & recent advances in HMM-based speech synthesis

45

Speech parameter generation algorithm

Page 45: Fundamentals & recent advances in HMM-based speech synthesis

Mean Variance

step-wise, mean values

Without dynamic features

46

Page 46: Fundamentals & recent advances in HMM-based speech synthesis

Speech param. vectors includes both static & dyn. feats.

M2

The relationship between & can be arranged as

M M

Integration of dynamic features

47

Page 47: Fundamentals & recent advances in HMM-based speech synthesis

48

Speech parameter generation algorithm

Page 48: Fundamentals & recent advances in HMM-based speech synthesis

Solution

49

Page 49: Fundamentals & recent advances in HMM-based speech synthesis

Mean Variance

Dyn

amic

Sta

ticGenerated speech parameter trajectory

50

Page 50: Fundamentals & recent advances in HMM-based speech synthesis

50

12

34

5(kH

z)0

12

34

(kHz)

a i silsil

Generated spectra

51

w/o dynamic features

w/ dynamic features

Page 51: Fundamentals & recent advances in HMM-based speech synthesis

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

TEXT

Text analysis

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels

Spectralparameters

Excitationparameters

Labels

Speech signal Training part

Synthesis partExcitationgeneration

Synthesis Filter

SYNTHESIZEDSPEECH

Excitationparameters

Excitation

Spectralparameters

HMM-based speech synthesis system (HTS)

52

Page 52: Fundamentals & recent advances in HMM-based speech synthesis

lineartime-invariant

systemexcitation

pulse train

white noise

synthesizedspeech

Generatedexcitation parameter(log F0 with V/UV)

Generatedspectral parameter

(cepstrum, LSP)

Source-filter model

53

filtering

Page 53: Fundamentals & recent advances in HMM-based speech synthesis

Speech samples

55

w/o dynamic features

w/ dynamic features

Use of dynamic features can reduce discontinuity

Page 54: Fundamentals & recent advances in HMM-based speech synthesis

Outline

56

Fundamentals of HMM-based speech synthesis– Mathematical formulation

– Implementation of individual components

– Examples demonstrating its flexibility

Recent works– Quality improvements

– Vocoding

– Acoustic modeling

– Oversmoothing

– Other challenging research topics

Conclusions

Page 55: Fundamentals & recent advances in HMM-based speech synthesis

Adaptation/adaptive training of HMMs– Originally developed in ASR, but works very well in TTS

– Average voice model (AVM)-based speech synthesis [Yamagishi;’06]

– Require small amount of target speaker/speaking style data

More efficient creation of new voices

AdaptiveTraining

Average-voice model

Training speakers

Adaptation

Target speakers

Adaptation (mimic voice)

57

Page 56: Fundamentals & recent advances in HMM-based speech synthesis

Speaker adaptation– VIP voice

– Child voice

Style (expressiveness) adaptation (in Japanese)– Joyful

– Sad

– Rough

Difficult to collect large amount of data Adaptation framework gives a way to build TTS with small database

58

Adaptation demo

Some samples are from http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html

Page 57: Fundamentals & recent advances in HMM-based speech synthesis

Interpolate parameters among representive HMM sets– Can obtain new voices even when no adaptation data is available

– Gradually change speakers & speaking styles [Yoshimura;’97, Tachibana;’05]

Interpolation (mixing voice)

59

Page 58: Fundamentals & recent advances in HMM-based speech synthesis

Speaker interpolation (in Japanese)– Male & Female

Style interpolation– Neutral Angry

– Neutral Happy

60

Interpolation demo

Samples are from http://www.sp.nitech.ac.jp/& http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html

Page 59: Fundamentals & recent advances in HMM-based speech synthesis

Outline

63

Fundamentals of HMM-based speech synthesis– Mathematical formulation

– Implementation of individual components

– Examples demonstrating its flexibility

Recent works– Quality improvements

– Vocoding

– Acoustic modeling

– Oversmoothing

– Other challenging research topics

Conclusions

Page 60: Fundamentals & recent advances in HMM-based speech synthesis

Drawbacks of HTS

Quality of synthesized speech– Buzzy

– Flat

– Muffled

Three major factors degrade the quality– Poor vocoding

how to parameterize speech?

– Inaccurate acoustic modeling

how to model extracted speech parameter trajectories?

– Over-smoothing

how to recover generated speech parameter trajectories?

Recent work has drastically improved the quality of HTS

64

Page 61: Fundamentals & recent advances in HMM-based speech synthesis

pulse train

white noseexcitation

Unvoiced Voiced

)(ne

Problem in vocoding

Binary pulse or noise excitation– Voiced (periodic) pulse trains

– Unvoiced (aperiodic) white noiseDifficult to model mixture of voiced & unvoiced sounds

(e.g., voiced fricatives)

Results in buzziness in synthesized speech

65

Page 62: Fundamentals & recent advances in HMM-based speech synthesis

Limitations of HMMs for modeling speech parameters– Piece-wise constant statistics

Statistics do not vary within an HMM state

/a/ /i/ /sil//sil/

2.2

0

-1.20.4

0

-0.5mean

variance

)2(c

)2(c

Problem in acoustic modeling

66

Page 63: Fundamentals & recent advances in HMM-based speech synthesis

1q 2q 3q

1o 2o 3o

Problem in acoustic modeling

Limitations of HMMs for modeling speech parameters– Piece-wise constant statistics

Statistics do not vary within an HMM state

– Frame-wise conditional independence assumption

State output probability depends only on the current state

67

Page 64: Fundamentals & recent advances in HMM-based speech synthesis

1 2 3 4 5 6 7 8 State duration

0.1

0.2

0.3

0.4

0.5

Sta

te-d

ura

tion

pro

b.

68

Problem in acoustic modeling

Limitations of HMMs for modeling speech parameters– Piece-wise constant statistics

Statistics do not vary within an HMM state

– Frame-wise conditional independence assumption

State output probability depends only on the current state

– Weak duration modeling

State duration probability decreases exponentially with time

Page 65: Fundamentals & recent advances in HMM-based speech synthesis

69

Problem in acoustic modeling

Limitations of HMMs for modeling speech parameters– Piece-wise constant statistics

Statistics do not vary within an HMM state

– Frame-wise conditional independence assumption

State output probability depends only on the current state

– Weak duration modeling

State duration probability decreases exponentially with time

None of them hold for real speech data

Speech parameters are generated from acoustic models

Better acoustic model may produce better speech

Page 66: Fundamentals & recent advances in HMM-based speech synthesis

0

4

8

0

4

8

Fre

qu

en

cy (

kHz)

Fre

qu

en

cy (

kHz)

Na

tura

lG

en

erat

ed

70

Over-smoothing problem

Speech parameter generation algorithm– Dynamic features make generated parameters smooth

– Some times too smooth Muffled speech

Why?– Fine structure of spectra are removed by averaging effect

– Model underfitting

Page 67: Fundamentals & recent advances in HMM-based speech synthesis

71

Nitech-HTS 2005 for Blizzard Challenge

Blizzard Challenge– Open evaluation of TTS

– Common database

– From 2005

Nitech-HTS 2005– Achieved best naturalness & intelligibility

– Still used as calibration system in Blizzard Challenge

– Includes various enhancements

Advanced vocoding

High-quality vocoder STRAIGHT

Better acoustic modeling

Hidden semi-Markov model (HSMM)

Oversmoothing compensation

Speech parameter generation algorithm considering GV

Page 68: Fundamentals & recent advances in HMM-based speech synthesis

Waveform Synthetic waveform

Fixed-point analysis

F0 extraction

F0 adaptive spectralsmoothing in the

time-frequency region

Analysis

F0

Smoothedspectrum

Aperiodicfactors

Mixed excitation withphase manipulation

Synthesis

STRAIGHT analysis

72

Page 69: Fundamentals & recent advances in HMM-based speech synthesis

Frequency [kHz]

Pow

er [d

B]

120

642

100

80

60

40

20

0

-200 8

FFT power spectrumFFT + mel-cepstral analysisSTRAIGHT + mel-cepstral analysis

Mel-cepstrum from STRAIGHT spectrum

73

Page 70: Fundamentals & recent advances in HMM-based speech synthesis

Examples of generated excitation signals

pulse/noise

STRAIGHT

75

Page 71: Fundamentals & recent advances in HMM-based speech synthesis

76

HSMM-based modeling

Hidden semi-Markov model (HSMM)

– Can be viewed as HMM with state-duration probability distributions

– State-output, state-duration, & state-transition probability distributions can be estimated simultaneously

Page 72: Fundamentals & recent advances in HMM-based speech synthesis

Generated trajectory has smaller GV

1

0

-12nd

me

l-ce

pst

ral c

oeffi

cie

nt

0 1 2 3 time [sec]

Natural Generated

Parameter generation considering GV

77

Global variance intra-utterance var of speech params

Page 73: Fundamentals & recent advances in HMM-based speech synthesis

Objective function of standard parameter generation

Objective function of parameter generation with GV

– 1st term is same as standard one

– 2nd term works as a penalty term for oversmoothing

78

Parameter generation considering GV

Page 74: Fundamentals & recent advances in HMM-based speech synthesis

1

2

3

4

5

24 mcep. HMM

non-GV

24 ST. HMM

non-GV

24 mcep. HSMM non-GV

24 ST. HSMM non-GV

39 ST. HSMM non-GV

24 mcep. HSMM

GV

24 ST. HSMM

GV

39 ST. HSMM

GV

MO

S

2.0

3.1

2.0

3.2 3.2 3.1

4.7 4.7

Evaluations

80

Page 75: Fundamentals & recent advances in HMM-based speech synthesis

Recent works to improve quality

Vocoding– MELP-style / CELP-style excitation

– LF model

– Sinusoidal models

Acoustic model– Segment models, trajectory models

– Model combination (product of experts)

– Minimum generation error training

– Bayesian modeling

Oversmoothing– Pre & postfiltering

– Improvements of GV

– Hybrid approaches

& more…

81

Page 76: Fundamentals & recent advances in HMM-based speech synthesis

Other challenging topics

Keynote speech by Simon King (CSTR) in SSW7 this year

Speech synthesis is easy, if ...

• voice is built offline & carefully checked for errors

• speech is recorded in clean conditions

• word transcriptions are correct

• accurate phonetic labels are available or can be obtained

• speech is in the required language & speaking style

• speech is from a suitable speaker

• a native speaker is available, preferably a linguist

Speech synthesis is not easy if we don’t have right data82

Page 77: Fundamentals & recent advances in HMM-based speech synthesis

Other challenging topics

Non-professional speakers

• AVM + adaptation (CSTR)

Too little speech data

• VTLN-based rapid speaker adaptation (Titech, IDIAP)

Noisy recordings

• Spectral subtraction & AVM + adaptation (CSTR)

No labels

• Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba)

Insufficient knowledge of the language or accent

• Letter (grapheme)-based synthesis (CSTR)

• No prosodic contexts (CSTR, Titech)

Wrong language

• Cross-lingual speaker adaptation (MSRA, EMIME)

• Speaker & language adaptive training (Toshiba)

83

Page 78: Fundamentals & recent advances in HMM-based speech synthesis

Outline

84

Fundamentals of HMM-based speech synthesis– Mathematical formulation

– Implementation of individual components

– Examples demonstrating its flexibility

Recent works– Vocoding

– Acoustic modeling

– Oversmoothing

– Other challenging research topics

Conclusions

Page 79: Fundamentals & recent advances in HMM-based speech synthesis

HMM-based speech synthesis– Getting popular in speech synthesis area

– Implementation

– Flexible to change its voice characteristics

– Its quality is improving

– There are many challenging research topics

Summary

85

Page 80: Fundamentals & recent advances in HMM-based speech synthesis

TTS research is attractive for you!!

86

text analysis

vocoding

statistical modeling

Text analysis

• Part-of-speech tagging, Grapheme-to-Phoneme conversion, etc.

Vocoding

• Better signal processing, better speech representation

Statistical modeling

• Robust modeling in ASR, knowledge from machine learning

Natural language processing

Signal processing, speech coding

ASR, machine learning

Page 81: Fundamentals & recent advances in HMM-based speech synthesis

87

Please join TTS research!

Thanks!