Fundamentals & recent advances in HMM-based speech synthesis

Fundamentals & recent advancesin HMM-based speech synthesis

Heiga ZEN

Toshiba Research Europe Ltd.

Cambridge Research Laboratory

FALA2010, Vigo, Spain

November 10th, 2010

Text-to-speech synthesis (TTS)Text (seq of discrete symbols) Speech (continuous time series)

Automatic speech recognition (ASR)Speech (continuous time series) Text (seq of discrete symbols)

Machine Translation (MT)Text (seq of discrete symbols) Text (seq of discrete symbols)

2

Text-to-speech as a mapping problem

Dobré ráno Good morning

Good morning

Good morning

text(concept)

speech

air flow

Sound sourcevoiced: pulseunvoiced: noise

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r w

ave

by s

peec

h in

form

atio

n

fund

amen

tal f

req.

voic

ed/u

nvoi

ced

freq

. tr

ans

. ch

ar.

3

Speech production process

Grapheme-to-Phoneme

Part-of-Speech tagging

Text normalization

Pause predictionProsody prediction

Waveform generation

4

Flow of text-to-speech synthesis

TEXT

Text analysis

SYNTHESIZEDSPEECH

Speech synthesis

This talk focuses on speech synthesis part of TTS

Rule-based, formant synthesis (~’90s)

– Based on parametric representation of speech

– Hand-crafted rules to control phonetic unit

DECtalk (or KlattTalk / MITTalk) [Klatt;‘82]

5

Speech synthesis methods (1)

Block diagram of KlattTalk

Corpus-based, concatenative synthesis (’90s~)

– Concatenate small speech units (e.g., phone) from a database

– Large data + automatic learning High-quality synthetic voices

Single inventory; diphone synthesis [Moullnes;‘90]

Multiple inventory; unit selection synthesis [Sagisaka;‘92, Black;‘96]

6


Corpus-based, statistical parametric synthesis (mid ’90s~)

– Large data + automatic training

Automatic voice building

– Source-filter model + statistical modeling

Flexible to change its voice characteristics

Hidden Markov models (HMMs) as its statistical acoustic model

HMM-based speech synthesis (HTS) [Yoshimura;‘02]

7


Parametergeneration

Modeltraining

Featureextraction

Waveformgeneration

This talk gives the basic implementations & some recent work in HMM-based speech synthesis research

8

Popularity of HMM-based speech synthesis

Outline

9

Fundamentals of HMM-based speech synthesis– Mathematical formulation

– Implementation of individual components

– Examples demonstrating its flexibility

Recent work– Quality improvements

– Vocoding

– Acoustic modeling

– Oversmoothing

– Other challenging research topics

Conclusions

Mathematical formulation

Statistical parametric speech synthesisTraining

Synthesis

Using HMM as generative model

HMM-based speech synthesis (HTS) [Yoshimura;’02]

10

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Labels

Training part

Synthesis part

ExcitationParameterextraction

SPEECHDATABASE

SpectralParameterExtraction

Spectralparameters

Excitationparameters

Speech signal

HMM-based speech synthesis system (HTS)

11

SPEECHDATABASE




Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs




Excitation

Spectralparameters

Spectralparameters


Labels

Speech signal Training part

Synthesis part


12


Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs




Excitation

Spectralparameters

Labels

Training part

Synthesis part


SPEECHDATABASE


Spectralparameters


Speech signal


13

air flow

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r w

ave

by s

peec

h in

form

atio

n speech

Sound sourcevoiced: pulseunvoiced: noise

Speech production process

14

Divide speech into frames

15

Speech is a non-stationary signal

… but can be assumed to be quasi-stationary

Divide speech into short-time frames (e.g., 5ms shift, 25ms length)

lineartime-invariant

system

excitationpulse train(voiced)

white noise(unvoiced)

speech

Excitation (source) part Spectral (filter) part

Fourier transform

Source-filter model

16

Parametric models speech spectrum

Autoregressive (AR) model Exponential (EX) model

ML estimation of spectral model parameters

: AR model Linear prediction (LP) [Itakura;’70]

: EX model ML-based cepstral analysis

Spectral (filter) model

17

Excitation (source) model

excitationpulse train

white noise

18

Excitation model: pulse/noise excitation– Voiced (periodic) pulse trains

– Unvoiced (aperiodic) white noise

Excitation model parameters– V/UV decision

– V fundamental frequency (F0):

Speech samples

19

Natural speech

Reconstructed speech from extracted parameters (cepstral coefficients & F0 with V/UV decisions)

Quality degrades, but main characteristics are preserved

SPEECHDATABASE




Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH




Excitation

Spectralparameters


Synthesis part

Training HMMs

Spectralparameters


Labels


20

Structure of state-output (observation) vector

Spectrum part

Excitation part

Spectral parameters(e.g., cepstrum, LSPs)

log F0 with V/UV

21

Dynamic features

22

HMM-based modeling

・・

1 1 2 3 3 N

Observation sequence

State sequence

23

SentenceHMM

Transcription sil a i sil

1 2 31 2 3

Spectrum(cepstrum or LSP, & dynamic features)

Excitation(log F0

& dynamic features)

Stream

12

34

Multi-stream HMM structure

24

Time

Log

Fre

qu

enc

y

Unable to model by continuous or discrete distribution

Observation of F0

25

1 2 3

unvoiced

voiced

unvoiced

voiced

unvoiced

voiced

・・・

voiced/unvoiced weights

Multi-space probability distribution (MSD)

26

Voiced

Unvoiced

Voiced

Unvoiced

Voiced

Unvoiced

・

・

・

Log F0

Spectral params Single Gaussian

MSD(Gaussian & discrete)



Str

eam

1S

trea

m 2

,3,4

Structure of state-output distributions

27

data & labels

Compute variancefloor

Initialize CI-HMMs bysegmental k-means

Reestimate CI-HMMs byEM algorithm

Copy CI-HMMs toCD-HMMs

monophone(context-independent, CI)

Reestimate CD-HMMs byEM algorithm

Decision tree-basedclustering


Untie parameter tyingstructure

Estimate CD-dur Modelsfrom FB stats


Estimated dur models

EstimatedHMMs

fullcontext(context-dependent, CD)

Training process

29

HMM-based modeling

・・

1 1 2 3 3 N


State sequence

30

SentenceHMM

Transcription

sil sil-a+i a-i+sil sil

1 2 31 2 3

Phoneme current phoneme {preceding, succeeding} two phonemes

Syllable # of phonemes at {preceding, current, succeeding} syllable {accent, stress} of {preceding, current, succeeding} syllable Position of current syllable in current word # of {preceding, succeeding} {accented, stressed} syllable in current phrase # of syllables {from previous, to next} {accented, stressed} syllable Vowel within current syllable

Word Part of speech of {preceding, current, succeeding} word # of syllables in {preceding, current, succeeding} word Position of current word in current phrase # of {preceding, succeeding} content words in current phrase # of words {from previous, to next} content word

Phrase # of syllables in {preceding, current, succeeding} phrase

…..

Huge # of combinations Difficult to have all possible models⇒

Context-dependent modeling

31

data & labels













EstimatedHMMs


Training process

32

k-a+b/A:…

t-e+h/A:……

…

…w-a+t/A:…

w-i+sil/A:… w-o+sh/A:…

gy-e+sil/A:…

gy-a+pau/A:…

g-u+pau/A:…

…

…

leaf nodes

synthesized

states

yes

yes

yesyes

yes no

no

no

no

no

R=silence?

L=“gy”?

C=voiced?

L=“w”?

R=silence?

Decision tree-based context clustering [Odell;’95]

33

Spectrum & excitation have different context dependency

Build decision trees separately

Decision treesfor

mel-cepstrum

Decision treesfor F0

Stream-dependent clustering

34

data & labels













EstimatedHMMs


Training process

35

1 2 3 4 5 6 7 8

i

0t 1t

T t

Estimation of state duration models [Yoshimura;’98]

36

Decision treesfor

mel-cepstrum

Decision treesfor F0

HMM

State durationmodel

Decision treefor state dur.

models

Stream-dependent clustering

37

data & labels













EstimatedHMMs


Training process

38

SPEECHDATABASE




Synthesis Filter

SYNTHESIZEDSPEECH

Excitation


Synthesis part

Training HMMs

Spectralparameters


Labels

TEXT

Text analysisParameter generation

from HMMs



Spectralparameters


39

Composition of sentence HMM for given text

TEXT

Text analysis

context-dependent label sequence

G2P

POS tagging

Text normalization

Pause prediction

sentence HMM given labels

This sentence HMM gives

Speech parameter generation algorithm

41


State sequence

・・

1 2 3

1 1 1 1 2 2 3 3

Determine state sequence via determining state durations

Determination of state sequence (1)

4 10 5State duration

42


43

Geometric

Gaussian

0.01 2 3 4 5 6 7 8 State duration

0.1

0.2

0.3

0.4

0.5

Sta

te-d

ura

tion

pro

b.


44

45


Mean Variance

step-wise, mean values

Without dynamic features

46

Speech param. vectors includes both static & dyn. feats.

M2

The relationship between & can be arranged as

M M

Integration of dynamic features

47

48


Solution

49

Mean Variance

Dyn

amic

Sta

ticGenerated speech parameter trajectory

50

50

12

34

5(kH

z)0

12

34

(kHz)

a i silsil

Generated spectra

51

w/o dynamic features

w/ dynamic features

SPEECHDATABASE



TEXT

Text analysis

Training HMMs



Labels

Spectralparameters


Labels


Synthesis partExcitationgeneration

Synthesis Filter

SYNTHESIZEDSPEECH


Excitation

Spectralparameters


52

lineartime-invariant

systemexcitation

pulse train

white noise

synthesizedspeech

Generatedexcitation parameter(log F0 with V/UV)

Generatedspectral parameter

(cepstrum, LSP)

Source-filter model

53

filtering

Speech samples

55

w/o dynamic features

w/ dynamic features

Use of dynamic features can reduce discontinuity

Outline

56




Recent works– Quality improvements

– Vocoding


– Oversmoothing


Conclusions

Adaptation/adaptive training of HMMs– Originally developed in ASR, but works very well in TTS

– Average voice model (AVM)-based speech synthesis [Yamagishi;’06]

– Require small amount of target speaker/speaking style data

More efficient creation of new voices

AdaptiveTraining

Average-voice model

Training speakers

Adaptation

Target speakers

Adaptation (mimic voice)

57

Speaker adaptation– VIP voice

– Child voice

Style (expressiveness) adaptation (in Japanese)– Joyful

– Sad

– Rough

Difficult to collect large amount of data Adaptation framework gives a way to build TTS with small database

58

Adaptation demo

Some samples are from http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html

Interpolate parameters among representive HMM sets– Can obtain new voices even when no adaptation data is available

– Gradually change speakers & speaking styles [Yoshimura;’97, Tachibana;’05]

Interpolation (mixing voice)

59

Speaker interpolation (in Japanese)– Male & Female

Style interpolation– Neutral Angry

– Neutral Happy

60

Interpolation demo

Samples are from http://www.sp.nitech.ac.jp/& http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html

http://www.sp.nitech.ac.jp/

Outline

63




Recent works– Quality improvements

– Vocoding


– Oversmoothing


Conclusions

Drawbacks of HTS

Quality of synthesized speech– Buzzy

– Flat

– Muffled

Three major factors degrade the quality– Poor vocoding

how to parameterize speech?

– Inaccurate acoustic modeling

how to model extracted speech parameter trajectories?

– Over-smoothing

how to recover generated speech parameter trajectories?

Recent work has drastically improved the quality of HTS

64

pulse train

white noseexcitation

Unvoiced Voiced

)(ne

Problem in vocoding

Binary pulse or noise excitation– Voiced (periodic) pulse trains

– Unvoiced (aperiodic) white noiseDifficult to model mixture of voiced & unvoiced sounds

(e.g., voiced fricatives)

Results in buzziness in synthesized speech

65

Limitations of HMMs for modeling speech parameters– Piece-wise constant statistics

Statistics do not vary within an HMM state

/a/ /i/ /sil//sil/

2.2

0

-1.20.4

0

-0.5mean

variance

)2(c

)2(c

Problem in acoustic modeling

66

1q 2q 3q

1o 2o 3o




– Frame-wise conditional independence assumption

State output probability depends only on the current state

67

1 2 3 4 5 6 7 8 State duration

0.1

0.2

0.3

0.4

0.5

Sta

te-d

ura

tion

pro

b.

68






– Weak duration modeling

State duration probability decreases exponentially with time

69






– Weak duration modeling

State duration probability decreases exponentially with time

None of them hold for real speech data

Speech parameters are generated from acoustic models

Better acoustic model may produce better speech

0

4

8

0

4

8

Fre

qu

en

cy (

kHz)

Fre

qu

en

cy (

kHz)

Na

tura

lG

en

erat

ed

70

Over-smoothing problem

Speech parameter generation algorithm– Dynamic features make generated parameters smooth

– Some times too smooth Muffled speech

Why?– Fine structure of spectra are removed by averaging effect

– Model underfitting

71

Nitech-HTS 2005 for Blizzard Challenge

Blizzard Challenge– Open evaluation of TTS

– Common database

– From 2005

Nitech-HTS 2005– Achieved best naturalness & intelligibility

– Still used as calibration system in Blizzard Challenge

– Includes various enhancements

Advanced vocoding

High-quality vocoder STRAIGHT

Better acoustic modeling

Hidden semi-Markov model (HSMM)

Oversmoothing compensation

Speech parameter generation algorithm considering GV

Waveform Synthetic waveform

Fixed-point analysis

F0 extraction

F0 adaptive spectralsmoothing in the

time-frequency region

Analysis

F0

Smoothedspectrum

Aperiodicfactors

Mixed excitation withphase manipulation

Synthesis

STRAIGHT analysis

72

Frequency [kHz]

Pow

er [d

B]

120

642

100

80

60

40

20

0

-200 8

FFT power spectrumFFT + mel-cepstral analysisSTRAIGHT + mel-cepstral analysis

Mel-cepstrum from STRAIGHT spectrum

73

Examples of generated excitation signals

pulse/noise

STRAIGHT

75

76

HSMM-based modeling

Hidden semi-Markov model (HSMM)

– Can be viewed as HMM with state-duration probability distributions

– State-output, state-duration, & state-transition probability distributions can be estimated simultaneously

Generated trajectory has smaller GV

1

0

-12nd

me

l-ce

pst

ral c

oeffi

cie

nt

0 1 2 3 time [sec]

Natural Generated

Parameter generation considering GV

77

Global variance intra-utterance var of speech params

Objective function of standard parameter generation

Objective function of parameter generation with GV

– 1st term is same as standard one

– 2nd term works as a penalty term for oversmoothing

78

Parameter generation considering GV

1

2

3

4

5

24 mcep. HMM

non-GV

24 ST. HMM

non-GV

24 mcep. HSMM non-GV

24 ST. HSMM non-GV

39 ST. HSMM non-GV

24 mcep. HSMM

GV

24 ST. HSMM

GV

39 ST. HSMM

GV

MO

S

2.0

3.1

2.0

3.2 3.2 3.1

4.7 4.7

Evaluations

80

Recent works to improve quality

Vocoding– MELP-style / CELP-style excitation

– LF model

– Sinusoidal models

Acoustic model– Segment models, trajectory models

– Model combination (product of experts)

– Minimum generation error training

– Bayesian modeling

Oversmoothing– Pre & postfiltering

– Improvements of GV

– Hybrid approaches

& more…

81

Other challenging topics

Keynote speech by Simon King (CSTR) in SSW7 this year

Speech synthesis is easy, if ...

• voice is built offline & carefully checked for errors

• speech is recorded in clean conditions

• word transcriptions are correct

• accurate phonetic labels are available or can be obtained

• speech is in the required language & speaking style

• speech is from a suitable speaker

• a native speaker is available, preferably a linguist

Speech synthesis is not easy if we don’t have right data82

Other challenging topics

Non-professional speakers

• AVM + adaptation (CSTR)

Too little speech data

• VTLN-based rapid speaker adaptation (Titech, IDIAP)

Noisy recordings

• Spectral subtraction & AVM + adaptation (CSTR)

No labels

• Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba)

Insufficient knowledge of the language or accent

• Letter (grapheme)-based synthesis (CSTR)

• No prosodic contexts (CSTR, Titech)

Wrong language

• Cross-lingual speaker adaptation (MSRA, EMIME)

• Speaker & language adaptive training (Toshiba)

83

Outline

84




Recent works– Vocoding


– Oversmoothing


Conclusions

HMM-based speech synthesis– Getting popular in speech synthesis area

– Implementation

– Flexible to change its voice characteristics

– Its quality is improving

– There are many challenging research topics

Summary

85

TTS research is attractive for you!!

86

text analysis

vocoding

statistical modeling

Text analysis

• Part-of-speech tagging, Grapheme-to-Phoneme conversion, etc.

Vocoding

• Better signal processing, better speech representation

Statistical modeling

• Robust modeling in ASR, knowledge from machine learning

Natural language processing

Signal processing, speech coding

ASR, machine learning

87

Please join TTS research!

Thanks!

Fundamentals & recent advances in HMM-based speech synthesis

Technology

Transcript of Fundamentals & recent advances in HMM-based speech synthesis