HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

36
HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl

Transcript of HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Page 1: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

HMM-Based Synthesis of Creaky Voice

Tuomo Raitio John Kane Thomas Drugman Christer Gobl

Page 2: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

2

• Creaky voice (vocal fry) is a distinctive phonation type involving low-frequency vocal fold vibration

• Highly irregular with secondary laryngeal excitations

Creaky voice

Page 3: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

3

• Usually involuntary, but various systematic usages have been reported

• For instance, creaky voice has been observed as• phrase boundary marker• turn-yielding mechanism• indication of hesitations• portrayal of social status• cue for communicating attitude and affective states

Use of creaky voice

Page 4: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

4

• HMM-based synthesis of creaky voice requires1. Algorithm for automatic detection of creaky voice 2. Accurate f0 estimation and voicing decision3. Prediction of creaky voice from context (text input)4. Vocoder capable of rendering creaky excitation

Synthesis of creaky voice

Page 5: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

7

This work…

1. Compares different f0 estimation methods suitable for building creaky voice synthesis

2. Culminates the previous research by creating a framework for creaky voice synthesis

3. Explores the conversion of normal synthetic voice to a creaky one

Page 6: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

8

What modification are required in order to

construct a creaky voice synthesis from a

conventional HTS system?

Page 7: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

9

f0, voicing decision

Voice model

Training

Synthesis

Speech data

Spectrum estimation HMM training - spectrum - f0

Front-end

Parameter generation - f0 - spectrum

Synthesis

Text

Labels

Speech

Page 8: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

10

A) Use a database of creaky voice

Page 9: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

11

f0, voicing decision

Voice model

Training

Synthesis

Spectrum estimation HMM training - spectrum - f0

Front-end

Parameter generation - f0 - spectrum

Synthesis

Text

Labels

Speech

Speech data(creaky)

Page 10: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

12

B) Replace f0 estimation method with one

suitable for creaky voice

Page 11: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

13

f0, voicing decision

Voice model

Training

Synthesis

Speech data(creaky)

Spectrum estimation HMM training - spectrum - f0

Front-end

Parameter generation - f0 - spectrum

Synthesis

Text

Labels

Speech

Page 12: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

14

• Creaky voice has low f0 and irregular excitation• Many f0 trackers output spurious values or classify creak as

unvoiced• Range of state-of-the-art f0 estimation algorithms were

evaluated with creaky voice:1. GlottHMM2. SWIPE (with SPTK 3.6 voicing decision)3. RAPT (SPTK 3.6)4. SPTK 3.1 cepstrum based pitch function5. STRAIGHT TEMPO

f0 estimation of creaky voice

Page 13: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

15

• Methods were mostly used with default settings• Frame length was set to 45ms whenever possible

• Speech data:• 3 databases of read speech for TTS development

• American English male BDL• Finnish male MV• Finnish female HS

• Conversational speech data from 7 other speakers (Swedish, Japanese, American English)

f0 estimation of creaky voice – Evaluation

Page 14: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

16

f0 estimation of creaky voice – Results

• For creaky voice TTS development, GlottHMM f0 estimation was chosen

[1] Raitio, Suni, Yamagishi, Pulakka, Nurminen, Vainio & Alku, “HMM-based speech synthesis utilizing glottal inverse filtering”, in IEEE Trans. on Audio, Speech, and Lang. Proc., 2011

• GlottHMM [1] performed best with TTS data

• SPTK performed best with conversational speech

Page 15: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

17

What modification are required in order to

construct a creaky voice synthesis from a

conventional HTS system?

Page 16: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

18

C) Detect creaky regions and model creak

as a special case

Page 17: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

19

f0, voicing decision

Voice model

Training

Synthesis

Speech data

Spectrum estimation HMM training - spectrum - f0

Front-end

Parameter generation - f0 - spectrum

Synthesis

Text

Labels

Speech

Page 18: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

20

f0, voicing decision

Voice model

Training

Synthesis

Spectrum estimation HMM training - spectrum - f0

Front-end

Parameter generation - f0 - spectrum

Synthesis

Text

Labels

Speech

Speech data(creaky)

Page 19: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

21

f0, voicing decision

Voice model

Training

Synthesis

Speech data(creaky)

Spectrum estimation HMM training - spectrum - f0

Front-end

Parameter generation - f0 - spectrum

Synthesis

Text

Labels

Speech

Page 20: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

22

f0, voicing decision

Creaky voice model

Training

Synthesis

Speech (creaky)

Speech data(creaky)

Spectrum estimation

Creaky voice detection

HMM training - spectrum - f0 - creaky probability

Extract creaky excitation

Average creaky residual

Front-end

Parameter generation - creaky probability - f0 - spectrum

Synthesis(normal/

creak)

Text

Labels

Page 21: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

23

Creaky voice detection

• Hand-annotation too laborious automatic methods• An automatic creaky voice detection method by Kane & Drugman [1,2]

• Based on linear prediction (LP) residual features

[1] Drugman, Kane & Gobl, “Resonator-based Creaky Voice Detection”, Interspeech, 2012[2] Kane, Drugman & Gobl, “Improved automatic detection of creak”, Computer Speech &

Language, 2013

Page 22: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

24

LP residual

Probability of creak

Page 23: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

25

Modeling creaky excitation

• Extension of the deterministic plus stochastic model (DSM) [1,2] which integrates a proper modeling of creaky voice

[1] Drugman, Kane & Gobl,, “Modeling the creaky excitation for parametric speech synthesis”, Interspeech, 2012

[2] Drugman & Dutoit, “The Deterministic plus Stochastic Model of the Residual Signal and its Applications”, in IEEE Trans. on Audio, Speech and Lang. Proc., 2012.

Page 24: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

26

Deterministic component Envelope of the stochastic component

Secondary excitation

Main excitation

GCI GCI GCI GCI GCI GCI

Page 25: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

27

f0, voicing decision

Creaky voice model

Training

Synthesis

Speech (creaky)

Speech data(creaky)

Spectrum estimation

Creaky voice detection

HMM training - spectrum - f0 - creaky probability

Extract creaky excitation

Average creaky residual

Front-end

Parameter generation - creaky probability - f0 - spectrum

Synthesis(normal/

creak)

Text

Labels

Page 26: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

28

Voice building and synthesis

Training:•Standard HTS method with the addition of 1-dimensional stream of creaky probability•Spectrum: 30th order mel-generalized cepstral analysis with alpha = 0.42 and gamma = -1/3 (converted to LSFs)

Synthesis:•Excitation: DSM vocoder with creaky parts rendered with the creaky excitation•Excitation was filtered with the mel-generalized log spectral approximation (MGLSA) filter

Page 27: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

29

• The following systems were compared

1. Conventional (STRAIGHT f0)

2. Proposed (GlottHMM f0)

3. Proposed (GlottHMM f0 and creaky excitation)

• Subjective online listening tests1. Stimuli: 20 sentences from the held-out data of BDL and MV2. 29 tests subjects

Evaluation

Page 28: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

31

• Results indicate that systems 2 and 3 have higher (p<0.001) ratings than 1

• Difference between systems 2 and 3 is not significant

• Conclusions:• Use of GlottHMM f0 improves

naturalness• Modeling of creaky excitation

has no effect on MOS

Evaluation – MOS naturalness

STRAIGHT f0

GlottHMM f0

GlottHMM f0 +

creaky excitation

Page 29: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

32

• Pairwise comparison of samples• Systems 2 and 3 are preferred over

system 1• System 3 is preferred over system 2• Conclusions:

• Both the use of GlottHMM f0 and the modeling of creaky excitation improve creaky voice rendering

Evaluation – Creaky rendering

No pref.No pref.

No pref.

GlottHMM f02 GlottHMM

f0 + cr. exc.3GlottHMM f0 + cr. exc.3

STRAIGHT f01 STRAIGHT

f01GlottHMM f02

Page 30: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

33

Is it possible to transplant a creaky voice

quality to a non-creaky speaker?

Page 31: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

34

• Convert non-creaky voice of Scottish English male AWB to creaky• Transplantation strategy:

1. Creaky voice is predicted from American English male BDL2. Creaky excitation pulse from BDL is used to render creak3. f0 is either:

a) kept as isb) substituted with BDL f0 by stream substitutionc) transformed only in the creaky parts

Adding creak for non-creaky speaker

Page 32: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

35

• Four different voices were built:

• AWB (baseline)

• AWB with BDL creaky excitation

• AWB with BDL creaky excitation and BDL f0

• AWB with BDL creaky excitation and f0 transformation

Evaluation

Page 33: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

37

• Subjective online listening tests• 14 tests subjects• 28 synthesized stimuli• Samples were rated with two scales:

1. Standard MOS naturalness2. Impression of creakiness from 1 to 5

1 – does not sound like creaky voice 2 – 3 – 4 – 5 – sounds exactly like creaky voice

Evaluation

Page 34: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

38

Evaluation results – MOS

• System 3 is rated lower than system 1• No other statistically significant

differences • Conclusions

• Creaky voice transformation does not decrease naturalness, except when f0 of BDL was used

• Degradation of system 3 is probably due to different prosody Ba

selin

e AW

B

AWB

+ cr

eaky

exc

itatio

n

AWB

+ BD

L f0

str

eam

+

crea

ky e

xcita

tion

AWB

+ f0

tran

sfor

mati

on +

cr

eaky

exc

itatio

n

Page 35: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

39

Evaluation results – Creakiness

• System 1 is rated less creaky than other systems

• Conclusions:• Creaky voice transformation is

successful: all transformed voices are rated creaky

• f0 has less effect on impression of creakiness, but it contributes to naturalness

AWB

AWB

+ cr

eaky

exc

itatio

n

AWB

+ BD

L f0

str

eam

+

crea

ky e

xcita

tion

AWB

+ f0

tran

sfor

mati

on +

cr

eaky

exc

itatio

n

Page 36: HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

40

• Methods for the HMM-based synthesis of creaky voice were investigated

• This requires:1. method for detecting creaky voice2. robust pitch tracker and voicing decision3. prediction of creaky voice from contextual factors4. dedicated vocoder for rendering the creaky excitation

• Evaluation showed a significant improvement in naturalness and creakiness

• Transformation of a non-creaky speaker to a creaky one was successful

Summary

Thank you!