HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

HMM-Based Synthesis of Creaky Voice

Tuomo Raitio John Kane Thomas Drugman Christer Gobl

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl

2

• Creaky voice (vocal fry) is a distinctive phonation type involving low-frequency vocal fold vibration

• Highly irregular with secondary laryngeal excitations

Creaky voice


3

• Usually involuntary, but various systematic usages have been reported

• For instance, creaky voice has been observed as• phrase boundary marker• turn-yielding mechanism• indication of hesitations• portrayal of social status• cue for communicating attitude and affective states

Use of creaky voice


4

• HMM-based synthesis of creaky voice requires1. Algorithm for automatic detection of creaky voice 2. Accurate f0 estimation and voicing decision3. Prediction of creaky voice from context (text input)4. Vocoder capable of rendering creaky excitation

Synthesis of creaky voice


7

This work…

1. Compares different f0 estimation methods suitable for building creaky voice synthesis

2. Culminates the previous research by creating a framework for creaky voice synthesis

3. Explores the conversion of normal synthetic voice to a creaky one


8

What modification are required in order to

construct a creaky voice synthesis from a

conventional HTS system?


9

f0, voicing decision

Voice model

Training

Synthesis

Speech data

Spectrum estimation HMM training - spectrum - f0

Front-end

Parameter generation - f0 - spectrum

Synthesis

Text

Labels

Speech


10

A) Use a database of creaky voice


11


Voice model

Training

Synthesis


Front-end


Synthesis

Text

Labels

Speech

Speech data(creaky)


12

B) Replace f0 estimation method with one

suitable for creaky voice


13


Voice model

Training

Synthesis

Speech data(creaky)


Front-end


Synthesis

Text

Labels

Speech


14

• Creaky voice has low f0 and irregular excitation• Many f0 trackers output spurious values or classify creak as

unvoiced• Range of state-of-the-art f0 estimation algorithms were

evaluated with creaky voice:1. GlottHMM2. SWIPE (with SPTK 3.6 voicing decision)3. RAPT (SPTK 3.6)4. SPTK 3.1 cepstrum based pitch function5. STRAIGHT TEMPO

f0 estimation of creaky voice


15

• Methods were mostly used with default settings• Frame length was set to 45ms whenever possible

• Speech data:• 3 databases of read speech for TTS development

• American English male BDL• Finnish male MV• Finnish female HS

• Conversational speech data from 7 other speakers (Swedish, Japanese, American English)

f0 estimation of creaky voice – Evaluation


16

f0 estimation of creaky voice – Results

• For creaky voice TTS development, GlottHMM f0 estimation was chosen

[1] Raitio, Suni, Yamagishi, Pulakka, Nurminen, Vainio & Alku, “HMM-based speech synthesis utilizing glottal inverse filtering”, in IEEE Trans. on Audio, Speech, and Lang. Proc., 2011

• GlottHMM [1] performed best with TTS data

• SPTK performed best with conversational speech


17

What modification are required in order to

construct a creaky voice synthesis from a

conventional HTS system?


18

C) Detect creaky regions and model creak

as a special case


19


Voice model

Training

Synthesis

Speech data


Front-end


Synthesis

Text

Labels

Speech


20


Voice model

Training

Synthesis


Front-end


Synthesis

Text

Labels

Speech

Speech data(creaky)


21


Voice model

Training

Synthesis

Speech data(creaky)


Front-end


Synthesis

Text

Labels

Speech


22


Creaky voice model

Training

Synthesis

Speech (creaky)

Speech data(creaky)

Spectrum estimation

Creaky voice detection

HMM training - spectrum - f0 - creaky probability

Extract creaky excitation

Average creaky residual

Front-end

Parameter generation - creaky probability - f0 - spectrum

Synthesis(normal/

creak)

Text

Labels


23


• Hand-annotation too laborious automatic methods• An automatic creaky voice detection method by Kane & Drugman [1,2]

• Based on linear prediction (LP) residual features

[1] Drugman, Kane & Gobl, “Resonator-based Creaky Voice Detection”, Interspeech, 2012[2] Kane, Drugman & Gobl, “Improved automatic detection of creak”, Computer Speech &

Language, 2013


24

LP residual

Probability of creak


25

Modeling creaky excitation

• Extension of the deterministic plus stochastic model (DSM) [1,2] which integrates a proper modeling of creaky voice

[1] Drugman, Kane & Gobl,, “Modeling the creaky excitation for parametric speech synthesis”, Interspeech, 2012

[2] Drugman & Dutoit, “The Deterministic plus Stochastic Model of the Residual Signal and its Applications”, in IEEE Trans. on Audio, Speech and Lang. Proc., 2012.


26

Deterministic component Envelope of the stochastic component

Secondary excitation

Main excitation

GCI GCI GCI GCI GCI GCI


27


Creaky voice model

Training

Synthesis

Speech (creaky)

Speech data(creaky)

Spectrum estimation


HMM training - spectrum - f0 - creaky probability

Extract creaky excitation

Average creaky residual

Front-end

Parameter generation - creaky probability - f0 - spectrum

Synthesis(normal/

creak)

Text

Labels


28

Voice building and synthesis

Training:•Standard HTS method with the addition of 1-dimensional stream of creaky probability•Spectrum: 30th order mel-generalized cepstral analysis with alpha = 0.42 and gamma = -1/3 (converted to LSFs)

Synthesis:•Excitation: DSM vocoder with creaky parts rendered with the creaky excitation•Excitation was filtered with the mel-generalized log spectral approximation (MGLSA) filter


29

• The following systems were compared

1. Conventional (STRAIGHT f0)

2. Proposed (GlottHMM f0)

3. Proposed (GlottHMM f0 and creaky excitation)

• Subjective online listening tests1. Stimuli: 20 sentences from the held-out data of BDL and MV2. 29 tests subjects

Evaluation


31

• Results indicate that systems 2 and 3 have higher (p<0.001) ratings than 1

• Difference between systems 2 and 3 is not significant

• Conclusions:• Use of GlottHMM f0 improves

naturalness• Modeling of creaky excitation

has no effect on MOS

Evaluation – MOS naturalness

STRAIGHT f0

GlottHMM f0

GlottHMM f0 +

creaky excitation


32

• Pairwise comparison of samples• Systems 2 and 3 are preferred over

system 1• System 3 is preferred over system 2• Conclusions:

• Both the use of GlottHMM f0 and the modeling of creaky excitation improve creaky voice rendering

Evaluation – Creaky rendering

No pref.No pref.

No pref.

GlottHMM f02 GlottHMM

f0 + cr. exc.3GlottHMM f0 + cr. exc.3

STRAIGHT f01 STRAIGHT

f01GlottHMM f02


33

Is it possible to transplant a creaky voice

quality to a non-creaky speaker?


34

• Convert non-creaky voice of Scottish English male AWB to creaky• Transplantation strategy:

1. Creaky voice is predicted from American English male BDL2. Creaky excitation pulse from BDL is used to render creak3. f0 is either:

a) kept as isb) substituted with BDL f0 by stream substitutionc) transformed only in the creaky parts

Adding creak for non-creaky speaker


35

• Four different voices were built:

• AWB (baseline)

• AWB with BDL creaky excitation

• AWB with BDL creaky excitation and BDL f0

• AWB with BDL creaky excitation and f0 transformation

Evaluation


37

• Subjective online listening tests• 14 tests subjects• 28 synthesized stimuli• Samples were rated with two scales:

1. Standard MOS naturalness2. Impression of creakiness from 1 to 5

1 – does not sound like creaky voice 2 – 3 – 4 – 5 – sounds exactly like creaky voice

Evaluation


38

Evaluation results – MOS

• System 3 is rated lower than system 1• No other statistically significant

differences • Conclusions

• Creaky voice transformation does not decrease naturalness, except when f0 of BDL was used

• Degradation of system 3 is probably due to different prosody Ba

selin

e AW

B

AWB

+ cr

eaky

exc

itatio

n

AWB

+ BD

L f0

str

eam

+

crea

ky e

xcita

tion

AWB

+ f0

tran

sfor

mati

on +

cr

eaky

exc

itatio

n


39

Evaluation results – Creakiness

• System 1 is rated less creaky than other systems

• Conclusions:• Creaky voice transformation is

successful: all transformed voices are rated creaky

• f0 has less effect on impression of creakiness, but it contributes to naturalness

AWB

AWB

+ cr

eaky

exc

itatio

n

AWB

+ BD

L f0

str

eam

+

crea

ky e

xcita

tion

AWB

+ f0

tran

sfor

mati

on +

cr

eaky

exc

itatio

n


40

• Methods for the HMM-based synthesis of creaky voice were investigated

• This requires:1. method for detecting creaky voice2. robust pitch tracker and voicing decision3. prediction of creaky voice from contextual factors4. dedicated vocoder for rendering the creaky excitation

• Evaluation showed a significant improvement in naturalness and creakiness

• Transformation of a non-creaky speaker to a creaky one was successful

Summary

Thank you!

HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.

Documents

Transcript of HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.