HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.
-
Upload
mercy-houston -
Category
Documents
-
view
213 -
download
0
Transcript of HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl.
HMM-Based Synthesis of Creaky Voice
Tuomo Raitio John Kane Thomas Drugman Christer Gobl
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
2
• Creaky voice (vocal fry) is a distinctive phonation type involving low-frequency vocal fold vibration
• Highly irregular with secondary laryngeal excitations
Creaky voice
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
3
• Usually involuntary, but various systematic usages have been reported
• For instance, creaky voice has been observed as• phrase boundary marker• turn-yielding mechanism• indication of hesitations• portrayal of social status• cue for communicating attitude and affective states
Use of creaky voice
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
4
• HMM-based synthesis of creaky voice requires1. Algorithm for automatic detection of creaky voice 2. Accurate f0 estimation and voicing decision3. Prediction of creaky voice from context (text input)4. Vocoder capable of rendering creaky excitation
Synthesis of creaky voice
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
7
This work…
1. Compares different f0 estimation methods suitable for building creaky voice synthesis
2. Culminates the previous research by creating a framework for creaky voice synthesis
3. Explores the conversion of normal synthetic voice to a creaky one
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
8
What modification are required in order to
construct a creaky voice synthesis from a
conventional HTS system?
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
9
f0, voicing decision
Voice model
Training
Synthesis
Speech data
Spectrum estimation HMM training - spectrum - f0
Front-end
Parameter generation - f0 - spectrum
Synthesis
Text
Labels
Speech
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
10
A) Use a database of creaky voice
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
11
f0, voicing decision
Voice model
Training
Synthesis
Spectrum estimation HMM training - spectrum - f0
Front-end
Parameter generation - f0 - spectrum
Synthesis
Text
Labels
Speech
Speech data(creaky)
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
12
B) Replace f0 estimation method with one
suitable for creaky voice
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
13
f0, voicing decision
Voice model
Training
Synthesis
Speech data(creaky)
Spectrum estimation HMM training - spectrum - f0
Front-end
Parameter generation - f0 - spectrum
Synthesis
Text
Labels
Speech
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
14
• Creaky voice has low f0 and irregular excitation• Many f0 trackers output spurious values or classify creak as
unvoiced• Range of state-of-the-art f0 estimation algorithms were
evaluated with creaky voice:1. GlottHMM2. SWIPE (with SPTK 3.6 voicing decision)3. RAPT (SPTK 3.6)4. SPTK 3.1 cepstrum based pitch function5. STRAIGHT TEMPO
f0 estimation of creaky voice
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
15
• Methods were mostly used with default settings• Frame length was set to 45ms whenever possible
• Speech data:• 3 databases of read speech for TTS development
• American English male BDL• Finnish male MV• Finnish female HS
• Conversational speech data from 7 other speakers (Swedish, Japanese, American English)
f0 estimation of creaky voice – Evaluation
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
16
f0 estimation of creaky voice – Results
• For creaky voice TTS development, GlottHMM f0 estimation was chosen
[1] Raitio, Suni, Yamagishi, Pulakka, Nurminen, Vainio & Alku, “HMM-based speech synthesis utilizing glottal inverse filtering”, in IEEE Trans. on Audio, Speech, and Lang. Proc., 2011
• GlottHMM [1] performed best with TTS data
• SPTK performed best with conversational speech
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
17
What modification are required in order to
construct a creaky voice synthesis from a
conventional HTS system?
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
18
C) Detect creaky regions and model creak
as a special case
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
19
f0, voicing decision
Voice model
Training
Synthesis
Speech data
Spectrum estimation HMM training - spectrum - f0
Front-end
Parameter generation - f0 - spectrum
Synthesis
Text
Labels
Speech
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
20
f0, voicing decision
Voice model
Training
Synthesis
Spectrum estimation HMM training - spectrum - f0
Front-end
Parameter generation - f0 - spectrum
Synthesis
Text
Labels
Speech
Speech data(creaky)
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
21
f0, voicing decision
Voice model
Training
Synthesis
Speech data(creaky)
Spectrum estimation HMM training - spectrum - f0
Front-end
Parameter generation - f0 - spectrum
Synthesis
Text
Labels
Speech
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
22
f0, voicing decision
Creaky voice model
Training
Synthesis
Speech (creaky)
Speech data(creaky)
Spectrum estimation
Creaky voice detection
HMM training - spectrum - f0 - creaky probability
Extract creaky excitation
Average creaky residual
Front-end
Parameter generation - creaky probability - f0 - spectrum
Synthesis(normal/
creak)
Text
Labels
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
23
Creaky voice detection
• Hand-annotation too laborious automatic methods• An automatic creaky voice detection method by Kane & Drugman [1,2]
• Based on linear prediction (LP) residual features
[1] Drugman, Kane & Gobl, “Resonator-based Creaky Voice Detection”, Interspeech, 2012[2] Kane, Drugman & Gobl, “Improved automatic detection of creak”, Computer Speech &
Language, 2013
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
24
LP residual
Probability of creak
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
25
Modeling creaky excitation
• Extension of the deterministic plus stochastic model (DSM) [1,2] which integrates a proper modeling of creaky voice
[1] Drugman, Kane & Gobl,, “Modeling the creaky excitation for parametric speech synthesis”, Interspeech, 2012
[2] Drugman & Dutoit, “The Deterministic plus Stochastic Model of the Residual Signal and its Applications”, in IEEE Trans. on Audio, Speech and Lang. Proc., 2012.
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
26
Deterministic component Envelope of the stochastic component
Secondary excitation
Main excitation
GCI GCI GCI GCI GCI GCI
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
27
f0, voicing decision
Creaky voice model
Training
Synthesis
Speech (creaky)
Speech data(creaky)
Spectrum estimation
Creaky voice detection
HMM training - spectrum - f0 - creaky probability
Extract creaky excitation
Average creaky residual
Front-end
Parameter generation - creaky probability - f0 - spectrum
Synthesis(normal/
creak)
Text
Labels
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
28
Voice building and synthesis
Training:•Standard HTS method with the addition of 1-dimensional stream of creaky probability•Spectrum: 30th order mel-generalized cepstral analysis with alpha = 0.42 and gamma = -1/3 (converted to LSFs)
Synthesis:•Excitation: DSM vocoder with creaky parts rendered with the creaky excitation•Excitation was filtered with the mel-generalized log spectral approximation (MGLSA) filter
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
29
• The following systems were compared
1. Conventional (STRAIGHT f0)
2. Proposed (GlottHMM f0)
3. Proposed (GlottHMM f0 and creaky excitation)
• Subjective online listening tests1. Stimuli: 20 sentences from the held-out data of BDL and MV2. 29 tests subjects
Evaluation
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
31
• Results indicate that systems 2 and 3 have higher (p<0.001) ratings than 1
• Difference between systems 2 and 3 is not significant
• Conclusions:• Use of GlottHMM f0 improves
naturalness• Modeling of creaky excitation
has no effect on MOS
Evaluation – MOS naturalness
STRAIGHT f0
GlottHMM f0
GlottHMM f0 +
creaky excitation
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
32
• Pairwise comparison of samples• Systems 2 and 3 are preferred over
system 1• System 3 is preferred over system 2• Conclusions:
• Both the use of GlottHMM f0 and the modeling of creaky excitation improve creaky voice rendering
Evaluation – Creaky rendering
No pref.No pref.
No pref.
GlottHMM f02 GlottHMM
f0 + cr. exc.3GlottHMM f0 + cr. exc.3
STRAIGHT f01 STRAIGHT
f01GlottHMM f02
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
33
Is it possible to transplant a creaky voice
quality to a non-creaky speaker?
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
34
• Convert non-creaky voice of Scottish English male AWB to creaky• Transplantation strategy:
1. Creaky voice is predicted from American English male BDL2. Creaky excitation pulse from BDL is used to render creak3. f0 is either:
a) kept as isb) substituted with BDL f0 by stream substitutionc) transformed only in the creaky parts
Adding creak for non-creaky speaker
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
35
• Four different voices were built:
• AWB (baseline)
• AWB with BDL creaky excitation
• AWB with BDL creaky excitation and BDL f0
• AWB with BDL creaky excitation and f0 transformation
Evaluation
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
37
• Subjective online listening tests• 14 tests subjects• 28 synthesized stimuli• Samples were rated with two scales:
1. Standard MOS naturalness2. Impression of creakiness from 1 to 5
1 – does not sound like creaky voice 2 – 3 – 4 – 5 – sounds exactly like creaky voice
Evaluation
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
38
Evaluation results – MOS
• System 3 is rated lower than system 1• No other statistically significant
differences • Conclusions
• Creaky voice transformation does not decrease naturalness, except when f0 of BDL was used
• Degradation of system 3 is probably due to different prosody Ba
selin
e AW
B
AWB
+ cr
eaky
exc
itatio
n
AWB
+ BD
L f0
str
eam
+
crea
ky e
xcita
tion
AWB
+ f0
tran
sfor
mati
on +
cr
eaky
exc
itatio
n
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
39
Evaluation results – Creakiness
• System 1 is rated less creaky than other systems
• Conclusions:• Creaky voice transformation is
successful: all transformed voices are rated creaky
• f0 has less effect on impression of creakiness, but it contributes to naturalness
AWB
AWB
+ cr
eaky
exc
itatio
n
AWB
+ BD
L f0
str
eam
+
crea
ky e
xcita
tion
AWB
+ f0
tran
sfor
mati
on +
cr
eaky
exc
itatio
n
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingHMM-Based Synthesis of Creaky Voice Raitio, Kane, Drugman, Gobl
40
• Methods for the HMM-based synthesis of creaky voice were investigated
• This requires:1. method for detecting creaky voice2. robust pitch tracker and voicing decision3. prediction of creaky voice from contextual factors4. dedicated vocoder for rendering the creaky excitation
• Evaluation showed a significant improvement in naturalness and creakiness
• Transformation of a non-creaky speaker to a creaky one was successful
Summary
Thank you!