Post on 19-Mar-2016
description
Outlines Objectives Study of Thai tones
Construction of contextual factors
Design of decision-tree structures
Design of context clustering styles
Characteristics of Thai tones
Categorizations of Thai tones
Tree-based context clustering
Evaluation of overall tone correctness
Evaluation of tone correctness for each tone type
Evaluation of syllable duration distortion
Experiments
Conclusions
Objectives
To implement an HMM-based speech synthesis system for Thai language with the highest correctness of tone.
Study of Thai tones
Characteristics of Thai tones Syllable Structure [Nakasakul2002]
Thai : Tonal Language
)(CV(V)T )(CC fii
fi CVTCfi CVVTC
fii CVVT CC
fii CVT CC
รกั r-a-k^-3 (love)
เรื่อย r-va-j^-2 (always)
เครง่ khr-e-ng^-2 (strict)
เครยีด khr-ia-t^-2 (stress)
VTCi
และ l-x-3 (and)
VVTCC ii
เพลีย phl-iia-0 (exhausted)
VVTCiเสยี s-iia-4 (spoil)
VTCC ii
ปร ิpr-i-1 (break)
Study of Thai tones
Characteristics of Thai tones F0 contours of Standard Thai Tones (normalized
duration)[Luksaneeyanawin1992]
140
230
200
170
290
260
F0(Hz)
0% 50% 100%
rising (4)high (3)
falling (2)low (1)middle (0)
Duration
สามญั Middle(0) เอก Low(1) โท Falling(2) ตร ีHigh(3) จตัวา Rising(4)
Study of Thai tones
Categorizations of Thai tones Abramson divided the tones into two groups:
static group dynamic group
According to the final trend of contours: upward trend group downward trend group
140
230
200
170
290
260
F0(Hz)
0% 50% 100%
rising (4)high (3)
falling (2)low (1)middle (0)
Duration
HMM-based speech synthesizer
• Phoneme based speechunit modeling
• Provide flexible models,an efficient adaptation
Speaker adaptation Speaking style conversion
1994 K. Tokuda; et al, proposedHMM-based speech synthesizerfor Japanese Excitation
ParameterExtraction
Spectral ParameterExtraction
Training of HMM
ExcitationGeneration
Synthesisfilter
Parameter Generation from HMM
Label
Speech Signal
Excitation Parameter Spectral Parameter
Text Analysis
Label
Text
Synthetic Speech
Context Dependent HMMs
Training Stage
Excitation Parameter Spectral Parameter
Synthesis Stage
Speech Database
Phrase level
• current word position in current phrase
• the number of syllables in {preceding, current, succeeding} phrase
Utterance level
• current phrase position in current sentence
• the number of syllables in current sentence
• the number of words in current sentence
Phoneme level
• {preceding, current, succeeding} phonetic type
• {preceding, current, succeeding} part of syllable structure
Syllable level
• {preceding, current, succeeding} tone type
• the number of phones in {preceding, current, succeeding} syllable
• current phone position in current syllable
Word level
• current syllable position in current word• part of speech• the number of syllables in {preceding,
current, succeeding} word
Tree-based context clustering
Construction of contextual factorsContext clustering is to treat the problem of limitation of training data.
Tree-based context clustering
Design of decision-tree structures
(a)
F0 contours of (a) synthesized speech from the clustering style of single binary tree without tone type questions and (b) natural speech.
Problem of Misshaped F0 contour
5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4150
200
250
5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4150
200
250
T i m e ( s )
f r
e q
u e
n c
y ( H
z ) (a)
(b)
Tree-based context clustering
Design of decision-tree structures
(a)
Tone 0 Tone 1 Tone 2 Tone 3 Tone 4
(b)
Static Tone(Tone 0, 1, 3)
Tone 2 Tone 4
Dynamic Tone(Tone 2, 4)
(c)
Upward Trend (Tone 3, 4)
(d)
Downward Trend (Tone 0, 1, 2)
Tone 3 Tone 4
Tree-based context clustering
Design of 8 context clustering styles (a)-(h)
(a)
Tone 0 Tone 1 Tone 2 Tone 3 Tone 4
(b)
Static Tone(Tone 0, 1, 3)
Tone 2 Tone 4
Dynamic Tone(Tone 2, 4)
(c)
Upward Trend (Tone 3, 4)
(d)
Downward Trend (Tone 0, 1, 2)
Tone 3 Tone 4
+ tone type questions (g)
+ tone type questions (e)
+ tone type questions (h) + tone type questions (f)
1. Sentence structure analysis
2. Word structure analysis3. Full context labeling 4. Construction of question
set for context clustering5. Feature extraction
System PreparationsVAJA
Speech corpus
Wav file Label file
ORCHID Text corpus
Wav file Wav file Wav file Label fileLabel fileLabel file
XML fileXML fileXML fileXML file
Parameterfile (.cmp)
Full contextLabeling
FeatureExtraction(mcep,f0)
Parameterfile (.cmp) Parameterfile (.cmp) Parameterfile (.cmp)
Full contextlabel file(.lab)
Label file (.lab)
Label file (.lab)
Label file (.lab)
Label file (.lab)
Full contextlabel file(.lab)
Full contextlabel file(.lab)
Full contextlabel file(.lab)
HMM Training and SynthesisSyntheticSpeech
Experiments Evaluation of overall tone correctness
1 5 02 0 02 5 0
1 5 02 0 02 5 0
1 5 02 0 02 5 0
1 5 02 0 02 5 0
1 5 02 0 02 5 0
F r e
q u
e n
c y (
H z
)
1 5 02 0 02 5 0
1 5 02 0 02 5 0
1 5 02 0 02 5 0
5 . 0 5 . 2 5 . 4 5 . 6 5 . 8 6 . 0 6 . 2 6 . 41 5 02 0 02 5 0
T i m e ( s )
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 5: F0 contours of synthesized speech from 8 different clustering styles; and F0 contour of natural speech.
Experiments Evaluation of overall tone correctness
0
5
10
15
20
25
30
35
40
45
50
100 200 300 400 500 1000 1500 2000 2500
N u m b e r o f t r a i n i n g u t t e r a n c e s
E r r
o r
p e
r c e
n t a
g e
.
(a)(b)(c)(d)
Figure 6: Tone error percentages of synthesized speech from 4 different clustering styles
Experiments Evaluation of overall tone correctness
0
5
10
15
20
25
30
35
40
45
50
100 200 300 400 500 1000 1500 2000 2500
N u m b e r o f t r a i n i n g u t t e r a n c e s
E r r
o r
p e
r c e
n t a
g e
.
(a)(b)(c)(d)(e)(f)(g)(h)
Figure 7: Tone error percentages of synthesized speech from 8 different clustering styles
t o n e 03 8 %
t o n e 12 2 %
t o n e 21 7 %
t o n e 31 5 %
t o n e 48 %
Experiments Evaluation of tone correctness for each tone type
050
100
0
50
0
50
0
50
050
100
E
r r o
r p
e r
c e n
t a g
e
0
50
0
50
100 300 500 1500 25000
50
N u m b e r o f t r a i n i n g u t t e r a n c e s
t o n e 0t o n e 1t o n e 2t o n e 3t o n e 4
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 8: Tone error percentages of synthesized speech from 8 different clustering styles categorized by tone types;
Experiments Evaluation of syllable duration distortion
71
6055 53 51
11
28
4247 49
6157
49 49 48
56 55 5451 52
0
10
20
30
40
50
60
70
80
100 300 500 1500 2500
N u m b e r o f t r a i n i n g u t t e r a n c e s
S c o
r e
( %
)
.
(e)(f)(g)(h)
Figure 9: Scores of a paired-comparison test for natural duration among 4 different clustering styles;
Examples of synthesized speech
Female
Methodcorpus size (number of
training utterances)
Examples1 2
3
HMM
100
500
2500
VAJA (Unit Selection) Analysis-Synthesis speech
Female
Method Tree Structure
Add tone question set
HMM
(a) (e)
(b) (f)
(c) (g)
(d) (h)
Conclusions An analysis of tree-based context clustering of an HMM-based Thai speech synthesis system has been conducted in this paper.Four structures of decision tree were designed according to tone groups and tone types to obtain higher correctness of tone of synthesized speech.The results show that the tone-separated tree structures can reduce the tone error percentage of the synthesized speech compared to the single binary tree structure significantly.As for using the contextual tone information in the syllable level, it can improve the tone correctness for all structures of decision tree.There are some distortions of the syllable duration appearing in the case of using the simple tone-separated tree context clustering with a small amount of training data, however it can be relieved when using the constancy-based-tone-separated or the trend-based-tone-separated tree context clustering.The analysis of tone correctness of the average-voice-based speech model and the intonation analysis issues are anticipated to be studied in the future.