Text-to-Speech Lin Cheng Yuan 2014,6,18. Agenda TTS Introduction HTS Q & A
-
Upload
peter-carr -
Category
Documents
-
view
220 -
download
1
Transcript of Text-to-Speech Lin Cheng Yuan 2014,6,18. Agenda TTS Introduction HTS Q & A
Text-to-Speech
Lin Cheng Yuan
2014,6,18
Agenda
• TTS Introduction• HTS• Q & A
2
3
Stephen Hawking’s Voice
Professor Stephen Hawking selects NeoSpeech Text-to-Speech as his new voice. (Mar. 15, 2004)
2010/10/15 3
TTS Software Comparison
http://en.wikipedia.org/wiki/Comparison_of_speech_synthesizers
4
Training part
Synthesis part
Unit-selection synthesis (USS) (1/2)
SPEECHDATABASE Prosodic
parameterextraction
Spectralparameterextraction
Signalprocessing
TEXT
Text analysis
SYNTHESIZEDSPEECH
Build USS system
Unit-selectionalgorithm
Unit database& cost functions
Spectralparameters
Prosodicparameters
Transcriptions
Speech signal
Prosodyprediction
Prosodyprediction
5
Unit-selection synthesis (USS) (2/2)
Unit-selection algorithm [Hunt; ’96]
Major techniques– General selection technique [Hunt; ’96]
– Clustering-based technique [Donovan; ’95]
q
kii
ck
ckii
c uuCwuuC1
11 ),(),(
p
jii
tj
tjii
t utCwutC1
),(),(
)},({minargˆ 1111
nn
u
n utCun
Candidateunits
Targetunits
Concatenationcost
1iu
1it
iu 1iu
it 1it
targetcost
6
n
i
n
i
nn utC1 2
11 ),( ),( 1 iic uuC ),( ii
t utC
: Target cost),( iit utC: Concatenation cost),( 1 ii
c uuC
Agenda
• TTS Introduction• HTS• Q & A
7
HMM-based speech synthesis system
SPEECHDATABASE Excitation
Parameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Speech signal Training part
Synthesis part
Training HMMs
Spectralparameters
Excitationparameters
Labels
8
HTS History
December 25, 2002
HTS version 1.0 was released.
December 29, 2006
HTS version 2.0 was released
October 1, 2007
HTS version 2.0.1 and hts_engine_API version 0.9 were released.
July 7, 2011
HTS version 2.2 was released.
December 25, 2012 (latest)
HTS version 2.3 alpha was released to the hts-users ML members.
More detail:
http://hts.sp.nitech.ac.jp/?History#v5285cba
9
Before HTS
• HTK • The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and
manipulating hidden Markov models • Latest version 3.4.1 (2009)
• SPTK• The Speech Signal Processing Toolkit (SPTK) is a suite of speech signal
processing tools• Latest version 3.7 (2013)
• hts_engine API• The hts_engine is software to synthesize speech waveform from HMMs trained by
the HMM-based speech synthesis system (HTS).• Latest version 1.08 (2013)
• Optional:• Festival • STRAIGHT
10
11
HTS Process
Speech database– Suggest at least 2 hours
Label file generation– A good text analysis tool is must
Feature extraction– mcp, lf0, cmp, …..
Model training– Using modified HTK 3.4.1 with HTS patch
HTS Installation
12
HTS Installation (1) – HTS for HTK patch
13
HTS Installation (2) – Normal Demo
1. Install Festival SPTK HTS Hts_engine
2. Configure
3. Make
14
HTS Directory Structure Before Installation
15
HTS Directory Structure After Installation (include training)
16
Data Scripts Results
training
Voices Data
– .pdf (probability density function file)• Binary-formatted• Statistics of model parameters
– .winX (window function)• Text-formatted• Not used actually
– .inf (decision tree file)• Text-formatted• Selected questions• Tree topology
17
Text-To-Speech using hts_engine
18
hts_engine
HTS Training Data - Labels
labels/mono : Mono-phone (context-independent) label files Label needs to include time information
labels/full : Context-dependent label files Time information can be omitted.
labels/gen : Context-dependent label files for test
19
How To Get Label Files?
20
Label information:– Raw Text Input– Markup Parser (XML tag)– Text Converter
• Sentence Segmentation• Text Normalization (06/18 六月十八號 )
– Linguistic Analyzer• Word Segmentation & POS (Part-of-Speech) tagging• Word Formation ( 構詞 )
– 定量式複合詞 (e.g. 二月 十本 六萬 )– 重疊詞 (e.g. 慢慢看、快快樂樂 )
• 破音字處理 (heteronym processing)– e.g. 「著」火 vs. 「著」名 ; 「還」來 vs. 「還」錢來
• 變調 (tone sandhi)– e.g. 「一」定 ( 二聲 ) 第「一」名 ( 一聲 ) – e.g. 「不」要 ( 二聲 ) 「不」可 ( 四聲 )
Time information:– HTK forced alignment + manual revisions (optional)
TimeTone
SyllableWordPOS…..
HTS Training Data - Feature Data
o /data/mcp: o mel-spectrum featureso Using SPTK toolkit (mcep)
o /data/lf0 : o logarithmic fundamental frequencyo Using SPTK toolkit (pitch)
o /data/cmp : o composited features (MCP+LF0)o Using perl scripts (data/scripts)
21
HTS Training Data - Question Set
Example:
22
HTS Training Data - Others
o /data/scpo train.scp for the trainingo gen.scp for synthesis
o /data/listso Full.list for all unique labels of training data o Full_all.list for all unique labels of training data and test data
o /data/wino parameters used in feature extractiono lf0.win1 ~ 3, mcp.win1~3
23
HTS 基本原理介紹
24
25
MCP (Mel-Cepstral Parameter) DeltaMCP delta
– Take mcp.win2 為例 : 3 -0.5 0.0 0.5 (3 個參數 : -0.5 0.0 0.5)
.
.
.
.
.
.
.
.
Mdims
K frames
N=1 N=K
……………………………………......
.
.
.
.
N=2 N=3
.
.
.
.
N=2
-0.5 0.0 0.5
+
Mdims ……………………………………..
.
.
.
.
N=1
.
.
.
.
N=K
MCP delta
NGASR Workshop 201026
2010/1/25
HTS - Feature RepresentationSpectral
parameters(MCP)
Spectral parameters
(LF0)
Window function (Δ- and ΔΔ-)
MCP Δ-MCP ΔΔ-MCP LF0 Δ-LF0 ΔΔ-LF0
Stream 1 Stream 2 Stream 4Stream 3
x 1x M
x 3M x 1x 1x 1
One Gaussian per state Two mixtures per state, “voiced” and “unvoiced”
(cmp)
Observation of F0
Time
Log
freq
uen
cy
Unable to model by continuous or discrete distribution ⇒ Multi-space probability distribution HMM (MSD-HMM)
11 R
Voiced0
2 RUnvoiced
・
27
MSD-HMM for F0 modeling
1 2 3HMM for F0
2,1w 2,2w 2,3w
1,1w 1,2w 1,3w
Unvoiced
Voiced
Unvoiced
Voiced
Unvoiced
Voiced
・ ・ ・02 R
11 R
Voiced / Unvoiced weights
Enable to model with dynamic F0 feature28
Structure of F0 state-output distributions
Voiced
Unvoiced
Voiced
Unvoiced
Voiced
Unvoiced
・
・
・
Log F0
MSD(Gaussian & discrete)
MSD(Gaussian & discrete)
MSD(Gaussian & discrete)
p
p2
p
29
Fre
quen
cy50
100150200
0 1 time (s)2
Fre
quen
cy
50
100150200
0 1 2 time (s)
Fre
quen
cy
50
100150200
0 1 2 time (s)
Generated F0
natural speech
without dynamic features
with dynamic features
30
Speech parameter generation algorithm [Tokuda; ’00]
)|(),|(max
)|(),|()|(
λqλqo
λqλqoλo
q
q
PP
PPP
31
),ˆ|(maxargˆ
),|(maxargˆ
λqoo
λqq
o
q
P
wP
⇒
For given HMM , determine a speech parameter vectorSequence which maximizes
λTTTT ],,,[ 21 Toooo
Structure of state-output distributions
Voiced
Unvoiced
Voiced
Unvoiced
Voiced
Unvoiced
・
・
・
Log F0
Mel-cepstrum
Spe
ctru
mE
xcita
tion
tο
tc
tc
tc2
tp2tptp
32
Determination of state sequence (1/3)
33
q
Observation sequence
State sequence
4 10 5dState duration
Determine state sequence via determining state durations
11a 22a 33a
12a 23a
)(1 tb o )(2 tb o )(3 tb o
1o 2o 3o 4o 5o To ・ ・
1 2 3
1 1 1 1 2 2 3 3
ija
)( tqb o
: State transition probability
: Output probability
o
Relation between two approaches (2/3)
Unit selection HMM-based
34
Features
Units
Unit selection : Serial
F0Dur.
Text
SYNTHESIZED SPEECH
Synthesis
Text analysis
Text
SYNTHESIZED SPEECH
Synthesis
Text analysis
HMM-based : Parallel
Labels
Dur.Spect. F0
Comparison of two approaches
Unit selection HMM-based synthesis
Advantage:High quality at the waveform level(Especially in the limited domain)
Disadvantage:Vocoder sound(Domain-independent)
Disadvantage:• Large footprints• Discontinuous• Unstable quality
Advantage:• Small footprints• Smooth• Stable quality
Fixed voice characteristics Various voice characteristics
35
Text Normalization Examples– 1750 as seventeen-fifty (if year) or one-thousand seven-
hundred and fifty (if measure).
– 1/3 as January third (if date) or one third (if fraction).
– St. can be saint or street
– $100 million as one hundred million dollars, not as one hundred dollars million.
– RAM is usually pronounced as a word and NBA as letter-by-letter.
– Chapter III should be read as Chapter three.
– US means United States, UK means United Kingdom, how about USE?
36
Hybrid approaches (3/4)
Smoothing units
Smooth unit boundaries using HMM statistics– Microsoft [Plumpe; ’98], OGI [Wouters; ’00]
TEXT
Text analysis
Unit selection& concationSpectra &
State sequences
SmoothedspectraSpectral
smootherWaveformsynthesizer
SYNTHESIZEDSPEECH
37