A Bayesian Approach to HMM-Based Speech Synthesis
-
Upload
jackson-burns -
Category
Documents
-
view
53 -
download
0
description
Transcript of A Bayesian Approach to HMM-Based Speech Synthesis
![Page 1: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/1.jpg)
A Bayesian Approach to HMM-Based Speech Synthesis
Kei Hashimoto , Heiga Zen ,
Yoshihiko Nankaku , Takashi Masuko ,
and Keiichi Tokuda
Nagoya Institute of Technology
Tokyo Institute of Technology
1
2
1 1
1
1
2
![Page 2: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/2.jpg)
2
Background HMM-based speech synthesis system
Spectrum, excitation and duration are modeled Speech parameter seqs. are generated
Maximum likelihood (ML) criterion Train HMMs and generate speech parameters Point estimate ⇒ The over-fitting problem
Bayesian approach Estimate posterior dist. of model parameters Prior information can be use
⇒ Alleviate the over-fitting problem
![Page 3: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/3.jpg)
Outline Bayesian speech synthesis
Variational Bayesian method Speech parameter generation
Bayesian context clustering Prior distribution using cross validation
Experiments Conclusion & Future work
3
![Page 4: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/4.jpg)
Model training and speech synthesis
Bayesian speech synthesis (1/2)
4
: Model parameters
: Label seq. for synthesis: Label seq. for training: Training data seq.
: Synthesis data seq.
ML
Bayes
![Page 5: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/5.jpg)
Bayesian speech synthesis (2/2)
Predictive distribution (marginal likelihood)
5
: HMM state seq. for synthesis data
Variational Bayesian method [Attias; ’99]
: HMM state seq. for training data: Likelihood of synthesis data: Likelihood of training data: Prior distribution for model parameters
![Page 6: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/6.jpg)
Estimate approximate posterior dist. ⇒ Maximize a lower bound
Variational Bayesian method (1/2)
6
: Expectation w.r.t.
( Jensen’s inequality )
: Approximate distribution of the true posterior distribution
![Page 7: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/7.jpg)
Random variables are statistically independent
Optimal posterior distributions
Variational Bayesian method (2/2)
7
: normalization terms
Iterative updates as the EM algorithm
![Page 8: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/8.jpg)
Approximation for speech synthesis
is dependent on synthesis data
⇒ Huge computational cost in the synthesis part
Ignore the dependency of synthesis data
⇒ Estimation from only training data
8
![Page 9: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/9.jpg)
Prior distribution Conjugate prior distribution
⇒ Posterior dist. becomes a same family of dist. with prior dist.
Determination using statistics of prior data
9
: Dimension of feature
: Covariance of prior data
: # of prior data
: Mean of prior data
Conjugate prior distribution
Likelihood function
![Page 10: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/10.jpg)
Speech parameter generation Speech parameter
Consist of static and dynamic features
⇒ Only static feature seq. is generated Speech parameter generation based on
Bayesian approach ⇒ Maximize the lower bound
10
![Page 11: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/11.jpg)
Relation between Bayes and ML
Compare with the ML criterion
Use of expectations of model parameters Can be solved by the same fashion of ML
11
Output dist.
ML ⇒
Bayes ⇒
![Page 12: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/12.jpg)
Outline Bayesian speech synthesis
Variational Bayesian method Speech parameter generation
Bayesian context clustering Prior distribution using cross validation
Experiments Conclusion & Future work
12
![Page 13: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/13.jpg)
Bayesian context clustering
Context clustering based on maximizing
13
yes no
Select question
Gain of
Stopping condition
⇒ Split node based on gain
: Is this phoneme a vowel?
![Page 14: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/14.jpg)
Impact of prior distribution Affect model selection as tuning parameters
⇒ Require determination technique of prior dist.
Conventional: maximize the marginal likelihood Lead to the over-fitting problem as the ML Tuning parameters are still required
Determination technique of prior distribution using cross validation [Hashimoto; ’08]
14
![Page 15: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/15.jpg)
15
Bayesian approach using CV
Prior distribution based on Cross Validation
2,3 1,3Cross valid prior dist.
Calculate likelihood
Training data is randomly divided into K groups
Posterior dist.
1,2
![Page 16: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/16.jpg)
Outline Bayesian speech synthesis
Variational Bayesian method Speech parameter generation
Bayesian context clustering Prior distribution using cross validation
Experiments Conclusion & Future work
16
![Page 17: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/17.jpg)
17
Experimental conditions (1/2)Database ATR Japanese speech database B-set
Speaker MHT
Training data 450 utterances
Test data 53 utterances
Sampling rate 16 kHz
Window Blackman window
Frame size / shift 25 ms / 5 ms
Feature vector24 mel-cepstrum + Δ + ΔΔ and
log F0 + Δ + ΔΔ (78 dimension)
HMM5-state left-to-right HMM
without skip transition
![Page 18: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/18.jpg)
18
Experimental conditions (2/2) Compared approach
Mean Opinion Score (MOS) test Subjects were 10 Japanese students 20 sentences were chosen at random
Training Context clustering # of states
ML-MDL ML MDL 2,491
Bayes-Bayes Bayes Bayes using CV 25,911
Bayes-MDL BayesBayes using CV
Adjust threshold2,553
ML-Bayes MLMDL
Adjust threshold27,106
![Page 19: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/19.jpg)
Mean opinion score
Subjective listening test
192,491 25,911 27,1062,553
![Page 20: A Bayesian Approach to HMM-Based Speech Synthesis](https://reader036.fdocuments.us/reader036/viewer/2022081503/568136fd550346895d9e8bfb/html5/thumbnails/20.jpg)
20
Conclusions and future work A new framework based on Bayesian approach
All processes are derived from a single predictive distribution
Improve the naturalness of synthesized speech
Future work Introduce HSMM instead of HMM Investigate the relation between the speech
quality and model structures