An Evaluation of Many-to-OneVoice Conversion Algorithms
with Pre-Stored Speaker Data Sets
Daisuke Tani, Yamato Ohtani, Tomoki Toda,Hiroshi Saruwatari and Kiyohiro Shikano
Nara Institute of Science and Technology (NAIST), Japan
August 23rd, 2007
Contents
Many-to-One VC framework
Many-to-One VC algorithms
Experimental evaluations
Conclusion
Many-to-One VC framework
Convetional Voice Conversion (VC)
Source speaker Target speaker
Training
Conversion model
Please saythe same thing.
Please saythe same thing.
We would like to make VC more flexible!Using arbitrary utterances Using a few utterancesConverting arbitrary source speakers
Training of conversion model has some limitations.
Using parallel dataUsing around 50 pairsConverting only trainedsource speaker
Many-to-One VC (M-to-O VC)
Convert arbitrary source speakers into target speaker
[T. Toda et al.]
Targetspeaker
Pre-stored source
speakers
?
Initial model trainingwith multiple parallel data sets
Adaptation of model parameters for an arbitrary source speaker
Applications• Voice changer to movie stars• Speech translation system, etc.
Contents
Many-to-One VC framework
Many-to-One VC algorithms
Experimental evaluations
Conclusion
M-to-O VC Algorithms
Based on source independent GMM (SI-GMM)
Based on speaker selection
Based on Eigenvoice conversion (EVC)
Based on EVC with speaker adaptive training (SAT)
1.
2.
3.
4.
[T. Toda et al.]
[T. Toda et al.]
New algorithm
New algorithm
M-to-O VC based on Source Independent GMM (SI-GMM)
1.
[T. Toda et al.]
We train the conversion model for arbitrary source speakers.
Weight
Mean vector
Covariance matrix
)YY(i
)YX(i
)XY(i
)XX(i)ZZ(
iΣΣ
ΣΣΣ
)(
)()(
Yi
XiZ
iμ
μμ
i
Parameters of the i-th mixture component of SI-GMM
Red : Speaker A Blue : Speaker B
Green : Speaker C3rd mixture component
1st mixture component
2nd mixturecomponent
Source mean vector
Target mean vector
: /a/: /i/: /o/
: Tied parameters
Previous Training of SI-GMM
st1 nd2 thS
Target speaker
Multiple pre-storedsource speakers
SI-GMM
Training using all parallel data sets
Previous training process
The SI-GMM converts arbitrary source speaker’s voice without any adaptation processes.
Problem of SI-GMM
Phonemic spaces of a certain speaker often overlap with those of another speaker.
SI-GMM might cause a conversion error !
Red : Speaker A Blue : Speaker B
Green : Speaker C
3rd mixturecomponent
1st mixture component
2nd mixturecomponent
: /a/: /i/: /o/
M-to-O VC based on Speaker Selection2.
We train the conversion model using a part of pre-stored source speakers whose voice characteristics are similar to those of the given source speaker.
Speaker Selection [S. Yoshizawa, et al.,2001]
*
*
Red : Speaker A Blue : Speaker B
Green : Speaker C Black : Source speaker
Speaker A and Care selected. 3rd mixture
component
1st mixture component
2nd mixturecomponent
: /a/: /i/: /o/
Previous training process
st1 nd2 thS
Target speaker
SI-GMM
1. Training of SI-GMM
2. Training of speaker dependentGMMs (SD-GMMs)
Multiple pre-storedsource speakers
Adaptation process
SD-GMMs
Adaptation dataof source speaker
st1
nd2
thS
4. Sort of likelihood
5. Selection of N-best parallel data sets based on likelihoods
6. Training of conversion model
Conversion model
st1nd2
thN
thS
th37
th15
th26
rd3
3. Calculation of likelihood
Selected pre-storedsource speakers
rd3 th37 th15
Target speaker
Process of Speaker Selection
Problem of Speaker Selection
Such a model is not necessarily suitable for the given source speaker.
Red : Speaker A Blue : Speaker B
Green : Speaker C Black : Source speaker
Speaker A and Care selected.
Trained conversion model by speaker selection
Desired conversion model
The resulting conversion model just covers the selected pre-stored source speakers.
M-to-O VC based on Eigenvoice Conversion (EVC)
3.
[T. Toda et al.]
The conversion model is adapted by adjusting weights for individual eigenvoices.
Conversion model
Source speaker
Weighting
Weighting
Weighting
Unsupervised adaptation
1st eigen vector
2nd eigen vector
(S-1)th eigen vector
Eigenvoice GMM (EV-GMM)
Weight
Mean vector
Covariance matrix
)YY(i
)YX(i
)XY(i
)XX(i)ZZ(
iΣΣ
ΣΣΣ
)(
)()()( (0)
Yi
Xi
XiZ
iμ
bwBμ
iRepresentative vectors
(eigenvoices)
Bias vector(average voice)
Parameters of the i-th mixture component
Free parameter
w
=+
Free parameter can be estimatedwith adaptation data.
: Tied parameters
Weigt vector
Previous Training of EV-GMM3.
st1 nd2 thS
SI-GMM
)(SSV
)S()X(1μ
)S()X(2μ
)S()X(2μ
)2(SV
)()X( 21μ
)()X( 22μ
)()X( 22μ
)1(SV
)()X( 11μ
)()X( 12μ
)()X( 12μ
bias vectors Representativevectors
)0(1b)0(
2b
)0(Mb
1B2B
MB
&
+=
EV-GMM
1. Training of SI-GMM
2. Training of SD-GMMs
3. Construction of supervectors
4. Estimation of bias vectorsand representative vectors
Multiple pre-storedsource speakers
Previous training processTarget speaker
5. Construction of EV-GMM
Problem of EVC
The tied parameters of the EV-GMM arefrom the SI-GMM.
They are not suitable for the given source speaker, e.g., source covariance values are much larger than those of the desired conversion model.
Red : Speaker A Blue : Speaker B
Green : Speaker C Black : Source speaker
Adapted EV-GMM
Desired conversion model
EV-GMM
M-to-O VC based on EVC with Speaker Adaptive Training (SAT)
4. SAT [T. Anastasakos, et al., 1996]
*
*We previously train EV-GMM so that the adaptatio
n performance is improved.
)s(e)EV()s(t
S
s
T
t
)EV( ,Plogmaxargs
)EV(wλZλ
λ
1 1
Training criterion:
Likelihood of the adapted EV-GMM for each pre-stored source speaker
Total likelihood over all pre-stored source speakers
Red : Speaker A Blue : Speaker B
Green : Speaker C Black : Source speaker
SAT
EV-GMM with SAT
Adapted EV-GMM with SAT
EV-GMM
Adapted EV-GMM
SAT for EV-GMM
st1 nd2 thS
+=
CanonicalEV-GMM
)EV(λ
1. Training of speaker dependent parameters
1w 2w Sw
GMM weightsBias vectorsRepresentative vectorsTarget mean vectorsCovariance matrices
2. Training of tied parameters
3. Iteration
Multiple pre-storedsource speakers
Previous training processTarget speaker
Weight vectors
Sourcemean
vectors
Tiedparameters
Based onSI-GMM
Based onspeakerselection
Based onEVC
Based onEVC
with SAT
Not adapted
Roughly adapted Roughly adapted
Previously optimizedAdapted
Adapted Not adapted
Not adapted
Comparison of M-to-O VC Algorithms
Contents
Many-to-One VC framework
Many-to-One VC algorithms
Experimental evaluations
Conclusion
Experimental Conditions
160 pre-stored source speakers(80 males and 80 females)
10 source speakers(5 males and 5 females)
?
Training stage
Adaptation stage
1 male target speaker
50 sentences uttered by each speaker
The number of mixturesThe number of representative vectors
The number of selected speakers
128159
27
Experimental Conditions (cont’d)
Test data
Objective measure
The number of adaptation sentences
21 sentences
Spectral distortion
Varying from 1/32 to 32
Objective evaluation
Subjective evaluation
Preference test on speech qualityof converted voices
The number of subjects(Each subject evaluated 120 sample-pairs)
The number ofadaptation sentences
6
2
Result of Objective EvaluationWorse
Better
The adaptation techniques cause improvements of the conversion accuracy.SAT causes further improvements.
EVC and EVC with SAT cause large distortionswhen the amount of adaptation data is very limited.
Speaker selection is effective even when using very limited amount of adaptation data.
Result of Subjective Evaluation
Every adaptation technique causes improvements of the converted speech quality.
Contents
Many-to-One VC framework
Many-to-One VC algorithms
Experimental evaluations
Conclusion
Conclusions
We conducted an experimental evaluationof many-to-one VC algorithms.
based on SI-GMM.based on EVC.
based on speaker selection.
based on EVC with SAT.
[T. Toda, et al.]
[T. Toda, et al.]
New methods
Results of objective and subjective evaluations showed
the adaptation process results in a better conversion model than the SI-GMM.the algorithm based on speaker selection works well with very little amount of adaptation data.
Thank you for your attention!
Any questions?
Top Related