An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

An Evaluation of Many-to-OneVoice Conversion Algorithms

with Pre-Stored Speaker Data Sets

Daisuke Tani, Yamato Ohtani, Tomoki Toda,Hiroshi Saruwatari and Kiyohiro Shikano

Nara Institute of Science and Technology (NAIST), Japan

August 23rd, 2007

Contents

Many-to-One VC framework

Many-to-One VC algorithms

Experimental evaluations

Conclusion


Convetional Voice Conversion (VC)

Source speaker Target speaker

Training

Conversion model

Please saythe same thing.

Please saythe same thing.

We would like to make VC more flexible!Using arbitrary utterances Using a few utterancesConverting arbitrary source speakers

Training of conversion model has some limitations.

Using parallel dataUsing around 50 pairsConverting only trainedsource speaker

Many-to-One VC (M-to-O VC)

Convert arbitrary source speakers into target speaker

[T. Toda et al.]

Targetspeaker

Pre-stored source

speakers

?

Initial model trainingwith multiple parallel data sets

Adaptation of model parameters for an arbitrary source speaker

Applications• Voice changer to movie stars• Speech translation system, etc.

Contents




Conclusion

M-to-O VC Algorithms

Based on source independent GMM (SI-GMM)

Based on speaker selection

Based on Eigenvoice conversion (EVC)

Based on EVC with speaker adaptive training (SAT)

1.

2.

3.

4.

[T. Toda et al.]

[T. Toda et al.]

New algorithm

New algorithm

M-to-O VC based on Source Independent GMM (SI-GMM)

1.

[T. Toda et al.]

We train the conversion model for arbitrary source speakers.

Weight

Mean vector

Covariance matrix

)YY(i

)YX(i

)XY(i

)XX(i)ZZ(

iΣΣ

ΣΣΣ

)(

)()(

Yi

XiZ

iμ

μμ

i

Parameters of the i-th mixture component of SI-GMM

Red : Speaker A Blue : Speaker B

Green : Speaker C3rd mixture component

1st mixture component

2nd mixturecomponent

Source mean vector

Target mean vector

: /a/: /i/: /o/

: Tied parameters

Previous Training of SI-GMM

st1 nd2 thS

Target speaker

Multiple pre-storedsource speakers

SI-GMM

Training using all parallel data sets

Previous training process

The SI-GMM converts arbitrary source speaker’s voice without any adaptation processes.

Problem of SI-GMM

Phonemic spaces of a certain speaker often overlap with those of another speaker.

SI-GMM might cause a conversion error !


Green : Speaker C

3rd mixturecomponent



: /a/: /i/: /o/

M-to-O VC based on Speaker Selection2.

We train the conversion model using a part of pre-stored source speakers whose voice characteristics are similar to those of the given source speaker.

Speaker Selection [S. Yoshizawa, et al.,2001]

＊

＊


Green : Speaker C Black : Source speaker

Speaker A and Care selected. 3rd mixture

component



: /a/: /i/: /o/

Previous training process

st1 nd2 thS

Target speaker

SI-GMM

1. Training of SI-GMM

2. Training of speaker dependentGMMs (SD-GMMs)


Adaptation process

SD-GMMs

Adaptation dataof source speaker

st1

nd2

thS

4. Sort of likelihood

5. Selection of N-best parallel data sets based on likelihoods

6. Training of conversion model

Conversion model

st1nd2

thN

thS

th37

th15

th26

rd3

3. Calculation of likelihood

Selected pre-storedsource speakers

rd3 th37 th15

Target speaker

Process of Speaker Selection

Problem of Speaker Selection

Such a model is not necessarily suitable for the given source speaker.



Speaker A and Care selected.

Trained conversion model by speaker selection

Desired conversion model

The resulting conversion model just covers the selected pre-stored source speakers.

M-to-O VC based on Eigenvoice Conversion (EVC)

3.

[T. Toda et al.]

The conversion model is adapted by adjusting weights for individual eigenvoices.

Conversion model

Source speaker

Weighting

Weighting

Weighting

Unsupervised adaptation

1st eigen vector

2nd eigen vector

(S-1)th eigen vector

Eigenvoice GMM (EV-GMM)

Weight

Mean vector

Covariance matrix

)YY(i

)YX(i

)XY(i

)XX(i)ZZ(

iΣΣ

ΣΣΣ

)(

)()()( (0)

Yi

Xi

XiZ

iμ

bwBμ

iRepresentative vectors

(eigenvoices)

Bias vector(average voice)

Parameters of the i-th mixture component

Free parameter

w

=+

Free parameter can be estimatedwith adaptation data.

: Tied parameters

Weigt vector

Previous Training of EV-GMM3.

st1 nd2 thS

SI-GMM

)(SSV

)S()X(1μ

)S()X(2μ

)S()X(2μ

)2(SV

)()X( 21μ

)()X( 22μ

)()X( 22μ

)1(SV

)()X( 11μ

)()X( 12μ

)()X( 12μ

bias vectors Representativevectors

)0(1b)0(

2b

)0(Mb

1B2B

MB

＆

+=

EV-GMM

1. Training of SI-GMM

2. Training of SD-GMMs

3. Construction of supervectors

4. Estimation of bias vectorsand representative vectors


Previous training processTarget speaker

5. Construction of EV-GMM

Problem of EVC

The tied parameters of the EV-GMM arefrom the SI-GMM.

They are not suitable for the given source speaker, e.g., source covariance values are much larger than those of the desired conversion model.



Adapted EV-GMM

Desired conversion model

EV-GMM

M-to-O VC based on EVC with Speaker Adaptive Training (SAT)

4. SAT [T. Anastasakos, et al., 1996]

＊

＊We previously train EV-GMM so that the adaptatio

n performance is improved.

)s(e)EV()s(t

S

s

T

t

)EV( ,Plogmaxargs

)EV(wλZλ

λ

1 1

Training criterion:

Likelihood of the adapted EV-GMM for each pre-stored source speaker

Total likelihood over all pre-stored source speakers



SAT

EV-GMM with SAT

Adapted EV-GMM with SAT

EV-GMM

Adapted EV-GMM

SAT for EV-GMM

st1 nd2 thS

+=

CanonicalEV-GMM

)EV(λ

1. Training of speaker dependent parameters

1w 2w Sw

GMM weightsBias vectorsRepresentative vectorsTarget mean vectorsCovariance matrices

2. Training of tied parameters

3. Iteration


Previous training processTarget speaker

Weight vectors

Sourcemean

vectors

Tiedparameters

Based onSI-GMM

Based onspeakerselection

Based onEVC

Based onEVC

with SAT

Not adapted

Roughly adapted Roughly adapted

Previously optimizedAdapted

Adapted Not adapted

Not adapted

Comparison of M-to-O VC Algorithms

Contents




Conclusion

Experimental Conditions

160 pre-stored source speakers(80 males and 80 females)

10 source speakers(5 males and 5 females)

?

Training stage

Adaptation stage

1 male target speaker

50 sentences uttered by each speaker

The number of mixturesThe number of representative vectors

The number of selected speakers

128159

27

Experimental Conditions (cont’d)

Test data

Objective measure

The number of adaptation sentences

21 sentences

Spectral distortion

Varying from 1/32 to 32

Objective evaluation

Subjective evaluation

Preference test on speech qualityof converted voices

The number of subjects(Each subject evaluated 120 sample-pairs)

The number ofadaptation sentences

6

2

Result of Objective EvaluationWorse

Better

The adaptation techniques cause improvements of the conversion accuracy.SAT causes further improvements.

EVC and EVC with SAT cause large distortionswhen the amount of adaptation data is very limited.

Speaker selection is effective even when using very limited amount of adaptation data.

Result of Subjective Evaluation

Every adaptation technique causes improvements of the converted speech quality.

Contents




Conclusion

Conclusions

We conducted an experimental evaluationof many-to-one VC algorithms.

based on SI-GMM.based on EVC.

based on speaker selection.

based on EVC with SAT.

[T. Toda, et al.]

[T. Toda, et al.]

New methods

Results of objective and subjective evaluations showed

the adaptation process results in a better conversion model than the SI-GMM.the algorithm based on speaker selection works well with very little amount of adaptation data.

Thank you for your attention!

Any questions?

An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Documents

Transcript of An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets