An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

27
An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets Daisuke Tani, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari and Kiyohiro Shikano Nara Institute of Science and Technology (NAIST), Japan August 23rd, 2007

description

An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets. Daisuke Tani, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari and Kiyohiro Shikano. Nara Institute of Science and Technology (NAIST), Japan. August 23rd, 2007. Many-to-One VC framework. - PowerPoint PPT Presentation

Transcript of An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Page 1: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

An Evaluation of Many-to-OneVoice Conversion Algorithms

with Pre-Stored Speaker Data Sets

Daisuke Tani, Yamato Ohtani, Tomoki Toda,Hiroshi Saruwatari and Kiyohiro Shikano

Nara Institute of Science and Technology (NAIST), Japan

August 23rd, 2007

Page 2: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Contents

Many-to-One VC framework

Many-to-One VC algorithms

Experimental evaluations

Conclusion

Many-to-One VC framework

Page 3: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Convetional Voice Conversion (VC)

Source speaker Target speaker

Training

Conversion model

Please saythe same thing.

Please saythe same thing.

We would like to make VC more flexible!Using arbitrary utterances Using a few utterancesConverting arbitrary source speakers

Training of conversion model has some limitations.

Using parallel dataUsing around 50 pairsConverting only trainedsource speaker

Page 4: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Many-to-One VC (M-to-O VC)

Convert arbitrary source speakers into target speaker

[T. Toda et al.]

Targetspeaker

Pre-stored source

speakers

?

Initial model trainingwith multiple parallel data sets

Adaptation of model parameters for an arbitrary source speaker

Applications• Voice changer to movie stars• Speech translation system, etc.

Page 5: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Contents

Many-to-One VC framework

Many-to-One VC algorithms

Experimental evaluations

Conclusion

Page 6: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

M-to-O VC Algorithms

Based on source independent GMM (SI-GMM)

Based on speaker selection

Based on Eigenvoice conversion (EVC)

Based on EVC with speaker adaptive training (SAT)

1.

2.

3.

4.

[T. Toda et al.]

[T. Toda et al.]

New algorithm

New algorithm

Page 7: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

M-to-O VC based on Source Independent GMM (SI-GMM)

1.

[T. Toda et al.]

We train the conversion model for arbitrary source speakers.

Weight

Mean vector

Covariance matrix

)YY(i

)YX(i

)XY(i

)XX(i)ZZ(

iΣΣ

ΣΣΣ

)(

)()(

Yi

XiZ

μμ

i

Parameters of the i-th mixture component of SI-GMM

Red : Speaker A Blue : Speaker B

Green : Speaker C3rd mixture component

1st mixture component

2nd mixturecomponent

Source mean vector

Target mean vector

: /a/: /i/: /o/

: Tied parameters

Page 8: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Previous Training of SI-GMM

st1 nd2 thS

Target speaker

Multiple pre-storedsource speakers

SI-GMM

Training using all parallel data sets

Previous training process

The SI-GMM converts arbitrary source speaker’s voice without any adaptation processes.

Page 9: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Problem of SI-GMM

Phonemic spaces of a certain speaker often overlap with those of another speaker.

SI-GMM might cause a conversion error !

Red : Speaker A Blue : Speaker B

Green : Speaker C

3rd mixturecomponent

1st mixture component

2nd mixturecomponent

: /a/: /i/: /o/

Page 10: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

M-to-O VC based on Speaker Selection2.

We train the conversion model using a part of pre-stored source speakers whose voice characteristics are similar to those of the given source speaker.

Speaker Selection [S. Yoshizawa, et al.,2001]

Red : Speaker A Blue : Speaker B

Green : Speaker C Black : Source speaker

Speaker A and Care selected. 3rd mixture

component

1st mixture component

2nd mixturecomponent

: /a/: /i/: /o/

Page 11: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Previous training process

st1 nd2 thS

Target speaker

SI-GMM

1. Training of SI-GMM

2. Training of speaker dependentGMMs (SD-GMMs)

Multiple pre-storedsource speakers

Adaptation process

SD-GMMs

Adaptation dataof source speaker

st1

nd2

thS

4. Sort of likelihood

5. Selection of N-best parallel data sets based on likelihoods

6. Training of conversion model

Conversion model

st1nd2

thN

thS

th37

th15

th26

rd3

3. Calculation of likelihood

Selected pre-storedsource speakers

rd3 th37 th15

Target speaker

Process of Speaker Selection

Page 12: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Problem of Speaker Selection

Such a model is not necessarily suitable for the given source speaker.

Red : Speaker A Blue : Speaker B

Green : Speaker C Black : Source speaker

Speaker A and Care selected.

Trained conversion model by speaker selection

Desired conversion model

The resulting conversion model just covers the selected pre-stored source speakers.

Page 13: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

M-to-O VC based on Eigenvoice Conversion (EVC)

3.

[T. Toda et al.]

The conversion model is adapted by adjusting weights for individual eigenvoices.

Conversion model

Source speaker

Weighting

Weighting

Weighting

Unsupervised adaptation

1st eigen vector

2nd eigen vector

(S-1)th eigen vector

Page 14: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Eigenvoice GMM (EV-GMM)

Weight

Mean vector

Covariance matrix

)YY(i

)YX(i

)XY(i

)XX(i)ZZ(

iΣΣ

ΣΣΣ

)(

)()()( (0)

Yi

Xi

XiZ

bwBμ

iRepresentative vectors

(eigenvoices)

Bias vector(average voice)

Parameters of the i-th mixture component

Free parameter

w

=+

Free parameter can be estimatedwith adaptation data.

: Tied parameters

Weigt vector

Page 15: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Previous Training of EV-GMM3.

st1 nd2 thS

SI-GMM

)(SSV

)S()X(1μ

)S()X(2μ

)S()X(2μ

)2(SV

)()X( 21μ

)()X( 22μ

)()X( 22μ

)1(SV

)()X( 11μ

)()X( 12μ

)()X( 12μ

bias vectors Representativevectors

)0(1b)0(

2b

)0(Mb

1B2B

MB

+=

EV-GMM

1. Training of SI-GMM

2. Training of SD-GMMs

3. Construction of supervectors

4. Estimation of bias vectorsand representative vectors

Multiple pre-storedsource speakers

Previous training processTarget speaker

5. Construction of EV-GMM

Page 16: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Problem of EVC

The tied parameters of the EV-GMM arefrom the SI-GMM.

They are not suitable for the given source speaker, e.g., source covariance values are much larger than those of the desired conversion model.

Red : Speaker A Blue : Speaker B

Green : Speaker C Black : Source speaker

Adapted EV-GMM

Desired conversion model

EV-GMM

Page 17: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

M-to-O VC based on EVC with Speaker Adaptive Training (SAT)

4. SAT [T. Anastasakos, et al., 1996]

*We previously train EV-GMM so that the adaptatio

n performance is improved.

)s(e)EV()s(t

S

s

T

t

)EV( ,Plogmaxargs

)EV(wλZλ

λ

1 1

Training criterion:

Likelihood of the adapted EV-GMM for each pre-stored source speaker

Total likelihood over all pre-stored source speakers

Red : Speaker A Blue : Speaker B

Green : Speaker C Black : Source speaker

SAT

EV-GMM with SAT

Adapted EV-GMM with SAT

EV-GMM

Adapted EV-GMM

Page 18: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

SAT for EV-GMM

st1 nd2 thS

+=

CanonicalEV-GMM

)EV(λ

1. Training of speaker dependent parameters

1w 2w Sw

GMM weightsBias vectorsRepresentative vectorsTarget mean vectorsCovariance matrices

2. Training of tied parameters

3. Iteration

Multiple pre-storedsource speakers

Previous training processTarget speaker

Weight vectors

Page 19: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Sourcemean

vectors

Tiedparameters

Based onSI-GMM

Based onspeakerselection

Based onEVC

Based onEVC

with SAT

Not adapted

Roughly adapted Roughly adapted

Previously optimizedAdapted

Adapted Not adapted

Not adapted

Comparison of M-to-O VC Algorithms

Page 20: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Contents

Many-to-One VC framework

Many-to-One VC algorithms

Experimental evaluations

Conclusion

Page 21: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Experimental Conditions

160 pre-stored source speakers(80 males and 80 females)

10 source speakers(5 males and 5 females)

?

Training stage

Adaptation stage

1 male target speaker

50 sentences uttered by each speaker

The number of mixturesThe number of representative vectors

The number of selected speakers

128159

27

Page 22: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Experimental Conditions (cont’d)

Test data

Objective measure

The number of adaptation sentences

21 sentences

Spectral distortion

Varying from 1/32 to 32

Objective evaluation

Subjective evaluation

Preference test on speech qualityof converted voices

The number of subjects(Each subject evaluated 120 sample-pairs)

The number ofadaptation sentences

6

2

Page 23: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Result of Objective EvaluationWorse

Better

The adaptation techniques cause improvements of the conversion accuracy.SAT causes further improvements.

EVC and EVC with SAT cause large distortionswhen the amount of adaptation data is very limited.

Speaker selection is effective even when using very limited amount of adaptation data.

Page 24: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Result of Subjective Evaluation

Every adaptation technique causes improvements of the converted speech quality.

Page 25: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Contents

Many-to-One VC framework

Many-to-One VC algorithms

Experimental evaluations

Conclusion

Page 26: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Conclusions

We conducted an experimental evaluationof many-to-one VC algorithms.

based on SI-GMM.based on EVC.

based on speaker selection.

based on EVC with SAT.

[T. Toda, et al.]

[T. Toda, et al.]

New methods

Results of objective and subjective evaluations showed

the adaptation process results in a better conversion model than the SI-GMM.the algorithm based on speaker selection works well with very little amount of adaptation data.

Page 27: An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

Thank you for your attention!

Any questions?