Design and Implementation of Voice Conversion Application (VOCAL)

Design and Implementation Design and Implementation of Voice Conversion of Voice Conversion Application (VOCAL)Application (VOCAL)

Elizabeth Kwan (26406025)

Supervised by:Ms. Liliana, M.Eng

Mr. Resmana Lim, M.Eng

DEFINITIONDEFINITIONWhat is Voice Conversion???

A method to transform the input speech signal such that the output signal will be perceived as produced by another speaker

BACKGROUNDBACKGROUNDWhy Voice Conversion???

Rapid development in speech technology

Speech recognition and text-to-speech have been the priorities in research efforts to improve human-machine (computer) interaction

Improve the naturalness of human-machine (computer) interaction

Voice conversion used in personification of speech enabled system

SCOPE & LIMITATIONSCOPE & LIMITATIONScope and limitation of project??

GENERAL :Format : wave file (.wav), single channel (mono)

INPUT :Source speaker and target speaker which speaks same utterances

Home recording

One person with minimal noise (no background sound)

For speech only


PROCESS :Not real-time, pre-record speech needed

Text-dependent

OUTPUTOutput signal will be perceived as produced by another speaker, judge by subjectivity of human auditory perception

Dialect not included


Test using Mean Opinion Score (MOS)

Developed in .NET environment (C# .NET Visual Studio 2005)

VOICE CONVERSION METHODVOICE CONVERSION METHODBrief explanation on Voice Conversion??

Difference system conversion used difference methods

General system:A method to represent the speaker specific characteristics of the speech waveform

A method to map the source and the target acoustical spaces

A method to modify the characteristics of the source speech using the mapping obtained in previous step

VOICE CONVERSION METHODVOICE CONVERSION METHODPage 33??

SOUND A(source speaker)

SOUND B(target speaker)

segmentation

segmentation

A(i)

B(i)

A(n)

B(n)

A(i)

B(i)

resample

resample

LPC

LPC

Inverse Filter

Apply Filter to Excitation

Excitation

Filter

Pitch Period Computation


Pitch Replacement

SynthesisC(i)

C(i)

C(n)

Window Combination

SOUND C(converted)

VOICE CONVERSION METHODVOICE CONVERSION METHODMain Process (Flow Chart see Page 30)??

SEGMENTATION

ANALYSIS or MODELING

TRANSFORMATION

SYNTHESIS

WHY IT IS DIFFICULT?WHY IT IS DIFFICULT?External Problems??

Complexity of human language

Speech is more than sequences of phones that forms words and sentences. It carries information (rhythm, intonation, stress of words, etc)

This information is varied from one person to the others

The infinite variety raised the application complexity, especially in segmentation

WHY IT IS DIFFICULT?WHY IT IS DIFFICULT?External Problems??

Speaker Variability

Unique voice. Speech generated from one person may varied too- Realization- Speaking style- Sex of speaker- Anatomy of vocal tract- Speed of speech- Dialects

WHY IT IS DIFFICULT?WHY IT IS DIFFICULT?Internal Problems??

Digital form only contains information of amplitude per periods

Amplitude can not directly used to determined the speech parameters (problems for analysis process)

Manipulate (add or delete) some part of the sound would effect to whole sound


SEGMENTATION


TRANSFORMATION

SYNTHESIS

SEGMENTATIONSEGMENTATIONFlow Chart see Page 34??

It is difficult to process entire phrase as tone, pitch, and other characteristics may diverse over the whole signal

Split base on syllable

Use end-point detection methods, combination of volume (two volume threshold) and zero-crossing rate (ZCR)


VolumeLoudness of audio signal

Zero-Crossing Rate (ZCR)Rate where signal change from positive to negative, and vise versa

n

iiSvolume

1


SEGMENTATION


TRANSFORMATION

SYNTHESIS

ANALYSIS OR MODELINGANALYSIS OR MODELINGMain Process (Flow Chart see Page 36)??


Linear Predictive Coding


ANALYSIS OR MODELINGANALYSIS OR MODELINGModeling Vocal Tract??

Source : signal x(t) [excitation signal]

Filter : linear time invariant h(t) [transfer function]

Speech : convolution of source and filter y(t) = x(t) * h(t)


De-convolution needed

Use of LPC methodspredicting a sample of a speech signal based on several previous samples


p

kk knsas

1

][ˆ

ANALYSIS OR MODELINGANALYSIS OR MODELINGLinear Predictive Coding??



Linear Predictive Coding




Pitch Analysis

Glottal Pulse Computation

Pitch Tier Computation

Pitch AnalysisBased on autocorrelation methods (Boersma 1993)

ANALYSIS OR MODELINGANALYSIS OR MODELINGPitch Period Computation??

Glottal Pulse ComputationRepeated pattern of voiced sound

τ : glottal pulse


Pitch Tier Calculationtotal points according to total voiced frames from pitch contour obtained from previous step



SEGMENTATION


TRANSFORMATION

Synthesis

TRANSFORMATIONTRANSFORMATIONTransform speech parameter obtained??

TransformationExtract pitch dari target chunk (target

chunk mula-mula sebelum di resample)

Extract pitch dari source yang sudah difilter

Replace pitch

return

SYNTHESISSYNTHESISMain Process (Flow Chart see Page 30)??

SEGMENTATION


TRANSFORMATION

SYNTHESIS

SYNTHESISSYNTHESISFlow Chart see Page 46??

Use of LPC Filter method to reconstruct transformed speech

EXPERIMENTAL RESULTEXPERIMENTAL RESULT??

TESTINGTESTINGEffect of choice of hardware used to record??

Microphone :Soundcard :

Phillips PC Headset (SHM7410U/1)Realtek HD Audio



Shure Beta 58Realtek HD Audio



Shure Beta 58EMU0404

TESTINGTESTINGTest on segmentation??

Speech : “Hai” from 4 difference speakers

Speech : “Hai” from 4 (four) difference speakers


Speech : “Hai” from 4 (four) difference speakers

Percentage result:For speech with only 1 (one) syllable : 100% success


TESTINGTESTINGTest on segmentation

Speech : “Saya” from 4 difference speakers

??

Speech : “Saya” from 4 difference speakers

?? TESTINGTESTINGTest on segmentation

Speech : “Saya” from 4 (four) difference speakers

Percentage result:For speech with 2 (two) syllables without paused : 0%

success (All detect as 1 (one) syllable only)But it works good in the application : 100% success


Speech : “Sistem Cerdas” from 4 difference speakers

?? TESTINGTESTINGTest on segmentation

Speech : “Sistem Cerdas” from 4 (four) difference speakers

Percentage result:For speech with more complex forms : 50% successRelated to Speaker Variability


TESTINGTESTINGTest on pitch modification??

No UtteranceSource Target

Converted (Hz)Speaker Freq (Hz) Speaker Freq (Hz)

1 Good Kath 242.04 Liz 266.95 263.16

2 Hai Kath 227.09 Zefan 176.26 172.41

3 Saya Liz 259.11 Will 170.14 172.41

4 Hallo Zefan 162.01 Liz 100.18 100

5 A Will 151.44 Zefan 191.57 188.68

TESTINGTESTINGTest on pitch modification??

No UtteranceTarget

Converted (Hz)

Success RateSpeaker Freq (Hz)

1 Good Liz 266.95 263.16 98.58 %

2 Hai Zefan 176.26 172.41 97.82 %

3 Saya Will 170.14 172.41 98.66 %

4 Hallo Liz 100.18 100 99.82 %

5 A Zefan 191.57 188.68 98. 49 %

Average percentage result: 98.67 %

TESTINGTESTINGSubjectivity Test??

Similarity (based on human auditory perception)Test on 20 peoples, 5 utterances

Overall result : 3.71 of 5.0

Utterance Source Target Avg. score

Good Kath Liz 3.55 of 5.0

Hai Kath Zefan 4.1 of 5.0

Saya Liz Will 3.4 of 5.0

Hallo Zefan Liz 3.65 of 5.0

A Will Zefan 3.85 of 5.0


Based on genderTest on 22 peoples, 2 utterances. 4 combinations gender for each utterance

From To Overall Rank

Female Female 2.591

Female Male 1.818

Male Female 2.727

Male Male 2.864


Similarity of speaker characteristicTest on 22 peoples, 5 utterances

Overall result : 3.64 of 5.0

No UtteranceSource Target

Avg. ScoreSpeaker Speaker

1 Carike Leonita Daniel 3.29 of 5.0

2 Mboh yo Daniel Leonita 3.95 of 5.0

3 Ndek mana Melinda Indro 4.16 of 5.0

4 Ra mangan Melinda Angela 3.41 of 5.0

5 Ya toh Indro Liz 3.36 of 5.0

CONCLUSIONCONCLUSIONConclusion from experiments result??

Segmentation result is fairly effective for certain speech, depends on the input speech which can be very diverse

For segmentation, longer speech will result lower success rate

Segmentation effects on conversion result

CONCLUSIONCONCLUSIONConclusion from experiments result??

Pitch modification calculation is working successful (average percentage 98.67 %)

This system is fairly effective at imitating certain target speaker (average score 3.71 of 5.0)

Female to male conversion give the best results (overall rank 1.818 of 4.0)

Speaker characteristic is fairly recognized by auditory perception (overall score 3.64 of 5.0)

SUGGESTIONSUGGESTIONFor future development??

The need of semi-automatic segmentation for a better result

Currently, the system only convert 2 voices saying same word or phrase (text-dependent). Neural network need to make text-independent system

Real-time system is possible

More research on frequency domain process

Design and Implementation Design and Implementation of Voice Conversion of Voice Conversion Application (VOCAL)Application (VOCAL)

THANKS FOR YOUR ATTENTION

Design and Implementation of Voice Conversion Application (VOCAL)

Documents

Transcript of Design and Implementation of Voice Conversion Application (VOCAL)