Design and Implementation of Voice Conversion Application (VOCAL)
description
Transcript of Design and Implementation of Voice Conversion Application (VOCAL)
Design and Implementation Design and Implementation of Voice Conversion of Voice Conversion Application (VOCAL)Application (VOCAL)
Elizabeth Kwan (26406025)
Supervised by:Ms. Liliana, M.Eng
Mr. Resmana Lim, M.Eng
DEFINITIONDEFINITIONWhat is Voice Conversion???
A method to transform the input speech signal such that the output signal will be perceived as produced by another speaker
BACKGROUNDBACKGROUNDWhy Voice Conversion???
Rapid development in speech technology
Speech recognition and text-to-speech have been the priorities in research efforts to improve human-machine (computer) interaction
Improve the naturalness of human-machine (computer) interaction
Voice conversion used in personification of speech enabled system
SCOPE & LIMITATIONSCOPE & LIMITATIONScope and limitation of project??
GENERAL :Format : wave file (.wav), single channel (mono)
INPUT :Source speaker and target speaker which speaks same utterances
Home recording
One person with minimal noise (no background sound)
For speech only
SCOPE & LIMITATIONSCOPE & LIMITATIONScope and limitation of project??
PROCESS :Not real-time, pre-record speech needed
Text-dependent
OUTPUTOutput signal will be perceived as produced by another speaker, judge by subjectivity of human auditory perception
Dialect not included
SCOPE & LIMITATIONSCOPE & LIMITATIONScope and limitation of project??
Test using Mean Opinion Score (MOS)
Developed in .NET environment (C# .NET Visual Studio 2005)
VOICE CONVERSION METHODVOICE CONVERSION METHODBrief explanation on Voice Conversion??
Difference system conversion used difference methods
General system:A method to represent the speaker specific characteristics of the speech waveform
A method to map the source and the target acoustical spaces
A method to modify the characteristics of the source speech using the mapping obtained in previous step
VOICE CONVERSION METHODVOICE CONVERSION METHODPage 33??
SOUND A(source speaker)
SOUND B(target speaker)
segmentation
segmentation
A(i)
B(i)
A(n)
B(n)
A(i)
B(i)
resample
resample
LPC
LPC
Inverse Filter
Apply Filter to Excitation
Excitation
Filter
Pitch Period Computation
Pitch Period Computation
Pitch Replacement
SynthesisC(i)
C(i)
C(n)
Window Combination
SOUND C(converted)
VOICE CONVERSION METHODVOICE CONVERSION METHODMain Process (Flow Chart see Page 30)??
SEGMENTATION
ANALYSIS or MODELING
TRANSFORMATION
SYNTHESIS
WHY IT IS DIFFICULT?WHY IT IS DIFFICULT?External Problems??
Complexity of human language
Speech is more than sequences of phones that forms words and sentences. It carries information (rhythm, intonation, stress of words, etc)
This information is varied from one person to the others
The infinite variety raised the application complexity, especially in segmentation
WHY IT IS DIFFICULT?WHY IT IS DIFFICULT?External Problems??
Speaker Variability
Unique voice. Speech generated from one person may varied too- Realization- Speaking style- Sex of speaker- Anatomy of vocal tract- Speed of speech- Dialects
WHY IT IS DIFFICULT?WHY IT IS DIFFICULT?Internal Problems??
Digital form only contains information of amplitude per periods
Amplitude can not directly used to determined the speech parameters (problems for analysis process)
Manipulate (add or delete) some part of the sound would effect to whole sound
VOICE CONVERSION METHODVOICE CONVERSION METHODMain Process (Flow Chart see Page 30)??
SEGMENTATION
ANALYSIS or MODELING
TRANSFORMATION
SYNTHESIS
SEGMENTATIONSEGMENTATIONFlow Chart see Page 34??
It is difficult to process entire phrase as tone, pitch, and other characteristics may diverse over the whole signal
Split base on syllable
Use end-point detection methods, combination of volume (two volume threshold) and zero-crossing rate (ZCR)
SEGMENTATIONSEGMENTATIONFlow Chart see Page 34??
VolumeLoudness of audio signal
Zero-Crossing Rate (ZCR)Rate where signal change from positive to negative, and vise versa
n
iiSvolume
1
SEGMENTATIONSEGMENTATIONFlow Chart see Page 34??
VOICE CONVERSION METHODVOICE CONVERSION METHODMain Process (Flow Chart see Page 30)??
SEGMENTATION
ANALYSIS or MODELING
TRANSFORMATION
SYNTHESIS
ANALYSIS OR MODELINGANALYSIS OR MODELINGMain Process (Flow Chart see Page 36)??
ANALYSIS or MODELING
Linear Predictive Coding
Pitch Period Computation
ANALYSIS OR MODELINGANALYSIS OR MODELINGMain Process (Flow Chart see Page 36)??
ANALYSIS or MODELING
Linear Predictive Coding
Pitch Period Computation
ANALYSIS OR MODELINGANALYSIS OR MODELINGModeling Vocal Tract??
Source : signal x(t) [excitation signal]
Filter : linear time invariant h(t) [transfer function]
Speech : convolution of source and filter y(t) = x(t) * h(t)
ANALYSIS OR MODELINGANALYSIS OR MODELINGModeling Vocal Tract??
De-convolution needed
Use of LPC methodspredicting a sample of a speech signal based on several previous samples
ANALYSIS OR MODELINGANALYSIS OR MODELINGModeling Vocal Tract??
p
kk knsas
1
][ˆ
ANALYSIS OR MODELINGANALYSIS OR MODELINGLinear Predictive Coding??
VOICE CONVERSION METHODVOICE CONVERSION METHODMain Process (Flow Chart see Page 36)??
ANALYSIS or MODELING
Linear Predictive Coding
Pitch Period Computation
VOICE CONVERSION METHODVOICE CONVERSION METHODMain Process (Flow Chart see Page 36)??
Pitch Period Computation
Pitch Analysis
Glottal Pulse Computation
Pitch Tier Computation
Pitch AnalysisBased on autocorrelation methods (Boersma 1993)
ANALYSIS OR MODELINGANALYSIS OR MODELINGPitch Period Computation??
Glottal Pulse ComputationRepeated pattern of voiced sound
τ : glottal pulse
ANALYSIS OR MODELINGANALYSIS OR MODELINGPitch Period Computation??
Pitch Tier Calculationtotal points according to total voiced frames from pitch contour obtained from previous step
ANALYSIS OR MODELINGANALYSIS OR MODELINGPitch Period Computation??
VOICE CONVERSION METHODVOICE CONVERSION METHODMain Process (Flow Chart see Page 30)??
SEGMENTATION
ANALYSIS or MODELING
TRANSFORMATION
Synthesis
TRANSFORMATIONTRANSFORMATIONTransform speech parameter obtained??
TransformationExtract pitch dari target chunk (target
chunk mula-mula sebelum di resample)
Extract pitch dari source yang sudah difilter
Replace pitch
return
SYNTHESISSYNTHESISMain Process (Flow Chart see Page 30)??
SEGMENTATION
ANALYSIS or MODELING
TRANSFORMATION
SYNTHESIS
SYNTHESISSYNTHESISFlow Chart see Page 46??
Use of LPC Filter method to reconstruct transformed speech
EXPERIMENTAL RESULTEXPERIMENTAL RESULT??
TESTINGTESTINGEffect of choice of hardware used to record??
Microphone :Soundcard :
Phillips PC Headset (SHM7410U/1)Realtek HD Audio
TESTINGTESTINGEffect of choice of hardware used to record??
Microphone :Soundcard :
Shure Beta 58Realtek HD Audio
TESTINGTESTINGEffect of choice of hardware used to record??
Microphone :Soundcard :
Shure Beta 58EMU0404
TESTINGTESTINGTest on segmentation??
Speech : “Hai” from 4 difference speakers
Speech : “Hai” from 4 (four) difference speakers
TESTINGTESTINGTest on segmentation??
Speech : “Hai” from 4 (four) difference speakers
Percentage result:For speech with only 1 (one) syllable : 100% success
TESTINGTESTINGTest on segmentation??
TESTINGTESTINGTest on segmentation
Speech : “Saya” from 4 difference speakers
??
Speech : “Saya” from 4 difference speakers
?? TESTINGTESTINGTest on segmentation
Speech : “Saya” from 4 (four) difference speakers
Percentage result:For speech with 2 (two) syllables without paused : 0%
success (All detect as 1 (one) syllable only)But it works good in the application : 100% success
TESTINGTESTINGTest on segmentation??
Speech : “Sistem Cerdas” from 4 difference speakers
?? TESTINGTESTINGTest on segmentation
Speech : “Sistem Cerdas” from 4 difference speakers
?? TESTINGTESTINGTest on segmentation
Speech : “Sistem Cerdas” from 4 (four) difference speakers
Percentage result:For speech with more complex forms : 50% successRelated to Speaker Variability
TESTINGTESTINGTest on segmentation??
TESTINGTESTINGTest on pitch modification??
No UtteranceSource Target
Converted (Hz)Speaker Freq (Hz) Speaker Freq (Hz)
1 Good Kath 242.04 Liz 266.95 263.16
2 Hai Kath 227.09 Zefan 176.26 172.41
3 Saya Liz 259.11 Will 170.14 172.41
4 Hallo Zefan 162.01 Liz 100.18 100
5 A Will 151.44 Zefan 191.57 188.68
TESTINGTESTINGTest on pitch modification??
No UtteranceTarget
Converted (Hz)
Success RateSpeaker Freq (Hz)
1 Good Liz 266.95 263.16 98.58 %
2 Hai Zefan 176.26 172.41 97.82 %
3 Saya Will 170.14 172.41 98.66 %
4 Hallo Liz 100.18 100 99.82 %
5 A Zefan 191.57 188.68 98. 49 %
Average percentage result: 98.67 %
TESTINGTESTINGSubjectivity Test??
Similarity (based on human auditory perception)Test on 20 peoples, 5 utterances
Overall result : 3.71 of 5.0
Utterance Source Target Avg. score
Good Kath Liz 3.55 of 5.0
Hai Kath Zefan 4.1 of 5.0
Saya Liz Will 3.4 of 5.0
Hallo Zefan Liz 3.65 of 5.0
A Will Zefan 3.85 of 5.0
TESTINGTESTINGSubjectivity Test??
Based on genderTest on 22 peoples, 2 utterances. 4 combinations gender for each utterance
From To Overall Rank
Female Female 2.591
Female Male 1.818
Male Female 2.727
Male Male 2.864
TESTINGTESTINGSubjectivity Test??
Similarity of speaker characteristicTest on 22 peoples, 5 utterances
Overall result : 3.64 of 5.0
No UtteranceSource Target
Avg. ScoreSpeaker Speaker
1 Carike Leonita Daniel 3.29 of 5.0
2 Mboh yo Daniel Leonita 3.95 of 5.0
3 Ndek mana Melinda Indro 4.16 of 5.0
4 Ra mangan Melinda Angela 3.41 of 5.0
5 Ya toh Indro Liz 3.36 of 5.0
CONCLUSIONCONCLUSIONConclusion from experiments result??
Segmentation result is fairly effective for certain speech, depends on the input speech which can be very diverse
For segmentation, longer speech will result lower success rate
Segmentation effects on conversion result
CONCLUSIONCONCLUSIONConclusion from experiments result??
Pitch modification calculation is working successful (average percentage 98.67 %)
This system is fairly effective at imitating certain target speaker (average score 3.71 of 5.0)
Female to male conversion give the best results (overall rank 1.818 of 4.0)
Speaker characteristic is fairly recognized by auditory perception (overall score 3.64 of 5.0)
SUGGESTIONSUGGESTIONFor future development??
The need of semi-automatic segmentation for a better result
Currently, the system only convert 2 voices saying same word or phrase (text-dependent). Neural network need to make text-independent system
Real-time system is possible
More research on frequency domain process
Design and Implementation Design and Implementation of Voice Conversion of Voice Conversion Application (VOCAL)Application (VOCAL)
THANKS FOR YOUR ATTENTION