Post on 06-Feb-2018
12/1/2016 1
Voice Transformation : Speech Morphing By PWI
Submitted by: Gidon Porat Supervisor: Dr. Yizhar Lavner
Signal and Image Processing Lab
12/1/2016 2
Objective
Gradually change a source speaker’s voice, to sound like the voice of a target speaker.
The inputs : two reference voice signals.
12/1/2016 3
Applications
● Multimedia and video entertainment: While seeing a face gradually changing from one person to another’s (like often done in video clips) ,we could simultaneously hear his voice changing as well.
● Forensic voice identification by synthesis:Identifying a suspect voice by creating a voice-bank.
12/1/2016 4
The Challenges
Produce a natural sounding speech signal by a computer.
The “naive solution” gives poor results.
The source and target voice signals will never be of the same length.
Speaker A Speaker B interpolation
12/1/2016 5
Speech Modeling - Synthesis
Impulsetrain
generator
Glottal PulsemodelG(z)
RandomNoise
generator
Vocaltract model
V(z)
RadiationmodelR(z)
PitchDetector
LinearPredictor
SpeechProduction
System
Voiced/unvoiced
The basic synthesis of digital speech is done by the discrete-time system model.
12/1/2016 6
Speech Modeling
A1 A2 A3 Ak
Sound transmission in the vocal tract can be modeled as sound passing through concatenated lossless acoustic tubes.
A known mathematical relation between the areas of these tubes and the vocal tract’s filter, will help in the implementation of our algorithm.
⇒ ⇒1
1
k kk
k k
A ArA A
+
+
−=
+
01
1 1
11 1
( ) 1
( ) ( ) ( ). . :
1 (1)
kk k k k
D zD z D z r z D ze gD r z
− −− −
−
=
= + ⋅ ⋅
= + ⋅ ⋅
( )( )GV z
D z=
12/1/2016 7
Prototype Waveform Interpolation
PWI is a speech signal representation method, which is based on the presentation of a speech signal, or its residual error function, by a 3-D surface.
This method allows the coding of a prototype waveform’s structure evolution in time. φ t
12/1/2016 80 100 200 300 400 500 600 700 800 900 10000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
time
0 100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
time0 100 200 300 400 500 600 700 800 900 1000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
time0 100 200 300 400 500 600 700 800 900 1000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
time
time
φφ
tt
12/1/2016 9
The Solution
Use 3D surfaces that will capture each vocal phoneme’s residual error signal characteristics and interpolate between the two speakers.
Unvoiced phonemes are not dealt with due to the fact that they carry less information about the speaker.
12/1/2016 10
Algorithm – Block Diagram
PhonemesBoundaries
Demarcation
LPCAnalysis
PitchDetector
PhonemesBoundariesDemarcation
PitchDetector
LPCAnalysis
PWICreation
PWICreation
Interpolator
e(n)reconstruction
new LPCcalculator
Speaker A Speaker B
Synthesizer
12/1/2016 11
PWI Surface Interpolation
+
Intermediate surface
Speaker A
Speaker B
The new error function, that will be reconstructed from that surface, will then be the input of a new Vocal Tract Filter.
Once the surfaces for both source and target speakers are created (for each phoneme) an interpolated surface is created.
( , ) ( , ) (1 ) ( , )new s tu t u t u tφ α φ α φ= ⋅ + − ⋅
12/1/2016 12
The New Excitation - ReconstructionThe new, intermediate error signal can be evaluated from the new surface by the equation :
And that :
( ) ( , ( )), [0 : ]new new new newe t u t t t Tφ= ∀ =
0
''
2( )( )
t
newnewt
t dtp t
πφ = ∫
Assuming the pitch cycle changes slowly in time :
( )new tφ
( ) ( ) (1 ) ( )new s tp t p t p tα α= ⋅ + − ⋅
12/1/2016 13
New Lossless Tube Model
A1s A2s A3s A4s A5s A6s
A1n A2n A3n A4n A5n A6n
The areas of the new tube model will be an interpolation between the source’s and the target’s.Once the new Areas are computed the LPC parameters and V(z) can be calculated, and the signal can be synthesized.1 2 3 4 5 6 7( , , , , , , )a a a a a a a
(1 ) :[1, 2,..., ]new s ti i iA A A i Nα α= ⋅ + − ⋅ ∀
12/1/2016 14
The morphing factor
Variant factor (from t=0 to t=T) => Gradually changing voice.
Invariant factor => Intermediate voice.
In order for one to hear a “linear” change between the source’s and the target’s voices, the morphing factor, (the relative part of ) has to vary nonlinearly in time in a logarithmic way(specifically in this algorithm).
α
( , )su t φ
12/1/2016 15
0 1000 2000 3000 4000 5000 6000 7000-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
Concatenating the new phonemes
The final morphed speech signal is created by concatenating the new vocal phonemes, sequentially, along with the source’s/target’s unvoiced phonemes and silence periods.
1
silence
2voiced3
unvoiced
4voiced
12/1/2016 16
Demonstrations
Speaker A.
Speaker B.
Superposition intermediate signal.Algorithm’s intermediate signal.
Superposition gradual changing signal.
Algorithm’s gradual changing signal.
12/1/2016 17
ConclusionAdvantages:
The utterances produced were shown to consist of intermediate features of the two speakers, as apposed to the “naive solution”.
This field of study is relatively short of publicly available algorithms.
Limitations:
The fact that the unvoiced parts, are not dealt with and the need to better define an “intermediate sound”.
12/1/2016 18
Surface Construction Algorithm
error signalcalculation
LPC parametersPitch detectionmarks
short timepitch function
calculation
prototypewaveformsampler
alignbetween
0 and 2*pi
align withpreviouswaveform
interpolatealong
time axis
the errorwaveform surface
LPF
12/1/2016 19
New Voiced Phoneme Creation
evaluatenew p(t)
source and target error surface
source and target short time
pitch function
modificationparameter (a)
calculatenew fi(t)
create newerror surface
recover newerror function
claculate newtube areas
source and targetLPC parameters
calulatenew LPC
parameters
synthesizenew speech
signalnew
phoneme
modificationparameter (a)
( )tφ
12/1/2016 20
Appendix - Equations
( , ) ( , ) (1 ) ( , )new s t alignedu t u t u tφ α β φ α γ φ−= ⋅ ⋅ + − ⋅ ⋅
( , ) ( , )t aligned t nu t u tφ φ φ− = +2
0 0
( , ) ( , )arg max { }
( , )e
Tnew
s t et
ns t e
u t u t dtd
u u t
π
φφ
φ φ φ φφ
φ φ= =
⋅ +
=⋅ +
∫ ∫
( ) ( ) (1 ) ( )new s tp t p t p tα β α γ= ⋅ ⋅ + − ⋅ ⋅
12/1/2016 21
Appendix – transfer function.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
12/1/2016 22
The Challenges cont.
Here are two identical words (“shade”) form source and target
The target speaker’s word lasts longer then the source speaker’s and it’s “shape” is quite different.
12/1/2016 23
Interpolation between source and target surfaces is carried out.
Intermediate surface construction
( , ) ( , ) (1 ) ( , )new s tu t u t u tφ α β φ α γ φ= ⋅ ⋅ + − ⋅ ⋅
/ ; /new s new tT T T Tβ γ= =
(1 )new s tT T Tα α= ⋅ + − ⋅