An Autoregressive Recurrent Mixture Density Network …tonywangx.github.io/pdfs/ARRMDN.pdf · An...
Transcript of An Autoregressive Recurrent Mixture Density Network …tonywangx.github.io/pdfs/ARRMDN.pdf · An...
AnAutoregressiveRecurrentMixtureDensityNetworkforParametricSpeech
Synthesis
XinWANG,ShinjiTAKAKI,JunichiYAMAGISHINationalInstituteofInformatics,Japan
2017-03-07
1contact:[email protected],suggestions,anddiscussion
ICASSP 2017NewOrleans,USA
ABBREVIATION
2
GMM Gaussianmixture model
NN Neural network
RNN Recurrent neuralnetwork
MDN Mixturedensitynetwork
RMDN Recurrentmixture densitynetwork
AR Autoregressivemodel
AR-RMDN AutoregressiveRMDN (proposedmodel)
l Introduction
l Modeldefinition,interpretation,implementation
l Experiments
l Conclusion
CONTENTS
3
Text-to-Speechl Basedonparametricspeechsynthesis
• thiswork:onacousticmodelsbasedonneuralnetworks
4
INTRODUCTION
text
acoustic model
text analyzer
vocoder
Acousticmodelsbasedonneuralnetworksl RNN [1]
• outputofRNN:(spectralfeatures,F0...)
INTRODUCTION
7/15/17 5[1]Fan,Y.,Qian,Y.,Xie,F.-L.,&Soong,F.K.(2014).TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.INTERSPEECH (pp.1964–1968).
x1 x2 x3 x4 x5
bo5b
o4bo3b
o2bo1
bot
generated acoustic features
input textual features
time (frame) axis
Acousticmodelsbasedonneuralnetworksl RMDN[2]
• outputofRMDN:for
INTRODUCTION
7/15/17 6[2]Schuster,M.(1999).BetterGenerativeModelsforSequentialDataProblems:BidirectionalRecurrentMixtureDensityNetworks.InProc.NIPS (pp.589–595).
Mt p(ot;Mt)
input textual features
distribution
M3 M4M2M1 M5
o1 o2 o3 o4 o5
generated parameter set
p(ot;Mt)
x1 x2 x3 x4 x5
x1 x2 x3 x4 x5
Acousticmodelsbasedonneuralnetworksl RMDN [3,4]
• RMDNusingGMM
• generatefromGMMusingMLPG[5]
INTRODUCTION
7/15/17 7[3]Bishop,C.M.(2004).MixtureDensityNetworks.Retrievedfromhttp://eprints.aston.ac.uk/373/[4]Schuster,M.(1999).BetterGenerativeModelsforSequentialDataProblems:BidirectionalRecurrentMixtureDensityNetworks.InProc.NIPS (pp.589–595).[5]Tokuda,K.,Yoshimura,T.,Masuko,T.,Kobayashi,T.,&Kitamura,T.(2000).SpeechparametergenerationalgorithmsforHMM-basedspeechsynthesis.InProc.
ICASSP,pp.1315–1318.
GMM
M3 M4M2M1 M5
o1 o2 o3 o4 o5
p(o1:T ;M1:T ) =TY
t=1
p(ot;Mt) =TY
t=1
MX
m=1
!mt N (ot;µ
mt ,⌃m
t ).
Mt = {!1t , · · · ,!M
t ,µ1t , · · · ,µM
t ,⌃1t , · · · ,⌃M
t }
bot
l Introduction
l Modeldefinition,interpretation,implementation
l Experimentsandresults
l Conclusion
CONTENTS
8
MOTIVATION
7/15/17 9
distribution
o1 o2 o3 o4 o5
p(ot;Mt)
M3 M4M2M1 M5
x1 x2 x3 x4 x5
p(o1:T ;M1:T ) =TY
t=1
p(ot;Mt)
[6]Shannon,M.,Zen,H.,&Byrne,W. (2013).Autoregressivemodelsforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,Speech,andLanguageProcessing,21(3),587–597.
ConventionalRMDN
l Independenceassumption:
l Alternative?• theideaofAutoregressiveHMM(forspeechsynthesis[6])• ...
Proposedmodel:AR+RMDN
l ARassumption:
DEFINITION
7/15/17 10
M3 M4M2M1 M5
o1 o2 o3 o4 o5
RMDN
p(o1:T ;M1:T ) =TY
t=1
p(ot|ot�K:t�1;Mt)
x1 x2 x3 x4 x5
AR-RMDN usingGMM
• baselineRMDN:
• AR-RMDN:
o1. timeinvariant(context-independent)2. jointtrainingwithRMDN usingback-propagation
DEFINITION
7/15/17 11
f(ot�K:t�1) =KX
k=1
ak � ot�k + b,
o1 o2 o3 o4 o5
p(o1:T ;M1:T ) =TY
t=1
MX
m=1
!mt N (ot;µ
mt + f(ot�K:t�1),⌃
mt )
a2 a2 a2
a1a1a1a1
p(o1:T ;M1:T ) =TY
t=1
MX
m=1
!mt N (ot;µ
mt ,⌃m
t )
{a1, · · · ,aK , b}
l Introduction
l Modeldefinition,interpretation,implementation
l Experimentsandresults
l Conclusion
CONTENTS
12
100 200 300 400 500 600 700 800 1000Frame index (utterance BC2011_nancy_APDC2-166-00)
-0.3
-0.2
-0.1
0
0.1
0.2
MG
C (3
0th
dim
)
Aglimpseoftheresults
• smooth?
BEFORE INTERPRETATION
7/15/17 13
Generatedtrajectoriesofthe30th dimension ofmel-generalizedcepstrum coefficients
AR-RMDNNaturaldata
100 200 300 400 500 600 700 800 1000Frame index (utterance BC2011_nancy_APDC2-166-00)
-0.3
-0.2
-0.1
0
0.1
0.2
MG
C (3
0th
dim
)
Aglimpseoftheresults
• smooth?• largerdynamicrange?
BEFORE INTERPRETATION
7/15/17 14
AR-RMDNNaturaldata RNN RMDN
Generatedtrajectoriesofthe30th dimension ofmel-generalizedcepstrum coefficients
SignalsandfiltersinAR-RMDNl Simplecase:1st orderAR
• forGMMinAR-RMDN
INTERPRETATION
7/15/17 15
f(ot�1) = a� ot�1 + b,where a = [a1, · · · , aD]>
d 2 [1, D], dimension of feature vector
t 2 [1, T ], number of frames
p(ot|ot�1;Mt) =
MX
m=1
w
mt
DY
d=1
1q2⇡�
m2t,d
exp(�(ot,d � f(ot�1,d)� µ
mt,d)
2
2�
mt,d
2 )
ot,d � f(ot�1,d) = ot,d � adot�1,d � bd
ct,d
,dimension index,frameindex
SignalsandfiltersinAR-RMDNl Simplecase:1st orderAR:
• ifT=2
INTERPRETATION
7/15/17 16
1 0
�ad 1
�·o1,d
o2,d
�=
c1,d
c2,d
�
ct,d = ot,d � adot�1,d
c1,d = o1,d
c2,d = o2,d � ado1,d
d 2 [1, D], dimension of feature vector
t 2 [1, T ], number of frames
,dimension index,frameindex
SignalsandfiltersinAR-RMDNl Simplecase:1st orderAR:
• ifT=3
INTERPRETATION
7/15/17 17
ct,d = ot,d � adot�1,d d 2 [1, D], dimension of feature vector
t 2 [1, T ], number of frames
c1,d = o1,d
c2,d = o2,d � ado1,d
c3,d = o3,d � ado2,d
2
41 0 0
�ad 1 00 �ad 1
3
5 ·
2
4o1,d
o2,d
o3,d
3
5 =
2
4c1,d
c2,d
c3,d
3
5
,dimension index,frameindex
SignalsandfiltersinAR-RMDNl Simplecase:1st orderAR:
• ingeneral
INTERPRETATION
7/15/17 18
2
666664
1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...
......
...0 · · · 0 �ad 1
3
777775·
2
666664
o1,d
o2,d
o3,d...
oT,d
3
777775=
2
666664
c1,d
c2,d
c3,d...
cT,d
3
777775
ct,d = ot,d � adot�1,d d 2 [1, D], dimension of feature vector
t 2 [1, T ], number of frames
,dimension index,frameindex
SignalsandfiltersinAR-RMDNl Simplecase:1st orderAR:
• a filteringprocessing
v denotesthefilterinz-domain
INTERPRETATION
7/15/17 19
2
666664
1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...
......
...0 · · · 0 �ad 1
3
777775·
2
666664
o1,d
o2,d
o3,d...
oT,d
3
777775=
2
666664
c1,d
c2,d
c3,d...
cT,d
3
777775
o1:T,d c1:T,dAd(z) = 1� adz�1
o1:T,d c1:T,d
Ad(z)
d 2 [1, D], dimension of feature vector
,dimension index
SignalsandfiltersinAR-RMDNl Simplecase:1st orderAR:
• afilteringprocessing
v denotesthefilterinz-domain
INTERPRETATION
7/15/17 20
o1:T,d c1:T,d
o1:T,dc1:T,d
2
666664
1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...
......
...0 · · · 0 �ad 1
3
777775
�1
·
2
666664
c1,d
c2,d
c3,d...
cT,d
3
777775=
2
666664
o1,d
o2,d
o3,d...
oT,d
3
777775
Hd(z)
Hd(z) =1
Ad(z)=
1
1� adz�1
d 2 [1, D], dimension of feature vector
,dimension index
SignalsandfiltersinAR-RMDNl Withhigh orderAR
• jointtrainingofARfilterandRMDNusingback-propagation:
• generation:
7/15/17 21
INTERPRETATION
Ad(z) = 1�KX
k=1
ak,dz�k
o1:T,d c1:T,d
TY
t=1
p(ct;Mt)
ARanalysisfilter RMDNAR-RMDN
textualfeatures
TY
t=1
p(ct;Mt)
ARsynthesisfilter RMDNAR-RMDN
Hd(z) =1
1�PK
k=1 ak,dz�k
bc1:T,dbo1:T,d
textualfeatures
d 2 [1, D], dimension of feature vector,dimension index
l Introduction
l Modeldefinition,interpretation,implementation
l Experimentsandresults
l Conclusion
CONTENTS
22
Stabilityofthefilterl mustbestable
7/15/17 23
IMPLEMENTATION
Hd(z)
TY
t=1
p(ct;Mt)
ARsynthesisfilter RMDNAR-RMDN
Hd(z) =1
1�PK
k=1 ak,dz�k
bc1:T,dbo1:T,d
textualfeatures
infiniteimpulseresponsefilter
[7]Oppenheim,A.V,Schafer,R.W.,&Buck,J.R.(1999).Discrete-timeSignalProcessing(2ndEd.).UpperSaddleRiver,NJ, USA:Prentice-Hall, Inc.
Stabilityofthefilterl mustbestable
• unstable------>willbeinfinitelylarge!
7/15/17 24
TY
t=1
p(ct;Mt)
ARfilterpart RMDN partAR-RMDN
Hd(z) =1
1�PK
k=1 ak,dz�k
bc1:T,dbo1:T,d
IMPLEMENTATION
Hd(z)
Hd(z) bo1:T,d
F0output fromunstablefilter
Stabilityofthefilterl Generalrequirement:
• 'spolesareinsidetheunitcircle[7]
l Asimpletrick(withastrongconstraint):
• requirement:
• howto:
IMPLEMENTATION
7/15/17 25
H(z) =1
1�PK
k=1 akz�k
=KY
k=1
1
1� ↵kz�1
↵k = tanh(↵̂k)
Hd(z)
[7]Oppenheim,A.V,Schafer,R.W.,&Buck,J.R.(1999).Discrete-timeSignalProcessing(2ndEd.).UpperSaddleRiver,NJ, USA:Prentice-Hall, Inc.
↵k 2 (�1, 1)
↵k 2 R
l Introduction
l Modeldefinition,interpretation,implementation
l Experimentsandresults
l Conclusion
CONTENTS
26
DATACorpus
Features
27[8] King,S.,&Karaiskos,V.(2011).TheBlizzardChallenge2011.Retrievedfromhttp://festvox.org/blizzard/bc2011/summary_Blizzard2011.pdf[9] HTS WorkingGroup.(2014).TheEnglishTTSSystem“Flite+hts_engine.”Retrievedfromhttp://hts-engine.sourceforge.net/
Name Size Usage
BlizzardChallenge2011corpus[8]
About 12,000utterances16hours
Validation set:500utterancesTestset:500utterancesTrainingset:therest
Featureanddescription Dimension
Inputfeatures Phonemesequence, stress,pitchaccent...(extractedusingFlite[9]) 382
TargetfeaturesMel-generalizedcepstrum coefficients(MGC) 60
Continuous F0trajectory + Unvoiced/voiced flag 1+1
Bandaperiodicities (BAP) 25
xt
ot
7/15/17 28
SYSTEMSDescription Configuration
RNN without
RMDN without
RNN+MLPG [5] with
AR-RMDN without
�ot,�2ot
�ot,�2ot
�ot,�2ot
�ot,�2ot
[5]Tokuda,K.,Yoshimura,T.,Masuko,T.,Kobayashi,T.,&Kitamura,T.(2000).SpeechparametergenerationalgorithmsforHMM-basedspeechsynthesis.InProc.ICASSP (Vol.3,pp.1315–1318).
numberofARparameters:1*60+60+2*1+1=123
(1)feedforward tanh [512](2)feedforward tanh [512](3)Bi-LSTM[256](4)Bi-LSTM[256](5)feedforward linearoutput [259]
SameasRNN
SameRNN+MDN:1.GMM(2mix)forMGC2.GMM(2mix)forF03.GMM(1mix)forBAP4.Binarydistribution forU/V
SameasRMDN,withAR1.GMM(2mix,1st orderAR)forMGC2.GMM(2mix,2nd orderAR)forF03.GMM(1mix)forBAP4.Binarydistribution forU/V
100 200 300 400 500 600 700 800 1000Frame index (utterance BC2011_nancy_APDC2-166-00)
-2
0
2
4
6
8
MG
C (1
st d
im)
NATRNNRNN+MLPGRMDNAR-RMDN
100 200 300 400 500 600 700 800 1000Frame index (utterance BC2011_nancy_APDC2-166-00)
-2
-1
0
1
2
3
4
5
MG
C (2
th d
im)
NATRNNRNN+MLPGRMDNAR-RMDN
EXPERIMENTS
1st dimension ofMGC
2nd dimension ofMGC
7/15/17 29
Resultsl TrajectoryofthegeneratedMGC:
100 200 300 400 500 600 700 800 1000Frame index (utterance BC2011_nancy_APDC2-166-00)
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
MG
C (1
5th
dim
)
NATRNNRNN+MLPGRMDNAR-RMDN
EXPERIMENTS
15th dimensionofMGC
7/15/17 30
Resultsl TrajectoryofthegeneratedMGC:
100 200 300 400 500 600 700 800 1000Frame index (utterance BC2011_nancy_APDC2-166-00)
-0.3
-0.2
-0.1
0
0.1
0.2
MG
C (3
0th
dim
)
NATRNNRNN+MLPGRMDNAR-RMDN
30th dimensionofMGC
100 200 300 400 500 600 700 800 1000Frame index (utterance BC2011_nancy_APDC2-166-00)
-0.3
-0.2
-0.1
0
0.1
0.2
MG
C (4
5th
dim
)
NATRNNRNN+MLPGRMDNAR-RMDN
EXPERIMENTS
45th dimensionofMGC
7/15/17 31
Resultsl TrajectoryofthegeneratedMGC andF0:
100 150 200 250 300 350 400Frame index (utterance BC2011_nancy_APDC2-166-00)
0
50
100
150
200
250
300
350
400
450
F0 (H
z)
NATRNNRNN+MLPGRMDNAR-RMDN
F0(afterunvoiced/voicedclassification)
EXPERIMENTS
7/15/17 32
Analysisl Globalvariance[10] ofthegeneratedMGCandF0trajectories
1 20 40 60Order of MGC
-12
-10
-8
-6
-4
-2
0
2
GV
of M
GC
NATRNNRNN+MLPGRMDNAR-RMDN
RNN+MLPG
RMDN
RNN
NAT
AR-RMDN
NAT RNN RNN+MLPG RMDN AR-RMDN
9.8
9.85
9.9
9.95
10
10.05
GV
of F
0
[10]Toda,T.,&Tokuda,K.(2007).Aspeechparametergenerationalgorithmconsideringglobalvariance for{HMM}-basedspeechsynthesis.IEICE TransactionsonInformationandSystems,90(5),816–824.
GVofMGC GVofF0
Analysisl Alargerdynamicrange(higherGV)?
EXPERIMENTS
7/15/17 33
TY
t=1
p(ct;Mt)
ARsynthesis filter RMDNpartAR-RMDN
Hd(z) =1
1�PK
k=1 ak,dz�k
bc1:T,dbo1:T,d
FrequencyresponseofHd(z)forMGC
d 2 [1, D], dimension of feature vector,dimension index
Resultsl Samples:
• formantenhancementwasusedinsubjectiveevaluation
EXPERIMENTS
RNN RNN+MLPG RMDN AR-RMDN Natural
w/oformantenhancement
withformantenhancement
7/15/17 34Othersamples:here ortonywangx.github.io
l Introduction
l Methodofthiswork
l Experimentsandresults
l Conclusion
CONTENTS
35
l Model:• AR-RMDN:apairofARfilter+RMDN
l Results:• ARsynthesisfilter: a'low-pass'filter• ARanalysisfilter: a'high-pass'filter(seepp.45-46)
• generatedtrajectorieswithalargerdynamicrange• betterperceivedquality
CONCLUSION
7/15/17 36
Rates(0-100)
Subjectiveevaluation(MUSHRA test)
l InverseARfilterwithcomplexpoles
• usingsigmoid&tanh function
• pleaseseeattachedslidespp.58
RECENT WORK
7/15/17 37
TY
t=1
p(ct;Mt)
ARsynthesis filter RMDNpartAR-RMDN
Hd(z) =1
1�PK
k=1 ak,dz�k
bc1:T,dbo1:T,d
l Samplingfromthemodel
• ARisstillweakfortemporalcorrelation?
FUTURE WORK
7/15/17 38
100 150 200 250 300 350 400Frame index (utterance BC2011_nancy_APDC2-166-00)
0
50
100
150
200
250
300
350
400
450
F0 (H
z)
NATAR-RMDNAR-RMDN (Sampling)
Thankyouforyourattention
Q&A
39
l Toolkit,scripts,slides,andsamples:
tonywangx.github.io
l ARvsRNN astheoutputlayer[1]
• Tosimplytheequation
§ suppose,GMM hasonemixturecomponentwith
(thus,RMDN isequivalenttoRNN trainedusingMSE)
§ consideronlytwoframes
• Forexample,baseline(noRNNoutputlayer,noAR)willbe:
ARGUMENT
7/15/17 40[1] Zen, H. and Sak, H. (2015). Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages 4470–4474.
M2M1
o1 o2
ot 2 D �t = 1
Mt = {µt}
p(o1:2;M1:2) = p(o1;M1)p(o1;M1)
=
1p2⇡
exp(� (o1 � µ1)2
2
)
1p2⇡
exp(� (o2 � µ2)2
2
)
istheoutputofRMDNnotMt = {µt}
ot 2 D
l ARvsRNN astheoutputlayer• RNN outputlayer:
§ supposetheweightontheRNN outputlayeris,nobias
ARGUMENT
7/15/17 41
M2M1
o1 o2
a
p(o1:2;M1:2) = p(o1;M1)p(o1;M1)
=
1p2⇡
exp(� (o1 � µ1)2
2
)
1p2⇡
exp(� (o2 � µ2 � aµ1)2
2
)
x1 x2
a
l ARvsRNN astheoutputlayer• AR:
§ supposetheweightofARis,nobias
ARGUMENT
7/15/17 42
M2M1
o1 o2
a
x1 x2
a
p(o1:2;M1:2) =1p2⇡
exp(� (o1 � µ1)2
2
)
1p2⇡
exp(� (o2 � µ2 � ao1)2
2
)
l ARvsRNN astheoutputlayer• Differenceinprobabilitydensityfunction:
§ justchangethedistribution'sparameter,
§ stillindependentdistribution
§ RNN outputlayerbuildsthedependencybetweenparameters
ARGUMENT
7/15/17 43
M2M1
o1 o2
a
p(o1:2;M1:2) =1p2⇡
exp(� (o1 � µ1)2
2
)
1p2⇡
exp(� (o2 � µ2 � aµ1)2
2
)
=
1p2⇡
exp(� (o1 � µ1)2
2
)
1p2⇡
exp(� (o2 � µ
02)
2
2
)
= p(o1;M1)p(o2;M02)
µ02 = µ2 � aµ1
RNN's case
l ARvsRNN astheoutputlayer• Differenceinprobabilitydensityfunction:
where
§ ARbuildsthedependencybetweenrandomvariables
§ the2-dimGaussiandistributionhasnon-diagonalcovariance
ARGUMENT
7/15/17 44
M2M1
o1 o2
a AR'scase
p(o1:2;M1:2) =1p2⇡
exp(� (o1 � µ1)2
2
)
1p2⇡
exp(� (o2 � µ2 � ao1)2
2
)
=
1
2⇡
exp(�1
2
(o� µ)
>⌃
�1(o� µ))
o = [o1, o2]>,µ = [µ1, µ2 + aµ1]
⌃ =
1 aa 1 + a2
�⌃�1 =
1 + a2 �a�a 1
�|⌃| = 1
l ARvsRNN astheoutputlayer• Differenceinprobabilitydensityfunction:
Let'scompareand
ARGUMENT
7/15/17 45
M2M1
o1 o2
a AR'scase
Let m = [m1,m2]> = o� u
Get m>⌃�1m =
⇥m1 m2
⇤ 1 + a2 �a�a 1
� m1
m2
�
= m21 +m2
1a2 � 2m1m2a+m2
2
= m21 + (m1a�m2)
2
As m1 = o1 � µ1,m2 = o2 � µ2 � µ1a
Get m>⌃�1m = (o1 � µ1)2 + (o1a� µ1a� o2 + µ2 + µ1a)
2
= (o1 � µ1)2 + (o2 � µ2 � o1a)
2
(o� µ)>⌃�1(o� µ) (o1 � µ1)2 + (o2 � µ2 � ao1)
2
l ARvsRNN astheoutputlayer• Differenceinrandomsampling:
1. calculatebyneuralnetwork,thenwehave
2. calculatebyneuralnetworkand,thenwehave
3. drawsamplesfromand
§ sampledoesn'tinfluence
ARGUMENT
7/15/17 46
M2M1
o1 o2
aRNN's case
M1
x1 x2
p(o1;M1)
M2 M1 p(o2;M2)
p(o1;M1) p(o2;M2)bo1, bo2
bo1 p(o2;M2)
l ARvsRNN astheoutputlayer• Differenceinrandomsampling:
1. calculatebyneuralnetwork,thenwehave
2. drawsamplefrom
3. calculatebyneuralnetworkand
4. drawsamplefrom
ARGUMENT
7/15/17 47
M2M1
o1 o2
aAR'scase
M1
x1 x2
p(o1;M1)
bo1
p(o2;M2)
p(o1;M1)
M2 bo1
bo2
l ARvsRNN astheoutputlayer• Differenceintrainingusingback-propagation:
ARGUMENT
7/15/17 48
M2M1
o1 o2
aRNN's case
@-LogLikelihood
@a
= (o2 � µ2 � aµ1)µ1
M2M1
o1 o2
aAR'scase
@-LogLikelihood
@a
= (o2 � µ2 � ao1)o1
gradientisalsodifferent
l Samplingfromthemodel
• ARisstillweakfortemporalcorrelation?YES!• Why?
ARGUMENT
7/15/17 49
100 150 200 250 300 350 400Frame index (utterance BC2011_nancy_APDC2-166-00)
0
50
100
150
200
250
300
350
400
450
F0 (H
z)
NATAR-RMDNAR-RMDN (Sampling)
TY
t=1
p(ct;Mt)
ARsynthesis filter RMDNpartAR-RMDN
Hd(z) =1
1�PK
k=1 ak,dz�k
bc1:T,dbo1:T,d
featuretransformation
l Samplingfromthemodel• ARisstillweakfortemporalcorrelation?YES!• Why?
ARGUMENT
7/15/17 50
TY
t=1
p(ct;Mt)
ARsynthesis filter RMDNpartAR-RMDN
Hd(z) =1
1�PK
k=1 ak,dz�k
bc1:T,dbo1:T,d
featuretransformation
o1:T,dc1:T,d
2
666664
1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...
......
...0 · · · 0 �ad 1
3
777775
�1
·
2
666664
c1,d
c2,d
c3,d...
cT,d
3
777775=
2
666664
o1,d
o2,d
o3,d...
oT,d
3
777775
l Samplingfromthemodel• ARisstillweakfortemporalcorrelation?YES!• Why?
1. Vectorc containssamplesfromi.i.d Gaussiandistribution2. Vectoro ~Gaussiandistribution,withacovariancematrixas
Cov(o)=H*Cov (c)*HT
3. Unfortunately,off-diagonalelementsofCov(o)decaystooquickly(exponentially)becauseofH
4. SimilarresultsforhighorderAR
ARGUMENT
7/15/17 51
callitthefeaturetransformationmatrixH
o1:T,dc1:T,d
2
666664
1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...
......
...0 · · · 0 �ad 1
3
777775
�1
·
2
666664
c1,d
c2,d
c3,d...
cT,d
3
777775=
2
666664
o1,d
o2,d
o3,d...
oT,d
3
777775
Co-variance matrix based on AR
20 40 60 80 100Time index
20
40
60
80
100
Tim
e in
dex
0 20 40 60 80 100Time index
-15
-10
-5
0
5
10
15
Ampli
tude
cgenerated from ARgenerated from GP
Co-variance matrix based on GP
20 40 60 80 100Time index
20
40
60
80
100
Tim
e in
dex
ARGUMENTl Simpleexperiment
• Set100frames,let• UsematrixHfromthetrainedARmodel,transformc intoo (redline)• Defineatransformationmatrixbasedon
transformcintoanothervector(greenline)
ct = sin(w1t) + ✏,where ✏ ⇠ N (0, I)
�i,j = exp(�0.5 ⇤ l ⇤ ((i� j)2))
Co-variance matrix based on AR
20 40 60 80 100Time index
20
40
60
80
100
Tim
e in
dex
0 20 40 60 80 100Time index
-15
-10
-5
0
5
10
15
Ampli
tude
cgenerated from ARgenerated from GP
Co-variance matrix based on GP
20 40 60 80 100Time index
20
40
60
80
100
Tim
e in
dex
ARGUMENTl Simpleexperiment
• So:transformationmatrixofARistoosimple
EXPERIMENTS
7/15/17 54
Objectiveresults
MGC RMSE F0 RMSE F0 CORR
RNN 1.000 39.827 0.768RMDN 0.994 39.797 0.772
RNN+MLPG 0.988 39.252 0.775ARRMDN 1.133 47.512 0.772
Analysisl Alargerdynamicrange?
EXPERIMENTS
7/15/17 55[11]Hermansky,H.(1997).Themodulationspectrumintheautomaticrecognitionofspeech.InProc.ASRU (pp.140–147).[12]Takamichi,S.,Toda,T.,Neubig,G.,Sakti, S.,&Nakamura,S.(2014).Apostfilter tomodifythemodulationspectruminHMM-basedspeechsynthesis.In
Proc.ICASSP (pp.290–294).
TY
t=1
p(ct;Mt)
ARsynthesis filter RMDNpartAR-RMDN
Hd(z) =1
1�PK
k=1 ak,dz�k
bc1:T,dbo1:T,d
0 200 400 600 800 1000 1200 1400 1600 1800 2000Frequency Index (:/2048)
-10
-5
0
5
MS
MG
C (d
b)
RNN
RNN+MLPG
NATAR-RMDN
RMDN
Modulation spectrum[11,12]ofthe30th dimension ofMGC
400 450 500 550 600 650 700 750 800Frame
-4
-2
0
2
4
Norm
alize
d M
GC
30
Original dataFiltered data
200 400 600 800 1000 1200 1400 1600 1800 2000Frequency index (k * :/2048)
-50
0
50
100
150
Norm
alize
d M
GC
30
Original dataFiltered data
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Normalized frequency (:)
-20
-10
0
10
20
Freq
uenc
y Re
spon
se(d
B)
Frequency response ofA(z)
250 500 750 1000Frequency bin (: /1024)
-5
0
5
10
Mag
nitu
de (
dB
) H1(z)
H15(z)
H30(z)
H60(z)
Analysis(thetrainingprocess)l ARmodelintrainingstage
• Onthe1storderARforMGC
EXPERIMENTS
7/15/17 56
Ad(z) = 1� tanh(↵d)z�1
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
Mag
nitu
de (d
B)
A1(z)A15(z)A30(z)A60(z)
400 450 500 550 600 650 700 750 800Frame
-4
-2
0
2
4
Norm
alize
d M
GC
30
Original dataFiltered data
200 400 600 800 1000 1200 1400 1600 1800 2000Frequency index (k * :/2048)
-50
0
50
100
150
Norm
alize
d M
GC
30
Original dataFiltered data
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Normalized frequency (:)
-20
-10
0
10
20
Freq
uenc
y Re
spon
se(d
B)
featuretrajectory30th MGC modulation spectrum30th MGC
Ad(z) = 1�KX
k=1
ak,dz�k
o1:T,d c1:T,d
TY
t=1
p(ct;Mt)
ARanalysis filter RMDNpartAR-RMDN
0 200 400 600 800 1000 1200Frame
-2
0
2
4
Norm
alize
d F0
Original dataFiltered data
200 400 600 800 1000 1200 1400 1600 1800 2000Frequency index (k * :/2048)
-100
0
100
200
Norm
alize
d F0 Original data
Filtered data
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Normalized frequency (:)
-20
-10
0
10
20
Freq
uenc
y Re
spon
se(d
B)
0 200 400 600 800 1000 1200Frame
-2
0
2
4
Norm
alize
d F0
Original dataFiltered data
200 400 600 800 1000 1200 1400 1600 1800 2000Frequency index (k * :/2048)
-100
0
100
200
Norm
alize
d F0 Original data
Filtered data
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Normalized frequency (:)
-20
-10
0
10
20
Freq
uenc
y Re
spon
se(d
B)
Analysis(thetrainingprocess)l ARmodelintrainingstage
• Onthe2st orderARforF0
EXPERIMENTS
7/15/17 57
featuretrajectoryinterpolatedF0 modulation spectruminterpolatedF0
Ad(z) = 1�KX
k=1
ak,dz�k
o1:T,d c1:T,d
TY
t=1
p(ct;Mt)
ARanalysis filter RMDNpartAR-RMDN
Frequency response ofA(z)
0 200 400 600 800 1000 1200Frame
-2
0
2
4
Norm
alized F
0 Original dataFiltered data
200 400 600 800 1000 1200 1400 1600 1800 2000Frequency index (k * :/2048)
-100
0
100
200
Norm
alized F
0 Original dataFiltered data
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Normalized frequency (:)
-20
-10
0
10
20
Frequency R
esponse(dB
)
Analysis(thetrainingprocess)l ARmodelintrainingstage
• 2nd orderforMGCisalso’high-pass’
EXPERIMENTS
7/15/17 58
Ad(z) = 1�KX
k=1
ak,dz�k
o1:T,d c1:T,d
TY
t=1
p(ct;Mt)
ARanalysis filter RMDNpartAR-RMDN
0 20 40 60Order of the MGC
0
0.2
0.4
0.6
Valu
e of
a
a1a2
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
Mag
nitu
de (d
B)
A1(z)A15(z)A30(z)A60(z)
Frequency response ofAd(z)
Ad(z) = (1� a1z�1)(1� a2z�1)
Analysis(thetrainingprocess)l ARmodelintrainingstage
• ‘High-pass’ARanalysisfilter
EXPERIMENTS
7/15/17 59
Ad(z) = 1�KX
k=1
ak,dz�k
o1:T,d c1:T,d
TY
t=1
p(ct;Mt)
ARanalysis filter RMDNpartAR-RMDN
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
Mag
nitu
de (d
B)order 2order 3order 4order 5order 6
Frequency response ofA (z)
v Order2isre-trained
Comparewithpost-filteringmethod[15]EXPERIMENTS II
7/15/17 60[15]Takamichi,S.,Toda,T.,Neubig,G.,Sakti,S.,&Nakamura,S.(2014).Apostfilter tomodifythemodulationspectruminHMM-
basedspeechsynthesis.InProc.ICASSP (pp.290–294).
TY
t=1
p(ct;Mt)
ARsynthesisfilter RMDN partAR-RMDN
Hd(z) =1
1�PK
k=1 ak,dz�k
bc1:T,dbo1:T,d
Modulation-spectrum-basedpost-filter
TY
t=1
p(ct;Mt)
RMDN
bo1:T,d
bo
01:T,d
ID DescriptionRMDN recurrent mixturedensitynetwork
RMDN-MS recurrent mixturedensitynetworkwithModulation-spectrum-basedpost-filter
AR-RMDN proposedmodel
100 200 300 400 500 600 700 800 1000Frame index (utterance BC2011_nancy_APDC2-166-00)
-0.3
-0.2
-0.1
0
0.1
0.2
MG
C (3
0th
dim
)
NATRMDNRMDN-MSAR-RMDN
200 400 600 800 1000 1200 1400 1600 1800 2000Frequency Index (:/2048)
-10
-5
0
5
MS
MG
C (d
b)
RMDN-MSNAT
AR-RMDN
RMDN
ResultsEXPERIMENTS II
7/15/17 61
30th dimensionofMGC
Modulation spectrumof30th dimension ofMGC
RMDN RMDN-MS AR-RMDN natural
w/o formantenhancement
withformantenhancement
OnlycompareMGC
• AllsystemsusedthesamegeneratedF0 fromRMDN
EXPERIMENTS II
7/15/17 62
CompareMGC +F0EXPERIMENTS II
RMDN RMDN-MS AR-RMDN natural
w/o formantenhancement
withformantenhancement
7/15/17 63
Wrapup
• Why?AR-RMDNavoidFFT/iFFT onacoustictrajectories
v Note,RMDN-MScouldbebetter[11]
EXPERIMENTS II
7/15/17 64[11]Takamichi,S.(2016).Acousticmodelingandspeechparametergenerationforhigh-quality statistical parametricspeechsynthesis.NaraInstitute ofScienceandTechnology.Retrieved fromhttp://hdl.handle.net/10061/10609
RMDN
RMDN-MS F
AR-RMDN F
RMDN-MS M
AR-RMDN M
RMDN-MS
AR-RMDN
35
40
45
50
55
60
Rat
ed q
ualit
y (fr
om 0
(min
) to
100
(max
))
onlyenhanceF0 onlyenhanceMGC
SignalsandfiltersinAR-RMDNl Signal
• Defineacousticfeaturetrajectories
INTERPRETATION
7/15/17 65
2
666666664
o1,1 o2,1 o3,1 · · · ot,1 · · · oT,1
o1,2 o2,2 o3,2 · · · ot,2 · · · oT,2...
......
......
......
o1,d o2,d o3,d · · · ot,d · · · oT,d...
......
......
......
o1,D o2,D o3,D · · · ot,D · · · oT,D
3
777777775
ot 2 RD
o
>1:T,d
T frames
D dimensions
o1:T,d = [o1,d, · · · , oT,d]>
o1:T,d
d 2 [1, D]
SignalsandfiltersinAR-RMDNl Signal
1. Considera1st orderARmodel
2. LookintotheGMM withAR
3. Define,
4. Define,
INTERPRETATION
7/15/17 66
f(ot�1) = a� ot�1 + b,where a = [a1, · · · , aD]>
c1:T,d
ct,d = ot,d � adot�1,d
c1:T,d = [c1,d, · · · , cT,d]>
p(ot|ot�1;Mt) =
MX
m=1
w
mt
DY
d=1
1q2⇡�
m2t,d
exp(�(ot,d � f(ot�1,d)� µ
mt,d)
2
2�
mt,d
2 )
=
MX
m=1
w
mt
DY
d=1
1q2⇡�
m2t,d
exp(�(ot,d � adot�1,d � µ
mt,d � bd)
2
2�
mt,d
2 )
d 2 [1, D]
d 2 [1, D]
SignalandfilterinAR-RMDNl Filters
• Stillconsidera1st orderAR,whereand
• Then,andarerelatedby
INTERPRETATION
7/15/17 67
ct,d = ot,d � adot�1,d
2
666664
1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...
......
...0 · · · 0 �ad 1
3
777775⇤
2
666664
o1,d
o2,d
o3,d...
oT,d
3
777775=
2
666664
c1,dc2,dc3,d...
cT,d
3
777775
Ad(z) = 1� adz�1
o1:T,d c1:T,d
.
Filterwithfiniteimpulseresponse
excitationsignal
filteredsignal
o1:T,d c1:T,d
d 2 [1, D], t 2 [1, T ]
Ad(z) = 1�KX
k=1
ak,dz�k
SignalandfilterinAR-RMDNl Filters
• Ingeneral
• Fromto
INTERPRETATION
7/15/17 68
⇤
2
666664
o1,d
o2,d
o3,d...
oT,d
3
777775=
2
666664
c1,dc2,dc3,d...
cT,d
3
777775
o1:T,d c1:T,d
.
Filterwithfiniteimpulseresponse
excitationsignal
filteredsignal
o1:T,d c1:T,d
f(ot�K:t�1) =KX
k=1
ak � ot�k + b,
2
666664
1 0 0 0 · · · 0 0�a1,d 1 0 0 · · · 0 0�a2,d �a1,d 1 0 · · · 0 0
......
......
......
...0 · · · 0 �aK,d · · · �a1,d 1
3
777775
SignalandfilterinAR-RMDNl Filters
• Ingeneral
• Fromto
INTERPRETATION
7/15/17 69
=
2
666664
c1,dc2,dc3,d...
cT,d
3
777775
o1:T,dc1:T,d
.
Filterwith infiniteimpulseresponse
o1:T,dc1:T,d
f(ot�K:t�1) =KX
k=1
ak � ot�k + b,
2
666664
1 0 0 0 · · · 0 0�a1,d 1 0 0 · · · 0 0�a2,d �a1,d 1 0 · · · 0 0
......
......
......
...0 · · · 0 �aK,d · · · �a1,d 1
3
777775
�1
=
2
666664
o1,d
o2,d
o3,d...
oT,d
3
777775
Hd(z) =1
1�PK
k=1 ak,dz�k
Constraints Filters Trainableparameters
noconstraint
real polesinside theunitcircle
oneornorealpole&pairsofcomplexpolesinsidetheunitcircle
(pleasereadthereport)
Stabilityofthefilterl Tomakestable?
• PleasefindthetechnicalreportSLP-115Kagawaontonywangx.github.io
MODEL IMPLEMENTATION
7/15/17 70
Hd(z) = 11�
PKk=1 ak,dz�k , d 2 [1, D]
H(z) =
8<
:
1QK/2k=1(1�↵kz�1��kz�2)
if K is even
1
(1�↵0z�1)Q(K�1)/2
k=1 (1�↵kz�1��kz�2)if K is odd.
�k = �sigmoid(
c�k)
↵k = 2
qsigmoid(
c�k)tanh(c↵k)
H(z) =1
1�PK
k=1 akz�k
=KY
k=1
1
1� ↵kz�1↵k
H(z) =1
1�PK
k=1 akz�k
c↵k,c�k
ak
HighorderAR-Filterl LearnedARfilteronF0 data
v Note,constraintsareusedbothintrainingandgenerationstage
EXPERIMENTS II
7/15/17 71
Systems Constraints onAR filterHd(z) FormoftheARfilter
U6 6-order,Unconstrained
R6 6-order,Stable, withreal poles
C6 6-order,Stable,withcomplex poles
TY
t=1
p(ct;Mt)
ARsynthesis filter RMDNpartAR-RMDN
Hd(z) =1
1�PK
k=1 ak,dz�k
bc1:T,dbo1:T,d
H(z) =1
1�PK
k=1 akz�k
=KY
k=1
1
1� ↵kz�1
H(z) =1
1�PK
k=1 akz�k
H(z) =1
QK/2k=1(1� ↵kz�1 � �kz�2)
HighorderAR-Filterl LearnedARfilteronF0 data
EXPERIMENTS II
7/15/17 72
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
10
15
20
Mag
nitu
de (
dB
) U6R6C6
-1 0 1
real axis
-1
-0.5
0
0.5
1
imagin
ary
axi
s
6
0 0.05 0.1
real axis
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
imagin
ary
axi
s
6
-1 0 1
real axis
-1
-0.5
0
0.5
1
imagin
ary
axi
s
6
Systems Constraints onAR filterHd(z)
U6 6-order,Unconstrained
R6 6-order,Stable, withreal poles
C6 6-order,Stable,withcomplex poles
U6 R6 C6
HighorderAR-Filterl GeneratedF0
• U6 generatedverylargeF0value(unstableIIRfilter)• U1 (1-orderunconstrained)isplottedinstead• VibrationoftheF0 inC6
EXPERIMENTS II
7/15/17 73
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
10
15
20
Mag
nitu
de (
dB
) U6R6C6
100 150 200 250 300 350 400Frame index (utterance BC2011_nancy_APDC2-166-00)
0
50
100
150
200
250
300
350
400
450
F0 (H
z)
NATU1R6C6