An Autoregressive Recurrent Mixture Density Network …tonywangx.github.io/pdfs/ARRMDN.pdf · An...

AnAutoregressiveRecurrentMixtureDensityNetworkforParametricSpeech

Synthesis

XinWANG,ShinjiTAKAKI,JunichiYAMAGISHINationalInstituteofInformatics,Japan

2017-03-07

1contact:[email protected],suggestions,anddiscussion

ICASSP 2017NewOrleans,USA

ABBREVIATION

2

GMM Gaussianmixture model

NN Neural network

RNN Recurrent neuralnetwork

MDN Mixturedensitynetwork

RMDN Recurrentmixture densitynetwork

AR Autoregressivemodel

AR-RMDN AutoregressiveRMDN (proposedmodel)

l Introduction

l Modeldefinition,interpretation,implementation

l Experiments

l Conclusion

CONTENTS

3

Text-to-Speechl Basedonparametricspeechsynthesis

• thiswork:onacousticmodelsbasedonneuralnetworks

4

INTRODUCTION

text

acoustic model

text analyzer

vocoder

Acousticmodelsbasedonneuralnetworksl RNN [1]

• outputofRNN:(spectralfeatures,F0...)

INTRODUCTION

7/15/17 5[1]Fan,Y.,Qian,Y.,Xie,F.-L.,&Soong,F.K.(2014).TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.INTERSPEECH (pp.1964–1968).

x1 x2 x3 x4 x5

bo5b

o4bo3b

o2bo1

bot

generated acoustic features

input textual features

time (frame) axis

Acousticmodelsbasedonneuralnetworksl RMDN[2]

• outputofRMDN:for

INTRODUCTION

7/15/17 6[2]Schuster,M.(1999).BetterGenerativeModelsforSequentialDataProblems:BidirectionalRecurrentMixtureDensityNetworks.InProc.NIPS (pp.589–595).

Mt p(ot;Mt)

input textual features

distribution

M3 M4M2M1 M5

o1 o2 o3 o4 o5

generated parameter set

p(ot;Mt)

x1 x2 x3 x4 x5

x1 x2 x3 x4 x5

Acousticmodelsbasedonneuralnetworksl RMDN [3,4]

• RMDNusingGMM

• generatefromGMMusingMLPG[5]

INTRODUCTION

7/15/17 7[3]Bishop,C.M.(2004).MixtureDensityNetworks.Retrievedfromhttp://eprints.aston.ac.uk/373/[4]Schuster,M.(1999).BetterGenerativeModelsforSequentialDataProblems:BidirectionalRecurrentMixtureDensityNetworks.InProc.NIPS (pp.589–595).[5]Tokuda,K.,Yoshimura,T.,Masuko,T.,Kobayashi,T.,&Kitamura,T.(2000).SpeechparametergenerationalgorithmsforHMM-basedspeechsynthesis.InProc.

ICASSP,pp.1315–1318.

GMM

M3 M4M2M1 M5

o1 o2 o3 o4 o5

p(o1:T ;M1:T ) =TY

t=1

p(ot;Mt) =TY

t=1

MX

m=1

!mt N (ot;µ

mt ,⌃m

t ).

Mt = {!1t , · · · ,!M

t ,µ1t , · · · ,µM

t ,⌃1t , · · · ,⌃M

t }

bot

l Introduction


l Experimentsandresults

l Conclusion

CONTENTS

8

MOTIVATION

7/15/17 9

distribution

o1 o2 o3 o4 o5

p(ot;Mt)

M3 M4M2M1 M5

x1 x2 x3 x4 x5

p(o1:T ;M1:T ) =TY

t=1

p(ot;Mt)

[6]Shannon,M.,Zen,H.,&Byrne,W. (2013).Autoregressivemodelsforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,Speech,andLanguageProcessing,21(3),587–597.

ConventionalRMDN

l Independenceassumption:

l Alternative?• theideaofAutoregressiveHMM(forspeechsynthesis[6])• ...

Proposedmodel:AR+RMDN

l ARassumption:

DEFINITION

7/15/17 10

M3 M4M2M1 M5

o1 o2 o3 o4 o5

RMDN

p(o1:T ;M1:T ) =TY

t=1

p(ot|ot�K:t�1;Mt)

x1 x2 x3 x4 x5

AR-RMDN usingGMM

• baselineRMDN:

• AR-RMDN:

o1. timeinvariant(context-independent)2. jointtrainingwithRMDN usingback-propagation

DEFINITION

7/15/17 11

f(ot�K:t�1) =KX

k=1

ak � ot�k + b,

o1 o2 o3 o4 o5

p(o1:T ;M1:T ) =TY

t=1

MX

m=1

!mt N (ot;µ

mt + f(ot�K:t�1),⌃

mt )

a2 a2 a2

a1a1a1a1

p(o1:T ;M1:T ) =TY

t=1

MX

m=1

!mt N (ot;µ

mt ,⌃m

t )

{a1, · · · ,aK , b}

l Introduction



l Conclusion

CONTENTS

12

100 200 300 400 500 600 700 800 1000Frame index (utterance BC2011_nancy_APDC2-166-00)

-0.3

-0.2

-0.1

0

0.1

0.2

MG

C (3

0th

dim

)

Aglimpseoftheresults

• smooth?

BEFORE INTERPRETATION

7/15/17 13

Generatedtrajectoriesofthe30th dimension ofmel-generalizedcepstrum coefficients

AR-RMDNNaturaldata


-0.3

-0.2

-0.1

0

0.1

0.2

MG

C (3

0th

dim

)

Aglimpseoftheresults

• smooth?• largerdynamicrange?

BEFORE INTERPRETATION

7/15/17 14

AR-RMDNNaturaldata RNN RMDN

Generatedtrajectoriesofthe30th dimension ofmel-generalizedcepstrum coefficients

SignalsandfiltersinAR-RMDNl Simplecase:1st orderAR

• forGMMinAR-RMDN

INTERPRETATION

7/15/17 15

f(ot�1) = a� ot�1 + b,where a = [a1, · · · , aD]>

d 2 [1, D], dimension of feature vector

t 2 [1, T ], number of frames

p(ot|ot�1;Mt) =

MX

m=1

w

mt

DY

d=1

1q2⇡�

m2t,d

exp(�(ot,d � f(ot�1,d)� µ

mt,d)

2

2�

mt,d

2 )

ot,d � f(ot�1,d) = ot,d � adot�1,d � bd

ct,d

,dimension index,frameindex

SignalsandfiltersinAR-RMDNl Simplecase:1st orderAR:

• ifT=2

INTERPRETATION

7/15/17 16

1 0

�ad 1

�·o1,d

o2,d

�=

c1,d

c2,d

�

ct,d = ot,d � adot�1,d

c1,d = o1,d

c2,d = o2,d � ado1,d





• ifT=3

INTERPRETATION

7/15/17 17

ct,d = ot,d � adot�1,d d 2 [1, D], dimension of feature vector


c1,d = o1,d



2

41 0 0

�ad 1 00 �ad 1

3

5 ·

2

4o1,d

o2,d

o3,d

3

5 =

2

4c1,d

c2,d

c3,d

3

5



• ingeneral

INTERPRETATION

7/15/17 18

2

666664

1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...

......

...0 · · · 0 �ad 1

3

777775·

2

666664

o1,d

o2,d

o3,d...

oT,d

3

777775=

2

666664

c1,d

c2,d

c3,d...

cT,d

3

777775

ct,d = ot,d � adot�1,d d 2 [1, D], dimension of feature vector




• a filteringprocessing

v denotesthefilterinz-domain

INTERPRETATION

7/15/17 19

2

666664

1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...

......

...0 · · · 0 �ad 1

3

777775·

2

666664

o1,d

o2,d

o3,d...

oT,d

3

777775=

2

666664

c1,d

c2,d

c3,d...

cT,d

3

777775

o1:T,d c1:T,dAd(z) = 1� adz�1

o1:T,d c1:T,d

Ad(z)


,dimension index


• afilteringprocessing

v denotesthefilterinz-domain

INTERPRETATION

7/15/17 20

o1:T,d c1:T,d

o1:T,dc1:T,d

2

666664

1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...

......

...0 · · · 0 �ad 1

3

777775

�1

·

2

666664

c1,d

c2,d

c3,d...

cT,d

3

777775=

2

666664

o1,d

o2,d

o3,d...

oT,d

3

777775

Hd(z)

Hd(z) =1

Ad(z)=

1

1� adz�1


,dimension index

SignalsandfiltersinAR-RMDNl Withhigh orderAR

• jointtrainingofARfilterandRMDNusingback-propagation:

• generation:

7/15/17 21

INTERPRETATION

Ad(z) = 1�KX

k=1

ak,dz�k

o1:T,d c1:T,d

TY

t=1

p(ct;Mt)

ARanalysisfilter RMDNAR-RMDN

textualfeatures

TY

t=1

p(ct;Mt)

ARsynthesisfilter RMDNAR-RMDN

Hd(z) =1

1�PK

k=1 ak,dz�k

bc1:T,dbo1:T,d

textualfeatures

d 2 [1, D], dimension of feature vector,dimension index

l Introduction



l Conclusion

CONTENTS

22

Stabilityofthefilterl mustbestable

7/15/17 23

IMPLEMENTATION

Hd(z)

TY

t=1

p(ct;Mt)

ARsynthesisfilter RMDNAR-RMDN

Hd(z) =1

1�PK

k=1 ak,dz�k

bc1:T,dbo1:T,d

textualfeatures

infiniteimpulseresponsefilter

[7]Oppenheim,A.V,Schafer,R.W.,&Buck,J.R.(1999).Discrete-timeSignalProcessing(2ndEd.).UpperSaddleRiver,NJ, USA:Prentice-Hall, Inc.

Stabilityofthefilterl mustbestable

• unstable------>willbeinfinitelylarge!

7/15/17 24

TY

t=1

p(ct;Mt)

ARfilterpart RMDN partAR-RMDN

Hd(z) =1

1�PK

k=1 ak,dz�k

bc1:T,dbo1:T,d

IMPLEMENTATION

Hd(z)

Hd(z) bo1:T,d

F0output fromunstablefilter

Stabilityofthefilterl Generalrequirement:

• 'spolesareinsidetheunitcircle[7]

l Asimpletrick(withastrongconstraint):

• requirement:

• howto:

IMPLEMENTATION

7/15/17 25

H(z) =1

1�PK

k=1 akz�k

=KY

k=1

1

1� ↵kz�1

↵k = tanh(↵̂k)

Hd(z)

[7]Oppenheim,A.V,Schafer,R.W.,&Buck,J.R.(1999).Discrete-timeSignalProcessing(2ndEd.).UpperSaddleRiver,NJ, USA:Prentice-Hall, Inc.

↵k 2 (�1, 1)

↵k 2 R

l Introduction



l Conclusion

CONTENTS

26

DATACorpus

Features

27[8] King,S.,&Karaiskos,V.(2011).TheBlizzardChallenge2011.Retrievedfromhttp://festvox.org/blizzard/bc2011/summary_Blizzard2011.pdf[9] HTS WorkingGroup.(2014).TheEnglishTTSSystem“Flite+hts_engine.”Retrievedfromhttp://hts-engine.sourceforge.net/

Name Size Usage

BlizzardChallenge2011corpus[8]

About 12,000utterances16hours

Validation set:500utterancesTestset:500utterancesTrainingset:therest

Featureanddescription Dimension

Inputfeatures Phonemesequence, stress,pitchaccent...(extractedusingFlite[9]) 382

TargetfeaturesMel-generalizedcepstrum coefficients(MGC) 60

Continuous F0trajectory + Unvoiced/voiced flag 1+1

Bandaperiodicities (BAP) 25

xt

ot

7/15/17 28

SYSTEMSDescription Configuration

RNN without

RMDN without

RNN+MLPG [5] with

AR-RMDN without

�ot,�2ot

�ot,�2ot

�ot,�2ot

�ot,�2ot

[5]Tokuda,K.,Yoshimura,T.,Masuko,T.,Kobayashi,T.,&Kitamura,T.(2000).SpeechparametergenerationalgorithmsforHMM-basedspeechsynthesis.InProc.ICASSP (Vol.3,pp.1315–1318).

numberofARparameters:1*60+60+2*1+1=123

(1)feedforward tanh [512](2)feedforward tanh [512](3)Bi-LSTM[256](4)Bi-LSTM[256](5)feedforward linearoutput [259]

SameasRNN

SameRNN+MDN:1.GMM(2mix)forMGC2.GMM(2mix)forF03.GMM(1mix)forBAP4.Binarydistribution forU/V

SameasRMDN,withAR1.GMM(2mix,1st orderAR)forMGC2.GMM(2mix,2nd orderAR)forF03.GMM(1mix)forBAP4.Binarydistribution forU/V


-2

0

2

4

6

8

MG

C (1

st d

im)

NATRNNRNN+MLPGRMDNAR-RMDN


-2

-1

0

1

2

3

4

5

MG

C (2

th d

im)


EXPERIMENTS

1st dimension ofMGC

2nd dimension ofMGC

7/15/17 29

Resultsl TrajectoryofthegeneratedMGC:


-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

MG

C (1

5th

dim

)


EXPERIMENTS

15th dimensionofMGC

7/15/17 30

Resultsl TrajectoryofthegeneratedMGC:


-0.3

-0.2

-0.1

0

0.1

0.2

MG

C (3

0th

dim

)


30th dimensionofMGC


-0.3

-0.2

-0.1

0

0.1

0.2

MG

C (4

5th

dim

)


EXPERIMENTS

45th dimensionofMGC

7/15/17 31

Resultsl TrajectoryofthegeneratedMGC andF0:

100 150 200 250 300 350 400Frame index (utterance BC2011_nancy_APDC2-166-00)

0

50

100

150

200

250

300

350

400

450

F0 (H

z)


F0(afterunvoiced/voicedclassification)

EXPERIMENTS

7/15/17 32

Analysisl Globalvariance[10] ofthegeneratedMGCandF0trajectories

1 20 40 60Order of MGC

-12

-10

-8

-6

-4

-2

0

2

GV

of M

GC


RNN+MLPG

RMDN

RNN

NAT

AR-RMDN

NAT RNN RNN+MLPG RMDN AR-RMDN

9.8

9.85

9.9

9.95

10

10.05

GV

of F

0

[10]Toda,T.,&Tokuda,K.(2007).Aspeechparametergenerationalgorithmconsideringglobalvariance for{HMM}-basedspeechsynthesis.IEICE TransactionsonInformationandSystems,90(5),816–824.

GVofMGC GVofF0

Analysisl Alargerdynamicrange(higherGV)?

EXPERIMENTS

7/15/17 33

TY

t=1

p(ct;Mt)

ARsynthesis filter RMDNpartAR-RMDN

Hd(z) =1

1�PK

k=1 ak,dz�k

bc1:T,dbo1:T,d

FrequencyresponseofHd(z)forMGC

d 2 [1, D], dimension of feature vector,dimension index

Resultsl Samples:

• formantenhancementwasusedinsubjectiveevaluation

EXPERIMENTS

RNN RNN+MLPG RMDN AR-RMDN Natural

w/oformantenhancement

withformantenhancement

7/15/17 34Othersamples:here ortonywangx.github.io

l Introduction

l Methodofthiswork


l Conclusion

CONTENTS

35

l Model:• AR-RMDN:apairofARfilter+RMDN

l Results:• ARsynthesisfilter: a'low-pass'filter• ARanalysisfilter: a'high-pass'filter(seepp.45-46)

• generatedtrajectorieswithalargerdynamicrange• betterperceivedquality

CONCLUSION

7/15/17 36

Rates(0-100)

Subjectiveevaluation(MUSHRA test)

l InverseARfilterwithcomplexpoles

• usingsigmoid&tanh function

• pleaseseeattachedslidespp.58

RECENT WORK

7/15/17 37

TY

t=1

p(ct;Mt)


Hd(z) =1

1�PK

k=1 ak,dz�k

bc1:T,dbo1:T,d

l Samplingfromthemodel

• ARisstillweakfortemporalcorrelation?

FUTURE WORK

7/15/17 38


0

50

100

150

200

250

300

350

400

450

F0 (H

z)

NATAR-RMDNAR-RMDN (Sampling)

Thankyouforyourattention

Q&A

39

l Toolkit,scripts,slides,andsamples:

tonywangx.github.io

l ARvsRNN astheoutputlayer[1]

• Tosimplytheequation

§ suppose,GMM hasonemixturecomponentwith

(thus,RMDN isequivalenttoRNN trainedusingMSE)

§ consideronlytwoframes

• Forexample,baseline(noRNNoutputlayer,noAR)willbe:

ARGUMENT

7/15/17 40[1] Zen, H. and Sak, H. (2015). Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages 4470–4474.

M2M1

o1 o2

ot 2 D �t = 1

Mt = {µt}

p(o1:2;M1:2) = p(o1;M1)p(o1;M1)

=

1p2⇡

exp(� (o1 � µ1)2

2

)

1p2⇡

exp(� (o2 � µ2)2

2

)

istheoutputofRMDNnotMt = {µt}

ot 2 D

l ARvsRNN astheoutputlayer• RNN outputlayer:

§ supposetheweightontheRNN outputlayeris,nobias

ARGUMENT

7/15/17 41

M2M1

o1 o2

a

p(o1:2;M1:2) = p(o1;M1)p(o1;M1)

=

1p2⇡

exp(� (o1 � µ1)2

2

)

1p2⇡

exp(� (o2 � µ2 � aµ1)2

2

)

x1 x2

a

l ARvsRNN astheoutputlayer• AR:

§ supposetheweightofARis,nobias

ARGUMENT

7/15/17 42

M2M1

o1 o2

a

x1 x2

a

p(o1:2;M1:2) =1p2⇡

exp(� (o1 � µ1)2

2

)

1p2⇡

exp(� (o2 � µ2 � ao1)2

2

)

l ARvsRNN astheoutputlayer• Differenceinprobabilitydensityfunction:

§ justchangethedistribution'sparameter,

§ stillindependentdistribution

§ RNN outputlayerbuildsthedependencybetweenparameters

ARGUMENT

7/15/17 43

M2M1

o1 o2

a

p(o1:2;M1:2) =1p2⇡

exp(� (o1 � µ1)2

2

)

1p2⇡

exp(� (o2 � µ2 � aµ1)2

2

)

=

1p2⇡

exp(� (o1 � µ1)2

2

)

1p2⇡

exp(� (o2 � µ

02)

2

2

)

= p(o1;M1)p(o2;M02)

µ02 = µ2 � aµ1

RNN's case


where

§ ARbuildsthedependencybetweenrandomvariables

§ the2-dimGaussiandistributionhasnon-diagonalcovariance

ARGUMENT

7/15/17 44

M2M1

o1 o2

a AR'scase

p(o1:2;M1:2) =1p2⇡

exp(� (o1 � µ1)2

2

)

1p2⇡

exp(� (o2 � µ2 � ao1)2

2

)

=

1

2⇡

exp(�1

2

(o� µ)

>⌃

�1(o� µ))

o = [o1, o2]>,µ = [µ1, µ2 + aµ1]

⌃ =

1 aa 1 + a2

�⌃�1 =

1 + a2 �a�a 1

�|⌃| = 1


Let'scompareand

ARGUMENT

7/15/17 45

M2M1

o1 o2

a AR'scase

Let m = [m1,m2]> = o� u

Get m>⌃�1m =

⇥m1 m2

⇤ 1 + a2 �a�a 1

� m1

m2

�

= m21 +m2

1a2 � 2m1m2a+m2

2

= m21 + (m1a�m2)

2

As m1 = o1 � µ1,m2 = o2 � µ2 � µ1a

Get m>⌃�1m = (o1 � µ1)2 + (o1a� µ1a� o2 + µ2 + µ1a)

2

= (o1 � µ1)2 + (o2 � µ2 � o1a)

2

(o� µ)>⌃�1(o� µ) (o1 � µ1)2 + (o2 � µ2 � ao1)

2

l ARvsRNN astheoutputlayer• Differenceinrandomsampling:

1. calculatebyneuralnetwork,thenwehave

2. calculatebyneuralnetworkand,thenwehave

3. drawsamplesfromand

§ sampledoesn'tinfluence

ARGUMENT

7/15/17 46

M2M1

o1 o2

aRNN's case

M1

x1 x2

p(o1;M1)

M2 M1 p(o2;M2)

p(o1;M1) p(o2;M2)bo1, bo2

bo1 p(o2;M2)

l ARvsRNN astheoutputlayer• Differenceinrandomsampling:

1. calculatebyneuralnetwork,thenwehave

2. drawsamplefrom

3. calculatebyneuralnetworkand

4. drawsamplefrom

ARGUMENT

7/15/17 47

M2M1

o1 o2

aAR'scase

M1

x1 x2

p(o1;M1)

bo1

p(o2;M2)

p(o1;M1)

M2 bo1

bo2

l ARvsRNN astheoutputlayer• Differenceintrainingusingback-propagation:

ARGUMENT

7/15/17 48

M2M1

o1 o2

aRNN's case

@-LogLikelihood

@a

= (o2 � µ2 � aµ1)µ1

M2M1

o1 o2

aAR'scase

@-LogLikelihood

@a

= (o2 � µ2 � ao1)o1

gradientisalsodifferent

l Samplingfromthemodel

• ARisstillweakfortemporalcorrelation?YES!• Why?

ARGUMENT

7/15/17 49


0

50

100

150

200

250

300

350

400

450

F0 (H

z)

NATAR-RMDNAR-RMDN (Sampling)

TY

t=1

p(ct;Mt)


Hd(z) =1

1�PK

k=1 ak,dz�k

bc1:T,dbo1:T,d

featuretransformation

l Samplingfromthemodel• ARisstillweakfortemporalcorrelation?YES!• Why?

ARGUMENT

7/15/17 50

TY

t=1

p(ct;Mt)


Hd(z) =1

1�PK

k=1 ak,dz�k

bc1:T,dbo1:T,d

featuretransformation

o1:T,dc1:T,d

2

666664

1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...

......

...0 · · · 0 �ad 1

3

777775

�1

·

2

666664

c1,d

c2,d

c3,d...

cT,d

3

777775=

2

666664

o1,d

o2,d

o3,d...

oT,d

3

777775

l Samplingfromthemodel• ARisstillweakfortemporalcorrelation?YES!• Why?

1. Vectorc containssamplesfromi.i.d Gaussiandistribution2. Vectoro ~Gaussiandistribution,withacovariancematrixas

Cov(o)=H*Cov (c)*HT

3. Unfortunately,off-diagonalelementsofCov(o)decaystooquickly(exponentially)becauseofH

4. SimilarresultsforhighorderAR

ARGUMENT

7/15/17 51

callitthefeaturetransformationmatrixH

o1:T,dc1:T,d

2

666664

1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...

......

...0 · · · 0 �ad 1

3

777775

�1

·

2

666664

c1,d

c2,d

c3,d...

cT,d

3

777775=

2

666664

o1,d

o2,d

o3,d...

oT,d

3

777775

Co-variance matrix based on AR

20 40 60 80 100Time index

20

40

60

80

100

Tim

e in

dex

0 20 40 60 80 100Time index

-15

-10

-5

0

5

10

15

Ampli

tude

cgenerated from ARgenerated from GP

Co-variance matrix based on GP

20 40 60 80 100Time index

20

40

60

80

100

Tim

e in

dex

ARGUMENTl Simpleexperiment

• Set100frames,let• UsematrixHfromthetrainedARmodel,transformc intoo (redline)• Defineatransformationmatrixbasedon

transformcintoanothervector(greenline)

ct = sin(w1t) + ✏,where ✏ ⇠ N (0, I)

�i,j = exp(�0.5 ⇤ l ⇤ ((i� j)2))

Co-variance matrix based on AR

20 40 60 80 100Time index

20

40

60

80

100

Tim

e in

dex

0 20 40 60 80 100Time index

-15

-10

-5

0

5

10

15

Ampli

tude

cgenerated from ARgenerated from GP

Co-variance matrix based on GP

20 40 60 80 100Time index

20

40

60

80

100

Tim

e in

dex

ARGUMENTl Simpleexperiment

• So:transformationmatrixofARistoosimple

EXPERIMENTS

7/15/17 54

Objectiveresults

MGC RMSE F0 RMSE F0 CORR

RNN 1.000 39.827 0.768RMDN 0.994 39.797 0.772

RNN+MLPG 0.988 39.252 0.775ARRMDN 1.133 47.512 0.772

Analysisl Alargerdynamicrange?

EXPERIMENTS

7/15/17 55[11]Hermansky,H.(1997).Themodulationspectrumintheautomaticrecognitionofspeech.InProc.ASRU (pp.140–147).[12]Takamichi,S.,Toda,T.,Neubig,G.,Sakti, S.,&Nakamura,S.(2014).Apostfilter tomodifythemodulationspectruminHMM-basedspeechsynthesis.In

Proc.ICASSP (pp.290–294).

TY

t=1

p(ct;Mt)


Hd(z) =1

1�PK

k=1 ak,dz�k

bc1:T,dbo1:T,d

0 200 400 600 800 1000 1200 1400 1600 1800 2000Frequency Index (:/2048)

-10

-5

0

5

MS

MG

C (d

b)

RNN

RNN+MLPG

NATAR-RMDN

RMDN

Modulation spectrum[11,12]ofthe30th dimension ofMGC

400 450 500 550 600 650 700 750 800Frame

-4

-2

0

2

4

Norm

alize

d M

GC

30

Original dataFiltered data

200 400 600 800 1000 1200 1400 1600 1800 2000Frequency index (k * :/2048)

-50

0

50

100

150

Norm

alize

d M

GC

30


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Normalized frequency (:)

-20

-10

0

10

20

Freq

uenc

y Re

spon

se(d

B)

Frequency response ofA(z)

250 500 750 1000Frequency bin (: /1024)

-5

0

5

10

Mag

nitu

de (

dB

) H1(z)

H15(z)

H30(z)

H60(z)

Analysis(thetrainingprocess)l ARmodelintrainingstage

• Onthe1storderARforMGC

EXPERIMENTS

7/15/17 56

Ad(z) = 1� tanh(↵d)z�1

1 250 500 750 1000Frequency bin (: /1024)

-10

-5

0

5

Mag

nitu

de (d

B)

A1(z)A15(z)A30(z)A60(z)

400 450 500 550 600 650 700 750 800Frame

-4

-2

0

2

4

Norm

alize

d M

GC

30



-50

0

50

100

150

Norm

alize

d M

GC

30



-20

-10

0

10

20

Freq

uenc

y Re

spon

se(d

B)

featuretrajectory30th MGC modulation spectrum30th MGC

Ad(z) = 1�KX

k=1

ak,dz�k

o1:T,d c1:T,d

TY

t=1

p(ct;Mt)

ARanalysis filter RMDNpartAR-RMDN

0 200 400 600 800 1000 1200Frame

-2

0

2

4

Norm

alize

d F0



-100

0

100

200

Norm

alize

d F0 Original data

Filtered data


-20

-10

0

10

20

Freq

uenc

y Re

spon

se(d

B)

0 200 400 600 800 1000 1200Frame

-2

0

2

4

Norm

alize

d F0



-100

0

100

200

Norm

alize

d F0 Original data

Filtered data


-20

-10

0

10

20

Freq

uenc

y Re

spon

se(d

B)


• Onthe2st orderARforF0

EXPERIMENTS

7/15/17 57

featuretrajectoryinterpolatedF0 modulation spectruminterpolatedF0

Ad(z) = 1�KX

k=1

ak,dz�k

o1:T,d c1:T,d

TY

t=1

p(ct;Mt)


Frequency response ofA(z)

0 200 400 600 800 1000 1200Frame

-2

0

2

4

Norm

alized F

0 Original dataFiltered data


-100

0

100

200

Norm

alized F

0 Original dataFiltered data


-20

-10

0

10

20

Frequency R

esponse(dB

)


• 2nd orderforMGCisalso’high-pass’

EXPERIMENTS

7/15/17 58

Ad(z) = 1�KX

k=1

ak,dz�k

o1:T,d c1:T,d

TY

t=1

p(ct;Mt)


0 20 40 60Order of the MGC

0

0.2

0.4

0.6

Valu

e of

a

a1a2

1 250 500 750 1000Frequency bin (: /1024)

-10

-5

0

5

Mag

nitu

de (d

B)

A1(z)A15(z)A30(z)A60(z)

Frequency response ofAd(z)

Ad(z) = (1� a1z�1)(1� a2z�1)


• ‘High-pass’ARanalysisfilter

EXPERIMENTS

7/15/17 59

Ad(z) = 1�KX

k=1

ak,dz�k

o1:T,d c1:T,d

TY

t=1

p(ct;Mt)


1 250 500 750 1000Frequency bin (: /1024)

-10

-5

0

5

Mag

nitu

de (d

B)order 2order 3order 4order 5order 6

Frequency response ofA (z)

v Order2isre-trained

Comparewithpost-filteringmethod[15]EXPERIMENTS II

7/15/17 60[15]Takamichi,S.,Toda,T.,Neubig,G.,Sakti,S.,&Nakamura,S.(2014).Apostfilter tomodifythemodulationspectruminHMM-

basedspeechsynthesis.InProc.ICASSP (pp.290–294).

TY

t=1

p(ct;Mt)

ARsynthesisfilter RMDN partAR-RMDN

Hd(z) =1

1�PK

k=1 ak,dz�k

bc1:T,dbo1:T,d

Modulation-spectrum-basedpost-filter

TY

t=1

p(ct;Mt)

RMDN

bo1:T,d

bo

01:T,d

ID DescriptionRMDN recurrent mixturedensitynetwork

RMDN-MS recurrent mixturedensitynetworkwithModulation-spectrum-basedpost-filter

AR-RMDN proposedmodel


-0.3

-0.2

-0.1

0

0.1

0.2

MG

C (3

0th

dim

)

NATRMDNRMDN-MSAR-RMDN

200 400 600 800 1000 1200 1400 1600 1800 2000Frequency Index (:/2048)

-10

-5

0

5

MS

MG

C (d

b)

RMDN-MSNAT

AR-RMDN

RMDN

ResultsEXPERIMENTS II

7/15/17 61

30th dimensionofMGC

Modulation spectrumof30th dimension ofMGC

RMDN RMDN-MS AR-RMDN natural

w/o formantenhancement


OnlycompareMGC

• AllsystemsusedthesamegeneratedF0 fromRMDN

EXPERIMENTS II

7/15/17 62

CompareMGC +F0EXPERIMENTS II

RMDN RMDN-MS AR-RMDN natural

w/o formantenhancement


7/15/17 63

Wrapup

• Why?AR-RMDNavoidFFT/iFFT onacoustictrajectories

v Note,RMDN-MScouldbebetter[11]

EXPERIMENTS II

7/15/17 64[11]Takamichi,S.(2016).Acousticmodelingandspeechparametergenerationforhigh-quality statistical parametricspeechsynthesis.NaraInstitute ofScienceandTechnology.Retrieved fromhttp://hdl.handle.net/10061/10609

RMDN

RMDN-MS F

AR-RMDN F

RMDN-MS M

AR-RMDN M

RMDN-MS

AR-RMDN

35

40

45

50

55

60

Rat

ed q

ualit

y (fr

om 0

(min

) to

100

(max

))

onlyenhanceF0 onlyenhanceMGC

SignalsandfiltersinAR-RMDNl Signal

• Defineacousticfeaturetrajectories

INTERPRETATION

7/15/17 65

2

666666664

o1,1 o2,1 o3,1 · · · ot,1 · · · oT,1

o1,2 o2,2 o3,2 · · · ot,2 · · · oT,2...

......

......

......

o1,d o2,d o3,d · · · ot,d · · · oT,d...

......

......

......

o1,D o2,D o3,D · · · ot,D · · · oT,D

3

777777775

ot 2 RD

o

>1:T,d

T frames

D dimensions

o1:T,d = [o1,d, · · · , oT,d]>

o1:T,d

d 2 [1, D]

SignalsandfiltersinAR-RMDNl Signal

1. Considera1st orderARmodel

2. LookintotheGMM withAR

3. Define,

4. Define,

INTERPRETATION

7/15/17 66

f(ot�1) = a� ot�1 + b,where a = [a1, · · · , aD]>

c1:T,d


c1:T,d = [c1,d, · · · , cT,d]>

p(ot|ot�1;Mt) =

MX

m=1

w

mt

DY

d=1

1q2⇡�

m2t,d

exp(�(ot,d � f(ot�1,d)� µ

mt,d)

2

2�

mt,d

2 )

=

MX

m=1

w

mt

DY

d=1

1q2⇡�

m2t,d

exp(�(ot,d � adot�1,d � µ

mt,d � bd)

2

2�

mt,d

2 )

d 2 [1, D]

d 2 [1, D]

SignalandfilterinAR-RMDNl Filters

• Stillconsidera1st orderAR,whereand

• Then,andarerelatedby

INTERPRETATION

7/15/17 67


2

666664

1 0 0 · · · 0�ad 1 0 · · · 00 �ad 1 · · · 0...

......

...0 · · · 0 �ad 1

3

777775⇤

2

666664

o1,d

o2,d

o3,d...

oT,d

3

777775=

2

666664

c1,dc2,dc3,d...

cT,d

3

777775

Ad(z) = 1� adz�1

o1:T,d c1:T,d

.

Filterwithfiniteimpulseresponse

excitationsignal

filteredsignal

o1:T,d c1:T,d

d 2 [1, D], t 2 [1, T ]

Ad(z) = 1�KX

k=1

ak,dz�k


• Ingeneral

• Fromto

INTERPRETATION

7/15/17 68

⇤

2

666664

o1,d

o2,d

o3,d...

oT,d

3

777775=

2

666664

c1,dc2,dc3,d...

cT,d

3

777775

o1:T,d c1:T,d

.

Filterwithfiniteimpulseresponse

excitationsignal

filteredsignal

o1:T,d c1:T,d

f(ot�K:t�1) =KX

k=1

ak � ot�k + b,

2

666664

1 0 0 0 · · · 0 0�a1,d 1 0 0 · · · 0 0�a2,d �a1,d 1 0 · · · 0 0

......

......

......

...0 · · · 0 �aK,d · · · �a1,d 1

3

777775


• Ingeneral

• Fromto

INTERPRETATION

7/15/17 69

=

2

666664

c1,dc2,dc3,d...

cT,d

3

777775

o1:T,dc1:T,d

.

Filterwith infiniteimpulseresponse

o1:T,dc1:T,d

f(ot�K:t�1) =KX

k=1

ak � ot�k + b,

2

666664

1 0 0 0 · · · 0 0�a1,d 1 0 0 · · · 0 0�a2,d �a1,d 1 0 · · · 0 0

......

......

......

...0 · · · 0 �aK,d · · · �a1,d 1

3

777775

�1

=

2

666664

o1,d

o2,d

o3,d...

oT,d

3

777775

Hd(z) =1

1�PK

k=1 ak,dz�k

Constraints Filters Trainableparameters

noconstraint

real polesinside theunitcircle

oneornorealpole&pairsofcomplexpolesinsidetheunitcircle

(pleasereadthereport)

Stabilityofthefilterl Tomakestable?

• PleasefindthetechnicalreportSLP-115Kagawaontonywangx.github.io

MODEL IMPLEMENTATION

7/15/17 70

Hd(z) = 11�

PKk=1 ak,dz�k , d 2 [1, D]

H(z) =

8<

:

1QK/2k=1(1�↵kz�1��kz�2)

if K is even

1

(1�↵0z�1)Q(K�1)/2

k=1 (1�↵kz�1��kz�2)if K is odd.

�k = �sigmoid(

c�k)

↵k = 2

qsigmoid(

c�k)tanh(c↵k)

H(z) =1

1�PK

k=1 akz�k

=KY

k=1

1

1� ↵kz�1↵k

H(z) =1

1�PK

k=1 akz�k

c↵k,c�k

ak

HighorderAR-Filterl LearnedARfilteronF0 data

v Note,constraintsareusedbothintrainingandgenerationstage

EXPERIMENTS II

7/15/17 71

Systems Constraints onAR filterHd(z) FormoftheARfilter

U6 6-order,Unconstrained

R6 6-order,Stable, withreal poles

C6 6-order,Stable,withcomplex poles

TY

t=1

p(ct;Mt)


Hd(z) =1

1�PK

k=1 ak,dz�k

bc1:T,dbo1:T,d

H(z) =1

1�PK

k=1 akz�k

=KY

k=1

1

1� ↵kz�1

H(z) =1

1�PK

k=1 akz�k

H(z) =1

QK/2k=1(1� ↵kz�1 � �kz�2)

HighorderAR-Filterl LearnedARfilteronF0 data

EXPERIMENTS II

7/15/17 72

1 250 500 750 1000Frequency bin (: /1024)

-10

-5

0

5

10

15

20

Mag

nitu

de (

dB

) U6R6C6

-1 0 1

real axis

-1

-0.5

0

0.5

1

imagin

ary

axi

s

6

0 0.05 0.1

real axis

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

imagin

ary

axi

s

6

-1 0 1

real axis

-1

-0.5

0

0.5

1

imagin

ary

axi

s

6

Systems Constraints onAR filterHd(z)

U6 6-order,Unconstrained

R6 6-order,Stable, withreal poles

C6 6-order,Stable,withcomplex poles

U6 R6 C6

HighorderAR-Filterl GeneratedF0

• U6 generatedverylargeF0value(unstableIIRfilter)• U1 (1-orderunconstrained)isplottedinstead• VibrationoftheF0 inC6

EXPERIMENTS II

7/15/17 73

1 250 500 750 1000Frequency bin (: /1024)

-10

-5

0

5

10

15

20

Mag

nitu

de (

dB

) U6R6C6


0

50

100

150

200

250

300

350

400

450

F0 (H

z)

NATU1R6C6

An Autoregressive Recurrent Mixture Density Network …tonywangx.github.io/pdfs/ARRMDN.pdf · An...

Documents

Transcript of An Autoregressive Recurrent Mixture Density Network …tonywangx.github.io/pdfs/ARRMDN.pdf · An...