NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR...

11
NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1 , Shinji Takaki 1 , Junichi Yamagishi 1* 1 National Institute of Informatics, Japan [email protected], [email protected], [email protected] ABSTRACT Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive (AR) structure. Although faster non-AR models were recently reported, they may be prohibitively complicated due to the use of a distilling training method and the blend of other disparate training criteria. This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform. Our experiments demonstrated that the proposed model generated waveforms at least 100 times faster than the AR WaveNet and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet. Ablation test results showed that both the sine- wave excitation signal and the spectrum-based training criteria were essential to the performance of the proposed model. Index Termsspeech synthesis, neural network, waveform modeling 1. INTRODUCTION Text-to-speech (TTS) synthesis, a technology that converts texts into speech waveforms, has been advanced by using end-to-end architectures [1] and neural-network-based waveform models [2, 3, 4]. Among those waveform models, the WaveNet [2] directly models the distributions of waveform sampling points and has demonstrated outstanding performance. The vocoder version of WaveNet [5], which converts the acoustic features into the waveform, also outperformed other vocoders for the pipeline TTS systems [6]. As an autoregressive (AR) model, the WaveNet is quite slow in waveform generation because it has to generate the waveform sampling points one by one. To improve the generation speed, the Parallel WaveNet [3] and the ClariNet [4] introduce a distilling method to transfer ‘knowledge’ from a teacher AR WaveNet to a student non-AR model that simultaneously generates all the waveform sampling points. However, the concatenation of two large models and the mix of distilling and other training criteria reduce the model interpretability and raise the implementation cost. In this paper, we propose a neural source-filter waveform model that converts acoustic features into speech waveforms. Inspired by classical speech modeling methods [7, 8], we used a source * This work was partially supported by JST CREST Grant Number JPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302, 16K16096, 17H04687, 18H04120, 18H04112, 18KT0051), Japan. module to generate a sine-based excitation signal with a specified fundamental frequency (F0). We then used a dilated-convolution- based filter module to transform the sine-based excitation into the speech waveform. The proposed model was trained by minimizing spectral amplitude and phase distances, which can be efficiently implemented using discrete Fourier transforms (DFTs). Because the proposed model is a non-AR model, it generates waveforms much faster than the AR WaveNet. A large-scale listening test showed that the proposed model was close to the AR WaveNet in terms of the Mean opinion score (MOS) on the quality of synthetic speech. An ablation test showed that both the sine-wave excitation and the spectral amplitude distance were crucial to the proposed model. The model structure and training criteria are explained in Section 2, after which the experiments are described in Section 3. Finally, this paper is summarized and concluded in Section 4. 2. PROPOSED MODEL AND TRAINING CRITERIA 2.1. Model structure The proposed model (shown in Figure 1) converts an input acoustic feature sequence c1:B of length B into a speech waveform b o1:T of length T . It includes a source module that generates an excitation signal e1:T , a filter module that transforms e1:T into the speech waveform, and a condition module that processes the acoustic features for the source and filter modules. None of the modules takes the previously generated waveform sample as the input. The waveform is assumed to be real-valued, i.e., b ot R, 0 <t T . 2.1.1. Condition module The condition module takes as input the acoustic feature sequence c1:B = {c1, ··· , cB}, where each c b = [f b , s > b ] > contains the F0 f b and the spectral features s b of the b-th speech frame. The condition module upsamples the F0 by duplicating f b to every time step within the b-th frame and feeds the upsampled F0 sequence f1:T to the source module. Meanwhile, it processes c1:B using a bi-directional recurrent layer with long-short-term memory (LSTM) units [9] and a convolutional (CONV) layer, after which the processed features are upsampled and sent to the filter module. The LSTM and CONV were used so that the condition module was similar to that of the WaveNet-vocoder [10] in the experiment. They can be replaced with a feedforward layer in practice. 2.1.2. Source module Given the input F0 sequence f1:T , the source module generates a sine-based excitation signal e1:T = {e1, ··· ,eT }, where et R, t ∈{1, ··· ,T }. Suppose the F0 value of the t-th time step arXiv:1810.11946v4 [eess.AS] 27 Apr 2019

Transcript of NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR...

Page 1: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICALPARAMETRIC SPEECH SYNTHESIS

Xin Wang1, Shinji Takaki1, Junichi Yamagishi1∗

1National Institute of Informatics, [email protected], [email protected], [email protected]

ABSTRACT

Neural waveform models such as the WaveNet are used in manyrecent text-to-speech systems, but the original WaveNet is quiteslow in waveform generation because of its autoregressive (AR)structure. Although faster non-AR models were recently reported,they may be prohibitively complicated due to the use of a distillingtraining method and the blend of other disparate training criteria.This study proposes a non-AR neural source-filter waveform modelthat can be directly trained using spectrum-based training criteriaand the stochastic gradient descent method. Given the input acousticfeatures, the proposed model first uses a source module to generatea sine-based excitation signal and then uses a filter module totransform the excitation signal into the output speech waveform.Our experiments demonstrated that the proposed model generatedwaveforms at least 100 times faster than the AR WaveNet and thequality of its synthetic speech is close to that of speech generated bythe AR WaveNet. Ablation test results showed that both the sine-wave excitation signal and the spectrum-based training criteria wereessential to the performance of the proposed model.

Index Terms— speech synthesis, neural network, waveformmodeling

1. INTRODUCTION

Text-to-speech (TTS) synthesis, a technology that converts textsinto speech waveforms, has been advanced by using end-to-endarchitectures [1] and neural-network-based waveform models [2, 3,4]. Among those waveform models, the WaveNet [2] directly modelsthe distributions of waveform sampling points and has demonstratedoutstanding performance. The vocoder version of WaveNet [5],which converts the acoustic features into the waveform, alsooutperformed other vocoders for the pipeline TTS systems [6].

As an autoregressive (AR) model, the WaveNet is quite slowin waveform generation because it has to generate the waveformsampling points one by one. To improve the generation speed,the Parallel WaveNet [3] and the ClariNet [4] introduce a distillingmethod to transfer ‘knowledge’ from a teacher AR WaveNet toa student non-AR model that simultaneously generates all thewaveform sampling points. However, the concatenation of two largemodels and the mix of distilling and other training criteria reduce themodel interpretability and raise the implementation cost.

In this paper, we propose a neural source-filter waveform modelthat converts acoustic features into speech waveforms. Inspiredby classical speech modeling methods [7, 8], we used a source

∗This work was partially supported by JST CREST Grant NumberJPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302,16K16096, 17H04687, 18H04120, 18H04112, 18KT0051), Japan.

module to generate a sine-based excitation signal with a specifiedfundamental frequency (F0). We then used a dilated-convolution-based filter module to transform the sine-based excitation into thespeech waveform. The proposed model was trained by minimizingspectral amplitude and phase distances, which can be efficientlyimplemented using discrete Fourier transforms (DFTs). Because theproposed model is a non-AR model, it generates waveforms muchfaster than the AR WaveNet. A large-scale listening test showedthat the proposed model was close to the AR WaveNet in terms ofthe Mean opinion score (MOS) on the quality of synthetic speech.An ablation test showed that both the sine-wave excitation and thespectral amplitude distance were crucial to the proposed model.

The model structure and training criteria are explained inSection 2, after which the experiments are described in Section 3.Finally, this paper is summarized and concluded in Section 4.

2. PROPOSED MODEL AND TRAINING CRITERIA

2.1. Model structure

The proposed model (shown in Figure 1) converts an input acousticfeature sequence c1:B of length B into a speech waveform o1:T oflength T . It includes a source module that generates an excitationsignal e1:T , a filter module that transforms e1:T into the speechwaveform, and a condition module that processes the acousticfeatures for the source and filter modules. None of the modulestakes the previously generated waveform sample as the input. Thewaveform is assumed to be real-valued, i.e., ot ∈ R, 0 < t ≤ T .

2.1.1. Condition module

The condition module takes as input the acoustic feature sequencec1:B = {c1, · · · , cB}, where each cb = [fb, s

>b ]> contains

the F0 fb and the spectral features sb of the b-th speech frame.The condition module upsamples the F0 by duplicating fb toevery time step within the b-th frame and feeds the upsampled F0sequence f1:T to the source module. Meanwhile, it processes c1:Busing a bi-directional recurrent layer with long-short-term memory(LSTM) units [9] and a convolutional (CONV) layer, after whichthe processed features are upsampled and sent to the filter module.The LSTM and CONV were used so that the condition module wassimilar to that of the WaveNet-vocoder [10] in the experiment. Theycan be replaced with a feedforward layer in practice.

2.1.2. Source module

Given the input F0 sequence f1:T , the source module generates asine-based excitation signal e1:T = {e1, · · · , eT }, where et ∈R, ∀t ∈ {1, · · · , T}. Suppose the F0 value of the t-th time step

arX

iv:1

810.

1194

6v4

[ee

ss.A

S] 2

7 A

pr 2

019

Page 2: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

NeuralfiltermoduleSourcemodule

bo1:To1:T DFTFraming/

windowingDFT

Framing/

windowing

inverse

DFT

De-framing

/windowing

Generated

waveform

Conditionmodule

Natural

waveform

e1:T

Harmonics

Noise

+

+

FF

f1:T

Upsampling

Dilated

CONVs

Transformation

UpsamplingBi-LSTM CONVF0

Acousticfeatures

Spectralfeatures&F0

… Dilated

CONVsTransformation

GradientsTrainingcriteria

Modelstructure

L

bo1:T

by(n)

a1:T

b1:Te1:T � b1:T + a1:T

bx(n)y(n)x(n)

@L@Re(by(n))

+ j @L@Im(by(n))

Sine

generator

c1:B

e<0>1:T

@L@bo1:T

@L@bx(n)

Fig. 1. Structure of proposed model. B and T denote lengths of input feature sequence and output waveform, respectively. FF, CONV, andBi-LSTM denote feedforward, convolutional, and bi-directional recurrent layers, respectively. DFT denotes discrete Fourier transform.

is ft ∈ R≥0, and ft = 0 denotes being unvoiced. By treating ft asthe instantaneous frequency [11], a signal e<0>

1:T can be generated as

e<0>t =

α sin(

t∑k=1

2πfkNs

+ φ) + nt, if ft > 0

1

3σnt, if ft = 0

, (1)

where nt ∼ N (0, σ2) is a Gaussian noise, φ ∈ [−π, π] is a randominitial phase, and Ns is equal to the waveform sampling rate.

Although we can directly set e1:T = e<0>1:T , we tried two

additional tricks. First, a ‘best’ phase φ∗ for e<0>1:T can be

determined in the training stage by maximizing the correlationbetween e<0>

1:T and the natural waveform o1:T . During generation, φis randomly generated. The second method is to generate harmonicsby increasing fk in Equation (1) and use a feedforward (FF) layerto merge the harmonics and e<0>

1:T into e1:T . In this paper we use 7harmonics and set σ = 0.003 and α = 0.1.

2.1.3. Neural filter module

Given the excitation signal e1:T from the source module and theprocessed acoustic features from the condition module, the filtermodule modulates e1:T using multiple stages of dilated convolutionand affine transformations similar to those in ClariNet [4]. Forexample, the first stage takes e1:T and the processed acousticfeatures as input and produces two signals a1:T and b1:T usingdilated convolution. The e1:T is then transformed using e1:T �b1:T + a1:T , where � denotes element-wise multiplication. Thetransformed signal is further processed in the following stages, andthe output of the final stage is used as generated waveform o1:T .

The dilated convolution blocks are similar to those in ParallelWaveNet [3]. Specifically, each block contains multiple dilatedconvolution layers with a filter size of 3. The outputs of theconvolution layers are merged with the features from the conditionmodule through gated activation functions [3]. After that, themerged features are transformed into a1:T and b1:T . To make surethat b1:T is positive, b1:T is parameterized as b1:T = exp(b1:T ).

Unlike ClariNet or Parallel WaveNet, the proposed model doesnot use the distilling method. It is unnecessary to compute themean and standard deviation of the transformed signal. Neither isit necessary to form the convolution and transformation blocks as aninverse autoregressive flow [12].

2.2. Training criteria in frequency domain

Because speech perception heavily relies on acoustic cues in thefrequency domain, we define training criteria that minimize thespectral amplitude and phase distances, which can be implementedusing DFTs. Given these criteria, the proposed model is trainedusing the stochastic gradient descent (SGD) method.

2.2.1. Spectral amplitude distance

Following the convention of short-time Fourier analysis, we conductwaveform framing and windowing before producing the spectrumof each frame. For the generated waveform o1:T , we use x(n) =

[x(n)1 , · · · , x(n)M ]> ∈ RM to denote the n-th waveform frame of

length M . We then use y(n) = [y(n)1 , · · · , y(n)K ]> ∈ CK to denote

the spectrum of x(n) calculated using K-point DFT. We similarlydefine x(n) and y(n) for the natural waveform o1:T .

Suppose the waveform is sliced into N frames. Then the logspectral amplitude distance is defined as follows:

Ls = 1

2

N∑n=1

K∑k=1

[log

Re(y(n)k )2 + Im(y(n)k )2

Re(y(n)k )2 + Im(y(n)k )2

]2, (2)

where Re(·) and Im(·) denote the real and imaginary parts of acomplex number, respectively.

Although Ls is defined on complex-valued spectra, the gradient∂Ls∂o1:T

∈ RT for SGD training can be efficiently calculated. Letus consider the n-th frame and compose a complex-valued vectorg(n) = ∂Ls

∂Re(y(n))+ j ∂Ls

∂Im(y(n))∈ CK , where the k-th element is

g(n)k = ∂Ls

∂Re(y(n)k

)+ j ∂Ls

∂Im(y(n)k

)∈ C. It can be shown that, as long

as g(n) is Hermitian symmetric, the inverse DFT of g(n) is equalto ∂Ls

∂x(n) = [ ∂Ls∂x

(n)1

, · · · , ∂Ls∂x

(n)m

, ∂Ls∂x

(n)M

] ∈ RM 1. Using the same

method, ∂Ls∂x(n) for n ∈ {1, · · · , N} can be computed in parallel.

Given { ∂Ls∂x(1) , · · · , ∂Ls

∂x(N) }, the value of each ∂Ls∂ot

in ∂Ls∂o1:T

can be

easily accumulated since the relationship between ot and each x(n)m

has been determined by the framing and windowing operations.

1In the implementation using fast Fourier transform, x(n) of length Mis zero-padded to length K before DFT. Accordingly, the inverse DFT ofg(n) also gives the gradients w.r.t. the zero-padded part, which should bediscarded (see https://arxiv.org/abs/1810.11946).

Page 3: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

Table 1. Three framing and DFT configurations for Ls and LpLs1&Lp1 Ls2&Lp2 Ls3&Lp3

DFT bins K 512 128 2048Frame length M 320 (20 ms) 80 (5 ms) 1920 (120 ms)

Frame shift 80 (5 ms) 40 (2.5 ms) 640 (40 ms)Note: all configurations use Hann window.

In fact, ∂Ls∂o1:T

∈ RT can be calculated in the same mannerno matter how we set the framing and DFT configuration, i.e., thevalues of N , M , and K. Furthermore, multiple Lss with differentconfigurations can be computed, and the gradients ∂Ls

∂o1:Tcan be

simply summed up. For example, using the three Lss in Table 1was found to be essential to the proposed model (see Section 3.3).

The Hermitian symmetry of g(n) is satisfied if Ls is carefullydefined. For example,Ls can be the square error or Kullback-Leiblerdivergence (KLD) of the spectral amplitudes [13, 14]. The phasedistance defined below also satisfies the requirement.

2.2.2. Phase distance

Given the spectra, a phase distance [15] is computed as

Lp =1

2

N∑n=1

K∑k=1

∣∣∣1− exp(j(θ(n)k − θ(n)k ))

∣∣∣2=

N∑n=1

K∑k=1

[1− Re(y(n)k )Re(y(n)k ) + Im(y(n)k )Im(y(n)k )

|y(n)k ||y(n)k |

], (3)

where θ(n)k and θ(n)k are the phases of y(n)k and y(n)k , respectively.The gradient ∂Lp

∂o1:Tcan be computed by the same procedure as

∂Ls∂o1:T

. Multiple Lps and Lss with different framing and DFTconfigurations can be added up as the ultimate training criterionL. For different L∗s, additional DFT/iDFT and framing/windowingblocks should be added to the model in Figure 1.

3. EXPERIMENTS

3.1. Corpus and features

This study used the same Japanese speech corpus and data divisionrecipe as our previous study [16]. This corpus [17] contains neutralreading speech uttered by a female speaker. Both validation andtest sets contain 480 randomly selected utterances. Among the 48-hour training data, 9,000 randomly selected utterances (15 hours)were used as the training set in this study. For the ablation test inSection 3.3, the training set was further reduced to 3,000 utterances(5 hours). Acoustic features, including 60 dimensions of Mel-generalized cepstral coefficients (MGCs) [18] and 1 dimension ofF0, were extracted from the 48 kHz waveforms at a frame shiftof 5 ms using WORLD [19]. The natural waveforms were thendownsampled to 16 kHz for model training and the listening test.

3.2. Comparison of proposed model, WaveNet, and WORLD

The first experiment compared the four models listed in Table 22.The WAD model, which was trained in our previous study [6],

2The models were implemented using a modified CURRENNT toolkit[20] on a single P100 Nvidia GPU card. Codes, recipes, and generated speechcan be found on https://nii-yamagishilab.github.io.

Table 2. Models for comparison test in Section 3.2WOR WORLD vocoderWAD WaveNet-vocoder for 10-bit discrete µ-law waveformWAC WaveNet-vocoder using Gaussian dist. for raw waveformNSF Proposed model for raw waveform

Natural WOR WAD WAC NSF1

2

3

4

5

Qua

lity

(MO

S)

Fig. 2. MOS scores of natural speech, synthetic speech given naturalacoustic features (blue), and synthetic speech given acoustic featuresgenerated from acoustic models (red). White dots are mean values.

Table 3. Average number of waveform points generated in 1 sWAD NSF (memory-save mode) NSF (normal mode)0.19k 20k 227k

contained a condition module, a post-processing module, and 40dilated CONV blocks, where the k-th CONV block had a dilationsize of 2modulo(k,10). WAC was similar to WAD but used a Gaussiandistribution to model the raw waveform at the output layer [4].

The proposed NSF contained 5 stages of dilated CONV andtransformation, each stage including 10 convolutional layers witha dilation size of 2modulo(k,10) and a filter size of 3. Its conditionmodule was the same as that of WAD and WAC. NSF was trainedusing L = Ls1 + Ls2 + Ls3, and the configuration of each Ls∗ islisted in Table 1. The phase distance Lp∗ was not used in this test.

Each model generated waveforms using natural and generatedacoustic features, where the generated acoustic features wereproduced by the acoustic models in our previous study [6]. Thegenerated and natural waveforms were then evaluated by paid nativeJapanese speakers. In each evaluation round the evaluator listenedto one speech waveform in each screen and rated the speech qualityon a 1-to-5 MOS scale. The evaluator can take at most 10 evaluationrounds and can replay the sample during evaluation. The waveformsin an evaluation round were for the same text and were played ina random order. Note that the waveforms generated from NSF andWAC were converted to 16-bit PCM format before evaluation.

A total of 245 evaluators conducted 1444 valid evaluationrounds in all, and the results are plotted in Figure 2. Two-sidedMann-Whitney tests showed that the difference between any pairof models is statistically significant (p < 0.01) except NSF andWAC when the two models used generated acoustic features. Ingeneral, NSF outperformed WOR and WAC but performed slightlyworse than WAD. The gap of the mean MOS scores between NSFand WAD was about 0.12, given either natural or generated acousticfeatures. A possible reason for this result may be the differencebetween the non-AR and AR model structures, which is similarto the difference between the finite and infinite impulse responsefilters. WAC performed worse than WAD because some syllables wereperceived to be trembling in pitch, which may be caused by therandom sampling generation method. WAD alleviated this artifactby using a one-best generation method in voiced regions [6].

After the MOS test, we compared the waveform generationspeed of NSF and WAD. The implementation of NSF has a normal

Page 4: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

2k

4k

8k

Freq

uenc

y(H

z)

Natural

2k

4k

8k

Freq

uenc

y(H

z)

NSFs L3(NSFs w/o Ls2 nor Ls3)

L4(NSFs w/ Ls1,2,3,Lp1,2,3)

S3(NSFs noise excitation)

N2(NSFs with b1:T = 0)

Fig. 3. Spectrogram (top) and instantaneous frequency (bottom) of natural waveform and waveforms generated from models in Table 4 givennatural acoustic features in test set (utterance AOZORAR 03372 T01). Figures are plotted using 5 ms frame length and 2.5 ms frame shift.

Table 4. Models for ablation test (Section 3.3)NSFs NSF trained on 5-hour dataL1 NSFs without using Ls3ssssssssss (i.e., L = Ls1 + Ls2)L2 NSFs without using Ls2ssssssssss (i.e., L = Ls1 + Ls3)L3 NSFs without using Ls2 nor Ls3sssssssss (i.e., L = Ls1)L4 NSFs using L = Ls1 + Ls2 + Ls3 + Lp1 + Lp2 + Lp3L5 NSFs using KLD of spectral amplitudesS1 NSFs without harmonicsS2 NSFs without harmonics or ‘best’ phase φ∗

S3 NSFs only using noise as excitationN1 NSFs with b1:T = 1 in filter’s transformation layersN2 NSFs with b1:T = 0 in filter’s transformation layers

generation mode and a memory-save one. The normal modeallocates all the required GPU memory once but cannot generatewaveforms longer than 6 seconds because of the insufficient memoryspace in a single GPU card. The memory-save mode can generatelong waveforms because it releases and allocates the memory layerby layer, but the repeated memory operations are time consuming.

We evaluated NSF using both modes on a smaller test set, inwhich each of the 80 generated test utterances was around 5 secondslong. As the results in Table 3 show, NSF is much faster than WAD.Note that WAD allocates and re-uses a small size of GPU memory,which needs no repeated memory operation. WAD is slow mainlybecause of the AR generation process. Of course, both WAD and NSFcan be improved if our toolkit is further optimized. Particularly, ifthe memory operation can be sped up, the memory-save mode ofNSF will be much faster.

3.3. Ablation test on proposed model

This experiment was an ablation test on NSF. Specifically, the 11variants of NSF listed in Table 4 were trained using the 5-hourtraining set. For a fair comparison, NSF was re-trained using the5-hour data, and this variant is referred to as NSFs. The speechwaveforms were generated given the natural acoustic features andrated in 1444 evaluation rounds by the same group of evaluators inSection 3.2. This test excluded natural waveform for evaluation.

The results are plotted in Figure 4. The difference betweenNSTs and any other model except S2 was statistically significant(p < 0.01). Comparison among NSTs, L1, L2, and L3 shows thatusing multiple Lss listed in Table 1 is beneficial. For L3 that usedonly Ls1, the generated waveform points clustered around one peakin each frame, and the waveform suffered from a pulse-train noise.This can be observed from L3 of Figure 3, whose spectrogram in the

NSFs L1 L2 L3 L4 L5 S1 S2 S3 N1 N2

1

2

3

4

5

Qua

lity

(MO

S)

Fig. 4. MOS scores of synthetic samples from NSFs and its variantsgiven natural acoustic features. White dots are mean MOS scores.

high frequency band shows more clearly vertical strips than othermodels. Accordingly, this artifact can be alleviated by adding Ls2with a frame length of 5 ms for model training, which explainedthe improvement in L1. Using phase distance (L4) didn’t improvethe speech quality even though the value of the phase distance wasconsistently decreased on both training and validation data.

The good result of S2 indicates that a single sine-waveexcitation with a random initial phase also works. Without thesine-wave excitation, S3 generated waveforms that were intelligiblebut lacked stable harmonic structure. N1 slightly outperformedNSFs while N2 produced unstable harmonic structures. Because thetransformation in N1 is equivalent to skip-connection [21], the resultindicates that the skip-connection may help the model training.

4. CONCLUSION

In this paper, we proposed a neural waveform model with separatedsource and filter modules. The source module produces a sine-wave excitation signal with a specified F0, and the filter module usesdilated convolution to transform the excitation into a waveform. Ourexperiment demonstrated that the sine-wave excitation was essentialfor generating waveforms with harmonic structures. We also foundthat multiple spectral-based training criteria and the transformationin the filter module contributed to the performance of the proposedmodel. Compared with the AR WaveNet, the proposed modelgenerated speech with a similar quality at a much faster speed.

The proposed model can be improved in many aspects. Forexample, it is possible to simplify the dilated convolution blocks. Itis also possible to try classical speech modeling methods, includingglottal waveform excitations [22, 23], two-bands or multi-bandsapproaches [24, 25] on waveforms. When applying the model toconvert linguistic features into the waveform, we observed the over-smoothing affect in the high-frequency band and will investigate theissue in the future work.

Page 5: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

5. REFERENCES

[1] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster,Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang,Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural TTS synthesisby conditioning WaveNet on Mel spectrogram predictions,” inProc. ICASSP, 2018, pp. 4779–4783.

[2] Aaron van den Oord, Sander Dieleman, Heiga Zen, KarenSimonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,Andrew Senior, and Koray Kavukcuoglu, “WaveNet:A generative model for raw audio,” arXiv preprintarXiv:1609.03499, 2016.

[3] Aaron van den Oord, Yazhe Li, Igor Babuschkin, KarenSimonyan, Oriol Vinyals, Koray Kavukcuoglu, George van denDriessche, Edward Lockhart, Luis Cobo, Florian Stimberg,Norman Casagrande, Dominik Grewe, Seb Noury, SanderDieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, AlexGraves, Helen King, Tom Walters, Dan Belov, and DemisHassabis, “Parallel WaveNet: Fast high-fidelity speechsynthesis,” in Proc. ICML, 2018, pp. 3918–3926.

[4] Wei Ping, Kainan Peng, and Jitong Chen, “Clarinet: Parallelwave generation in end-to-end text-to-speech,” in Proc. ICLP,2019.

[5] Akira Tamamori, Tomoki Hayashi, Kazuhiro Kobayashi,Kazuya Takeda, and Tomoki Toda, “Speaker-dependentWaveNet vocoder,” in Proc. Interspeech, 2017, pp. 1118–1122.

[6] Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela,and Junichi Yamagishi, “A comparison of recent waveformgeneration and acoustic modeling methods for neural-network-based speech synthesis,” in Proc. ICASSP, 2018, pp. 4804–4808.

[7] Per Hedelin, “A tone oriented voice excited vocoder,” in Proc.ICASSP. IEEE, 1981, vol. 6, pp. 205–208.

[8] Robert McAulayand Thomas Quatieri, “Speech analysis/synthesis based on asinusoidal representation,” IEEE Trans. on Acoustics, Speech,and Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.

[9] Alex Graves, Supervised Sequence Labelling with RecurrentNeural Networks, Ph.D. thesis, Technische UniversitatMunchen, 2008.

[10] XinWang, Shinji Takaki, and Junichi Yamagishi, “Investigationof WaveNet for text-to-speech synthesis,” Tech. Rep. 6, SIGTechnical Reports, feb 2018.

[11] John R Carson and Thornton C Fry, “Variable frequencyelectric circuit theory with application to the theory offrequency-modulation,” Bell System Technical Journal, vol.16, no. 4, pp. 513–540, 1937.

[12] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen,Ilya Sutskever, and Max Welling, “Improved variationalinference with inverse autoregressive flow,” in Proc. NIPS,2016, pp. 4743–4751.

[13] Daniel D Lee and H Sebastian Seung, “Algorithms for non-negative matrix factorization,” in Proc. NIPS, 2001, pp. 556–562.

[14] Shinji Takaki, Hirokazu Kameoka, and Junichi Yamagishi,“Direct modeling of frequency spectra and waveformgeneration based on phase recovery for DNN-based speechsynthesis,” in Proc. Interspeech, 2017, pp. 1128–1132.

[15] Shinji Takaki, Toru Nakashika, Xin Wang, and JunichiYamagishi, “STFT spectral loss for training a neural speechwaveform model,” in Proc. ICASSP, 2019, p. (accepted).

[16] Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, and NobuyukiNishizawa, “Investigating accuracy of pitch-accent annotationsin neural-network-based speech synthesis and denoisingeffects,” in Proc. Interspeech, 2018, pp. 37–41.

[17] Hisashi Kawai, Tomoki Toda, Jinfu Ni, Minoru Tsuzaki, andKeiichi Tokuda, “XIMERA: A new TTS from ATR based oncorpus-based technologies,” in Proc. SSW5, 2004, pp. 179–184.

[18] Keiichi Tokuda, Takao Kobayashi, Takashi Masuko, andSatoshi Imai, “Mel-generalized cepstral analysis a unifiedapproach,” in Proc. ICSLP, 1994, pp. 1043–1046.

[19] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa,“WORLD: A vocoder-based high-quality speech synthesissystem for real-time applications,” IEICE Trans. onInformation and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.

[20] Felix Weninger, Johannes Bergmann, and Bjorn Schuller,“Introducing CURRENT: The Munich open-source CUDArecurrent neural network toolkit,” The Journal of MachineLearning Research, vol. 16, no. 1, pp. 547–551, 2015.

[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in Proc.CVPR, 2016, pp. 770–778.

[22] Gunnar Fant, Johan Liljencrants, and Qi-guang Lin, “A four-parameter model of glottal flow,” STL-QPSR, vol. 4, no. 1985,pp. 1–13, 1985.

[23] Lauri Juvela, Bajibabu Bollepalli, Manu Airaksinen, and PaavoAlku, “High-pitched excitation generation for glottal vocodingin statistical parametric speech synthesis using a deep neuralnetwork,” in Proc. ICASSP, 2016, pp. 5120–5124.

[24] John Makhoul, R Viswanathan, Richard Schwartz, and AWFHuggins, “A mixed-source model for speech compression andsynthesis,” The Journal of the Acoustical Society of America,vol. 64, no. 6, pp. 1577–1581, 1978.

[25] D. W. Griffin and J. S. Lim, “Multiband excitationvocoder,” IEEE Transactions on Acoustics, Speech, and SignalProcessing, vol. 36, no. 8, pp. 1223–1235, Aug 1988.

Page 6: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

A. FORWARD COMPUTATION

Figure 5 plots the two steps to derive the spectrum from the generated waveform o1:T . We use x(n) = [x(n)1 , · · · , x(n)M ]> ∈ RM to denote the

n-th waveform frame of length M . We then use y(n) = [y(n)1 , · · · , y(n)K ]> ∈ CK to denote the spectrum of x(n) calculated using K-point

DFT, i.e., y(n) = DFTK(x(n)). For fast Fourier transform, K is set to the power of 2, and x(n) is zero-padded to length K before DFT.

bo1:To1:T DFTFraming/windowing DFT Framing/

windowingGeneratedwaveform

Naturalwaveform L

by(n) bx(n)y(n)x(n)

bo1

bo2

bo3

boT

bo1:T

bx(1)bx(2)bx(N)

Nframes

…KDFTbins

K-pointsDFT

by(N) by(1)by(2)

FrameLengthM

PaddingK-M

00

00

00

00

00

bx(1)1

bx(1)2

bx(1)M

Framing/windowing

by(1)1

by(1)2

by(1)K

by(1)3

bx(2)1

bx(2)2

bx(2)Mbx(N)

M

bx(N)1

Complex-valuedomain Real-valuedomain

Fig. 5. Framing/windowing and DFT steps. T , M , K denotes waveform length, frame length, and number of DFT bins.

A.1. Framing and windowing

The framing and windowing operation is also parallelized over x(n)m using for each command in CUDA/Thrust. However, for explanation,let’s use the matrix operation in Figure 6. In other words, we compute

x(n)m =

T∑t=1

otw(n,m)t , (4)

where w(n,m)t is the element in the

[(n− 1)×M +m

]-th row and the t-th column of the transformation matrix W .

bo1

bo2

bo3

boT

bx(1)1

bx(1)2

bx(1)M

bx(2)1

bx(2)2

bx(2)M

X=

bx(N)M

bx(N)1

bx(N)2

1st Frame

2nd Frame

Nth Frame

Trows

M(framelength)

w1

w2

wM

w1

w2

wM

Frameshift

NMcolumns

w1

w2

wM

W 2 RNM⇥T

Fig. 6. A matrix format of framing/windowing operation, where w1:M denote coefficients of Hann window.

A.2. DFT

Our implementation uses cuFFT (cufftExecR2C and cufftPlan1d) 3 to compute {y(1), · · · ,y(N)} in parallel.

3https://docs.nvidia.com/cuda/cufft/index.html

Page 7: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

B. BACKWARD COMPUTATION

For back-propagation, we need to compute the gradient ∂L∂o1:T

∈ RT following the steps plotted in Figure 7.

bo1:To1:T DFTFraming/windowing DFT Framing/

windowingGeneratedwaveform

Naturalwaveform L

by(n) bx(n)y(n)x(n)

Nframes

…KDFTbins

K-pointsiDFT

FrameLengthM

De-framing/windowing

inverseDFT

De-framing/windowing Gradients

@L@bo1:T

@L@bx(n)

@L@bx(N)

@L@bx(2)

@L@bx(1)

Gradientsw.r.t.zero-paddedpartNotusedinde-framing/windowing

PaddingK-M

@L@bo1:T

g(n)k =

@Ls

@Re(by(n)k )

+ j@Ls

@Im(by(n)k )

2 C

g(1)K

g(1)2

g(1)1

g(1)g(2)g(N)

g(n)

@L@bx(1)

1

@L@bx(1)

2

@L@bx(1)

M

@L@bx(2)

M

@L@bx(2)

1

@L@bx(N)

1

@L@bx(N)

M

@L@bo1

@L@bo2

@L@bo

Complex-valuedomain Real-valuedomain

Fig. 7. Steps to compute gradients

B.1. The 2nd step: from ∂L∂x

(n)m

to ∂L∂ot

Suppose we have { ∂L∂x(1) , · · · , ∂L

∂x(N) }, where each ∂L∂x(n) ∈ RM and ∂L

∂x(n)m

∈ R. Then, we can compute ∂L∂ot

on the basis Equation (4) as

∂L∂ot

=

N∑n=1

M∑m=1

∂L∂x

(n)m

w(n,m)t , (5)

where w(n,m)t are the framing/windowing coefficients. This equation explains what we mean by saying ‘ ∂L

∂otcan be easily accumulated the

relationship between ot and each x(n)m has been determined by the framing and windowing operations’.Our implementation uses CUDA/Thrust for each command to launch T threads and compute ∂L

∂ot, t ∈ {1, · · · , T} in parallel. Because

ot is only used in a few frames and there is only one w(n,m)t 6= 0 for each {n, t}, Equation (5) can be optimized as

∂L∂ot

=

Nt,max∑n=Nt,min

∂L∂x

(n)mt,n

w(n,mt,n)t , (6)

where [Nt,min, Nt,max] is the frame range that ot appears, and mt,n is the position of ot in the n-th frame.

B.2. The 1st step: compute ∂L∂x

(n)m

Remember that y(n) = [y(n)1 , · · · , y(n)K ]> ∈ CK is the K-points DFT spectrum of x(n) = [x

(n)1 , · · · , x(n)M ]> ∈ RM . Therefore we know

Re(y(n)k ) =

M∑m=1

x(n)m cos(2π

K(k − 1)(m− 1)), (7)

Im(y(n)k ) = −M∑m=1

x(n)m sin(2π

K(k − 1)(m− 1)), (8)

where k ∈ [1,K]. Note that, although the sum should be∑Km=1, the summation over the zero-padded part

∑Km=M+1 0 cos(

2πK(k− 1)(m−

1)) can be safely ignored 4.

4Although we can avoid zero-padding by setting K = M , in practice K is usually the power of 2 while the frame length M is not.

Page 8: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

Suppose we compute a log spectral amplitude distance L over the N frames as

L =1

2

N∑n=1

K∑k=1

[log

Re(y(n)k )2 + Im(y(n)k )2

Re(y(n)k )2 + Im(y(n)k )2

]2. (9)

Because L, x(n)m , Re(y(n)k ), and Im(y(n)k ) are real-valued numbers, we can compute the gradient ∂L∂x

(n)m

using the chain rule:

∂L∂x

(n)m

=

K∑k=1

∂L∂Re(y(n)k )

∂Re(y(n)k )

∂x(n)m

+

K∑k=1

∂L∂Im(y(n)k )

∂Im(y(n)k )

∂x(n)m

(10)

=

K∑k=1

∂L∂Re(y(n)k )

cos(2π

K(k − 1)(m− 1))−

K∑k=1

∂L∂Im(y(n)k )

sin(2π

K(k − 1)(m− 1)). (11)

Once we compute ∂L∂x

(n)m

for each m and n, we can use Equation (6) to compute the gradient ∂L∂ot

.

B.3. Implementation of the 1st step using inverse DFT

Because ∂L∂Re(y

(n)k

)and ∂L

∂Im(y(n)k

)are real numbers, we can directly implement Equation (11) using matrix multiplication. However, a more

efficient way is to use inverse DFT (iDFT).Suppose a complex valued signal g = [g1, g2, · · · , gL] ∈ CK , we compute b = [b1, · · · , bK ] as the K-point inverse DFT of g by 5

bm =

K∑k=1

gkej 2πK

(k−1)(m−1) (12)

=

K∑k=1

[Re(gk) + jIm(gk)][cos(2π

K(k − 1)(m− 1)) + j sin(

K(k − 1)(m− 1))] (13)

=

K∑k=1

Re(gk) cos(2π

K(k − 1)(m− 1))−

K∑k=1

Im(gk) sin(2π

K(k − 1)(m− 1)) (14)

+ j[ K∑k=1

Re(gk) sin(2π

K(k − 1)(m− 1)) +

K∑k=1

Im(gk) cos(2π

K(k − 1)(m− 1))

]. (15)

For the first term in Line 15, we can write

K∑k=1

Re(gk) sin(2π

K(k − 1)(m− 1)) (16)

=Re(g0) sin(2π

K(1− 1)(m− 1)) + Re(gK

2+1) sin(

K(K

2+ 1− 1)(m− 1)) (17)

+

K2∑

k=2

Re(gk) sin(2π

K(k − 1)(m− 1)) +

K∑k=K

2+2

Re(gk) sin(2π

K(k − 1)(m− 1)) (18)

=

K2∑

k=2

[Re(gk) sin(

K(k − 1)(m− 1)) + Re(g(K+2−k)) sin(

K(K + 2− k − 1)(m− 1))

](19)

=

K2∑

k=2

[Re(gk)− Re(g(K+2−k))

]sin(

K(k − 1)(m− 1)) (20)

Note that in Line (17), Re(g1) sin( 2πK (1 − 1)(m − 1)) = Re(g1) sin(0) = 0, and Re(gK2+1) sin(

2πK(K

2+ 1 − 1)(m − 1)) =

Re(gK2+1) sin((m− 1)π) = 0.

It is easy to those that Line (20) is equal to 0 if Re(gk) = Re(g(K+2−k)), for any k ∈ [2, K2]. Similarly, it can be shown that∑K

k=1 Im(gk) cos(2πK(k − 1)(m − 1)) = 0 if Im(gk) = −Im(g(K+2−k)), k ∈ [2, K

2] and Im(g1) = Im(g(K

2+1)) = 0. When these two

terms are equal to 0, the imaginary part in Line (15) will be 0, and bm =∑Kk=1 gke

j 2πK

(k−1)(m−1) in Line (12) will be a real number.

5 cuFFT performs un-normalized FFTs, i.e., the scaling factor 1K

is not used.

Page 9: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

To summarize, if g satisfies the conditions below

Re(gk) = Re(g(K+2−k)), k ∈ [2,K

2] (21)

Im(gk) =

{−Im(g(K+2−k)), k ∈ [2, K

2]

0, k = {1, K2+ 1} (22)

inverse DFT of g will be real-valued:

K∑k=1

gkej 2πK

(k−1)(m−1) =

K∑k=1

Re(gk) cos(2π

K(k − 1)(m− 1))−

K∑k=1

Im(gk) sin(2π

K(k − 1)(m− 1)) (23)

This is a basic concept in signal processing: the iDFT of a conjugate-symmetric (Hermitian)6 signal will be a real-valued signal.

We can observer from Equation (23) and (11) that, if[

∂L∂Re(y

(n)1 )

+ j ∂L∂Im(y

(n)1 )

, · · · , ∂L∂Re(y

(n)K

)+ j ∂L

∂Im(y(n)K

)

]>is conjugate-symmetric,

the gradient vector ∂L∂x(n) = [ ∂L

∂x(n)1

, · · · , ∂L∂x

(n)M

]> be computed using iDFT:

∂L∂x

(n)1

∂L∂x

(n)2

· · ·∂L∂x

(n)M

∂L∂x

(n)M+1

· · ·∂L∂x

(n)K

= iDFT(

∂L∂Re(y

(n)1 )

+ j ∂L∂Im(y

(n)1 )

∂L∂Re(y

(n)2 )

+ j ∂L∂Im(y

(n)2 )

· · ·∂L

∂Re(y(n)M

)+ j ∂L

∂Im(y(n)M

)

∂L∂Re(y

(n)M+1

)+ j ∂L

∂Im(y(n)M+1

)

· · ·∂L

∂Re(y(n)K

)+ j ∂L

∂Im(y(n)K

)

). (24)

Note that { ∂L∂x

(n)M+1

, · · · , ∂L∂x

(n)K

} are the gradients w.r.t to the zero-padded part, which will not be used and can safely set to 0. The iDFT of

a conjugate symmetric signal can be executed using cuFFT cufftExecC2R command. It is more efficient than other implementations ofEquation (11) because

• there is no need to compute the imaginary part;

• there is no need to compute and allocate GPU memory for gk where k ∈ [K2+ 2,K] because of the conjugate symmetry;

• iDFT can be executed for all the N frames in parallel.

B.4. Conjugate symmetry complex-valued gradient vector

The conjugate symmetry of[

∂L∂Re(y

(n)1 )

+ j ∂L∂Im(y

(n)1 )

, · · · , ∂L∂Re(y

(n)K

)+ j ∂L

∂Im(y(n)K

)

]>is satisfied if L is carefully chosen. Luckily, most of the

common distance metrics can be used.

B.4.1. Log spectral amplitude distance

Given the log spectral amplitude distance Ls in Equation (9), we can compute

∂Ls∂Re(y(n)k )

=[log[Re(y(n)k )2 + Im(y(n)k )2]− log[Re(y(n)k )2 + Im(y(n)k )2]

] 2Re(y(n)k )

Re(y(n)k )2 + Im(y(n)k )2(25)

∂Ls∂Im(y(n)k )

=[log[Re(y(n)k )2 + Im(y(n)k )2]− log[Re(y(n)k )2 + Im(y(n)k )2]

] 2Im(y(n)k )

Re(y(n)k )2 + Im(y(n)k )2(26)

Because y(n) is the DFT spectrum of the real-valued signal, y(n) is conjugate symmetric, and Re(y(n)k ) and Im(y(n)k ) satisfy the conditionin Equations (21) and (22), respectively. Because the amplitude Re(y(n)k )2 + Im(y(n)k )2 does not change the symmetry, ∂Ls

∂Re(y(n)k

)and

∂Ls∂Im(y

(n)k

)also satisfy the conditions in Equations (21) and (22), respectively, and

[∂Ls

∂Re(y(n)1 )

+ j ∂Ls∂Im(y

(n)1 )

, · · · , ∂Ls∂Re(y

(n)K

)+ j ∂Ls

∂Im(y(n)K

)

]>is

conjugate-symmetric.

6It should be called circular conjugate symmetry in strict sense

Page 10: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

B.4.2. Phase distance

Let θ(n)k and θ(n)k to be the phases of y(n)k and y(n)k , respectively. Then, the phase distance is defined as

Lp =1

2

N∑n=1

K∑k=1

∣∣∣1− exp(j(θ(n)k − θ(n)k ))

∣∣∣2 (27)

=1

2

N∑n=1

K∑k=1

∣∣∣1− cos(θ(n)k − θ(n)k ))− j sin(θ(n)k − θ(n)k ))

∣∣∣2 (28)

=1

2

N∑n=1

K∑k=1

[(1− cos(θ

(n)k − θ(n)k ))

)2+ sin(θ

(n)k − θ(n)k ))2

](29)

=1

2

N∑n=1

K∑k=1

[1 + cos(θ

(n)k − θ(n)k ))2 + sin(θ

(n)k − θ(n)k ))2 − 2 cos(θ

(n)k − θ(n)k ))

](30)

=

N∑n=1

K∑k=1

(1− cos(θ

(n)k − θ(n)k ))

)(31)

=

N∑n=1

K∑k=1

[1−

(cos(θ

(n)k ) cos(θ

(n)k )) + sin(θ

(n)k ) sin(θ

(n)k ))

)](32)

=

N∑n=1

K∑k=1

[1− Re(y(n)k )Re(y(n)k ) + Im(y(n)k )Im(y(n)k )√

Re(y(n)k )2 + Im(y(n)k )2√Re(y(n)k )2 + Im(y(n)k )2

], (33)

where

cos(θ(n)k ) =

Re(y(n)k )√Re(y(n)k )2 + Im(y(n)k )2

, cos(θ(n)k ) =

Re(y(n)k )√Re(y(n)k )2 + Im(y(n)k )2

(34)

sin(θ(n)k ) =

Im(y(n)k )√Re(y(n)k )2 + Im(y(n)k )2

, sin(θ(n)k ) =

Im(y(n)k )√Re(y(n)k )2 + Im(y(n)k )2

. (35)

Therefore, we get

∂Lp∂Re(y(n)k )

=− cos(θ(n)k )

∂cos(θ(n)k )

∂Re(y(n)k )− sin(θ

(n)k )

∂sin(θ(n)k )

∂Re(y(n)k )(36)

=− cos(θ(n)k )

√Re(y(n)k )2 + Im(y(n)k )2 − Re(y(n)k ) 1

2

2Re(y(n)k

)√Re(y

(n)k

)2+Im(y(n)k

)2

Re(y(n)k )2 + Im(y(n)k )2− sin(θ

(n)k )

−Im(y(n)k ) 12

2Re(y(n)k

)√Re(y

(n)k

)2+Im(y(n)k

)2

Re(y(n)k )2 + Im(y(n)k )2

(37)

=− cos(θ(n)k )

Re(y(n)k )2 + Im(y(n)k )2 − Re(y(n)k )2(Re(y(n)k )2 + Im(y(n)k )2

) 32

− sin(θ(n)k )

−Im(y(n)k )Re(y(n)k )(Re(y(n)k )2 + Im(y(n)k )2

) 32

(38)

=− Re(y(n)k )Re(y(n)k )2 + Re(y(n)k )Im(y(n)k )2 − Re(y(n)k )Re(y(n)k )2 − Im(y(n)k )Im(y(n)k )Re(y(n)k )(Re(y(n)k )2 + Im(y(n)k )2

) 12(Re(y(n)k )2 + Im(y(n)k )2

) 32

(39)

=− Re(y(n)k )Im(y(n)k )− Im(y(n)k )Re(y(n)k )(Re(y(n)k )2 + Im(y(n)k )2

) 12(Re(y(n)k )2 + Im(y(n)k )2

) 32

Im(y(n)k ) (40)

∂Lp∂Im(y(n)k )

=− cos(θ(n)k )

∂cos(θ(n)k )

∂Im(y(n)k )− sin(θ

(n)k )

∂sin(θ(n)k )

∂Im(y(n)k )(41)

=− Im(y(n)k )Re(y(n)k )− Re(y(n)k )Im(y(n)k )(Re(y(n)k )2 + Im(y(n)k )2

) 12(Re(y(n)k )2 + Im(y(n)k )2

) 32

Re(y(n)k ). (42)

Because both y(n) and y(n) are conjugate-symmetric, it can be easily observed that ∂Lp∂Re(y

(n)k

)and ∂Lp

∂Re(y(n)k

)satisfy the condition in

Equations (21) and (22), respectively.

Page 11: NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR … · NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin Wang 1, Shinji Takaki , Junichi Yamagishi

C. MULTIPLE DISTANCE METRICS

Different distance metrics can be merged easily. For example, we can define

L = Ls1 + · · ·+ LsS + Lp1 + · · ·+ LpP , (43)

where Ls∗ ∈ R and Lp∗ ∈ R may use different numbers of DFT bins, frame length, or frame shift. Although the dimension of the gradientvector ∂L∗

∂x(n) may be different, the gradient ∂L∗∂o1:T

∈ RT will always be a real-valued vector of dimension T after de-framing/windowing.The gradients then will be simply merged together as

∂L∂o1:T

=∂Ls1∂o1:T

+ · · ·+ ∂LsS∂o1:T

+∂Lp1∂o1:T

+ · · ·+ ∂LpP∂o1:T

. (44)

bo1:T

DFTconfig 1

Framing/windowingconfig 1

DFTconfig 1

Framing/windowingconfig 1

Naturalwaveform

iDFTconfig 1

De-framing/windowingconfig 1 @L1

@bo1:T

L1

DFTconfig 2

Framing/windowingconfig 2

DFTconfig 2

Framing/windowingconfig 2

o1:T

Generatedwaveform

iDFTconfig 2

De-framing/windowingconfig 2

L2

DFTconfig L

Framing/windowingconfig L

DFTconfig L

Framing/windowingconfig L

iDFTconfig L

De-framing/windowingconfig L

@L2

@bo1:T

LL

@L@bo1:T

@LL

@bo1:T

+

Fig. 8. Using multiple distances {L1, · · · ,LL}