Comparison of Single Channel Blind Dereverberation Methods for Speech Signals

Comparison of Single Channel Blind Dereverberation Methods

for Speech Signals

Deha Deniz Türköz - MSc ThesisThesis Supervisor: Hakan Erdoğan

Sabancı Üniversitesi27.06.2016

OUTLINE1) Introduction2) Background

a) Features of speechb) Reverberation modelc) Room impulse response (RIR)d) Non-negative matrix factorization (NMF)

3) Blind-Dereverberation Methodsa) Delayed linear prediction (DLP)b) Weighted prediction error (G-WPE)c) Laplacian based Weighted Prediction Error (L-WPE) d) NMF based spectral modeling (NMF+N-CTF)e) Sparsity penalized weighted least squares method (SPWLS)

4) Experiments and Comparisons

5) Discussion and Conclusion2

1. Introduction

3

1. IntroductionReverberation:

● is an effect occurs on speech data due to reflections through walls,

● decreases speech intelligibility,

● degrades applications such as ASR, hands-free teleconferencing,

● can be modeled with an LTI filter.

4

● If filter, h is known, then clean signal,s can be recovered with a simple deconvolution operation called dereverberation.

● For most cases h & s are unknowns and x is the only known parameter. Predicting h & s from x is called “Blind-dereverberation problem” which is the main subject of this work.

1. Introduction

5

Aim of this work is to compare the existing blind-dereverberation methods

○ DLP: delayed linear prediction,○ G-WPE: Gaussian based weighted prediction error, ○ L-WPE: Laplacian based based weighted prediction

error,○ NMF+N-CTF: NMF based spectral-temporal modeling

and offer a new algorithm called

○ SPWLS: sparsity penalized weighted least squares.

1. Introduction

6

2. Background

a. Features of Speech

7

2a. Features of Speech● Speech is a signal created

through human vocal system.● Input of vocal tract is called

glottal signal:○ White noise,○ Impulse train

● Vocal tract system can be modeled as all-pole filter

means speech production is a simple LTI filtering operation of a glottal signal.

8

2A. Features of Speech

● Speech signals are non-stationary.● General approach: divide signal into small time segments,

assume each of them are stationary. ● To analyze speech: short-time Fourier transform (STFT)● STFT: divides speech signal into overlapping segments

called frames by using a window filter. Calculates DFT of these frames

9

2A. Features of SpeechFormulation of STFT:

L: frame shift,

N:frame size,

X(n,k): discrete STFT coefficients of speech signal x[m] at frame n.

W[m]: Hamming window10

2A. Features of Speech

● STFT of signal is interpreted as a matrix having complex DFT coefficients at columns.

11

2A. Features of Speech● To visualize signal’s frequency changes with respect to

time: spectrogram● Spectrogram, S(n,k) uses power spectral domain (PSD)

measures of STFT matrix, X(n,k) as intensity values in an 2D image:

12

2. Background

b. Reverberation Model

13

2b. Reverberation Model● Reverberation environment can be modeled as an LTI filter

which is called room impulse response (RIR).● Reverberation model:

h(t): RIR, unknown

s(t): clean signal (anechoic signal), unknown

x(t): reverberated signal (echoed signal), known

14

2b. Reverberation ModelReverberation effect on spectrogram:

15

2. Background

c. Room Impulse Response (RIR)

16

2c. Room Impulse Response (RIR)

The length of RIR depends on

● Room size,● Room temperature,● Room shape,● Microphone’s distance to the speech source,● Absorption of sound in room,

: time required for reflected signal to drop by 60 dB level

● RIR shows FIR filter characteristic.

17

2c. Room Impulse Response (RIR)Usually RIR is divided into two parts:

1. Early reverberation

2. Late reverberation: the most detrimental part of echo

n(t): noise

d(t): early echo + clean signal (desired signal)

r(t): Late echo

Lh: the length of RIR

h(t): RIR, (earl echo + late echo) 18

2c. Room Impulse Response (RIR)Then, early and late reverberations are

D: the length of early reverberation

19

2c. Room Impulse Response (RIR)

20

2. Background

d. Non-negative Matrix Factorization (NMF)

21

2d. Non-negative Matrix Factorization (NMF)NMF: decomposition a V matrix as production of two matrices B and G with non-negative entries.

B: basis or dictionary matrix, G: weight or gains matrix.

● This problem can be interpreted as an optimization problem as follows:

where C is the cost function for measuring the distance between V and BG

22

2d. Non-negative Matrix Factorization (NMF)● Columns of B are called basis vectors, ● Number of B matrix columns are kept smaller than the size

of V,● Iterative algorithms are utilized to solve the NMF

problem, since there is no unique solution.● Initial B & G matrices can be randomized positive numbers

or supervised matrices for fast convergence. ● Popular iterative methods to formulate distance function

between V and BG are:○ Euclidean distance, ○ Kullback-Leibler distance (KL),○ Itakuro-Saito distance method (IS).

23

2d. Non-negative Matrix Factorization (NMF)Kullback-Leibler divergence between V and BG and defined as [6]:

where “1” is the matrix of ones, has the

same size of V

24

2d. Non-negative Matrix Factorization (NMF)

● NMF is a non-convex algorithm and have multiple local minimums. As a result, B and G can vary for the same V matrix.

● NMF is a common method used in speech processing, deep learning, clustering, and computer vision.

● In speech processing, NMF has applications for Audio-Source Separation, source/filter model, blind-dereverberation [3][4], speech denoising and so on.

25

3. Blind-Dereverberation Methods

a. Delayed linear prediction (DLP)

26

3a. Delayed Linear Prediction(DLP)We denote time-domain signals x(t), s(t), h(t) as respectively.

STFT-domain signal notations are , for x(n,k), s(n,k), h(n,k) respectively.

Then,

27

3a. Delayed Linear Prediction (DLP)● DLP estimates inverse filter coefficients from

reverberated signal.● inverse filter of length Lw, can be used to

approximately obtain a dereverberated signal as:

● In matrix form, reverberation can be formulated as

28

3a. Delayed Linear Prediction (DLP)

29

3a. Delayed Linear Prediction (DLP)

● means desired signal can be estimated by only using reverberated signal and its past samples.

● Then, the inverse filter is● The number of zeros in the inverse filter vector is equal

to D, delay.● In conclusion, DLP algorithm is a simple technique to

achieve dereverberation.● it may not work well in most cases. Reason is having an

FIR filter as the inverse filter.

30


b. Weighted prediction error (G-WPE)

31

3b. Weighted prediction error (g-wpe)Assumption 1: speech signal has local Gaussian distribution for small frames with length Lf,

Assumption 2: samples are mutually uncorrelated after a certain distance,

Assumption 3: variance is constant for short-time frames with size Lf.

32

3b. Weighted prediction error (g-wpe)● Dereverberation can be done both in time domain and in

STFT domain,● Using time domain is very costly, because of having quite

big matrices, so STFT domain will be used.● Probability density function of desired signal in STFT

domain,

n:frame number, k:frequency bin, : time-varying variance

Then,

33

3b. Weighted prediction error (g-wpe)● Variance values alter only with respect to time frames

Thus,

● Apply likelihood maximization to Gaussian pdf. Then, log likelihood function for dereverberation process in STFT domain becomes:

Parameter vector for likelihood maximization: 34

3b. Weighted prediction error (g-wpe)

Maximizing the equation with respect to parameter vector, cannot be achieved analytically and there is no closed form solution for this equation. Thus, an iterative algorithm is needed.

35

3b. Weighted prediction error (g-wpe)Two step procedure has been proposed in [1] to solve Likelihood maximization problem.

1. Keep constant and solve for to maximize likelihood, then obtain ;

2. Keep constant and update

and so on until a convergence criterion satisfied or a maximum number of iterations completed

36

3b. Weighted prediction error (g-wpe)

37


c. Laplacian based weighted linear prediction (L-WPE)

38

3c. Laplacian based weighted prediction ERROR (L-WPE)L-WPE in [2] suggests that speech can be modeled more precisely with a Laplacian model rather than a Gaussian model in STFT domain.

● Assumption 1: speech signal has local Laplacian distribution for small frames with length Lf,

● Assumption 2: represent STFT coefficients of the desired signal, for each time-frequency bin with an equal variance, for independent imaginary and real parts.

39

3c. Laplacian based weighted prediction ERROR (L-WPE)Then, pdf of the Laplacian Model is

Likewise to G-WPE method, maximum likelihood estimation(ML) will be utilized for parameter vector, . Then, likelihood function:

40

3c. Laplacian based weighted prediction ERROR (L-WPE)No closed formulation for likelihood function. Thus, solve it numerically.

1. Keep constant and solve for to maximize likelihood (or minimize l1 norm), then obtain

2. Keep constant and update

Step1: fix & update

Likelihood function can be rewritten in terms of as

41

3c. Laplacian based weighted prediction ERROR (L-WPE)

Thus, likelihood function can be written as:

42

3c. Laplacian based weighted prediction ERROR (L-WPE)Then, problem can be interpreted as a linear programming problem as:

43

3c. Laplacian based weighted prediction ERROR (L-WPE)Step 2: fix & update

After calculating log likelihood and calculating its maximum with respect to variable , closed form solution for variance becomes:

● These two steps will proceed until a convergence criterion is satisfied or maximum number of iterations has been reached.

44

3c. Laplacian based weighted prediction ERROR (L-WPE)

45

46


d. NMF based spectral modeling (NMF+N-CTF)

3d. NMF based spectral modeling (NMF+N-CTF)● The method in [3] is a combined version of non-negative

convoluted transfer function (N-CTF) model and non-negative matrix factorization (NMF).

● N-CTF model assumption: for each frequency bin, the power spectrogram of STFT coefficient matrices of clean speech signal & RIR convolution gives the reverberated signal’s power spectrogram of STFT coefficient matrix.

,

47

3d. NMF based spectral modeling (NMF+N-CTF)Assumptions:

● Phase elements of the at different frames are mutually independent

● Zero-mean random variable with Gaussian distribution● Clean signal & RIR spectral coefficients are mutually

independent.

For simplicity, set , likewise for s(n,k) and h(n,k). (different than other methods)

48

3d. NMF based spectral modeling (NMF+N-CTF)Kullback-Leibler (KL) divergence will be used to estimate power spectrogram of s(n,k) from previous eqn. As:

Where,

: estimated power spectrogram of reverberated signal

49

3d. NMF based spectral modeling (NMF+N-CTF)

To acquire more accurate estimation, the sparsity of clean speech spectrogram can be added as a regularization term with weight .

As a non-negativity constraint, are expected to be greater than zero.

50


This model can be solved as an iterative learning method as:

51

3d. NMF based spectral modeling (NMF+N-CTF)Let’s add NMF approach:

The clean speech magnitude spectrogram S can be formulated as the production of a dictionary matrix B and a weight matrix G.

Where,

R: the number of basis vectors in the dictionary matrix B, dictionary size; R<N (s frame size)

52

3d. NMF based spectral modeling (NMF+N-CTF)After combination of method N-CTF and NMF, problem definition becomes:

Approach: keep two fixed, update one in order until a convergence criterion has been succeeded or maximum number of iteration has been reached

53


54

3d. NMF based spectral modeling (NMF+N-CTF)● To remove scale ambiguity, after each iteration each

columns of B is normalized to sum to one ● The columns of H are element-wise divided by the first

column of H.● The nature of RIR consists of decaying impulses.

● Mapping coefficient matrix, between clean speech signal and reverberated speech signal can be formulated as:

where,55


● Initializations of basis, B and weight, G matrices are conducted with randomized non-negative numbers for online method.

● B & G can be initialized with supervised matrices to increase efficiency.

● In this work, we employ online method.

57


e. Sparsity penalized weighted least squares method (SPWLS)

58

3e. Sparsity penalized weighted least squares method (SPWLS)❖ SPWLS combines the idea of variance normalization with a

weight matrix and the sparsity property of speech spectrogram matrices.

❖ To provide sparsity of a variable, generally norm regularization is used.

❖ With regularization, optimization problem, also known as Lasso problem, requires an iterative algorithm to solve.

❖ Some popular algorithms to solve Lasso problem are➢ ISTA (iterative shrinkage and threshold algorithm) [7]➢ FISTA ➢ SALSA

59

3e. Sparsity penalized weighted least squares method (SPWLS)Convolution equation (in STFT domain with fixed frequency k) can be rewritten in matrix form as:

Then, with regularization term for sparsity, we need to solve the Lasso problem:

n: noise signal, s: clean speech signal, x: reverberated signal,

H: convolution matrix of RIR.60

3e. Sparsity penalized weighted least squares method (SPWLS)● Add weights to the problem as in L-WPE and G-WPE method.● Add an extra regularization on the norm of the filter h to

make sure that not getting a trivial solution. ● Our optimization loss function becomes:

where,

: regularization parameter, W: diagonal weight matrix with 1/(std) values

: the target norm for filter h,

k: freq. Index (fixed), n: frame index

61

3e. Sparsity penalized weighted least squares method (SPWLS)● Problem is non-differentiable at its local minimum.● s & h need to be calculated numerically with an iterative

approach.● Our approach requires a good initialization for s & h

which can be obtained from an earlier method such as G-WPE.

● Our approach: Performing alternating updates of s and h that would minimize the objective function with respect to the corresponding variable.

● For updating s & h, ISTA algorithm is utilized.

62

3e. Sparsity penalized weighted least squares method (SPWLS)ISTA: minimizes functions like f(s)+g(s) where the first function is differentiable and the second function is usually not differentiable, but simple.

Step 1 to update s: Take a gradient descent step in the direction of the first function f(.):

(i: iteration index)

The result is an intermediate solution.

● If we calculate the gradient of the first function f(.):

63

3e. Sparsity penalized weighted least squares method (SPWLS) : positive step size parameter, indicates the amount that we move along the negative gradient.

Step 2 to update s: A proximal operator step of g(.) is performed around that intermediate solution as follows:

Proximal step corresponds to a thresholding/shrinkage operation for the norm penalty:

Basically, this step erases the components with small energy and shrinks the other parts. (a = for our algorithm) 64

3e. Sparsity penalized weighted least squares method (SPWLS)● After the update of s, we update W matrix according to

new variance values of s.

Now, we need to solve problem for h. Update h according to:

● Use ISTA again:

Step 1 to update h: minimizer for f(.), simple least-square problem with exact solution:

65

3e. Sparsity penalized weighted least squares method (SPWLS)Step 2 to update h: Proximal operation step for the regularization of h

● Step size parameter, for the inner gradient descent descent iteration for s can be set to change for each iteration as

Where are hyperparameters and is the initial step size, are the inner and outer iteration indices. 66

4) Experiments & Comparisons

68

TEST DATAExperiment 1: 3 male & 3 female (clean) voices convolved with 6 different RIR samples with 30 dB and 60 dB additive noises (for DLP, G-WPE, NMF+N-CTF, SPWLS methods)

72 different samples have been dereverberated.

Experiment 2: 1 male and 1 female (clean) voices convolved with 5 different RIR samples and added 30 dB and 60 dB additive noises. (for all methods)

20 different samples have been dereverberated.

● Test data has been taken from \Reverb Challenge" data set. 69

TEST DATA● Sampling frequency was 16KHz same for all files.● RIR times (RT60) were 0.17, 0.11, 0.95, 0.33, 0.54, 0.35s

respectively● L-WPE method was not performed with the RT60 = 0.95s only

due to excessive run time.● As additive noise, a cafe environment noise with 30 dB and

60 dB levels has been used.

70

setup● Number of delayed frame size, D was set to 3 frames for

G-WPE, L-WPE and DLP methods,● Lf , number of frames used for variance calculations is

set to 1 frame for G-WPE, L-WPE and SPWLS methods,● Iteration number for G-WPE, L-WPE and SPWLS methods is

set to 5,● Iteration number for NMF+N-CTF method is set to 100.● STFT parameters: hop size =10ms, window size =30ms.● Minimum variance to avoid zero divisions,v = 1e(-6)● Number of STFT frames used to predict signal changes with

respect to RT60 estimates of internal compiling.

71

setupSPWLS parameters specific to this method are

● step size, = 1E-7, ● ISTA regularization parameter = 1E5, ● inner iteration number for ISTA i =10, ● ISTA regularization parameter for filter =10.● SPWLS initialization for RIR, H is set as the output of

G-WPE method.

NMF+N-CTF method has

● dictionary matrix size \ndict" as 100. ● Method uses online method.

72

Computational effıciency● All the algorithms are implemented in MATLAB on a

computer with an Intel Xeon CPU, 2.5GHz.● the fastest one is SPWLS method. Then, G-WPE, DLP,

NMF+N-CTF and L-WPE come in order.● L-WPE is very slow due to linear programming (LP) part

inside. CVX tool for Matlab is utilized for LP part.● Compiling times of data with RT60= 0.54 s :

○ L-WPE, ~one day○ NMF+N-CTF ~1.5hour (with 100 iter#, 100 ndict)○ G-WPE ~4mins (5 iterations)○ SPWLS ~2mins (5 iterations)○ DLP ~3mins (1 iteration) - implemented with Levinson-Durbin algorithm

73

Test Methods ● Accuracy of the dereverberation process is calculated with average

cepstral distortion (CD) test over short time frames.● Popular method to measure speech quality measure between clean signal and

reconstructed signal.

: clean speech signal cepstral coeffs from 1th to 12th order

: estimated speech signal's cepstral coeffs 1th to 12th order.

: Zero order coeff, denotes the power spectrum envelope in dB.

● CD between similar signals converges to 0. ● Our aim is to keep CD as small as possible after dereverberation process.

74

Test Methods● STOI, short-time objective intelligibility measure: For

short-time frames, STOI compares the temporal envelopes of the clean and dereberberated speech in terms of correlation coefficients.

● PESQ, Perceptual Evaluation of Speech Quality: common standardized test method for speech quality measure. 3 types of PESQ measure is applied.

● Signal to noise (SNR) ratio test between clean signal and dereverberated signal.

● Segmented SNR (segSNR): SNR results for short time frames.

75

Test results - iteration# (experiment 2 - for 20 files)

76


77


78


79


80


81


82

Test results - NMF+N-CTF Method (experiment 2 - for 20 files)

83


84


85


86


87


88


89

Spectrogram results OF DEREVERBERATED Sıgnals

90

Spectrogram results OF DEREVERBERATED Sıgnals

91

Spectrogram results OF DEREVERBERATED Sıgnalsiter# =1

92


93


94


95

NUMERICAL RESULTSiter# =5

96

Test results - Average

97

Test results - Average

98

NUMERICAL RESULTS (For long RIR with RT60 = 0.54s results)

99

NUMERICAL RESULTS - NMF+N-CTF Method ndict= dictionary matrix size , #iter = number of iterations

NNCTF1 ndict = 100 & #iter= 100,

NNCTF2 ndict = 500 & #iter= 200,

NNCTF3 ndict= 1000 & #iter= 200,

NNCTF4 ndict= 1000 & #iter= 400,

NNCTF5 ndict= 1000 & #iter= 240.

100

NUMERICAL RESULTS - NMF+N-CTF Method

101

102

Listen to the results

5) DISCUSSION & CONCLUSION

103

DISCUSSION & CONCLUSION● The best test results belongs to L-WPE method.● In terms of time efficiency and test results, G-WPE works better,

could work better with real time applications.● L-WPE algorithm is much more complex than G-WPE because of linear

programming part. Thus, it works very slow.● NMF+N-CTF results

○ converging,○ test results are not as good as proposed in paper,○ method could perform better with a good initialization or

supervised dictionary matrix. ○ Increasing dictionary size has good effects on test results, but

Iteration number does not always improve them.○ No phase information.

104

DISCUSSION & CONCLUSION● L-WPE was slower, G-WPE was faster than DLP for one iteration.● SPWLS could not show good performance for CD. To improve the

performance, more constraints can be set for h. In SPWLS, we are trying to eliminate the whole echo, not only late as in G-WPE, L-WPE & DLP. Also, step size might be decreased.

● SPWLS shows promises due to time efficiency, SNR and PESQ results.

● Spectrogram results show that L-WPE and G-WPE are successfully managing eliminating late reverberant parts.

● DLP is just utilized to make comparisons with L-WPE and G-WPE methods, since they rooted from DLP method. As expected L-WPE and G-WPE are better.

105

REFERENCES[1] Nakatani, Tomohiro, et al. "Speech dereverberation based on variance-normalized delayed linear prediction." IEEE transactions on audio, speech, and language processing 18.7 (2010): 1717-1731.

[2] Jukić, Ante, and Simon Doclo. "Speech dereverberation using weighted prediction error with Laplacian model of the desired signal." 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.

[3] Mohammadiha, Nasser, Paris Smaragdis, and Simon Doclo. "Joint acoustic and spectral modeling for speech dereverberation using non-negative representations." 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015.

[4] Mohammadiha, Nasser, and Simon Doclo. "Speech dereverberation using non-negative convolutive transfer function and spectro-temporal modeling."IEEE/ACM Transactions on Audio, Speech, and Language Processing 24.2 (2016): 276-289.

[5] Selesnick, Ivan. "Introduction to sparsity in signal processing." Connexions(2012).

[6] Lee, Daniel D., and H. Sebastian Seung. "Algorithms for non-negative matrix factorization." Advances in neural information processing systems. 2001.

[7] Combettes, Patrick L., and Jean-Christophe Pesquet. "Proximal splitting methods in signal processing." Fixed-point algorithms for inverse problems in science and engineering. Springer New York, 2011. 185-212.

106

Comparison of Single Channel Blind Dereverberation Methods for Speech Signals

Documents

Transcript of Comparison of Single Channel Blind Dereverberation Methods for Speech Signals