1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing...

14
1680 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 8, AUGUST 2017 Cram´ er–Rao Bound Analysis of Reverberation Level Estimators for Dereverberation and Noise Reduction Ofer Schwartz, Student Member, IEEE, Sharon Gannot, Senior Member, IEEE, and Emanu¨ el A. P. Habets, Senior Member, IEEE Abstract—The reverberation power spectral density (PSD) is often required for dereverberation and noise reduction algorithms. In this work, we compare two maximum likelihood (ML) estima- tors of the reverberation PSD in a noisy environment. In the first estimator, the direct path is first blocked. Then, the ML criterion for estimating the reverberation PSD is stated according to the probability density function of the blocking matrix (BM) outputs. In the second estimator, the speech component is not blocked. In- stead, the ML criterion for estimating the speech and reverberation PSD is stated according to the probability density function of the microphone signals. To compare the expected mean square error (MSE) between the two ML estimators of the reverberation PSD, the Cram´ er–Rao Bounds (CRBs) for the two ML estimators are de- rived. We show that the CRB for the joint reverberation and speech PSD estimator is lower than the CRB for estimating the reverber- ation PSD from the BM outputs. Experimental results show that the MSE of the two estimators indeed obeys the CRB curves. Ex- perimental results of multimicrophone dereverberation and noise reduction algorithm show the benefits of using the ML estimators in comparison with another baseline estimators. Index Terms—array processing, dereverberation, diffuse noise, noise reduction. I. INTRODUCTION R EVERBERATION and ambient noise may degrade the quality and intelligibility of the signals captured by the microphones of mobile devices, smart TVs and audio confer- encing systems. While intelligibility is not degraded by early reflections, it can be significantly degraded by late reflections (i.e., reverberation) due to overlap masking effects [1]. The es- timation of the reverberation PSD is a much more challenging task than the estimation of the ambient noise PSD, since it is highly non-stationary and since speech-absence periods cannot be utilized. Both single- and multi-microphone techniques have been pro- posed to reduce reverberation (see [2], [3] and the references Manuscript received May 20, 2016; revised February 9, 2017; accepted March 23, 2017. Date of publication April 24, 2017; date of current version June 28, 2017. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding author: Sharon Gannot.) O. Schwartz and S. Gannot are with the Faculty of Engineering, Bar-Ilan University, Ramat-Gan 5290002, Israel (e-mail: [email protected]; [email protected]). E. A. P. Habets is with the International Audio Laboratories Erlangen (a joint institution between the University of Erlangen-Nuremberg and Fraunhofer IIS), Erlangen 91058, Germany (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASLP.2017.2696308 therein), while many of these techniques require an estimate of the reverberation PSD (e.g. [4]). A statistical model of reverber- ant signals usually depends on the PSD of the anechoic speech component. Since the anechoic speech PSD is changing rapidly across time and is unknown in advance, the anechoic speech PSD should be either ignored or estimated as well. In our previ- ous work [4], a multi-microphone minimum mean square error (MMSE) estimator of the early speech component was imple- mented as a minimum variance distortionless response (MVDR) beamformer followed by a postfilter. The reverberation and the ambient noise were treated by both the MVDR stage and the postfiltering stage. The ambient noise PSD matrix was assumed to be known. The most difficult task was to provide an accu- rate estimate of the reverberation PSD that is required for both stages. The reverberation PSD was estimated by averaging the marginal reverberation levels at the microphones, obtained us- ing the single-channel estimator proposed in [5]. The postfilter was calculated using the decision-directed approach [6]. The speech PSD was not estimated explicitly. In the sequel, multi- channel estimators of the speech, noise or reverberation PSDs are discussed. In [7], the authors proposed an estimator for the noise PSD matrix assuming a non-reverberant scenario and time-varying noise level. A normalized version of the noise PSD matrix was estimated, using the most recent speech-absent segments. Next, for the speech-present segments, only the level of the noise was estimated. The authors proposed to estimate the level of the noise from signals at the output of a BM, which blocks the speech signals. Since the speech signal is blocked, the PSD matrix of the BM outputs comprises only noise com- ponents. A closed-form ML estimator of the noise level was proposed w.r.t. the probability density function (p.d.f.) of the BM outputs. This estimator may be applied also for reverbera- tion assuming that the observed signals do not contain additive noise. In [8], [9], a reverberant and noisy environment was ad- dressed. The reverberation was modelled as a diffuse sound field with time-varying level and the noise PSD matrix was assumed to be known. In these works, the authors proposed to estimate the time-varying level of the reverberation from the signals at the output of a BM. The authors defined an error matrix using the PSD matrix of the BM outputs. The elements of the error matrix were assumed to be zero-mean Gaussian variables. The best fitting value for the reverberation PSD was estimated by minimizing the Frobenius norm of the error matrix. 2329-9290 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Transcript of 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing...

Page 1: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

1680 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 8, AUGUST 2017

Cramer–Rao Bound Analysis of Reverberation LevelEstimators for Dereverberation and Noise Reduction

Ofer Schwartz, Student Member, IEEE, Sharon Gannot, Senior Member, IEEE,and Emanuel A. P. Habets, Senior Member, IEEE

Abstract—The reverberation power spectral density (PSD) isoften required for dereverberation and noise reduction algorithms.In this work, we compare two maximum likelihood (ML) estima-tors of the reverberation PSD in a noisy environment. In the firstestimator, the direct path is first blocked. Then, the ML criterionfor estimating the reverberation PSD is stated according to theprobability density function of the blocking matrix (BM) outputs.In the second estimator, the speech component is not blocked. In-stead, the ML criterion for estimating the speech and reverberationPSD is stated according to the probability density function of themicrophone signals. To compare the expected mean square error(MSE) between the two ML estimators of the reverberation PSD,the Cramer–Rao Bounds (CRBs) for the two ML estimators are de-rived. We show that the CRB for the joint reverberation and speechPSD estimator is lower than the CRB for estimating the reverber-ation PSD from the BM outputs. Experimental results show thatthe MSE of the two estimators indeed obeys the CRB curves. Ex-perimental results of multimicrophone dereverberation and noisereduction algorithm show the benefits of using the ML estimatorsin comparison with another baseline estimators.

Index Terms—array processing, dereverberation, diffuse noise,noise reduction.

I. INTRODUCTION

R EVERBERATION and ambient noise may degrade thequality and intelligibility of the signals captured by the

microphones of mobile devices, smart TVs and audio confer-encing systems. While intelligibility is not degraded by earlyreflections, it can be significantly degraded by late reflections(i.e., reverberation) due to overlap masking effects [1]. The es-timation of the reverberation PSD is a much more challengingtask than the estimation of the ambient noise PSD, since it ishighly non-stationary and since speech-absence periods cannotbe utilized.

Both single- and multi-microphone techniques have been pro-posed to reduce reverberation (see [2], [3] and the references

Manuscript received May 20, 2016; revised February 9, 2017; accepted March23, 2017. Date of publication April 24, 2017; date of current version June28, 2017. The associate editor coordinating the review of this manuscriptand approving it for publication was Prof. Hiroshi Saruwatari. (Correspondingauthor: Sharon Gannot.)

O. Schwartz and S. Gannot are with the Faculty of Engineering, Bar-IlanUniversity, Ramat-Gan 5290002, Israel (e-mail: [email protected];[email protected]).

E. A. P. Habets is with the International Audio Laboratories Erlangen (a jointinstitution between the University of Erlangen-Nuremberg and Fraunhofer IIS),Erlangen 91058, Germany (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASLP.2017.2696308

therein), while many of these techniques require an estimate ofthe reverberation PSD (e.g. [4]). A statistical model of reverber-ant signals usually depends on the PSD of the anechoic speechcomponent. Since the anechoic speech PSD is changing rapidlyacross time and is unknown in advance, the anechoic speechPSD should be either ignored or estimated as well. In our previ-ous work [4], a multi-microphone minimum mean square error(MMSE) estimator of the early speech component was imple-mented as a minimum variance distortionless response (MVDR)beamformer followed by a postfilter. The reverberation and theambient noise were treated by both the MVDR stage and thepostfiltering stage. The ambient noise PSD matrix was assumedto be known. The most difficult task was to provide an accu-rate estimate of the reverberation PSD that is required for bothstages. The reverberation PSD was estimated by averaging themarginal reverberation levels at the microphones, obtained us-ing the single-channel estimator proposed in [5]. The postfilterwas calculated using the decision-directed approach [6]. Thespeech PSD was not estimated explicitly. In the sequel, multi-channel estimators of the speech, noise or reverberation PSDsare discussed.

In [7], the authors proposed an estimator for the noise PSDmatrix assuming a non-reverberant scenario and time-varyingnoise level. A normalized version of the noise PSD matrix wasestimated, using the most recent speech-absent segments. Next,for the speech-present segments, only the level of the noisewas estimated. The authors proposed to estimate the level ofthe noise from signals at the output of a BM, which blocksthe speech signals. Since the speech signal is blocked, thePSD matrix of the BM outputs comprises only noise com-ponents. A closed-form ML estimator of the noise level wasproposed w.r.t. the probability density function (p.d.f.) of theBM outputs. This estimator may be applied also for reverbera-tion assuming that the observed signals do not contain additivenoise.

In [8], [9], a reverberant and noisy environment was ad-dressed. The reverberation was modelled as a diffuse sound fieldwith time-varying level and the noise PSD matrix was assumedto be known. In these works, the authors proposed to estimatethe time-varying level of the reverberation from the signals atthe output of a BM. The authors defined an error matrix usingthe PSD matrix of the BM outputs. The elements of the errormatrix were assumed to be zero-mean Gaussian variables. Thebest fitting value for the reverberation PSD was estimated byminimizing the Frobenius norm of the error matrix.

2329-9290 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Page 2: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

SCHWARTZ et al.: CRB ANALYSIS OF REVERBERATION LEVEL ESTIMATORS 1681

Recently, in [10], an estimator for the reverberation PSD ina noisy environment was proposed. First, the received signalsare filtered by a BM to block the anechoic speech. Then,the likelihood of the reverberation PSD, given the signals at theoutput of the BM, is maximized. Due to the complexity of thep.d.f., a closed-form solution cannot be derived. Instead, theNewton’s iteration method for maximizing the ML was used.For the application of the iterations, the first- and second-orderderivatives of the log-likelihood were calculated in closed-form.However, since the BM processes the data and reduces thenumber of available signals, applying the ML at the output ofthe BM is prone to sub-optimality.

In [11], the dereverberation task for hearing aids applicationswas addressed, assuming a noise-free environment. The rever-beration component was similarly modeled as in [8]. Insteadof blocking the speech signal, the authors propose to estimatealso the speech PSD. A closed-form solution for the maximumlikelihood estimators (MLEs) of the time-varying reverberationPSD and the anechoic speech PSD is then presented based onan expression derived in [12]. The MLE was derived underthe assumption that the speech component and the reverberantcomponent are zero-mean Gaussian vectors, with rank-1 PSDmatrix for the speech component and full-rank PSD matrix forthe reverberant component.

More recently, in [13], an optimal estimator in the ML sensefor the reverberation PSD in a noisy environment was proposedwithout using a blocking stage. Instead, the reverberation and theanechoic speech PSD levels were jointly estimated. Similarlyto [10], a closed-form solution cannot be derived. Thus, the New-ton’s iteration method for maximizing the ML was used. For theapplication of the iterations, the first- and second-order deriva-tives of the log-likelihood were calculated in closed-form. It isunclear whether the preliminarily blocking stage used in [10]degrades or enhances the accuracy of the reverberation estima-tion. It is therefore interesting to compare the two MLEs, i.e.the MLE of the reverberant PSD without using the BM [13] andthe MLE which uses the BM [10].

A common way to analyze an MLE is by examining its CRB,since the MSE of an MLE asymptotically achieves the CRB. Bycareful examination of the CRB expression, one can reveal fac-tors that may decrease the MSE of the MLE. In [14], the authorsderive the CRB expressions for the MLE of the reverberant andspeech PSDs proposed in [11], [12]. The authors show that theCRB expression for the reverberation PSD is independent of thespeech component. It should be emphasized that this result isvalid only for the noiseless case. In [15], the authors show thatthe MLE derived in [11], which circumvents the blocking of thedesired speech, has a lower MSE than the MLE derived in [8],which uses the BM.

In the current paper, the CRB for the estimator in [10] that usesthe blocking stage, and the CRB for the joint estimator in [13]that circumvents the blocking stage are derived and compared.Although the estimators are based on different signal models,they both provide an exact solution in the ML sense given theconsidered signal model. The ML estimator is commonly usedin the signal processing community as it is mathematicallyrigorous. Moreover, the CRB is a reliable tool for checking

the estimation error of the ML estimator. The comparison be-tween these two estimators is interesting from both practical andtheoretical points of view as it provides new insights into thepros and cons of using a blocking stage, which is used invarious noise reduction and dereverberation algorithms (seereferences [7]–[10]).

The paper is organized as follows. In Section II, the rever-berant and noisy signal model is formulated. In Section III,the two ML estimators are derived and summarized. InSection IV, the two CRB expressions are derived and compared.Section V presents the simulation study by calculating the MSEof the two MLEs and their respective CRB and demonstrates thedereverberation and noise reduction capabilities of an algorithmthat uses these estimators. Finally, conclusions are drawn inSection VI and the work is summarized.

II. PROBLEM FORMULATION

We consider N microphone observations consisting of rever-berant speech and additive noise. The reverberant speech can besplit into two components, the direct and reverberation parts:

Yi(m, k) = Xi(m, k) + Ri(m, k) + Vi(m, k), (1)

where Yi(m, k) denotes the ith microphone observation in time-index m and frequency bin index k, Xi(m, k) denotes the di-rect speech component, Ri(m, k) denotes the reverberation andVi(m, k) denotes the ambient noise. Here Xi(m, k) is modeledas a multiplication between the anechoic speech and the directtransfer function (DTF)

Xi(m, k) = Gd,i(k)S(m, k), (2)

where Gd,i(k) is the DTF and S(m, k) is the desired anechoicspeech.

The N microphone signals are concatenated in the vector

y(m, k) = x(m, k) + r(m, k) + v(m, k) (3)

where

y(m, k) =[Y1(m, k) Y2(m, k) . . . YN (m, k)

]T

x(m, k) =[X1(m, k) X2(m, k) . . . XN (m, k)

]T

= gd(k)S(m, k)

gd(k) =[Gd,1(k) Gd,2(k) . . . Gd,N (k)

]T

r(m, k) =[R1(m, k) R2(m, k) . . . RN (m, k)

]T

v(m, k) =[V1(m, k) V2(m, k) . . . VN (m, k)

]T.

The speech signal is modeled as a Gaussian process withS(m, k) ∼ N (0, φS (m, k)) where φS (m, k) is the speech PSD.The reverberation and the noise components are assumed to beuncorrelated and may be modelled by zero-mean multivariateGaussian probability. The PSD matrix of the noise Φv(k) =E{v(m, k)vH(m, k)

}is assumed to be time-invariant and

known in advance or can be accurately estimated during speech-absent periods. The PSD matrix of the reverberation is time-

Page 3: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

1682 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 8, AUGUST 2017

varying, since the reverberation originates from the speechsource. However, the spatial characteristic of the reverberationis assumed to be time-invariant, as long as the speaker and mi-crophone positions do not change. Therefore, it is reasonable tomodel the PSD matrix of the reverberation as a time-invariantmatrix with time-varying levels. Hence, the p.d.f. of the rever-berant signal vector is modelled as:

r(m, k) ∼ N (0, φR (m, k)Γ(k)) , (4)

where Γ(k) is the time-invariant spatial coherence matrix of thereverberation and φR (m, k) is the temporal level of the rever-beration. In the current work, we assume that the reverberationcan be modelled using a spatially homogenous and sphericallyisotropic sound field (as in many previous works [4], [16]–[19]). Under this assumption, and assuming omnidirectionalmicrophones and a known microphone array geometry, the spa-tial coherence between the ith and jth microphones is uniquelydefined as

Γij (k) = sinc

(2πk

K

Di,j

Tsc

), (5)

where sinc(x) = sin(x)/x, K is the number of frequency bins,Di,j is the inter-distance between microphones i and j, Ts is thesampling time and c is the sound velocity. The goal in this workis to compare two ML estimators of φR (m, k), given the mea-surements y(m, k), with the other parameters gd(k), Γ(k), andΦv(k) known. As the speech PSD φS (m, k) is unobservable,φS (m, k) should be estimated or eliminated. One alternative isto simultaneously estimate the variance of the reverberation andthe speech. Another alternative is to first block the speech signalusing a standard BM, and then estimate the reverberation vari-ance alone. The main goal of the paper is to analytically comparetwo ML estimators of the reverberation power. The main con-tributions of the paper are 1) to describe the ML estimators withand without using the blocking matrix; 2) to derivate the CRBsof each estimator; 3) to analytical compare these CRBs; 4) todiscuss the considerations for choosing an appropriate estimatorfor a given scenario; and in particular 5) to analytically studythe influence of applying the blocking stage on the estimationerror. This result may also be applicable to other algorithms thatuse a blocking matrix.

In the following sections we first derive the MLE (Section III)and then derive and compare their respective CRBs (Section IV).

III. PRESENTATION OF THE TWO MAXIMUM

LIKELIHOOD SCHEMES

First, the joint MLE of both the reverberation and the anechoicspeech PSDs is derived. Second, the MLE is derived for thereverberant PSD alone, while the speech signal is blocked. Inthe following, whenever applicable, the frequency index k andthe time index m are omitted for brevity.

A. Joint Estimation of Reverberation and Speech PSDs

The PSD can be assumed to be smooth over time. It is there-fore proposed to estimate φS and φR using a set of successivesegments by concatenating several snapshots, assumed to be

i.i.d. Denote y as a vector of M i.i.d. past snapshots of y,

y(m) ≡ [yT(m − M + 1) . . . yT(m)]T

. (6)

Collecting all definitions, the p.d.f. of microphone signals y is

f (y(m)) =m∏

m=m−M +1

1πN |Φy | exp

(yH(m)Φ−1

y y(m))

=(

1πN |Φy |

)M

exp

(m∑

m=m−M +1

yH(m)Φ−1y y(m)

)

=(

1πN |Φy |

)M

exp(Tr[Φ−1

y M Ry(m)])

=(

1πN |Φy | exp

(Tr[Φ−1

y Ry(m)]))M

, (7)

where Φy is the PSD matrix of the observed signals that isassumed to be fixed during the entire segment and is defined as

Φy = φSgdgHd + φR Γ + Φv (8)

and

Ry(m) ≡ 1M

m∑

m=m−M +1

y(m)yH(m). (9)

The MLE aims at finding the most likely parameter set φ =[φR φS

]Tw.r.t. the p.d.f. of y(m), i.e.,

φML,y = argmaxφ

log f (y(m); φ) . (10)

Since there is no closed-form solution to φML,y , the Newton’smethod may be applied to derive an iterative search (c.f. [20])

φ(�+1) = φ(�) − H−1(φ(�)

)d(φ(�)

), (11)

where d (φ) is the first-order derivative of the log-likelihoodwith respect to φ, and H (φ) is the corresponding Hessian ma-trix, i.e.,

d (φ) ≡ ∂ log f (y(m); φ)∂φ

,

H (φ) ≡ ∂2 log f (y(m); φ)∂φ∂φT . (12)

The first-order derivative d (φ) is a 2-dimensional vector

d (φ) ≡ [DR (φ) DS (φ)]T

, (13)

Di (φ) = M Tr

[(Φ−1

y Ry(m) − I)Φ−1

y∂Φy

∂φi

], (14)

where i ∈ {R,S}, ∂Φy (m )∂φR

= Γ, ∂Φy

∂φS= gdgH

d . The Hessian isa 2 × 2 matrix:

H (φ) ≡[

HRR (φ) HSR (φ)

HRS (φ) HSS (φ)

]

. (15)

Page 4: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

SCHWARTZ et al.: CRB ANALYSIS OF REVERBERATION LEVEL ESTIMATORS 1683

Applying a second derivative to DR (φ) and DS (φ) yields theelements of H (φ):

Hij (φ) = −M Tr

[Φ−1

y∂Φy

∂φjΦ−1

y Ry(m)Φ−1y

∂Φy

∂φi

+(Φ−1

y Ry(m) − I)Φ−1

y∂Φy

∂φjΦ−1

y∂Φy

∂φi

], (16)

where i, j ∈ {R,S}.In order to apply the Newton’s method, the following practical

considerations have to been taken into account:1) The PSDs of the reverberation and the speech are always

positive. Thus, the Newton’s method may be initializedwith φ(0) = 0.

2) In cases where the first-order derivatives d (φ) in φ =[0, 0]T are negative, the maximum point is probably lo-cated in the third quadrant (i.e., φS < 0 and φR < 0). Insuch a case, the ML estimates of φML,y

R and φML,yS can

be set to be zero without executing the Newton’s method.However, to avoid dividing by zero, the reverberation andspeech levels can be set to some small positive value ε.

3) To avoid illegal PSD level estimates, the parameter setφ(�) may be confined to be in the range [ε, Z] whereZ ≡ 1

N yHy − 1N Tr [Φv(k)] is the a posteriori level of

the entire reverberant speech component.The joint ML estimation of φ is summarized in Algorithm 1.

Additional technical details can be found in [13].For a noiseless case (which can be determined by evaluating

Tr [Φv ]), a closed-form solution for φR is available [11, Eq. 7a]:

φML,yR =

1N − 1

Tr

[(Γ−1 − Γ−1gdgH

d Γ−1

gHd Γ−1gd

)Ry(m)

].

(17)

B. Estimation of the Reverberation PSD Using a BM

In this section, the speech component is first blocked, and thenan MLE of the reverberation PSD is derived. The speech signal is

blocked using a standard blocking matrix, satisfying BHgd = 0.A sparse option for the blocking matrix is (see e.g. [4], [21])

B =

⎢⎢⎢⎢⎢⎢⎢⎢⎣

−G∗d , 2

G∗d , 1

−G∗d , 3

G∗d , 1

. . . −G∗d , N

G∗d , 1

1 0 . . . 00 1 . . . 0...

.... . . 0

0 0 . . . 1

⎥⎥⎥⎥⎥⎥⎥⎥⎦

. (18)

The output of the blocking matrix is given by

u = BH y = BH (r + v) . (19)

The p.d.f. of a set of M i.i.d. snapshots u(m) (where u(m) issimilarly defined as (6)) is given by

f(u(m)) =(

1πN |Φu | exp

(Tr[Φ−1

u Ru(m)]))M

, (20)

where

Φu = BH (φR Γ + ΦV )B (21)

and

Ru(m) =1M

m∑

m=m−M +1

u(m)uH(m) = BHRy(m)B.

(22)

Under this model, the MLE of φR aims at finding the mostlikely φR w.r.t. the p.d.f. of u(m), i.e.,

φML,uR = argmax

φR

log f (u(m); φR ) . (23)

Similar to the previous estimator, there is no closed-form so-lution for φML,u

R . Newton’s method may be used to derive an

Page 5: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

1684 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 8, AUGUST 2017

iterative search (c.f. [20])

φ(�+1)R = φ

(�)R −

D(φ

(�)R

)

H(φ

(�)R

) , (24)

whereD (φR ) andH (φR ) are the first- and second-order deriva-tives

D (φR ) ≡ ∂ log f (u(m); φR )∂φR

, (25)

H (φR ) ≡ ∂2 log f (u(m); φR )∂φ2

R

. (26)

The first-order derivative equals to

D (φR ) = M Tr

[(Φ−1

u Ru(m) − I)Φ−1

u∂Φu

∂φR

], (27)

where ∂Φu∂φR

= BHΓB. The second-order derivative equals to

H (φR ) = −M Tr

[Φ−1

u∂Φu

∂φRΦ−1

u Ru(m)Φ−1u

∂Φu

∂φR

+(Φ−1

u Ru(m) − I)Φ−1

u∂Φu

∂φRΦ−1

u∂Φu

∂φR

]. (28)

Similar practical considerations to the described in the previoussection can be applied to the estimator which uses the BM. TheMLE of φR which uses the BM is summarized in Algorithm 2.Additional technical details can be found in [10].

For the noiseless case, a closed-form solution for φR is avail-able [7, Eq. 7]:

φML,uR =

1N − 1

Tr[(

BHΓB)−1

Ru(m)]. (29)

Interestingly, the estimators in (17) and in (29) are mathemati-cally identical. In Appendix A, the following identity is proved

Γ−1 − Γ−1gdgHd Γ−1

gHd Γ−1gd

= B(BHΓB

)−1BH . (30)

Substituting the above identity in (17) and using the cyclic prop-erty of the trace operation on B yields the estimator in (29).Hence, in the absence of noise the two ML estimators areidentical.

IV. CRB ANALYSIS OF THE ML ESTIMATORS

In the following section, the CRBs of the estimators summa-rized in Section III are derived and compared. By comparingthe CRBs, some insights about the estimators may be deduced.

A. CRB Derivation for the Joint Estimator

In the following section, the CRB for φR under the jointestimator of the reverberation and speech PSDs is derived. Toderive the CRB, the Fisher information matrix (FIM) of φML,y

is first derived. The FIM of φR and φS w.r.t. the p.d.f. of y canbe decomposed as

Iy =

(IyRR IySR

IyRS IySS

)

, (31)

where Iyij is the Fisher information of φi and φj (where i, j ∈{R,S}). The CRB for φR is obtained by inverting the FIM andtaking the corresponding component:

CRByφR

={(

Iy)−1}

11=

IySS

IyRR IySS − IyRS IySR

. (32)

Both φR and φS are components of the PSD matrix Φy . TheFisher information of two parameters associated with a PSDmatrix of a Gaussian vector is given by [22], [12] ,[23]:

Iyij = M Tr

[Φ−1

y∂Φy

∂φiΦ−1

y∂Φy

∂φj

]. (33)

Substituting the derivatives ∂Φy

∂φRand ∂Φy

∂φSin (33) yields the

expressions for the elements of the FIM:

IyRR = M Tr[Φ−1

y ΓΦ−1y Γ

], (34a)

IySS = M Tr[Φ−1

y gdgHd Φ−1

y gdgHd

], (34b)

IyRS = I ySR = M Tr

[Φ−1

y ΓΦ−1y gdgH

d

]. (34c)

To continue the derivation,Φ−1y is calculated using the matrix-

inversion Lemma:

Φ−1y = Ψ−1 − Ψ−1gdgH

d Ψ−1 1φ−1

S + gHd Ψ−1gd

, (35)

where

Ψ ≡ φR Γ + Φv . (36)

Substituting (35) in (34) and employing some algebraic steps,the following results are obtained:

IyRR = M

(

γ0 − 2γ3

γ1 + φ−1S

+γ2

2(γ1 + φ−1

S

)2

)

(37a)

IySS =Mγ2

1

φ2S

(γ1 + φ−1

S

)2 (37b)

IySR =Mγ2

φ2S

(γ1 + φ−1

S

)2 , (37c)

where

γ0 = Tr[Ψ−1ΓΨ−1Γ

](38a)

γ1 = gHd Ψ−1gd (38b)

γ2 = gHd Ψ−1ΓΨ−1gd (38c)

γ3 = gHd Ψ−1ΓΨ−1ΓΨ−1gd. (38d)

Finally, substituting the results from (37) in (32) yields thefinal expression for the CRB:

CRByφR

(φS ) =1M

1

γ0 − 2 γ3γ1

+ γ 22

γ 21

+ 2δγ 2

1 (φS γ1 +1)

, (39)

where δ ≡ γ3γ1 − γ22 .

Page 6: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

SCHWARTZ et al.: CRB ANALYSIS OF REVERBERATION LEVEL ESTIMATORS 1685

For the noiseless case the definitions in (38) are simplified to:

γ0 = Nφ−2R (40a)

γ1 = φ−1R gH

d Γ−1gd (40b)

γ2 = φ−2R gH

d Γ−1gd (40c)

γ3 = φ−3R gH

d Γ−1gd. (40d)

The expression for the CRB becomes simpler and identicalto the CRB derived in [14, Eq. 12]:

CRByφR

=1M

φ2R

N − 1. (41)

B. CRB for the Estimator With BM

In the following section, the FIM and the CRB of φR w.r.t. thep.d.f. of u is derived. Similarly to (33), the Fisher-informationof φR equals

IuRR = M Tr

[Φ−1

u∂Φu

∂φRΦ−1

u∂Φu

∂φR

], (42)

and the corresponding CRB is given by

CRBuφR

=(IuRR

)−1. (43)

Using (21) and (36) one can notice that Φ−1u =

(BHΨB

)−1.

Substituting this expression in (42), the FIM now reads:

IuRR = M Tr

[(BHΨB

)−1BHΓB

(BHΨB

)−1BHΓB

].

(44)

To derive the expression for the CRB, the following iden-tity presented in (30) may be used (the identity is proved inAppendix A):

B(BHΨB

)−1BH = Ψ−1 − Ψ−1gdgH

d Ψ−1

gHd Ψ−1gd

. (45)

Using the cyclic property of the trace operation, the Fisher-information from (44) may be written as

IuRR = MTr

[B(BHΨB

)−1BHΓB

(BHΨB

)−1BHΓ

].

(46)

Substituting the identity from (45) in (46), and using the defini-tions in (38), the Fisher information can be obtained as

IuRR = M

(γ0 − 2

γ3

γ1+

γ22

γ21

). (47)

Finally, employing the definition of the CRB in (43) yields theexpression for the CRB:

CRBuφR

=1M

1

γ0 − 2 γ3γ1

+ γ 22

γ 21

. (48)

Note that the expressions of the CRBs in (39) and (48) are similarbesides the additional component in the denominator of (39),namely 2δ

γ 21 (φS γ1 +1) . For the noiseless case, it can be shown that

CRBuφR

is identical to CRByφR

in (41).

Fig. 1. Illustration of CRByφR

(φS ) and CRBuφR

as a function of φS .

C. Comparison of the CRBs

In the following section, the CRBs in (39) and (48) are com-pared, to determine which of which provides the more accurateestimate in the minimum MSE sense.

Although CRBuφR

is independent of φS and CRByφR

(φS )depends on φS , the CRBs may be evaluated as a function of φS .

The following observations can be made w.r.t. CRBuφR

andCRBy

φR(φS ):

1) Compared to CRBuφR

, CRByφR

(φS ) has an additional

component in the denominator that is equal to 2δγ 2

1 (φS γ1 +1) .Since γ1 is positive1 and φS is positive, the sign of theadditional component depends only on δ. In Appendix B,it is proven that δ ≥ 0. Accordingly, it can be concludedthat CRBy

φR(φS ) is always smaller or equal than CRBu

φR.

2) The additional component 2δγ 2

1 (φS γ1 +1) is inversely propor-

tional to φS . Hence, it can be deduced that CRByφR

(φS ) ismonotonically increasing as a function of φS . Moreover,the limit of CRBy

φR(φS ) as φS goes to infinity is

limφs →∞

CRByφR

(φS ) = CRBuφR

. (49)

Finally, an illustration of CRByφR

(φS ) and CRBuφR

as a func-tion of φS is depicted in Fig. 1.

Some relevant conclusions from the behavior of the CRBscan be drawn:

1) Since, theoretically, CRByφR

(φS ) ≤ CRBuφR

consis-tently, the ML estimation of φR without the blockingoperation has, asymptotically, a lower MSE than the MLestimation of φR with the blocking operation.

2) When the signal to reverberation plus noise ratio (SRNR)decreases (i.e. φS becomes smaller), the performance dif-ference increases. This probably occurs since low levels ofspeech renders the blocking operation unnecessary. How-ever, in high SRNR cases, the blocking operation becomesmore effective.

3) In case where φS = 0 (i.e. the blocking action is entirelyunnecessary), the CRB for the joint estimator is simplifiedto

CRByφR

(φS = 0) =1M

1(γ0 − 2 γ3

γ1+ γ 2

2γ 2

1

)+ 2δ

γ 21

. (50)

Thus, even in cases where the blocking operation is un-necessary, if δ equals zero the CRBs are equal.

1Ψ consists of PSD matrices and therefore is a positive-definite matrix. Thus,according to (38b), γ1 is always positive.

Page 7: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

1686 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 8, AUGUST 2017

4) In Appendix B, it is shown that δ equals zero when: 1) theambient noise level equals zero, or 2) the spatial fields ofthe ambient noise and the reverberation are identical. Insuch cases both CRBs are identical.

5) For large inter-microphone distances and/or high frequen-cies, the reverberant sound field tends to be spatially white.If the ambient noise is also spatially white, it can be shownthat both CRBs are identical.

V. PERFORMANCE EVALUATION

In this section, an experimental comparison between the re-verberation PSD estimators in Sections III-A and III-B is pre-sented. First, a Monte-Carlo experiment was carried out, toinvestigate and confirm the theoretical results presented in thispaper. Secondly, the reverberation PSD was estimated from realreverberant and noisy signals using the ML estimator and othercompeting estimators. The average log-error between the esti-mated reverberation PSDs and the oracle reverberation PSDsis presented. Thirdly, a dereverberation and noise reductionalgorithm was applied using estimates of the two estimators(and some other competing estimators). The output signal wascompared with the direct-path signals in terms of perceptualevaluation of speech quality (PESQ) score and log-spectral dis-tance (LSD) result.

A. CRB Comparison Using Monte-Carlo Simulations

This section provides several Monte-Carlo simulations todemonstrate the theoretical results obtained in the previoussections.

In each simulation, the influence of only one parameter is ex-amined while the other parameters remain fixed. N microphonesignals were synthetically created with M i.i.d. snapshots. Pure-tone signals with frequency f = 800 Hz were used throughoutthis section. A uniform linear array was simulated. The DTF gd

was assumed to be a pure delay system

Gd,i = exp (−ι2πfτ · (i − 1)) , i ∈ {1, 2, . . . , N} (51)

where ι =√−1 and τ is the time difference of arrival (TDOA)

between two neighboring microphones. The TDOA dependson the direction of arrival (DOA), τ = d cos(θ)

c , where θ is theangle of arrival (from 0◦ to 180°), d is the inter-microphonesdistance and c is the sound velocity (c = 340 m

sec ). The spatialcoherence matrix Γ was set as a diffused noise field, Γij =sinc (2πfd|i − j|/c). The noise PSD matrix was set as Φv =φV I, where I denotes the identity matrix. The values of thenominal parameter set are presented in Table I.

In each simulation, 2000 Monte-Carlo experiments were ex-ecuted. In each experiment, φR was estimated according toAlgorithm 1 and 2. The MSE of the reverberation PSD wascomputed.

Four measurements were calculated: 1) the theoretical CRBin (39) refers to the joint ML estimator which circumvents theapplication of the blocking operation; 2) the MSE of the esti-mates using Algorithm 1; 3) the theoretical CRB in (48) refers tothe ML estimator which use the blocking operation; 4) the MSEof the estimates using Algorithm 2. In the following sections,

TABLE INOMINAL PARAMETER-SET FOR THE SIMULATED EXPERIMENTS

Parameter Nominal value

Snapshot number M 100Microphone number N 4Speech PSD φS 0.5Rev. PSD φR 0.5Noise PSD φV 0.5Inter-microphone distance d 0.06 mFrequency f 800 HzDirection of arrival 90°Iterations number L 10

Fig. 2. The CRBs and the MSEs of the reverberation level estimates of thetwo MLE schemes as a function of the number of snapshots M .

the four measurements are presented vs. the values of M , φS ,φR , φV , N , d, θ and f .

1) CRBs and MSEs as a Function of the Number of SnapshotsM : Fig. 2 shows the CRBs and the MSEs of the reverberationlevel estimates as a function of the snapshots number M . Thesnapshots number varies from 5 to 500 snapshots (it should benoted that since in practice the reverberation is highly time-varying, this high number of snapshot can only be used in thisartificial experiment).

It can be verified that the theoretical CRBs curves match wellwith the MSE curves, especially when the number of snapshotsis large. In a low number of snapshots, the MSEs curves areeven lower than the CRBs curves. This result may be attributedto the lower and upper delimitations described in Algorithm 1and 2. As expected, the CRBs and MSEs curves decrease as thenumber of snapshots increase.

2) CRBs and MSEs as a Function of φS : In the second ex-periment, the oracle φS was set to various values, such thatthe SRNR varies from −20 dB to 15 dB where SRNR ≡10 log( φS

φR +φV). Fig. 3 shows the CRBs and the MSEs of the

reverberation level estimates as a function of the SRNR.It can be verified that the theoretical CRBs curves match with

the MSE curves and follow the theoretical contours depicted inFig. 1. In low SRNR there is a difference between the CRBswhile in high SRNR the CRBs are identical.

3) CRBs and MSEs as a Function of φR : In the third ex-periment, the oracle φR was set to various values, such that

Page 8: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

SCHWARTZ et al.: CRB ANALYSIS OF REVERBERATION LEVEL ESTIMATORS 1687

Fig. 3. The CRBs and the MSEs of the reverberation level estimates of thetwo MLE schemes as a function of the SRNR.

Fig. 4. The CRBs and the MSEs of the reverberation level estimates of thetwo MLE schemes as a function of the SRR.

the signal-to-reverberant ratio (SRR) varies from −20 dB to+20 dB, where SRR ≡ 10 log( φS

φR). The signal-to-noise ratio

(SNR) was 0 dB, where SNR ≡ 10 log( φS

φV). Fig. 4 shows the

CRBs and the MSEs of the reverberation level estimates as afunction of the SRR. In the middle range, the MSE curves matchthe CRB curves. In high φR values (i.e., low SRR), the MSE ofboth estimators are higher than the CRB curves. In low φR val-ues (i.e., high SRR), the MSEs are smaller than the CRBs. Theseeffects often occur when the oracle value of desired parameteris very small or very large relative to the other parameters. Wehave no theoretical explanation for these effects.

In Fig. 5, the difference between CRBuφR

and CRByφR

,

D ≡ CRBuφR

− CRByφR

, (52)

is presented as a function of φS and φR . It can be verified thatas long φS decreases or φR increases the difference becomeshigher.

4) CRBs and MSEs as a Function of φV : In the fourth ex-periment, the noise power φV was set to various values, suchthat the SNR varies from −10 dB to +30 dB. The SRR was0 dB. Fig. 6 shows the CRBs and the MSEs of the reverberationPSD estimates as a function of the SNR.

For high φV values (i.e., low SNR), both the MSEs are lowerthan the CRB curves. In the middle range, the MSE curves match

Fig. 5. The difference between the CRBs, D (52), as a function of φS

and φR .

Fig. 6. The CRBs and the MSEs of the reverberation level estimates of thetwo MLE schemes as a function of the SNR with L = 10.

Fig. 7. The CRBs and the MSEs of the reverberation level estimates of thetwo MLE schemes as a function of the SNR with L = 20.

the CRB curves. For low φV values (i.e., high SNR), the MSEsare much higher than the CRB curves. Moreover, the MSE forAlgorithm 2 outperforms the MSE of Algorithm 1. The CRBcurves in high SNR case are identical as shown in (41). Ap-parently, in high SNR cases, the parameters update rule con-verges slowly [24], especially when two parameters are es-timated. Therefore, it is worthwhile to increase the iterationnumber L in such cases. Fig. 7 shows the CRBs and the MSEsof the reverberation PSD estimates as a function of the SNRwhere L = 20 iterations of the Newton search were applied

Page 9: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

1688 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 8, AUGUST 2017

Fig. 8. The CRBs and the MSEs of the reverberation level estimates of thetwo MLE schemes as a function of N .

Fig. 9. The CRBs and the MSEs of the reverberation level estimates of thetwo MLE schemes as a function of d.

(Nominally, L = 10 iterations were applied). It can be verifiedthat the MSEs are now well matched with the CRBs.

5) CRBs and MSEs as a Function of the Microphone NumberN : In the fifth experiment, the microphone number is variedfrom 2 microphones to 20. Fig. 8 shows the CRBs and theMSEs of the reverberation PSD estimates as a function of N .

The MSEs and CRBs curves are decreased as the microphonenumber is increased. Conceptually, as long as the amount of datais increased the MSE is decreased. Moreover, since the blockingstage blocks only one dimension, the difference between theCRB curves is decreased as long as the dimensions number of thedata is increased. There is a difference between the estimatorswhen 3 − 7 microphones are used.

6) CRBs and MSEs as a Function of the Inter-MicrophonesDistance d: In the sixth experiment, four microphones wastaken. The microphone inter distance d was set to the range0.01 − 0.2 m. Fig. 9 depicts the CRBs and the MSEs of thereverberation PSD estimates as a function of d.

There is a noticeable difference between the estimators onlyin the range 0.03 − 0.12 m. Generally, the MSEs and the CRBsare decreases as d increases. At large inter-microphone distancesthe diffuse field behaves as a spatially white noise field. In ourcase, the fields of the reverberation and the ambient noise are

Fig. 10. The CRBs and the MSEs of the reverberation level estimates of thetwo MLE schemes as a function of θ.

Fig. 11. The CRBs and the MSEs of the reverberation level estimates of thetwo MLE schemes as a function of f .

therefore identical. As was mentioned in Section IV-C, the CRBsare identical in this case.

7) CRBs and MSEs as a Function of the DOA: In the seventhexperiment, various directions of arrivals were examined. Theangle of arrival θ was set in the range 1◦ − 180◦. Fig. 10 depictsthe CRBs and the MSEs of the reverberation PSD estimates as afunction of θ. It can be observed that the estimator without block-ing achieves a lower CRB and MSE compared to the estimatorwith blocking for all the directions of arrival. Interestingly, inendfire scenarios, the CRBs and the MSEs decreases.

8) CRBs and MSEs as a Function of the Frequency f : Inthe eighth experiment, the frequency was set in the range0 − 4000 Hz. Fig. 11 shows the CRBs and the MSEs of thereverberation PSD estimates as a function of f .

There is a difference between the estimators in the frequencyrange 400 − 1300 Hz. The MSEs decreases as the frequencyincreases. At high frequencies the diffuse field resembles a spa-tially white noise field. As the reverberation and ambient noisefields are identical, also the CRBs are identical.

B. Performance Evaluation With Measured RoomImpulse Responses

The performance of the two MLEs is now examined usingreal reverberant and noisy signals. As in [13], the performance

Page 10: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

SCHWARTZ et al.: CRB ANALYSIS OF REVERBERATION LEVEL ESTIMATORS 1689

of the two estimators is evaluated by: 1) examining the log-errorbetween the estimated value of φML

R and the true reverberationPSD, obtained by convolving the anechoic speech signal withthe late component of the acoustic impulse response; and 2) uti-lizing the estimated PSD levels φML

R in a speech dereverberationand noise reduction task.

1) Experimental Setup: The experiments consist of rever-berant signals plus directional noise with various SNR levels.Spatially white (sensor) noise was also added, with power 20 dBlower than the directional noise power. Anechoic speech sig-nals were convolved by room impulse responses (RIRs), down-loaded from an open-source database recorded in our lab. Detailsabout the database and RIRs identification method can be foundin [25]. Reverberation time was set by adjusting the room panels,and was measured to be approximately T60 = 0.61 s. The SRRwas measured to be approximately −5 dB. The spatial PSD ma-trix Φv was estimated using periods in which the desired speechsource was inactive. An oracle voice activity detector was usedfor this task. The loudspeaker was positioned in front of a fourmicrophone linear array, such that the steering vector was set togd = [1 1 1 1 ]T . The inter-distances between the microphoneswere [3,8,3] cm. The sampling frequency was 16 kHz, the framelength of the short-time Fourier transform (STFT) was 32 mswith 8 ms between successive time-frames (i.e., 75% overlap).We have also set ε = 10−10 . The number of iterations was setto L = 10. In cases where the a posteriori SNR

1N yH y

φVwas at

least 60 dB, the high-SNR estimator (17) was used. All mea-sures were computed by averaging the results obtained using 50sentences, 4–8 s long, evenly distributed between female andmale speakers.

Practically, due to the non-stationarity of the signals, we havesubstituted the sliding window averaging (9) by recursive aver-aging that was used also in our previous work [13], i.e.,

Ry(m) = αRy(m − 1) + (1 − α)y(m)yH(m), (53)

where 0 ≤ α < 1 is a smoothing factor. The smoothing factor αwas set to 0.7, i.e. only few frames are involved in the smoothing.In each time-index m, Ry(m) in (53) was used for estimatingφS (m) and φR (m).

2) Accuracy of the ML Estimator: The performance of thetwo ML estimators was compared to other two existing esti-mators in terms of log-errors between the estimated PSD andthe oracle PSD: 1) the estimator in [8], denoted henceforth asBraun2013; 2) the estimator in [26]2, denoted henceforth asLefkimmiatis2006.

In Fig. 12, the results for several SNR levels are depicted.The mean log-errors between the estimated PSD levels and

the oracle PSD levels are presented. In order to calculate theoracle PSD levels, the anechoic speech was filtered with thereverberation tails of the RIRs. The reverberation tails were setto start 2 ms after the arrival time of the direct-path. To reduce thevariance of the oracle PSD, the mean value of the oracle PSDs

2Note that the algorithm in [26] aims to estimate the noise variance given thereceived signals PSD matrix while the noise coherence matrix is assumed to beknown. In our implementation, we treat the reverberation as an additive noise,and we estimate its level while we subtract the ambient noise PSD matrix fromthe received signals PSD matrix.

Fig. 12. Log-errors of the two ML reverberation PSD level estimators incomparison with Braun2013 [8] and Lefkimmiatis2006 [26]. The upper part ofeach bar represents the underestimation error, while the lower part representsthe overestimation error.

Fig. 13. Occurrence of the various SRNRs in the time-frequency bins

over all microphones was computed. The result bars are split todistinguish between underestimation errors and overestimationerrors. It is evident that the joint estimator outperforms theother three competing estimators in terms of overall log-errorsfor −2 dB, 3 dB, 8 dB and 13 dB SNR values. For the higherSNR of 13 dB, Lefkimmiatis2006 and Braun2013 outperformthe proposed ML estimators.

We also examined the mean log-errors between the esti-mated PSD levels and the oracle PSD levels for various val-ues of SRNR. We computed the true SRNR for each time-frequency bin and calculated the mean log-errors for eachSRNR separately. In Fig. 13 the occurrence of the variousSRNRs in the time-frequency bins is presented. It can beseen that the most common SRNRs are found in the range−10 dB to 0 dB. In Fig. 14, the mean log-errors results arepresented vs. the SRNR level for the blocking based MLestimator and the non-blocking based ML estimator. It canbe seen that for low SRNR the joint ML estimator outper-forms the blocking-ML estimator, but for high SRNR the meanlog-errors of the two estimators become equal. It can be de-duced that for most highly occurring SRNRs the joint esti-mator outperforms blocking-ML estimator. This result is in-

Page 11: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

1690 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 8, AUGUST 2017

Fig. 14. Mean log-errors of the two ML late reverberation PSD estimators vs.the SRNR axis.

Fig. 15. Mean log-errors of Braun2013 and Lefkimmiatis2006 vs. the SRNRaxis.

line with the trends presented in the theoretical derivation inSection IV-C. In Fig. 15, the mean log-errors results are pre-sented vs. the SRNR level for Braun2013 and Lefkimmi-atis2006.

3) Dereverberation Performance: The performance of theproposed estimator was also examined by utilizing the esti-mated PSDs in a joint dereverberation and noise reduction task.The estimated PSDs were used to compute the multi-channelWiener filter presented in [4]. The multichannel Wiener filter(MCWF) was designed to jointly suppress the power of thetotal interference (e.g. the reverberation and the noise) by de-riving the MMSE estimator of the direct path. The MCWF wasimplemented by a two-stage approach: a minimum variance dis-tortionless response beamformer followed by a correspondingpost filter. The performance of the dereverberation algorithmwas evaluated in terms of two objective measures, commonlyused in the speech enhancement community, namely PESQ [27]and LSD. The clean reference for evaluation in all cases was theanechoic speech signal filtered by only the direct path of theRIR.

In Tables II and III, the performance measures for several in-put SNR levels are depicted. The joint estimator outperforms allcompeting estimators with respect to the LSD measures whilethe estimator that uses the blocking outperforms all competingestimators with respect to the PESQ measures.The PESQ seems

TABLE IIPESQ SCORES FOR THE MCWF [4] USING VARIOUS ESTIMATORS

SNR −2 dB 3 dB 8 dB 13 dB

Unprocessed 1.33 1.53 1.74 1.86Oracle φR 1.91 2.04 2.11 2.15Braun2013 [8] 1.78 1.96 2.07 2.13Lefkimmiatis2006 [26] 1.81 1.98 2.08 2.13ML with blocking 1.91 2.06 2.14 2.17ML without blocking 1.83 1.97 2.06 2.12

TABLE IIILSD SCORES FOR THE MCWF [4] USING VARIOUS ESTIMATORS

SNR −2 dB 3 dB 8 dB 13 dB

Unprocessed 9.42 8.08 6.91 6.14Oracle φR 5.93 5.47 5.17 4.95Braun2013 [8] 6.27 5.65 5.30 5.10Lefkimmiatis2006 [26] 6.10 5.58 5.27 5.08ML with blocking 6.13 5.55 5.20 4.98ML without blocking 6.09 5.51 5.19 4.96

to mainly reflect reverberation reduction rather than speechdistortion. The estimator that uses the blocking slightly over-estimates the reverberation level and hence results in morereverberation reduction. The dereverberation and noise reduc-tion results are given for a specific case just to present a usageof the reverberation power. The differences between the perfor-mance of the two variants are indeed small. Note that the oraclePSD of the reverberation level does not yield a significantlybetter result either.

VI. CONCLUSION

In this work, two ML estimators of the reverberation PSDlevel in a noisy environment were compared. The first one jointlyestimates the reverberation and the anechoic speech PSDs. Thesecond one first blocks the speech component and then estimatesthe reverberation level PSD using the outputs of the blockingmatrix. The CRB expressions associated with the two ML esti-mators were derived and compared. The CRB associated withthe joint estimator is proven to be lower than the CRB associatedwith the estimator that uses the blocking stage. An experimentalstudy computed the MSE curves of the two estimators vs. thevarious problem parameters. Monte-Carlo simulations showsthat the MSE is congruous with the CRB curves. The ML esti-mators were applied to real recordings in order to estimate thereverberation PSD levels. The estimated levels were utilized toimplement a recently proposed dereverberation and noise reduc-tion algorithm. The results obtained by the two ML estimatorsare presented and compared with the results obtained by othercompeting estimators from the literature.

APPENDIX A

In this appendix, the following identity is proven:

B(BHΨB

)−1BH = Ψ−1 − Ψ−1gdgH

d Ψ−1

gHd Ψ−1gd

. (A.54)

Page 12: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

SCHWARTZ et al.: CRB ANALYSIS OF REVERBERATION LEVEL ESTIMATORS 1691

Note that the only condition for the identity is BHgd = 0. Inorder to prove the above identity, the well-known criterion of theMVDR-beamformer (BF) is used in the following. Regardingthe signal model in (3), the MVDR-BF that minimizes the powerof the interference (i.e. the reverberation and the noise) whilemaintaining distortionless response of the direct path is definedby [21]

hMVDR = argminh

hHΨh s.t. hHgd = F, (A.55)

where F is the desired response. The solution for this optimiza-tion problem is

hMVDR =Ψ−1gd

gHd Ψ−1gd

F ∗. (A.56)

The MVDR-BF can be implemented using a generalized side-lobe canceller (GSC) structure [21], [28]:

hMVDR = h0 − BhNC , (A.57)

where h0 is the fixed beamformer (FBF) satisfying hH0 gd =

F and hNC is the noise canceller (NC) that is responsible formitigating the residual interference at the output of the FBF,

hNC =(BHΨB

)−1BHΨh0 . (A.58)

A common choice for h0 is h0 = gd||gd||2 [21]. Note that for such

a choice, the branches of the GSC (i.e. h0 and −BhNC ) areorthogonal.

In cases where the desired response F is set as F = Gd,i

(where Gd,i is the i-th component of gd), the FBF can be setas h0 = Ii (where Ii is the i-th column of the identity matrixI), since IH

i gd = Gd,i . Note that by this selection, the MVDRdecomposes into two vectors which are not necessarily orthog-onal. In previous work [4], we elaborate about the benefits thatcan be gained from such a structure.

Denote the MVDR-BF that satisfies F = Gd,i as hMVDR ,i ,the standard MVDR expression (A.56) and its correspondingGSC implementation (A.57) (which uses h0 = Ii) are provedto be identical [21], i.e.,

hMVDR ,i =Ψ−1gd

gHd Ψ−1gd

G∗d,i

= Ii − B(BHΨB

)−1BHΨIi . (A.59)

Concatenating all the MVDR and the GSC expressions for eachF = Gd,i , the following is obtained:

[hMVDR ,1 ... hMVDR ,N

]=

Ψ−1gd

gHd Ψ−1gd

gHd

= I − B(BHΨB

)−1BHΨI. (A.60)

Multiplying from the right both sides of the equation above byΨ−1 concludes the proof of the identity (A.54).

APPENDIX B

In this appendix, it is proven that δ ≡ γ3γ1 − γ22 is a positive

number. In the following, a detailed expression for δ is derived.

Since γ1 , γ2 and γ3 consist of Ψ−1 (see (38b)–(38d)), sometransformations may be made to Ψ−1 as elaborated in the sequel.

First, the noise PSD matrix is whitened by decomposingΦv = φV CCH , where φV is the noise level, CCH is thenormalized3 PSD matrix of the noise and C is an invertiblematrix4. Therefore, from (36), Ψ−1 can now be expressed as

Ψ−1 = CH(φRC−1ΓCH + φV I

)−1C−1 . (B.61)

Now, the EVD is applied to C−1ΓC−H , by defining

C−1ΓC−H = VHΛV. (B.62)

The matrix Ψ−1 now reads

Ψ−1 = C−HVHΥ−1VC−1 , (B.63)

where Υ ≡ φRΛ + φV I .Define d ≡ VC−1gd and using (B.63), the definitions in

(38b)–(38d) may be re-expressed as

γ1 = dHΥ−1d(B.64a)

γ2 = dHΥ−1ΛΥ−1d(B.64b)

γ3 = dHΥ−1ΛΥ−1ΛΥ−1d.(B.64c)

Since Λ and Υ−1 are diagonal matrixes, γ1 , γ2 and γ3 can beagain re-expressed

γ1 =∑

i

|di |2φRλi + φV

, (B.65a)

γ2 =∑

i

|di |2 · λi

(φRλi + φV )2 , (B.65b)

γ3 =∑

i

|di |2 · λ2i

(φRλi + φV )3 , (B.65c)

where di are the elements of d and λi are the diagonal elementsof Λ.

Substituting the results from (B.65) in δ yields:

δ =∑

i

|di |2 · λ2i

(φRλi + φV )3

j

|dj |2φRλj + φV

−∑

i

|di |2 · λi

(φRλi + φV )2

j

|dj |2 · λj

(φRλj + φV )2 . (B.66)

Computing common denominator for the above componentsand executing several algebraic steps yields

δ =∑

i

j

|di |2 |dj |2(λ2

i − λiλj

)

(φRλi + φV )3 (φRλj + φV )2 . (B.67)

3such that its trace equals N .4since Φv is symmetric and usually full-rank, C can be found using the

eigenvalue decomposition (EVD) of Φv .

Page 13: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

1692 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 8, AUGUST 2017

For double-summation, a useful identity may be used, which isobtained by exchanging the indexes of the summations:

i

j

f(i, j) =12

i

j

f(i, j) +12

j

i

f(j, i),

(B.68)

where f(j, i) is some function of i and j. Using the aboveidentity in (B.67) yields

δ =12

i

j

|di |2 |dj |2(λ2

i − λiλj

)

(φRλi + φV )3 (φRλj + φV )2

+12

j

i

|dj |2 |di |2(λ2

j − λjλi

)

(φRλj + φV )3 (φRλi + φV )2 . (B.69)

Finally, after computing common denominator and a few al-gebraic steps, δ is proved to be a positive number since all itscomponents are positive,5

δ =φV

2

i

j

|di |2 |dj |2 (λi − λj )2

(φRλj + φV )3 (φRλi + φV )3 ≥ 0. (B.70)

Some conclusions may be drawn from the final expression ofδ: 1) if φV equals zero (i.e., noiseless case), δ equals zero;or 2) if λi are identical for all i, δ equals zero (which occurswhen C−1ΓC−H = I, namely the spatial fields of the ambientnoise and the reverberation are identical). In cases when δ equalszero, both CRB expressions (with and without blocking) becomeidentical.

REFERENCES

[1] A. Kjellberg, “Effects of reverberation time on the cognitive load in speechcommunication: Theoretical considerations,” Noise Health, vol. 7, no. 25,pp. 11–21, 2004.

[2] P. A. Naylor and N. D. Gaubitch, Eds., Speech Dereverberation. NewYork, NY, USA: Springer, 2010.

[3] E. A. P. Habets, “Single- and multi-microphone speech dereverberationusing spectral enhancement,” Ph.D. dissertation, Technische UniversiteitEindhoven, Eindhoven, The Netherlands, Jun. 2007.

[4] O. Schwartz, S. Gannot, and E. A. P. Habets, “Multi-microphone speechdereverberation and noise reduction using relative early transfer func-tions,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 2,pp. 240–251, Feb. 2015.

[5] E. A. P. Habets, S. Gannot, and I. Cohen, “Late reverberant spectralvariance estimation based on a statistical model,” IEEE Signal Process.Lett., vol. 16, no. 9, pp. 770–773, Sep. 2009.

[6] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Trans.Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 1109–1121,Dec. 1984.

[7] U. Kjems and J. Jensen, “Maximum likelihood based noise covariancematrix estimation for multi-microphone speech enhancement,” in Proc.20th Eur. Signal Process. Conf., 2012, pp. 295–299.

[8] S. Braun and E. A. P. Habets, “Dereverberation in noisy environmentsusing reference signals and a maximum likelihood estimator,” in Proc.21st Eur. Signal Process. Conf., 2013, pp. 1–5.

[9] S. Braun and E. A. Habets, “A multichannel diffuse power es-timator for dereverberation in the presence of multiple sources,”EURASIP J. Audio, Speech Music Process., vol. 2015, no. 1, pp. 1–14,2015.

5Note that since λi are eigenvalues of a positive definite matrix, they arenecessarily positive.

[10] O. Schwartz, S. Braun, S. Gannot, and E. A. P. Habets, “Maximum likeli-hood estimation of the late reverberant power spectral density in noisy en-vironments,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust.,New-Paltz, NY, USA, Oct. 2015, pp. 1–5.

[11] A. Kuklasinski, S. Doclo, S. H. Jensen, and J. Jensen, “Maximum like-lihood based multi-channel isotropic reverberation reduction for hearingaids,” in Proc. 22nd Eur. Signal Process. Conf., 2014, pp. 61–65.

[12] H. Ye and R. D. DeGroat, “Maximum likelihood DOA estimation andasymptotic Cramer-Rao bounds for additive unknown colored noise,”IEEE Trans. Signal Process., vol. 43, no. 4, pp. 938–949, 1995.

[13] O. Schwartz, S. Gannot, and E. A. P. Habets, “Joint maximum likelihoodestimation of late reverberant and speech power spectral density in noisyenvironments,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process.,Shanghai, China, Mar. 2016, pp. 151–155.

[14] J. Jensen and M. S. Pedersen, “Analysis of beamformer directed single-channel noise reduction system for hearing aid applications,” in Proc.IEEE Int. Conf. Acoust., Speech Signal Process., Brisbane, Qld, Australia,Apr. 2015, pp. 5728–5732.

[15] A. Kuklasinski, S. Doclo, T. Gerkmann, S. H. Jensen, and J. Jensen,“Multi-channel PSD estimators for speech dereverberation-a theoreticaland experimental comparison,” in Proc. IEEE Int. Conf. Acoust. SpeechSignal Process., 2015, pp. 91–95.

[16] E. A. P. Habets and J. Benesty, “A two-stage beamforming approach fornoise reduction and dereverberation,” IEEE Trans. Audio Speech Lang.Process., vol. 21, no. 5, pp. 945–958, May 2013.

[17] E. A. P. Habets, “Towards multi-microphone speech dereverberation us-ing spectral enhancement and statistical reverberation models,” in Proc.Asilomar Conf. Signals Syst. Comput., 2008, pp. 806–810.

[18] N. Dal Degan and C. Prati, “Acoustic noise analysis and speech enhance-ment techniques for mobile radio applications,” Signal Process., vol. 15,no. 1, pp. 43–56, 1988.

[19] E. A. P. Habets and S. Gannot, “Generating sensor signals in isotropicnoise fields,” J. Acoust. Soc. Amer., vol. 122, pp. 3464–3470, Dec. 2007.

[20] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.:Cambridge Univ. Press, 2004.

[21] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement us-ing beamforming and nonstationarity with applications to speech,” IEEETrans. Signal Process., vol. 49, no. 8, pp. 1614–1626, Aug. 2001.

[22] W. J. Bangs, “Array processing with generalized beamformers,” Ph.D.dissertation, , Yale University, New Haven, CT, USA, 1972.

[23] H. Hung and M. Kaveh, “On the statistical sufficiency of the coherentlyaveraged covariance matrix for the estimation of the parameters of wide-band sources,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process.,1987, vol. 12, pp. 33–36.

[24] A. Ozerov and C. Fevotte, “Multichannel nonnegative matrix factorizationin convolutive mixtures for audio source separation,” IEEE Trans. Audio,Speech, Lang. Process., vol. 18, no. 3, pp. 550–563, Mar. 2010.

[25] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio databasein various acoustic environments,” in Proc. 14th Int. Workshop Acoust.Signal Enhanc., Sep. 2014, pp. 313–317.

[26] S. Lefkimmiatis, D. Dimitriadis, and P. Maragos, “An optimum micro-phone array post-filter for speech applications,” in Proc. 9th Int. Conf.Spoken Lang. Process., 2006, pp. 2142-2145.

[27] ITU-T, “Perceptual evaluation of speech quality (PESQ), an objec-tive method for end-to-end speech quality assessment of narrowbandtelephone networks and speech codecs,” ITU, Geneva, Switzerland,Recommendation P.862, Feb. 2001.

[28] L. Griffiths and C. W. Jim, “An alternative approach to linearlyconstrained adaptive beamforming,” IEEE Trans. Antennas Propag.,vol. AP-30, no. 1, pp. 27–34, Jan. 1982.

Ofer Schwartz received the B.Sc. (cum laude) andM.Sc. degrees in electrical engineering from Bar-IlanUniversity, Ramat Gan, Israel, in 2010 and 2013, re-spectively. He is currently working toward the Ph.D.degree in electrical engineering at the Speech and Sig-nal Processing Laboratory, Faculty of Engineering,Bar-Ilan University. His research interests includestatistical signal processing and in particular dere-verberation and noise reduction using microphonearrays and speaker localization and tracking.

Page 14: 1680 IEEE/ACM TRANSACTIONS ON AUDIO, … of mobile devices, smart TVs and audio confer-encing systems. ... and approving it for publication was Prof. Hiroshi Saruwatari. (Corresponding

SCHWARTZ et al.: CRB ANALYSIS OF REVERBERATION LEVEL ESTIMATORS 1693

Sharon Gannot (S’92–M’01–SM’06) received theB.Sc. degree (summa cum laude) from the Tech-nion Israel Institute of Technology, Haifa, Israel, in1986, and the M.Sc. (cum laude) and Ph.D. degreesfrom Tel-Aviv University, Tel Aviv, Israel, in 1995and 2000, respectively, all in electrical engineering.In 2001, he held a Postdoctoral position at the De-partment of Electrical Engineering (ESAT-SISTA), atK.U. Leuven, Leuven, Belgium. From 2002 to 2003,he held a Research and Teaching position at the Fac-ulty of Electrical Engineering, Technion-Israel Insti-

tute of Technology. Currently, he is a Full Professor at the Faculty of Engineer-ing, Bar-Ilan University, Ramat Gan, Israel, where he is heading the Speech andSignal Processing laboratory and the Signal Processing Track. His research in-terests include multi-microphone speech processing and specifically distributedalgorithms for ad hoc microphone arrays for noise reduction and speaker sep-aration; dereverberation; single microphone speech enhancement and speakerlocalization and tracking.

He served as an Associate Editor of the EURASIP Journal of Advances inSignal Processing in 2003–2012, and as an Editor of several special issues onMulti-microphone Speech Processing of the same journal. He has also servedas a Guest Editor of ELSEVIER Speech Communication and Signal Processingjournals. He also served as an Associate Editor of IEEE TRANSACTIONS ON

SPEECH, AUDIO AND LANGUAGE PROCESSING in 2009–2013. He is currently aSenior Area Chair of the same journal. He also serves as a Reviewer of manyIEEE journals and conferences. He is a member of the Audio and Acoustic Sig-nal Processing (AASP) technical committee of the IEEE since January 2010.Since January 2017, he serves as the Committee Chair. He is also a member ofthe Technical and Steering committee of the International Workshop on Acous-tic Signal Enhancement (IWAENC) since 2005 and was the General Co-Chair ofIWAENC held at Tel-Aviv, Israel, in August 2010. He has served as the GeneralCo-Chair of the IEEE Workshop on Applications of Signal Processing to Audioand Acoustics (WASPAA) in October 2013. He was selected (with colleagues)to present a tutorial sessions in ICASSP 2012, EUSIPCO 2012, ICASSP 2013,and EUSIPCO 2013. He received the Bar-Ilan University outstanding lectureraward for 2010 and 2014. He is also a co-recipient of seven best paper awards.

Emanuel A. P. Habets (S’02–M’07–SM’11) re-ceived the B.Sc. degree in electrical engineering fromthe Hogeschool Limburg, Limburg, The Netherlands,in 1999, and the M.Sc. and Ph.D. degrees in elec-trical engineering from the Technische UniversiteitEindhoven, Eindhoven, The Netherlands, in 2002and 2007, respectively. He is an Associate Profes-sor at the International Audio Laboratories Erlan-gen (a joint institution of the Friedrich-Alexander-University Erlangen-Nurnberg and Fraunhofer IIS),and the Head of the Spatial Audio Research Group at

Fraunhofer IIS, Germany.From 2007 to 2009, he was a Postdoctoral Fellow at the Technion - Israel In-

stitute of Technology and at the Bar-Ilan University, Israel. From 2009 to 2010,he was a Research Fellow in the Communication and Signal Processing Group atImperial College London, U.K. His research activities center around audio andacoustic signal processing, and include spatial audio signal processing, spatialsound recording and reproduction, speech enhancement (dereverberation, noisereduction, echo reduction), and sound localization and tracking.

He was a member of the organization committee of the 2005 InternationalWorkshop on Acoustic Echo and Noise Control (IWAENC), Eindhoven, a Gen-eral Co-Chair of the 2013 International Workshop on Applications of SignalProcessing to Audio and Acoustics (WASPAA) in New Paltz, NY, USA, anda General Co-Chair of the 2014 International Conference on Spatial Audio(ICSA), Erlangen, Germany. He was a member of the IEEE Signal ProcessingSociety Standing Committee on Industry Digital Signal Processing Technology(2013–2015), and a Guest Editor for the IEEE JOURNAL OF SELECTED TOP-ICS IN SIGNAL PROCESSINGand the EURASIP Journal on Advances in SignalProcessing. He received, with S. Gannot and I. Cohen, the 2014 IEEE SignalProcessing Letters Best Paper Award. Currently, he is a member of the IEEESignal Processing Society Technical Committee on Audio and Acoustic SignalProcessing, the Vice-Chair of the EURASIP Special Area Team on Acoustic,Sound and Music Signal Processing, an Associate Editor of the IEEE SIGNAL

PROCESSING LETTERS, and an Editor-in-Chief of the EURASIP Journal on Au-dio, Speech, and Music Processing.