A Bayesian Hierarchical Model for Speech Enhancement with … · 2021. 4. 21. · types and...

1

A Bayesian Hierarchical Model for SpeechEnhancement with Time-Varying Audio Channel

Yaron Laufer, Student Member, IEEE, and Sharon Gannot, Senior Member, IEEE

Abstract—We present a fully Bayesian hierarchical approachfor multichannel speech enhancement with time-varying audiochannel. Our probabilistic approach relies on a Gaussian priorfor the speech signal and a Gamma hyperprior for the speechprecision, combined with a multichannel linear-Gaussian state-space model for the acoustic channel. Furthermore, we assumea Wishart prior for the noise precision matrix. We derive avariational Expectation-Maximization (VEM) algorithm whichuses a variant of multichannel Wiener filter (MCWF) to inferthe sound source and a Kalman smoother to infer the acousticchannel. It is further shown that the VEM speech estimatorcan be recast as a multichannel minimum variance distortionlessresponse (MVDR) beamformer followed by a single-channel va-riational postfilter. The proposed algorithm was evaluated usingboth simulated and real room environments with several noisetypes and reverberation levels. Both static and dynamic scenariosare considered. In terms of speech quality, it is shown that asignificant improvement is obtained with respect to the noisysignal, and that the proposed method outperforms a baselinealgorithm. In terms of channel alignment and tracking ability, asuperior channel estimate is demonstrated.

Index Terms—Adaptive beamforming, Kalman smoother, va-riational EM.

I. INTRODUCTION

Background noise significantly degrades the quality of thespeech signal and its intelligibility. Speech enhancement aimsat recovering a speech source from microphone signals re-corded in a noisy and reverberant environment, with a vastamount of applications as cellular phones, hearing aids, huma-noid robots and autonomous systems. Personal assistants andnavigation systems also take advantage of speech enhancementmethods to obtain noise-robust automatic speech recognition.Speech enhancement algorithms have therefore attracted agreat deal of interest in recent years.

In this paper, we address the challenging scenario of time-varying audio channel, which arises when the acoustic impulseresponses (AIRs) between the speaker and the microphoneschange over time. Time-varying acoustic channel can describemoving speaker, moving microphones, or other environmentalchanges.

Array beamforming is a common multichannel method thatexploits spatial diversity, i.e. the different acoustic propertiesbetween channels, for enhancing desired source while suppres-sing sounds from other directions. In this work, the steeringvector used for beamforming is the relative transfer function(RTF) [1]. Design of beamformers requires the estimation ofseveral model parameters. For instance, multichannel Wienerfilter (MCWF) beamformers require the RTF of the speaker,

Yaron Laufer and Sharon Gannot are with the Faculty of Engi-neering, Bar-Ilan University, Ramat-Gan, 5290002, Israel (e-mail: ya-ron [email protected], [email protected]).

its respective variance (or alternatively, the speaker’s entirecovariance matrix), and the covariance matrix of the noise[2]. Numerous methods exist for estimating these parame-ters. Some approaches utilize the speech presence probability(SPP), by first determining the time-frequency bins dominatedby either speech or noise, and then estimating independentlythe model parameters [3].

In this context, the problem of RTF identification can betackled with various approaches. In [1], [4], the nonstationarityof speech signal is exploited. A subspace-based approach wasproposed in [5], where a generalized eigenvalue decomposition(GEVD) of the covariance matrix was used to extract an RTFestimator. However, many realistic scenarios are dynamic. Theapplication of standard beamforming techniques to dynamicscenarios is not an easy task, since the need to track changesin the RTF enforces short time frames, which in turn reducesthe number of data points for the RTF estimation. A modifiedversion of the nonstationarity-based system identification al-gorithm was introduced in [6], based on hypothesis testing. Aweighted least-squares (WLS) estimator, which incorporatesthe SPP decisions, was proposed in [7].

Rather than estimating independently the parameters assuggested by the SPP-based methods, other methods jointlyestimate all model parameters by optimizing some criterion,such as maximum likelihood (ML) or maximum a posteriori(MAP) criteria [8]–[10]. When the resulting optimizationproblem cannot be solved directly, a common solution is theExpectation-Maximization (EM) algorithm [11], [12], whichbreaks down the problem into the signal estimate and theparameters estimate that are iteratively solved.

The Bayesian approach defines prior distributions overeach parameter, and thus provides an elegant way to exploreuncertainty in the model and to incorporate prior knowledgeinto the learning process. Hierarchical Bayesian models arevery useful, since they allow to define complex structuredmodels. These models use a multi-level modeling to captureimportant dependencies among parameters. The resulting com-plex structure is of particular interest when the parameter ofinterest is not related directly to the observation but rather toanother parameter, which itself is related to the observation.However, to apply the EM algorithm we must know theposterior distribution, which might be intractable in complexBayesian models. To alleviate this drawback, one may resortto approximate methods, which are broadly divided into twoclasses, namely stochastic approximations and deterministicapproximations [13]. Stochastic techniques are based on nu-merical sampling methods, such as Markov Chain Monte-Carlo (MCMC). Given large computational resource, they cangenerate exact results. In practice, however, sampling methods

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2018.2876177

Copyright (c) 2018 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

2

can be computationally expensive, often limiting their use. Thedeterministic approximation methods employ an analyticalapproximation to the posterior distribution by assuming a spe-cific form. These methods are highly efficient and thus allowBayesian techniques to be used in large-scale applications. Acommon method in this family is the variational Expectation-Maximization (VEM) [13]–[15]. Rather than estimating asingle value for the model parameters, this approach estimatestheir posterior distributions. As the inference process relieson the entire posterior probability density function (PDF)instead of point estimates, this method has the potential toobtain estimators that are more robust and less sensitive tolocal maxima and overfitting, compared to the ML and MAPestimators [2], [15].

In [16], the problem of blind adaptive beamforming isaddressed in the short-time Fourier transform (STFT) domain,using a VEM approach. The speech signal is modelled asa Gaussian process and the RTF is modelled as a first-order Markov model. By modelling the speech signal andthe channel as latent variables, their posterior distributions arejointly estimated in the E-step. However, the considered modelis not fully Bayesian, since the variance of the speaker andcovariance matrix of the noise are modelled as deterministicunknown parameters, rather than latent random variables.These parameters are estimated in the M-step using pointestimators. In addition, the RTF was estimated using the casualKalman filter, i.e. using only past and present observations.

The Bayesian approach has also been pursued in the contextof speech dereverberation in [17], [18] and in the context ofsource separation, e.g. in [19], [20].

In this paper, we extend the probabilistic model proposed in[16] towards a fully Bayesian model. We introduce a hierar-chical model which is based on a Gaussian prior for the speechsignal and a Gamma hyperprior for the speech precision.Furthermore, we assume a complex Wishart prior for the noiseprecision matrix. This way, the precisions are also modelledas latent random variables, for which posterior distributionsare inferred. The inference of the hidden variables is carriedout using a VEM algorithm. Inspired by the decompositionof the multichannel minimum mean square error (MMSE)estimator of a single speech signal, namely the MCWF,into a multichannel minimum variance distortionless response(MVDR) beamformer followed by a subsequent single-channelWiener postfilter [21], [22], we show that the VEM speechestimator has an analogous decomposition. Similarly to theMCWF, it includes an MVDR beamformer as an initial stage.However, the single-channel Wiener postfilter, which aims tominimize the residual noise at the MVDR output, is substitutedby a variational postfilter, which also takes into account theuncertainty in the RTF estimate, and weights accordingly thesingle-channel at the MVDR output.

Considering the channel coefficients as random latent vari-ables, may have higher modelling capacity than consideringthem as deterministic and unknown variables [23]. Therefore,it was proposed in the context of source separation [19] tomodel the time-varying audio channel as a set of hiddenrandom variables, which are temporally related through a first-order linear dynamical system (LDS) [13], with a reduced set

of parameters. We adopt this latent Bayesian model for theunderlying filtering, and derive a Kalman smoother to inferthe time-varying RTF. This way we exploit all available datato estimate the RTF.

In order to simultaneously infer all hidden variables, ourVEM algorithm consists of alternating the following steps:(i) inferring the instantaneous speech signal by means ofWiener filtering, (ii) inferring the RTF sequence with Kalmansmoother, (iii) inferring the speech precision, (iii) inferringthe noise precision matrix, and (iv) updating rules for thehyperparameters.

The current paper is an essential extension of [24]. In[24], the noise is assumed to be a spatially homogeneousand spherically diffuse sound field, with a time-invariantcovariance matrix modelled by a known spatial coherencematrix multiplied by an unknown noise power. Hence, theinverse power of the noise is treated as a latent randomvariable, modelled with the Gamma distribution. Here wegeneralize the model by omitting the assumptions about thespatial structure of the noise covariance. The entire precisionmatrix of the noise is therefore treated as a latent variable,modelled with the complex Wishart distribution. By allowingcomplex-valued precision matrices, our model accounts forany spatial structure of the noise field. We also provide here acomprehensive description of the assumed model, and derivethe proposed VEM algorithm in detail. We describe the fullmathematical derivation that was omitted in [24]. We alsoextend the preliminary performance evaluation, presented in[24], using variety of speech signals and noise types. Inaddition, we examine the performance on real-life recordingsin static as well as dynamic scenarios.

The remainder of the paper is organized as follows.Section II introduces some notations and preliminary notes.Section III presents the problem formulation, and describesthe probabilistic model. Section IV describes the proposedVEM algorithm. Section V demonstrates the performance ofthe proposed algorithm by an extensive experimental studybased on both simulated data and real recordings. The paperis concluded in Section VI.

II. NOTATION AND PRELIMINARIES

In our notation, scalars are denoted with regular lowercaseletters, vectors are denoted with bold lowercase letters and ma-trices are denoted with bold uppercase letters. The superscripts> and H describe transposition and Hermitian transposition,respectively. | · | denotes the determinant of a matrix, tr{·}denotes the trace operator and c

= denotes equality up to anadditive constant. For a random vector a, a multivariate propercomplex Gaussian distribution is given by [25]–[27]:

Nc(a;µa,Φa) =1

|πΦa|exp

[−(a− µa

)HΦ−1a

(a− µa

)],

(1)where µa is the mean vector and Φa is an Hermitian positivedefinite complex covariance matrix. The inverse covariancematrix, Φ−1

a , is the precision matrix. A Gamma distributionfor a non-negative random variable λ with shape and rate



3

parameters a, b > 0 is given by [13]:

Gam(λ; a, b) =

{ 1Γ(a)b

aλa−1 exp (−bλ) λ > 0

0 λ ≤ 0 ,(2)

where Γ(·) is the gamma function defined by

Γ(x) ≡∫ ∞

0

ux−1 exp(−u)du. (3)

The Gamma distribution has the following properties [13]:

E [λ] =a

b, (4a)

E [lnλ] = Ψ(a)− ln(b), (4b)

where Ψ(·) is the digamma function, defined by

Ψ(a) ≡ d

daln Γ(a). (5)

A complex Wishart distribution for a random N×N Hermitianpositive definite matrix Tc is given by [26]:

Wc(Tc;Wc, νc) = Bc(Wc, νc)|Tc|νc−N exp[− tr

(W−1

c Tc

)],

(6)

with

Bc(Wc, νc) = |Wc|−νc[πN(N−1)/2

N∏i=1

Γ(νc + 1− i

)]−1

, (7)

where Wc is the Hermitian positive definite scale matrix, andνc is the number of degrees of freedom. The complex Wishartdistribution has the following properties (see [28]):

E [Tc] = νcWc, (8a)

E [ln |Tc|] =

N∑i=1

Ψ (νc + 1− i) + ln |Wc|. (8b)

III. PROBLEM FORMULATION

A. Signal Model

Consider a speech signal received by N microphones, in anoisy and reverberant acoustic environment. We work with theSTFT representation of the measured signals. Let k ∈ [1,K]denote the frequency bin index, and ` ∈ [1, L] denote thetime frame index. The N -channel observation signal x(`, k) =[x1(`, k), · · · , xN (`, k)]> is given by

x(`, k) = s(`, k)a(`, k) + u(`, k), (9)

where s(`, k) is the echoic speech signal as received by thefirst microphone (designated as a reference microphone) anda(`, k) = [1, a2(`, k), · · · , aN (`, k)]> is the RTF vector. Bytaking a reference microphone and normalizing the acoustictransfer function (ATF) to construct the RTF, we circumventthe problem of gain ambiguity between the source signal andthe ATF [1]. The ambient noise, which may originate fromboth microphone responses and from environmental sources,is denoted by u(`, k) = [u1(`, k), · · · , uN (`, k)]>.

Given the observed STFTs {x(`, k)}L,K`,k=1, we are interestedin estimating the speech signal {s(`, k)}L,K`,k=1. This requiresthe estimation of the RTF and the characteristics of both thespeech and noise.

B. Probabilistic Model

The speech STFT coefficients are assumed to follow a zero-mean proper (i.e. circularly-symmetric) complex Gaussiandistribution with time-varying precision τ(`, k), and they areall statistically independent frame- and frequency-wise. Hence,the PDF writes:

p(s(`, k)|τ(`, k)

)= Nc

(s(`, k); 0, τ−1(`, k)

). (10)

The noise is modelled as a zero-mean proper complex multi-variate Gaussian, given by

p(u(`, k)|T (k)

)= Nc

(u(`, k); 0,T−1(k)

), (11)

where T (k) denotes the noise precision matrix, which isassumed to be time-invariant. The conditional data distributionis therefore given by

p(x(`, k)|s(`, k),a(`, k),T (k)

)= Nc

(x(`, k); s(`, k)a(`, k),T−1(k)

). (12)

The RTF vector may be time-varying if the speaker or themicrophone array are moving around the enclosure. Thus itshould be described with a dynamical model. We model theRTF as a set of hidden variables and parametrize its temporalprogression as a random walk [19]. Accordingly, it is assumedthat for each frequency k, the RTF sequence {a(`, k)}L`=1

is governed by a first-order LDS. The initial and transitiondistributions are both assumed to follow complex Gaussiandistributions. Specifically, this writes

p(a(1, k)

)= Nc

(a(1, k);µa(k),Φa(k)

), (13)

p(a(`, k)|a(`− 1, k)

)= Nc

(a(`, k); a(`− 1, k),Φa(k)

), (14)

with the mean vector µa(k) ∈ CN and the covariance matrixΦa(k) ∈ CN×N . For brevity, a(1:L, k) = {a(`, k)}L`=1

denotes the entire sequence of RTFs at frequency k.

C. Conjugate Priors

In the Bayesian framework, it is common to introduceprobabilistic priors over the latent variables, which allows us totake into account the uncertainty about the model. For the ex-ponential family of distributions, choosing a conjugate form ofprior leads to a posterior distribution with the same functionalform, and therefore results with a simplified Bayesian analysis.In this framework, it is often more convenient to work withthe precision rather than the covariance [13]. Our hierarchicalgenerative model is therefore established by introducing priorsfor the precisions of the speaker and the noise. For a Gaussianvector, the conjugate prior distribution of the precision matrixis the Wishart distribution [13]. Since the speech precisionτ(`, k) is a scalar, we use the degenerated form of the Wishartdistribution, namely the Gamma distribution:

p(τ(`, k)

)= Gam

(τ(`, k); a(`, k), b0(`, k)

), (15)

where a0(`, k), b0(`, k) are hyperparameters. Each time-frequency (TF) bin is modelled with distinct shape and rate,to allow the flexibility of modeling local characteristics ofthe speech signal. This choice of two-level hierarchical priorcan be further justified by the fact that by marginalizing over



4

a0(`), b0(`) τ(`) s(`)

µa,Φa a(`)

x(`) T W0, ν0

a(`− 1)

Fig. 1: Graphical model. Double circles represent observati-ons, circles represent latent variables and rectangles representdeterministic parameters (Frequency index is omitted).

the precision τ(`, k), the true prior distribution of the speechsignal turns to be a Student’s t-distribution [13]. The heavytailed Student-t prior favors sparse models, i.e. models witha few nonzero parameters, and thus was proposed to modelspeech coefficients [29], [30], which have sparse distributionin the TF domain. For the time-invariant precision matrixof the ambient noise T (k), we assume a complex Wishartdistribution as prior:

p(T (k)

)=Wc

(T (k);W0(k), ν0(k)

). (16)

In general, the noise precision matrix might be complex-valued, hence a better fit can be obtained by modelling it witha complex distribution. The complete graphical model of theproposed hierarchical model is shown in Fig. 1.

IV. VARIATIONAL BAYESIAN ALGORITHM

In this section, we develop a VEM algorithm which jointlyinfer the latent variables and estimate the model parameters.We begin with the general problem of Bayesian inference, thenwe introduce the VEM framework. Then we describe our E-step, composed of an E-s step for the speech signal, E-a stepfor the RTF sequence, E-τ step for the speech precision andE-T step for the noise precision matrix. Finally, the M-step ispresented.

A. Variational Inference

Consider a probabilistic model in which we denote all ofthe observed data as X and all of the latent variables asH. The joint distribution p(X ,H; θ) is governed by a setof parameters denoted θ. Bayesian inference aims to inferthe latent variables according to their posterior mean (PM),using an ML estimation of the model parameters. To thisend, it is required to compute the posterior distribution of thehidden variables p(H|X ; θ) = p(X ,H; θ)/p(X ; θ). The EMalgorithm is a general technique for finding ML solutions forprobabilistic models having latent variables [11]. However, theEM algorithm requires knowledge of the posterior in order tomaximize the likelihood function iteratively. A serious shortco-ming of this methodology is that for complex Bayesian modelsthis posterior is not available, and thus the EM algorithm isnot applicable.

The VEM approach provides approximate solutions forBayesian inference problems. This method relaxes the limitingrequirements of the EM algorithm and thus is applicableto complex problems. In the variational inference procedure,the exact posterior distribution over the hidden variables is

substituted by an approximate one q(H) ≈ p(H|X ; θ). In orderto achieve tractable distributions, we must restrict the family ofdistribution q(H). A particular form that has been used withsuccess is the factorized one, called mean field theory [14],[31]. According to this approximation, we assume that q(H)can be factorized over some partition of the hidden variables:

q(H) =∏Hi⊂H

q(Hi). (17)

The variational approach is based on the following decomposi-tion of the log-likelihood function. For any PDF q(H) definedover the latent variables, the following decomposition holds:

ln p(X ; θ) = F (q(H); θ) + KL(q(H)||p(H|X ; θ)

), (18)

with

F (q(H); θ) =

∫q(H) ln

p(X ,H; θ)

q(H)dH, (19)

and

KL(q(H)||p(H|X ; θ)) = −∫q(H) ln

p(H|X ; θ)

q(H)dH, (20)

where KL(q(H)||p(H|X ; θ)

)is the Kullback-Leibler (KL)

divergence between q(H) and the posterior distributionp(H|X ; θ). Since the KL divergence is always non-negative, itfollows that F (q(H); θ) is a lower bound on the log-likelihood,with equality if and only if q(H) = p(H|X ; θ).

We wish to find q(H) of the form of (17) that maximizesthe lower bound F (q(H); θ). The maximization of the lowerbound consists in iterating the following two steps untilconvergence. In the E-step we compute

q∗(H) = argmaxq(H)

F (q(H); θold), (21)

where θold is the current value of the parameter vector. Forsolving (21), it is well known that the optimal marginalposterior distribution of a subset Hi ⊆ H is given by [13]

ln q∗(Hi) = Eq(H/Hi)[ln p(X ,H; θ)] + const, (22)

where q(H/Hi) denotes the approximated joint posteriordistribution of all latent variables, excluding the subset Hi.The additive constant in (22) can be obtained through norma-lization. Accordingly, q(H) is inferred by alternating over eachsubset Hi ⊂ H. Once we have the posterior distributions ofall the variables in H, F (q∗(H); θ) is maximized with respectto the parameters in the M-Step. To this end, F (q(H); θ) canbe further decomposed as

F (q(H); θ) = Eq(H) [ln p(X ,H; θ)]− Eq(H) [ln q(H)] , (23)

where the first term is the expected log-likelihood of thecomplete-data (denoted by L(θ)) and the second term is theentropy of the distribution q(H), which is independent of θ.Using (23) the maximization in the M-step writes

θnew = argmaxθ

Eq∗(H) [ln p(X ,H; θ)] . (24)



5

B. Variational EM for Speech Enhancement

In this work, the observations set is X = {x(`, k)}L,K`,k=1,the set of hidden variables consists of

H ={s(`, k),a(`, k), τ(`, k),T (k)

}L,K`,k=1

,

and the parameter set consists of

θ ={a0(`, k), b0(`, k),W0(k), ν0(k),µa(k),Φa(k)

}L,K`,k=1

.

The complete-data distribution writes

p(X ,H; θ) = p(X|H; θ)p(H; θ)

=

L,K∏`,k=1

[p(x(`, k)|s(`, k),a(`, k),T (k)

)× p(s(`, k)|τ(`, k)

)p(τ(`, k); a0(`, k), b0(`, k)

)]×

K∏k=1

[p(T (k);W0(k), ν0(k)

)p(a(1, k);µa(k),Φa(k)

)×

L∏`=2

p(a(`, k); a(`− 1, k),Φa(k)

)]. (25)

However, the likelihood p(X ; θ) =∫p(X ,H; θ)dH can-

not be computed analytically from (25), hence the posteriorp(H|X ; θ) cannot be expressed in closed-form and exactinference becomes intractable. We resort to the approximatevariational inference methodology, which circumvents thisdifficulty by approximating the posterior. According to thefactorized approximation (17), we assume that the speech sig-nal, RTF, speech precision and the noise precision matrix areconditionally independent given the observations. Hence, theapproximate posterior distribution has the following factorizedform:

q(H) =

L,K∏`,k=1

[q(s(`, k)

)q(τ(`, k)

)]

×K∏k=1

[q(T (k)

)q(a(1:L, k)

)]. (26)

For the sake of brevity, the frequency bin index k is henceforthomitted.

C. E-s Step

The approximate posterior distribution of the source canbe computed from (25) and (22) by recognizing the terms ofln q(s(`)) that are function of s(`):

ln q(s(`))c= Eq(a(`))q(τ(`))q(T )

[ln p(x(`)|s(`),a(`),T

)+ ln p

(s(`)|τ(`)

)]c= −|s(`)|2

{Eq(a(`))q(T )

[aH(`)Ta(`)

]+ τ(`)

}+ s∗(`)aH(`)Tx(`) + s(`)xH(`)T a(`), (27)

where a(`),Σa(`), T and τ(`) are posterior statistics that willbe defined in Sections IV-D-IV-F. The remaining expectation

can be simplified by the cyclic property of the trace operation:

Eq(a(`))q(T )

[aH(`)Ta(`)

]= tr

{Eq(a(`))q(T )

[a(`)aH(`)T )

]}= tr

{[a(`)aH(`) + Σa(`)

]T}

= aH(`)T a(`) + tr[Σa(`)T

]. (28)

Eq. (27) is an incomplete quadratic form in s(`), thus repre-sents a Gaussian distribution:

q (s(`)) = Nc(s(`); s(`),Σs(`)

), (29)

with mean s(`) and variance Σs(`) given by:

Σs(`) ={

aH(`)T a(`) + tr[Σa(`)T

]+ τ(`)

}−1

, (30)

s(`) =aH(`)Tx(`)

aH(`)T a(`) + tr[Σa(`)T

]+ τ(`)

. (31)

The speech signal can be estimated by the PM, namely s(`),with the variance Σs(`). This speech estimator resembles theform of the MCWF [32], [33]:

sMCWF(`) =aH(`)Tx(`)

aH(`)T a(`) + τ(`), (32)

except the term tr[Σa(`)T

]. The MCWF is the optimal beam-

former in the sense of minimizing the MMSE of the speechsignal. Furthermore, it can be decomposed into a multichannelMVDR beamformer followed by a single-channel Wienerpostfilter [21], [22], [33]:

sMCWF(`) =aH(`)T a(`)

aH(`)T a(`) + τ(`)︸︷︷︸HW (`)

× aH(`)T

aH(`)T a(`)︸︷︷︸wH

MVDR(`)

x(`), (33)

where HW (`) is the single-channel Wiener filter at the outputof the MVDR. In a similar way to the MCWF decomposition,s(`) can be decomposed as

s(`) =aH(`)T a(`)

aH(`)T a(`) + tr[Σa(`)T

]+ τ(`)︸︷︷︸

H(`)

× aH(`)T

aH(`)T a(`)︸︷︷︸wH

MVDR(`)

x(`)

(34)

where HW (`) is replaced by a single-channel variationalpost-filter, H(`), having the following role. Due to the RTFuncertainty, the speech signal at the output of the MVDR stagemay be distorted. The MCWF treats the RTF estimator a(`) asa point estimator, and thus ignores its uncertainty. In contrast,the VEM estimator treats the RTF as a latent variable andconsiders its posterior distribution, which captures the uncer-tainty regarding the parameter estimate. Therefore, it includesH(`) as a postfilter that takes into account that uncertaintylevel, expressed by Σa(`), and weights accordingly the single-channel at the MVDR output. When Σa(`)→ 0, s(`) reducesto sMCWF(`).



6

D. E-a Step

The joint posterior distribution of the RTF sequence isobtained by identifying the terms that are function of a(1:L):

ln q(a(1:L))c=

L∑`=1

Eq(s(`))q(T )

[ln p(x(`)|s(`),a(`),T

)]+ ln p

(a(1:L)

). (35)

Following the same lines as in [19], the first term reduces to aGaussian distribution Nc

(µL(`); a(`),ΦL(`)

)with ΦL(`) =(

|s(`)|2T)−1

and µL(`) = s∗(`)

|s(`)|2x(`). Thus we obtain:

q(a(1:L)

)= p(a(1:L)

) L∏`=1

Nc(µL(`); a(`),ΦL(`)

). (36)

We identify (36) as a first-order LDS [13] over the latent states{a(`)}L`=1, where (13) is the initial distribution, (14) is thetransition distribution and Nc(µL(`); a(`),ΦL(`)) denotes theemission distribution. Hence, the marginal posterior distribu-tion of each frame is also a Gaussian distribution, which can berecursively calculated using the Kalman smoother [13]. Thisis a recursive algorithm that consists of a forward pass and abackward pass. Forward inference over the states is used toestimate a distribution over a(`) using all observations up totime `, whereas backward inference uses future observationsfrom time ` + 1 up to L. The two passes are afterwardscombined to give the posterior marginal distribution of thestate variable a(`) given all observations:

q(a(`)) = Nc(a(`); a(`),Σa(`)

), (37)

with

Σa(`) =(Φf (`)

−1+ Φβ(`)

−1)−1

(38)

a(`) = Σa(`)(Φf (`)

−1af (`) + Φβ(`)

−1aβ(`)

). (39)

Φf (`),af (`) are provided by the forward pass, andΦβ(`),aβ(`) are provided by the backward pass, both detailedin the Appendix. The RTF can therefore be estimated bythe smoothed PM, namely a(`) ∈ CN , with uncertaintyΣa(`) ∈ CN×N . Note that in [16] the RTF was inferredusing the forward pass alone, i.e. a(`) was estimated givenonly causal observations up to time `. We propose to find themarginal for a(`) based on all observations up to time L, i.e.comprise future as well as past observations. Although thiscannot be used for real-time prediction, it plays a key role inlearning the model.

Another quantity of interest is the pair-wise joint posteriordistribution of two successive frames a(`) and a(`−1), whichwill be required to update Φa in Section IV-G6. Marginalizingout all other frames in (36) results with a Gaussian distribution,q(a(`),a(`− 1)

)= Nc

( [a(`)>,a(`− 1)>

]>; aξ(`),Σξ(`)

),

where the mean vector aξ(`) ∈ C2N and the covariance matrixΣξ(`) ∈ C2N×2N are given in the Appendix. The second-order joint posterior moment is defined as

Q(`) = Σξ(`) + aξ(`)aHξ (`). (40)

E. E-τ Step

Using (25) and (22), the posterior distribution of the speechprecision writes:

ln q(τ(`))c= Eq(s(`))

[ln p(s(`)|τ(`)

)]+ln p

(τ(`); a0(`), b0(`)

)c= a0(`) ln τ(`)− τ(`)

(b0(`) + |s(`)|2

). (41)

This is an exponent of a Gamma distribution:

q (τ(`)) = Gam(τ(`); ap(`), bp(`)

), (42)

with parameters ap(`) and bp(`), given by

ap(`) = a0(`) + 1, (43)

bp(`) = b0(`) + |s(`)|2. (44)

Thus, the PM estimate for the source precision writes

τ(`) =ap(`)

bp(`)=

a0(`) + 1

b0(`) + |s(`)|2, (45)

where the hyperparameters a0(`), b0(`) are updated in the M-Step. It should be noted that treating the speech precision asan unknown deterministic parameter as in [16], leads to thefollowing point estimator τD(`) = 1/|s(`)|2. It is instructiveto relate the variational solution to the deterministic one. Todo this, consider the marginal case where the parameters arefixed to very small values, i.e. a0(`) = b0(`) = 0, in whichthe VEM posterior estimator coincides with the deterministicestimator. This equivalence can be explained by the fact that anon-informative prior is obtained for the Gamma distributionas the special case a0(`) = b0(`) = 0, since it corresponds tothe limit of an infinitely broad prior [13].

F. E-T Step

Similarly, the posterior distribution of the noise precision isgiven by:

ln q(T )c=

L∑`=1

Eq(s(`))q(a(`))

[ln p(x(`)|s(`),a(`),T

)]+ ln p(T ;W0, ν0)

c= −tr

(T{W−1

0 +L∑`=1

Eq(s(`))q(a(`))

[(x(`)− s(`)a(`)

)×(x(`)− s(`)a(`)

)H]})

+ (ν0 −N + L) ln |T |. (46)

This is an exponent of a complex Wishart distribution:

q(T ) =W(T ;Wp, vp), (47)

with parameters Wp and vp, given by

W−1p = W−1

0 +L∑`=1

[x(`)xH(`)− x(`)aH(`)s∗(`)

− s(`)a(`)xH(`) + |s(`)|2(a(`)aH(`) + Σa(`)

)], (48)

vp = ν0 + L. (49)



7

Thus, the PM estimate for the noise precision matrix writes

T = vpWp = (ν0 + L)Wp. (50)

Treating the precision matrix of the noise as an unknowndeterministic parameter as in [16], leads to the following pointestimator:

T−1D =

1

L

L∑`=1

[x(`)xH(`)− x(`)aH(`)s∗(`)

− s(`)a(`)xH(`) + |s(`)|2(a(`)aH(`) + Σa(`)

)]. (51)

Note that W−1p = W−1

0 + LT−1D , hence (50) becomes T =

(ν0 + L)(W−1

0 + LT−1D

)−1

. For the Wishart distribution, anon-informative prior is obtained as the special case ν0 → 0and W0 = 1

λI where λ → 0. In this case we obtain T =

TD. However, the least informative proper Wishart prior isobtained by setting ν0 = N , leading to T =

(NL + 1

)TD.

The approximate posterior distributions in (29), (37), (42)and (47) are then iteratively updated until convergence, sincethey depend on the statistics of each other.

G. M-Step

In order to estimate the parameters, we maximize theexpected log-likelihood of the complete-data L(θ) =Eq(H) [ln p(X ,H; θ)] with respect to each parameter. For ourmodel, L(θ) is given by (52), at the bottom of the page.

1) M-a0(`) Step: Differentiating L(θ) w.r.t. a0(`), weobtain∂L(θ)

∂a0(`)=

∂

∂a0(`)Eq(τ(`))

[ln p(τ(`); a0(`), b0(`)

)]= −Ψ(a0(`)) + ln b0(`) + Eq(τ(`))

[ln(τ(`)

)], (53)

where Ψ(·) is the digamma function, defined in (5). Using(4b) and comparing the derivative to zero yields:

a0(`) = Ψ−1

[Ψ(ap(`)

)+ ln

b0(`)

bp(`)

]. (54)

2) M-b0(`) Step: Differentiating L(θ) w.r.t. b0(`), weobtain

∂L(θ)

∂b0(`)=a0(`)

b0(`)− Eq(τ(`))

[τ(`)

]. (55)

Using (4a) and comparing to zero yields:

b0(`) =a0(`)

ap(`)bp(`). (56)

3) M-ν0 Step: Differentiating L(θ) w.r.t. ν0, we obtain

∂L(θ)

∂ν0=

∂

∂ν0Eq(T )

[ln p(T ;W0, ν0

)]= − ln |W0| −

N∑i=1

Ψ (ν0 + 1− i) + Eq(T )

[ln |T |

]. (57)

Using (8b) and comparing to zero yields the following equa-tion:

N∑i=1

Ψ (ν0 + 1− i) =N∑i=1

Ψ (νp + 1− i) + ln|Wp||W0|

. (58)

Since this equation cannot be solved in a closed-form for ν0,we apply the iterative method of Newton to find a maximum[34]:

ν(m+1)0 = ν

(m)0 −

d(ν

(m)0

)h(ν

(m)0

) , (59)

where d (ν0) and h (ν0) are the first and second-order deriva-tives of L(θ) w.r.t. ν0:

d (ν0) ≡ ∂L(θ)

∂ν0, h (ν0) ≡ ∂2L(θ)

∂ν20

. (60)

The derivatives in our case are given by

d (ν0) =N∑i=1

[Ψ(νp + 1− i

)−Ψ

(ν0 + 1− i

)]+ ln

|Wp||W0|

,

(61)

h (ν0) = −N∑i=1

Ψ′(ν0 + 1− i

), (62)

where Ψ′(·) is the trigamma function.

4) M-W0 Step: Differentiating L(θ) w.r.t. W0, we obtain

∂L(θ)

∂W0= −ν0W

−10 +W−1

0 Eq(T )

[T]W−1

0 . (63)

Using (8a) and comparing to zero yields:

W0 =νpν0Wp. (64)

5) M-µa Step: Differentiating L(θ) w.r.t. µa, we obtain

∂L(θ)

µaH =∂

∂µaH Eq(a(1))

[ln p(a(1);µa,Φa

)]= Φ−1

a

{Eq(a(1))

[a(1)

]− µa

}. (65)

L(θ) =

L,K∑`,k=1

Eq(s(`,k))q(a(`,k))q(T (k))

[ln p(x(`, k)|s(`, k),a(`, k),T (k)

)]+

L,K∑`,k=1

Eq(s(`,k))q(τ(`,k))

[ln p(s(`, k)|τ(`, k)

)]

+

L,K∑`,k=1

Eq(τ(`,k))

[ln p(τ(`, k); a0(`, k), b0(`, k)

)]+

K∑k=1

Eq(T (k))

[ln p(T (k);W0(k), ν0(k)

)]

+K∑k=1

{Eq(a(1,k))

[ln p(a(1, k);µa(k),Φa(k)

)]+

L∑`=2

Eq(a{`:`−1,k})

[ln p(a(`, k)|a(`− 1, k); Φa(k)

)]}. (52)



8

Comparing the derivative to zero yields:

µa = a(1). (66)6) M-Φa Step: Differentiating L(θ) w.r.t. Φa, we obtain

∂L(θ)

∂Φa=

∂

∂Φa

{Eq(a(1))

[ln p(a(1);µa,Φa

)]+

L∑`=2

Eq(a{`:`−1})

[ln p(a(`); a(`− 1); Φa

)]}= Φ−1

a Eq(a(1))

[(a(1)− µa

)(a(1)− µa

)H]Φ−1a

+ Φ−1a

L∑`=2

Eq(a{`:`−1})

[(a(`)− a(`− 1)

)×(a(`)− a(`− 1)

)H]Φ−1a − LΦ−1

a . (67)

Using (66) and comparing to zero yields:

Φa =1

L

[Σa(1) +

L∑`=2

(Q11(`)−Q21(`)−Q12(`)

+Q22(`))], (68)

where the four Qnp(`), (n, p) ∈ {1, 2} are N × N non-overlapping subblocks of the second-order joint posteriormoment Q(`), defined in (40). The complete VEM algorithmfor speech enhancement is summarized in Algorithm 1.

V. EXPERIMENTAL STUDY

In this section, the proposed algorithm is evaluated usingboth simulated and real room environments. In the simulationpart we provide a comprehensive performance evaluation overvarious reverberation times, noise types and signal to noiseratio (SNR) levels. In addition, we also report the resultson real recordings, to demonstrate the applicability of theproposed method in real-life conditions. We begin by pre-senting the performance measures and the baseline method.Then, we describe the blind initialization method for the VEMalgorithm. In the subsequent sections we detail setups andresults for each series of experiments.

A. Performance Measures

In this section we describe the performance measures thatare used for evaluating the quality of performance. The speechquality was assessed with two common objective measures,namely perceptual evaluation of speech quality (PESQ) [35],and log-spectral distance (LSD). Both the PESQ and the LSDwere measured by comparing the speech signal as receivedby the first microphone, namely s(`, k), with its PM estimates(`, k), given by (31). The quality of the RTF estimate wasevaluated in terms of the normalized projection misalignment(NPM) measure [36], which is a widely used error measure toevaluate an estimated impulse response, disregarding possiblegain error. The NPM is computed with:

NPM [dB] = 20 log10

1−

(hHh

‖h‖‖h‖

)2 . (69)

Algorithm 1 VEM algorithm for speech enhancement

Input {x(`)}L`=1.

Initialize s(`),Σs(`), τ(`), T , θ.repeat

E-step

E-a step (Forward-Backward Algorithm):Set Φf (1) with (72), af (1) with (73).

for ` : 2 to LCalculate Φf (`) with (70), af (`) with (71).

endSet Φβ(L) = Φf (L), aβ(L) = af (L).

for ` : L− 1 to 1Calculate Φi(`) with (74).Calculate Φβ(`) with (75), aβ(`) with (76).

endCalculate Σa(`) with (38), a(`) with (39).Calculate Σξ(`) with (78), aξ(`) with (79).Then Calculate Q(`) with (40).

E-s step: Calculate Σs(`) with (30), s(`) with (31).E-τ step: Calculate ap(`) with (43), bp(`) with (44).

Then Calculate τ(`) with (45).E-T step: Calculate Wp with (48), vp with (49).

Then Calculate T with (50).M-stepM-a step: Update µa with (66), Φa with (68).M-τ step: Update a0(`) with (54), b0(`) with (56).M-T step: Update ν0 with (59), W0 with (64).

until convergencereturn the estimated source s(`).

B. Baseline Method

The baseline method is [16], where the speech signal and theRTF are modelled as hidden variables, but the covariances aretreated as deterministic parameters. For comparison, we alsoevaluated the performance obtained by our previous algorithmin [24]. In [24], the noise is assumed to be a spatially homoge-neous and spherically diffuse sound field, with a time-invariantcovariance matrix modelled by a known spatial coherencematrix multiplied by an unknown noise power. Assuming aknown microphone array geometry, the spatial coherence bet-ween microphones i, j is given by: Γij(k) = sinc

(2πfskK

dijc

),

where fs is the sampling rate, dij is the inter-distance betweenmicrophones i and j and c is the sound velocity. This diffuse-based method, which estimates only the power of the noise,will be referred to as Prop. (diff.). The method proposed inthis paper, which estimates the entire noise precision matrix,is denoted henceforth as Prop. (mat.).

The results of the MCWF in (32) with true parametersare also presented, in order to illustrate the efficiency of theproposed method. We refer to this algorithm as the oraclealgorithm, since it knows a priori the RTF, the variance ofthe speaker and the covariance of the noise, information that



9

is not available to the fully blind VEM algorithm. For thismethod, the RTF and the speech variance were estimated fromthe clean microphone signal, while the noise covariance wasestimated from the noise signal. The results obtained by thisoracle method indicates the best attainable results.

C. Initialization

The proposed VEM algorithm requires the initialization ofthe prior parameters{

µa(k),Φa(k), a0(`, k), b0(`, k), ν0(k),W0(k)}L,K`,k=1

.

as well as the posterior parameters{s(`, k),Σs(`, k), ap(`, k), bp(`, k), νp(k),Wp(k)

}L,K`,k=1

.

The proposed scheme has the great advantage of converging tothe correct solution in a fully blind setup, where all initializa-tions are either based on the observations or drawn randomly.The speech estimator was set to s(`, k) = x1(`, k), i.e. thenoisy signal at the first microphone, and the posterior variancewas set to Σs(`, k) = 0. The prior channel parameters wereinitialized with µa(k) = 1,Φa(k) = εI, where ε = 0.1. Thesource precision parameters were initialized with a0(`, k) =100, b0(`, k) = 10−8, bp(`, k) = b0(`, k) + |x1(`, k)|2, andap(`, k) was drawn randomly. The noise precision matrix pa-rameters were initialized with ν0(k) = 10, W0(k) = δI where

δ = 0.01, Wp(k) =

(W−1

0 (k) +L∑=1

x(`)xH(`)

)−1

and

νp(k) was drawn randomly. The initialization for Newton’smethod was ν(0)

0 = N . Similar initialization was used for thebaseline algorithm.

D. Simulations

1) Simulation Setup: To evaluate the performance of theproposed algorithm, we simulated the following static scena-rio. We downloaded time-invariant room impulse responses(RIRs) from the RIR database presented in [37]. The impulseresponses were recorded in the 6 × 6 × 2.4 m acoustic labof the Engineering Faculty at Bar-Ilan University (BIU). Thereverberation time was set to T60 = 160 ms and T60 = 360 msby configuring 60 dedicated panels attached to the roomwalls, ceiling and floor. The measurements were recordedusing an eight microphones uniform linear array (ULA) withmicrophone spacing of 8 cm. The loudspeaker was located inthe end-fire direction (angle of 90◦), at 1 m distance from thearray. The room setup is illustrated in Fig. 2. To construct themicrophone signals, we convolved the clean speech signals offive male and five female speakers from the TIMIT database[38] with the corresponding RIR, where each speaker uttersentence of 3−5 sec long. For the additive noise, we examinethree different types of noise sources: i) Speech-like Diffusenoise: An artificial diffuse noise with speech-like spectrumwas created by applying the method presented in [39] on astationary noise signal with a speech-like spectrum; ii) BabbleDiffuse noise: Applying [39] on a babble noise signal from theNOISEX-92 database [40]; and iii) Directional noise: A noisesource was positioned at a distance of 1 m from the array at

90◦

45◦

0◦

−45◦

−90◦

Speaker

Noise Source

1 m

Fig. 2: Geometric setup.

angle of 45◦, as illustrated in Fig. 2. To generate a directionalnoise signal we convolved a noise signal (with speech-likespectrum) with the corresponding RIR. We also added anadditional white noise with a directional noise-to white noiseratio of 15 dB, in order to guarantee the invertibility of thenoise covariance matrix. Finally, the microphone signals weresynthesized by adding the noise signal to the reverberantspeech signals with several SNR levels.

The following values of the parameters were common forboth simulation and real recordings. The sampling rate wasset to 16 kHz, the STFT was computed with windows of128 ms and 32 ms overlap between adjacent time frames. Thenumber of frequency bins was 2048. The VEM algorithm wasimplemented using 100 iterations. For the Newton search, 5iterations were applied.

All experiments described in this paper were run in MAT-LAB R2016b on a HP Compaq Elite 8300 PC, with an Intel4-core (8 threads) i7-3770 CPU at 3.4 GHz and 16 GB RAM.In order to process a 2 sec recording at sampling rate of16 kHz with 8 microphones, the average running time ofthe matrix-based proposed algorithm is 31 sec per iteration,the proposed diffuse-based algorithm [24] requires 30 secper iteration, while the baseline method [16] requires 13 secper iteration. The computational complexity of the proposedmethod is therefore about 2.4 times higher than that of thebaseline method.

2) Results of Simulations: The performance measures werecalculated by averaging over 10 trials with different speakers.Tables I, II, and III summarize the results for the speech-like diffuse noise case, the babble diffuse noise case andthe directional noise case, respectively. NPM results aresummarized in Tables IV, V and VI. The best results arehighlighted in boldface. The proposed method outperformsthe baseline algorithm for all noise types in all SNR levels,and for both reverberation levels. It can also be noted that inthe diffuse noise case, the proposed matrix-based method isslightly superior to the diffuse-based method [24], except forthe case of babble noise with SNR = 0dB. However, in thedirectional noise case, the wrong diffuseness assumption of[24] degrades the results significantly.

E. Real Recordings

1) Real Recordings Setup: The performance of the propo-sed method was also verified on real-life recordings, demon-strating the feasibility of the algorithm in real conditions. Therecordings were taken in the BIU acoustic lab. The signals



10

TABLE I: Quality Measures for Simulations with Speech-LikeDiffuse Noise

T60 = 160msec T60 = 360msec

Alg.\SNR 0dB 5dB 10dB 0dB 5dB 10dB

PESQ

Unprocessed 1.3 1.5 1.8 1.3 1.6 2.0Baseline [16] 1.8 1.9 2.0 1.8 2.0 2.0

Prop. (diff.) [24] 2.1 2.6 3.1 2.1 2.6 3.1Prop. (mat.) 2.2 2.9 3.4 2.4 3.2 3.6

Oracle MCWF 3.5 3.7 3.8 3.4 3.4 3.5

LSD



Oracle MCWF 1.7 1.2 0.9 2.4 2.0 1.8

TABLE II: Quality Measures for Simulations with BabbleDiffuse Noise

T60 = 160msec T60 = 360msec


PESQ



Oracle MCWF 3.6 3.8 3.9 3.5 3.6 3.6

LSD



Oracle MCWF 1.2 0.9 0.7 2.0 1.8 1.7

TABLE III: Quality Measures for Simulations with DirectionalNoise

T60 = 160msec T60 = 360msec


PESQ



Oracle MCWF 2.7 3.0 3.4 2.7 3.1 3.4

LSD


Prop. (diff) [24] 7.3 4.8 3.6 7.8 5.3 4.0Prop. (mat.) 3.6 2.7 2.7 3.8 3.2 3.1

Oracle MCWF 3.3 2.5 1.8 4.0 3.2 2.4

TABLE IV: NPM Results for Simulations with Speech-LikeDiffuse Noise

T60 = 160msec T60 = 360msec


Baseline [16] -1.6 -2.6 -3.3 -1.6 -2.4 -2.8Prop. (diff.) [24] -3.6 -5.4 -6.7 -3.7 -5.4 -6.6

Prop. (mat.) -5.3 -7.5 -8.7 -5.3 -7.2 -8.0

TABLE V: NPM Results for Simulations with Babble DiffuseNoise

T60 = 160msec T60 = 360msec



Prop. (mat.) -6.1 -7.3 -7.8 -5.9 -6.8 -6.9

TABLE VI: NPM Results for Simulations with DirectionalNoise

T60 = 160msec T60 = 360msec



Prop. (mat.) -4.6 -7.4 -8.5 -4.8 -6.8 -7.5

were measured by the CK32 omnidirectional microphones ofAKG. Eight microphones were mounted on a metal ruler,with inter-distances of [8, 6, 4, 3, 3, 4, 6] cm. For the recordingwe used the Hammerfall HDSPe MADI audio card of RME,where an ANDIAMO.MC acquisition panel was used as amicrophone preamp and A/D converter. A Fostex 6301BXloudspeaker was utilized as a noise source. The signals weremeasured with 24-bits resolution and 48kHz sampling rate,and then were downsampled to 16 kHz, according to thespeech signals bandwidth. The room panels were adjustedto create two reverberation levels, namely T60 = 200 ms orT60 = 400 ms. A first series of experiments was conductedwith speech utterances that were drawn from the TIMITdatabase [38]. Real human speakers were first emulated usinga Head and Torso Simulator (HATS) mannequin with built-in mouth (Type 4128C- B&K), which emulates the acousticproperties of an average human head and torso. The utteranceswere played in the room using the mannequin, which waslocated at an angle of 90◦, a 1 m distance from the array.The room layout is illustrated in Fig. 3(a), and a picture ofthe room setup and the equipment is shown in Fig. 3(b). Tendifferent female and male speakers were used, each uttering3 − 5 sec long signal. Then, a second series was conductedwith real-life speakers. This series consists of one male andone female speakers, which were asked to utter speech in staticand dynamic scenarios. In the static scenario, the speaker waspositioned 1 m in front of the array, at angle of 90◦, andwas asked to utter a single sentence. We also examined thecapability of the proposed method to track a time-varying RTFseries of a moving speaker. We devised a dynamic scenario,in which the speaker was asked to utter speech while movingon the trajectory drawn in Fig. 3(a). The speaker walked quitequickly and the entire movement lasted 15 s.

For the noise, we examined two different types of recordednoise: i) Air-conditioner noise; and ii) Directional-noise: Noisesignal with a speech-like spectrum from the NOISEX-92database was played from a loudspeaker located at a distance



11

Rack

)60X80)

Window

Door

1m 90◦

-45◦

2m

Air-conditioner

(a)

Microphones

Air-conditioner

Loudspeaker

Andiamo.mc

Mannequin

Room panels

(b)

Fig. 3: (a) The room sketch: microphone locations are depictedwith blue ‘x’ marks, position of the speaker is marked by redcircle, and position of noise loudspeaker is marked by orangesquare. The trajectory for the dynamic scenario is depicted bygreen dash lines. (b) A photograph of the room.

of 2 m from the array at angle of −45◦, as illustrated inFig. 3(a). The noise signals were recorded separately and thenwere added with various SNR levels to the measured speechsignals.

2) Results of Static Real-Life Speakers: PESQ and LSDscores for the human speakers in the static scenario arepresented in Tables VII and VIII for the air-conditioner noisecase and the directional noise case, respectively. Each reportedmeasure is the average over the two speakers. It can be seenthat the proposed method outperforms the baseline method,and obtains a higher speech quality and lower distortionlevel for both noise types. The advantage of the proposed

TABLE VII: Quality Measures for Real-Life Static Speakerswith Air-Conditioner Noise

T60 = 200msec T60 = 400msec


PESQ



Oracle MCWF 3.7 3.9 4.0 3.3 3.6 3.7

LSD



Oracle MCWF 1.5 1.3 1.2 2.2 1.9 1.7

TABLE VIII: Quality Measures for Real-Life Static Speakerswith Directional Noise

T60 = 200msec T60 = 400msec


PESQ



Oracle MCWF 3.2 3.5 3.7 3.3 3.5 3.7

LSD



Oracle MCWF 1.7 1.4 1.1 2.1 1.8 1.6

method is clearly demonstrated in all considered scenarios.Similar trends are obtained for both simulation and real-liferecordings. It can also be noted that the proposed matrix-basedmethod is superior to the diffuse-based method [24], especiallyfor the directional noise case, except for the LSD measurein T60 = 400msec. Fig. 4 depicts sonogram examples ofthe various signals for air-conditioner noise with input SNRof 0 dB at T60 = 200 ms. Fig. 4(a) shows s, the cleanspeech signal as received by the first microphone. Fig. 4(b)depicts x1, the noisy signal at the first microphone. Theestimated speech signal obtained by the baseline method [16]is shown in Fig. 4(c). The output of the proposed diffuse-based method [24] is depicted in Fig. 4(d). Fig. 4(e) depicts s,the estimated speech of the matrix-based proposed algorithm.We can conclude that the proposed algorithm reduces noisewhile maintaining low distortion. In contrast, the estimatorproposed by [16] provides aggressive noise reduction, at thecost of severe speech distortion. The proposed diffuse-basedmethod [24] produces less speech distortion than the baselinemethod [16]. However, it is still slightly inferior compared tothe matrix-based proposed algorithm. Informal listening testsverify that the output of the proposed method is a more naturaldenoised speech, in comparison to the output of the baselinemethod [16]. Audio examples can be found in our website.1

1http://www.eng.biu.ac.il/gannot/speech-enhancement/



12

0

2

4

6

8

Freq

uenc

y [K

Hz]

-40

-30

-20

-10

0

10

20

30

0 0.5 1 1.5 2 2.5 3 3.5 4

Time [Sec]

-1

0

1

Am

plitu

de

(a) Clean speech at microphone #1.

0

2

4

6

8

Freq

uenc

y [K

Hz]

-40

-30

-20

-10

0

10

20

30

0 0.5 1 1.5 2 2.5 3 3.5 4

Time [Sec]

-1

0

1

Am

plitu

de

(b) Noisy signal at microphone #1.

0

2

4

6

8

Freq

uenc

y [K

Hz]

-40

-30

-20

-10

0

10

20

30

0 0.5 1 1.5 2 2.5 3 3.5 4

Time [Sec]

-1

0

1

Am

plitu

de

(c) Output of [16].

0

2

4

6

8

Freq

uenc

y [K

Hz]

-40

-30

-20

-10

0

10

20

30

0 0.5 1 1.5 2 2.5 3 3.5 4

Time [Sec]

-1

0

1

Am

plitu

de

(d) Output of the proposed diffuse-based algorithm [24].

0

2

4

6

8

Freq

uenc

y [K

Hz]

-40

-30

-20

-10

0

10

20

30

0 0.5 1 1.5 2 2.5 3 3.5 4

Time [Sec]

-1

0

1

Am

plitu

de

(e) Output of the matrix-based proposed algorithm.

Fig. 4: Sonograms of a real-life example, with T60 = 200 ms and SNR of 0 dB.

The results obtained for the TIMIT utterances that wereplayed from the mannequin are pretty similar, and wereomitted here due to lack of space.

3) Results of Moving Speakers: Table IX reports the PESQand LSD scores for a moving speaker in the air-condition noisecase and the directional noise case. Despite the challengingscenario, the proposed algorithm produces a high-quality es-timate of the speech signal.

In order to further demonstrate the quality of the estimatedtime-varying RTFs, we use it to estimate the direction of arrival(DOA) of the moving speaker. We first find the peak of therelative AIR (obtained by applying the inverse FFT to theRTF). Assuming a low reverberation level, the peak locationcorresponds to the time difference of arrival (TDOA) w.r.t. the

reference microphone, and thus can be utilized for estimatingthe source DOA. The tracking ability is illustrated in Fig. 5,where the estimated DOA is presented as a function of timecompared to the true trajectory, which was estimated fromthe clean measured signal using the peak of an instantaneousleast-squares (LS) estimate of the RTF.

F. Analysis of the Noise Covariance Estimation

In this section we briefly discuss the benefit of estimatingthe entire noise covariance matrix, rather than assuming adiffuse spatial coherence matrix and estimating only the noisepower as in [24]. On the one hand, the proposed diffuse-basedmethod [24] offers slightly lower computational complexitycompared to the proposed matrix-based method, due to the



13

TABLE IX: Quality Measures for Real-Life Moving Speaker,T60 = 200 msec

Noise Type Air-conditioner Directional


PESQ



Oracle MCWF 3.9 4.1 4.3 3.0 3.4 3.7

LSD



Oracle MCWF 1.6 1.2 0.9 2.2 1.7 1.3

0 2 4 6 8 10 12 14Time [sec]

-80

-60

-40

-20

0

20

40

60

80

DO

A [

deg]

TrueProp. (mat.)Baseline

Fig. 5: Estimated DOA of a moving speaker.

reduced number of parameters to be estimated. On the otherhand, from the quality measures in Tables I-IX it can beobserved that the estimation of the entire noise covariancematrix yields better results than estimating only the powerof the noise, especially for the directional noise case, exceptfor the LSD in T60 = 400msec. This can be explained bythe fact that the diffuse model becomes more accurate as thereverberation level increases.

VI. CONCLUSIONS

In this contribution, we presented a fully Bayesian hierar-chical framework for the challenging problem of blind multi-channel speech enhancement with time-varying audio channel.The model extends [16] to include the speech precision andthe noise precision matrix as part of the hidden data. The set oflatent variables consists of the speech signal (as received by areference microphone), the RTF, the speech precision and thenoise precision matrix. The inference of the hidden variablesis performed using a VEM algorithm, leading to a variantof MCWF for estimating the source and a Kalman smootherfor the time-varying acoustic channel. The speech estimatorwas decomposed into an MVDR beamformer followed by avariational postfilter. The proposed algorithm is fully-blind anduses only the available data. The discussion is supported by

an extensive experimental study based on both simulated dataand real-life recording, for static and dynamic scenarios. Theproposed method outperforms the baseline method in termsof both objective speech quality scores and informal listeningtests. In terms of NPM and tracking ability, the proposedalgorithm offers a superior channel estimate compared to [16].

VII. APPENDIX

In this appendix, we detail the recursions for the forwardand backward passes, as well as the pairwise statistics for thejoint posterior distribution of two successive frames.

1) Forward pass: The forward distribution writes [13]p(a(`),µL(1:`)

)= Nc

(a(`); af (`),Φf (`)

), where

Φf (`) =[Φ−1L (`) +

(Φa + Φf (`− 1)

)−1]−1

, (70)

af (`) = Φf (`)[Φ−1L (`)µL(`)

+(Φa + Φf (`− 1)

)−1af (`− 1)

], (71)

with the initialization

Φf (1) =(Φ−1L (1) + Φ−1

a

)−1, (72)

af (1) = Φf (1)(Φ−1L (1)µL(1) + Φ−1

a µa). (73)

2) Backward pass: The backward distribution writesp(µL(`+ 1:L)|a(`)

)= Nc

(a(`); aβ(`),Φβ(`)

), where

Φi(`) =(Φ−1L (`+ 1) + Φ−1

β (`+ 1))−1

, (74)

Φβ(`) = Φa + Φi(`), (75)

aβ(`) = Φi(`)(Φ−1L (`+ 1)µL(`+ 1)

+ Φ−1β (`+ 1)aβ(`+ 1)

). (76)

The backward recursion starts from the last time step L,with the last update from the forward pass as initialization:Φβ(L) = Φf (L), aβ(L) = af (L).

3) Marginal posterior: The forward and backward proba-bility distributions are then combined to obtain the posteriormarginal distribution:

q(a(`)

)∝ p(a(`),µL(1:`)

)p(µL(`+ 1:L)|a(`)

). (77)

4) Joint posterior distribution of two consecutive frames:Marginalizing out all other frames in (36) leads to a Gaussiandistribution, with the following covariance and mean:

Σξ(`) =

[Φ−1i (`− 1) + Φ−1

a −Φ−1a

−Φ−1a Φ−1

f (`− 1) + Φ−1a

]−1

, (78)

aξ(`) = Σξ(`)

[Φ−1i (`− 1)aβ(`− 1)

Φ−1f (`− 1)af (`− 1)

]. (79)

REFERENCES

[1] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement usingbeamforming and nonstationarity with applications to speech,” IEEETran. on Signal Processing, vol. 49, no. 8, pp. 1614–1626, 2001.

[2] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A con-solidated perspective on multimicrophone speech enhancement andsource separation,” IEEE/ACM Tran. on Audio, Speech, and LanguageProcessing, vol. 25, no. 4, pp. 692–730, 2017.



14

[3] M. Souden, J. Chen, J. Benesty, and S. Affes, “Gaussian model-basedmultichannel speech presence probability,” IEEE Tran. on Audio, Speech,and Language Processing, vol. 18, no. 5, pp. 1072–1077, 2010.

[4] O. Shalvi and E. Weinstein, “System identification using nonstationarysignals,” IEEE Tran. on Signal Processing, vol. 44, no. 8, pp. 2055–2063, 1996.

[5] S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspacebeamforming in a reverberant noisy environment with multiple inter-fering speech signals,” IEEE Tran. on Audio, Speech, and LanguageProcessing, vol. 17, no. 6, pp. 1071–1086, 2009.

[6] I. Cohen, S. Gannot, and B. Berdugo, “An integrated real-time beam-forming and postfiltering system for nonstationary noise environments,”EURASIP Journal on Applied Signal Processing, vol. 2003, pp. 1064–1073, 2003.

[7] I. Cohen, “Relative transfer function identification using speech signals,”IEEE Tran. on Speech and Audio Processing, vol. 12, no. 5, pp. 451–459, 2004.

[8] C. Fevotte and J.-F. Cardoso, “Maximum likelihood approach for blindaudio source separation using time-frequency Gaussian source models,”in IEEE Workshop on Applications of Signal Processing to Audio andAcoustics (WASPAA), 2005, pp. 78–81.

[9] E. Vincent, S. Arberet, and R. Gribonval, “Underdetermined instanta-neous audio source separation via local Gaussian modeling.” in Inter-national Conference on Independent Component Analysis and SignalSeparation (ICA), 2009, pp. 775–782.

[10] N. Q. Duong, E. Vincent, and R. Gribonval, “Under-determined reverbe-rant audio source separation using a full-rank spatial covariance model,”IEEE Tran. on Audio, Speech, and Language Processing, vol. 18, no. 7,pp. 1830–1840, 2010.

[11] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the EM algorithm,” Journal of the royalstatistical society. Series B (methodological), pp. 1–38, 1977.

[12] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signalprocessing magazine, vol. 13, no. 6, pp. 47–60, 1996.

[13] C. M. Bishop, Pattern Recognition and Machine Learning. New York,NY, USA: Springer, 2006.

[14] T. S. Jaakkola, “Variational methods for inference and estimation ingraphical models,” Ph.D. dissertation, Massachusetts Institute of Techno-logy, 1997.

[15] D. G. Tzikas, A. C. Likas, and N. P. Galatsanos, “The variational ap-proximation for Bayesian inference,” IEEE Signal Processing Magazine,vol. 25, no. 6, pp. 131–146, 2008.

[16] S. Malik, J. Benesty, and J. Chen, “A Bayesian framework for blindadaptive beamforming,” IEEE Tran. on Signal Processing, vol. 62, no. 9,pp. 2370–2384, 2014.

[17] D. Schmid, G. Enzner, S. Malik, D. Kolossa, and R. Martin, “Vari-ational Bayesian inference for multichannel dereverberation and noisereduction,” IEEE/ACM Tran. on Audio, Speech and Language Processing(TASLP), vol. 22, no. 8, pp. 1320–1335, 2014.

[18] A. Jukic, T. van Waterschoot, T. Gerkmann, and S. Doclo, “Speech dere-verberation with convolutive transfer function approximation using mapand variational deconvolution approaches,” in International Workshopon Acoustic Signal Enhancement (IWAENC), 2014, pp. 50–54.

[19] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, andR. Horaud, “A variational EM algorithm for the separation of time-varying convolutive audio mixtures,” IEEE/ACM Tran. on Audio, Speech,and Language Processing, vol. 24, no. 8, pp. 1408–1423, 2016.

[20] K. Adiloglu and E. Vincent, “Variational bayesian inference for sourceseparation and robust feature extraction,” IEEE/ACM Tran. on Audio,Speech, and Language Processing, vol. 24, no. 10, pp. 1746–1758, 2016.

[21] K. U. Simmer, J. Bitzer, and C. Marro, “Post-filtering techniques,” inMicrophone arrays. Springer, 2001, pp. 39–60.

[22] R. Balan and J. Rosca, “Microphone array speech enhancement byBayesian estimation of spectral amplitude and phase,” in IEEE SensorArray and Multichannel Signal Process. Workshop, 2002, pp. 209–213.

[23] L. Girin and R. Badeau, “On the use of latent mixing filters in audiosource separation,” in International Conference on Latent VariableAnalysis and Signal Separation. Springer, 2017, pp. 225–235.

[24] Y. Laufer and S. Gannot, “A Bayesian hierarchical model for speechenhancement,” in IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), 2018.

[25] R. A. Wooding, “The multivariate distribution of complex normalvariables,” Biometrika, vol. 43, no. 1/2, pp. 212–215, 1956.

[26] N. R. Goodman, “Statistical analysis based on a certain multivariatecomplex gaussian distribution (an introduction),” The Annals of mathe-matical statistics, vol. 34, no. 1, pp. 152–177, 1963.

[27] F. D. Neeser and J. L. Massey, “Proper complex random processes withapplications to information theory,” IEEE transactions on informationtheory, vol. 39, no. 4, pp. 1293–1302, 1993.

[28] S. N. Anfinsen, A. P. Doulgeris, and T. Eltoft, “Estimation of theequivalent number of looks in polarimetric synthetic aperture radarimagery,” IEEE Tran. on Geoscience and Remote Sensing, vol. 47,no. 11, pp. 3795–3809, 2009.

[29] C. Fevotte and S. J. Godsill, “A Bayesian approach for blind separa-tion of sparse sources,” IEEE Tran. on Audio, Speech, and LanguageProcessing, vol. 14, no. 6, pp. 2174–2188, 2006.

[30] S. Leglaive, R. Badeau, and G. Richard, “Student’s t source andmixing models for multichannel audio source separation,” IEEE/ACMTransactions on Audio, Speech, and Language Processing, 2018.

[31] G. Parisi, Statistical field theory. Addison-Wesley, 1988.[32] H. L. Van Trees, Optimum array processing: Part IV of detection,

estimation and modulation theory. Wiley, 2002.[33] O. Schwartz, S. Gannot, and E. A. P. Habets, “Multi-microphone speech

dereverberation and noise reduction using relative early transfer functi-ons,” IEEE/ACM Tran. on Audio, Speech and Language Processing,vol. 23, no. 2, pp. 240–251, 2015.

[34] S. Boyd and L. Vandenberghe, Convex optimization. Cambridgeuniversity press, 2004.

[35] I.-T. Recommendation, “Perceptual evaluation of speech quality (PESQ):An objective method for end-to-end speech quality assessment ofnarrow-band telephone networks and speech codecs,” Rec. ITU-T P. 862,2001.

[36] D. R. Morgan, J. Benesty, and M. M. Sondhi, “On the evaluation ofestimated impulse responses,” IEEE Signal processing letters, vol. 5,no. 7, pp. 174–176, 1998.

[37] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audiodatabase in various acoustic environments,” in International Workshopon Acoustic Signal Enhancement (IWAENC), 2014, pp. 313–317.

[38] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett,“DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.NIST speech disc 1-1.1,” Disc, 1993.

[39] E. A. P. Habets and S. Gannot, “Generating sensor signals in isotropicnoise fields,” The Journal of the Acoustical Society of America, vol. 122,no. 6, pp. 3464–3470, 2007.

[40] A. Varga and H. J. Steeneken, “Assessment for automatic speechrecognition: II. NOISEX-92: A database and an experiment to studythe effect of additive noise on speech recognition systems,” Speechcommunication, vol. 12, no. 3, pp. 247–251, 1993.



A Bayesian Hierarchical Model for Speech Enhancement with … · 2021. 4. 21. · types and...

Documents

Transcript of A Bayesian Hierarchical Model for Speech Enhancement with … · 2021. 4. 21. · types and...