Speech Enhancement Based on Adaptive Line Enhancer · min. that minimizes the gain (e.g., using...

Speech Enhancement Based on Adaptive Line Enhancer

Research ThesisAviva Atkins

April 7th, 2020

Supervised by Prof. Israel Cohen

Outline▪ Introduction

▪ The problem researched

▪ The challenges

▪ Research contributions

▪ Adaptive Line Enhancer background ▪ Convention fixed step size

▪ Mutual Information approach

▪ Proposed method

▪ Conclusions and future research

Sound test

Noise is Everywhere!

Interference

Source

ReverberationEcho

Additive noise

Speech Enhancement

Applications

Harmonic noise▪ Contains deterministic sinusoidal components

sounds/blender.wav

sounds/vacuum2.wav

sounds/heart_monitor.wav

sounds/house_fire_alarm.wav

The problem researched

Reducing nonstationary harmonic noise from a speech signal recorded with a single microphone

Source

Nonstationary HarmonicAdditive noise

sounds/heart_monitor.wav

The challenges▪ Single channel – only the noisy signal is available with no access to additional reference signals and no spatial information

→only intrinsic properties of speech or noise can be used

▪ The vast majority of methods require an estimate of the noise spectrum▪ When the noise is stationary it can be estimated during segments when speech is absent

▪ When the noise is nonstationary it needs to be tracked continuously

→it is more difficult to estimate nonstationary noise

▪ Trade-off between noise reduction to speech distortion

▪ The developed method needs to be relevant for real-time applications

Research Contributions

▪ Introduced a filtering method based on the frequency domain Adaptive Line Enhancer, that enables better reduction of nonstationary harmonic noise.▪ Proposed the combined filter – a combination of the commonly-used forward adaptive linear

filter and a non-causal backward adaptive linear filter used together, increasing the reduction span of the noise transient

▪ Applied the filter based on a comparison to the noisy spectrum, reducing noise overestimation

▪ Applied the filter based on a noise presence indicator for better speech preservation

▪ Employed a set of filter lengths, to ensure the combined filter spans throughout the noise transient

Additional contributions▪Investigated a statistical model as an alternative to the Decision Directed for the a-priori SNR estimator and showed that it can eliminate the musical noise while compromising between signal distortion and noise reduction.

▪ Introduced a beamformer that enables fine tuning of the compromise between Directivity Factor and White Noise Gain, through a simple computationally-efficient algorithm.

Why use Adaptive Line Enhancer?▪ Exploits the structure of the harmonic noise

▪ Simple with low computational cost

▪ Modifies both magnitude and phase so has the potential to improve on signal intelligibility and not just quality

Adaptive Noise Canceller (ANC)

+

−

Adaptive filter

Signal Source

Noise Source

Primary Input

Output𝑥 𝑛 + 𝑣(𝑛)

𝑣0(𝑛) ො𝑣 (𝑛)

𝑒(𝑛)

Adaptive algorithm

ො𝑥 (𝑛)

ො𝑥 = 𝑥 + 𝑣 − ො𝑣

𝑚𝑖𝑛𝐸 ො𝑥2 = 𝐸 𝑥2 +𝑚𝑖𝑛𝐸 𝑣 − ො𝑣 2

𝑚𝑖𝑛𝐸 𝑥 − ො𝑥 2 = 𝑚𝑖𝑛𝐸 𝑣 − ො𝑣 2

ො𝑣 = 𝑣, ො𝑥 = 𝑥Ideal case:

+𝑑(𝑛)+𝑑(𝑛)

+𝑥0(𝑛)

?Reference Input

distortion

Adaptive Line Enhancer (ALE)

+

−

Adaptive filter

Signal Source

Noise Source

Output𝑦 𝑛= 𝑥 𝑛 + 𝑣(𝑛)

𝑣(𝑛) 𝑧 (𝑛)

𝑒(𝑛)

Adaptive algorithm

ො𝑥 (𝑛)𝑥(𝑛)

−z

Input

ො𝑥 (𝑛)Output’

Signal decorrelatedNoise correlated

Noise decorrelatedsignal correlated


+

−

Adaptive filter

Signal Source

Noise Source



𝑒(𝑛)

Adaptive algorithm

ො𝑥 (𝑛)𝑥(𝑛)

−z

Input

X(𝑘,𝑚)𝑌 𝑘,𝑚= 𝑋 𝑘,𝑚 + 𝑉 𝑘,𝑚

𝑉 𝑘,𝑚 𝑍 𝑘,𝑚

𝐸 𝑘,𝑚 𝑋 𝑘,𝑚

TDFD

𝑋 𝑘,𝑚

𝑍 𝑘,𝑚

𝐸 𝑘,𝑚𝑌 𝑘,𝑚= 𝑋 𝑘,𝑚 + 𝑉 𝑘,𝑚

𝑉 𝑘,𝑚

X(𝑘,𝑚)


+

−

Adaptive filter

Signal Source

Noise Source

Output

Adaptive algorithm

−z

Input

𝜇

( ) ( ) ( )−= mkmkmkZ H ,,, yh

( ) ( ) ( ) TL mkHmkHmk ,,...,,, 10 −=h

( ) ( ) ( ) TLmkYmkYmk 1,,...,,, +−−−=− y ( ) ( )( ) ( )

( )

+−

−+=+

2

*

,

,,,1,

mk

mkmkEmkmk

y

yhhNLMS:

FD

𝜏

Conventional Fixed Step Size Example

For the conventional fixed step size, it is difficult to both reduce the noise and maintain high quality of the enhanced signal

Frame Index Frame Index Frame Index

Freq

. In

dex

Freq

. In

dex

Freq

. In

dex

(a) Clean signal (b) Noisy signal (c) Enhanced signal

3,1 == L

sounds/f1_synthetic_nlms_tau1_L3_mu0.5.wav

sounds/f1_synthetic.wav

sounds/f1.wav

Mutual Information ApproachTaghia, J., Martin, R., 2016, “A frequency-domain adaptive line enhancer with step-size control based on mutual information for harmonic noise reduction” IEEE Trans. Audio Speech Lang. Process.

▪ Frequency dependent step size, detecting harmonic noise presence per frequency

▪ Based on Mutual Information (MI)

▪ Step size: ( ) ( )kQk 0=

( )( )

= =

else,0

if,1K

1k

2

thr

P IkIQconstant0

( )( )

( )

=

*

*

,ˆ

,ˆ

kkkI

kkkIk

total

P

MI Approach Example

Frame Index Freq. [KHz]

Freq

. In

dex(b) Noisy signal (c) MI Step Size

1=Q

𝜇

sounds/f1_synthetic.wav

MI Approach Example

Frame Index Frame Index

Freq

. In

dex

Freq

. In

dex

(a) Clean signal (c) Enhanced signal - MI

Frame Index

Freq

. In

dex

(b) Enhanced signal – fixed step size

sounds/f1_synthetic_mi_tau1_L3.wav

sounds/f1_synthetic_nlms_tau1_L3_mu0.5.wav

sounds/f1.wav

MI Approach ▪ Implemented in block-wise manner

▪ Assumption: stationarity of the noise is at least as large as the block length

▪ They take block length of 3 seconds

Taghia, J., Martin, R., 2016, “A frequency-domain adaptive line enhancer with step-size control based on mutual information for harmonic noise reduction” IEEE Trans. Audio Speech Lang. Process.

The assumption does not hold for highly non-stationary signals, such as the heart monitor beeping

Decision block often zero for highly non-stationary signals, such as the heart monitor beepingSpectrogram of 3.4s long

heart monitor beeping

MI Approach Example Non-stationary

Frame Index Freq. [KHz]

Freq

. In

dex

(a) Noisy signal (b) MI Step Size

0=Q

𝜇

MI Approach Example Non-stationary

Frame Index Frame Index

Freq

. In

dex

Freq

. In

dex

(a) Clean signal (c) Enhanced signal - MI

Frame Index

Freq

. In

dex

(b) Noisy signal

ignoredQ

sounds/f1_7_mi_tau1_L3.wav

sounds/f1_7.wav

sounds/f1.wav

?

Non-Stationary noise – filter output estimate

+

−

Adaptive filter

Signal Source

Noise Source



𝑒(𝑛)

Adaptive algorithm

𝑥(𝑛)

−z

Input

?

= ො𝑣(𝑛)

ො𝑥 (𝑛)

sounds/f1_7.wav

Experimental Setup▪ Clean speech: 20 different speech signals from different speakers from TIMIT database (0.5M/0.5F)

▪ Sampled @ 16KHz

▪ SNR range [0,20] dB

▪ STFT, overlap-add

▪ Noise: 26 different non-stationary harmonic noise signals, e.g., heart monitor beeping, train door beeping, house alarm, railroad crossing bells.

Correlation

𝛄𝑋 𝑘,𝑚, 𝜏 =𝐸 𝑋 𝑘,𝑚 𝐱∗ 𝑘,𝑚 − 𝜏

𝐸 𝑋 𝑘,𝑚 2

𝛄V 𝑘,𝑚, 𝜏 =𝐸 𝑉 𝑘,𝑚 𝐯∗ 𝑘,𝑚 − 𝜏

𝐸 𝑉 𝑘,𝑚 2

[Frames]

= ො𝑣(𝑛)

1 Frame = 32ms

Proposed Approach▪ Combined filter (CMLNLMS):

𝐸𝑐 𝑘,𝑚 =

𝐸𝑏 𝑘,𝑚 + 𝐿 , 𝐸𝑏 𝑘,𝑚 + 𝐿 2 ≤ 𝐸𝑓 𝑘,𝑚2𝑎𝑛𝑑 𝐸𝑏 𝑘,𝑚 + 𝐿 2 ≤ 𝑌 𝑘,𝑚 2

𝐸𝑓 𝑘,𝑚 , 𝐸𝑏 𝑘,𝑚 + 𝐿 2 > 𝐸𝑓 𝑘,𝑚2𝑎𝑛𝑑 𝐸𝑓 𝑘,𝑚 + 𝐿

2≤ 𝑌 𝑘,𝑚 2

𝑌 𝑘,𝑚 , 𝑒𝑙𝑠𝑒

F B

C

Proposed Approach▪ Harmonic noise presence detector for better speech preservation

▪ Set of filters with changing length, until maximal filter length L, based on the available amount of noise samples

𝐼 𝑘, 𝑚 = ቊ1 𝑉 𝑘,𝑚 ∈ ℋ0

0 𝑉 𝑘,𝑚 ∈ ℋ1

FB

CC

Performance Measures▪ Distortion Index

▪ Noise reduction Factor

▪ Perceptual Evaluation of Speech Quality (PESQ) ITU-T P.862.2

▪ The Short-Time Objective Intelligibility (STOI)

𝑣𝑠𝑑

𝜉𝑛𝑟

Transient Reduction

Frame Index

NR

R [

dB

]

3,3 == L

thresholdindicator dB25−

5.0=

Better noise reduction which leads to improved

, PESQ, and STOI levels for the combined filter

𝜉𝑛𝑟

sounds/f1_7.wav

Step Size▪ An appropriate selection of the step size is required

▪ Fixed step size

▪

Frame Index

𝑣𝑠𝑑

NR

R [

dB

]

𝜏5.0=

?const,MImax

[Frames]

PES

Q

STO

I

𝑣𝑠𝑑

𝑑𝐵

𝜉 𝑛𝑟𝑑𝐵

5.0=


𝜏 [Frames] 𝜏 [Frames]

𝜏 [Frames] 𝜏 [Frames]

Combined & MI-Combined show better results than MI

sounds/f1_7.wav

PES

Q

STO

I

𝑣𝑠𝑑

𝑑𝐵

5.0=


𝜏 [Frames]

𝜏 [Frames] 𝜏 [Frames]𝜉 𝑛

𝑟𝑑𝐵

𝜏 [Frames]

shortL

1=

Recommendation:

Combined & MI-Combined show better results than MI

sounds/f1_7.wav

Noise Presence Indicator

3,1 == L

sounds/f1_7.wav

Experimental Results Summary

3,1 == L


Frame Index

Frame Index

Freq

. In

dex

Freq

. In

dex

(a) Clean signal

(c) Enhanced signal - MI

Frame Index

Freq

. In

dex

(b) Noisy signal

(d) Enhanced signal - Combined

Freq

. In

dex

Frame Index

sounds/f1_7_cmlnlms_tau1_L3.wav

sounds/f1_7_mi_tau1_L3.wav

sounds/f1_7.wav

sounds/f1.wav

Conclusions▪ Introduced the combined filter

▪ Parameter selection

▪ Noise presence indicator impact

▪ Improved results compared to other methods

Future Research▪Noise presence indicator implementation

▪ Residual noise at transient edges

▪ Deep Learning approach for noise reduction

Speech Enhancement Using ARCH model▪ We investigate the use of the autoregressive conditional heteroscedasticity (ARCH) model as a replacement for the well-known Decision-Directed estimator by Epharim and Malah

▪ We employ three sound quality measures: speech distortion, noise reduction and musical noise, and explain the effect the ARCH model parameters have on these measures.

▪ We demonstrate that the ARCH model achieves better results than the decision-directed for some of these measures, while compromising between the speech distortion and noise reduction.

Problem Formulation▪ Let 𝑌ℓ 𝑘 = 𝑋ℓ 𝑘 + 𝐷ℓ 𝑘 denote an observed noisy speech signal in the STFT domain.

▪ Given an error function between the clean signal and its estimate, the spectral enhancement problem can be formulated as

𝑋ℓ 𝑘 = argmin 𝑋𝐸 𝑒 𝑋ℓ 𝑘 , 𝑋 𝑘 |𝑌0 𝑘 ,… , 𝑌ℓ′ 𝑘

▪ We consider the casual case ℓ ≤ ℓ′ and the LSA error function

𝑒LSA 𝑋ℓ 𝑘 , 𝑋ℓ 𝑘 = log 𝑋ℓ 𝑘 − log 𝑋ℓ 𝑘2

Problem Formulation▪ The estimate is obtained by applying a spectral gain to each noisy spectral component:

𝑋ℓ 𝑘 = 𝐺LSA 𝜉ℓ|ℓ′ ∙ 𝑌ℓ

where the a-priori and a-posteriori SNRs are defined, respectively, by:

𝜉ℓ|ℓ′ ≜𝜆ℓ|ℓ′

𝜎ℓ2 , 𝛾ℓ ≜

𝑌ℓ2

𝜎ℓ2

𝜎ℓ2 = 𝐸 𝐷ℓ

2 denotes the short-term spectrum of the noise, and

𝜆ℓ|ℓ′ = 𝐸 𝑋ℓ2|𝑌0 𝑘 ,… , 𝑌ℓ′ 𝑘 denotes the short-term spectrum of the speech

signal.

Decision-DirectedY. Ephraim and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator,“ IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-32, pp. 1109-1121, December 1984

▪ Over the past decades, the decision-directed (DD) approach has become the acceptable estimation method for the a-priori SNR

መ𝜉ℓ|ℓ = max 𝛼𝑋ℓ−1

2

𝜎ℓ2 + 1 − 𝛼 𝑃 𝛾ℓ − 1 , 𝜉min

where 𝑃 𝑥 = 𝑥 if 𝑥 ≥ 0 and 𝑃 𝑥 = 0 otherwise.

▪ The decision-directed approach is not supported by a statistical model.

▪ 𝛼 and 𝜉min have to be determined by simulations.

▪ 𝛼 and 𝜉min are fixed constants and are not adapted to the speech components.

ARCH Model▪ The GARCH (generalized autoregressive conditional heteroscedasticity) model is extensively used in financial applications where it is necessary to model time varying volatility while taking into account heavy tailed behavior and volatility clustering.

▪ Recently 1 , it was proposed to use the GARCH for statistically modeling the speech signals in the STFT domain, as they show these two characteristics.

▪ In this work, we investigate the use of a simplified case of the GARCH, the ARCH model. We explain the effect that the ARCH model parameters have on commonly used performance measures and compare it to the decision-directed estimator.

[1] I. Cohen, “Modeling speech signals in the time frequency domain using GARCH,” Signal Processing, vol. 84 (12), pp. 2453–2459, 2004.

ARCH ModelWe use a two-step estimator, to recursively update the estimate of the conditional a-priori SNR as new data arrives.

Given an estimate of መ𝜉ℓ|ℓ−1 and a new noisy spectral component 𝑌ℓ

Update step:

መ𝜉ℓ|ℓ = 𝐸 ฬ𝑋ℓ

2

𝜎ℓ2

መ𝜉ℓ|ℓ−1, 𝑌ℓ

Using ARCH(1), propagate the a-priori SNR to obtain the one-frame-ahead a priori SNR,

Propagation step:

መ𝜉ℓ|ℓ−1 = 𝜅 + 𝜇 መ𝜉ℓ−1|ℓ−1, 𝜅 > 0, 0 ≤ 𝜇 < 1

ARCH Model▪ Since the a-priori SNRs need to be equal to 𝜉min when speech is absent, we obtain a condition on 𝜅, 𝜅 = 1 − 𝜇 𝜉min

▪ Using ARCH(1) we have two parameters 𝜉min and 𝜇:

Propagation step: መ𝜉ℓ|ℓ−1 = 1 − 𝜇 𝜉min + 𝜇 መ𝜉ℓ−1|ℓ−1,

Update step: መ𝜉ℓ|ℓ = 𝛼ℓ መ𝜉ℓ|ℓ−1 + 1 − 𝛼ℓ 𝛾ℓ − 1 , where

𝛼ℓ = 1 −𝜉ℓ|ℓ−1

𝜉ℓ|ℓ−1+1

2

, 𝛼ℓ ∈ 0,1

Distortion and NRRWe employ three performance measures commonly used for the quality assessment of a speech enhancement algorithm.

The first two are easily understood when we express the estimated signal as

𝑋ℓ = 𝐺 𝜉ℓ|ℓ′, 𝛾ℓ 𝑋ℓ + 𝐺 𝜉ℓ|ℓ′, 𝛾ℓ 𝐷ℓ = 𝑋𝑓𝑑 +𝐷𝑟𝑛

Speech distortion:

𝐽𝑋 ≜ 𝐸 log 𝑋ℓ 𝑘 − log 𝐺 𝜉ℓ|ℓ′, 𝛾ℓ 𝑋ℓ2

Noise Reduction Ratio (NRR):

NRR ≜𝐸 𝐷ℓ

2

𝐸 𝐺 𝜉ℓ|ℓ′,𝛾ℓ 𝐷ℓ2

Musical noise via higher order statisticsThe attenuated noise will be composed of isolated spectral components, also known as tonal components.

The amount of tonal components can be quantified by the kurtosis;

kurtosis = Τ𝜇4 𝜇22, where 𝜇𝑚 is the 𝑚th order moment of the signal.

As we are interested in the amount of tonal components caused by the processing, we use the ratio of the kurtosis before and after the processing:

LKR ≜ log10kurtosisproc

kurtosisorf

which is evaluated on noise only frames. The LKR increases as the musical noise increases, and the absence of musical noise corresponds to LKR of zero and below.

Musical noise via higher order statisticsAnalytical calculation of the kurtosis ratio requires the use of a specific noise reduction method or assumptions about the statistical spectral components. Here, we use the sample kurtosis:

kurtosis =1

𝐿σℓ=0𝐿

1

𝑁σ𝑘=0𝑁−1 𝐷ℓ(𝑘)

2− 𝐷ℓ(𝑘)2

4

1


2− 𝐷ℓ(𝑘)2

2 2

Where 𝐷ℓ(𝑘) 2= 1


2

Experimental Setup▪ Speech signals: 20 different utterances from 20 different speakers, sampled at 16 kHz and degraded by white Gaussian noise with SNRs in the range [0,20]dB.

▪ The noisy signals are transformed to the time-frequency domain using STFT, with 75% overlapping Hamming analysis windows of 32ms length.

▪ The evaluation of the musical noise was done separately on a complex white Gaussian noise in the time-frequency domain, to emulate performance in noise only frames.

Experimental Setup

Comparison of decision-directed (solid lines) and ARCH (dashed lines) estimators for 5dB SNR: (a) Distortion, (b) NRR and, (c) LKR, with varying 𝛼(upper axis) and 𝜇(lower axis) respectively per estimator, and 𝜉min of -20dB (square), -15dB (circle), and for decision-directed method only 𝜉min = 0 (triangle).

Experimental Setup

We get the expected decision-directed behavior

Experimental Setup

For the ARCH estimator, increasing the value of 𝜇 decreases the distortion

Experimental Setup

When 𝜇 increases also the NRR decreases.

The lower we take the noise floor 𝜉min , the more noise reduction we get.

Experimental Setup

The musical noise mainly depends on the noise floor 𝜉min . Lower 𝜉min means higher 𝛼ℓ , resulting in a smoother a-priori SNR around 𝜉min , thus reducing the musical noise.

Experimental Setup

For the decision-directed estimator we have to compromise between the amount of distortion and amount of musical noise, while for the ARCH estimator, the musical noise can be eliminated by choosing an appropriate value of 𝜉min . However, for the ARCH estimator we need to compromise between the amount of distortion and the amount of residual noise.

ConclusionsResults summary:

▪ We presented the use of the ARCH estimator, which is based on a statistical model.

▪ We explained the effect the ARCH model parameters have on three commonly used quality measures.

▪ We demonstrated that the ARCH model can achieve better results than the decision-directed, while compromising between the speech distortion and noise reduction.

Future work:

▪ We used the ARCH(1) model for the a-priori SNR estimator, which is a special case of the GARCH(0,1). It would be interesting to expand the model to a full GARCH(p,q) model and conduct a similar analysis, to understand if the full general model could provide additional advantages

Robust Superdirective Beamformer with Optimal Regularization▪ We introduce an optimal beamformer design that facilitates a compromise between high directivity and low white noise amplification.

▪ The proposed beamformer involves a regularization factor, whose optimal value is determined using a simple and efficient one-dimensional search algorithm.

▪ Simulation results demonstrate controlled tuning of various gain properties of the desired beamformer, and improved performance compared to a competing method.

Signal Model and Array Setup▪ We consider a plane wave, in the farfield, impinging on an array at angle 𝜃

▪ Uniform linear microphone array of 𝑀 sensors, with distance 𝛿between them

▪ The desired signal 𝑋(𝜔) propagates from 𝜃 = 0 (endfire)

▪ Neglecting the propagation attenuation, the observed signal is

𝐲 𝜔 = 𝐝 𝜔, 𝜃 𝑋 𝜔 + 𝐯(𝜔)

where 𝐝 𝜔, 𝜃 is the steering vector, and 𝐯(𝜔) is the additive noise vector.

𝐝 𝜔, 𝜃 = 1 𝑒−𝑗𝜔 cos 𝜃𝜏0 ⋯ 𝑒−𝑗 𝑀−1 𝜔 cos 𝜃𝜏0 𝑇, 𝜏0 =𝛿

𝑐

Signal Model and Array Setup▪ For the endfire direction 𝐝 𝜔 = 𝐝 𝜔, 0

▪ Applying a complex linear filter 𝐡 𝜔 , the estimated signal is

Z 𝜔 = 𝐡𝐻 𝜔 𝐲 𝜔 = 𝐡𝐻 𝜔 𝐝 𝜔 𝑋 𝜔 + 𝐡𝐻 𝜔 𝐯(𝜔)

▪ The beamformer is distortionless when 𝐡𝐻 𝜔 𝐝 𝜔 = 1

Performance measures▪ Taking the first microphone as reference, we define the input and output SNR

iSNR ω =𝜙𝑋(𝜔)

𝜙𝑉1(𝜔)oSNR ω =

𝜙𝑋(𝜔)

𝜙𝑉1(𝜔)×

𝐡𝐻 𝜔 𝐝 𝜔2

𝐡𝐻 𝜔 𝚪𝐯 𝜔 𝐡 𝜔

where 𝜙𝑓 𝜔 = 𝐸( 𝑓 𝜔 2) is the variance of 𝑓 ∈ 𝑋, 𝑉1 , and 𝚪𝐯 𝜔 = ൗ𝐸 𝐯(𝜔)𝐯𝐻 𝜔 𝜙𝑉1(𝜔)is the pseudo-coherence matrix of the noise.

▪ We deduce the gain in SNR:

𝒢 𝐡 𝜔 =oSNR ω

iSNR ω=

𝐡𝐻 𝜔 𝐝 𝜔 2

𝐡𝐻 𝜔 𝚪𝐯 𝜔 𝐡 𝜔

▪ WNG: 𝚪𝐯 𝜔 = 𝐈𝑀, 𝒲 𝐡 𝜔 =𝐡𝐻 𝜔 𝐝 𝜔

2

𝐡𝐻 𝜔 𝐡 𝜔

▪ DF: 𝚪𝐯 𝜔 = 𝚪𝒅 𝜔 =1

20𝜋𝐝 𝜔, 𝜃 𝐝𝐻 𝜔, 𝜃 sin 𝜃𝑑𝜃, 𝒟 𝐡 𝜔 =

𝐡𝐻 𝜔 𝐝 𝜔2

𝐡𝐻 𝜔 𝚪𝐝 𝜔 𝐡 𝜔

Conventional Beamformers▪ Delay-and-Sum (DS): maximizes the WNG subject to the distortionless constraint

𝐡DS 𝜔, 𝜃 =𝐝 𝜔, 𝜃

𝑀𝒲 𝐡DS 𝜔, 𝜃 = 𝑀 = 𝒲max

𝒟 𝐡DS 𝜔, 𝜃 =𝑀2

𝐝𝐻 𝜔, 𝜃 𝚪𝐝 𝜔 𝐝 𝜔, 𝜃≥ 1

While the DS maximizes WNG it never amplifies diffuse noise.

▪ Superdirective (SD): maximizes the DF subject to the distortionless constraint for the specific case of 𝜃 = 0 and small 𝛿

𝐡SD 𝜔, 𝜃 =𝚪𝐝−1 𝜔 𝐝 𝜔

𝐝𝐻 𝜔 𝚪𝐝−1 𝜔 𝐝 𝜔

while maximizing the DF the 𝐡SD 𝜔, 𝜃 can amplify the white noise especially at low frequencies

Conventional Beamformers▪ Robust Superdirective

𝐡𝑅,𝜀 𝜔 =𝚪𝐝 𝜔 + 𝜺𝐈𝑀

−1𝐝 𝜔

𝐝𝐻 𝜔 𝚪𝐝 𝜔 + 𝜺𝐈𝑀−1𝐝 𝜔

Where 휀 ≥ 0 is a Lagrange multiplier, which enables a compromise between the DF and the WNG

If we define 𝚪𝜀 𝜔 = 𝚪𝐝 𝜔 + 𝜺𝐈𝑀 , we can write

𝐡𝑅,𝜀 𝜔 =𝚪𝜺−1 𝜔 𝐝 𝜔

𝐝𝐻 𝜔 𝚪𝜺−1 𝜔 𝐝 𝜔

While the robust superdirective beamformer has control on the white noise amplification, it is not easy to find a closed form expression for 휀 for a desired value of the WNG

Combined beamformerR. Berkun, I. Cohen, and J. Benesty, “Combined beamformers for robust broadband regularized superdirective beamforming,“ IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, pp. 877-886, May 2015

Berkun et al. proposed the combined beamformer:

𝐡𝛼,𝜀 𝜔 =𝚪𝜺−1 𝜔 +𝛼 𝜔 𝐈𝑀 𝐝 𝜔

𝐝𝐻 𝜔 𝚪𝜺−1 𝜔 +𝛼 𝜔 𝐈𝑀 𝐝 𝜔

, 𝛼 ∈ ℝ

It can be reformulated as

𝐡𝛼,𝜀 𝜔 =𝐡𝑅,𝜀 𝜔

1 + 𝛼𝜀 𝜔+

𝐡𝐷𝑆 𝜔

1 + 𝛼𝜀−1 𝜔

Where 𝛼𝜀 𝜔 = 𝛼 𝜔𝒲max

𝒟max,𝜀 𝜔and 𝒟max,𝜀 𝜔 = 𝐝𝐻 𝜔 𝚪𝜺

−1 𝜔 𝐝 𝜔

For a fixed 𝒲 𝐡𝛼,𝜀 𝜔 = 𝒲0 < 𝑀 or a fixed 𝒟 𝐡𝛼,𝜀 𝜔 = 𝒟0 it is possible to analytically calculate 𝛼𝜀 𝜔 and hence 𝛼 𝜔 .

While finding a closed form solution for the parameter 𝛼 𝜔 , which enables control of the trade-off in performance between the WNG and the DF, The method does not address finding the regularization parameter 휀 and assumes it is user determined.

New Noise Field▪ We assume the signal is corrupted both by diffuse noise and additive white noise.

▪ The input and output SNR:

iSNR ω =tr 𝜙𝑋(𝜔)𝐝 𝜔 𝐝𝐻 𝜔

tr 𝜙𝑑 𝜔 𝚪𝐝 𝜔 + 𝜙𝑤(𝜔)𝐈𝑀=

𝜙𝑋(𝜔)

𝜙𝑑 𝜔 + 𝜙𝑤(𝜔)

oSNR ω =𝜙𝑋(𝜔) 𝐡

𝐻 𝜔 𝐝 𝜔 2

𝜙𝑑 𝜔 𝐡𝐻 𝜔 𝚪𝐝 𝜔 𝐡 𝜔 + 𝜙𝑤(𝜔)𝐡𝐻 𝜔 𝐡 𝜔

▪ The SNR gain:

𝒢 𝐡 𝜔 =𝐡𝐻 𝜔 𝐝 𝜔 2

1 − 𝛼(𝜔) 𝐡𝐻 𝜔 𝚪𝐝 𝜔 𝐡 𝜔 + 𝛼(𝜔)𝐡𝐻 𝜔 𝐡 𝜔

Where

𝛼 𝜔 =𝜙𝑤(𝜔)

𝜙𝑑 𝜔 + 𝜙𝑤(𝜔), 0 ≤ 𝛼 𝜔 ≤ 1

The optimal Beamformer▪ The proposed beamformer which maximizes the SNR gain is:

𝐡𝜶 𝜔 =𝚪𝐝,𝛼−1 𝜔 𝐝 𝜔

𝐝𝐻 𝜔 𝚪𝐝,𝛼−1 𝜔 𝐝 𝜔

, where 𝚪𝐝,α 𝜔 = 1 − 𝛼(𝜔) 𝚪𝐝 𝜔 + 𝛼(𝜔)𝐈𝑀

▪ The SNR gain 𝒢 𝐡𝜶 𝜔 = 𝐝𝐻 𝜔 𝚪𝐝,𝛼−1 𝜔 𝐝 𝜔

▪ The proposed beamformer is equivalent to 𝐡𝑅,𝜀 𝜔 with 휀 𝜔 =𝛼(𝜔)

1−𝛼(𝜔)

▪ Problem: 𝜙𝑑 𝜔 , 𝜙𝑤(𝜔) are not known → 𝛼(𝜔) is not known.

▪ Advantage 1: 𝛼(𝜔) varies from 0 to 1.

▪ Advantage 2: the gain is continuous and has a single minimum point in this range, the WNG and DF are monotonic in this range.

▪ Solution: 𝛼(𝜔) is found employing a binary-like search on each monotonic section.

Algorithm 1▪Input: Desired gain 𝒢0 , and tolerance

▪Output: Optimal regularization 𝛼

1. Find 𝛼min that minimizes the gain (e.g., using gradient descent)

2. Divide the range 0,1 into 2 sections in which the gain is monotonic: 0, 𝛼minand 𝛼min, 1

3. For each section, apply the following continuous binary search:

4. Divide the section into 2 sub-sections

5. Calculate the gain 𝒢𝑘 in the middle of each sub-section

6. Choose the gain 𝒢𝑘 and its respective sub-section for which 𝒢𝑘 − 𝒢0 is minimal

7. if 𝒢𝑘 − 𝒢0 ≤ tolerance then

8. 𝛼 ←(middle of chosen sub-section) and stop

9. else

10. update range to be the chosen sub-section and go back to 4

11. endif

12. Compare results 0, 𝛼min and 𝛼min, 1 , and choose the best result

Experimental ResultsSetup: 𝑀 = 8 microphones, 𝛿 = 1 cm spacing

Array gains for fixed SNR

𝛼(𝜔) is found for desired fixed SNR gain 𝒢0 using the proposed algorithm

Experimental ResultsArray gains for fixed WNG

𝛼(𝜔) is found for maximal SNR gain under a constant desired WNG 𝒲0 using the proposed algorithm from step 4

→ Our proposed beamformer outperforms the combined beamformer with 휀 = 10−4

Experimental ResultsArray gains for fixed DF in multi-band

𝛼(𝜔) is found for maximal SNR gain under a piece-wise constant gradually increasing DF using the proposed algorithm from step 4

→WNG-DF trade-off can be considered at each frequency band separately!→ Our proposed beamformer outperforms the combined beamformer with 휀 = 10−4

ConclusionsResults summary:

▪ The proposed approach facilitates the design of beamformers with fixed SNR gain, beamformers with maximal SNR gain for constant WNG or DF, and multi-band fixed beamformers.

▪ Enables a fine tuning of the compromise between the DF and robustness against white noise.

Future work:

▪ Testing various angles of incidence other than the end-fire direction.

▪ Incorporating other considerations such as side-lobe requirements and performance under other types of noise fields.

Speech Enhancement Based on Adaptive Line Enhancer · min. that minimizes the gain (e.g., using...

Documents

Transcript of Speech Enhancement Based on Adaptive Line Enhancer · min. that minimizes the gain (e.g., using...