Speech Enhancement Based on Adaptive Line Enhancer · min. that minimizes the gain (e.g., using...
Transcript of Speech Enhancement Based on Adaptive Line Enhancer · min. that minimizes the gain (e.g., using...
Speech Enhancement Based on Adaptive Line Enhancer
Research ThesisAviva Atkins
April 7th, 2020
Supervised by Prof. Israel Cohen
Outline▪ Introduction
▪ The problem researched
▪ The challenges
▪ Research contributions
▪ Adaptive Line Enhancer background ▪ Convention fixed step size
▪ Mutual Information approach
▪ Proposed method
▪ Conclusions and future research
Sound test
Noise is Everywhere!
Interference
Source
ReverberationEcho
Additive noise
Speech Enhancement
Applications
Harmonic noise▪ Contains deterministic sinusoidal components
The problem researched
Reducing nonstationary harmonic noise from a speech signal recorded with a single microphone
Source
Nonstationary HarmonicAdditive noise
The challenges▪ Single channel – only the noisy signal is available with no access to additional reference signals and no spatial information
→only intrinsic properties of speech or noise can be used
▪ The vast majority of methods require an estimate of the noise spectrum▪ When the noise is stationary it can be estimated during segments when speech is absent
▪ When the noise is nonstationary it needs to be tracked continuously
→it is more difficult to estimate nonstationary noise
▪ Trade-off between noise reduction to speech distortion
▪ The developed method needs to be relevant for real-time applications
Research Contributions
▪ Introduced a filtering method based on the frequency domain Adaptive Line Enhancer, that enables better reduction of nonstationary harmonic noise.▪ Proposed the combined filter – a combination of the commonly-used forward adaptive linear
filter and a non-causal backward adaptive linear filter used together, increasing the reduction span of the noise transient
▪ Applied the filter based on a comparison to the noisy spectrum, reducing noise overestimation
▪ Applied the filter based on a noise presence indicator for better speech preservation
▪ Employed a set of filter lengths, to ensure the combined filter spans throughout the noise transient
Additional contributions▪Investigated a statistical model as an alternative to the Decision Directed for the a-priori SNR estimator and showed that it can eliminate the musical noise while compromising between signal distortion and noise reduction.
▪ Introduced a beamformer that enables fine tuning of the compromise between Directivity Factor and White Noise Gain, through a simple computationally-efficient algorithm.
Why use Adaptive Line Enhancer?▪ Exploits the structure of the harmonic noise
▪ Simple with low computational cost
▪ Modifies both magnitude and phase so has the potential to improve on signal intelligibility and not just quality
Adaptive Noise Canceller (ANC)
+
−
Adaptive filter
Signal Source
Noise Source
Primary Input
Output𝑥 𝑛 + 𝑣(𝑛)
𝑣0(𝑛) ො𝑣 (𝑛)
𝑒(𝑛)
Adaptive algorithm
ො𝑥 (𝑛)
ො𝑥 = 𝑥 + 𝑣 − ො𝑣
𝑚𝑖𝑛𝐸 ො𝑥2 = 𝐸 𝑥2 +𝑚𝑖𝑛𝐸 𝑣 − ො𝑣 2
𝑚𝑖𝑛𝐸 𝑥 − ො𝑥 2 = 𝑚𝑖𝑛𝐸 𝑣 − ො𝑣 2
ො𝑣 = 𝑣, ො𝑥 = 𝑥Ideal case:
+𝑑(𝑛)+𝑑(𝑛)
+𝑥0(𝑛)
?Reference Input
distortion
Adaptive Line Enhancer (ALE)
+
−
Adaptive filter
Signal Source
Noise Source
Output𝑦 𝑛= 𝑥 𝑛 + 𝑣(𝑛)
𝑣(𝑛) 𝑧 (𝑛)
𝑒(𝑛)
Adaptive algorithm
ො𝑥 (𝑛)𝑥(𝑛)
−z
Input
ො𝑥 (𝑛)Output’
Signal decorrelatedNoise correlated
Noise decorrelatedsignal correlated
Adaptive Line Enhancer (ALE)
+
−
Adaptive filter
Signal Source
Noise Source
Output𝑦 𝑛= 𝑥 𝑛 + 𝑣(𝑛)
𝑣(𝑛) 𝑧 (𝑛)
𝑒(𝑛)
Adaptive algorithm
ො𝑥 (𝑛)𝑥(𝑛)
−z
Input
X(𝑘,𝑚)𝑌 𝑘,𝑚= 𝑋 𝑘,𝑚 + 𝑉 𝑘,𝑚
𝑉 𝑘,𝑚 𝑍 𝑘,𝑚
𝐸 𝑘,𝑚 𝑋 𝑘,𝑚
TDFD
𝑋 𝑘,𝑚
𝑍 𝑘,𝑚
𝐸 𝑘,𝑚𝑌 𝑘,𝑚= 𝑋 𝑘,𝑚 + 𝑉 𝑘,𝑚
𝑉 𝑘,𝑚
X(𝑘,𝑚)
Adaptive Line Enhancer (ALE)
+
−
Adaptive filter
Signal Source
Noise Source
Output
Adaptive algorithm
−z
Input
𝜇
( ) ( ) ( )−= mkmkmkZ H ,,, yh
( ) ( ) ( ) TL mkHmkHmk ,,...,,, 10 −=h
( ) ( ) ( ) TLmkYmkYmk 1,,...,,, +−−−=− y ( ) ( )( ) ( )
( )
+−
−+=+
2
*
,
,,,1,
mk
mkmkEmkmk
y
yhhNLMS:
FD
𝜏
Conventional Fixed Step Size Example
For the conventional fixed step size, it is difficult to both reduce the noise and maintain high quality of the enhanced signal
Frame Index Frame Index Frame Index
Freq
. In
dex
Freq
. In
dex
Freq
. In
dex
(a) Clean signal (b) Noisy signal (c) Enhanced signal
3,1 == L
Mutual Information ApproachTaghia, J., Martin, R., 2016, “A frequency-domain adaptive line enhancer with step-size control based on mutual information for harmonic noise reduction” IEEE Trans. Audio Speech Lang. Process.
▪ Frequency dependent step size, detecting harmonic noise presence per frequency
▪ Based on Mutual Information (MI)
▪ Step size: ( ) ( )kQk 0=
( )( )
= =
else,0
if,1K
1k
2
thr
P IkIQconstant0
( )( )
( )
=
*
*
,ˆ
,ˆ
kkkI
kkkIk
total
P
MI Approach Example
Frame Index Freq. [KHz]
Freq
. In
dex(b) Noisy signal (c) MI Step Size
1=Q
𝜇
MI Approach Example
Frame Index Frame Index
Freq
. In
dex
Freq
. In
dex
(a) Clean signal (c) Enhanced signal - MI
Frame Index
Freq
. In
dex
(b) Enhanced signal – fixed step size
MI Approach ▪ Implemented in block-wise manner
▪ Assumption: stationarity of the noise is at least as large as the block length
▪ They take block length of 3 seconds
Taghia, J., Martin, R., 2016, “A frequency-domain adaptive line enhancer with step-size control based on mutual information for harmonic noise reduction” IEEE Trans. Audio Speech Lang. Process.
The assumption does not hold for highly non-stationary signals, such as the heart monitor beeping
Decision block often zero for highly non-stationary signals, such as the heart monitor beepingSpectrogram of 3.4s long
heart monitor beeping
MI Approach Example Non-stationary
Frame Index Freq. [KHz]
Freq
. In
dex
(a) Noisy signal (b) MI Step Size
0=Q
𝜇
MI Approach Example Non-stationary
Frame Index Frame Index
Freq
. In
dex
Freq
. In
dex
(a) Clean signal (c) Enhanced signal - MI
Frame Index
Freq
. In
dex
(b) Noisy signal
ignoredQ
?
Non-Stationary noise – filter output estimate
+
−
Adaptive filter
Signal Source
Noise Source
Output𝑦 𝑛= 𝑥 𝑛 + 𝑣(𝑛)
𝑣(𝑛) 𝑧 (𝑛)
𝑒(𝑛)
Adaptive algorithm
𝑥(𝑛)
−z
Input
?
= ො𝑣(𝑛)
ො𝑥 (𝑛)
Experimental Setup▪ Clean speech: 20 different speech signals from different speakers from TIMIT database (0.5M/0.5F)
▪ Sampled @ 16KHz
▪ SNR range [0,20] dB
▪ STFT, overlap-add
▪ Noise: 26 different non-stationary harmonic noise signals, e.g., heart monitor beeping, train door beeping, house alarm, railroad crossing bells.
Correlation
𝛄𝑋 𝑘,𝑚, 𝜏 =𝐸 𝑋 𝑘,𝑚 𝐱∗ 𝑘,𝑚 − 𝜏
𝐸 𝑋 𝑘,𝑚 2
𝛄V 𝑘,𝑚, 𝜏 =𝐸 𝑉 𝑘,𝑚 𝐯∗ 𝑘,𝑚 − 𝜏
𝐸 𝑉 𝑘,𝑚 2
[Frames]
= ො𝑣(𝑛)
1 Frame = 32ms
Proposed Approach▪ Combined filter (CMLNLMS):
𝐸𝑐 𝑘,𝑚 =
𝐸𝑏 𝑘,𝑚 + 𝐿 , 𝐸𝑏 𝑘,𝑚 + 𝐿 2 ≤ 𝐸𝑓 𝑘,𝑚2𝑎𝑛𝑑 𝐸𝑏 𝑘,𝑚 + 𝐿 2 ≤ 𝑌 𝑘,𝑚 2
𝐸𝑓 𝑘,𝑚 , 𝐸𝑏 𝑘,𝑚 + 𝐿 2 > 𝐸𝑓 𝑘,𝑚2𝑎𝑛𝑑 𝐸𝑓 𝑘,𝑚 + 𝐿
2≤ 𝑌 𝑘,𝑚 2
𝑌 𝑘,𝑚 , 𝑒𝑙𝑠𝑒
F B
C
Proposed Approach▪ Harmonic noise presence detector for better speech preservation
▪ Set of filters with changing length, until maximal filter length L, based on the available amount of noise samples
𝐼 𝑘, 𝑚 = ቊ1 𝑉 𝑘,𝑚 ∈ ℋ0
0 𝑉 𝑘,𝑚 ∈ ℋ1
FB
CC
Performance Measures▪ Distortion Index
▪ Noise reduction Factor
▪ Perceptual Evaluation of Speech Quality (PESQ) ITU-T P.862.2
▪ The Short-Time Objective Intelligibility (STOI)
𝑣𝑠𝑑
𝜉𝑛𝑟
Transient Reduction
Frame Index
NR
R [
dB
]
3,3 == L
thresholdindicator dB25−
5.0=
Better noise reduction which leads to improved
, PESQ, and STOI levels for the combined filter
𝜉𝑛𝑟
Step Size▪ An appropriate selection of the step size is required
▪ Fixed step size
▪
Frame Index
𝑣𝑠𝑑
NR
R [
dB
]
𝜏5.0=
?const,MImax
[Frames]
PES
Q
STO
I
𝑣𝑠𝑑
𝑑𝐵
𝜉 𝑛𝑟𝑑𝐵
5.0=
thresholdindicator dB25−
𝜏 [Frames] 𝜏 [Frames]
𝜏 [Frames] 𝜏 [Frames]
Combined & MI-Combined show better results than MI
PES
Q
STO
I
𝑣𝑠𝑑
𝑑𝐵
5.0=
thresholdindicator dB25−
𝜏 [Frames]
𝜏 [Frames] 𝜏 [Frames]𝜉 𝑛
𝑟𝑑𝐵
𝜏 [Frames]
shortL
1=
Recommendation:
Combined & MI-Combined show better results than MI
Experimental Results Summary
3,1 == L
thresholdindicator dB25−
Frame Index
Frame Index
Freq
. In
dex
Freq
. In
dex
(a) Clean signal
(c) Enhanced signal - MI
Frame Index
Freq
. In
dex
(b) Noisy signal
(d) Enhanced signal - Combined
Freq
. In
dex
Frame Index
Conclusions▪ Introduced the combined filter
▪ Parameter selection
▪ Noise presence indicator impact
▪ Improved results compared to other methods
Future Research▪Noise presence indicator implementation
▪ Residual noise at transient edges
▪ Deep Learning approach for noise reduction
Speech Enhancement Using ARCH model▪ We investigate the use of the autoregressive conditional heteroscedasticity (ARCH) model as a replacement for the well-known Decision-Directed estimator by Epharim and Malah
▪ We employ three sound quality measures: speech distortion, noise reduction and musical noise, and explain the effect the ARCH model parameters have on these measures.
▪ We demonstrate that the ARCH model achieves better results than the decision-directed for some of these measures, while compromising between the speech distortion and noise reduction.
Problem Formulation▪ Let 𝑌ℓ 𝑘 = 𝑋ℓ 𝑘 + 𝐷ℓ 𝑘 denote an observed noisy speech signal in the STFT domain.
▪ Given an error function between the clean signal and its estimate, the spectral enhancement problem can be formulated as
𝑋ℓ 𝑘 = argmin 𝑋𝐸 𝑒 𝑋ℓ 𝑘 , 𝑋 𝑘 |𝑌0 𝑘 ,… , 𝑌ℓ′ 𝑘
▪ We consider the casual case ℓ ≤ ℓ′ and the LSA error function
𝑒LSA 𝑋ℓ 𝑘 , 𝑋ℓ 𝑘 = log 𝑋ℓ 𝑘 − log 𝑋ℓ 𝑘2
Problem Formulation▪ The estimate is obtained by applying a spectral gain to each noisy spectral component:
𝑋ℓ 𝑘 = 𝐺LSA 𝜉ℓ|ℓ′ ∙ 𝑌ℓ
where the a-priori and a-posteriori SNRs are defined, respectively, by:
𝜉ℓ|ℓ′ ≜𝜆ℓ|ℓ′
𝜎ℓ2 , 𝛾ℓ ≜
𝑌ℓ2
𝜎ℓ2
𝜎ℓ2 = 𝐸 𝐷ℓ
2 denotes the short-term spectrum of the noise, and
𝜆ℓ|ℓ′ = 𝐸 𝑋ℓ2|𝑌0 𝑘 ,… , 𝑌ℓ′ 𝑘 denotes the short-term spectrum of the speech
signal.
Decision-DirectedY. Ephraim and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator,“ IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-32, pp. 1109-1121, December 1984
▪ Over the past decades, the decision-directed (DD) approach has become the acceptable estimation method for the a-priori SNR
መ𝜉ℓ|ℓ = max 𝛼𝑋ℓ−1
2
𝜎ℓ2 + 1 − 𝛼 𝑃 𝛾ℓ − 1 , 𝜉min
where 𝑃 𝑥 = 𝑥 if 𝑥 ≥ 0 and 𝑃 𝑥 = 0 otherwise.
▪ The decision-directed approach is not supported by a statistical model.
▪ 𝛼 and 𝜉min have to be determined by simulations.
▪ 𝛼 and 𝜉min are fixed constants and are not adapted to the speech components.
ARCH Model▪ The GARCH (generalized autoregressive conditional heteroscedasticity) model is extensively used in financial applications where it is necessary to model time varying volatility while taking into account heavy tailed behavior and volatility clustering.
▪ Recently 1 , it was proposed to use the GARCH for statistically modeling the speech signals in the STFT domain, as they show these two characteristics.
▪ In this work, we investigate the use of a simplified case of the GARCH, the ARCH model. We explain the effect that the ARCH model parameters have on commonly used performance measures and compare it to the decision-directed estimator.
[1] I. Cohen, “Modeling speech signals in the time frequency domain using GARCH,” Signal Processing, vol. 84 (12), pp. 2453–2459, 2004.
ARCH ModelWe use a two-step estimator, to recursively update the estimate of the conditional a-priori SNR as new data arrives.
Given an estimate of መ𝜉ℓ|ℓ−1 and a new noisy spectral component 𝑌ℓ
Update step:
መ𝜉ℓ|ℓ = 𝐸 ฬ𝑋ℓ
2
𝜎ℓ2
መ𝜉ℓ|ℓ−1, 𝑌ℓ
Using ARCH(1), propagate the a-priori SNR to obtain the one-frame-ahead a priori SNR,
Propagation step:
መ𝜉ℓ|ℓ−1 = 𝜅 + 𝜇 መ𝜉ℓ−1|ℓ−1, 𝜅 > 0, 0 ≤ 𝜇 < 1
ARCH Model▪ Solving for the update step we get: መ𝜉ℓ|ℓ = 𝐺𝑆𝑃
2 መ𝜉ℓ|ℓ−1, 𝛾ℓ ∙ 𝛾ℓ
where 𝐺𝑆𝑃 𝜉ℓ|ℓ′, 𝛾ℓ =𝜉ℓ|ℓ′
𝜉ℓ|ℓ′+1
1
𝛾ℓ+
𝜉ℓ|ℓ′
𝜉ℓ|ℓ′+1
▪ Employing some algebra, we can write:መ𝜉ℓ|ℓ = 𝛼ℓ መ𝜉ℓ|ℓ−1 + 1 − 𝛼ℓ 𝛾ℓ − 1
where 𝛼ℓ = 1 −𝜉ℓ|ℓ−1
𝜉ℓ|ℓ−1+1
2
, 𝛼ℓ ∈ 0,1
▪ Note the similarity of form to the decision-directed but with a time-varying frequency-dependent weighting factor 𝛼ℓ.
ARCH Model▪ Since the a-priori SNRs need to be equal to 𝜉min when speech is absent, we obtain a condition on 𝜅, 𝜅 = 1 − 𝜇 𝜉min
▪ Using ARCH(1) we have two parameters 𝜉min and 𝜇:
Propagation step: መ𝜉ℓ|ℓ−1 = 1 − 𝜇 𝜉min + 𝜇 መ𝜉ℓ−1|ℓ−1,
Update step: መ𝜉ℓ|ℓ = 𝛼ℓ መ𝜉ℓ|ℓ−1 + 1 − 𝛼ℓ 𝛾ℓ − 1 , where
𝛼ℓ = 1 −𝜉ℓ|ℓ−1
𝜉ℓ|ℓ−1+1
2
, 𝛼ℓ ∈ 0,1
Distortion and NRRWe employ three performance measures commonly used for the quality assessment of a speech enhancement algorithm.
The first two are easily understood when we express the estimated signal as
𝑋ℓ = 𝐺 𝜉ℓ|ℓ′, 𝛾ℓ 𝑋ℓ + 𝐺 𝜉ℓ|ℓ′, 𝛾ℓ 𝐷ℓ = 𝑋𝑓𝑑 +𝐷𝑟𝑛
Speech distortion:
𝐽𝑋 ≜ 𝐸 log 𝑋ℓ 𝑘 − log 𝐺 𝜉ℓ|ℓ′, 𝛾ℓ 𝑋ℓ2
Noise Reduction Ratio (NRR):
NRR ≜𝐸 𝐷ℓ
2
𝐸 𝐺 𝜉ℓ|ℓ′,𝛾ℓ 𝐷ℓ2
Musical noise via higher order statisticsThe attenuated noise will be composed of isolated spectral components, also known as tonal components.
The amount of tonal components can be quantified by the kurtosis;
kurtosis = Τ𝜇4 𝜇22, where 𝜇𝑚 is the 𝑚th order moment of the signal.
As we are interested in the amount of tonal components caused by the processing, we use the ratio of the kurtosis before and after the processing:
LKR ≜ log10kurtosisproc
kurtosisorf
which is evaluated on noise only frames. The LKR increases as the musical noise increases, and the absence of musical noise corresponds to LKR of zero and below.
Musical noise via higher order statisticsAnalytical calculation of the kurtosis ratio requires the use of a specific noise reduction method or assumptions about the statistical spectral components. Here, we use the sample kurtosis:
kurtosis =1
𝐿σℓ=0𝐿
1
𝑁σ𝑘=0𝑁−1 𝐷ℓ(𝑘)
2− 𝐷ℓ(𝑘)2
4
1
𝑁σ𝑘=0𝑁−1 𝐷ℓ(𝑘)
2− 𝐷ℓ(𝑘)2
2 2
Where 𝐷ℓ(𝑘) 2= 1
𝑁σ𝑘=0𝑁−1 𝐷ℓ(𝑘)
2
Experimental Setup▪ Speech signals: 20 different utterances from 20 different speakers, sampled at 16 kHz and degraded by white Gaussian noise with SNRs in the range [0,20]dB.
▪ The noisy signals are transformed to the time-frequency domain using STFT, with 75% overlapping Hamming analysis windows of 32ms length.
▪ The evaluation of the musical noise was done separately on a complex white Gaussian noise in the time-frequency domain, to emulate performance in noise only frames.
Experimental Setup
Comparison of decision-directed (solid lines) and ARCH (dashed lines) estimators for 5dB SNR: (a) Distortion, (b) NRR and, (c) LKR, with varying 𝛼(upper axis) and 𝜇(lower axis) respectively per estimator, and 𝜉min of -20dB (square), -15dB (circle), and for decision-directed method only 𝜉min = 0 (triangle).
Experimental Setup
We get the expected decision-directed behavior
Experimental Setup
For the ARCH estimator, increasing the value of 𝜇 decreases the distortion
Experimental Setup
When 𝜇 increases also the NRR decreases.
The lower we take the noise floor 𝜉min , the more noise reduction we get.
Experimental Setup
The musical noise mainly depends on the noise floor 𝜉min . Lower 𝜉min means higher 𝛼ℓ , resulting in a smoother a-priori SNR around 𝜉min , thus reducing the musical noise.
Experimental Setup
For the decision-directed estimator we have to compromise between the amount of distortion and amount of musical noise, while for the ARCH estimator, the musical noise can be eliminated by choosing an appropriate value of 𝜉min . However, for the ARCH estimator we need to compromise between the amount of distortion and the amount of residual noise.
ConclusionsResults summary:
▪ We presented the use of the ARCH estimator, which is based on a statistical model.
▪ We explained the effect the ARCH model parameters have on three commonly used quality measures.
▪ We demonstrated that the ARCH model can achieve better results than the decision-directed, while compromising between the speech distortion and noise reduction.
Future work:
▪ We used the ARCH(1) model for the a-priori SNR estimator, which is a special case of the GARCH(0,1). It would be interesting to expand the model to a full GARCH(p,q) model and conduct a similar analysis, to understand if the full general model could provide additional advantages
Robust Superdirective Beamformer with Optimal Regularization▪ We introduce an optimal beamformer design that facilitates a compromise between high directivity and low white noise amplification.
▪ The proposed beamformer involves a regularization factor, whose optimal value is determined using a simple and efficient one-dimensional search algorithm.
▪ Simulation results demonstrate controlled tuning of various gain properties of the desired beamformer, and improved performance compared to a competing method.
Signal Model and Array Setup▪ We consider a plane wave, in the farfield, impinging on an array at angle 𝜃
▪ Uniform linear microphone array of 𝑀 sensors, with distance 𝛿between them
▪ The desired signal 𝑋(𝜔) propagates from 𝜃 = 0 (endfire)
▪ Neglecting the propagation attenuation, the observed signal is
𝐲 𝜔 = 𝐝 𝜔, 𝜃 𝑋 𝜔 + 𝐯(𝜔)
where 𝐝 𝜔, 𝜃 is the steering vector, and 𝐯(𝜔) is the additive noise vector.
𝐝 𝜔, 𝜃 = 1 𝑒−𝑗𝜔 cos 𝜃𝜏0 ⋯ 𝑒−𝑗 𝑀−1 𝜔 cos 𝜃𝜏0 𝑇, 𝜏0 =𝛿
𝑐
Signal Model and Array Setup▪ For the endfire direction 𝐝 𝜔 = 𝐝 𝜔, 0
▪ Applying a complex linear filter 𝐡 𝜔 , the estimated signal is
Z 𝜔 = 𝐡𝐻 𝜔 𝐲 𝜔 = 𝐡𝐻 𝜔 𝐝 𝜔 𝑋 𝜔 + 𝐡𝐻 𝜔 𝐯(𝜔)
▪ The beamformer is distortionless when 𝐡𝐻 𝜔 𝐝 𝜔 = 1
Performance measures▪ Taking the first microphone as reference, we define the input and output SNR
iSNR ω =𝜙𝑋(𝜔)
𝜙𝑉1(𝜔)oSNR ω =
𝜙𝑋(𝜔)
𝜙𝑉1(𝜔)×
𝐡𝐻 𝜔 𝐝 𝜔2
𝐡𝐻 𝜔 𝚪𝐯 𝜔 𝐡 𝜔
where 𝜙𝑓 𝜔 = 𝐸( 𝑓 𝜔 2) is the variance of 𝑓 ∈ 𝑋, 𝑉1 , and 𝚪𝐯 𝜔 = ൗ𝐸 𝐯(𝜔)𝐯𝐻 𝜔 𝜙𝑉1(𝜔)is the pseudo-coherence matrix of the noise.
▪ We deduce the gain in SNR:
𝒢 𝐡 𝜔 =oSNR ω
iSNR ω=
𝐡𝐻 𝜔 𝐝 𝜔 2
𝐡𝐻 𝜔 𝚪𝐯 𝜔 𝐡 𝜔
▪ WNG: 𝚪𝐯 𝜔 = 𝐈𝑀, 𝒲 𝐡 𝜔 =𝐡𝐻 𝜔 𝐝 𝜔
2
𝐡𝐻 𝜔 𝐡 𝜔
▪ DF: 𝚪𝐯 𝜔 = 𝚪𝒅 𝜔 =1
20𝜋𝐝 𝜔, 𝜃 𝐝𝐻 𝜔, 𝜃 sin 𝜃𝑑𝜃, 𝒟 𝐡 𝜔 =
𝐡𝐻 𝜔 𝐝 𝜔2
𝐡𝐻 𝜔 𝚪𝐝 𝜔 𝐡 𝜔
Conventional Beamformers▪ Delay-and-Sum (DS): maximizes the WNG subject to the distortionless constraint
𝐡DS 𝜔, 𝜃 =𝐝 𝜔, 𝜃
𝑀𝒲 𝐡DS 𝜔, 𝜃 = 𝑀 = 𝒲max
𝒟 𝐡DS 𝜔, 𝜃 =𝑀2
𝐝𝐻 𝜔, 𝜃 𝚪𝐝 𝜔 𝐝 𝜔, 𝜃≥ 1
While the DS maximizes WNG it never amplifies diffuse noise.
▪ Superdirective (SD): maximizes the DF subject to the distortionless constraint for the specific case of 𝜃 = 0 and small 𝛿
𝐡SD 𝜔, 𝜃 =𝚪𝐝−1 𝜔 𝐝 𝜔
𝐝𝐻 𝜔 𝚪𝐝−1 𝜔 𝐝 𝜔
while maximizing the DF the 𝐡SD 𝜔, 𝜃 can amplify the white noise especially at low frequencies
Conventional Beamformers▪ Robust Superdirective
𝐡𝑅,𝜀 𝜔 =𝚪𝐝 𝜔 + 𝜺𝐈𝑀
−1𝐝 𝜔
𝐝𝐻 𝜔 𝚪𝐝 𝜔 + 𝜺𝐈𝑀−1𝐝 𝜔
Where 휀 ≥ 0 is a Lagrange multiplier, which enables a compromise between the DF and the WNG
If we define 𝚪𝜀 𝜔 = 𝚪𝐝 𝜔 + 𝜺𝐈𝑀 , we can write
𝐡𝑅,𝜀 𝜔 =𝚪𝜺−1 𝜔 𝐝 𝜔
𝐝𝐻 𝜔 𝚪𝜺−1 𝜔 𝐝 𝜔
While the robust superdirective beamformer has control on the white noise amplification, it is not easy to find a closed form expression for 휀 for a desired value of the WNG
Combined beamformerR. Berkun, I. Cohen, and J. Benesty, “Combined beamformers for robust broadband regularized superdirective beamforming,“ IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, pp. 877-886, May 2015
Berkun et al. proposed the combined beamformer:
𝐡𝛼,𝜀 𝜔 =𝚪𝜺−1 𝜔 +𝛼 𝜔 𝐈𝑀 𝐝 𝜔
𝐝𝐻 𝜔 𝚪𝜺−1 𝜔 +𝛼 𝜔 𝐈𝑀 𝐝 𝜔
, 𝛼 ∈ ℝ
It can be reformulated as
𝐡𝛼,𝜀 𝜔 =𝐡𝑅,𝜀 𝜔
1 + 𝛼𝜀 𝜔+
𝐡𝐷𝑆 𝜔
1 + 𝛼𝜀−1 𝜔
Where 𝛼𝜀 𝜔 = 𝛼 𝜔𝒲max
𝒟max,𝜀 𝜔and 𝒟max,𝜀 𝜔 = 𝐝𝐻 𝜔 𝚪𝜺
−1 𝜔 𝐝 𝜔
For a fixed 𝒲 𝐡𝛼,𝜀 𝜔 = 𝒲0 < 𝑀 or a fixed 𝒟 𝐡𝛼,𝜀 𝜔 = 𝒟0 it is possible to analytically calculate 𝛼𝜀 𝜔 and hence 𝛼 𝜔 .
While finding a closed form solution for the parameter 𝛼 𝜔 , which enables control of the trade-off in performance between the WNG and the DF, The method does not address finding the regularization parameter 휀 and assumes it is user determined.
New Noise Field▪ We assume the signal is corrupted both by diffuse noise and additive white noise.
▪ The input and output SNR:
iSNR ω =tr 𝜙𝑋(𝜔)𝐝 𝜔 𝐝𝐻 𝜔
tr 𝜙𝑑 𝜔 𝚪𝐝 𝜔 + 𝜙𝑤(𝜔)𝐈𝑀=
𝜙𝑋(𝜔)
𝜙𝑑 𝜔 + 𝜙𝑤(𝜔)
oSNR ω =𝜙𝑋(𝜔) 𝐡
𝐻 𝜔 𝐝 𝜔 2
𝜙𝑑 𝜔 𝐡𝐻 𝜔 𝚪𝐝 𝜔 𝐡 𝜔 + 𝜙𝑤(𝜔)𝐡𝐻 𝜔 𝐡 𝜔
▪ The SNR gain:
𝒢 𝐡 𝜔 =𝐡𝐻 𝜔 𝐝 𝜔 2
1 − 𝛼(𝜔) 𝐡𝐻 𝜔 𝚪𝐝 𝜔 𝐡 𝜔 + 𝛼(𝜔)𝐡𝐻 𝜔 𝐡 𝜔
Where
𝛼 𝜔 =𝜙𝑤(𝜔)
𝜙𝑑 𝜔 + 𝜙𝑤(𝜔), 0 ≤ 𝛼 𝜔 ≤ 1
The optimal Beamformer▪ The proposed beamformer which maximizes the SNR gain is:
𝐡𝜶 𝜔 =𝚪𝐝,𝛼−1 𝜔 𝐝 𝜔
𝐝𝐻 𝜔 𝚪𝐝,𝛼−1 𝜔 𝐝 𝜔
, where 𝚪𝐝,α 𝜔 = 1 − 𝛼(𝜔) 𝚪𝐝 𝜔 + 𝛼(𝜔)𝐈𝑀
▪ The SNR gain 𝒢 𝐡𝜶 𝜔 = 𝐝𝐻 𝜔 𝚪𝐝,𝛼−1 𝜔 𝐝 𝜔
▪ The proposed beamformer is equivalent to 𝐡𝑅,𝜀 𝜔 with 휀 𝜔 =𝛼(𝜔)
1−𝛼(𝜔)
▪ Problem: 𝜙𝑑 𝜔 , 𝜙𝑤(𝜔) are not known → 𝛼(𝜔) is not known.
▪ Advantage 1: 𝛼(𝜔) varies from 0 to 1.
▪ Advantage 2: the gain is continuous and has a single minimum point in this range, the WNG and DF are monotonic in this range.
▪ Solution: 𝛼(𝜔) is found employing a binary-like search on each monotonic section.
Algorithm 1▪Input: Desired gain 𝒢0 , and tolerance
▪Output: Optimal regularization 𝛼
1. Find 𝛼min that minimizes the gain (e.g., using gradient descent)
2. Divide the range 0,1 into 2 sections in which the gain is monotonic: 0, 𝛼minand 𝛼min, 1
3. For each section, apply the following continuous binary search:
4. Divide the section into 2 sub-sections
5. Calculate the gain 𝒢𝑘 in the middle of each sub-section
6. Choose the gain 𝒢𝑘 and its respective sub-section for which 𝒢𝑘 − 𝒢0 is minimal
7. if 𝒢𝑘 − 𝒢0 ≤ tolerance then
8. 𝛼 ←(middle of chosen sub-section) and stop
9. else
10. update range to be the chosen sub-section and go back to 4
11. endif
12. Compare results 0, 𝛼min and 𝛼min, 1 , and choose the best result
Experimental ResultsSetup: 𝑀 = 8 microphones, 𝛿 = 1 cm spacing
Array gains for fixed SNR
𝛼(𝜔) is found for desired fixed SNR gain 𝒢0 using the proposed algorithm
Experimental ResultsArray gains for fixed WNG
𝛼(𝜔) is found for maximal SNR gain under a constant desired WNG 𝒲0 using the proposed algorithm from step 4
→ Our proposed beamformer outperforms the combined beamformer with 휀 = 10−4
Experimental ResultsArray gains for fixed DF in multi-band
𝛼(𝜔) is found for maximal SNR gain under a piece-wise constant gradually increasing DF using the proposed algorithm from step 4
→WNG-DF trade-off can be considered at each frequency band separately!→ Our proposed beamformer outperforms the combined beamformer with 휀 = 10−4
ConclusionsResults summary:
▪ The proposed approach facilitates the design of beamformers with fixed SNR gain, beamformers with maximal SNR gain for constant WNG or DF, and multi-band fixed beamformers.
▪ Enables a fine tuning of the compromise between the DF and robustness against white noise.
Future work:
▪ Testing various angles of incidence other than the end-fire direction.
▪ Incorporating other considerations such as side-lobe requirements and performance under other types of noise fields.