SPEECH ENHANCEMENT
Sefik Emre Eskimez
Dept. of Electrical and Computer Engineering
University of Rochester, Rochester, NY
Motivation Corruption present in speech signal reduces the
performance of the automatic processes, such as:
Automatic speech recognition (ASR)
Automatic speaker identification/verification (ASID/ASV)
Automatic speech emotion recognition (ASER)
Try it with Amazon’s Alexa and Google’s assistant
Hearing implants performance suffers in noise
conditions
Ideal Cases
Problem Definition – Additive
Noise
𝑠(𝑡) is the speech signal
𝑛 𝑡 is the noise signal
𝑚 𝑡 = 𝑠 𝑡 + 𝑛(𝑡),
given 𝒎 𝒕 , estimate 𝒔 𝒕 !
Approaches1. Spectral Subtraction
Estimate the noise spectrum and subtract it from the noisy speech
spectrum.
a. Wiener Filtering
LTI filter to estimate clean speech.
b. Log-Minimum Mean Square Error (MMSE) Short-Time Spectral
Amplitude (STSA) Estimator
A short-time spectral amplitude (STSA) estimator which minimizes the mean-square error of
the log-spectra
2. Non-negative Dictionary Learning
Utilizes sparse coding and a voice activity detector to find which
frames belongs to noise and which belongs to speech. Usually
two dictionaries are built for speech and noise.
3. Deep Learning Approaches
Early work
1979-1984
Spectral Subtraction Taking the Fourier Transform Yields:
𝑚 𝑡 = 𝑠 𝑡 + 𝑛(𝑡) ⟷ 𝑀 𝑒𝑗𝑤 = 𝑆 𝑒𝑗𝑤 +𝑁(𝑒𝑗𝑤)
Speech spectra 𝑆 𝑒𝑗𝑤 can be represented as:
𝑆 𝑒𝑗𝑤 = 𝑀 𝑒𝑗𝑤 − 𝑁𝜇 (𝑒𝑗𝑤) 𝑒𝑗𝜃𝑀,
where 𝑆 and 𝑁 are speech and noise estimates.
𝑁𝜇 (𝑒𝑗𝑤) = Ε 𝑁(𝑒𝑗𝑤)
Noise estimate is usually calculated using first few frames
of the input signal
Wiener Filtering
𝑀 𝑒𝑗𝑤 = 𝑆 𝑒𝑗𝑤 + 𝑁 𝑒𝑗𝑤
A filter can be defined as follows:
𝐻 𝑒𝑗𝑤 =𝑆 𝑒𝑗𝑤
𝑀(𝑒𝑗𝑤)
The filter can be estimated using the noise estimate:
𝐻 𝑒𝑗𝑤 =𝑀 𝑒𝑗𝑤 − 𝑁 (𝑒𝑗𝑤)
𝑀(𝑒𝑗𝑤)𝐻 𝑒𝑗𝑤𝑠(𝑡)
n(𝑡)
𝑠 (𝑡)
Log-Minimum Mean Square Error (MMSE)
Short-Time Spectral Amplitude (STSA)
Let’s simplify the notation: 𝑆 𝑒𝑗𝑤 → 𝑆
Log-MMSE STSA minimizes the logarithmic mean
square error
Ε log10 𝑆 − log10( 𝑆)2
Non-negative Dictionary
Learning Let us denote the basis matrix of speech and noise as
𝑊𝑠 and 𝑊𝑛 respectively
The basis matrix for the noisy signal
𝑊 = 𝑊𝑠𝑊𝑛
The noisy spectrogram can be represented as 𝑀 ≈ 𝑊𝐻,
where the noisy NMF coefficients defined as
𝐻 = 𝐻𝑠𝑇𝐻𝑛𝑇𝑇
Non-negative Dictionary
Learning
The mask can be obtained as follows:
𝑚𝑠 =𝑊𝑠𝐻𝑠
𝑊𝑠𝐻𝑠+𝑊𝑛𝐻𝑛
Time-Frequency (T-F) Masks
T-F masks operate on the magnitude
spectra of the signal.
Let 𝑆𝑡 𝑓 , 𝑁𝑡 𝑓 and 𝑀𝑡(𝑓) be the
magnitude spectra of the speech, noise
and mixture signal, respectively.
Time-Frequency (T-F) Masks
𝑆 𝑁
𝑀
T-F Masks
Ideal Binary Masks (IBM), 0 dB
𝐼𝐵𝑀𝑡(𝑓) 1 , 𝑖𝑓 𝑆𝑡 𝑓 > 𝑁𝑡 𝑓0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Ideal Binary Masks (IBM)
𝑆 𝑀
𝐼𝐵𝑀 𝐼𝐵𝑀 ⊙ 𝑀
The problem becomes a binary
classification task!
Given 𝑀𝑡 𝑓 , determine whether it belongs to
speech or noise
Can be estimated with any machine learning
classifier
Problem: The results obtained from the
ground-truth IBM mask has “musical noise”
Ideal Binary Masks (IBM)
T-F Masks
Amplitude Soft Masks (ASM) or Ideal
Ratio Masks (IRM)
𝐼𝑅𝑀𝑡 𝑓 =𝑆𝑡 𝑓
𝑆𝑡 𝑓 + 𝑁𝑡 𝑓
Ideal Ratio Masks (IRM)
𝑆 𝑀
𝐼𝑅𝑀 𝐼𝑅𝑀 ⊙ 𝑀
Predicting Masks – System
Overview
Predicting Masks – Features
Mel-Frequency Cepstrum (MFC)
Magnitude Spectra
Raw waveform
Can be supplemented by traditional
features
Autoencoder based methods
Two types:
1. Trained with only clean speech
Network learns speech
representation
2. Trained with noisy-clean speech
pairs
Network learns transfer function from
noisy to clean speech
Recurrent Neural Network
(RNN) RNNs are useful for modeling temporal
relations
Huang et al. (Huang, Kim et al. 2015)
proposed predicting masks with the
following cost function:
min 𝑚𝑠 − 𝑚𝑠2 + 𝑚𝑛 − 𝑚𝑛
2
− 𝑚𝑠 − 𝑚𝑛2 − 𝑚𝑛 − 𝑚𝑠
2
where 𝑚𝑠 and 𝑚𝑛 are the speech and
noise masks, respectively.
Redundant Convolutional
Encoder-Decoder (R-CED)
Park et al. (Park and Lee 2016)
proposed a convolutional network with
1-dimensional convolutional operation
which operates on frequency axis
Convolutional networks have fewer
parameters than RNN, which makes
them feasible for small devices, such
as hearing implants!
Predicting Masks – Our Methods
Convolutional Encoder-Decoder (CED)
network with skip connections
Bidirectional Long Short-Term Memory
(BLSTM) network
Convolutional Encoder-Decoder
(CED)
InputSpectrogram
skipconnections
ConvolutionalEncoder-Decoder(CED)
Conv
BNReLU
64filters
Deconv
BNReLU
64filters
Conv
BNReLU
128filters
Conv
BNReLU
256filters
Conv
BNReLU
512filters
Deconv
BNReLU
256filters
Deconv
BNReLU
128filters
Speechmask
Noisemask
Conv
BNReLU1filter
Conv
BNReLU1filter
Bidirectional Long Short-Term
Memory (BLSTM)
Predicting Masks – Comparison
with other methods
(a)Noisyspectrogram
(b)Cleanspectrogram
(c)Enhanced(SS)spectrogram
(d)Enhanced(Log-MMSE)spectrogram
(e)Enhanced(RNN)spectrogram
(f)Enhanced(R-CED)spectrogram
(g)Enhanced(BLSTM)spectrogram
(h)Enhanced(CED)spectrogram
Evaluation metrics
• Objective measures:
• Perceptual evaluation of speech quality (PESQ) – Ranges from -0.5
to 4.5
• Short-time Objective Intelligibility (STOI) – Ranges from 0 to 1
• Segmental SNR (SSNR, in dB)
• Log-spectral distortion (LSD, in dB)
• Hearing aids speech quality index (HASQI)
• Hearing aids speech perception index (HASPI)
• Speech distortion index (SDI)
• Subjective measures:
• Listening tests
The most
important
metrics!
RESULTS - PESQ
RESULTS - STOI
More examples…
The End…
Thank you!
References
Loizou, Philipos C. Speech enhancement: theory and practice. CRC press, 2013.
Boll, Steven. "Suppression of acoustic noise in speech using spectral subtraction." IEEE Transactions
on acoustics, speech, and signal processing 27.2 (1979): 113-120.
Ephraim, Yariv, and David Malah. "Speech enhancement using a minimum mean-square error log-
spectral amplitude estimator." IEEE Transactions on Acoustics, Speech, and Signal Processing 33.2
(1985): 443-445.
Huang, Po-Sen, et al. "Joint optimization of masks and deep recurrent neural networks for monaural
source separation." IEEE/ACM Transactions on Audio, Speech and Language Processing
(TASLP) 23.12 (2015): 2136-2147.
Park, Se Rim, and Jinwon Lee. "A fully convolutional neural network for speech
enhancement." arXiv preprint arXiv:1609.07132 (2016).
Mohammadiha, Nasser. Speech Enhancement Using Nonnegative Matrix Factorization and Hidden
Markov Models. Diss. KTH Royal Institute of Technology, 2013.
Wang, Yuxuan, Arun Narayanan, and DeLiang Wang. "On training targets for supervised speech
separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22.12
(2014): 1849-1858.
Lu, Xugang, et al. "Speech enhancement based on deep denoising autoencoder." Interspeech. 2013.
Top Related