Download - SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

SPEECH ENHANCEMENT

Sefik Emre Eskimez

Dept. of Electrical and Computer Engineering

University of Rochester, Rochester, NY

Motivation Corruption present in speech signal reduces the

performance of the automatic processes, such as:

Automatic speech recognition (ASR)

Automatic speaker identification/verification (ASID/ASV)

Automatic speech emotion recognition (ASER)

Try it with Amazon’s Alexa and Google’s assistant

Hearing implants performance suffers in noise

conditions

Ideal Cases

Problem Definition – Additive

Noise

𝑠(𝑡) is the speech signal

𝑛 𝑡 is the noise signal

𝑚 𝑡 = 𝑠 𝑡 + 𝑛(𝑡),

given 𝒎 𝒕 , estimate 𝒔 𝒕 !

Approaches1. Spectral Subtraction

Estimate the noise spectrum and subtract it from the noisy speech

spectrum.

a. Wiener Filtering

LTI filter to estimate clean speech.

b. Log-Minimum Mean Square Error (MMSE) Short-Time Spectral

Amplitude (STSA) Estimator

A short-time spectral amplitude (STSA) estimator which minimizes the mean-square error of

the log-spectra

2. Non-negative Dictionary Learning

Utilizes sparse coding and a voice activity detector to find which

frames belongs to noise and which belongs to speech. Usually

two dictionaries are built for speech and noise.

3. Deep Learning Approaches

Early work

1979-1984

Spectral Subtraction Taking the Fourier Transform Yields:

𝑚 𝑡 = 𝑠 𝑡 + 𝑛(𝑡) ⟷ 𝑀 𝑒𝑗𝑤 = 𝑆 𝑒𝑗𝑤 +𝑁(𝑒𝑗𝑤)

Speech spectra 𝑆 𝑒𝑗𝑤 can be represented as:

𝑆 𝑒𝑗𝑤 = 𝑀 𝑒𝑗𝑤 − 𝑁𝜇 (𝑒𝑗𝑤) 𝑒𝑗𝜃𝑀,

where 𝑆 and 𝑁 are speech and noise estimates.

𝑁𝜇 (𝑒𝑗𝑤) = Ε 𝑁(𝑒𝑗𝑤)

Noise estimate is usually calculated using first few frames

of the input signal

Wiener Filtering

𝑀 𝑒𝑗𝑤 = 𝑆 𝑒𝑗𝑤 + 𝑁 𝑒𝑗𝑤

A filter can be defined as follows:

𝐻 𝑒𝑗𝑤 =𝑆 𝑒𝑗𝑤

𝑀(𝑒𝑗𝑤)

The filter can be estimated using the noise estimate:

𝐻 𝑒𝑗𝑤 =𝑀 𝑒𝑗𝑤 − 𝑁 (𝑒𝑗𝑤)

𝑀(𝑒𝑗𝑤)𝐻 𝑒𝑗𝑤𝑠(𝑡)

n(𝑡)

𝑠 (𝑡)

Log-Minimum Mean Square Error (MMSE)

Short-Time Spectral Amplitude (STSA)

Let’s simplify the notation: 𝑆 𝑒𝑗𝑤 → 𝑆

Log-MMSE STSA minimizes the logarithmic mean

square error

Ε log10 𝑆 − log10( 𝑆)2

Non-negative Dictionary

Learning Let us denote the basis matrix of speech and noise as

𝑊𝑠 and 𝑊𝑛 respectively

The basis matrix for the noisy signal

𝑊 = 𝑊𝑠𝑊𝑛

The noisy spectrogram can be represented as 𝑀 ≈ 𝑊𝐻,

where the noisy NMF coefficients defined as

𝐻 = 𝐻𝑠𝑇𝐻𝑛𝑇𝑇

Non-negative Dictionary

Learning

The mask can be obtained as follows:

𝑚𝑠 =𝑊𝑠𝐻𝑠

𝑊𝑠𝐻𝑠+𝑊𝑛𝐻𝑛

Time-Frequency (T-F) Masks

T-F masks operate on the magnitude

spectra of the signal.

Let 𝑆𝑡 𝑓 , 𝑁𝑡 𝑓 and 𝑀𝑡(𝑓) be the

magnitude spectra of the speech, noise

and mixture signal, respectively.

Time-Frequency (T-F) Masks

𝑆 𝑁

𝑀

T-F Masks

Ideal Binary Masks (IBM), 0 dB

𝐼𝐵𝑀𝑡(𝑓) 1 , 𝑖𝑓 𝑆𝑡 𝑓 > 𝑁𝑡 𝑓0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Ideal Binary Masks (IBM)

𝑆 𝑀

𝐼𝐵𝑀 𝐼𝐵𝑀 ⊙ 𝑀

The problem becomes a binary

classification task!

Given 𝑀𝑡 𝑓 , determine whether it belongs to

speech or noise

Can be estimated with any machine learning

classifier

Problem: The results obtained from the

ground-truth IBM mask has “musical noise”

Ideal Binary Masks (IBM)

T-F Masks

Amplitude Soft Masks (ASM) or Ideal

Ratio Masks (IRM)

𝐼𝑅𝑀𝑡 𝑓 =𝑆𝑡 𝑓

𝑆𝑡 𝑓 + 𝑁𝑡 𝑓

Ideal Ratio Masks (IRM)

𝑆 𝑀

𝐼𝑅𝑀 𝐼𝑅𝑀 ⊙ 𝑀

Predicting Masks – System

Overview

Predicting Masks – Features

Mel-Frequency Cepstrum (MFC)

Magnitude Spectra

Raw waveform

Can be supplemented by traditional

features

Autoencoder based methods

Two types:

1. Trained with only clean speech

Network learns speech

representation

2. Trained with noisy-clean speech

pairs

Network learns transfer function from

noisy to clean speech

Recurrent Neural Network

(RNN) RNNs are useful for modeling temporal

relations

Huang et al. (Huang, Kim et al. 2015)

proposed predicting masks with the

following cost function:

min 𝑚𝑠 − 𝑚𝑠2 + 𝑚𝑛 − 𝑚𝑛

2

− 𝑚𝑠 − 𝑚𝑛2 − 𝑚𝑛 − 𝑚𝑠

2

where 𝑚𝑠 and 𝑚𝑛 are the speech and

noise masks, respectively.

Redundant Convolutional

Encoder-Decoder (R-CED)

Park et al. (Park and Lee 2016)

proposed a convolutional network with

1-dimensional convolutional operation

which operates on frequency axis

Convolutional networks have fewer

parameters than RNN, which makes

them feasible for small devices, such

as hearing implants!

Predicting Masks – Our Methods

Convolutional Encoder-Decoder (CED)

network with skip connections

Bidirectional Long Short-Term Memory

(BLSTM) network

Convolutional Encoder-Decoder

(CED)

InputSpectrogram

skipconnections

ConvolutionalEncoder-Decoder(CED)

Conv

BNReLU

64filters

Deconv

BNReLU

64filters

Conv

BNReLU

128filters

Conv

BNReLU

256filters

Conv

BNReLU

512filters

Deconv

BNReLU

256filters

Deconv

BNReLU

128filters

Speechmask

Noisemask

Conv

BNReLU1filter

Conv

BNReLU1filter

Bidirectional Long Short-Term

Memory (BLSTM)

Predicting Masks – Comparison

with other methods

(a)Noisyspectrogram

(b)Cleanspectrogram

(c)Enhanced(SS)spectrogram

(d)Enhanced(Log-MMSE)spectrogram

(e)Enhanced(RNN)spectrogram

(f)Enhanced(R-CED)spectrogram

(g)Enhanced(BLSTM)spectrogram

(h)Enhanced(CED)spectrogram

Evaluation metrics

• Objective measures:

• Perceptual evaluation of speech quality (PESQ) – Ranges from -0.5

to 4.5

• Short-time Objective Intelligibility (STOI) – Ranges from 0 to 1

• Segmental SNR (SSNR, in dB)

• Log-spectral distortion (LSD, in dB)

• Hearing aids speech quality index (HASQI)

• Hearing aids speech perception index (HASPI)

• Speech distortion index (SDI)

• Subjective measures:

• Listening tests

The most

important

metrics!

RESULTS - PESQ

RESULTS - STOI

More examples…

The End…

Thank you!

References

Loizou, Philipos C. Speech enhancement: theory and practice. CRC press, 2013.

Boll, Steven. "Suppression of acoustic noise in speech using spectral subtraction." IEEE Transactions

on acoustics, speech, and signal processing 27.2 (1979): 113-120.

Ephraim, Yariv, and David Malah. "Speech enhancement using a minimum mean-square error log-

spectral amplitude estimator." IEEE Transactions on Acoustics, Speech, and Signal Processing 33.2

(1985): 443-445.

Huang, Po-Sen, et al. "Joint optimization of masks and deep recurrent neural networks for monaural

source separation." IEEE/ACM Transactions on Audio, Speech and Language Processing

(TASLP) 23.12 (2015): 2136-2147.

Park, Se Rim, and Jinwon Lee. "A fully convolutional neural network for speech

enhancement." arXiv preprint arXiv:1609.07132 (2016).

Mohammadiha, Nasser. Speech Enhancement Using Nonnegative Matrix Factorization and Hidden

Markov Models. Diss. KTH Royal Institute of Technology, 2013.

Wang, Yuxuan, Arun Narayanan, and DeLiang Wang. "On training targets for supervised speech

separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22.12

(2014): 1849-1858.

Lu, Xugang, et al. "Speech enhancement based on deep denoising autoencoder." Interspeech. 2013.