SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral...

SPEECH ENHANCEMENT

Sefik Emre Eskimez

Dept. of Electrical and Computer Engineering

University of Rochester, Rochester, NY

Motivation Corruption present in speech signal reduces the

performance of the automatic processes, such as:

Automatic speech recognition (ASR)

Automatic speaker identification/verification (ASID/ASV)

Automatic speech emotion recognition (ASER)

Try it with Amazon’s Alexa and Google’s assistant

Hearing implants performance suffers in noise

conditions

Ideal Cases

Problem Definition – Additive

𝑠(𝑡) is the speech signal

𝑛 𝑡 is the noise signal

𝑚 𝑡 = 𝑠 𝑡 + 𝑛(𝑡),

given 𝒎 𝒕 , estimate 𝒔 𝒕 !

Approaches1. Spectral Subtraction

Estimate the noise spectrum and subtract it from the noisy speech

spectrum.

a. Wiener Filtering

LTI filter to estimate clean speech.

b. Log-Minimum Mean Square Error (MMSE) Short-Time Spectral

Amplitude (STSA) Estimator

A short-time spectral amplitude (STSA) estimator which minimizes the mean-square error of

the log-spectra

2. Non-negative Dictionary Learning

Utilizes sparse coding and a voice activity detector to find which

frames belongs to noise and which belongs to speech. Usually

two dictionaries are built for speech and noise.

3. Deep Learning Approaches

Early work

1979-1984

Spectral Subtraction Taking the Fourier Transform Yields:

𝑚 𝑡 = 𝑠 𝑡 + 𝑛(𝑡) ⟷ 𝑀 𝑒𝑗𝑤 = 𝑆 𝑒𝑗𝑤 +𝑁(𝑒𝑗𝑤)

Speech spectra 𝑆 𝑒𝑗𝑤 can be represented as:

𝑆 𝑒𝑗𝑤 = 𝑀 𝑒𝑗𝑤 − 𝑁𝜇 (𝑒𝑗𝑤) 𝑒𝑗𝜃𝑀,

where 𝑆 and 𝑁 are speech and noise estimates.

𝑁𝜇 (𝑒𝑗𝑤) = Ε 𝑁(𝑒𝑗𝑤)

Noise estimate is usually calculated using first few frames

of the input signal

Wiener Filtering

𝑀 𝑒𝑗𝑤 = 𝑆 𝑒𝑗𝑤 + 𝑁 𝑒𝑗𝑤

A filter can be defined as follows:

𝐻 𝑒𝑗𝑤 =𝑆 𝑒𝑗𝑤

𝑀(𝑒𝑗𝑤)

The filter can be estimated using the noise estimate:

𝐻 𝑒𝑗𝑤 =𝑀 𝑒𝑗𝑤 − 𝑁 (𝑒𝑗𝑤)

𝑀(𝑒𝑗𝑤)𝐻 𝑒𝑗𝑤𝑠(𝑡)

n(𝑡)

𝑠 (𝑡)

Log-Minimum Mean Square Error (MMSE)

Short-Time Spectral Amplitude (STSA)

Let’s simplify the notation: 𝑆 𝑒𝑗𝑤 → 𝑆

Log-MMSE STSA minimizes the logarithmic mean

square error

Ε log10 𝑆 − log10( 𝑆)2

Non-negative Dictionary

Learning Let us denote the basis matrix of speech and noise as

𝑊𝑠 and 𝑊𝑛 respectively

The basis matrix for the noisy signal

𝑊 = 𝑊𝑠𝑊𝑛

The noisy spectrogram can be represented as 𝑀 ≈ 𝑊𝐻,

where the noisy NMF coefficients defined as

𝐻 = 𝐻𝑠𝑇𝐻𝑛𝑇𝑇

Non-negative Dictionary

Learning

The mask can be obtained as follows:

𝑚𝑠 =𝑊𝑠𝐻𝑠

𝑊𝑠𝐻𝑠+𝑊𝑛𝐻𝑛

Time-Frequency (T-F) Masks

T-F masks operate on the magnitude

spectra of the signal.

Let 𝑆𝑡 𝑓 , 𝑁𝑡 𝑓 and 𝑀𝑡(𝑓) be the

magnitude spectra of the speech, noise

and mixture signal, respectively.

Time-Frequency (T-F) Masks

𝑆 𝑁

T-F Masks

Ideal Binary Masks (IBM), 0 dB

𝐼𝐵𝑀𝑡(𝑓) 1 , 𝑖𝑓 𝑆𝑡 𝑓 > 𝑁𝑡 𝑓0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Ideal Binary Masks (IBM)

𝑆 𝑀

𝐼𝐵𝑀 𝐼𝐵𝑀 ⊙ 𝑀

The problem becomes a binary

classification task!

Given 𝑀𝑡 𝑓 , determine whether it belongs to

speech or noise

Can be estimated with any machine learning

classifier

Problem: The results obtained from the

ground-truth IBM mask has “musical noise”

Ideal Binary Masks (IBM)

T-F Masks

Amplitude Soft Masks (ASM) or Ideal

Ratio Masks (IRM)

𝐼𝑅𝑀𝑡 𝑓 =𝑆𝑡 𝑓

𝑆𝑡 𝑓 + 𝑁𝑡 𝑓

Ideal Ratio Masks (IRM)

𝑆 𝑀

𝐼𝑅𝑀 𝐼𝑅𝑀 ⊙ 𝑀

Predicting Masks – System

Overview

Predicting Masks – Features

Mel-Frequency Cepstrum (MFC)

Magnitude Spectra

Raw waveform

Can be supplemented by traditional

features

Autoencoder based methods

Two types:

1. Trained with only clean speech

Network learns speech

representation

2. Trained with noisy-clean speech

Network learns transfer function from

noisy to clean speech

Recurrent Neural Network

(RNN) RNNs are useful for modeling temporal

relations

Huang et al. (Huang, Kim et al. 2015)

proposed predicting masks with the

following cost function:

min 𝑚𝑠 − 𝑚𝑠2 + 𝑚𝑛 − 𝑚𝑛

− 𝑚𝑠 − 𝑚𝑛2 − 𝑚𝑛 − 𝑚𝑠

where 𝑚𝑠 and 𝑚𝑛 are the speech and

noise masks, respectively.

Redundant Convolutional

Encoder-Decoder (R-CED)

Park et al. (Park and Lee 2016)

proposed a convolutional network with

1-dimensional convolutional operation

which operates on frequency axis

Convolutional networks have fewer

parameters than RNN, which makes

them feasible for small devices, such

as hearing implants!

Predicting Masks – Our Methods

Convolutional Encoder-Decoder (CED)

network with skip connections

Bidirectional Long Short-Term Memory

(BLSTM) network

Convolutional Encoder-Decoder

InputSpectrogram

skipconnections

ConvolutionalEncoder-Decoder(CED)

BNReLU

64filters

Deconv

BNReLU

64filters

BNReLU

128filters

BNReLU

256filters

BNReLU

512filters

Deconv

BNReLU

256filters

Deconv

BNReLU

128filters

Speechmask

Noisemask

BNReLU1filter

Bidirectional Long Short-Term

Memory (BLSTM)

Predicting Masks – Comparison

with other methods

(a)Noisyspectrogram

(b)Cleanspectrogram

(c)Enhanced(SS)spectrogram

(d)Enhanced(Log-MMSE)spectrogram

(e)Enhanced(RNN)spectrogram

(f)Enhanced(R-CED)spectrogram

(g)Enhanced(BLSTM)spectrogram

(h)Enhanced(CED)spectrogram

Evaluation metrics

• Objective measures:

• Perceptual evaluation of speech quality (PESQ) – Ranges from -0.5

to 4.5

• Short-time Objective Intelligibility (STOI) – Ranges from 0 to 1

• Segmental SNR (SSNR, in dB)

• Log-spectral distortion (LSD, in dB)

• Hearing aids speech quality index (HASQI)

• Hearing aids speech perception index (HASPI)

• Speech distortion index (SDI)

• Subjective measures:

• Listening tests

The most

important

metrics!

RESULTS - PESQ

RESULTS - STOI

More examples…

The End…

Thank you!

References

Loizou, Philipos C. Speech enhancement: theory and practice. CRC press, 2013.

Boll, Steven. "Suppression of acoustic noise in speech using spectral subtraction." IEEE Transactions

on acoustics, speech, and signal processing 27.2 (1979): 113-120.

Ephraim, Yariv, and David Malah. "Speech enhancement using a minimum mean-square error log-

spectral amplitude estimator." IEEE Transactions on Acoustics, Speech, and Signal Processing 33.2

(1985): 443-445.

Huang, Po-Sen, et al. "Joint optimization of masks and deep recurrent neural networks for monaural

source separation." IEEE/ACM Transactions on Audio, Speech and Language Processing

(TASLP) 23.12 (2015): 2136-2147.

Park, Se Rim, and Jinwon Lee. "A fully convolutional neural network for speech

enhancement." arXiv preprint arXiv:1609.07132 (2016).

Mohammadiha, Nasser. Speech Enhancement Using Nonnegative Matrix Factorization and Hidden

Markov Models. Diss. KTH Royal Institute of Technology, 2013.

Wang, Yuxuan, Arun Narayanan, and DeLiang Wang. "On training targets for supervised speech

separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22.12

(2014): 1849-1858.

Lu, Xugang, et al. "Speech enhancement based on deep denoising autoencoder." Interspeech. 2013.

SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral...

Documents

Transcript of SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral...

Speech Enhancement Using Constrained ICA

Speech enhancement for distant talking speech recognition

Speech Enhancement and Noise-Robust Automatic Speech ...projekter.aau.dk/projekter/files/213552063/Master.pdf · Speech Enhancement and Noise-Robust Automatic Speech Recognition -

SPEECH ENHANCEMENT-PART2.docx

Visual Speech Enhancement - cs.huji.ac.il

Speech Enhancement using Deep Learning

Speech Enhancement Based On Noise Reductionzduan/teaching/ece472/projects/2014/... · Speech Enhancement Based On Noise Reduction ... certain speech enhancement algorithms at the

Dual-Microphone Speech Enhancement AL Using Speech …ibruce/papers/dong_IHCON2004.pdf · 1. Introduction Speech enhancement in multi-speaker babble remains an enormous challenge.

Speech Enhancement Presentation_Group Meeting_09172015

técnicas de speech enhancement considerando estadísticas de ...

Kalman Filtering and Speech Enhancement - CMPcmp.felk.cvut.cz/~kybic/old/dipl/dipl.pdf · Kalman Filtering and Speech Enhancement Bc. ... The intended application ... speech enhancement,

Speech enhancement using linear prediction residualspeech.iiit.ac.in/svlpubs/article/Yegnanarayana199925.pdf · Speech enhancement using linear prediction residual B. Yegnanarayana

Real-Time Dual-Microphone Speech Enhancement

Speech Enhancement using Excitation Source Information

Source Localization for Dual Speech Enhancement Technology...Dual Speech Enhancement (DSETM) is a trademark of advanced two-channel speech enhancement technology developed by LG Electr

Speech Enhancement using Deep Learning - … · Abstract This thesis explores the possibility to achieve enhancement on noisy speech signals using Deep Neural Networks. Signal enhancement

Speech Enhancement, Gain, and Noise Spectrum …papers.cnl.salk.edu/PDFs/Speech Enhancement, Gain, and Noise... · 24 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING,

Single Channel Speech Enhancement Using Spectral ...830519/FULLTEXT01.pdf · Single Channel Speech Enhancement Using Spectral Subtraction Based on ... Speech is an elementary ...

EVALUATION OF SPEECH ENHANCEMENT TECHNIQUES FOR …

Speech Enhancement Using Spectral Flatness ... - … · Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction DOI: 10.9790/4200-0702014146 ...