Noise Robustness - University of Cambridgemi.eng.cam.ac.uk/~mjfg/local/Projects/noise_lect.pdf ·...

Noise Robustness

Mark Gales

Lent 2007

Module 2A: Lecture 2

MPhil in Computer Speech Text and Internet Technology

MPhil in Computer Speech Text and Internet Technology

Module 2A: Noise Robustness

Environment Mismatches

• For most tasks there is some mismatch between training and testing

– Acoustic environment: background noise/microphone differ to training data;– Speaker: the test speaker is “different” to speakers seen in training.

• In the previous lecture speaker adaptation was considered

– this lecture considers noise robustness

• In contrast to speaker adaptation:

– impact of noise of the speech signal “easier” to characterise– most speaker adaptation schemes (not VTLN) are general approaches– noise robustness schemes make use of a mismatch function

• Speaker adaptation schemes such as MLLR [1] also used for noise robustness.

Cambridge UniversityEngineering Department

MPhil in Computer Speech Text and Internet Technology 1


Robust Speech Recognition

• Good performance may be obtained using HMM-based speech recognitionsystems in matched conditions.

– matched means that the type of microphone (acoustic channel) being usedand the ambient (background) noise are the same in training and testing.

• When either or both of these conditions vary the performance of the recognisermay drop dramatically.

Rec

ogni

tion

Rat

e

Signal-to-noise Ratio

Matched training

Clean training

Noise robustness schemes should:

• achieve performance as good as “matched”

• have low computationally cost:

– for a fixed noise condition– changes in noise condition




Why Is It Important?

Few situations exist where there is complete control over the acoustic environment.

• Hands-free voice dialling allow use of car phones without use of hands

– high levels of background noise - tyre rumble, wind flow over car, passengers.– reflective surfaces (windows), multi-path.

• Speech recognition over the telephone: home banking, “mail-order” catalogues,automated operators.

– Unknown microphone - variations due to hand set used– Unknown channel - distortions due to the telephone wires and exchanges.

Noise may also be “added” at the exchange.– Background noise at the “human” end.

• Interaction with PDA ...




General Environment Model

WORD

+

+

+

+

Additive NoiseChannel

Channel Difference

Ambient Noise

Additive NoiseReceiver

TaskStressLombard

Speaker

CORRUPTED SPEECH

The noise-corrupted speech, o(t), and the noise-free speech , x(t), related by

o(t) =[{([

x(t)|StressLombard

]n1(t)

+ n1(t))∗ hmike(t) + n2(t)

}∗ hchan(t)

]+ n3(t)

where n1(t) is the background noise, hmike(t) is the impulse response of themicrophone, n2(t) and hchan(t) are the additive noise and impulse response ofthe transmission channel, and n3(t) is the noise present at the receiver.




Stressed SpeechThe “speech” generated by the speaker is conditional on:

• Lombard: speakers alter the way they speak according to the backgroundnoise [2] - aim to increase intelligibility

– energy: Decrease of energy in the vowels between 0-500Hz. For femalesincrease in energy between 4-5kHz. For nasals, fricatives, plosive, decreasein all frequency bands.

– formants: Frequency of the first formant increases for vowels, glides, liquidsand nasals (more important for females). The second formant frequencyonly increases for females.

– duration: Increase in duration for vowels, slight decrease for consonants(overall increase in length of words).

– pitch: For males the pitch of the vowels increases.

• Stress: the user may be stressed due to work, or the recogniser not working, ...




“Simplified” Acoustic Environment• A simplified model of the effects of noise is often used

Channel Difference

Speech

+

+

Additive Noise

Corrupted Speech

Convolutional Noise • Ignore effects of stress:

• Group noise sources

o(t) = x(t) ∗ h(t) + n(t)

• Squared magnitude of the Fourier Transform of signal

O(f)O∗(f) = |H(f)X(f)|2 + |N(f)|2 + 2|N(f)||H(f)X(f)| cos(θ)

θ is the angle between the vectors N(f) and H(f)X(f).

• Average (over Mel bins) and assume speech and noise independent

olt = hlxl

t + nlt




Dealing with Adverse Environments

• Single-microphone techniques [3] may be split into

– inherently robust speech parameterisation - no modifications to the system.– clean speech estimation - alters the front-end processing scheme.– acoustic model compensation so that they are representative of speech in

the new acoustic environment.

• Multiple-microphones - microphone arrays may be used

– increase SNR by reducing the beam-width of the effective microphone.– additional/specialised hardware required

• Only single-microphone techniques are considered in this lecture.

• If something is known about the possible test acoustic environment

– multi-style (multi-environment) training may be used– “clean” model trained under a variety of conditions– also helps general robustness




Effects of Noise on the Signal

0 2000 4000 6000 8000 10000 12000−4000

−3000

−2000

−1000

0

1000

2000

3000

4000

(a) Clean Waveform0 2000 4000 6000 8000 10000 12000

−6000

−4000

−2000

0

2000

4000

6000

(b) Noisy Waveform

0 50 100 150 200 250240

260

280

300

320

340

360

380

(c) Clean Spectrum0 50 100 150 200 250

240

260

280

300

320

340

360

380

400

(d) Noisy Spectrum

• Illustration of impact in the waveform/spectral domains




Approaches to Noise Robustness

• The approaches described in this lecture will be split into three groups

– inherently robust schemes are not discussed

• Front-End Processing Schemes: normally highly efficient

– Cepstral mean normalisation (CMN), spectral subtraction [4], SPLICE [5]

• Model Compensation Schemes: normally yield good performance

– parallel model combination [6] (and vector Taylor series (VTS) [7])

• Uncertainty Decoding: compromise between front-end/model compensation

– observation uncertainty [8] and joint uncertainty decoding [9]




Cepstral Mean Normalisation

• Already come across CMN - often used in LVCSR systems

– handles time-invariant convolutional noise– doesn’t handle additive noise

• Rewriting the mismatch function in the Cepstral domain

ot = xt + h

In CMN the mean of the Cepstral feature is subtracted, so

ot = ot − 1T

T∑τ=1

oτ = xt − 1T

T∑τ=1

xτ

ot is independent of the channel difference, h.

• BUT no feature vectors are generated until the end of the segment

– often implemented as a high-pass filter (similar to RASTA processing [10])




Spectral Subtraction

• One of the simplest (earliest) forms of speech enhancement

– estimated clean speech in the spectral domain, xlt ,

• General implementation has the form (µ is the noise mean)

xlti = di(t)olti, di(t) =

[|olti|γ−α|µli |γ

|olti|γ]β

, if xlti > kolti

k, otherwise

• Form of transform specified by k,α, β and γ

– k determines the floor - prevents negative energy!– α, β and γ, yields various forms of spectral subtraction

power domain subtraction is used, γ = 2, β = 0.5 and α = 1– Often α ≥ 1 - an overestimate of the noise is subtracted




Spectral Subtraction for Enhancement

Short-termFFT

Floor

Corrupted Speech

InverseFFT

| . | 2

| |2

+

-

Combine

| |2

y(f)

y(f) y(f)

n(f)

Estimated Clean Speech

• need to handle phase of thecorrupted

• simple to assume same ascorrupted signal

• Problems: due to peaks and troughs in the short-term get

– Musical noise narrow peaks, perceived as time-varying tones– Broadband noise wider peaks, perceived as time-varying broadband noise.

• α and k effects levels of broadband noise and musical noise may be set.

– Increasing α the broadband noise is decreased.– Increasing k reduces the perceived musical tones.




SPLICE• If “stereo” data is available can learn mapping from corrupted to clean speech

– stereo data is the same utterance in clean/corrupted conditions– sometimes collected using close-talking/far-field microphones– may be simulated by adding noise

• SPLICE is a simple transformation scheme

– First construct a GMM on the noise corrupted acoustic-space– For each component, k, of the GMM calculate the offset between clean and

corrupted speech, b(k)

– Estimated clean speech given by

xt = ot −K∑

k=1

P (k|ot)b(k)

– SPLICE with uncertainty also examined [11]




Effects of Noise on the Distribution

−10 −5 0 5 10 15 20 25 30 35 400

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

Ln Power

Like

lihoo

d

(e) Noise mean=0−10 −5 0 5 10 15 20 25 30 35 400

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

Ln Power

Like

lihoo

d

(f) Noise mean=2

−10 −5 0 5 10 15 20 25 30 35 400

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

Ln Power

Like

lihoo

d

(g) Noise mean=4−10 −5 0 5 10 15 20 25 30 35 400

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

Ln Power

Like

lihoo

d

(h) Noise mean=6

• Comments (Note in Log-Spectral domain)

– Mean increases, Variance decreases.– Distributions may become multi-modal.




Model-Based Compensation

Cepstral Domain

C

Log-Spectral Domain

Log-Spectral Domain

Cepstral Domain

Clean Speech HMM Noise HMM

Corrupted Speech HMM

-1

Model Combination

-1CC

• Mismatch function (Cepstral domain)

ot = C log(exp(C−1xt) + exp(C−1nt)

)

• Assume combined distribution is Gaussian

• Parameters of model then given by

µ = E {o}Σ = E {oo′} − µµ′

expectation over component observations.

• Compensation requires estimate of the noise distribution

– may be estimated using ML in an unsupervised fashion

• Extension to convolutional noise: ot = C log(exp(C−1(xt + h)) + exp(C−1nt)

)




Model-Based Compensation

• There are no closed-form solutions to integral. Possible to use [6]

– Numerical Integration: standard numerical approximations– “Monte-Carlo” style integration: draw samples from distributions to

compute expectations– Log-Add: ignore variances in compensation– Log-Normal approximation: add distributions in the linear spectral domain

• Vector Taylor Series one popular approximation [7]

– Taylor series expansion about “current” parameter values

– mismatch function approximated using first order series

ot ≈ f(µx, µn) + ∇xf(x, n)|µx(x− µx) + ∇nf(x, n)|µn(n− µn)

where f(x, n) is the mismatch function– Gives simple approach to estimating noise parameters




“Uncertain” Observations

• Front-end schemes assume that all estimates are “perfect”

– doesn’t reflect that when speech is masked by noise it’s hard to estimate

• “Uncertain” Observations : propagate a distribution of the observations

HypothesisFeature

EnhancementDecode

Corrupted Speech

Noise

Model

Clean

Acoustic Model

(i) Enhancement

HypothesisUncertaintyObservation

DecodeCorrupted Speech

Noise

Model

Clean

Acoustic Model

(j) Uncertain Observations

• Normally p(xt|ot) is Gaussian distributed

– variance is a measure of the “confidence” in the observations– simple to incorporate into recogniser

• Intuitive rather than mathematically motivated




Uncertainty Decoding

t+1(1)q

t(1)q xt

ot+1 t+1(2)q

q(2)tnt

t+1x nt+1

ot

p(ot) =∫

p(ot|xt, nt)p(xt)p(nt)

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

O

X

5.8 6 6.2 6.4 6.60

5

10

15

p(o|

x)

0 2 4 60

0.1

0.2

0.3

0.4

0.5

p(o|

x)

• Need to model the conditional distribution p(ot|xt) =∫

p(ot|xt, nt)p(nt)dnt

– p(xt) given by clean speech models– form of distribution important for efficient decoding ..

• Simple form is Joint Uncertainty Decoding




Joint Uncertainty Decoding• Assume that stereo data is available - initially model joint distribution

p(xt, ot) = N([

xt

ot

];[

µx

µo

],

[Σxx Σox

Σxo Σoo

])

– conditional distribution p(ot|xt) also Gaussian– joint distribution may be estimated using mismatch function/noise estimates

• Likelihood calculation for recogniser component m has the form

p(ot|m) = |A|N (Aot + b; µ(m),Σ(m) + Σb)

– compensation parameters in blue simple to define from joint distribution

• Use regression class trees (as in MLLR) to form multiple joint distributions

– improves conditional independence modelling, p(ot|xt)




Joint Uncertainty Decoding and Linear Transforms

• Schemes such as CMLLR similar to JUD

p(ot|m) = |A|N (Aot + b; µ(m),Σ(m) + Σb)

– both apply linear transforms to model parameters– both can make use of regression class trees (and base-classes)

• However some important differences

– JUD uses a bias on the covariance matrix - CMLLR doesn’t– JUD can use as many transforms and base-classes as desired

transform parameters function of noise model estimatesCMLLR estimates all linear transforms independently

• When a full JUD transform is used Σb is full

– decoding cost is same as using a full covariance matrix




Example Performance• Experimental Setup:

– Wall Street Journal (WSJ) task– speaker-independent, 5,000 word closed-vocabulary task– cross-word triphone model set, tri-gram language model

• Additive Noise: car noise artificially added at various SNRs.

Model Noise Condition

Set Clean 22dB 16dB 10dB

Clean (-CMN) 6.7 20.4 35.0 59.3

Clean (+CMN) 5.8 14.7 30.0 54.3

PMC (-CMN) — 7.2 7.5 10.1

• Unknown Microphone: test data recorded with an unknown microphone

Model Baseline MLLR Means MLLR Means

Set +Vars

Clean (+CMN) 17.4 12.1 —

PMC (-CMN) 10.3 8.6 8.0

Multi-Env (+CMN) 9.0 7.4 7.1




References[1] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density HMMs,”

Computer Speech and Language, vol. 9, pp. 171–186, 1995.

[2] J. Junqua and Y. Anglade, “Acoustic and perceptual studies of Lombard speech: Application to isolated-word automatic speechrecognition,” in Proceedings ICASSP, 1990, pp. 841–844.

[3] Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Communication, vol. 16, pp. 261–291, 1995.

[4] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions ASSP, vol. 27, pp. 113–120, 1979.

[5] L. Deng, A. Acero, J. Droppo, and X. D. Huang, “High-performance robust speech recognition using stereo training data,” in Proc.ICASSP, 2001.

[6] M. J. F. Gales and S. J.Young, “Robust continuous speech recognition using parallel model combination,” IEEE Trans. on Speech andAudio Processing, 1996.

[7] P. Moreno, “Speech recognition in noisy environments,” Ph.D. dissertation, Carnegie Mellon University, 1996.

[8] J. A. Arrowood and M. A. Clements, “Using Observation Uncertainty In HMM Decoding,” in Proc. ICSLP, Denver, Colorado, Sept.2002.

[9] H. Liao and M. J. F. Gales, “Uncertainty decoding for noise robust speech recognition,” University of Cambridge, Tech. Rep.CUED/F-INFENG/TR499, 2004, available from: mi.eng.cam.ac.uk/∼hl251.

[10] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, Oct.1994.

[11] J. Droppo, A. Acero, and L. Deng, “Uncertainty decoding with splice for noise robust speech recognition,” in Proc. ICASSP, Orlando,Florida, May 2002.



Noise Robustness - University of Cambridgemi.eng.cam.ac.uk/~mjfg/local/Projects/noise_lect.pdf ·...

Documents

Transcript of Noise Robustness - University of Cambridgemi.eng.cam.ac.uk/~mjfg/local/Projects/noise_lect.pdf ·...