Noise Robustness - University of Cambridgemi.eng.cam.ac.uk/~mjfg/local/Projects/noise_lect.pdf ·...
Transcript of Noise Robustness - University of Cambridgemi.eng.cam.ac.uk/~mjfg/local/Projects/noise_lect.pdf ·...
Noise Robustness
Mark Gales
Lent 2007
Module 2A: Lecture 2
MPhil in Computer Speech Text and Internet Technology
MPhil in Computer Speech Text and Internet Technology
Module 2A: Noise Robustness
Environment Mismatches
• For most tasks there is some mismatch between training and testing
– Acoustic environment: background noise/microphone differ to training data;– Speaker: the test speaker is “different” to speakers seen in training.
• In the previous lecture speaker adaptation was considered
– this lecture considers noise robustness
• In contrast to speaker adaptation:
– impact of noise of the speech signal “easier” to characterise– most speaker adaptation schemes (not VTLN) are general approaches– noise robustness schemes make use of a mismatch function
• Speaker adaptation schemes such as MLLR [1] also used for noise robustness.
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 1
Module 2A: Noise Robustness
Robust Speech Recognition
• Good performance may be obtained using HMM-based speech recognitionsystems in matched conditions.
– matched means that the type of microphone (acoustic channel) being usedand the ambient (background) noise are the same in training and testing.
• When either or both of these conditions vary the performance of the recognisermay drop dramatically.
Rec
ogni
tion
Rat
e
Signal-to-noise Ratio
Matched training
Clean training
Noise robustness schemes should:
• achieve performance as good as “matched”
• have low computationally cost:
– for a fixed noise condition– changes in noise condition
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 2
Module 2A: Noise Robustness
Why Is It Important?
Few situations exist where there is complete control over the acoustic environment.
• Hands-free voice dialling allow use of car phones without use of hands
– high levels of background noise - tyre rumble, wind flow over car, passengers.– reflective surfaces (windows), multi-path.
• Speech recognition over the telephone: home banking, “mail-order” catalogues,automated operators.
– Unknown microphone - variations due to hand set used– Unknown channel - distortions due to the telephone wires and exchanges.
Noise may also be “added” at the exchange.– Background noise at the “human” end.
• Interaction with PDA ...
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 3
Module 2A: Noise Robustness
General Environment Model
WORD
+
+
+
+
Additive NoiseChannel
Channel Difference
Ambient Noise
Additive NoiseReceiver
TaskStressLombard
Speaker
CORRUPTED SPEECH
The noise-corrupted speech, o(t), and the noise-free speech , x(t), related by
o(t) =[{([
x(t)|StressLombard
]n1(t)
+ n1(t))∗ hmike(t) + n2(t)
}∗ hchan(t)
]+ n3(t)
where n1(t) is the background noise, hmike(t) is the impulse response of themicrophone, n2(t) and hchan(t) are the additive noise and impulse response ofthe transmission channel, and n3(t) is the noise present at the receiver.
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 4
Module 2A: Noise Robustness
Stressed SpeechThe “speech” generated by the speaker is conditional on:
• Lombard: speakers alter the way they speak according to the backgroundnoise [2] - aim to increase intelligibility
– energy: Decrease of energy in the vowels between 0-500Hz. For femalesincrease in energy between 4-5kHz. For nasals, fricatives, plosive, decreasein all frequency bands.
– formants: Frequency of the first formant increases for vowels, glides, liquidsand nasals (more important for females). The second formant frequencyonly increases for females.
– duration: Increase in duration for vowels, slight decrease for consonants(overall increase in length of words).
– pitch: For males the pitch of the vowels increases.
• Stress: the user may be stressed due to work, or the recogniser not working, ...
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 5
Module 2A: Noise Robustness
“Simplified” Acoustic Environment• A simplified model of the effects of noise is often used
Channel Difference
Speech
+
+
Additive Noise
Corrupted Speech
Convolutional Noise • Ignore effects of stress:
• Group noise sources
o(t) = x(t) ∗ h(t) + n(t)
• Squared magnitude of the Fourier Transform of signal
O(f)O∗(f) = |H(f)X(f)|2 + |N(f)|2 + 2|N(f)||H(f)X(f)| cos(θ)
θ is the angle between the vectors N(f) and H(f)X(f).
• Average (over Mel bins) and assume speech and noise independent
olt = hlxl
t + nlt
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 6
Module 2A: Noise Robustness
Dealing with Adverse Environments
• Single-microphone techniques [3] may be split into
– inherently robust speech parameterisation - no modifications to the system.– clean speech estimation - alters the front-end processing scheme.– acoustic model compensation so that they are representative of speech in
the new acoustic environment.
• Multiple-microphones - microphone arrays may be used
– increase SNR by reducing the beam-width of the effective microphone.– additional/specialised hardware required
• Only single-microphone techniques are considered in this lecture.
• If something is known about the possible test acoustic environment
– multi-style (multi-environment) training may be used– “clean” model trained under a variety of conditions– also helps general robustness
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 7
Module 2A: Noise Robustness
Effects of Noise on the Signal
0 2000 4000 6000 8000 10000 12000−4000
−3000
−2000
−1000
0
1000
2000
3000
4000
(a) Clean Waveform0 2000 4000 6000 8000 10000 12000
−6000
−4000
−2000
0
2000
4000
6000
(b) Noisy Waveform
0 50 100 150 200 250240
260
280
300
320
340
360
380
(c) Clean Spectrum0 50 100 150 200 250
240
260
280
300
320
340
360
380
400
(d) Noisy Spectrum
• Illustration of impact in the waveform/spectral domains
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 8
Module 2A: Noise Robustness
Approaches to Noise Robustness
• The approaches described in this lecture will be split into three groups
– inherently robust schemes are not discussed
• Front-End Processing Schemes: normally highly efficient
– Cepstral mean normalisation (CMN), spectral subtraction [4], SPLICE [5]
• Model Compensation Schemes: normally yield good performance
– parallel model combination [6] (and vector Taylor series (VTS) [7])
• Uncertainty Decoding: compromise between front-end/model compensation
– observation uncertainty [8] and joint uncertainty decoding [9]
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 9
Module 2A: Noise Robustness
Cepstral Mean Normalisation
• Already come across CMN - often used in LVCSR systems
– handles time-invariant convolutional noise– doesn’t handle additive noise
• Rewriting the mismatch function in the Cepstral domain
ot = xt + h
In CMN the mean of the Cepstral feature is subtracted, so
ot = ot − 1T
T∑τ=1
oτ = xt − 1T
T∑τ=1
xτ
ot is independent of the channel difference, h.
• BUT no feature vectors are generated until the end of the segment
– often implemented as a high-pass filter (similar to RASTA processing [10])
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 10
Module 2A: Noise Robustness
Spectral Subtraction
• One of the simplest (earliest) forms of speech enhancement
– estimated clean speech in the spectral domain, xlt ,
• General implementation has the form (µ is the noise mean)
xlti = di(t)olti, di(t) =
[|olti|γ−α|µli |γ
|olti|γ]β
, if xlti > kolti
k, otherwise
• Form of transform specified by k,α, β and γ
– k determines the floor - prevents negative energy!– α, β and γ, yields various forms of spectral subtraction
power domain subtraction is used, γ = 2, β = 0.5 and α = 1– Often α ≥ 1 - an overestimate of the noise is subtracted
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 11
Module 2A: Noise Robustness
Spectral Subtraction for Enhancement
Short-termFFT
Floor
Corrupted Speech
InverseFFT
| . | 2
| |2
+
-
Combine
| |2
y(f)
y(f) y(f)
n(f)
Estimated Clean Speech
• need to handle phase of thecorrupted
• simple to assume same ascorrupted signal
• Problems: due to peaks and troughs in the short-term get
– Musical noise narrow peaks, perceived as time-varying tones– Broadband noise wider peaks, perceived as time-varying broadband noise.
• α and k effects levels of broadband noise and musical noise may be set.
– Increasing α the broadband noise is decreased.– Increasing k reduces the perceived musical tones.
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 12
Module 2A: Noise Robustness
SPLICE• If “stereo” data is available can learn mapping from corrupted to clean speech
– stereo data is the same utterance in clean/corrupted conditions– sometimes collected using close-talking/far-field microphones– may be simulated by adding noise
• SPLICE is a simple transformation scheme
– First construct a GMM on the noise corrupted acoustic-space– For each component, k, of the GMM calculate the offset between clean and
corrupted speech, b(k)
– Estimated clean speech given by
xt = ot −K∑
k=1
P (k|ot)b(k)
– SPLICE with uncertainty also examined [11]
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 13
Module 2A: Noise Robustness
Effects of Noise on the Distribution
−10 −5 0 5 10 15 20 25 30 35 400
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
Ln Power
Like
lihoo
d
(e) Noise mean=0−10 −5 0 5 10 15 20 25 30 35 400
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
Ln Power
Like
lihoo
d
(f) Noise mean=2
−10 −5 0 5 10 15 20 25 30 35 400
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
Ln Power
Like
lihoo
d
(g) Noise mean=4−10 −5 0 5 10 15 20 25 30 35 400
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
Ln Power
Like
lihoo
d
(h) Noise mean=6
• Comments (Note in Log-Spectral domain)
– Mean increases, Variance decreases.– Distributions may become multi-modal.
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 14
Module 2A: Noise Robustness
Model-Based Compensation
Cepstral Domain
C
Log-Spectral Domain
Log-Spectral Domain
Cepstral Domain
Clean Speech HMM Noise HMM
Corrupted Speech HMM
-1
Model Combination
-1CC
• Mismatch function (Cepstral domain)
ot = C log(exp(C−1xt) + exp(C−1nt)
)
• Assume combined distribution is Gaussian
• Parameters of model then given by
µ = E {o}Σ = E {oo′} − µµ′
expectation over component observations.
• Compensation requires estimate of the noise distribution
– may be estimated using ML in an unsupervised fashion
• Extension to convolutional noise: ot = C log(exp(C−1(xt + h)) + exp(C−1nt)
)
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 15
Module 2A: Noise Robustness
Model-Based Compensation
• There are no closed-form solutions to integral. Possible to use [6]
– Numerical Integration: standard numerical approximations– “Monte-Carlo” style integration: draw samples from distributions to
compute expectations– Log-Add: ignore variances in compensation– Log-Normal approximation: add distributions in the linear spectral domain
• Vector Taylor Series one popular approximation [7]
– Taylor series expansion about “current” parameter values
– mismatch function approximated using first order series
ot ≈ f(µx, µn) + ∇xf(x, n)|µx(x− µx) + ∇nf(x, n)|µn(n− µn)
where f(x, n) is the mismatch function– Gives simple approach to estimating noise parameters
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 16
Module 2A: Noise Robustness
“Uncertain” Observations
• Front-end schemes assume that all estimates are “perfect”
– doesn’t reflect that when speech is masked by noise it’s hard to estimate
• “Uncertain” Observations : propagate a distribution of the observations
HypothesisFeature
EnhancementDecode
Corrupted Speech
Noise
Model
Clean
Acoustic Model
(i) Enhancement
HypothesisUncertaintyObservation
DecodeCorrupted Speech
Noise
Model
Clean
Acoustic Model
(j) Uncertain Observations
• Normally p(xt|ot) is Gaussian distributed
– variance is a measure of the “confidence” in the observations– simple to incorporate into recogniser
• Intuitive rather than mathematically motivated
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 17
Module 2A: Noise Robustness
Uncertainty Decoding
t+1(1)q
t(1)q xt
ot+1 t+1(2)q
q(2)tnt
t+1x nt+1
ot
p(ot) =∫
p(ot|xt, nt)p(xt)p(nt)
0 1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
O
X
5.8 6 6.2 6.4 6.60
5
10
15
p(o|
x)
0 2 4 60
0.1
0.2
0.3
0.4
0.5
p(o|
x)
• Need to model the conditional distribution p(ot|xt) =∫
p(ot|xt, nt)p(nt)dnt
– p(xt) given by clean speech models– form of distribution important for efficient decoding ..
• Simple form is Joint Uncertainty Decoding
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 18
Module 2A: Noise Robustness
Joint Uncertainty Decoding• Assume that stereo data is available - initially model joint distribution
p(xt, ot) = N([
xt
ot
];[
µx
µo
],
[Σxx Σox
Σxo Σoo
])
– conditional distribution p(ot|xt) also Gaussian– joint distribution may be estimated using mismatch function/noise estimates
• Likelihood calculation for recogniser component m has the form
p(ot|m) = |A|N (Aot + b; µ(m),Σ(m) + Σb)
– compensation parameters in blue simple to define from joint distribution
• Use regression class trees (as in MLLR) to form multiple joint distributions
– improves conditional independence modelling, p(ot|xt)
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 19
Module 2A: Noise Robustness
Joint Uncertainty Decoding and Linear Transforms
• Schemes such as CMLLR similar to JUD
p(ot|m) = |A|N (Aot + b; µ(m),Σ(m) + Σb)
– both apply linear transforms to model parameters– both can make use of regression class trees (and base-classes)
• However some important differences
– JUD uses a bias on the covariance matrix - CMLLR doesn’t– JUD can use as many transforms and base-classes as desired
transform parameters function of noise model estimatesCMLLR estimates all linear transforms independently
• When a full JUD transform is used Σb is full
– decoding cost is same as using a full covariance matrix
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 20
Module 2A: Noise Robustness
Example Performance• Experimental Setup:
– Wall Street Journal (WSJ) task– speaker-independent, 5,000 word closed-vocabulary task– cross-word triphone model set, tri-gram language model
• Additive Noise: car noise artificially added at various SNRs.
Model Noise Condition
Set Clean 22dB 16dB 10dB
Clean (-CMN) 6.7 20.4 35.0 59.3
Clean (+CMN) 5.8 14.7 30.0 54.3
PMC (-CMN) — 7.2 7.5 10.1
• Unknown Microphone: test data recorded with an unknown microphone
Model Baseline MLLR Means MLLR Means
Set +Vars
Clean (+CMN) 17.4 12.1 —
PMC (-CMN) 10.3 8.6 8.0
Multi-Env (+CMN) 9.0 7.4 7.1
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 21
Module 2A: Noise Robustness
References[1] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density HMMs,”
Computer Speech and Language, vol. 9, pp. 171–186, 1995.
[2] J. Junqua and Y. Anglade, “Acoustic and perceptual studies of Lombard speech: Application to isolated-word automatic speechrecognition,” in Proceedings ICASSP, 1990, pp. 841–844.
[3] Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Communication, vol. 16, pp. 261–291, 1995.
[4] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions ASSP, vol. 27, pp. 113–120, 1979.
[5] L. Deng, A. Acero, J. Droppo, and X. D. Huang, “High-performance robust speech recognition using stereo training data,” in Proc.ICASSP, 2001.
[6] M. J. F. Gales and S. J.Young, “Robust continuous speech recognition using parallel model combination,” IEEE Trans. on Speech andAudio Processing, 1996.
[7] P. Moreno, “Speech recognition in noisy environments,” Ph.D. dissertation, Carnegie Mellon University, 1996.
[8] J. A. Arrowood and M. A. Clements, “Using Observation Uncertainty In HMM Decoding,” in Proc. ICSLP, Denver, Colorado, Sept.2002.
[9] H. Liao and M. J. F. Gales, “Uncertainty decoding for noise robust speech recognition,” University of Cambridge, Tech. Rep.CUED/F-INFENG/TR499, 2004, available from: mi.eng.cam.ac.uk/∼hl251.
[10] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, Oct.1994.
[11] J. Droppo, A. Acero, and L. Deng, “Uncertainty decoding with splice for noise robust speech recognition,” in Proc. ICASSP, Orlando,Florida, May 2002.
Cambridge UniversityEngineering Department
MPhil in Computer Speech Text and Internet Technology 22