Speech EnhancementEE 516 Spring 2009
Alex Acero
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
Additive noise
• Stationary noise: properties don’t change over time:– White noise x[n]
• flat power spectrum• Samples are uncorrelated
– White Gaussian Noise
• Pdf is Gaussian (see chapter 10)– Typical noise is colored
• Pink noise: low-pass in nature• Non-stationary: properties changes over time
– Babble noise– Cocktail party effect
( )xxS f q[ ] [ ]xxR n q n
Reverberation
• Impulse response of an average office
0 200 400 600 800 1000 1200 1400 1600 1800 2000-3000
-2000
-1000
0
1000
2000
3000
4000
5000
6000
7000
Time (samples)
Roo
m Im
puls
e R
espo
nse 0 0
1[ ] [ ] [ ]k k
k kk kk k
h n n T n Tr c T
Model of the Environment
n[m]
x[m] y[m]h[m] +
[ ] [ ] [ ] [ ]y m x m h m n m
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
Cepstral Mean NormalizationCompute mean of cepstrum
And subtract it from input
CMN robust to channel
distortion
Normalizes average
vocal tract or short filters
Average must include
> 2 sec of speech
1
0
1 T
ttT
x x
ˆ t t x x x
0
2
4
6
8
10
12
14
16
10 15 20 30
SNR (dB)
Wo
rd E
rro
r R
ate
(%) No CMN
CMN-2
RASTA
• CMN is a low-pass filter with rectangular window
• Can use other low-pass filters too• RASTA filter is band-pass
1 3 44
1
2 2( ) 0.1 *
1 0.98
z z zH z z
z
1
0
1ˆ
T
t t ttT
x x x
Retrain with noisy data
• Mismatches between training and testing are bad for pattern recognition systems
• Retrain with noisy data• Approximation: add noise to clean data and retrain
0
20
40
60
80
100
0 5 10 15 20 25 30
SNR (dB)
Wo
rd E
rro
r R
ate
(%)
Mismatched
Matched (Noisy)
Multi-condition training
• Very hard to predict exactly the type of noise we’ll encounter at test time
• Too expensive to retrain the system for each noise condition• Train system offline with several noise types and levels
0
5
10
15
20
25
30
5 10 15 20 25 30
SNR (dB)
Wo
rd E
rro
r R
ate
(%)
Matched Noise
Multistyle
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
Condenser Microphone
b
b
h
~
ZM RL
v(t) G+
-
PreamplifierMicrophone
Ommidirectional microphones
• Polar response
0.5
1
30
210
60
240
90
270
120
300
150
330
180 0
Diaphragm
Mic opening
Bidirectional microphones
Speech sound wave from the front
Noise sound wave from the side
r
source
(d, 0)(–d, 0)
r1r2
5
10
15
20
25
30
210
60
240
90
270
120
300
150
330
180 0
Bidirectional microphones
• bidirectional microphone with d=1 cm at 0• Solid line corresponds to far field conditions ( ) and the
dotted line to near field conditions ( )
102
103
104
-30
-25
-20
-15
-10
-5
0
Frequency (Hz)
Diff
eren
ce in
air
pres
sure
(dB
)
0.02 0.5 /d r
Unidirectional microphones
5
10
15
20
25
30
210
60
240
90
270
120
300
150
330
180 0
Speech sound wave from the front
Noise sound wave from the side
Dynamic microphones
Output voltage
Magnet
Coil
Diaphragm
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
Acoustic Echo cancellation
2
10 2
{ [ ]}( ) 10log
ˆ{( [ ] [ ]) }
E d nERLE dB
E d n d n
Adaptive filter
Acoustic path H
-
x[n]
s[n]
r[n]
Loudspeaker
e[n]
Speech signal
Microphone
+ +v[n] Local
noise
d[n]ˆ[ ]d n
Line echo cancellation
Adaptive filter
Hybrid circuit H
-
x[n]
s[n]r[n]
Speaker A
e[n]
Speaker B
+ +v[n]
d[n]
Noise
ˆ[ ]d n
2
10 2
{ [ ]}( ) 10log
ˆ{( [ ] [ ]) }
E d nERLE dB
E d n d n
Least Mean Squares (LMS)
• Given input
• Estimate output
• Compute error
• Update filter
• Need to tune step size
[ 1] [ ] [ ] [ ]n n e n n W W X
[ ] [ ] [ ]e n d n y n
[ ] { [ ], [ 1], [ 1]}n x n x n x n L X
1
0
[ ] [ ] [ ] [ ] [ ]L
Tk
k
y n w n x n k n n
W X
Normalized LMS
• Make step size adaptive to ensure convergence
• Where we track the input energy
2[ ]
ˆ [ ]x
nL n
2 2 2ˆ ˆ[ ] (1 ) [ 1] [ ]x xn n x n
Recursive Least Squares (RLS)
• Newton Raphson
• New weights
• Faster convergence, but more CPU intensive
x0x1
f(x)
1
( )
( )i
i ii
f xx x
f x
121 [ ] ( ) ( )i i i in e e
w w w w2 ( ) [ ] { [ ] [ ]}T
ie n E n n w R x x
[ ] [ 1] [ ] [ ]Tn n n n R R x x
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
Microphone arrays: delay & sum
5 microphones spaced 5 cm apart. Source located at 5 m
Angle 0
400Hz 880Hz 4400Hz 8000 Hz
21
0
1arg max [ sin( )]
N
in i
y n iaN
M0
M1
M2
S
M-2
M-1
a1
0
1[ ] [ sin ]
N
ii
y n y n iaN
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
Microphone arrays: delay & sum
5 microphones spaced 5 cm apart. Source located at 5 m.
Angle 30
400Hz 880Hz 4400Hz 8000 Hz
21
0
1arg max [ sin( )]
N
in i
y n iaN
M0
M1
M2
S
M-2
M-1
a1
0
1[ ] [ sin ]
N
ii
y n y n iaN
WITTY: Who Is Talking To You?
( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
Y f X f V f
B f H f X f G f V f W f
Bone microphone for noise robust ASR
• Conventional microphones are sensitive to noise• Bone microphones are more noise resistant, but distort the signal
• Not enough data to retrain recognizer with bone microphone
• Fusion between acoustic microphone and bone microphone
Acoustic Microphone
Bone Microphone
Microphone fusion
Relationship between acoustic mic and bone mic
Acoustic
Contact
Relationship between acoustic mic and bone mic
WITTY: Who is talking to you?
Blind source separation
• Linear mixing• Estimate filter • Separate signals• Using assumption signals are independent
• Do gradient descent:
[ ] [ ]n ny Gx1H G
[ ] [ ]n nx Hy
( [ ]) | | ( [ ])p n p ny xy H Hy
1 1
0 0
( [0], [1], , [ 1]) ( [ ]) | | ( [ ])N N
N
n n
p N p n p n
y y xy y y y H Hy
1
1 ( [ ])( [ ])T Tn n n n n n
H H H H y y
Blind source separation
Idea: Estimate filters h11[n] and h12[n] that maximize p(z1[n]|) where is a HMM.
Approximate HMM by a Gaussian Mixture Model with LPC parameters => EM algorithm with a linear set of equations
+
+
h11[n]
h22[n]
h12[n]
h21[n]
z1[n]
z2[n]
y1[n]
y2[n]
+
+
h11[n]
h22[n]
h12[n]
h21[n]
z1[n]
z2[n]
y1[n]
y2[n]
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
Spectral subtraction
Corrupted signal
Power spectrum
but
So
Estimate noise power spectrum from noisy frames
Estimate clean power spectrum as
[ ] [ ] [ ]y m x m n m
2 2 2( ) ( ) ( )Y f X f N f
12 2
0
1ˆ ( ) ( )M
ii
N f Y fM
2 22 2 1ˆ ˆ( ) ( ) ( ) ( ) 1( )
X f Y f N f Y fSNR f
2
2
( )( )
ˆ ( )
Y fSNR f
N f
2 2 2( ) ( ) ( ) 2 ( ) ( ) cosY f X f N f X f N f
cos 0E
Spectral subtraction
Keep original phase
Ensure it’s positive
ˆ ( ) ( ) ( )ssX f Y f H f1
( ) max 1 ,( )ssH f a
SNR f
-5 0 5 10 15 20-12
-10
-8
-6
-4
-2
0
Instantaneous SNR (dB)
Ga
in(d
B)
spectral subtractionmagnitude subtractionOversubtraction
Aurora2
• ETSI STQ group• TIDigits• Added noise at SNRs: -5dB, 0dB, 5dB, 10dB, 15dB, 20dB• Set A: subway, babble, car, exhibition• Set B: restaurant, airport, street, station• Set C: one noise from set A and one noise from set C• Aurora 3 recorded in car (no digital mixing!)• Aurora4 for large vocabulary• Advanced Front-End (AFE) standard (2001) uses a variant of
spectral subtraction
Aurora 2 (Clean training)
Using SPLICE algorithm
AA BB CCSubwaySubway BabbleBabble CarCar ExhibitionExhibition AverageAverage RestaurantRestaurant StreetStreet AirportAirport StationStation AverageAverage Subway MSubway M Street MStreet M AverageAverage AverageAverage
CleanClean 20 dB20 dB 98.1698.16 98.5298.52 98.7298.72 98.2798.27 98.4298.42 98.6598.65 97.5897.58 98.8198.81 98.798.7 98.4498.44 98.3498.34 98.0498.04 98.1998.19 98.3898.3815 dB15 dB 96.6596.65 97.6497.64 98.0998.09 96.6196.61 97.2597.25 97.8897.88 96.8996.89 97.9797.97 97.8497.84 97.6597.65 96.8196.81 96.496.4 96.6196.61 97.2897.2810 dB10 dB 93.7793.77 94.6894.68 95.7195.71 93.0993.09 94.3194.31 94.7594.75 93.4493.44 95.8595.85 94.694.6 94.6694.66 93.1893.18 91.2391.23 92.2192.21 94.0394.035 dB5 dB 87.4787.47 84.4684.46 88.4688.46 85.5385.53 86.4886.48 85.0885.08 83.7183.71 87.0387.03 84.9484.94 85.1985.19 84.3184.31 80.3580.35 82.3382.33 85.1385.130 dB0 dB 65.9265.92 57.1357.13 63.6763.67 63.7863.78 62.6362.63 59.7259.72 57.8357.83 63.1163.11 57.4257.42 59.5259.52 59.2359.23 52.952.9 56.0756.07 60.0760.07-5dB-5dB AveragAveragee
88.3988.39 86.4986.49 88.9388.93 87.4687.46 87.8287.82 87.2287.22 85.8985.89 88.5588.55 86.7086.70 87.0987.09 86.3786.37 83.7883.78 85.0885.08 86.9886.98
AA BB CCSubwaySubway BabbleBabble CarCar ExhibitionExhibition AverageAverage RestaurantRestaurant StreetStreet AirportAirport StationStation AverageAverage Subway MSubway M Street MStreet M AverageAverage AverageAverage
CleanClean 20 dB20 dB 37.63%37.63% 84.97%84.97% 50.58%50.58% 52.08%52.08% 56.31%56.31% 86.51%86.51% 43.19%43.19% 87.29%87.29% 75.38%75.38% 73.09%73.09% 74.62%74.62% 59.75%59.75% 67.19%67.19% 65.20%65.20%15 dB15 dB 48.54%48.54% 91.01%91.01% 80.82%80.82% 57.41%57.41% 69.45%69.45% 91.08%91.08% 73.07%73.07% 91.17%91.17% 86.79%86.79% 85.53%85.53% 75.89%75.89% 67.54%67.54% 71.71%71.71% 76.33%76.33%10 dB10 dB 70.72%70.72% 89.48%89.48% 87.00%87.00% 71.61%71.61% 79.70%79.70% 88.39%88.39% 80.05%80.05% 91.01%91.01% 86.40%86.40% 86.46%86.46% 73.87%73.87% 65.70%65.70% 69.79%69.79% 80.42%80.42%5 dB5 dB 73.81%73.81% 78.77%78.77% 82.49%82.49% 73.77%73.77% 77.21%77.21% 78.37%78.37% 73.53%73.53% 81.38%81.38% 79.11%79.11% 78.10%78.10% 67.80%67.80% 61.31%61.31% 64.56%64.56% 75.04%75.04%0 dB0 dB 53.94%53.94% 52.74%52.74% 57.53%57.53% 55.80%55.80% 55.00%55.00% 54.76%54.76% 48.67%48.67% 56.90%56.90% 51.85%51.85% 53.05%53.05% 45.33%45.33% 38.90%38.90% 42.12%42.12% 51.64%51.64%-5dB-5dB AveragAveragee
61.96%61.96% 73.03%73.03% 71.90%71.90% 63.75%63.75% 68.48%68.48% 73.03%73.03% 63.33%63.33% 75.52%75.52% 70.02%70.02% 70.83%70.83% 59.73%59.73% 52.14%52.14% 55.93%55.93% 67.39%67.39%
Aurora 2 (multi-condition training)
Using SPLICE algorithm
AA BB CC
SubwaySubway BabbleBabble CarCar ExhibitionExhibition AverageAverage RestaurantRestaurant StreetStreet AirportAirport StationStation AverageAverage Subway MSubway M Street MStreet M AverageAverage AverageAverage
CleanClean 20 dB20 dB 98.5398.53 98.6498.64 98.5198.51 98.6498.64 98.5898.58 98.4698.46 97.9197.91 98.698.6 98.5898.58 98.3998.39 98.498.4 98.2598.25 98.3398.33 98.4598.4515 dB15 dB 97.6497.64 98.0798.07 98.3398.33 97.6997.69 97.9397.93 97.7997.79 97.4997.49 97.4497.44 97.4797.47 97.5597.55 97.8897.88 97.1697.16 97.5297.52 97.7097.7010 dB10 dB 95.9895.98 96.3796.37 96.8496.84 95.6595.65 96.2196.21 95.2795.27 94.4194.41 95.1195.11 95.1295.12 94.9894.98 95.7995.79 93.893.8 94.8094.80 95.4395.435 dB5 dB 92.0892.08 88.9488.94 92.7892.78 90.2590.25 91.0191.01 87.6387.63 88.0688.06 88.1688.16 87.0487.04 87.7287.72 90.9790.97 85.8585.85 88.4188.41 89.1889.180 dB0 dB 78.0278.02 65.5765.57 76.8376.83 74.4274.42 73.7173.71 65.3765.37 68.2368.23 69.4969.49 65.5765.57 67.1767.17 72.6772.67 65.4265.42 69.0569.05 70.1670.16-5dB-5dB AverageAverage 92.4592.45 89.5289.52 92.6692.66 91.3391.33 91.4991.49 88.9088.90 89.2289.22 89.7689.76 88.7688.76 89.1689.16 91.1491.14 88.1088.10 89.6289.62 90.1890.18
AA BB CCSubwaySubway BabbleBabble CarCar ExhibitionExhibition AverageAverage RestaurantRestaurant StreetStreet AirportAirport StationStation AverageAverage Subway MSubway M Street MStreet M AverageAverage AverageAverage
CleanClean 20 dB20 dB 38.49%38.49% 40.09%40.09% 24.37%24.37% 47.49%47.49% 37.61%37.61% 50.80%50.80% 13.64%13.64% 45.31%45.31% 52.51%52.51% 40.56%40.56% 40.74%40.74% 49.28%49.28% 45.01%45.01% 40.27%40.27%15 dB15 dB 33.14%33.14% 34.80%34.80% 30.13%30.13% 30.63%30.63% 32.17%32.17% 52.98%52.98% 31.98%31.98% 34.02%34.02% 43.40%43.40% 40.59%40.59% 41.92%41.92% 36.47%36.47% 39.19%39.19% 36.95%36.95%10 dB10 dB 27.70%27.70% 23.09%23.09% 25.82%25.82% 26.15%26.15% 25.69%25.69% 41.17%41.17% 1.06%1.06% 27.12%27.12% 31.56%31.56% 25.23%25.23% 36.79%36.79% 17.33%17.33% 27.06%27.06% 25.78%25.78%5 dB5 dB 31.96%31.96% 11.16%11.16% 40.82%40.82% 21.37%21.37% 26.33%26.33% 24.85%24.85% 17.03%17.03% 13.89%13.89% 21.36%21.36% 19.28%19.28% 48.66%48.66% 19.00%19.00% 33.83%33.83% 25.01%25.01%0 dB0 dB 33.60%33.60% 9.04%9.04% 50.24%50.24% 28.23%28.23% 30.27%30.27% 14.93%14.93% 17.82%17.82% 12.55%12.55% 21.54%21.54% 16.71%16.71% 48.61%48.61% 24.10%24.10% 36.35%36.35% 26.06%26.06%-5dB-5dB AverageAverage 32.85%32.85% 13.01%13.01% 45.52%45.52% 27.57%27.57% 30.15%30.15% 24.04%24.04% 16.83%16.83% 17.14%17.14% 24.99%24.99% 21.05%21.05% 47.14%47.14% 24.13%24.13% 36.01%36.01% 27.87%27.87%
Wiener Filtering
• Find linear estimate of clean signal• MMSE (Minimum Mean Squared Error)
• Wiener-Hopf equation
• In Freq domain
• If noise and signal are uncorrelated
[ ] [ ] [ ]n n n y x v
ˆ[ ] [ ] [ ]m
n h m n m
x y
2
[ ] [ ] [ ]m
E n h m n m
x y
[ ] [ ] [ ]xy yym
R l h m R l m
( )( )
( )xy
yy
S fH f
S f
[ ] [ ] [ ]xym
R l x m y m l
[ ] [ ] [ ]yym
R l y m y m l
( )( )
( ) ( )xx
xx vv
S fH f
S f S f
Wiener Filtering
• Find linear estimate of clean signal• If noise and signal are uncorrelated
• With
• Compare with Spectral Subtraction
[ ] [ ] [ ]n n n y x v
ˆ[ ] [ ] [ ]m
n h m n m
x y
( ) ( )( ) 1( ) 1
( ) ( ) ( )yy vvxx
yy yy
S f S fS fH f
S f S f SNR f
( )( )
( )yy
vv
S fSNR f
S f
1( ) max 1 ,
( )ssH f aSNR f
Spectral Subtraction
0
20
40
60
80
100
0 5 10 15 20 25 30
SNR (dB)
Wo
rd E
rro
r R
ate
(%)
Clean Speech Training
Spectral Subtraction
Matched Noisy Training
Vector Taylor Series (VTS)
• Acero, Moreno
• The power spectrum, on the average
• Taking logs
• Cepstrum is DCT (matrix C) of log power spectrum
( ) y x h g n x h 1
( ) ln 1 e
C zg z C
[ ] [ ] [ ] [ ]y m x m h m n m
2 2 2 2( ) ( ) ( ) ( )i i i iY f X f H f N f
2 2 2
2 2 2
ln ( ) ln ( ) ln ( )
ln 1 exp ln ( ) ln ( ) ln ( )
i i i
i i i
Y f X f H f
N f X f H f
Vector Taylor Series (VTS)
• x, h, and n are Gaussian random vectors with means , , and and covariance matrices , , and
• Expand y in first-order Taylor series
xμ hμnμ xΣ hΣ nΣ
( )
( ) ( ) ( )( )x h n x h
x h n
y μ μ g μ μ μ
A x μ A h μ I A n μ
1A CFC1
1( )
1 e
C μf μ
( )y x h n x h μ μ μ g μ μ μ
( ) ( )T T T y x h nΣ AΣ A AΣ A I A Σ I A
Vector Taylor Series
• Distribution of corrupted log-spectra• Noise with mean of 0dB and std dev of 2dB• Speech with mean of 25dB• Montecarlo simulation• Std dev: 25dB 10dB 5dB
0 50 1000
0.01
0.02
0.03
0 20 40 600
0.01
0.02
0.03
0.04
0 20 40 600
0.02
0.04
0.06
0.08
Phase matters
Corrupted signal
Spectrum
But is only an approximation
[ ] [ ] [ ]y m x m n m
2 2 2( ) ( ) ( )Y f X f N f
2 2 2( ) ( ) ( ) 2 ( ) ( ) cosY f X f N f X f N f
cos 0E
2 ( ) ( ) cos 0t t tX f N f
-6 -4 -2 0 2-6
-5
-4
-3
-2
-1
0
1
2
-6 -5 -4 -3 -2 -1 0 1 2-6
-5
-4
-3
-2
-1
0
1
2
Non-stationary noise
• Speech/Noise decomposition (Varga et al.)
Observations
Speech HMM
Noise HMM
Top Related