Noise Compensation for Subspace GaussianMixture Models
Liang LuUniversity of Edinburgh
Joint work with KK Chin, A. Ghoshal and S. Renals
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Outline
I MotivationI Subspace GMM (SGMM) works well in matched speech
condition [Povey et al., 2011]I In mismatched condition (i.e. noise), the gain disappears
I GoalI Noise compensation for SGMM
I MethodI Model space compensationI Joint uncertainty decoding (JUD) [Liao and Gales, 2005]
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
HMM-GMM acoustic model
j − 1 j + 1j
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Subspace Gaussian Mixture Models [Povey et al., 2011]
j − 1 j + 1j
vjkMi
wi
ΣΣΣi
i = 1, . . . , I
I GlobalI Mi is the basis for meansI wi is the basis for weightsI ΣΣΣi is the covariance matrix
I State-dependentI vjk is low dimensional vectors (e.g. 40dim)I Gaussian mean: µµµjki = Mivjk
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Subspace Gaussian Mixture Models
I More intuitively, suppose we have an acoustic space like this
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Subspace Gaussian Mixture Models
I We then partition the whole acoustic space into I regions.
I This can be done by learning a GMM using the training data.
1
2
3
I
. . .
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Subspace Gaussian Mixture Models
I We then introduce some parameters to structure each region
ΣΣΣi - model the covariance of this region
Mi - span the basis for Gaussian mean
wi - span the basis for Gaussian weight
1
2
3. . .
wiΣΣΣiMi
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Subspace Gaussian Mixture Models
Given a class with some data, such as an HMM state
j − 1 j + 1j
• ••• •••••• ••• ••• •••
••• ••••
1
2
3. . .• •
•
• •• ••
••••• ••
••••••••• ••
vjk
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Subspace Gaussian Mixture Models
Then we learn a GMM for this class
j − 1 j + 1j
• ••• •••••• ••• ••• •••
••• ••••
1
2
3. . .• •
•
• •• ••
••••• ••
••••••••• ••
vjk
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation
I Larger modelling power → higher recognition accuracy.I Our systems on Aurora 4, the #Gaussians is 6.4M (SGMM),
vs. 50k (GMM).I SGMM vs. GMM → 5.2% vs. 7.7% on clean conditionI SGMM vs. GMM → 59.9% vs. 59.3% on noisy condition
I Can we do noise compensation for SGMMs ?
GMM clean SGMM clean GMM noisy SGMM noisy0
10
20
30
40
50
60
WER
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation
There are numerous work on noise compensation for robust ASR[Deng, 2011]
I Feature domainI Spectral subtraction, cmn/cvnI Cepstral mean square error estimationI AlgonquinI SpliceI Feature space vector Taylor series (VTS)
I Model domainI MLLR, noise constraint MLLRI PMC, Data-driven PMC (DPMC), iterative DPMCI VTS, joint uncertainty decoding (JUD)I Linear spline interpolation (LSI)I Unscented transform (UT)
I HybridI Noise adaptive training
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation for SGMM
I Model space compensation for SGMM
I Not data-driven but using heuristic knowledge
I Mismatch function y = f (x,h,n,ααα) [Acero, 1990]
I ααα denotes the phase term between noise and speech[Deng et al., 2004].
Chanel noiseh
Clean speechx
⊕
Additive noise
Noisy speechy
n
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation for SGMM
I Model space compensation for SGMM
I Not data-driven but using heuristic knowledge
I Mismatch function y = f (x,h,n,ααα) [Acero, 1990]
I ααα denotes the phase term between noise and speech[Deng et al., 2004].
Chanel noiseh
Clean speechx
⊕
Additive noise
Noisy speechy
n
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation for SGMM
I Model space compensation for SGMM
I Not data-driven but using heuristic knowledge
I Mismatch function y = f (x,h,n,ααα) [Acero, 1990]
I ααα denotes the phase term between noise and speech[Deng et al., 2004].
Chanel noiseh
Clean speechx
⊕
Additive noise
Noisy speechy
n
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation for SGMM
I Model space compensation for SGMM
I Not data-driven but using heuristic knowledge
I Mismatch function y = f (x,h,n,ααα) [Acero, 1990]
I ααα denotes the phase term between noise and speech[Deng et al., 2004].
Chanel noiseh
Clean speechx
⊕
Additive noise
Noisy speechy
n
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation for SGMM
The mismatch function is
y = f (x,h,n,ααα)
= x + h + C log[1 + exp
(C−1 (n− x− h)
)+ 2ααα • exp
(C−1(n− x− h)/2
)︸ ︷︷ ︸phase term
]. (1)
where C be the DCT matrix.
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation
I Aim: estimate µµµy and ΣΣΣy for each Gaussian component.
I Difficulty: y = f (x,h,n,ααα) is highly nonlinear, no analyticsolution!
I Solution: Vector Taylor series (VTS) approximation[Moreno et al., 1996]
I Cost: Real time factor > 100, memory > 10G for (mediumsize) SGMM with 6.4M Gaussian
I Inelegant: Direct apply VTS will destroy the compact ofstructure of SGMMs
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation
I Aim: estimate µµµy and ΣΣΣy for each Gaussian component.
I Difficulty: y = f (x,h,n,ααα) is highly nonlinear, no analyticsolution!
I Solution: Vector Taylor series (VTS) approximation[Moreno et al., 1996]
I Cost: Real time factor > 100, memory > 10G for (mediumsize) SGMM with 6.4M Gaussian
I Inelegant: Direct apply VTS will destroy the compact ofstructure of SGMMs
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation
I Aim: estimate µµµy and ΣΣΣy for each Gaussian component.
I Difficulty: y = f (x,h,n,ααα) is highly nonlinear, no analyticsolution!
I Solution: Vector Taylor series (VTS) approximation[Moreno et al., 1996]
I Cost: Real time factor > 100, memory > 10G for (mediumsize) SGMM with 6.4M Gaussian
I Inelegant: Direct apply VTS will destroy the compact ofstructure of SGMMs
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation
I Aim: estimate µµµy and ΣΣΣy for each Gaussian component.
I Difficulty: y = f (x,h,n,ααα) is highly nonlinear, no analyticsolution!
I Solution: Vector Taylor series (VTS) approximation[Moreno et al., 1996]
I Cost: Real time factor > 100, memory > 10G for (mediumsize) SGMM with 6.4M Gaussian
I Inelegant: Direct apply VTS will destroy the compact ofstructure of SGMMs
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation
I Aim: estimate µµµy and ΣΣΣy for each Gaussian component.
I Difficulty: y = f (x,h,n,ααα) is highly nonlinear, no analyticsolution!
I Solution: Vector Taylor series (VTS) approximation[Moreno et al., 1996]
I Cost: Real time factor > 100, memory > 10G for (mediumsize) SGMM with 6.4M Gaussian
I Inelegant: Direct apply VTS will destroy the compact ofstructure of SGMMs
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation
I Solution: Joint uncertainty decoding (JUD)
VTS vs. JUD
VTS
JUD
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation
I Applying JUD to SGMM
1
2
3
I
. . .
I Cost: Real time factor ∼ 10 for SGMM with 6.4M Gaussians
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Experiments
I DatabaseI Aurora 4 datasetI Clean speech and noisy speech with SNR [5db - 15db]I Close-talking microphone and desk-mounted microphoneI ∼ 15 hour training dataI 330 testing utterances
I System configurationI 39dim MFCCI #triphone states: 3.1k (GMM) vs. 3.9k (SGMM)I #Gaussians: 50k (GMM) vs. 6.4M (SGMM)I #regression classes: 112 (GMM) vs. 400 (SGMM)
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation experiments
Baseline JUD VTS0
10
20
30
40
50
60
70
GMM SGMM
GMM SGMMGMM
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Experiments
Results by tuning the value of phase factors.
0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.016
18
20
22
24
26
28
The value of phase factor
Wor
d Er
ror R
ate
(\%)
VTS/GMM systemJUD/GMM systemJUD/SGMM system
I JUD/SGMM system achieved 16.8% WER on Aurora 4database
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Remarks
I The phase term is very effective for noise compensation
I Similar improvements were also observed in other studies, e.g.[Li et al., 2009]
I The reasons maybe it can compensate for the linearizationbias and performs domain compensation [Li et al., 2009]
I Our insight is it may helps to avoid the over estimation of thenoise model
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Remarks
I The phase term is very effective for noise compensation
I Similar improvements were also observed in other studies, e.g.[Li et al., 2009]
I The reasons maybe it can compensate for the linearizationbias and performs domain compensation [Li et al., 2009]
I Our insight is it may helps to avoid the over estimation of thenoise model
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Remarks
I The phase term is very effective for noise compensation
I Similar improvements were also observed in other studies, e.g.[Li et al., 2009]
I The reasons maybe it can compensate for the linearizationbias and performs domain compensation [Li et al., 2009]
I Our insight is it may helps to avoid the over estimation of thenoise model
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Remarks
I The phase term is very effective for noise compensation
I Similar improvements were also observed in other studies, e.g.[Li et al., 2009]
I The reasons maybe it can compensate for the linearizationbias and performs domain compensation [Li et al., 2009]
I Our insight is it may helps to avoid the over estimation of thenoise model
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Conclusion
I SGMM is a promising alternative for acoustic modelling
I Noise compensation using JUD works well for SGMMs
I The phase term is particular effective for the noisecompensation
I Future works will be on noise adaptive training, compensationin log-spectral domain.
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Noise compensation
I With JUD, the marginal likelihood can be obtained as
p(y | m) ≈ |A(r)| N(A(r)y + b(r);µµµm,ΣΣΣm + ΣΣΣ
(r)b
). (2)
I The transformation is done in the feature space, applied toeach frame
I Computation is saved since that the #frame � #Gaussians
I The transformation should be diagonalized in GMM systems,but not in SGMM system since we used full covariance matrix
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Experiments
Table: GMM systems with ααα = 0.
Methods Clean AvgClean model 7.7 59.3MTR model 12.7 26.9VTS 7.3 18.3JUD 7.0 21.1
Table: SGMM systems with ααα = 0.
Methods Clean AvgClean model 5.2 59.9MTR model 6.8 22.2JUD 5.3 20.3
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Acero, A. (1990).Acoustic and Enviromental Robustness in Automatic SpeechRecognition.PhD thesis, Carnegie Mellon University.
Deng, L., Droppo, J., and Acero, A. (2004).Enhancement of log mel power spectra of speech using aphase-sensitive model of the acoustic environment andsequential estimation of the corrupting noise.IEEE Transactions on Speech and Audio Processing,12(2):133–143.
Droppo, J., Acero, A., and Deng, L. (2002).Uncertainty decoding with SPLICE for noise robust speechrecognition.In Proc. ICASSP. IEEE.
Gales, M. (1995).Model-based techniques for noise robust speech recognition.PhD thesis, Cambridge University.
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Hu, Y. and Huo, Q. (2006).An HMM compensation approach using unscentedtransformation for noisy speech recognition.Chinese Spoken Language Processing, pages 346–357.
Li, J., Deng, L., Yu, D., Gong, Y., and Acero, A. (2009).A unified framework of HMM adaptation with jointcompensation of additive and convolutive distortions.Computer Speech & Language, 23(3):389–405.
Liao, H. and Gales, M. (2005).Joint uncertainty decoding for noise robust speech recognition.In Proc. INTERSPEECH. Citeseer.
Moreno, P., Raj, B., and Stern, R. (1996).A vector Taylor series approach for environment-independentspeech recognition.In Proc. ICASSP, volume 2, pages 733–736. IEEE.
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F.,Ghoshal, A., Glembek, O., Goel, N., Karafiat, M., Rastrow, A.,Rose, R., Schwarz, P., and Thomas, S. (2011).The subspace Gaussian mixture model—A structured modelfor speech recognition.Computer Speech & Language, 25(2):404–439.
Liang Lu, Interspeech, September, 2012 RTSC
RTSC
Top Related