Download - Noise Compensation for Subspace Gaussian Mixture Modelsllu/pdf/IS2012_lianglu.pdf · Subspace Gaussian Mixture Models [Povey et al., 2011] j j! 1 j+1 v jk M i w i! i i=1,...,I I Global

Noise Compensation for Subspace GaussianMixture Models

Liang LuUniversity of Edinburgh

Joint work with KK Chin, A. Ghoshal and S. Renals

Liang Lu, Interspeech, September, 2012 RTSC

RTSC

Outline

I MotivationI Subspace GMM (SGMM) works well in matched speech

condition [Povey et al., 2011]I In mismatched condition (i.e. noise), the gain disappears

I GoalI Noise compensation for SGMM

I MethodI Model space compensationI Joint uncertainty decoding (JUD) [Liao and Gales, 2005]


RTSC

HMM-GMM acoustic model

j − 1 j + 1j


RTSC

Subspace Gaussian Mixture Models [Povey et al., 2011]

j − 1 j + 1j

vjkMi

wi

ΣΣΣi

i = 1, . . . , I

I GlobalI Mi is the basis for meansI wi is the basis for weightsI ΣΣΣi is the covariance matrix

I State-dependentI vjk is low dimensional vectors (e.g. 40dim)I Gaussian mean: µµµjki = Mivjk


RTSC

Subspace Gaussian Mixture Models

I More intuitively, suppose we have an acoustic space like this


RTSC


I We then partition the whole acoustic space into I regions.

I This can be done by learning a GMM using the training data.

1

2

3

I

. . .


RTSC


I We then introduce some parameters to structure each region

ΣΣΣi - model the covariance of this region

Mi - span the basis for Gaussian mean

wi - span the basis for Gaussian weight

1

2

3. . .

wiΣΣΣiMi


RTSC


Given a class with some data, such as an HMM state

j − 1 j + 1j

• ••• •••••• ••• ••• •••

••• ••••

1

2

3. . .• •

•

• •• ••

••••• ••

••••••••• ••

vjk


RTSC


Then we learn a GMM for this class

j − 1 j + 1j

• ••• •••••• ••• ••• •••

••• ••••

1

2

3. . .• •

•

• •• ••

••••• ••

••••••••• ••

vjk


RTSC

Noise compensation

I Larger modelling power → higher recognition accuracy.I Our systems on Aurora 4, the #Gaussians is 6.4M (SGMM),

vs. 50k (GMM).I SGMM vs. GMM → 5.2% vs. 7.7% on clean conditionI SGMM vs. GMM → 59.9% vs. 59.3% on noisy condition

I Can we do noise compensation for SGMMs ?

GMM clean SGMM clean GMM noisy SGMM noisy0

10

20

30

40

50

60

WER


RTSC

Noise compensation

There are numerous work on noise compensation for robust ASR[Deng, 2011]

I Feature domainI Spectral subtraction, cmn/cvnI Cepstral mean square error estimationI AlgonquinI SpliceI Feature space vector Taylor series (VTS)

I Model domainI MLLR, noise constraint MLLRI PMC, Data-driven PMC (DPMC), iterative DPMCI VTS, joint uncertainty decoding (JUD)I Linear spline interpolation (LSI)I Unscented transform (UT)

I HybridI Noise adaptive training


RTSC

Noise compensation for SGMM

I Model space compensation for SGMM

I Not data-driven but using heuristic knowledge

I Mismatch function y = f (x,h,n,ααα) [Acero, 1990]

I ααα denotes the phase term between noise and speech[Deng et al., 2004].

Chanel noiseh

Clean speechx

⊕

Additive noise

Noisy speechy

n


RTSC

Noise compensation for SGMM

The mismatch function is

y = f (x,h,n,ααα)

= x + h + C log[1 + exp

(C−1 (n− x− h)

)+ 2ααα • exp

(C−1(n− x− h)/2

)︸︷︷︸phase term

]. (1)

where C be the DCT matrix.


RTSC

Noise compensation

I Aim: estimate µµµy and ΣΣΣy for each Gaussian component.

I Difficulty: y = f (x,h,n,ααα) is highly nonlinear, no analyticsolution!

I Solution: Vector Taylor series (VTS) approximation[Moreno et al., 1996]

I Cost: Real time factor > 100, memory > 10G for (mediumsize) SGMM with 6.4M Gaussian

I Inelegant: Direct apply VTS will destroy the compact ofstructure of SGMMs


RTSC

Noise compensation

I Solution: Joint uncertainty decoding (JUD)

VTS vs. JUD

VTS

JUD


RTSC

Noise compensation

I Applying JUD to SGMM

1

2

3

I

. . .

I Cost: Real time factor ∼ 10 for SGMM with 6.4M Gaussians


RTSC

Experiments

I DatabaseI Aurora 4 datasetI Clean speech and noisy speech with SNR [5db - 15db]I Close-talking microphone and desk-mounted microphoneI ∼ 15 hour training dataI 330 testing utterances

I System configurationI 39dim MFCCI #triphone states: 3.1k (GMM) vs. 3.9k (SGMM)I #Gaussians: 50k (GMM) vs. 6.4M (SGMM)I #regression classes: 112 (GMM) vs. 400 (SGMM)


RTSC

Noise compensation experiments

Baseline JUD VTS0

10

20

30

40

50

60

70

GMM SGMM

GMM SGMMGMM


RTSC

Experiments

Results by tuning the value of phase factors.

0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.016

18

20

22

24

26

28

The value of phase factor

Wor

d Er

ror R

ate

(\%)

VTS/GMM systemJUD/GMM systemJUD/SGMM system

I JUD/SGMM system achieved 16.8% WER on Aurora 4database


RTSC

Remarks

I The phase term is very effective for noise compensation

I Similar improvements were also observed in other studies, e.g.[Li et al., 2009]

I The reasons maybe it can compensate for the linearizationbias and performs domain compensation [Li et al., 2009]

I Our insight is it may helps to avoid the over estimation of thenoise model


RTSC

Conclusion

I SGMM is a promising alternative for acoustic modelling

I Noise compensation using JUD works well for SGMMs

I The phase term is particular effective for the noisecompensation

I Future works will be on noise adaptive training, compensationin log-spectral domain.


RTSC


RTSC

Noise compensation

I With JUD, the marginal likelihood can be obtained as

p(y | m) ≈ |A(r)| N(A(r)y + b(r);µµµm,ΣΣΣm + ΣΣΣ

(r)b

). (2)

I The transformation is done in the feature space, applied toeach frame

I Computation is saved since that the #frame � #Gaussians

I The transformation should be diagonalized in GMM systems,but not in SGMM system since we used full covariance matrix


RTSC

Experiments

Table: GMM systems with ααα = 0.

Methods Clean AvgClean model 7.7 59.3MTR model 12.7 26.9VTS 7.3 18.3JUD 7.0 21.1

Table: SGMM systems with ααα = 0.

Methods Clean AvgClean model 5.2 59.9MTR model 6.8 22.2JUD 5.3 20.3


RTSC

Acero, A. (1990).Acoustic and Enviromental Robustness in Automatic SpeechRecognition.PhD thesis, Carnegie Mellon University.

Deng, L., Droppo, J., and Acero, A. (2004).Enhancement of log mel power spectra of speech using aphase-sensitive model of the acoustic environment andsequential estimation of the corrupting noise.IEEE Transactions on Speech and Audio Processing,12(2):133–143.

Droppo, J., Acero, A., and Deng, L. (2002).Uncertainty decoding with SPLICE for noise robust speechrecognition.In Proc. ICASSP. IEEE.

Gales, M. (1995).Model-based techniques for noise robust speech recognition.PhD thesis, Cambridge University.


RTSC

Hu, Y. and Huo, Q. (2006).An HMM compensation approach using unscentedtransformation for noisy speech recognition.Chinese Spoken Language Processing, pages 346–357.

Li, J., Deng, L., Yu, D., Gong, Y., and Acero, A. (2009).A unified framework of HMM adaptation with jointcompensation of additive and convolutive distortions.Computer Speech & Language, 23(3):389–405.

Liao, H. and Gales, M. (2005).Joint uncertainty decoding for noise robust speech recognition.In Proc. INTERSPEECH. Citeseer.

Moreno, P., Raj, B., and Stern, R. (1996).A vector Taylor series approach for environment-independentspeech recognition.In Proc. ICASSP, volume 2, pages 733–736. IEEE.


RTSC

Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F.,Ghoshal, A., Glembek, O., Goel, N., Karafiat, M., Rastrow, A.,Rose, R., Schwarz, P., and Thomas, S. (2011).The subspace Gaussian mixture model—A structured modelfor speech recognition.Computer Speech & Language, 25(2):404–439.


RTSC