Modified MFCCs for Robust Speaker Recognition

download Modified MFCCs for Robust Speaker Recognition

of 4

Transcript of Modified MFCCs for Robust Speaker Recognition

  • 8/12/2019 Modified MFCCs for Robust Speaker Recognition

    1/4

    Modied MFCCs for robust speaker recogitio

    a a J iState Key aboratory for Novel Softe Tecnology,

    Njing UniversityNjing, China

    [email protected]

    Ab-Melca equency cetrum coecien

    (MFCCs) a commonly used katues in ser recognition

    s, ut MFCC vals are not ve rust in t prseeof noise. th, t modied MFCCs (named as NCMN

    MFCC) ed on t gel noisy sech mol s proposed inthis paper, which uses sctrum mean normalization (SMN) tsuppress t additive nose, and uses cetal meannormalization (CMN) to move t eect of coolutionalnoise. Totical analyses show that the comination of SMNand CMN can inhiit additive and convolutional noise at t

    sa time. To ve the ormance of the NCMNMFCC, \ have conducted some ser recognition s.Wth the same coolutional noise component, t additivewhite noise exmen and the additive facto noise

    exmen show that SMNCMNMFCC provis % and% lative improvent than t conventnal MFCC andMFCC features, resctively.

    Keywords-Mel-sle Freqncy psl Coefcien;

    feature action; speer recognion

    . OIOMost ser recoion sss have hi performce

    in low noise d low sr enrome, but due to tcomplex nate of t sech si d the ios noiseich almost exis in ery prc plicationenront, once the sysm plies in noisy eniroment,the peormce is deed dnically. So, reding theeect of ise is very cri to t prcle speerrecoition ssm. principle, it is possible to use genericnoise suression cniqs to ehce the quity of toginal -main si prior to t featestrsfrmation. Hower, si ehcent as tion step in t ente recoition pcess ireases tcomputational lo. Hence, quite my improved prohes

    to oduce betr coition performce nder noisyeniroments have been pposed, such as [1][2] [3].Meile, more d mo new noise-robst chtesticfeates have been velod, sh as sub-bd Mel

    specm cenoid (SMS prosed in [4], but tsefeates c only suppress the eect of some simple or

    speci noise. A a matter of conniee for seh, it usuly assed that the ise is stioy or slowlychging non-s tiony noise. the premise of thishpothesis, t speed of wer specm chge of ise isslower th that of the speech si. So, as long as develop a kind of t ich c lter the rect cnt978-1-4244-6585-9/10/$2600 2010 EEE

    7

    a stitute of mputer plication d Resech,

    Chi UniversityChi, China

    (DC) sub-compon d the slowly chging subcomponts, such as t Relive SpecTr (ST)algothm [5] d the cepstr me normization (CM)[6], the bustness of ser coion c be impd.

    Crrely, CMN has been idely used for i compatilysimpicity d eectiness.A we l ow, Mel-sce fquency cepsm

    coecies FCCs) ceny e commonly used featresin speer recoition systems, Frtrmore, dnicce feates such as lta d delta-lta cepstra habeen sho to play essenti le in capting ttrsition chteristics of the speech si. So, MFCC,MFCC [7], d other relad feates sh as deta ce

    energy (DC)[8] d delta-delta cepsal energy (DDC)have als been intduced io t ser citinsyss. Sie MFCC ues e not ry bust in tpresee of noise, sechers pse ious mocations

    to the basic MFCC to impve robustness. this paper, we pposed ed CCs, ich ised as SMN-CMN-MFCC. e remainder of ts paper isorgized follos. Section 2 d 3stuies the neralti-noise pnciple of feat ierence d featre menormaization, resctily. Section4 iey inoduces tsimplied noisy speech model. cc ringly, SMN-CMNMFCC using the Spectr Me Nrmizaon (SM) dCMN is prosed in section 5. Section 6 cove theexpents d iscssions. Finally, a brief conclungremk is gin in secon 7.

    I TE GEEL OIE PIILE O FEEDEEE

    Generly, t speer recoion fentally desth is t so mh the sech si the oertionsequee of t speech si. e obsertion sequencecoains t ious chtestic pete ichextrted om the sr sech e, d it c becosired as a multimesion si k where isthe fr nmber of t fe, d eh e is ith adation of ksples of t sech si. et us ss

    that the ertion sequence kc be decomsed intwo mutuly indendent compones, as in

    + , (1)

  • 8/12/2019 Modified MFCCs for Robust Speaker Recognition

    2/4

    Wherex(m,k) d c(m,k) represent obsertion sequeesich have been ssmed to come fm t cle sech dthe noise, sctily. With the tr ssption that

    noise is stationy d is ncoelad ith the sech,c

    (m,

    k)becos a constt ich has no relation ithm. hence (1)c be expssed s

    y(mk) =x(mk)+c(k). (2)

    et us consider t t order deritive of m on thsides of the formula(2). that is

    ymk =8k ax(m,k) /(m,k) 0kN-1,(3)amWhere Nis the t fre nmbers of the given speech

    sial.

    is cle that both noisy obsertion sequence dcle sech obsertios have the s t orderfferee of m, and nothing ith noise. s, tinoise chteriscs c be obtned by replacing the(m,k)ith (m,k). But the rect se of rst-orr dierenceles eater noise, thereby, given the obsertionsequee(mt), we c cclate t (m,k)by employingthe quratic polmial reession of(mt), as i

    y(m+t)" +h2t+h/ t =-L-L+l ,O,I,.. .L (4)

    Where L repsen the daon for caclating tfferee, d hI, h, h3 e polnoi coecien.ccorng to the inimm sque eor cteon, weget

    y(mk) " 8m+tk) 1at 10

    " h2=-. . y(m+tk) 5

    L t2 L

    L

    o m MLlOk N-And 5 is the ner reession formula calculate t

    rst order dierence of the pe sh as MFCC.

    . TE GEEL OISE PILE O FEEME NOIOet us coider t avere of(m,k) ith respect to t

    m fr in foula(2), ich is dened s

    NEyk "->k m=l

    6

    Where Nis the t fre nmbers of the given speechsia. By subtrtingE[(m,k)l om(m,k), we get

    77

    mkmk EmkL {xmk+c()} Exmk]+ck} 7xmk E xmkL mk

    Where Ymk denos the me normized feat of(m,k). A c see om 7t me noalid featof t noisy obsertion seqnce is equal ith that of thecle sech d become no relad ith the ise. ereby,it c be concluded that y featre ich sasfies (2) c beme normalized impve i noise bustness.

    From the a alysis, we see that th ee dme normaization c be used to mpve the ti-noisepeormce of t feates. A for ence oraon, it is

    mo sensiti to noise itself, d it just trsfors oginalfeates into the corsponding dnic featres essenally,so t prties of the chstic feates have beentrsford; s for t me normization, i opation isrelatily stle, d it is eqient to subt the me ofthe feats, so the pperties of t chtestic feats donot chge.

    V TEGEELN SEE MOELConsidering all the nti interferee ftors or t

    ole sech si trsission chnel, Hsen & Arsi[9] rosed a ger noisy speech geraon d

    trsission mol. nsing tt noise nally c beequient s the equilent ti noise d the equient

    convolution noise, c get a simplied noisy sechmodel s

    = * +. 8

    v MOIEMFCC IO OIFrom the ry bsic sia processing owlege, we

    ow that, en the noise is not coelated ith the siald is stationy, the simplest featre ch satises (2) isautocorrelaon or wer sctrm of the si, d thecacteristics of t power sctr t fodation ofmost commonly used speer featres ilung the MFCC.So, we rstly focs o invesgation on t conventiona

    MFCC extrtion aproh step by step. From then on,corng to t ner isy sech del, the moedMFCCs for ser cotion is tocly prosed byusing the spec me normalization d the cepsalme normization conjnctively.

    Given that the speech sial hs been sliced in Mfrs, d eh e cons Nsples, we c conducta fre based on noisy speech mol om 8 s

    = *h + w O::M- O::N-(9)

    Whe (m,) denotes the isy speech si, x(m,)denos the cle speech sial, h(m,) represents t

  • 8/12/2019 Modified MFCCs for Robust Speaker Recognition

    3/4

    conluon noise, d is the ti noise. Heon,we not only asse that t tive noise is stationy dis ncolated ith the speech, but also ass the wer

    spec of the convolutional noise is stioy or chge scosirably slow. We ow that the second assptionmes is line d shi init, thus,(9) could besimplied as

    y(m,n=()*()+). (10)

    By expressing(10) in t power spec we get

    Where P detes power sc of the sial ich ismked s its subscription, d is the sfer fnction

    of . Now applying the se me noalization (11)as ilusad in 7 le t relion

    P, = Pm,Pm,]y Y Y m= , I 2 , I 2. (12)= {m, m,m } I 2= , I 2

    e c see the eect of t tive noise has beeneiminad in (12). A for convention MFCC ertionapproh, o next sp is ple t Mel-scale equency

    lr bk to (12). tuly, or Mel-scale frequency rbk c be iewed as to do certn weiting of , .So if we asse is constt or most constt ithinevery cc equey bd, then the output of the Melsce fquency r bk is

    A(m,d) w(n,d)(m,n) 1 H(n) 12 (13)

    Where is t weiting coecien wch isindependent of the , d denotes the No. lr in theMel-sce fquency r bak.

    Taking logit on both sis of(13),obtn

    log (,) log [(,)(,)+log() n (14)Taing scre cosi sform (DC on th sides of

    (14), yielng

    ,= DCT {lo [,=DCT{log[w(n,d)p(m,n)J}+, (15)

    DCT lo }= ,

    7

    ere x(m,)d y(, e ned as spectmean normized MFCC (SMN-CC) coecies threspect to t imaginy cle speech d to the oiginnoisy speech, resctively, d ch(d) is the S-MFCCcoefcie ith resct to t convolutional noise.

    Now applying the se me noization to (15) asillsadin 7 d(12) agn, le to the relaon

    (,) = (,) (,)1= (,)+ ()(,)+()} , (16)= (,)(,)= (,)

    Where (,) d (,) e CMN-SMN-MFCCcoecies ith resct t iminy cle sech dto the orinal noisy sech, specly. Coeqntly,,)equas (, mes SMN-CMN-MFCC is isyrobust, ich using SMN suppress the tive noisewhile using CMN to suppress the convolutional noise at these time. By now, we c ge out t ertionapproh of SMN-N-MFCC in Fig. 1.

    =hIP .'mp '; dOW;HspmH sMNCMN IMFcc CFgure 1. Extcton approach of the SMN-CMN-MFCC

    V ERS

    suSpeer recoition expements have been conducted to

    tes t the perfoce of the moed MFCC. texpements, we use real cle speech recorngs ithspling rate of 8 z, d the speech aysis fre rate isset to 256 sples ith 80 sples sip re. f trecoring sech, silee d low-ener sech pts eremoved using a geral erg detection cniq, d the

    s that ha hier energy th the pre-nedtresld e selected to concanate the expenta sechsial.To obtain the noisy speecs for coizing, the cle

    speech t fees into a -tap F lr, ich is used tosie the convolutional noise d its plitude resnseis sho in Fig. 2, then t er output e mixed to tditive noise cording to t given Sia-to-ise ratio(SNR). He we select two kind of ti noises, o isWite Gaussi Noise (WG) d the otr is Factory Noise[10].

  • 8/12/2019 Modified MFCCs for Robust Speaker Recognition

    4/4

    - o \: Q"2'g -5

    3

    o . . . .Nomizd Frquy (X rd/sp)

    \

    Figu 2. Alitude sponse of the 10-tap F channel lter used to

    silate the conventional noise

    A for feat exaction, the cle speech d the noisyspeech e all pre-emphasized th the l(1 0.95 Z-I) ,d there 18 lte in the Mel-sce fquey lr bk.A for coing, Gssi mixt mol (G) th 32Gaussi comnen is selected as speer model. Andthe 10 me speers d each hone -second worthof sech for aining his GMM del, d eh has 10 onesecond worth of sech for recoizing st.B. Experetal Rsul

    For the cle sech, t WGN stned noisy speech dthe Ftory noise stned noisy sech th 24, 18, 12, 16,0

    SNR, respecly. MFCC, MFCCMFCC d theprosed CMN-SMN-MFCC e th exted t GMMmodel, d the recoition rates sho in Table 1 dTable 2.TABLE . ECOGITION RATE UNDER W AND CONVOLUTIONL

    NOI CONDION%)

    Spaker fe atureSd

    clean 4 18 1 6 0MFCC 77 75 63 45 23 13MFCC+FCC(LI) 83 76 65 46 25 15MFCC+FCL2) 79 75 42 22 14SMN-CMN-MFCC 74 78 70 52 34 19

    TABLE . ECOGITION RATE UNDER FCTORY NOI AND

    CONVOLUTIONL NOI CONDION%)

    Spaker fe atureS(d

    clean 4 18 1 6 0MFCC 77 72 59 19 11MFCCFCC L=I 83 74 61 47 21 12MFCC+FCC(L=2 79 73 40 18 10SMN-CMN-MFCC 74 76 50 29 14

    e experint sts sw that SMN-CMN-MFCCproides 10.5% d 9.6% relative mpvement th theconventional CC feates, sctily, \ile itspeormce for cle sech is not siicy affecd.

    79

    V. ONCLUSONS major deciency in sta-of-the-t aumac speer

    recoition (R) syss is t lk of bustss in tive d convolution noisy enment. Aiming toimp the performce of ser recoition in noiseconditions, in tis pap, we rst have invesgated tgener ti-noise pnciple of erence d the menormalition for t obsertion seqnce of the sechsia en coring to the simplied noisy speech molwe propose a new moed MFCC feate \ich employsSMN suppss the ve stiony noise \ile usingCMN suppss t conlutiona stiony noise at tse me. Finly, the SMN-CMN-MFCC have beenexined d ha been comped th t common MFCCand CC in ASR experimen, the experint resuldemosate the eectiveness d robustness of the SMNCMN-MFCC in noisy conitio.

    CNOWLGMENT

    is work suppored mnly by t ptment ofducation of Xinjig Uyg utonomos Reon of CnUniversity Key Project Fnd of Scienc Resech P,(XJDU200940), 2009 d by a Grant-in-Aid of CiUniversity, China.

    EFERCES

    [I] BA Jun-i, ZHANG Shi-ei, ZHANG Shu-wu and XU B, "RobustSpeaker Recognition n Noisy vronnt Joual of Cnesenfortion Prcessng. Vol. 20, pp. 9197, Febra 2006.

    [2] Z. H. Chen, Y. F. Liao and Y. T. Juang, "Prosodic delng andEigen-Prsod Analysis for Robust Speaker Recognition, Proc.CASSP 2005. PA. USA. vol. I, pp. 1851 88, March 25.

    [3] ng, W.-G., Yang, L.- P., and Chen, D. Pitch synchnos basedfeature ection for noise-bust speaker verication. Proc.ge and Signal Prcessing (CSP 28) (Ma 2008), vol. 5, pp.295298.

    [4] DG Jing, ZHENG Fang, L Jian and WU Wenhu, "Usngsubband Mel-spectrm centrid and ussian miu coelation forrbust spker identication , Acta Acustica. Vol. 5, pp47 1475,2006.

    [5] H. Hermans, N. Morgan. "RASTA prcessng of speech signal,EEE Tans. O speech nd Audio Prcessng. vol. 4, pp. 578589,Febra. 1994.

    [6] F.H. Liu, A. Acer, R Ste. "E icient jont coensation of spehfor the eects of additive noise and lnear flterng, Pc. Of EEECASP. pp. 257260. Janua 1992.

    [7] L. R bner and B. H. Juang, Fundantals of Speech cognition.

    Prntice-Hall, NJ, 1993.[8] Nostighods, M. Aikairjah, E. and ps, J., "Speaker

    Ver ication Us ng A Novel Set of Dynamic Featus, PatteRecognition, 2006. CPR 2006. 18th teational Conference on.Hong Kong. Chna, vol. 4, pp. 266269, Septeer 26.

    [9] J.H.L Hansen, L.M. Alan. Robust feature-esttion and objectivequality asssnt for noisy speech regnition usng credit cardcous[J]. EE Trans. O Speech and Audio Processng, 1995, 3(3):169184.

    [ 10] A. P. Varga, H. J. M. Steeneken, M. Tomlinson, D. Jones. "TheNOSEX-92 study on the eect of additive noise on autotic speechrecognition, Docuntation ncluded n the NOSEX-92 CD-ROMS,1992