Contextual invariant-integration features for improved speaker-independent speech recognition

12
Contextual invariant-integration features for improved speaker-independent speech recognition Florian Mu ¨ ller , Alfred Mertins Institute for Signal Processing, University of Lu ¨ beck, Ratzeburger Allee 160, 23538 Lu ¨ beck, Germany Received 26 March 2010; received in revised form 23 December 2010; accepted 1 February 2011 Available online 12 February 2011 Abstract This work presents a feature-extraction method that is based on the theory of invariant integration. The invariant-integration features are derived from an extended time period, and their computation has a very low complexity. Recognition experiments show a superior performance of the presented feature type compared to cepstral coefficients using a mel filterbank (MFCCs) or a gammatone filterbank (GTCCs) in matching as well as in mismatching training-testing conditions. Even without any speaker adaptation, the presented features yield accuracies that are larger than for MFCCs combined with vocal tract length normalization (VTLN) in matching training-test con- ditions. Also, it is shown that the invariant-integration features (IIFs) can be successfully combined with additional speaker-adaptation methods to further increase the accuracy. In addition to standard MFCCs also contextual MFCCs are introduced. Their performance lies between the one of MFCCs and IIFs. Ó 2011 Elsevier B.V. All rights reserved. Keywords: Speech recognition; Speaker-independency; Invariant-integration 1. Introduction A wide variety of applications of automatic speech recog- nition (ASR) systems can be found nowadays. Though major advances are reported regularly, generally, the per- formance of ASR systems is still far below the human one, which degrades the user acceptance. Different reasons for the performance lag can be made up; these can be, for example, sensitivities to environmental noise or to the char- acteristics of the speech-transmission channel. A general problem in speaker-independent ASR is the high variability that is inherent in human speech. Benzeghiba et al. (2007) give a detailed review of different kinds of variabilities in ASR. Two broad groups of variabilities can be defined: extrinsic (non-speaker related) and intrinsic (speaker- related) variabilities. Environmental noise and transmis- sion artifacts are two examples of extrinsic variabilities. Besides varying accents, speaking-styles and -rates, age, and emotional state, it is the shape of the vocal tract that intrinsically contributes to the variable occurrence of speech signals representing the same textual content. The problems originating from different vocal tract lengths (VTLs) become especially apparent in mismatching train- ing–testing conditions. For example, if children use an ASR system whose acoustic models have only been trained with adult data, the recognition performance degrades sig- nificantly compared to the performance of adult users. Therefore, in speaker-independent ASR systems, one often uses speaker-adaptation techniques to reduce the influence of speaker-related variabilities. There is ongoing work for addressing each of the variabilities mentioned above. The work presented in this paper focuses on the VTL as a source of variability between individual speakers. A common model of human speech production is the source-filter model (Huang et al., 2001). In this model, the source corresponds to the air stream originating from the lungs and the filter corresponds to the vocal tract, which is located between the glottis and the lips, and is 0167-6393/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2011.02.002 Corresponding author. E-mail addresses: [email protected] (F. Mu ¨ ller), mertins@ isip.uni-luebeck.de (A. Mertins). www.elsevier.com/locate/specom Available online at www.sciencedirect.com Speech Communication 53 (2011) 830–841

Transcript of Contextual invariant-integration features for improved speaker-independent speech recognition

Page 1: Contextual invariant-integration features for improved speaker-independent speech recognition

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 53 (2011) 830–841

Contextual invariant-integration features for improvedspeaker-independent speech recognition

Florian Muller ⇑, Alfred Mertins

Institute for Signal Processing, University of Lubeck, Ratzeburger Allee 160, 23538 Lubeck, Germany

Received 26 March 2010; received in revised form 23 December 2010; accepted 1 February 2011Available online 12 February 2011

Abstract

This work presents a feature-extraction method that is based on the theory of invariant integration. The invariant-integration featuresare derived from an extended time period, and their computation has a very low complexity. Recognition experiments show a superiorperformance of the presented feature type compared to cepstral coefficients using a mel filterbank (MFCCs) or a gammatone filterbank(GTCCs) in matching as well as in mismatching training-testing conditions. Even without any speaker adaptation, the presented featuresyield accuracies that are larger than for MFCCs combined with vocal tract length normalization (VTLN) in matching training-test con-ditions. Also, it is shown that the invariant-integration features (IIFs) can be successfully combined with additional speaker-adaptationmethods to further increase the accuracy. In addition to standard MFCCs also contextual MFCCs are introduced. Their performancelies between the one of MFCCs and IIFs.� 2011 Elsevier B.V. All rights reserved.

Keywords: Speech recognition; Speaker-independency; Invariant-integration

1. Introduction

A wide variety of applications of automatic speech recog-

nition (ASR) systems can be found nowadays. Thoughmajor advances are reported regularly, generally, the per-formance of ASR systems is still far below the humanone, which degrades the user acceptance. Different reasonsfor the performance lag can be made up; these can be, forexample, sensitivities to environmental noise or to the char-acteristics of the speech-transmission channel. A generalproblem in speaker-independent ASR is the high variabilitythat is inherent in human speech. Benzeghiba et al. (2007)give a detailed review of different kinds of variabilities inASR. Two broad groups of variabilities can be defined:extrinsic (non-speaker related) and intrinsic (speaker-related) variabilities. Environmental noise and transmis-sion artifacts are two examples of extrinsic variabilities.

0167-6393/$ - see front matter � 2011 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2011.02.002

⇑ Corresponding author.E-mail addresses: [email protected] (F. Muller), mertins@

isip.uni-luebeck.de (A. Mertins).

Besides varying accents, speaking-styles and -rates, age,and emotional state, it is the shape of the vocal tract thatintrinsically contributes to the variable occurrence ofspeech signals representing the same textual content. Theproblems originating from different vocal tract lengths

(VTLs) become especially apparent in mismatching train-ing–testing conditions. For example, if children use anASR system whose acoustic models have only been trainedwith adult data, the recognition performance degrades sig-nificantly compared to the performance of adult users.Therefore, in speaker-independent ASR systems, one oftenuses speaker-adaptation techniques to reduce the influenceof speaker-related variabilities. There is ongoing work foraddressing each of the variabilities mentioned above. Thework presented in this paper focuses on the VTL as asource of variability between individual speakers.

A common model of human speech production is thesource-filter model (Huang et al., 2001). In this model,the source corresponds to the air stream originating fromthe lungs and the filter corresponds to the vocal tract,which is located between the glottis and the lips, and is

Page 2: Contextual invariant-integration features for improved speaker-independent speech recognition

F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841 831

composed of different cavities. The VTL describes the dis-tance between the glottis and the lips. On average, the VTLis about 14 cm for adult women and 17 cm for men (Boeet al., 2006). The locations of the vocal tracts’ resonancefrequencies (the “formants”) shape the overall short-timespectrum and define the phonetic content. The spectraleffects of different VTLs have been widely studied, see Ben-zeghiba et al. (2007) and references therein. An importantobservation is that, while the absolute formant positionsof individual phones are speaker specific, their relativepositions for different speakers are somewhat constant. Arelation that describes this observation is given by consid-ering a uniform tube model with length l. Here, the reso-nances occur at frequencies Fi = c � (2i � 1)/(4l), i = 1, 2,3, . . ., where c is the speed of sound (Deller et al., 1993).Using this model, the spectra S of the same utterance fromtwo speakers A and B with different VTLs are related by ascaling factor aS, which is also known as the frequency-warping factor:

SAðxÞ ¼ SBðaS � xÞ: ð1Þ

Though Eq. (1) is only a rough approximation for the realrelationship between spectra from speakers with differentVTLs, methods that try to achieve speaker independencyfor an ASR system commonly take this relationship astheir fundamental assumption.

A time-frequency (TF) analysis of the speech signal isusually the first operation in an ASR feature-extractionstage after possible preprocessing steps such as pre-empha-sis or noise cancelation. This analysis tries to simulate thehuman auditory system up to a certain degree, and differentmethods have been proposed. As it is done for the compu-tation of the well-known mel frequency cepstral coefficients

(MFCCs), a basic approach is the use of the fast Fourier

transformation (FFT) applied on windowed short-time sig-nals whose output is weighted by a set of triangular band-pass filters in the spectral domain (Huang et al., 2001).Another common filterbank approach uses gammatone fil-ters (Patterson et al., 1992). These filters were shown to fitthe impulse response of the human auditory filters well.Both types of TF analysis methods have in common thatthey locate the center frequencies of the filters evenlyspaced on nonlinear auditory motivated scales; in case ofthe MFCCs the mel scale is used (Davis and Mermelstein,1980), and in case of a gammatone filterbank the equivalent

rectangular bandwidth (ERB) scale is used (Patterson et al.,1992; Moore and Glasberg, 1996; Ishizuka et al., 2006).Different works make use of the observation that boththe mel and the ERB scale approximately map the spectralscaling as described in Eq. (1) to a translation along thesubband-index space of the TF analysis. More details willbe given below.

The present methods that try to achieve speaker inde-pendency can be roughly grouped into three categories.These groups act on different stages of the ASR processand often may be combined within the same ASR system.One group tries to normalize the features after the extrac-

tion (Lee and Rose, 1998; Welling et al., 2002; Sinha andUmesh, 2002) by estimating the implicit warping factorsof the utterances. These techniques are commonly referredto as VTL normalization (VTLN) methods. A second groupof methods adapts the acoustic models to the features ofeach utterance (Leggetter and Woodland, 1995; Gales,1998). The use of maximum-likelihood linear regression

(MLLR) methods is part of most state-of-the-art recogni-tion systems nowadays. It was shown in (Pitz and Ney,2005) that certain types of VTLN methods are equivalentto constrained MLLR. The third group of methods workson the feature-extraction stage and tries to compute fea-tures that are speaker independent. In theory, this groupof methods does not need an additional speaker-adaptionstep and, hence, promises lower computational costs thanthe first two mentioned groups of methods. While this istrue in principle, practical experiments carried out in thiswork show that even these methods may benefit from fur-ther speaker-adaptation techniques.

The concept of computing features that are independentof the VTL has been taken up by several works in the past,and different methods were proposed. In the following, abrief summary of these ideas is presented.

To begin with, Cohen (1993) introduced the scale trans-formation which was further investigated for its applicabil-ity in the field of ASR by Umesh et al. (1999). Its use inASR is motivated by the relationship given in Eq. (1).One property of the scale transformation is that the magni-tudes of the transformations of two scaled versions of oneand the same signal are the same. Thus, the magnitudes canbe seen to be scaling invariant. The scale cepstrum, whichhas the same invariance property, was also introduced inthe work of Umesh et al. (1999). The scale transformationis a special case of the Mellin transformation. The work ofPatterson (2000) describes a so-called auditory image model

(AIM) that was extended with the Mellin transformation inthe work of Irino and Patterson (2002). Further studiesabout the Mellin transformation have been conducted,for example, by Sena and Rocchesso, (2005).

Various works rely on the assumption that the effects ofinter-speaker variability caused by VTL differences ismapped to translations along the subband-index space ofan appropriate filterbank analysis (Muller and Mertins,2009, 2010; Muller et al., 2009; Mertins and Rademacher,2005, 2006; Rademacher et al., 2006; Monaghan et al.,2008). Mertins and Rademacher (2005, 2006) used a wave-let transformation for the TF analysis and proposed so-called vocal tract length invariant (VTLI) features basedon auto- and cross-correlations of wavelet coefficients.Subsequent work (Rademacher et al., 2006) showed thata gammatone filterbank instead of a previously used wave-let filterbank leads to a higher robustness against VTLchanges.

Previous works of the authors of this manuscript inves-tigated translation-invariant transformations that wereoriginally developed within the field of image analysis(Muller and Mertins, 2009, 2010; Muller et al., 2009). In

Page 3: Contextual invariant-integration features for improved speaker-independent speech recognition

832 F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841

the present work, we propose a feature-extraction methodthat is based on the principle of invariant integration. Werefer to these features as invariant-integration features

(IIFs) in the following. While preliminary investigations(Muller and Mertins, 2009) about this method showedthe applicability as a proof-of-concept, here the methodis consequently enhanced and studied in detail. Formally,the features are designed to be invariant to translationsalong the subband-index space of the TF representation.The method for their computation is based on the generalapproach of integrating feature functions over a groupaction, which is mathematically well founded. It is shownthat IIFs can be found that are more robust to differentVTLs than standard MFCCs and that, with an optimumparameter choice, even perform superior to MFCCs incombination with speaker adaptation. Besides a compari-son to MFCCs in combination with VTLN and MLLR,this article gives a detailed study about the influence ofthe parameters on the recognition performance. Further-more, its robustness in different training-testing scenariosas well as their performance on different corpora isinvestigated.

The following section explains general notions of thetheory of invariants and describes three canonicalapproaches for obtaining invariants. Section 3 describesthe approach of invariant integration and shows how itcan be applied to the field of speaker-independent ASR.In that section the IIFs are formally defined. Starting witha basic definition of an IIF founded on a theoretical basis,it is further developed to incorporate contextual informa-tion for an ASR task. Since the proposed method has ahigh degree of freedom for its choice of parameters, anappropriate feature-selection method is described. Thefourth part of this article describes the experimentstogether with their results. In this section, in addition tocontextual IIFs, also the contextual selection of MFCCsis investigated. Conclusions and an outlook to future workare given in the last section.

Fig. 1. (a) Relation of invariant set IT and orbit O for a giventransformation T and observation x, (b) notion of completeness of atransformation T. After Burkhardt and Siggelkow (2001).

2. Invariant features and their construction

Nonlinear transformations that lead to invariant fea-tures have been investigated and successfully applied fordecades in the field of pattern recognition. A brief intro-duction to the general notions and concepts for the con-struction of invariants is given in the following. It isbased on the book chapter by Burkhardt and Siggelkow(2001) and the thesis by Schulz-Mirbach (1995a).

The idea of invariant features is to find a mapping T thatis able to extract features which are the same for differentobservations x of the same equivalence class with respectto a group action G. Such a transformation T maps allobservations of an equivalence class into one point of thefeature space:

x1�G

x2 ) T ðx1Þ ¼ T ðx2Þ: ð2Þ

In our case, it means that all frequency-warped versions ofthe same utterance should result in the same sequence offeature vectors. Given a transformation T and an observa-tion x, the set of all observations that are mapped into onepoint in the feature space is denoted as the set of invariants

IT(x) of an observation:

IT ðxÞ ¼ fxijT ðxiÞ ¼ T ðxÞg: ð3ÞThe set of all possible observations within one equivalenceclass is called orbit O(x): given a prototype x, all otherequivalent observations can be generated by applying thegroup action G,

OðxÞ :¼ fxijxi�G xg: ð4ÞA transformation T is said to be complete if both the set ofinvariants of an observation and the orbit of the sameobservation are equal. Complete transformations have noambiguities regarding the class discrimination. Incompletetransformations, on the other hand, have the property

OðxÞ# IT ðxÞand may lead to the same features from observations of dif-ferent equivalence classes and thus cannot distinguish them(Burkhardt and Siggelkow, 2001). The described terms arevisualized in Fig. 1.

In practice, a “high degree of completeness” is desired,which means that IT(x) should not be much larger thanO(x). Principles to systematically construct invariants withthis property can be divided into three categories:

1. Normalization. Here, the observations are transformedwith respect to extreme points on the orbit. Usually, acertain subset of the observations is chosen for theparameter estimation of the transformation.

2. Differential approach. This group of techniques obtainsinvariant features by solving partial differential equa-tions (PDEs). Its main idea is that the features shouldbe insensitive to infinitesimally small variations of theparameter(s) of the group action. Let these parameters

Page 4: Contextual invariant-integration features for improved speaker-independent speech recognition

1 This is similar to the assumptions made by a typical grid-search basedVTLN method, where a warping factor is chosen out of a finite set ofpossible warping factors.

F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841 833

be denoted by b = (b1,b2, . . . ,bN). For a group elementg 2 G and a term g(b) denoting a transformation g 2 G

with parameters b and an input signal x, it is demandedthat

@T ðgðbÞxÞ@bi

� 0; i ¼ 1; 2; . . . ;N : ð5Þ

The solutions of the partial differential equations are theinvariants.

3. Integral approach. The idea of the third group of meth-ods is to compute averages of arbitrary functions on theentire orbit. Hurwitz invented the principle of integrat-ing over the transformation group for constructinginvariant features in 1897 (Hurwitz, 1897). The resultingintegral is independent of the parameter(s) of the groupaction G:

T f ðxÞ ¼1

jGj

ZG

f ðgxÞdg; ð6Þ

where jGj :¼R

G dg is the “power”of the group, and f is a(possibly complex-valued) kernel function.

One drawback of the normalization approach is theproblem of a proper choice of a subset of the parameters.Another disadvantage is that a reduction of dimensionalityis not given with this method. For the differentialapproach, solving the PDEs is the main difficulty in prac-tice. In contrast, the integral approach does not need anypreprocessing and, as will be shown in the following, is eas-ily applied in the context of this work. Successful applica-tions of this approach in the field of image processingwere described, for example, by Schulz-Mirbach (1995b)and Siggelkow (2002).

The three described principles represent generalapproaches for the construction of invariants for arbitrarytransformation groups with a high degree of completeness.Certainly, there exist other methods that are invariant tospecific transformations. However, the set of invariantsIT(x) of these methods is generally much larger than theorbit O(x). This means that the generated features areinvariant to more operations than desired. Examples oftranslation-invariant transformations are the magnitudeof the Fourier transformation as well as the auto- and crosscorrelation functions. Both the Fourier-transform magni-tude and the autocorrelation of a signal remain unchangedwhen applied to translated versions of the same signal. Forthe cross-correlation of two signals, both signals may betranslated by the same arbitrary amount without affectingthe result. Known invariant transformations from the fieldof image analysis are the generalized cyclic transformations(Lohweg et al., 2004) and the transformations of the classCT (Burkhardt and Muller, 1980), which includes the rapidtransformation (Reitboeck and Brody, 1969). Although,theoretically, these transformations have a larger set ofinvariants compared to the methods of the described threegroups, all of them proved to be valuable in different kindsof applications (Fang et al., 1989; Lohweg et al., 2004).

3. Feature computation and selection

This section introduces in its first part the contextualIIFs and their application to ASR. As it turns out, the pro-posed features have a high degree of freedom for the choiceof parameters. A feature-selection method is described inthe second part of this section.

3.1. Invariant-integration features

The proposed features rely on a TF analysis thatapproximately maps the spectral scaling due to differentVTLs to translations along the subband-index space. Withrespect to the notions introduced in Section 2, this transla-tion effect can be attributed to the action of the group G oftranslations. With this assumption, the TF representationsS of two speakers A and B of the same utterance are relatedby a translation parameter aT,

SAðfðxÞÞ ¼ SBðaT þ fðxÞÞ; ð7Þ

where f(x) = log(x) or some other function like the mel orthe ERB scale. Formally, let vk(n) denote the TF represen-tation of a speech signal, where n is the time index,1 6 n 6 N, and k is the subband index with 1 6 k 6 K. Aframe for time index n is then given by vn = (v1(n),v2(n), . . . ,vK(n))>. When considering a translation accord-ing to aT in the subband-index space, some boundary con-ditions need to be introduced. Periodic boundaryconditions, where all subband indices are understood mod-ulo K, have been used in (Muller and Mertins, 2009, 2010),because they were required by the applied invariance trans-formations. In this paper, we use repeated boundary condi-tions vk(n) = v1(n) for k < 1 and vk(n) = vK(n) for k > K,because they form a closer match to frequency warpingin the analog domain.

By assuming that a finite set of translations along thesubband-index space describes the occurring spectral effectsdue to different VTLs sufficiently,1 a finite group of transla-tions bG with jbGj elements can be defined. For reasons ofsimplicity, integer translations are used in the rest of thisarticle. Nevertheless, non-integer translations could be usedif an appropriate interpolation scheme were incorporated inthe following definitions. According to the “integrationapproach” as described in Section 2, Eq. (6) becomes

bT f ðxÞ ¼1

jbGjXg2bG f ðgxÞ: ð8Þ

The question of how to define the function f arises. Accord-ing to Noether’s theorem (Noether, 1915; Schulz-Mirbach,1992, 1995b), a complete transformation

bT ðxÞ ¼ bT f1ðxÞ; bT f2

ðxÞ; . . . ; bT fF ðxÞ� �>

ð9Þ

Page 5: Contextual invariant-integration features for improved speaker-independent speech recognition

Fig. 2. (a) Schematic plot of a non-contextual IIF with exponents l1, l2,l3 – 0, (b) schematic plot of a contextual IIF with exponents l = (l1, l2, l3),l1, l2, l3 – 0 and corresponding temporal offsets m = (m1,m2,m3) =(+1,0,�1).

834 F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841

can be constructed by only considering monomials for thekernel functions f. Given the vectors k = (k1,k2, . . . ,kM)and l = (l1, l2, . . . , lM), containing element indices and inte-ger exponents with k 2 NM and l 2 NM

0 , respectively, anon-contextual monomial m(n;w,k, l) with M componentsis defined in the following as

mðn; w; k; lÞ :¼YMi¼1

vlikiþwðnÞ

" #1=cðmÞ

; ð10Þ

where w 2 N0 is a spectral offset parameter that is used forease of notation in the following definitions. The value c(m)denotes the order of a monomial m:

cðmÞ :¼XM

i¼1

li: ð11Þ

The all-encompassing exponent 1/c(m) in Eq. (10) acts as anormalizing term with respect to the order of the mono-mial. Noether showed that for input signals of dimension-

ality K and finite groups with jbGj elements, the group

averages of monomials with order less or equal jbGj forma generating system of the input space. Such a basis has

at most jbGj þ KK

� �elements. It has to be pointed out that

this is an upper bound, and it was shown in many applica-tions that the number of practically needed basis functionsis considerably smaller (e.g., Schulz-Mirbach, 1992, 1995b;Siggelkow, 2002). We can assume that the maximal trans-lation that occurs as effect of VTL changes in the sub-band-index space is limited to a certain range W (Sinhaand Umesh, 2002). A non-contextual invariant-integrationfeature (IIF) Am(n) for a frame at time n, as a group aver-age on the basis of a monomial m, is defined as

AmðnÞ :¼ 1

2W þ 1

XW

w¼�W

mðn; w; k; lÞ; ð12Þ

with W 2 N0 determining the window size. To give anexplanatory example, we consider a monomial of order 3with exponents li = 0 for all i 2 {1,2, . . . ,K}n{k1,k2,k3}and li = 1 for i 2 {k1,k2,k3}. The corresponding IIFAm(n) with window parameter W = 1 is then given by

AmðnÞ ¼1

3vk1�1ðnÞvk2�1ðnÞvk3�1ðnÞ þ vk1

ðnÞvk2ðnÞvk3

ðnÞ�þvk1þ1ðnÞvk2þ1ðnÞvk3þ1ðnÞ

�: ð13Þ

Fig. 2(a) shows a schematic plot of the computation of anIIF as defined in Eq. (12).

A monomial for frame n following the definition in Eq.(10) is only evaluated on the components of frame n andtemporal context is not considered. Since the spectral trans-lation between different speakers is assumed to be time inde-pendent (c.f. Eq. (7)) the definition of a monomial in Eq.(10) can be extended such that it also considers neighboringframes for its computation. The resulting feature type withcontextual monomials is called contextual IIF here. Given a

vector m 2 NM0 containing temporal offsets, a contextual

monomial m with M components is defined as

mðn; w; k; l;mÞ :¼YMi¼1

vlikiþwðnþ miÞ

" #1=cðmÞ

: ð14Þ

Consequently, a contextual IIF AmðnÞ is then given byreplacing m by m in Eq. (12):

AmðnÞ :¼ 1

2W þ 1

XW

w¼�W

mðn; w; k; l;mÞ: ð15Þ

With this definition, the non-contextual IIFs are a specialcase of the contextual ones. A schematic plot for the com-putation of the contextual IIFs according to Eq. (15) isshown in Fig. 2(b). Following Noether’s theorem as de-scribed above, an adequately chosen IIF set

A :¼ fAm1;Am2

; . . . ;AmF g ð16Þ

yields features that, on the one hand, are invariant to thetranslational spectral effects due to different VTLs, and,on the other hand, allow for discriminating between theindividual classes.

The contextual IIFs can be applied on any TF represen-tation that fulfills the assumption that spectral scaling ismapped to translation. This is approximately the case whena mel or an ERB scale is used for locating the frequencycenters within the TF analysis (Umesh et al., 2002; Mona-ghan et al., 2008). Depending on the application, a meannormalization can be applied to the features of each utter-ance to reduce the effect of channel sensitivity.

3.2. Feature selection for invariant-integration features

The set of parameters of the IIFs, consisting of the win-dow size, element indices, exponents, and temporal offsetscauses a huge number of possible combinations. Generally,with F features, T possible temporal offsets, B possible win-dow sizes, K subbands, and a maximum order of D, thetotal count C of possible IIF sets is given by

C ¼ B �PDd¼1

ðK � T Þd

F

0@ 1A: ð17Þ

Page 6: Contextual invariant-integration features for improved speaker-independent speech recognition

F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841 835

Depending on the choice of parameter constraints, a countof more than 1028 different feature sets is possible in prac-tice. For finding a good subset of IIFs, an appropriate fea-ture selection has to be done. As stated above, Noether’stheorem gives an upper bound for the order of the monomi-als that is needed to construct a basis for the observationspace. Hence, one constraint for the feature selection thatcan be defined is an assumption about the maximum num-ber of group elements, i.e., the maximum number of differ-ent translations that can be observed in our setting.Although the maximum reasonable monomial order is con-strained by this theorem, the experiments of this work showthat already an order of up to three leads to good results.

Because of the high degree of freedom and the largeamount of data, a requirement for the feature-selectionmethod is a low computational complexity. We used amethod for feature selection based on the so-called feature

finding neural network (FFNN) (Gramss, 1991). Thismethod was applied successfully in the field of speech rec-ognition (e.g., Gramss, 1991; Kleinschmidt and Gelbart,2002; Kleinschmidt, 2002) especially in case of small setsof training data. The FFNN approach works iterativelywith a single-layer perceptron at its basis. A fast trainingof this linear classifier is guaranteed by its closed-formsolution. It should be emphasized that the linear classifierdoes not need to form complex decision boundaries, butto generalize well (Gramss, 1991). The feature-selectionmethod can be summarized in the following four steps:

1. Start with a set of F + 1 features whose parameters arerandomly chosen.

2. Use the linear classifier for computing the relevance ofeach feature.

3. Remove the feature with the least relevance.4. If a stopping criterion (e.g., the total number of itera-

tions) is not fulfilled, add a new, randomly generatedfeature to the current feature set and go back to the sec-ond step, otherwise stop.

During the feature selection, the parameter set that leadsto the highest mean relevance is memorized and returned atthe end of the selection process. The linear classifier per-forms a frame-wise phone classification. The relevance ofa feature i is computed on basis of the computation ofthe root mean square (RMS) error of the linear classifier.The relevance of feature i is defined as the differencebetween the RMS error when using all features and theRMS error without feature i. It follows that this approachyields a ranking of the features according to their rele-vance. For determining the relevance, again, there are dif-ferent setups possible. For example, it can be determinedby using a matching (with respect to the mean VTL)training-testing scenario. Another possibility would be tocompute the relevance for each feature as the mean rele-vance for different mismatching training-test scenarios inwhich, for example, male utterances are used for trainingand female ones are used for testing, and vice versa.

The experimental part considers both ways of relevancecomputation.

4. Experiments

A series of experiments has been conducted with thecontextual IIFs. In the following, we compare the resultswith those for MFCCs and GTCCs. First, we consider IIFswith monomials of order one. Then we investigate whetherthe feature selection method used to find IIFs can also beapplied to select MFCCs from a time window. Experimentswith IIFs of higher order are considered afterwards. In asecond part of the experiments, we include speaker adapta-tion based on VTLN and MLLR. The generality of theselected feature sets is investigated in the last part of theexperiments.

4.1. Data and setup

The TIMIT corpus (Garofolo et al., 1993) with a sam-pling rate of 16 kHz was used for finding the feature setsand for the first part of the recognition experiments, inwhich we look at phone recognition. Overall, it consistsof 6300 utterances from 630 speakers (438 male and 192female). For training, the NIST training set (Halberstadt,1998) was used. It consists of 462 speakers, excludes thedialect (SA) sentences, and contains 3696 utterances. Thecomplete test set (Halberstadt, 1998) contains 1344 utter-ances from 168 speakers (SA sentences excluded) with nooverlap between training and test sets. The last part ofthe experiment used the TIDIGITS corpus (Leonard andDoddington, 1993) downsampled to a sampling rate of16 kHz. Overall, this corpus consists of about 25.000 readdigit sequences produced by adult and children speakers.The corpus provides predefined training and test sets,which were used in this work.

Similar to the feature-selection process described inSection 3.2, different training-test scenarios were definedfor both corpora in order to simulate matching and mis-matching training-test conditions. The matching scenarioin TIMIT refers to the standard training and test sets.Two mismatching scenarios were defined for TIMIT.The first used only the female utterances from the train-ing set for training and the male utterances from the testset for testing. Similarly, the second setup used only themale utterances from the training and the female utter-ances from the test set. Accordingly, the mismatching sce-narios are denoted as “F–M” or “M–F” in the following.In the M–F setting, the amount of test data is reduced toone third of the complete test set, so that the obtainedaccuracies are less statistically significant than for thecomplete test set used in the matching scenario. However,the number of utterances is still more than twice of thecore test set (Halberstadt, 1998), so that statistical signif-icance of M–F results is not a serious issue that couldlead to a misinterpretation of the properties of differentfeature sets.

Page 7: Contextual invariant-integration features for improved speaker-independent speech recognition

836 F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841

For the TIDIGITS corpus, two scenarios were defined.The first used only the corresponding utterances by adultsfor training and test. The second scenario used the utter-ances by adults for training and those by children for test-ing. These two scenarios are denoted in the following as“A–A” and “A–C”, respectively.

While this work explicitly looks at the VTL as source ofvariability, other speaker dependencies also affect theresults. For example, the corpora contain different dialectsand speaking rates, so that good results on these corpora alsoindicate robustness to such variabilities. Of course, in gen-eral, one would want robustness to even more types of vari-ation, which cannot be studied with these corpora.Experiments with regard to other variations are thereforeplanned for the future.

The TF analysis methods differed for the consideredfeature types: in case of the MFCCs, the standard HTKsetup was used that consists of a 26-channel mel filterbank(Davis and Mermelstein, 1980; Young et al., 2009). Agammatone filterbank implemented with an FFTapproach (Ellis, 2009) was used in case of the GTCCsand IIFs. The minimum center frequency of the filterswas set to 40 Hz, and the maximum center frequencywas set to 8 kHz. With these constraints, the center fre-quencies were evenly spaced on the ERB scale. In caseof GTCCs, 26 channels were used. In order to have suffi-ciently many spectral values for the computation of theIIFs, the number of channels was chosen as 110 in thiscase. Certainly, the number of channels could be smallerif the parameter for the window size would be non-integerand if an appropriate interpolation scheme would be used.However, minimizing the number of channels is not thescope of this article. The frame length was set to 20 msand a frame shift of 10 ms was chosen. A power-law com-pression of the spectral values with an exponent of 0.1 wasapplied in order to resemble the nonlinear compressionfound in the human auditory system.

The recognizer was based on the Hidden-Markov modeltoolkit (HTK) (Young et al., 2009) in all experiments.Phone recognition experiments were conducted on theTIMIT corpus. State-clustered, cross-word triphone mod-els with diagonal covariance modeling were used. Allphone models had three emitting states with a left-to-righttopology. Additionally, a long silence and a short silencemodel were included. A bigram language model was usedthat was computed on base of the TIMIT training data.According to Lee and Hon (1989) the phone-recognitionresults were folded to yield 39 final classes. The numberof Gaussian mixtures was optimized for both feature types.While for the cepstral coefficient based feature types a mix-ture of 16 Gaussian distributions per state was chosen, itturned out to be beneficial to use mixtures of eight Gauss-ian distributions when using IIFs as features. Word recog-nition experiments were conducted with TIDIGITS. Here,whole-word left-to-right HMMs without skips over thestates were trained. The number of states was chosenaccording to the average length of the individual words

and varied between 9 and 15 states. A mixture of up toeight Gaussian distributions was used for all states.

For baseline accuracies, MFCCs and GTCCs with 12coefficients plus log-energy together with first and secondorder derivatives were computed. Cepstral mean subtrac-tion and variance normalization were performed. As astandard VTLN method, the approach described by Well-ing et al. (2002) was used. It estimates maximum-likelihoodwarping factors based on whole utterances with a grid-search approach. The considered warping factors aS werechosen from the set {0.88, 0.9, . . . , 1.12} in case of MFCCs.For the GTCCs, the warping parameters aT were chosenempirically from {�1.5,�1.25, . . . , 1.5}. During training,individual warping factors were first estimated for allspeakers. Then these warping factors were used to trainspeaker-independent models. The decoding of the test dataused a two-pass strategy: First, a hypothesis was computedbased on the unwarped features. Then, the hypothesis wasaligned to a set of warped features, and the warping factorwith the highest confidence was used for the final decoding.Speaker-adaptive training and speaker adaptation withMLLR for the tests were also considered during the exper-iments. When MLLR was used, a regression class tree witheight terminal nodes was employed.

4.2. Feature selection

For the feature selection, the parameters were con-strained as follows: The number of features F and the max-imum order of the used monomials were varied within theindividual experiments. The temporal offsets m were allowedto be within an interval of ±3 frames, which corresponds to amaximum temporal context of 80 ms. The maximum win-dow-size parameter W was limited to 80 subbands. Thisnumber has been chosen so high to be on the safe side, andin fact, the features that were selected had values for W ofup to 65. According to Eq. (17) the total count of possiblefeature sets for F = 30 and a maximum order of 1 is of mag-nitude 10110. Because of technical limitations, only everytenth utterance was used for the feature selection. Therefore,about 370 male and female utterances with about 110.000frames were considered. Of course, a larger amount of datafor the feature selection could lead to better generalizationcapabilities, but the found feature sets already show goodrobustness, even if an entirely different corpus is used fortesting. The last part of the experiments described here inves-tigates the generalization capabilities of the TIMIT-basedfeature sets by using these features sets for word-recognitionexperiments on the TIDIGITS corpus.

The result of a feature-selection process according to thedescribed method depends on its initialization and the ran-domly chosen features during its runtime. Thus, for eachexperiment, several repetitions of the feature-selection pro-cess were made. A number of ten repetitions has beenexperimentally determined, and an increase beyond tendid not significantly improve the results. No further heuris-

Page 8: Contextual invariant-integration features for improved speaker-independent speech recognition

Fig. 3. Exemplary feature selection: (a) evolution of the smallest meanRMS error, and (b) corresponding phone error rates.

Table 1Accuracies [%] of phone recognition experiments on TIMIT for MFCCs,GTCCs, and for IIFs of order one (with F = 10,20,30). The feature vectordimensions are shown in brackets.

F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841 837

tics for the initialization of the parameters have been used.The overall process can be described as follows:

1. Compute the TF representation of the training data asdescribed in Section 4.1.

2. Perform feature selection as described in Section 3.2 tentimes.

3. Decide for the feature set with the highest meanrelevance.

With this choice of parameters, up to 15.000 differentfeature sets can be examined during the feature selection.The evolution of the smallest RMS error and the corre-sponding phone error rates (PER) are shown for an exem-plary feature selection in Fig. 3.

For this example, it can be observed that the mean RMSerror correlates well with the PER for the first 400 itera-tions. During the following 1000 iterations, the PERincreases and decreases while the best RMS error isdecreasing. This behavior in the vicinity of the optimumcould be expected, as the results obtained with a linear clas-sifier can predict the performance of an HMM-based rec-ognizer only to a limited extent. Generally, it wasobserved that after 1500 iterations of the described featureselection, the IIF set with the smallest mean RMS erroryielded a significant improvement in accuracy comparedto the randomly initialized IIF set. The IIFs that wereobtained after the feature selection usually have small aswell as large integration windows and involve the wholerange of allowed parameter ranges. Considering the rele-vances of the individual features as described in Section 3.2,we observed that IIFs with large as well as with small inte-gration windows were considered as highly relevant by thefeature selection. Two exemplary contextual IIFs Am1

ðnÞand Am2

ðnÞ that were selected during the experiments areshown in the following. They use monomials of orderone, which corresponds to computing the mean spectralvalue for a certain frequency range,

Am1ðnÞ ¼ 1

21

X10

w¼�10

v22þwðnþ 1Þ;

Am2ðnÞ ¼ 1

47

X23

w¼�23

v60þwðn� 1Þ: ð18Þ

Here, the integration range for Am1ðnÞ is from about 150 Hz

to about 480 Hz and for Am2ðnÞ from about 600 Hz to

about 3200 Hz.

Features Scenario

Match Mismatch

FM–FM F–M M–F

MFCC (39) 72.2 53.7 54.7MFCC+VTLN (39) 73.3 67.7 70.0GTCC (39) 72.5 55.4 54.0GTCC+VTLN (39) 73.9 66.8 68.710 IIF (33) 73.4 61.6 62.920 IIF (55) 74.8 60.3 61.430 IIF (55) 75.3 60.2 61.1

4.3. Invariant integration features of order one

Three IIF feature sets of order one and with sizes 10, 20,and 30 were selected in a matching scenario as describedabove. For the decoding, the log-energy was appendedtogether with first and second order derivatives. Then, alinear discriminant analysis (LDA) was used to project thefeature vectors down to 55 dimensions. Here, the phone

segments were considered as individual classes (Haeb-Umbach and Ney, 1992). The dimensionality-reductionstep was omitted for the feature set that consists of 10 IIFs.Finally, a maximum likelihood linear transformation

(MLLT) (Saon et al., 2000) was applied to allow for diag-onal covariance modeling. In the following, the resultingaccuracies will be compared with those for MFCCs andGTCCs with and without VTLN in a phone recognitiontask on TIMIT. Further results on cepstral coefficientbased feature types with speaker adaptation will bereported in Section 4.6. Experiments with contextuallyenhanced MFCCs are described in the next section. Tostudy the robustness to VTL mismatches, the usual match-ing case as well as the two mismatching scenarios M-F andF-M were considered for the comparison. Table 1 summa-rizes the results. The first two rows list the results for theMFCCs. By comparing the results for the MFCC matchingscenario with those for the corresponding mismatchingones, it can be seen that the performance of standardMFCCs differs by about 17 percentage points. Theenhancement when using MFCCs with VTLN is largerfor the mismatching scenarios than for the correspondingmatching scenario. The next two rows show the samebehavior for GTCCs.

Analyzing the results for the IIFs, it can be seen that allthree feature sets outperform the MFCCs and GTCCswithout VTLN in all scenarios. Furthermore, for the

Page 9: Contextual invariant-integration features for improved speaker-independent speech recognition

838 F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841

matching case an increase of accuracy is observable with anincreasing number of IIFs. Interestingly, the accuracies ofthe IIFs for the mismatching scenarios is highest for thesmallest feature set consisting of 10 IIFs and is lowest withthe largest feature set. An explanation may be that the lar-ger IIF sets are better adapted to the corpus they wereselected on and, therefore, do not generalize as well asthe smaller IIF set. Interestingly, all IIF feature sets yieldaccuracies that are comparable (10 IIFs) or even higher(20 IIFs, 30 IIFs) than the ones of the cepstral coefficientbased feature types with VTLN in the matching scenario,which represents the normal mode of operation of ASRsystems.

4.4. Contextually selected MFCCs

Besides approximations of first- and second order timederivatives of the feature components, another commonapproach for considering temporal context information isto concatenate feature vectors followed by an LDA. Wealso carried out LDA-based combinations of MFCC andGTCC feature vectors for an 80 ms context. Because oftheir similar results, only the ones for the MFCCs areshown in the upper part of Table 2. The comparison withTable 1 shows that the accuracies for the LDA extensionare higher when no speaker-adaptation is used. However,when MLLR or VTLN+MLLR are applied, the featureswith the time-derivative extensions yield in most casesslightly higher accuracies with our setup. The propertiesof the LDA-based approach have, for example, been dis-cussed by Schluter et al. (2006), where it was pointed outthat the performance of the LDA method often drops whenfeatures are highly correlated, too many frames are concat-enated, or the training set is too small. To further investi-gate contextually enhanced MFCCs, we will describeanother approach for including contextual informationnow which leads to significant improvements under match-ing as well as under mismatching conditions.

Contextual IIFs of order one are sums of weighted spec-tral values where the weighting coefficients have a rectan-gular shape. In case of the IIFs the rectangular shape

Table 2Accuracies [%] of phone recognition experiments on TIMIT for LDA-reduced concatenated MFCCs (MFCCLDA), and for contextually selectedMFCCs (MFCCselect). The feature vector dimensions are shown inbrackets.

Features Scenario

Match Mismatch

FM–FM F–M M–F

MFCCLDA (55) 73.8 57.0 57.3MFCCLDA+MLLR (55) 74.2 60.4 61.2MFCCLDA+VTLN (55) 74.8 68.6 69.6MFCCLDA+VTLN+MLLR (55) 75.3 69.3 70.8MFCCselect (55) 74.7 60.3 61.5MFCCselect+MLLR (55) 75.2 63.2 64.4MFCCselect+VTLN (55) 76.1 70.9 71.2MFCCselect+VTLN+MLLR (55) 76.4 71.2 72.1

originates from the idea of integrating over the group ofall possible translations. This procedure is similar to thecomputation of MFCCs in which the spectral values atthe output of a mel filterbank are weighted with triangularshaped coefficients and integrated. For a further compari-son between IIFs and MFCCs we selected 30 MFCCs froman 80 ms context using the same feature selection algorithmas for the IIFs. Similar to the IIFs, log-energy and first-and second order time derivatives were appended to thefeature vectors and finally reduced with an LDA to 55dimensions. The lower part of Table 2 shows the resultsof these experiments. The recognition rates for selectedGTCCs are not listed, because they were slightly inferiorto selected MFCCs.

The comparison of these results with the accuracies ofthe MFCCs as listed in Table 1 shows that the contextuallyselected MFCCs yield much higher accuracies than thestandard MFCCs for most of the setups. It is noteworthythat the accuracies of the contextual MFCCs increase espe-cially under mismatching conditions. Our conclusions ofthese findings are that a good feature selection of individualcomponents within a contextual temporal window may bebeneficial compared to a (linear) combination of compo-nents as it is done, e.g., by an LDA. The pure selectionof MFCCs from a context window has to the best of ourknowledge not been proposed before.

4.5. Invariant integration features of higher order and the

feature-selection scenario

In the next experiment, the allowed order of the mono-mials was constrained to two and three, respectively. Fur-thermore, each feature selection was performed on thematching scenario, as well as on the mismatching scenariosM–F and F–M. The chosen size of the feature sets was setto 30. For comparison purposes, an IIF set of order onewas also selected on the mismatching scenarios. Table 3shows the results of the experiments.

It can be seen that the accuracies in the matching sce-nario degrade when IIFs of higher order are included,whereas the accuracies in the corresponding mismatchingscenarios do increase. This effect is most noticeable when

Table 3Accuracies [%] of phone recognition experiments on TIMIT for 30 IIFs ofhigher order. “FS scenario” denotes the scenario which was used for therelevance computation during the feature selection (matching or mis-matching), “order” denotes the maximum monomial order of the IIFs.

FS scenario Order Match Mismatch

FM–FM F–M M–F

Matching 1 75.3 60.2 61.162 74.4 62.4 61.463 74.2 62.2 61.7

Mismatching 1 74.2 62.3 62.262 73.3 63.3 64.963 73.5 64.3 64.1

Page 10: Contextual invariant-integration features for improved speaker-independent speech recognition

F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841 839

the IIFs are selected on the basis of mismatching scenarios.For example, the IIF set whose results are shown in the lastrow of Table 3 yields similar accuracies as the MFCCswhen combined with VTLN. Thus, the accuracy for thenormal matching cases can be traded off to increase therobustness to larger mismatches between training andtesting.

4.6. Invariant integration features combined with VTLN and/

or MLLR

Adaptation and normalization methods are commonlypart of state-of-the art ASR systems nowadays. In the fol-lowing, we investigate whether the superior properties ofthe IIFs are still observable when VTLN and/or MLLRare also used within the ASR system. A block-diagonalstructure with three blocks was set as constraint for allMLLR-transform estimations, because it turned out tobe beneficial for all considered feature types. While in theexperiments in Sections 4.3 and 4.5 an LDA was appliedon the IIFs together with their derivatives, the experimentsin this part had a different sequence of feature-processingsteps: for the IIF sets of size 30, an LDA with a finaldimensionality of 20 was applied. No dimensionality reduc-tion was performed with the IIF sets of size 10 and 20.Then, the log-energy and first and second order derivativeswere appended, and a 3-block-constrained MLLT wascomputed to decorrelate the features. Speaker-adaptivetraining was performed with constrained MLLR

(CMLLR), while a combination of CMLLR and MLLRwas applied in the decoding stage. When VTLN was usedwith IIFs, which are using a 110-channel gammatone filter-bank, the considered warping parameters aT were �8, �7,. . . , 8. In case of MFCCs and GTCCs, the warping param-eters were chosen as described in Section 4.1. Table 4 showsthe results of the experiments.

Table 4Accuracies [%] of phone recognition experiments on TIMIT for MFCCs,GTCCs, and for IIFs of order one (F = 10,20,30) when adaptationmethods are used. The feature vector dimensions are shown in brackets.

Features Scenario

Match Mismatch

FM–FM F–M M–F

MFCC+MLLR (39) 75.2 65.3 66.9MFCC+VTLN (39) 73.3 67.7 70.0MFCC+VTLN+MLLR (39) 75.4 69.6 71.8GTCC+MLLR (39) 74.9 66.2 66.3GTCC+VTLN (39) 73.9 66.8 68.7GTCC+VTLN+MLLR (39) 76.3 70.4 72.410 IIF+MLLR (33) 74.9 68.0 68.620 IIF+MLLR (63) 75.4 67.8 69.430 IIF+MLLR (63) 76.2 68.1 69.410 IIF+VTLN (33) 75.4 69.7 70.520 IIF+VTLN (63) 76.2 70.1 70.130 IIF+VTLN (63) 77.2 71.4 70.910 IIF+VTLN+MLLR (33) 76.0 71.2 72.520 IIF+VTLN+MLLR (63) 77.1 72.4 72.330 IIF+VTLN+MLLR (63) 77.4 73.4 72.4

As expected, the cepstral coefficient-based systems thatuse both MLLR and VTLN yield higher accuracies in allscenarios than the cepstral coefficient-based systems thatuse only one of the adaptation methods or none of them.Generally, it can be observed that IIF-based ASR systemsdo benefit from the use of MLLR and/or VTLN. Com-pared to the accuracies from Table 1, it can be seen thatthe additional use of MLLR and VTLN increases the accu-racy of the MFCC- and GTCC-based systems in thematching scenario by about 3 and 4 percentage points,respectively. Combining MLLR and VTLN with IIFsyields increases in accuracy for the matching scenariobetween 2.1 and 2.6 percentage points. Overall, the IIFset of size 30 in combination with MLLR and VTLN yieldsthe highest accuracies in the experiments. Also, in compar-ison to the accuracies of the contextually selected MFCCsas listed in Table 2 the IIF set with 30 features yields thehighest accuracies in most of the setups.

4.7. Experiments on TIDIGITS

Since the features were selected on base of an individualcorpus (in our case TIMIT), the question arises, how goodthe IIF sets perform on another corpus. Therefore, the lastpart of the experiments considered the “generalizationcapabilities” of the features. The same IIF sets that wereselected on the TIMIT corpus were used for the featureextraction for the word-recognition task on TIDIGITS.The results of theses experiments are shown in Table 5.

The first four lines of Table 5 show the results obtainedby the cepstral coefficient-based ASR systems withoutadaptation and with VTLN, respectively. From line fiveon, the accuracies of the IIF-based systems without anyadaptation and with VTLN are shown. For both scenarios,it can be seen that the IIF-based systems without VTLNyield accuracies that are equally high (10 IIF, A–A) orhigher than for the MFCC-based systems. It is particularlyremarkable that the IIF-based system without adaptationoutperforms the MFCC-based one with VTLN even in

Table 5Accuracies [%] of word recognition experiments on TIDIGITS forMFCCs, GTCCs, and for IIFs of order one (F = 10,20,30). The featurevector dimensions are shown in brackets.

Features Scenario

Match MismatchA–A A–C

MFCC (39) 99.52 96.02MFCC+VTLN (39) 99.59 97.25GTCC (39) 99.65 97.67GTCC+VTLN (39) 99.66 98.9410 IIF (33) 99.59 97.4020 IIF (55) 99.64 97.9530 IIF (55) 99.68 97.8910 IIF+VTLN (33) 99.65 99.2220 IIF+VTLN (55) 99.73 99.3830 IIF+VTLN (55) 99.76 99.30

Page 11: Contextual invariant-integration features for improved speaker-independent speech recognition

840 F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841

the A–C setting, where the recognizer is trained on adultspeech and tested with children speech. Due to the verylow number of errors in the A–A case, the statistical signif-icance of the results may be in question for this scenario.However, in the A–C setting, where the accuracy is gener-ally lower, it could be improved by up to 0.7 percentagepoints when using IIFs instead of MFCCs with VTLN.In absolute terms, this means that the number of correctlyrecognized digits could be increased by the IIFs by up to 90digits compared to MFCCs, which can be seen as a signif-icant increase of accuracy. The comparison of IIFs withGTCCs without VTLN shows comparable accuracies.However, when VTLN is used, the accuracy of the IIFsin the A–C scenario still is about 0.4 percentage pointshigher than the GTCC + VTLN case, which correspondsto 30 more correctly recognized digits.

5. Discussion and conclusions

We presented a feature-extraction method that is basedon the theory of invariant integration. A major assumptionis that the spectral effects caused by different VTLs ismapped to translations along the subband-index space ofthe TF representation. Invariant-integration features weredefined such that they incorporate information about tem-poral context. Since the proposed IIFs are based on mono-mials, their computation has a very low complexity.However, the definition of these features has a high degreeof parametric freedom, so that an appropriate feature-selection method is needed. Within this work, a methodbased on a linear classifier was used that iterativelyenhances a feature set of a fixed size.

Experiments showed that contextually selected MFCCsyield much higher accuracies in matching as well as in mis-matching scenarios than standard MFCCs and LDA-reduced concatenated MFCCs. Invariant-integration fea-tures can be seen as a further refinement that tries to findimportant spectral cues within a certain temporal contextand enhances the robustness of the resulting features toVTL changes.

When no speaker-adaptation was used, the experimentsshowed a superior performance of the IIFs compared tocepstral coefficients (MFCCs and GTCCs) in matching,and especially in mismatching training-testing conditions.In the matching scenario, IIFs of order one perform betterthan IIFs based on higher order monomials. However,using higher order monomials yields better performanceson mismatching training–testing scenarios. The experi-ments showed that the combination of IIFs with MLLRand/or VTLN further increases their accuracy. Withinthe experiments, the IIF-based systems lead to the highestaccuracies and outperformed the cepstral coefficient-basedsystems in matching training-test conditions for the TIMITphone-recognition task by more than one percentage point.The last part of the experiments showed that the TIMIT-based IIF sets can equally well be used for word recogni-tion on the TIDIGITS corpus. Here, the IIF-based systems

also yield accuracies that are higher than those of theMFCC-based system without and with VTLN.

Future work will be focused on finding a general set ofIIFs that works equally well on different corpora and undera larger class of variabilities. A comprehensive study aboutthe robustness of the IIFs to different types of distortionswill also be conducted. Finally, the application of moresophisticated TF-analysis methods can be combined withthe presented feature type and might lead to furtherimprovements.

Acknowledgement

This work has been supported by the German ResearchFoundation under Grant No. ME1170/2-1.

References

Benzeghiba, M., Mori, R.D., Deroo, O., Dupont, S., Erbes, T., Jouvet, D.,Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V.,Wellekens, C., 2007. Automatic speech recognition and speechvariability: a review. Speech Comm. 49 (10–11), 763–786.

Boe, L.-J., Granat, J., Badin, P., Autesserre, D., Pochic, D., Zga, N.,Henrich, N., Menard, L., 2006. Skull and vocal tract growth fromnewborn to adult. In: Proc. 7th Int. Seminar on Speech Production(ISSP7). Ubatuba, Brazil, pp. 75–82.

Burkhardt, H., Muller, X., 1980. On invariant sets of a certain class of fasttranslation-invariant transforms. IEEE Trans. Acoust. Speech SignalProcess. 28 (5), 517–523.

Burkhardt, H., Siggelkow, S., 2001. Invariant features in pattern recog-nition – fundamentals and applications. In: Kotropoulos, C., Pitas, I.(Eds.), Nonlinear Model-Based Image/Video Processing and Analysis.John Wiley & Sons, pp. 269–307.

Cohen, L., 1993. The scale representation. IEEE Trans. Signal Process. 41(12), 3275–3292.

Davis, S., Mermelstein, P., 1980. Comparison of parametric representa-tions for monosyllabic word recognition in continuously spokensentences. IEEE Trans. Acoust. Speech Signal Process. 28 (4), 357–366.

Deller, J.R., Proakis, J.G., Hansen, J.H.L., 1993. Discrete-Time Process-ing of Speech Signals. Macmillan, New York.

Ellis, D.P.W., 2009. Gammatone-like spectrograms. web resource: <http://www.ee.columbia.edu/�dpwe/resources/matlab/gammatone>.

Fang, M., Hausler, G., 1989. Modified rapid transform. Appl. Opt. 28 (6),1257–1262.

Gales, M.J.F., 1998. Maximum likelihood linear transformations forHMM-based speech recognition. Comput. Speech Lang. 12 (2), 75–98.

Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.,Dahlgren, N.L., 1993. DARPA TIMIT acoustic phonetic speechcorpus. Linguistic Data Consortium, Philadelphia.

Gramss, T., 1991. Word recognition with the feature finding neuralnetwork (FFNN). In: Proc. IEEE Workshop Neural Networks forSignal Processing, Princeton, NJ, USA, pp. 289–298.

Haeb-Umbach, R., Ney, H., 1992. Linear discriminant analysis forimproved large vocabulary continuous speech recognition. In: Proc.IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, SanFrancisco, CA, USA, pp. 13–16.

Halberstadt, A.K., 1998. Heterogeneous acoustic measurements andmultiple classifiers for speech recognition. Ph.D. Thesis, MassachusettsInstitute of Technology.

Huang, X., Acero, A., Hon, H., 2001. Spoken Language Processing: AGuide to Theory, Algorithm, and System Development. Prentice HallPTR, Upper Saddle River, New York, USA.

Hurwitz, A., 1897. Ueber die Erzeugung der Invarianten durch Integra-tion. Nachrichten von der Konigl. Gesellschaft der Wissenschaften zuGottingen, Mathematisch-physikalische Klasse, pp. 71–90.

Page 12: Contextual invariant-integration features for improved speaker-independent speech recognition

F. Muller, A. Mertins / Speech Communication 53 (2011) 830–841 841

Irino, T., Patterson, R., 2002. Segregating information about the size andthe shape of the vocal tract using a time-domain auditory model: Thestabilised wavelet-Mellin transform. Speech Commun. 36 (3), 181–203.

Ishizuka, K., Nakatani, T., Minami, Y., Miyazaki, N., 2006. Speechfeature extraction method using subband-based periodicity andnonperiodicity decomposition. J. Acoust. Soc. Amer. 120 (1), 443–452.

Kleinschmidt, M., 2002. Robust speech recognition based on spectro-temporal processing. Ph.D. Thesis, Universitat Oldenburg.

Kleinschmidt, M., Gelbart, D., 2002. Improving word accuracy withGabor feature extraction. In: Proc. Int. Conf. Spoken LanguageProcessing. pp. 25–28.

Lee, K.F., Hon, H.W., 1989. Speaker-independent phone recognitionusing hidden Markov models. IEEE Trans. Acoust. Speech SignalProcess. 37 (11), 1641–1648.

Lee, L., Rose, R.C., 1998. A frequency warping approach to speakernormalization. IEEE Trans. Speech Audio Process. 6 (1), 49–60.

Leggetter, C., Woodland, P., 1995. Maximum likelihood linear regressionfor speaker adaptation of continuous density hidden Markov models.Comput. Speech Lang. 9 (2), 171–185.

Leonard, R.G., Doddington, G., 1993. TIDIGITS. Linguistic DataConsortium, Philadelphia.

Lohweg, V., Diederichs, C., Muller, D., 2004. Algorithms for hardware-based pattern recognition. EURASIP J. Appl. Signal Process. 2004,1912–1920.

Mertins, A., Rademacher, J., 2005. Vocal tract length invariant featuresfor automatic speech recognition. In: Proc. 2005 IEEE AutomaticSpeech Recognition and Understanding Workshop, San Juan, PuertoRico, pp. 308–312.

Mertins, A., Rademacher, J., 2006. Frequency-warping invariant featuresfor automatic speech recognition. In: Proc. IEEE Int. Conf. Acoustics,Speech, and Signal Processing, vol. V, Toulouse, France, pp. 1025–1028.

Monaghan, J.J., Feldbauer, C., Walters, T.C., Patterson, R.D., 2008.Low-dimensional, auditory feature vectors that improve vocal-tract-length normalization in automatic speech recognition. J. Acoust. Soc.Amer. 123 (5), 3066.

Moore, B.C.J., Glasberg, B.R., 1996. A revision of Zwicker’s loudnessmodel. Acta Acustica united with Acustica 82 (11), 245–335.

Muller, F., Belilovsky, E., Mertins, A., 2009. Generalized cyclic transfor-mations in speaker-independent speech recognition. In: Proc. IEEEAutomatic Speech Recognition and Understanding Workshop, Mer-ano, Italy, pp. 211–215.

Muller, F., Mertins, A., 2009. Invariant-integration method for robustfeature extraction in speaker-independent speech recognition. In: Proc.Int. Conf. Spoken Language Processing (Interspeech 2009-ICSLP),Brighton, UK, pp. 2975–2978.

Muller, F., Mertins, A., 2010. Nonlinear translation-invariant transfor-mations for speaker-independent speech recognition. In: Sole-Casals,J., Zaiats, V. (Eds.), Advances in Nonlinear Speech Processing, LNAI,vol. 5933. Springer, Heidelberg, Germany, pp. 111–119.

Noether, E., 1915. Der Endlichkeitssatz der Invarianten endlicher Grup-pen. Mathematische Annalen 77 (1), 89–92.

Patterson, R.D., 2000. Auditory images: How complex sounds arerepresented in the auditory system. J. Acoust. Soc. Japan (E) 21 (4),183–190.

Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C.,Allerhand, M., 1992. Complex sounds and auditory images. In: Cazals,Y., Demany, L., Horner, K. (Eds.), Auditory Physiology and Percep-tion. Advanced Bioscience, vol. 83, Pergamon, Oxford, pp. 429–446.

Pitz, M., Ney, H., 2005. Vocal tract normalization equals lineartransformation in cepstral space. IEEE Trans. Speech Audio Process.13 (5), 930–944.

Rademacher, J., Wachter, M., Mertins, A., 2006. Improved warping-invariant features for automatic speech recognition. In: Proc. Int.Conf. Spoken Language Processing (Interspeech 2006 – ICSLP),Pittsburgh, PA, USA, pp. 1499–1502.

Reitboeck, H., Brody, T.P., 1969. A transformation with invariance undercyclic permutation for applications in pattern recognition. Inform.Control 15 (2), 130–154.

Saon, G., Padmanabhan, M., Gopinath, R., Chen, S., 2000. Maximumlikelihood discriminant feature spaces. In: Proc. Int. Conf. AudioSpeech and Signal Processing. pp. 1129–1132.

Schluter, R., Zolnay, A., Ney, H., 2006. Feature combination using lineardiscriminant analysis and its pitfalls. In: Proc. Int. Conf. SpokenLanguage Processing (ICSLP/Interspeech). Pittsburgh, USA, pp. 345–348.

Schulz-Mirbach, H., 1992. On the existence of complete invariant featurespaces in pattern recognition. In: Proc. Int. Conf. Pattern Recognition,vol. 2, Hague, Netherlands, pp. 178–182.

Schulz-Mirbach, H., 1995a. Anwendung von Invarianzprinzipien zurMerkmalgewinnung in der Mustererkennung. Tr-402-95-018, Univer-sitaet Hamburg, Hamburg, Germany.

Schulz-Mirbach, H., 1995b. Invariant features for gray scale images. In:Mustererkennung 1995, 17. DAGM-Symposium. Springer, London,UK, pp. 1–14.

Sena, A.D., Rocchesso, D., 2005. A study on using the mellin transformfor vowel recognition. In: Proc. Sound and Music Conf. Salerno, Italy.

Siggelkow, S., 2002. Feature histograms for content-based image retrieval.Ph.D. thesis, Fakultat fur Angewandte Wissenschaften, Albert-Lud-wigs-Universitat Freiburg, Breisgau, Germany.

Sinha, R., Umesh, S., 2002. Non-uniform scaling based speaker normal-ization. In: Proc. IEEE Int. Conf. Acoustics, Speech and SignalProcessing (ICASSP’02), vol. 1. Orlando, USA, pp. I-589–I-592.

Umesh, S., Cohen, L., Marinovic, N., Nelson, D.J., 1999. Scale transformin speech analysis. IEEE Trans. Speech Audio Process. 7 (1), 40–45.

Umesh, S., Kumar, S.V.B., Vinay, M.K., Sharma, R., Sinha, R., 2002. Asimple approach to non-uniform vowel normalization. In: Proc. Int.Conf. Acoustics, Speech, and Signal Processing (ICASSP’02), vol. 1,Orlando, USA, pp. I-517–I-520.

Welling, L., Ney, H., Kanthak, S., 2002. Speaker adaptive modeling byvocal tract normalization. IEEE Trans. Speech Audio Process. 10 (6),415–426.

Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A.,Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland,P., 2009. The HTK Book (for HTK Version 3.4.1). CambridgeUniversity Engineering Department, Cambridge, UK.