IEEE TRANSACTIONS ON AUDIO, SPEECH, AND ...chanwook/MyPapers/OnlinePNCC_V25.pdfIEEE TRANSACTIONS ON...

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 1

Power-Normalized Cepstral Coefficients (PNCC) forRobust Speech Recognition

Chanwoo Kim and Richard M. Stern,Member, IEEE

Abstract—This paper presents a new feature extraction algo-rithm called Power Normalized Cepstral Coefficients (PNCC)that is motivated by auditory processing. Major new features ofPNCC processing include the use of a power-law nonlinearitythat replaces the traditional log nonlinearity used in MFCCcoefficients, a noise-suppression algorithm based on asymmetricfiltering that suppress background excitation, and a modulethataccomplishes temporal masking. We also propose the use ofmedium-time power analysis, in which environmental param-eters are estimated over a longer duration than is commonlyused for speech, as well as frequency smoothing. Experimentalresults demonstrate that PNCC processing provides substantialimprovements in recognition accuracy compared to MFCC andPLP processing for speech in the presence of various types ofadditive noise and in reverberant environments, with only slightlygreater computational cost than conventional MFCC processing,and without degrading the recognition accuracy that is observedwhile training and testing using clean speech. PNCC processingalso provides better recognition accuracy in noisy environmentsthan techniques such as Vector Taylor Series (VTS) and theETSI Advanced Front End (AFE) while requiring much lesscomputation. We describe an implementation of PNCC using“online processing” that does not require future knowledgeofthe input.

Index Terms—Robust speech recognition, feature extrac-tion, physiological modeling, rate-level curve, power function,asymmetric filtering, medium-time power estimation, spectralweight smoothing, temporal masking, modulation filtering, on-line speech processing

EDICS Category: SPE-ROBU, SPE-SPER

I. I NTRODUCTION

I N recent decades following the introduction of hiddenMarkov models (e.g. [1]) and statistical language models

(e.g.[2]), the performance of speech recognition systems inbenign acoustical environments has dramatically improved.Nevertheless, most speech recognition systems remain sensi-tive to the nature of the acoustical environments within whichthey are deployed, and their performance deteriorates sharplyin the presence of sources of degradation such as additivenoise, linear channel distortion, and reverberation.

One of the most challenging contemporary problems is thatrecognition accuracy degrades significantly if the test environ-ment is different from the training environment and/or if theacoustical environment includes disturbances such as additivenoise, channel distortion, speaker differences, reverberation,and so on. Over the years dozens if not hundreds of algorithms

Chanwoo Kim (Corresponding Author) is with The Microsoft Corporation,Redmond WA 98052 USA (email:[email protected] ). RichardM. Stern is with the Language Technologies Institute and theDepartment ofElectrical and Computer Engineering, Carnegie Mellon University, Pittsburgh,PA 15213 USA (email:[email protected] ).

Manuscript received XXXXX, XXXXX; revised XXXXX, XXXXX.

have been introduced to address this problem. Many of theseconventional noise compensation algorithms have providedsubstantial improvement in accuracy for recognizing speechin the presence of quasi-stationary noise (e.g. [3], [4], [5],[6], [7], [8], [9], [10]). Unfortunately these same algorithmsfrequently do not provide significant improvements in moredifficult environments with transitory disturbances such as asingle interfering speaker or background music (e.g. [11]).

Many of the current systems developed for automatic speechrecognition, speaker identification, and related tasks arebasedon variants of one of two types of features:mel frequencycepstral coefficients(MFCC) [12] orperceptual linear predic-tion (PLP) coefficients [13]. Spectro-temporal features havealso been recently introduced with promising results (e.g.[14], [15]). It has been observed that two-dimensional Gaborfilters provide a reasonable approximation to the spectro-temporal response fields of neurons in the auditory cortex,which has lead to various approaches to extract features forspeech recognition (e.g. [16], [17], [18], [19]). In this paperwe describe the development of an additional feature set forspeech recognition which we refer to aspower-normalizedcepstral coefficients(PNCC).

We had introduced several previous of PNCC processing in[20] and [21], and these implementations have been evaluatedby several teams of researchers and compared to severaldifferent algorithms includingzero crossing peak amplitude(ZCPA) [22], RASTA-PLP [23],perceptual minimum variancedistortionless response(PMVDR) [24], invariant-integrationfeatures(IIF) [25], and subband spectral centroid histograms(SSCH) [26]. As described in several papers (e.g. [27], [28],[29], [30], [31]), PNCC has been shown to provide betterspeech recognition accuracy than the other algorithms citedabove, particularly in conditions of training that is mismatchedacross environments. For example, Muller and Mertins[32]found that PNCC provides better results than the original IIFfeatures, but if IIF is combined with PNCC (PN-IIF), the resultis somewhat better than the original PNCC. Similar results hadbeen obtained withdelta-spectral cepstral coefficientsDSCC[33] as well. Our previous implementations of PNCC havealso been employed in industry as well [34]. In selected otherstudies, portions of PNCC processing have been incorporatedinto other feature extraction algorithms (e.g [35], [36]). Eventhough previous implementations of PNCC processing appearto be promising, a major problem is that they cannot be easilyimplemented for online applications without look-ahead overan entire sentence. In addition, previous implementationsofPNCC did not consider the effects of temporal masking, as isthe case for MFCC and PLP processing.

The implementation of PNCC processing in the present


Pre-Emphasis

STFT

Magnitude Squared

TriangularFrequency Integration

STFT

Magnitude Squared

Critical-BandFrequency Integration

NonlinearCompression

Pre-Emphasis

STFT

Magnitude Squared

GammatoneFrequency Integration

Medium-TimePower Calculation

Asymmetric NoiseSuppression with

Temporal Masking

Weight Smoothing

Time-FrequencyNormalization

Mean PowerNormalization

],[~

lmQ

],[~

lmR

Power FunctionNonlinearity

RASTA Filtering

NonlinearExpansion

Power FunctionNonlinearity

LogarithmicNonlinearity

DCT IDFT DCT

Mean NormalizationMean NormalizationMean Normalization

LPC-BasedCepstral Recursion

3/1)(• 15/1)(•

MFCCCoefficients

RASTA-PLPCoefficients

PNCCCoefficients

],[~

lmS

],[ lmP

],[ lmT

],[ lmU

],[ lmV

Input SpeechInput SpeechInput Speech

Short-TimeProcessing

Medium-Time

Processing

Init

ial P

ro

ce

ssin

gE

nviro

nm

en

ta

l Co

mp

en

satio

nF

ina

l Pro

ce

ssin

g

Fig. 1. Comparison of the structure of the MFCC, PLP, and PNCCfeature extraction algorithms. The modules of PNCC that function on the basis of“medium-time” analysis (with a temporal window of 65.6 ms) are plotted in the rightmost column. If the shaded blocks of PNCC are omitted, the remainingprocessing is referred to assimple power-normalized cepstral coefficients (SPNCC).

paper has been significantly revised to address these issuesin afashion that enables it to provide superior recognition accuracyover a broad range of conditions of noise and reverberationusing features that are computable in real time using “online”algorithms that do not require extensive look-ahead, and witha computational complexity that is comparable to that oftraditional MFCC and PLP features. In the subsequent subsec-tions of this Introduction we discuss the broader motivationsand overall structure of PNCC processing. We specify thekey elements of the processing in some detail in Sec. II.In Sec. III we compare the recognition accuracy providedby PNCC processing under a variety of conditions with thatof other processing schemes, and we consider the impact ofvarious components of PNCC on these results. We comparethe computational complexity of the MFCC, PLP, and PNCCfeature extraction algorithms in Sec. IV and we summarizeour results in the final section.

A. Broader motivation for the PNCC algorithm

The development of PNCC feature extraction was motivatedby a desire to obtain a set of practical features for speechrecognition that are more robust with respect to acousticalvariability in their native form, without loss of performancewhen the speech signal is undistorted, and with a degree ofcomputational complexity that is comparable to that of MFCCand PLP coefficients. While many of the attributes of PNCCprocessing have been strongly influenced by considerationof various attributes of human auditory processing, we havefavored approaches that provide pragmatic gains in robustnessat small computational cost over approaches that are morefaithful to auditory physiology in developing the specificprocessing that is performed.

Some of the innovations of the PNCC processing that weconsider to be the most important include:

• The replacement of the log nonlinearity in MFCC pro-cessing by a power-law nonlinearity that is carefullychosen to approximate the nonlinear relation betweensignal intensity and auditory-nerve firing rate. We believe


that this nonlinearity provides superior robustness by sup-pressing small signals and their variability, as discussedin Sec. II-G.

• The use of “medium-time” processing with a durationof 50-120 ms to analyze the parameters characterizingenvironmental degradation, in combination with the tra-ditional short-time Fourier analysis with frames of 20-30ms used in conventional speech recognition systems. Webelieve that this approach enables us to estimate envi-ronmental degradation more accurately while maintainingthe ability to respond to rapidly-changing speech signals,as discussed in Sec. II-B.

• The use of a form of “asymmetric nonlinear filtering”to estimate the level of the acoustical background noisefor each time frame and frequency bin. We believethat this approach enables us to remove slowly-varyingcomponents easily without the need to deal with many ofthe artifacts associated with over-correction in techniquessuch as spectral subtraction [37], as discussed in Sec.II-C. As shown in Sec. III-C, this approach is moreeffective than RASTA processing [23].

• The development of a signal processing block that real-izes temporal masking.

• The development of computationally-efficient realizationsof the algorithms above that support “online” real-timeprocessing that does not require substantial non-causallook-ahead of the input signal to compute the PNCCcoefficients.

B. Structure of the PNCC algorithm

Figure 1 compares the structure of conventional MFCCprocessing [12], PLP processing [13], [23], and the new PNCCapproach which we introduce in this paper. As was notedabove, the major innovations of PNCC processing include theredesigned nonlinear rate-intensity function, along withtheseries of processing elements to suppress the effects of back-ground acoustical activity based on medium-time analysis.

As can be seen from Fig. 1, the initial processing stages ofPNCC processing are quite similar to the corresponding stagesof MFCC and PLP analysis, except that the frequency analysisis performed using gammatone filters [38]. This is followedby the series of nonlinear time-varying operations that areperformed using the longer-duration temporal analysis thataccomplish noise subtraction as well as a degree of robustnesswith respect to reverberation. The final stages of processing arealso similar to MFCC and PLP processing, with the exceptionof the carefully-chosen power-law nonlinearity with exponent1/15, which will be discussed in Sec. II-G below. Finally,we note that if the shaded blocks in Fig. 1 are omitted,the processing that remains is referred to assimple power-normalized cepstral coefficients (SPNCC). SPNCC processinghas been employed in other studies on robust recognition (e.g.[36]).

II. COMPONENTS OFPNCCPROCESSING

In this section we describe and discuss the major compo-nents of PNCC processing in greater detail. While the detailed

0 2000 4000 6000 80000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Frequency (Hz)

H(e

j ω)

Fig. 2. The frequency response of a gammatone filterbank witheacharea of the squared frequency response normalized to be unity. Characteristicfrequencies are uniformly spaced between 200 and 8000 Hz according to theEquivalent Rectangular Bandwidth (ERB) scale [39].

description below assumes a sampling rate of 16 kHz, thePNCC features are easily modified to accommodate othersampling frequencies.

A. Initial processing

As in the case of MFCC, a pre-emphasis filter of the formH(z) = 1−0.97z−1 is applied. A short-time Fourier transform(STFT) is performed using Hamming windows of duration25.6 ms, with 10 ms between frames, using a DFT size of1024. Spectral power in 40 analysis bands is obtained byweighting the magnitude-squared STFT outputs for positivefrequencies by the frequency response associated with a 40-channel gammatone-shaped filter bank [38] whose centerfrequencies are linearly spaced in Equivalent RectangularBandwidth (ERB) [39] between 200 Hz and 8000 Hz, usingthe implementation of gammatone filters in Slaney’s AuditoryToolbox [40]. In previous work [20] we observed that the useof gammatone frequency weighting provides slightly betterASR accuracy in white noise, but the differences comparedto the traditional triangular weights in MFCC processing aresmall. The frequency response of the gammatone filterbank isshown in Fig. 2. In each channel the area under the squaredtransfer function is normalized to unity to satisfy the equation:

∫ 8000

0

|Hl(f)|2df = 1 (1)

whereHl(f) is the frequency response of thelth gammatonechannel. To reduce the amount of computation, we modifiedthe gammatone filter responses slightly by settingHl(f)equal to zero for all values off for which the unmodifiedHl(f) would be less than0.5 percent of its maximum value(corresponding to -46 dB).

We obtain the short-time spectral powerP [m, l] using thesquared gammatone summation as below:

P [m, l] =

(K/2)−1∑

k=0

|X [m, ejωk ]Hl(ejωk)|2 (2)

whereK is the DFT size,m and l represent the frame andchannel indices, respectively, andωk = 2πk/Fs, with Fs

representing the sampling frequency.X [m, ejωk ] is the short-time spectrum of themth frame of the signal.


B. Temporal integration for environmental analysis

Most speech recognition and speech coding systems useanalysis frames of duration between 20 ms and 30 ms.Nevertheless, it is frequently observed that longer analysiswindows provide better performance for noise modeling and/orenvironmental normalization (e.g. [21], [41], [42]), becausethe power associated with most background noise conditionschanges more slowly than the instantaneous power associatedwith speech.

In PNCC processing we estimate a quantity we refer toas “medium-time power”Q[m, l] by computing the runningaverage ofP [m, l], the power observed in a single analysisframe, according to the equation:

Q[m, l] =1

2M + 1

m+M∑

m′=m−M

P [m′, l] (3)

wherem represents the frame index andl is the channel index.We will apply the tilde symbol to all power estimates that areperformed using medium-time analysis.

We observed experimentally that the choice of the temporalintegration factorM has a substantial impact on performancein white noise (and presumably other types of broadbandbackground noise). This factor has less impact on the accuracythat is observed in more dynamic interference or reverberation,although the longer temporal analysis window does providesome benefit in these environments as well [43]. We chose thevalue ofM = 2 (corresponding to five consecutive windowswith a total net duration of 65.6 ms) on the basis of theseobservations. SinceQ[m, l] is the moving average ofP [m, l],Q[m, l] is a low-pass function ofm. If M = 2, the upperfrequency is approximately 15 Hz. Nevertheless, if we were touse features based onQ[m, l] directly for speech recognition,recognition accuracy would be degraded because onsets andoffsets of the frequency components would become blurred.Hence in PNCC, we useQ[m, l] only for noise estimationand compensation, which are used to modify the informationbased on the short-time power estimatesP [m, l]. We alsoapply smoothing over the various frequency channels, whichwill discussed in Sec. II-E below.

C. Asymmetric noise suppression

In this section, we discuss a new approach to noise com-pensation which we refer to asasymmetric noise suppression(ANS). This procedure is motivated by the observation men-tioned above that the speech power in each channel usuallychanges more rapidly than the background noise power inthe same channel. Alternately we might say that speech usu-ally has a higher-frequency modulation spectrum than noise.Motivated by this observation, many algorithms have beendeveloped using either high-pass filtering or band-pass filteringin the modulation spectrum domain (e.g. [23], [44]). Thesimplest way to accomplish this objective is to perform high-pass filtering in each channel (e.g. [45], [46]) which has theeffect of removing slowly-varying components which typicallyrepresent the effects of additive noise sources rather thanthespeech signal.

AsymmetricLowpass Filtering

HalfwaveRectification

AsymmetricLowpass Filtering

],[~

lmQ

Noise Removal

and Temporal

Masking

TemporalMasking

+

-

MAX

],[~

lmQle

],[~

lmQo

],[~

lmR

Floor Level

Estimation

Excitation Non-Excitation

],[~

lmQf],[

~lmQtm

],[~

lmRsp

Fig. 3. Functional block diagram of the modules for asymmetric noisesuppression (ANS) and temporal masking in PNCC processing.All processingis performed on a channel-by-channel basis.Q[m, l] is the medium-time-averaged input power as defined by Eq.(3),R[m, l] is the speech output ofthe ANS module, andS[m, l] is the output after temporal masking (which isapplied only to the speech frames). The block labelled Temporal Masking isdepicted in detail in Fig. 5

One significant problem with the application of conventionallinear high-pass filtering in the power domain is that the filteroutput can become negative. Negative values for the powercoefficients are problematic in the formal mathematical sense(in that power itself is positive). They also cause problemsin the application of the compressive nonlinearity and inspeech resynthesis unless a suitable floor value is applied tothe power coefficients (e.g. [46]). Rather than filtering in thepower domain, we could perform filtering after applying thelogarithmic nonlinearity, as is done with conventional cepstralmean normalization in MFCC processing. Nevertheless, aswill be seen in Sec. III, this approach is not very helpfulfor environments with additive noise. Spectral subtraction isanother way to reduce the effects of noise, whose powerchanges slowly (e.g. [37]). In spectral subtraction techniques,the noise level is typically estimated from the power of non-speech segments (e.g.[37]) or through the use of a continuous-update approach (e.g.[45]). In the approach that we introduce,we obtain a running estimate of the time-varying noise floorusing an asymmetric nonlinear filter, and subtract that fromthe instantaneous power.

Figure 3 is a block diagram of the complete asymmetricnonlinear suppression processing with temporal masking. Letus begin by describing the general characteristics of the


asymmetric nonlinear filter that is the first stage of processing.This filter is represented by the following equation for arbitraryinput and outputQin[m, l] and Qout[m, l], respectively:

Qout[m, l] =

λaQout[m− 1, l] + (1− λa)Qin[m, l],

if Qin[m, l] ≥ Qout[m− 1, l]

λbQout[m− 1, l] + (1− λb)Qin[m, l],

if Qin[m, l] < Qout[m− 1, l]

(4)

wherem is the frame index andl is the channel index, andλa andλb are constants between zero and one.

If λa = λb it is easy to verify that Eq. 4 reduces to aconventional IIR filter that is lowpass in nature because thevalues of theλ parameters are positive, as shown in Fig. 4(a).In contrast, If1 > λb > λa > 0, the nonlinear filter functionsas a conventional “upper” envelope detector, as illustratedin Fig. 4(b). Finally, and most usefully for our purposes, if1 > λa > λb > 0, the filter outputQout tends to followthe lower envelopeof Qin[m, l], as seen in Fig. 4(c). In ourprocessing, we will use this slowly-varying lower envelopeinFig. 4(c) to serve as a model for the estimated medium-timenoise level, and the activity above this envelope is assumedto represent speech activity. Hence, subtracting this low-levelenvelope from the original inputQin[m, l] will remove aslowly varying non-speech component.

We will use the notation

Qout[m, l] = AFλa,λb[Qin[m, l]] (5)

to represent the nonlinear filter described by Eq. (4). We notethat that this filter operates only on the frame indicesm foreach channel indexl.

Keeping the characteristics of the asymmetric filter de-scribed above in mind, we may now consider the structureshown in Fig. 3. In the first stage, the lower envelopeQle[m, l],which represents the average noise power, is obtained by ANSprocessing according to the equation

Qle[m, l] = AF0.999,0.5[Q[m, l]] (6)

as depicted in Fig. 4(c).Qle[0, l] is initialized to 0.9Q[m, l].Qle[m, l] is subtracted from the inputQ[m, l], effectivelyhighpass filtering the input, and that signal is passed throughan ideal half-wave linear rectifier to produce the rectifiedoutput Q0[m, l]. The impact of the specific values of theforgetting factorsλa and λb on speech recognition accuracyis discussed below.

The remaining elements of ANS processing in the right-hand side of Fig. 3 (other than the temporal masking block)are included to cope with problems that develop when therectifier outputQ0[m, l] remains zero for an interval, or whenthe local variance ofQ0[m, l] becomes excessively small. Ourapproach to this problem is motivated by our previous work[21] in which it was noted that applying a well-motivatedflooring level to power is very important for noise robustness.In PNCC processing we apply the asymmetric nonlinear filterfor a second time to obtain the lower envelope of the rectifieroutputQf [m, l], and we use this envelope to establish this floorlevel. This envelopeQf [m, l] is obtained using asymmetric

0 100 (1s) 200 (2s) 300 (3s)

Pow

er (

dB)

Frame Index m

Qin[m, l]Qout[m, l](λa = 0.9, λb = 0.9)

(a)

0 100 (1s) 200 (2s) 300 (3s)

Pow

er (

dB)

Frame Index m


(b)

0 100 (1s) 200 (2s) 300 (3s)

Pow

er (

dB)

Frame Index m


(c)

Fig. 4. Sample inputs (solid curves) and outputs (dashed curves) of theasymmetric nonlinear filter defined by Eq. (4) for conditionswhen (a)λa =λb (b) λa < λb , and (c)λa > λb . In this example, the channel indexl is8.

filtering as before:

Qf [m, l] = AF0.999,0.5[Q0[m, l]] (7)

Qf [0, l] is initialized asQ0[m, l]. As shown in Fig. 3, we usethe lower envelope of the rectified signalQf [m, l] as a floorlevel for the ANS processing outputR[m, l] after temporalmasking:

Rsp[m, l] = max (Qtm[m, l], Qf [m, l]) (8)

whereQtm[m, l] is the temporal masking output depicted inFig. 3. Temporal masking for speech segments is discussed inSec. II-D.

We have found that applying lowpass filtering to the signalsegments that do not appear to be driven by a periodicexcitation function (as in voiced speech) improves recognitionaccuracy in noise by a small amount. For this reason we usethe lower envelope of the rectified signalRle[m, l] directlyfor these non-excitation segments. This operation, which iseffectively a further lowpass filtering, is not performed for thespeech segments because blurring the power coefficients forspeech degrades recognition accuracy.

Excitation/non-excitation decisions for this purpose areob-tained for each value ofm and l in a very simple fashion:

“excitation segment” if Q[m, l] ≥ cQle[m, l](9a)

“non-excitation segment” if Q[m, l] < cQle[m, l](9b)

whereQle[m, l] is the lower envelope ofQ[m, l] as describedabove, and in andc is a fixed constant. In other words, a partic-ular value ofQ[m, l] is not considered to be a sufficiently-large


MAX

],[~

lmQo

1−z

tλ

tµ

],[~

lmRsp

],[~

lmQp

],1[~

lmQp −

],1[~

],[~

0 lmQlmQ pt −< λ

],1[~

],[~

0 lmQlmQ pt −≥ λ

Fig. 5. Block diagram of the components that accomplish temporal maskingin Fig. 3

excitation if it is less than a fixed multiple of its own lowerenvelope.

We observed experimentally that while a broad range ofvalues ofλb between 0.25 and 0.75 appear to provide reason-able recognition accuracy, the choice ofλa = 0.9 appears tobe best under most circumstances [43]. The parameter valuesused for the current standard implementation areλa = 0.999and λb = 0.5, which were chosen in part to maximize therecognition accuracy in clean speech as well as performanceinnoise. We also observed (in experiments in which the temporalmasking described below was bypassed) that the threshold-parameter valuec = 2 provides the best performance for whitenoise (and presumably other types of broadband noise). Thevalue of c has little impact on performance in backgroundmusic and in the presence of reverberation.

D. Temporal masking

Many authors have noted that the human auditory systemappears to focus more on the onset of an incoming powerenvelope rather than the falling edge of that same powerenvelope (e.g. [47], [48]). This observation has led to severalonset enhancement algorithms (e.g. [49], [46], [50], [51]). Inthis section we describe a simple way to incorporate this effectin PNCC processing, by obtaining a moving peak for eachfrequency channell and suppressing the instantaneous powerif it falls below this envelope.

The processing invoked for temporal masking is depicted inblock diagram form in Fig. 5. We first obtain the online peakpowerQp[m, l] for each channel using the following equation:

Qp[m, l] = max(

λtQp[m− 1, l], Q0[m, l])

(10)

whereλt is the forgetting factor for obtaining the online peak.As before,m is the frame index andl is the channel index.Temporal masking for speech segments is accomplished using

0 100 (1s)

Pow

er (

dB)

Frame Index m

S[m, l] (T60 = 0.5 s)

S[m, l] without Temporal Masking (T60 = 0.5 s)

0 100 (1s)

Pow

er (

dB)

Frame Index m

Fig. 6. Demonstration of the effect of temporal masking in the ANS modulefor speech in simulated reverberation withT60 = 0.5 s (upper panel) andclean speech (lower panel). In this example, the channel index l is 18.

the following equation:

Rsp[m, l] =

{

Q0[m, l], Q0[m, l] ≥ λtQp[m− 1, l]

µtQp[m− 1, l], Q0[m, l] < λtQp[m− 1, l](11)

We have found [43] that if the forgetting factorλt is equalto or less than0.85 and if µt ≤ 0.2, recognition accuracyremains almost constant for clean speech and most additivenoise conditions, and ifλt increases beyond0.85, performancedegrades. The value ofλt = 0.85 also appears to be best inthe reverberant condition. For these reasons we use the valuesλt = 0.85 and µt = 0.2 in the standard implementation ofPNCC. Note thatλt = 0.85 corresponds to a time constantof 28.2 ms, which means that the offset attenuation lastsapproximately 100 ms. This characteristic is in accordancewith observed data for humans [52].

Figure 6 illustrates the effect of this temporal masking. Ingeneral, with temporal masking the response of the systemis inhibited for portions of the input signalR[m, l] otherthan rising “attack transients”. The difference between thesignals with and without masking is especially pronounced inreverberant environments, for which the temporal processingmodule is especially helpful.

The final output of the asymmetric noise suppression andtemporal masking modules isR[m, l] = Rsp[m, l] for theexcitation segments andR[m, l] = Qf [m, l] for the non-excitation segments.

E. Spectral weight smoothing

In our previous research on speech enhancement and noisecompensation techniques (e.g., [20], [21], [41], [53], [54])it has been frequently observed that smoothing the responseacross channels is helpful. This is true especially in processingschemes such as PNCC where there are nonlinearities and/orthresholds that vary in their effect from channel to channel,as well as processing schemes that are based on inclusion ofresponses only from a subset of time frames and frequencychannels (e.g. [53]) or systems that rely on missing-featureapproaches (e.g. [55]).

From the discussion above, we can represent the combinedeffects of asymmetric noise suppression and temporal masking


for a specific time frame and frequency bin as the transferfunction R[m, l]/Q[m, l]. Smoothing the transfer functionacross frequency is accomplished by computing the runningaverage over the channel indexl of the ratioR[m, l]/Q[m, l].Hence, the frequency averaged weighting functionT [m, l](which had previously been subjected to temporal averaging)is given by:

S[m, l] =

(

1

l2 − l1 + 1

l2∑

l′=l1

R[m, l′]

Q[m, l′]

)

(12)

wherel2 = min(l +N,L) and l1 = max(l −N, 1), andL isthe total number of channels.

The time-averaged frequency-averaged transfer functionS[m, l] is used to modulate the original short-time powerP [m, l]:

T [m, l] = P [m, l]S[m, l] (13)

In the present implementation of PNCC, we use a value ofN = 4, and a total number ofL = 40 gammatone channels,again based on empirical optimization from the results of pilotstudies [43]. We note that if we were to use a different numberof channelsL, the optimal value ofN would be also different.

F. Mean power normalization

In conventional MFCC processing, multiplication of theinput signal by a constant scale factor produces only anadditive shift of theC0 coefficient because a logarithmicnonlinearity is included in the processing, and this shift iseasily removed by cepstral mean normalization. In PNCCprocessing, however, the replacement of the log nonlinearityby a power-law nonlinearity as discussed below, causes theresponse of the processing to be affected by changes inabsolute power, even though we have observed that this effectis usually small. In order to minimize further the potentialimpact of amplitude scaling in PNCC we invoke a stage ofmean power normalization.

While the easiest way to normalize power would be todivide the instantaneous power by the average power over theutterance, this is not feasible for real-time online processingbecause of the “look ahead” that would be required. Forthis reason, we normalize input power in the present onlineimplementation of PNCC by dividing the incoming power bya running average of the overall power. The mean power esti-mateµ[m] is computed from the simple difference equation:

µ[m] = λµµ[m− 1] +(1− λµ)

L

L−1∑

l=0

T [m, l] (14)

wherem and l are the frame and channel indices, as before,andL represents the number of frequency channels. We use avalue of0.999 for the forgetting factorλµ. For the initial valueof µ[m], we use the value obtained from the training database.Since the time constant corresponding toλµ is around 4.6seconds, we normally do not need to incorporate a formalvoice activity detector(VAD) in conjunction with PNCC ifa continuous non-speech portion is not longer than 3 to 4seconds, then we usually do not need to use a Voice Activity

0 20 40 60 80 100 120 1400

500

1000

1500

2000

2500

3000

3500

4000

Syn

apse

Out

put (

spik

es/s

ec)

Time (ms)

Fig. 7. Synapse output for a pure tone input with a carrier frequency of 500Hz at 60 dB SPL. This synapse output is obtained using the auditory modelby Heinz et al. [56].

−40 −20 0 20 40 60 800

200

400

600

800

1000

Sound Pressure Level (dB)

Rat

e (s

pike

s/se

c)

Onset RateSustained Rate

Fig. 8. Comparison of the onset rate (solid curve) and sustained rate (dashedcurve) obtained using the model proposed by Heinzet al. [56]. The curveswere obtained by averaging responses over seven frequencies. See text fordetails.

Detector (VAD) with PNCC. If silences of longer duration areinterspersed with the speech, however, we recommend the useof a VAD in combination with PNCC processing.

The normalized power is obtained directly from the runningpower estimateµ[m]:

U [m, l] = kT [m, l]

µ[m](15)

where the value of the constantk is arbitrary. In pilotexperiments we found that the speech recognition accuracyobtained using the online power normalization described aboveis comparable to the accuracy that would be obtained bynormalizing according to a power estimate that is computedover the entire estimate in offline fashion.

G. Rate-level nonlinearity

Several studies in our group (e.g.[20], [54]) have confirmedthe critical importance of the nonlinear function that describesthe relationship between incoming signal amplitude in a givenfrequency channel and the corresponding response of theprocessing model. This “rate-level nonlinearity” is explicitly orimplicitly a crucial part of every conceptual or physiologicalmodel of auditory processing (e.g. [57], [58], [59]). In thissection we summarize our approach to the development of therate-level nonlinearity used in PNCC processing.


0.05 0.1 0.15 0.2−200

0

200

400

600

800

1000

1200

1400

Pressure (Pa)

Rat

e (s

pike

s / s

ec)

Human Rate−Intensity ModelCube Root Power−Law ApproximationMMSE Power−Law ApproximationLogarithmic Approximation

(a)

−40 −20 0 20 40 60 80−1000

−500

0

500

1000

1500

Tone Level (dB SPL)

Rat

e (s

pike

s / s

ec)

Human Rate−Intensity ModelCube Root Power−Law ApproximationMMSE Power−Law ApproximationLogarithmic Approximation

(b)

Fig. 9. Comparison between a human rate-intensity relationusing theauditory model developed by Heinzet al. [56], a cube root power-law ap-proximation, an MMSE power-law approximation, and a logarithmic functionapproximation. Upper panel: Comparison using the pressure(Pa) as thex-axis. Lower panel: Comparison using the sound pressure level (SPL) in dBas thex-axis.

It is well known that the nonlinear curve relating soundpressure level in decibels to the auditory-nerve firing rateiscompressive (e.g [56] [60]). It has also been observed thatthe average auditory-nerve firing rate exhibits an overshoot atthe onset of an input signal. As an example, we compare inFig. 8 the average onset firing rate versus the sustained rateas predicted by the model of Heinzet al. [56]. The curves inthis figure were obtained by averaging the rate-intensity valuesobtained from sinusoidal tone bursts over seven frequencies,100, 200, 400, 800, 1600, 3200, and 6400 Hz. For the onset-rate results we partitioned the response into bins of lengthof2.5 ms, and searched for the bin with maximum rate duringthe initial 10 ms of the tone burst. To measure the sustainedrate, we averaged the response rate between 50 and 100 msafter the onset of the signals. The curves were generated underthe assumption that the spontaneous rate is 50 spikes/second.We observe in Fig. 8 that the sustained firing rate (brokencurve) is S-shaped with a threshold around 0 dB SPL and asaturating segment that begins at around 30 dB SPL. The onsetrate (solid curve), on the other hand, increases continuouslywithout apparent saturation over the conversational hearingrange of0 to 80 dB SPL. We choose to model the onset rate-intensity curve for PNCC processing because of the importantrole that it appears to play in auditory perception. Figure 9

compares the onset rate-intensity curve depicted in Fig. 8 withvarious analytical functions that approximate this function.The curves are plotted as a function of dB SPL in the lowerpanel of the figure and as a function of absolute pressure inPascals in the upper panel, and the putative spontaneous firingrate of 50 spikes per second is subtracted from the curves inboth cases.

The most widely used current feature extraction algorithmsare Mel Frequency Cepstral Coefficients (MFCC) and Per-ceptual Linear Prediction (PLP) coefficients. Both the MFCCand PLP procedures include an intrinsic nonlinearity, whichis logarithmic in the case of MFCC and a cube-root powerfunction in the case of PLP analysis. We plot these curvesrelating the power of the input pressurep to the responsesin Fig. 9 using values of the arbitrary scaling parameters thatare chosen to provide the best fit to the curve of the Heinzetal. model, resulting in the following equations:

scube = 4294.1p2/3 (16)

slog = 120.2 log(p) + 1319.3 (17)

We note that the exponent of the power function is doubledbecause we are plotting power rather than pressure. Eventhough scaling and shifting by fixed constants in Eqs. (16)and (17) do not have any significance in speech recognitionsystems, we included them in the above equation to fit thesecurves to the rate-intensity curve in Fig. 9(a). The constantsin Eqs. (16) and (17) are obtained using an MMSE criterionfor the sound pressure range between 0 dB (20µPa) and 80dB (0.2 Pa) from the linear rate-intensity curve in the upperpanel of Fig. 8.

We have also observed experimentally [43] that the power-law curve with an exponent of1/15 for sound pressureprovides a reasonably good fit to the physiological data whileoptimizing recognition accuracy in the presence of noise. Wehave observed that larger values of the pressure exponent suchas 1/5 provide better performance in white noise, but theydegrade the recognition accuracy that is obtained for cleanspeech [43]. We consider the value1/15 for the pressureexponent to represent a pragmatic compromise that providesreasonable accuracy in white noise without sacrificing recog-nition accuracy for clean speech, producing the power-lawnonlinearity

V [m, l] = U [m, l]1/15 (18)

where againU [m, l] and V [m, l] have the dimensions ofpower. This curve is closely approximated by the equation

spower = 1389.6p0.1264 (19)

which is also plotted in Fig. 9. The exponent of0.1264happens to be the best fit to the Heinzet al. data as depictedin the upper panel of Fig. 8. As before, this estimate wasdeveloped in the MMSE sense over the sound pressure rangebetween 0 dB (20µPa) and 80 dB (0.2 Pa).

The power law function was chosen for PNCC processingfor several reasons. First, it is a relationship that is not affectedin form by multiplying the input by a constant. Second, ithas the attractive property that its asymptotic response atvery


0 100 (1s) 200 (2s) 300 (3s)Frame Index m

logP

[m,l

]Clean and Street 5 dB

Street 5 dBClean

0 100 (1s) 200 (2s) 300 (3s)Frame Index m

P[m

,l]1

/15

Clean and Street 5 dB

Street 5 dBClean

Fig. 10. The effects of the asymmetric noise suppression, temporal masking,and the rate-level nonlinearity used in PNCC processing. Shown are theoutputs of these stages of processing for clean speech and for speech corruptedby street noise at an SNR of 5 dB when the logarithmic nonlinearity is usedwithout ANS processing or temporal masking (upper panel), and when thepower-law nonlinearity is used with ANS processing and temporal masking(lower panel). In this example, the channel indexl is 8.

low intensities is zero rather than negative infinity, whichreduces variance in the response to low-level inputs suchas spectral valleys or silence segments. Finally, the powerlaw has been demonstrated to provide a good approximationto the “psychophysical transfer functions” that are observedin experiments relating the physical intensity of sensationto the perceived intensity using direct magnitude-estimationprocedures (e.g. [61]).

Figure 10 is a final comparison of the effects of theasymmetric noise suppression, temporal masking, channelweighting, and power-law nonlinearity modules discussed inSecs. II-C through II-G. The curves in both panels compare theresponse of the system in the channel with center frequency490 Hz to clean speech and speech in the presence of streetnoise at an SNR of 5 dB. The curves in the upper panelwere obtained using conventional MFCC processing, includingthe logarithmic nonlinearity and without ANS processingor temporal masking. The curves in the lower panel wereobtained using PNCC processing, which includes the power-law transformation described in this section, as well as ANSprocessing and temporal masking. We note that the differencebetween the two curves representing clean and noisy speech ismuch greater with MFCC processing (upper panel), especiallyfor times during which the signal is at a low level.

III. E XPERIMENTAL RESULTS

In this section we present experimental results that areintended to demonstrate the superiority of PNCC processingover competing approaches in a wide variety of acousticalenvironments. We begin in Sec. III-A with a review of theexperimental procedures that were used. We provide somegeneral results for PNCC processing, we assess the contri-butions of its various components in PNCC in Sec. III-B, andwe compare PNCC to a small number of other approaches inSec. III-C.

It should be noted that in general we selected an algorithmconfiguration and associated parameter values that provide

very good performance over a wide variety of conditions usinga single set of parameters and settings, without sacrificingword error rate in clean conditions relative to MFCC pro-cessing. In previous work we had described slightly differentfeature extraction algorithms that provide even better perfor-mance for speech recognition in the presence of reverberation[21] and in background music [46], but these approaches donot perform as well as MFCC processing in clean speech.We used five standard testing environments in our work: (1)digitally-added white noise, (2) digitally-added noise that hadbeen recorded live on urban streets, (3) digitally-added single-speaker interference, (4) digitally-added background music,and (5) passage of the signal through simulated reverberation.The street noise was recorded by us on streets with steady butmoderate traffic. The masking signal used for single-speaker-interference experiments consisted of other utterances drawnfrom the same database as the target speech, and backgroundmusic was selected from music segments from the originalDARPA Hub 4 Broadcast News database. The reverberationsimulations were accomplished using theRoom Impulse Re-sponseopen source software package [62] based on the imagemethod [63]. The room size used was3 × 4 × 5 meters, themicrophone is in the center of the room, the spacing betweenthe target speaker and the microphone was assumed to be 3meters, and reverberation time was manipulated by changingthe assumed absorption coefficients in the room appropriately.These conditions were selected so that interfering additivenoise sources of progressively greater difficulty were included,along with basic reverberation effects.

A. Experimental Configuration

The PNCC features described in this paper were evaluatedby comparing the recognition accuracy obtained with PNCCintroduced in this paper to that obtained using MFCC andRASTA-PLP processing. We used the version of conventionalMFCC processing implemented as part ofsphinx_fe insphinxbase 0.4.1 , both from the CMU Sphinx opensource codebase [64]. We used the PLP-RASTA implemen-tation that is available at [65]. In all cases decoding wasperformed using the publicly-available CMU Sphinx 3.8 sys-tem [64] using training fromSphinxTrain 1.0 . We alsocompared PNCC with thevector Taylor series(VTS) noisecompensation algorithm [4] and theETSI Advanced FrontEnd (AFE) which has several noise suppression algorithmsincluded [8]. In the case of the ETSI AFE, we excluded thelog energy element because this resulted in better results inour experiments. A bigram language model was used in allthe experiments. We used feature vectors of length of 39including delta and delta-delta features. For experimentsusingthe DARPA Resource Management (RM1) database we usedsubsets of 1600 utterances of clean speech for training and600 utterances of clean or degraded speech for testing. Forexperiments based on the DARPA Wall Street Journal (WSJ)5000-word database we trained the system using the WSJ0SI-84 training set and tested it on the WSJ0 5K test set.

We typically plot word recognition accuracy, which is 100percent minus the word error rate (WER), using the standard


definition for WER of the number of insertions, deletions, andsubstitutions divided by the number of words spoken.

B. General performance of PNCC in noise and reverberation

In this section we describe the recognition accuracy ob-tained using PNCC processing in the presence of various typesof degradation of the incoming speech signals. Figures 11and 12 describe the recognition accuracy obtained with PNCCprocessing in the presence of white noise, street noise, back-ground music, and speech from a single interfering speakeras a function of SNR, as well as in the simulated reverberantenvironment as a function of reverberation time. These resultsare plotted for the DARPA RM database in Fig. 11 and forthe DARPA WSJ database in Fig. 12. For the experimentsconducted in noise we prefer to characterize the improvementin recognition accuracy by the amount of lateral shift of thecurves provided by the processing, which corresponds to anincrease of the effective SNR. For white noise using theRM task, PNCC provides an improvement of about 12 dBto 13 dB compared to MFCC processing, as shown in Fig.11. In the presence of street noise, background music, andinterfering speech, PNCC provides improvements of approxi-mately 8 dB, 3.5 dB, and 3.5 dB, respectively. We also notethat PNCC processing provides considerable improvement inreverberation, especially for longer reverberation times. PNCCprocessing exhibits similar performance trends for speechfromthe DARPA WSJ0 database in similar environments, as seenin Fig. 12, although the magnitude of the improvement isdiminished somewhat, which is commonly observed as wemove to larger databases.

The curves in Figs. 11 and 12 are also organized in away that highlights the various contributions of the majorcomponents. Beginning with baseline MFCC processing theremaining curves show the effects of adding in sequence (1)the power-law nonlinearity (along with mean power normal-ization and the gammatone frequency integration), (2) theANS processing including spectral smoothing, and finally (3)temporal masking. It can be seen from the curves that asubstantial improvement can be obtained by simply replacingthe logarithmic nonlinearity of MFCC processing by thepower-law rate-intensity function described in Sec. II-G.Theaddition of the ANS processing provides a substantial furtherimprovement for recognition accuracy in noise. Although itis not explicitly shown in Figs. 11 and 12, temporal maskingis particularly helpful in improving accuracy for reverberatedspeech and for speech in the presence of interfering speech.

C. Comparison with other algorithms

Figures 13 and 14 provide comparisons of PNCC processingto the baseline MFCC processing with cepstral mean normal-ization, MFCC processing combined with the vector Taylorseries (VTS) algorithm for noise robustness [4], as well asRASTA-PLP feature extraction [23] and the ETSI AdvancedFront End (AFE) [8]. We compare PNCC processing to MFCCand RASTA-PLP processing because these features are mostwidely used in baseline systems, even though neither MFCCnor PLP features were designed to be robust in the presence

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

RM1 (White Noise)

(a)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

RM1 (Street Noise)

(b)

0 5 10 15 20 Clean0

20

40

60

80

100

SNR (dB)

Acc

urac

y (1

00 −

WE

R)

RM1 (Music Noise)

PNCCPower−Law Nonlinearity with ANS processingPower−Law NonlinearityMFCC

(c)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SIR (dB)

RM1 (Interfering Speaker)

(d)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.20

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

Reverberation Time (s)

RM1 (Reverberation)

(e)

Fig. 11. Recognition accuracy obtained using PNCC processing in varioustypes of additive noise and reverberation. Curves are plotted separately toindicate the contributions of the power-law nonlinearity,asymmetric noisesuppression, and temporal masking. Results are described for the DARPARM1 database in the presence of (a) white noise, (b) street noise, (c)background music, (d) interfering speech, and (e) artificial reverberation.

of additive noise. The experimental conditions used were thesame as those used to produce Figs. 11 and 12.

We note in Figs. 13 and 14 that PNCC provides substantially


0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

WSJ0−5k (White Noise)

(a)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

WSJ0−5k (Street Noise)

(b)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

WSJ0−5k (Music Noise)

(c)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SIR (dB)

WSJ0−5k (Interfering Speaker)

(d)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.20

20

40

60

80

100


Acc

urac

y (1

00 −

WE

R)

WSJ0−5k (Reverberation)

PNCCPower−Law Nonlinearity with ANS processingPower−Law NonlinearityMFCC

(e)

Fig. 12. Recognition accuracy obtained using PNCC processing in varioustypes of additive noise and reverberation. Curves are plotted separately toindicate the contributions of the power-law nonlinearity,asymmetric noisesuppression, and temporal masking. Results are described for the DARPAWSJ0 database in the presence of (a) white noise, (b) street noise, (c)background music, (d) interfering speech, and (e) artificial reverberation.

better recognition accuracy than both MFCC and RASTA-PLP processing for all conditions examined. It also providesrecognition accuracy that is better than the combination of

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

RM1 (White Noise)

(a)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

RM1 (Street Noise)

(b)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

RM1 (Music Noise)

(c)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SIR (dB)

RM1 (Interfering Speaker)

(d)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.20

20

40

60

80

100


Acc

urac

y (1

00 −

WE

R)

RM1 (Reverberation)

PNCCETSI AFEMFCC with VTSMFCCRASTA−PLP

(e)

Fig. 13. Comparison of recognition accuracy for PNCC with processingusing MFCC features, the ETSI AFE, MFCC with VTS, and RASTA-PLPfeatures using the DARPA RM1 corpus. Environmental conditions are (a)white noise, (b) street noise, (c) background music, (d) interfering speech,and (e) reverberation.

MFCC with VTS, and at a substantially lower computationalcost than the computation that is incurred in implementingVTS. We also note that the VTS algorithm provides littleor no improvement over the baseline MFCC performance in


0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

WSJ0−5k (White Noise)

(a)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

WSJ0−5k (Street Noise)

(b)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SNR (dB)

WSJ0−5k (Music Noise)

(c)

0 5 10 15 20 Clean0

20

40

60

80

100

Acc

urac

y (1

00 −

WE

R)

SIR (dB)

WSJ0−5k (Interfering Speaker)

(d)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.20

20

40

60

80

100


Acc

urac

y (1

00 −

WE

R)

WSJ0−5k (Reverberation)

PNCCETSI AFEMFCC with VTSMFCCRASTA−PLP

(e)

Fig. 14. Comparison of recognition accuracy for PNCC with processing usingMFCC features, ETSI AFE, MFCC with VTS, and RASTA-PLP featuresusing the DARPA WSJ0 corpus. Environmental conditions are (a) whitenoise, (b) street noise, (c) background music, (d) interfering speech, and (e)reverberation.

difficult environments like background music noise, single-channel interfering speaker or reverberation.

The ETSI Advanced Front End (AFE) [8] generally pro-vides slightly better recognition accuracy than VTS in noisy

environments, but the accuracy obtained with the AFE doesnot approach that obtained with PNCC processing in the mostdifficult noise conditions. Neither the ETSI AFE nor VTSimprove recognition accuracy in reverberant environmentscompared to MFCC features, while PNCC provides mea-surable improvements in reverberation, and a closely relatedalgorithm [46] provides even greater recognition accuracyinreverberation (at the expense of somewhat worse performancein clean speech).

IV. COMPUTATIONAL COMPLEXITY

Table I provides estimates of the computational demandsMFCC, PLP, and PNCC feature extraction. (RASTA process-ing is not included in these tabulations.) As before, we usethe standard open source Sphinx code insphinx_fe [64]for the implementation of MFCC, and the implementation in[65] for PLP. We assume that the window length is 25.6 msand that the interval between successive windows is 10 ms.The sampling rate is assumed to be 16 kHz, and we use a1024-pt FFT for each analysis frame.

It can be seen in Table I that because all three algorithmsuse 1024-point FFTs, the greatest difference from algorithm toalgorithm in the amount of computation required is associatedwith the spectral integration component. Specifically, thetrian-gular weighting used in the MFCC calculation encompassesa narrower range of frequencies than the trapezoids used inPLP processing, which is in turn considerably narrower thanthe gammatone filter shapes, and the amount of computationneeded for spectral integration is directly proportional tothe effective bandwidth of the channels. For this reason, asmentioned in Sec. II-A, we limited the gammatone filtercomputation to those frequencies for which the filter transferfunction is0.5 percent or more of the maximum filter gain. InTable I, for all spectral integration types, we considered filterportion whose magnitude is0.5 or more of the maximum filtergain.

As can be seen in Table I, PLP processing by this tabula-tion is about 32.9 percent more costly than baseline MFCCprocessing. PNCC processing is approximately 34.6 percentmore costly than MFCC processing and 1.31 percent morecostly than PLP processing.

V. SUMMARY

In this paper we introduce power-normalized cepstral coef-ficients (PNCC), which we characterize as a feature set thatprovides better recognition accuracy than MFCC and RASTA-PLP processing in the presence of common types of additivenoise and reverberation. PNCC processing is motivated by thedesire to develop computationally efficient feature extractionfor automatic speech recognition that is based on a pragmaticabstraction of various attributes of auditory processing includ-ing the rate-level nonlinearity, temporal and spectral integra-tion, and temporal masking. The processing also includes acomponent that implements suppression of various types ofcommon additive noise. PNCC processing requires only about33 percent more computation compared to MFCC.


TABLE INUMBER OF MULTIPLICATIONS AND DIVISIONS IN EACH FRAME

Item MFCC PLP PNCC

Pre-emphasis 410 410

Windowing 410 410 410

FFT 10240 10240 10240

Magnitude squared 512 512 512

Medium-time power calculation 40

Spectral integration 958 4955 4984

ANS filtering 200

Equal loudness pre-emphasis 512

Temporal masking 120

Weight averaging 120

IDFT 504

LPC and cepstral recursion 156

DCT 480 480

Sum 13010 17289 17516

Further details about the motivation for and implementationof PNCC processing are available in [43]. This thesis alsoincludes additional relevant experimental findings includingresults obtained for PNCC processing using multi-style train-ing and in combination with speaker-by-speaker MLLR.

Open Source MATLAB code for PNCC may be found athttp://www.cs.cmu.edu/

˜robust/archive/algorithms/PNCC_IEEETran . Thecode in this directory was used for obtaining the results forthis paper.

ACKNOWLEDGEMENTS

This research was supported by NSF (Grants IIS-0420866and IIS-0916918). The authors are grateful to Bhiksha Raj,Kshitiz Kumar, and Mark Harvilla for many helpful discus-sions. A summary version of part of this paper was publishedat [66].

REFERENCES

[1] L. R. Rabiner and B.-H. Juang,Fundamentals of Speech Recognition.Englewood Cliffs, New Jersey: PTR Prentice Hall, 1993.

[2] F. Jelinek, Statistical Methods for Speech Recognition (Language,Speech, and Communication). MIT Press, 1998.

[3] A. Acero and R. M. Stern, “Environmental Robustness in AutomaticSpeech Recognition,” inProc. IEEE Int. Conf. Acoust., Speech, SignalProcessing (Albuquerque, NM), vol. 2, Apr. 1990, pp. 849–852.

[4] P. J. Moreno, B. Raj, and R. M. Stern, “A vector Taylor series approachfor environment-independent speech recognition,” inIEEE Int. Conf.Acoust., Speech and Signal Processing, May. 1996, pp. 733–736.

[5] P. Pujol, D. Macho, and C. Nadeu, “On real-time mean-and-variance nor-malization of speech recognition features,” inIEEE Int. Conf. Acoust.,Speech and Signal Processing, vol. 1, May 2006, pp. 773–776.

[6] R. M. Stern, B. Raj, and P. J. Moreno, “Compensation for environmentaldegradation in automatic speech recognition,” inProc. of the ESCATutorial and Research Workshop on Robust Speech Recognition forUnknown Communication Channels, Apr. 1997, pp. 33–42.

[7] R. Singh, R. M. Stern, and B. Raj, “Signal and feature compensationmethods for robust speech recognition,” inNoise Reduction in SpeechApplications, G. M. Davis, Ed. CRC Press, 2002, pp. 219–244.

[8] Speech Processing, Transmission and Quality Aspects (STQ); Dis-tributed Speech Recognition; Advanced Front-end Feature Extraction Al-gorithm; Compression Algorithms, European Telecommunications Stan-dards Institute ES 202 050, Rev. 1.1.5, Jan. 2007.

[9] S. Molau, M. Pitz, and H. Ney, “Histogram based normalization inthe acoustic feature space,” inIEEE Workshop on Automatic SpeechRecognition and Understanding, Nov. 2001, pp. 21–24.

[10] H. Mirsa, S. Ikbal, H. Bourlard, and H. Hermansky, “Spectral entropybased feature for robust ASR,” inIEEE Int. Conf. Acoust. Speech, andSignal Processing, May 2004, pp. 193–196.

[11] B. Raj, V. N. Parikh, and R. M. Stern, “The effects of background musicon speech recognition accuracy,” inIEEE Int. Conf. Acoust., Speech andSignal Processing, vol. 2, Apr. 1997, pp. 851–854.

[12] S. B. Davis and P. Mermelstein, “Comparison of parametric repre-sentations for monosyllabic word recognition in continuously spokensentences,”IEEE Trans. Acoust., Speech, and Signal Processing, vol. 28,no. 4, pp. 357–366, Aug. 1980.

[13] H. Hermansky, “Perceptual linear prediction analysisof speech,” J.Acoust. Soc. Am., vol. 87, no. 4, pp. 1738–1752, Apr. 1990.

[14] S. Ganapathy, S. Thomas, and H. Hermansky, “Robust spectro-temporalfeatures based on autoregressive models of hilbert envelopes,” in IEEEInt. Conf. on Acoustics, Speech, and Signal Processing, March 2010,pp. 4286–4289.

[15] M. Heckmann, X. Domont, F. Joublin, and C. Goerick, “A hierarchicalframework for spectro-temporal feature extraction,”Speech Communi-cation, vol. 53, no. 5, pp. 736–752, May-June 2011.

[16] N. Mesgarani, M. Slaney, and S. Shamma, “Discrimination of speechfrom nonspeech based on multiscale spectro-temporal modulations,”IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 3,pp. 920–929, May 2006.

[17] M. Kleinschmidt, “Localized spectro-temporal features for automaticspeech recognition,” inINTERSPEECH-2003, Sept. 2003, pp. 2573–2576.

[18] H. Hermansky and F. Valente, “Hierarchical and parallel processingof modulation spectrum for asr applications,” inIEEE Int. Conf. onAcoustics, Speech, and Signal Processing, March 2008, pp. 4165–4168.

[19] S. Y. Zhao and N. Morgan, “Multi-stream spectro-temporal features forrobust speech recognition,” inINTERSPEECH-2008, Sept. 2008, pp.898–901.

[20] C. Kim and R. M. Stern, “Feature extraction for robust speech recog-nition using a power-law nonlinearity and power-bias subtraction,” inINTERSPEECH-2009, Sept. 2009, pp. 28–31.

[21] ——, “Feature extraction for robust speech recognitionbased on maxi-mizing the sharpness of the power distribution and on power flooring,”in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, March2010, pp. 4574–4577.

[22] D.-S. Kim, S.-Y. Lee, and R. M. Kil, “Auditory processing of speechsignals for robust speech recognition in real-world noisy environments,”IEEE Trans. Speech and Audio Processing, vol. 7, no. 1, pp. 55–69,1999.

[23] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE.Trans. Speech Audio Process., vol. 2, no. 4, pp. 578–589, Oct. 1994.

[24] U. H. Yapanel and J. H. L. Hansen, “A new perceptually motivatedMVDR-based acoustic front-end (PMVDR) for robust automatic speechrecognition,” Speech Communication, vol. 50, no. 2, pp. 142–152, Feb.2008.

[25] F. Muller and A. Mertins, “Contextual invariant-integration features forimproved speaker-independent speech recognition,”Speech Communi-cation, vol. 53, no. 6, pp. 830–841, July 2011.

[26] B. Gajic and K. K. Paliwal, “Robust parameters for speech recognitionbased on subband spectral centroid histograms,” inEurospeech-2001,Sept. 2001, pp. 591–594.

[27] F. Kelly and N. Harte, “A comparison of auditory features for robustspeech recognition,” inEUSIPCO-2010, Aug 2010, pp. 1968–1972.

[28] ——, “Auditory features revisited for robust speech recognition,” inInternational conference on pattern recognition, Aug. 2010, pp. 4456–4459.

[29] J. K. Siqueira and A. Alcaim, “Comparacao dos atributos mfcc, ssch epncc para reconhecimento robusto de voz contınua,” inXXIX SimposioBrasileiro de TelecomunicaA§oes, Oct. 2011.

[30] G. Sarosi, M. Mozsary, B. Tarjan, A. Balog, P. Mihajlik, and T. Fegyo,“Recognition of multiple language voice navigation queries in trafficsituations,” in COST 2102 International Conference, Sept. 2010, pp.199–213.

[31] G. Sarosi, M. Mozsary, P. Mihajlik, and T. Fegyo, “Comparison offeature extraction methods for speech recognition in noise-free and intraffic noise environment,” inSpeech Technology and Human-ComputerDialogue (SpeD), May 2011, pp. 1–8.

[32] F. Muller and A. Mertins, “Noise robust speaker-independent speechrecognition with invariant-integration features using power-bias subtrac-tion,” in INTERSPEECH-2011, Aug. 2011, pp. 1677–1680.


[33] K. Kumar, C. Kim and R. M. Stern, “Delta-spectral cepstral coefficientsfor robust speech recognition,” inIEEE Int. Conf. on Acoustics, Speech,and Signal Processing, May 2011, pp. 4784–4787.

[34] H. Zhang, X. Zhu, T-R Su, K-W. Eom, and J-W Lee, “Data-drivenlexicon refinement using local and web resources for chinesespeechrecognition,” in International symposium on Chinese spoken languageprocessing, Dec 2010, pp. 233–237.

[35] A. Fazel and S. Chakrabartty, “Sparse auditory reproducing kernel(SPARK) features for noise-robust speech recognition ,”IEEE Trans.Audio, Speech, Language Processing, Dec. 2011 (to appear).

[36] M. J. Harvilla and R. M. Stern, “Histogram-based subband powerwarping and spectral averaging for robust speech recognition undermatched and multistyle training,” inIEEE Int. Conf. Acoust. SpeechSignal Processing, May 2012 (to appear).

[37] S. F. Boll, “Suppression of acoustic noise in speech using spectral sub-traction,” IEEE Trans. Acoust., Speech and Signal Processing, vol. 27,no. 2, pp. 113–120, Apr. 1979.

[38] P. D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang,and M. H. Allerhand, “Complex sounds and auditory images,” inAuditory and Perception. Oxford, UK: Y. Cazals, L. Demany, andK. Horner, (Eds), Pergamon Press, 1992, pp. 429–446.

[39] B. C. J. Moore and B. R. Glasberg, “A revision of Zwicker’s loudnessmodel,” Acustica - Acta Acustica, vol. 82, pp. 335–345, 1996.

[40] M. Slaney, “Auditory toolbox version 2,” Interval ResearchCorporation Technical Report, no. 10, 1998. [Online]. Available:http://cobweb.ecn.purdue.edu/ malcolm/interval/1998-010/

[41] C. Kim and R. M. Stern, “Power function-based power distribution nor-malization algorithm for robust speech recognition,” inIEEE AutomaticSpeech Recognition and Understanding Workshop, Dec. 2009, pp. 188–193.

[42] D. Gelbart and N. Morgan, “Evaluating long-term spectral subtractionfor reverberant ASR,” inIEEE Workshop on Automatic Speech Recog-nition and Understanding, 2001, pp. 103–106.

[43] C. Kim, “Signal processing for robust speech recognition motivated byauditory processing,” Ph.D. dissertation, Carnegie Mellon University,Pittsburgh, PA USA, Dec. 2010.

[44] B. E. D. Kingsbury, N. Morgan, and, S. Greenberg, “Robust speechrecognition using the modulation spectrogram,”Speech Communication,vol. 25, no. 1–3, pp. 117–132, Aug. 1998.

[45] H. G. Hirsch and C. Ehrlicher, “Noise estimation techniques or robustspeech recognition,” inIEEE Int. Conf. on Acoustics, Speech, and SignalProcessing, May 1995, pp. 153–156.

[46] C. Kim and R. M. Stern, “Nonlinear enhancement of onset for robustspeech recognition,” inINTERSPEECH-2010, Sept. 2010, pp. 2058–2061.

[47] C. Lemyre, M. Jelinek, and R. Lefebvre, “New approach tovoicedonset detection in speech signal and its application for frame errorconcealment,” inIEEE Int. Conf. on Acoustics, Speech, and SignalProcessing, May 2008, pp. 4757–4760.

[48] S. R. M. Prasanna and P. Krishnamoorthy, “Vowel onset point detectionusing source, spectral peaks, and modulation spectrum energies,” IEEETrans. Audio, Speech, and Lang. Process., vol. 17, no. 4, pp. 556–565,May 2009.

[49] K. D. Martin, “Echo suppression in a computational model of theprecedence effect,” inIEEE ASSP Workshop on Applications of SignalProcessing to Audio and Acoustics, Oct. 1997.

[50] C. Kim, K. Kumar, and R. M. Stern, “Binaural sound sourceseparationmotivated by auditory processing,” inIEEE Int. Conf. on Acoustics,Speech, and Signal Processing, May 2011, pp. 5072–5075.

[51] T. S. Gunawan and E. Ambikairajah, “A new forward masking model andits application to speech enhancement,” inIEEE Int. Conf. on Acoustics,Speech, and Signal Processing, May 2006, pp. 149–152.

[52] W. Jesteadt, S. P. Bacon, and J. R. Lehman, “Forward masking as afunction of frequency, masker level, and signal delay,”J. Acoust. Soc.Am., vol. 71, no. 4, pp. 950–962, Apr. 1982.

[53] C. Kim, K. Kumar, B. Raj, and R. M. Stern, “Signal separation for robustspeech recognition based on phase difference information obtained inthe frequency domain,” inINTERSPEECH-2009, Sept. 2009, pp. 2495–2498.

[54] C. Kim, K. Kumar and R. M. Stern, “Robust speech recognition usingsmall power boosting algorithm,” inIEEE Automatic Speech Recognitionand Understanding Workshop, Dec. 2009, pp. 243–248.

[55] B. Raj and R. M. Stern, “Missing-Feature Methods for Robust AutomaticSpeech Recognition,”IEEE Signal Processing Magazine, vol. 22, no. 5,pp. 101–116, Sept. 2005.

[56] M. G. Heinz, X. Zhang, I. C. Bruce, and L. H. Carney, “Auditory-nerve model for predicting performance limits of normal andimpairedlisteners,”Acoustics Research Letters Online, vol. 2, no. 3, pp. 91–96,July 2001.

[57] S. Seneff, “A joint synchrony/mean-rate model of auditory speechprocessing,”J. Phonetics, vol. 16, no. 1, pp. 55–76, Jan. 1988.

[58] J. Tchorz and B. Kolllmeier, “A model of auditory perception as frontend for automatic speech recognition,”J. Acoust. Soc. Am., vol. 106,no. 4, pp. 2040–2050, 1999.

[59] X. Zhang and M. G. Heinz and I. C. Bruce and L. H. Carney, “Aphenomenological model for the responses of auditory-nerve fibers: I.Nonlinear tuning with compression and suppression,”J. Acoust. Soc.Am., vol. 109, no. 2, pp. 648–670, Feb. 2001.

[60] ——, “A phenomenological model for the responses of auditory-nervefibers: I. Nonlinear tuning with compression and suppression,” J. Acoust.Soc. Am., vol. 109, no. 2, pp. 648–670, Feb. 2001.

[61] S. S. Stevens, “On the psychophysical law,”Psychological Review,vol. 64, no. 3, pp. 153–181, 1957.

[62] S. G. McGovern, “A model for room acoustics,” http://2pi.us/rir.html.[63] J. Allen and D. Berkley, “Image method for efficiently simulating small-

room acoustics,”J. Acoust. Soc. Am., vol. 65, no. 4, pp. 943–950, April1979.

[64] CMU Sphinx Consortium Sphinx Consortium. CMU Sphinx OpenSource Toolkit for Speech Recognition: Downloads. [Online]. Available:http://cmusphinx.sourceforge.net/wiki/download/

[65] D. Ellis. (2006) PLP and RASTA (and MFCC, and inversion)in MATLAB using melfcc.m and invmelfcc.m. [Online]. Available:http://labrosa.ee.columbia.edu/matlab/rastamat/

[66] C. Kim and R. M. Stern, “Power-normalized cepstral coefficients (pncc)for robust speech recognition,” inIEEE Int. Conf. on Acoustics, Speech,and Signal Processing, March 2012 (to appear).

Chanwoo Kim is presently a software developmentengineer at the Microsoft Corporation. He receiveda Ph.D. from the Language Technologies Instituteof the Carnegie Mellon University School of Com-puter Science in 2010. He received his B.S andM.S. degrees in Electrical Engineering from SeoulNational University in 1998 and 2001, respectively.Dr. Kim’s doctoral research was focused on enhanc-ing the robustness of automatic speech recognitionsystems in noisy environments. Toward this end hehas developed a number of different algorithms for

single-microphone applications, dual-microphone applications, and multiple-microphone applications which have been applied to variousreal-worldapplications. Between 2003 and 2005 Dr. Kim was a Senior Research Engineerat LG Electronics, where he worked primarily on embedded signal processingand protocol stacks for multimedia systems. Prior to his employment at LG,he worked for EdumediaTek and SK Teletech as a R&D engineer.

Richard Stern received the S.B. degree from theMassachusetts Institute of Technology in 1970, theM.S. from the University of California, Berkeley,in 1972, and the Ph.D. from MIT in 1977, all inelectrical engineering. He has been on the faculty ofCarnegie Mellon University since 1977, where he iscurrently a Professor in the Electrical and ComputerEngineering, Computer Science, and Biomedical En-gineering Departments, and the Language Technolo-gies Institute, and a Lecturer in the School of Music.Much of Dr. Stern’s current research is in spoken

language systems, where he is particularly concerned with the developmentof techniques with which automatic speech recognition can be made morerobust with respect to changes in environment and acoustical ambience. Hehas also developed sentence parsing and speaker adaptationalgorithms forearlier CMU speech systems. In addition to his work in speechrecognition,Dr. Stern also maintains an active research program in psychoacoustics, wherehe is best known for theoretical work in binaural perception. Dr. Stern is aFellow of the Acoustical Society of America, the 2008-2009 DistinguishedLecturer of the International Speech Communication Association, a recipientof the Allen Newell Award for Research Excellence in 1992, and he servedas General Chair of Interspeech 2006. He is also a member of the IEEE andof the Audio Engineering Society.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND ...chanwook/MyPapers/OnlinePNCC_V25.pdfIEEE TRANSACTIONS ON...

Documents

Transcript of IEEE TRANSACTIONS ON AUDIO, SPEECH, AND ...chanwook/MyPapers/OnlinePNCC_V25.pdfIEEE TRANSACTIONS ON...