Temporal envelope compensation for robust phoneme recognition using modulation spectrum

Temporal envelope compensation for robust phoneme recognition using modulation spectrum

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek HermanskyTemporal envelope compensation for robust phoneme recognition using modulation spectrumOutlineIntroductionFrequency domain linear predictionNoise compensation in FDLPGain normalization in FDLPRobustness in telephone channel noise Feature extractionExperimentsPhoneme recognition taskTIMIT databaseResultsPhoneme recognition in CTSRelative contribution of various processing stagesModificationsResultsSummaryIntroduction (1/2)Conventional speech analysis techniques start with estimating the spectral content of relatively short (about 1020 ms) segments of the signal (short-term spectrum).Most of the information contained in these acoustic features relate to formants which provide important cues for recognition of basic speech units.It has been shown that important information for phoneme perception lies in the 116 Hz range of the modulation frequencies. The recognition of consonants, especially the stops, suffers more when the temporal modulations below 16 Hz are filtered out.Even when the spectral information is limited to four sub-bands, the use of temporal amplitude modulations alone provides good human phoneme recognition.Introduction (2/2)However, in the presence of noise, the number of spectral channels needed for good vowel recognition increases, whereas the contribution of temporal modulations remain similar in clean and noisy conditions.For machine recognition of phonemes in noisy speech, there is considerable benefit in using larger temporal context for feature representation of a single phoneme.The techniques that are based on deriving long-term modulation frequencies do not preserve fine temporal events like onsets and offsets which are important in separating some phoneme classes. On the other hand, signal adaptive techniques, which try to represent local temporal fluctuation, cause strong attenuation of higher modulation frequencies which makes them less effective even in clean conditions.

Frequency domain linear prediction (1/2)Typically, AR models have been used in speech/audio applications for representing the envelope of the power spectrum of the signal [time domain linear prediction (TDLP). This paper utilizes AR models for obtaining smoothed, minimum phase, and parametric models for temporal rather than spectral envelopes.

The input signal is transformed into frequency domain using DCT. The full-band DCT is windowed using bark-spaced windows to yield sub-band DCT components. In each sub-band, the inverse discrete Fourier transform (IDFT) of the DCT coefficients represents the discrete time analytic signal. Spectral autocorrelations are derived by the application of DFT on the squared magnitude of analytic signal. These autocorrelations are used for linear prediction. The output of linear prediction is a set of AR model parameters which characterize the sub-band Hilbert envelopes.

Frequency domain linear prediction (2/2)

Noise compensation in FDLP (1/2)When speech signal is corrupted by additive noise:Assuming that the speech and noise are uncorrelated: short-term power spectral densities (PSD) at frequency . Conventional feature extraction techniques for ASR estimate the short-term (1030 ms) PSD of speech in bark or mel scale. Hence, most of the recently proposed noise robust feature extraction techniques apply some kind of spectral subtraction in which an estimate of the noise PSD is subtracted from the noisy speech PSD.A VAD operates on the input speech signal to indicate the presence of non-speech frames.

Noisy, Clean, Noise

Noise compensation in FDLP (2/2)Long segments of the input speech signal are transformed to DCT domain where they are decomposed into sub-band DCT components. The discrete time analytic signal is obtained as the squared magnitude IDFT of the DCT signal. We apply short-term noise subtraction on the analytic signal.

the proposed approach operates like an envelope normalization procedure as opposed to a noise removal technique.Gain normalization in FDLPIn reverberant environments, the speech signal that reaches the microphone is superimposed with multiple reflected versions of the original speech signal.

The envelope functions of typical room impulse responses:

reverberant speech, original speech, impulse response

slowly varying, positive, envelope function instantaneous phase function

Robustness in telephone channel noiseWhen speech signal is passed through a telephone channel:

channel noise, additive noiseIn such conditions, the combination of noise compensation and gain normalization provides suppression of additive and convolutive distortions.

Feature extraction (1/2)

logarithmtime constantsFeature extraction (2/2)

Experiments (1/3)MRASTA: multi-resolution RASTALDMN: log-DFT mean normalizationLTLSS: long-term log spectral subtraction

Experiments (2/3)

gammatone frequency cepstral coefficientsExperiments (3/3)The Hilbert envelopes form an improved representation compared to short-term critical band energy trajectories.

SummaryThe application of linear prediction in frequency domain forms an efficient method for deriving sub-band modulations.The two-stage compression scheme of deriving static and dynamic modulation spectrum results in good phoneme recognition for all phoneme classes even in the presence of noise.The noise compensation technique provides a way to derive robust representation of speech in almost all types of noise and SNR conditions.The robustness of the proposed features is further enhanced by the application of gain normalization technique.These envelope normalization techniques provide substantial improvements in noisy conditions over the previous work.

Temporal envelope compensation for robust phoneme recognition using modulation spectrum

Documents

Transcript of Temporal envelope compensation for robust phoneme recognition using modulation spectrum