USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf ·...

30
USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO IMPROVE AUTOMATIC SPEECH RECOGNITION: Promise, Progress, and Problems Richard Stern Department of Electrical and Computer Engineering and School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213 Telephone: (412) 268-2535; FAX: (412) 268-3890 Email: [email protected]; http://www.ece.cmu.edu/~rms AFOSR Workshop on Computational Audition August 9, 2002

Transcript of USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf ·...

Page 1: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

USING COMPUTATIONAL MODELS OFBINAURAL HEARING TO IMPROVE

AUTOMATIC SPEECH RECOGNITION:Promise, Progress, and Problems

Richard Stern

Department of Electrical and Computer Engineeringand School of Computer Science

Carnegie Mellon UniversityPittsburgh, Pennsylvania 15213

Telephone: (412) 268-2535; FAX: (412) 268-3890Email: [email protected]; http://www.ece.cmu.edu/~rms

AFOSR Workshop on Computational AuditionAugust 9, 2002

Page 2: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 2 ECE and SCS Robust Speech Group

Introduction - using binaural processing toimprove speech recognition accuracy

n We’re doing better than I expected but there is still a lot to bedone

n Talk outline ….

– Briefly review some relevant binaural phenomena

– Talk briefly about “classical” modeling approaches

– Discuss some specific implementations of binaural processors that areparticularly relevant to automatic speech recognition (ASR)

– Comment a bit on results and future prospects

Page 3: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 3 ECE and SCS Robust Speech Group

I’ll focus on …

n Studies that

– are based on the binaural temporal representation and are

– applied in a meaningful fashion to automatic speech recognition

n Will not consider …

– Other types of multiple microphone processing

» Fixed and adaptive beamforming

» Noise cancellation using adaptive arrays

– Applications to source localization and separation

– Applications to hearing impaired

– Other ASR application using other approaches (like sub-bandcoherence-based enhancement)

Page 4: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 4 ECE and SCS Robust Speech Group

How can two (or more) ears help us?

n The binaural system can focus attention on a single speaker ina complex acoustic scene

Page 5: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 5 ECE and SCS Robust Speech Group

Another big binaural advantage …

n The binaural system can suppress reflected components ofsound in a reverberant environment

Page 6: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 6 ECE and SCS Robust Speech Group

Primary binaural phenomena

n Interaural time delays (ITDs)

n Interaural intensity differences (IIDs)

Page 7: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 7 ECE and SCS Robust Speech Group

The classical model of binaural processing(Colburn and Durlach, 1978)

Page 8: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 8 ECE and SCS Robust Speech Group

Jeffress’s model of ITD extraction (1948)

Page 9: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 9 ECE and SCS Robust Speech Group

Response to a 500-Hz tone with –1500-ms ITD

Page 10: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 10 ECE and SCS Robust Speech Group

Response to 500-Hz noise with –1500-ms ITD

Page 11: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 11 ECE and SCS Robust Speech Group

Other important binaural phenomena

n Binaural “sluggishness”

– Temporal integration “blurs” effects of running time

n Information from head-related transfer functions (HRTFs)

– Impetus for significant work in externalization and virtual environments

– Also enables analysis of relationships between ITDs and IIDs

n The precedence effect

– First-arriving wavefront has greatest impact on localization

n Secondary “slipped-cycle” effects and “straightnessweighting”

– Localization mechanisms also responsive to consistency over frequency

Page 12: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 12 ECE and SCS Robust Speech Group

Some of the groups involved

n Pittsburgh - Stern, Sullivan, Palm, Raj, Seltzer et al.

n Bochum - Blauert, Lindemann, Gaik, Bodden et al.

n Oldenburg - Kollmeier, Peissig, Kleinschmidt, Schortz et al.

n Dayton - Anderson, DeSimio, Francis

n Sheffield - Green, Cooke, Brown, (and Ellis, Wang, Roman) etal.

Page 13: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 13 ECE and SCS Robust Speech Group

Typical elements of binaural models for ASR

n Peripheral processing

– HRTFs or explicit extraction of ITDs and IIDs vs. frequency band

n Model of auditory transduction

– Prosaic (BPF, rectification, nonlinear compression) or AIM

n Interaural timing comparison

– Direct (cross-correlation,stereoausis, etc.) or enhanced for precedence (a laLindemann)

n Time-intensity interaction

– Use of interaural intensity information to reinforce/vitiate temporal information(e.g. Gaik, Peissig)

n Possible restoration of “missing” features

n Feature extraction of enhanced display

n Decision making (Bayesian or using neural networks)

Page 14: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 14 ECE and SCS Robust Speech Group

Some (old) work from CMU: correlation-basedASR motivated by binaural hearing

Page 15: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 15 ECE and SCS Robust Speech Group

The good news: vowel representationsimproved by correlation processing

n Reconstructed features of vowel /a/

n But the bad news is that error rates in real environments godown only a smalll amount, with a lot more processing

Two inputszero delay

Two inputs120-ms delay

Eight inputs120-ms delay

Page 16: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 16 ECE and SCS Robust Speech Group

The Lindemann model to accomplish theprecedence effect

Blauertcross-correlation

Lindemann inhibition

Page 17: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 17 ECE and SCS Robust Speech Group

Sharpening effect of Lindemann inhibition

Comment: Also observeprecedence phenomena(as expected) and a naturaltime-intensity trade.

Page 18: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 18 ECE and SCS Robust Speech Group

Other techniques use by the Bochum group

n Gaik

– Collected statistics of ITDs and IIDs of signals through HFTF filters

– Used statistics to estimate joint pdf of ITD and IID, conditioned onsource location

n Bodden

– Detected source location and implemented source separation algorithmby differentially weighting different frequency bands

n Comment: Oldenburg group has developed a similar model(that differs in many details), but without the Lindemanninhibition

Page 19: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 19 ECE and SCS Robust Speech Group

Missing-feature recognition

n General approach:

– Determine which cells of a spectrogram-like display are unreliable (or“missing”)

– Ignore missing features or make best guess about their values based ondata that are present

Page 20: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 20 ECE and SCS Robust Speech Group

Original speech spectrogram

Page 21: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 21 ECE and SCS Robust Speech Group

Spectrogram corrupted by white noise atSNR 15 dB

n Some regions are affected far more than others

Page 22: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 22 ECE and SCS Robust Speech Group

Ignoring regions in the spectrogram that arecorrupted by noise

n All regions with SNR less than 0 dB deemed missing (dark blue)

n Recognition performed based on colored regions alone

Page 23: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 23 ECE and SCS Robust Speech Group

Recognition accuracy using compensatedcepstra, speech corrupted by white noise

n Large improvements in recognition accuracy can be obtained byreconstruction of corrupted regions of noisy speech spectrograms

n Knowledge of locations of “missing” features needed

0 5 10 15 20 250

102030405060708090

SNR (dB)

Accu

racy

(%)Cluster Based Recon.

Temporal Correlations Spectral Subtraction

Baseline

Page 24: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 24 ECE and SCS Robust Speech Group

0 5 10 15 20 250

102030405060708090

Recognition accuracy using compensatedcepstra, speech corrupted by music

n Recognition accuracy goes up from 7% to 69% at 0 dB with cluster basedreconstruction

SNR (dB)

Accu

racy

(%)

Cluster Based Recon.

Temporal Correlations

Spectral Subtraction

Baseline

Page 25: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 25 ECE and SCS Robust Speech Group

Latest system from the Oldenburg group

n Peripheral processing:

– Gammatone filters

– Envelope extraction, lowpass filtering

– Nonlinear temporal adaptation

– Lowpass filtering

n Binaural processing:

– Direct running cross-correlation (no inhibition)

– Learning of corresponding ITD, IID using a neural network

– Feature extraction from representation in “look direction”

Page 26: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 26 ECE and SCS Robust Speech Group

Sample results from the Oldenburg group(Kleinschmidt et al. 2001)

n Anechoic environment: “Moderate” reverberation:

n Comment: System performs worse in reverberation

0102030405060708090

100

Clean 15 10 5 0 -5 -10

SNR (dB)

% C

orr

ect

0102030405060708090

100

Clean 15 10 5 0 -5 -10

SNR (dB)

% C

orr

ect

Page 27: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 27 ECE and SCS Robust Speech Group

Some systems developed by the Dayton group

n Binaural Auditory Image Model (BAIM):

– HRTFs

– Auditory image model (AIM)

– Cross-correlation with and without Lindemann inhibition

– ITD/IID comparison using Kohonon self-organizing feature map

n Cocktail-party Processor (1995):

– HRTFs

– Conventional peripheral processing with Kates model

– Cross-correlation with Lindemann inhibition

[BAIM worked somewhat better for most conditions]

Page 28: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 28 ECE and SCS Robust Speech Group

Some comments, kudos, and concerns …

n Be very skeptical with results obtained using artificially added signalsand noise! Nevertheless some progress has definitely been made.

– Digitally adding noise almost invariably inflates performance

– Use of room image models to simulate reverberant room acoustics may be morereasonable

n Lots of information is being ignored in many current models

– Synchrony info a la Seneff, Ghitza;

– Complex timbre information as suggested by Lyon, Slaney?

n The Lindemann model may not be the best way to capture precedence

n Missing feature approaches should be very promising

n Too much computational modeling and not enough insight intofundamental processes

Page 29: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural

CarnegieMellon Slide 29 ECE and SCS Robust Speech Group

Summary

n Binaural processing has the ability (in principle) to improvespeech recognition accuracy by providing spatial filtering andby combating the effects of room reverberation.

n Current systems are realizing some gains but are just nowbeginning to realize that promise.

n Faster and more efficient computation will be a real spur forresearch in this area over the next five years.

Page 30: USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO …rms/BinauralWeb/papers/BinauralASR.pdf · Carnegie Mellon Slide 2 ECE and SCS Robust Speech Group Introduction - using binaural