Lecture 4: Auditory Perception - Columbia University

44
EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception Mike Mandel <[email protected]> Dan Ellis <[email protected]> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/dpwe/e6820 February 10, 2009 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: Detection & discrimination 4 Pitch perception 5 Speech perception 6 Auditory organization & Scene analysis E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 1 / 44

Transcript of Lecture 4: Auditory Perception - Columbia University

Page 1: Lecture 4: Auditory Perception - Columbia University

EE E6820: Speech & Audio Processing & Recognition

Lecture 4: Auditory Perception

Mike Mandel <[email protected]>Dan Ellis <[email protected]>

Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820

February 10, 2009

1 Motivation: Why & how2 Auditory physiology3 Psychophysics: Detection & discrimination4 Pitch perception5 Speech perception6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 1 / 44

Page 2: Lecture 4: Auditory Perception - Columbia University

Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 2 / 44

Page 3: Lecture 4: Auditory Perception - Columbia University

Why study perception?

Perception is messy: can we avoid it?No!

Audition provides the ‘ground truth’ in audioI what is relevant and irrelevantI subjective importance of distortion (coding etc.)I (there could be other information in sound. . . )

Some sounds are ‘designed’ for auditionI co-evolution of speech and hearing

The auditory system is very successfulI we would do extremely well to duplicate it

We are now able to model complex systemsI faster computers, bigger memories

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 3 / 44

Page 4: Lecture 4: Auditory Perception - Columbia University

How to study perception?

Three different approaches:

Analyze the example: physiology

I dissection & nerve recordings

Black box input/output: psychophysics

I fit simple models of simple functions

Information processing modelsI investigate and model complex functions

e.g. scene analysis, speech perception

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 4 / 44

Page 5: Lecture 4: Auditory Perception - Columbia University

Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 5 / 44

Page 6: Lecture 4: Auditory Perception - Columbia University

Physiology

Processing chain from air to brain:

Outerear

Middleear

Inner ear(cochlea)

Auditorynerve

Midbrain

Cortex

Study via:I anatomyI nerve recordings

Signals flow in both directions

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 6 / 44

Page 7: Lecture 4: Auditory Perception - Columbia University

Outer & middle ear

Pinna

Ear canal

Eardrum(tympanum)

Middle earbones

Cochlea(inner ear)

Pinna ‘horn’I complex reflections give spatial (elevation) cues

Ear canalI acoustic tube

Middle earI bones provide impedance matching

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 7 / 44

Page 8: Lecture 4: Auditory Perception - Columbia University

Inner ear: Cochlea

Cochlea

Oval window (from ME bones)

Basilar Membrane (BM)

Travelling wave

Resonant frequency

Position

16 kHz

50 Hz

0 35mm

Mechanical input from middle ear starts traveling wavemoving down Basilar membrane

Varying stiffness and mass of BM results in continuousvariation of resonant frequency

At resonance, traveling wave energy is dissipated in BMvibration

I Frequency (Fourier) analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 8 / 44

Page 9: Lecture 4: Auditory Perception - Columbia University

Cochlea hair cells

Ear converts sound to BM motionI each point on BM corresponds to a frequency

Cochlea

Basilarmembrane

Tectorialmembrane

Inner Hair Cell(IHC) Outer Hair Cell

(OHC)

Auditory nerve

Hair cells on BM convert motion into nerve impulses (firings)

Inner Hair Cells detect motion

Outer Hair Cells? Variable damping?

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 9 / 44

Page 10: Lecture 4: Auditory Perception - Columbia University

Inner Hair Cells

IHCs convert BM vibration into nerve firings

Human ear has ∼3500 IHCsI each IHC has ∼7 connections to Auditory Nerve

Each nerve fires (sometimes) near peak displacement

Local BM displacement

Typical nerve signal (mV)

time / ms50

Histogram to get firing probability

Firing count

Cycle angle

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 10 / 44

Page 11: Lecture 4: Auditory Perception - Columbia University

Auditory nerve (AN) signalsSingle nerve measurements

Tone burst histogram Frequency thresholdSpike count

Time

100

100 ms

Tone burst (log) frequency

100 Hz 1 kHz 10 kHz

20

40

60

80

dB SPL

Rate vs intensity

Spi

kes/

sec

Intensity / dB SPL

300

200

100

2000

40 60 80 100

One fiber: ~ 25 dB dynamic range

Hearing dynamic range > 100 dB

Hard to measure: probe living ANs?E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 11 / 44

Page 12: Lecture 4: Auditory Perception - Columbia University

AN population responseAll the information the brain has about sound

average rate & spike timings on 30,000 fibers

Not unlike a (constant-Q) spectrogram

time / ms

freq

/ 8v

e re

100

Hz

PatSla rectsmoo on bbctmp2 (2001-02-18)

0

1

2

3

4

5

0 10 20 30 40 50 60

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 12 / 44

Page 13: Lecture 4: Auditory Perception - Columbia University

Beyond the auditory nerve

Ascending and descending

Tonotopic × ?I modulation, position, source??

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 13 / 44

Page 14: Lecture 4: Auditory Perception - Columbia University

Periphery models

Outer/middle ear

filteringSound

Cochlea filterbank

IHC

IHC

Modeled aspectsI outer / middle earI hair cell

transductionI cochlea filteringI efferent feedback?

time / s

chan

nel

SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218)

0 0.1 0.2 0.3 0.4 0.5

10

20

30

40

50

60

Results: ‘neurogram’ / ‘cochleagram’

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 14 / 44

Page 15: Lecture 4: Auditory Perception - Columbia University

Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 15 / 44

Page 16: Lecture 4: Auditory Perception - Columbia University

Psychophysics

Physiology looks at the implementationPsychology looks at the function/behavior

Analyze audition as signal detection: p(θ | x)I psychological tests reflect internal decisionsI assume optimal decision processI infer nature of internal representations, noise, . . .→ lower bounds on more complex functions

Different aspects to measureI time, frequency, intensityI tones, complexes, noiseI binauralI pitch, detuning

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 16 / 44

Page 17: Lecture 4: Auditory Perception - Columbia University

Basic psychophysicsRelate physical and perceptual variablese.g. intensity → loudness

frequency → pitchMethodology: subject tests

I just noticeable difference (JND)I magnitude scaling e.g. “adjust to twice as loud”

Results for Intensity vs Loudness:Weber’s law ∆I ∝ I ⇒ log(L) = k log(I )

-20 -10 0 101.4

1.6

1.8

2.0

2.2

2.4

2.6

Sound level / dB

Log(

loud

ness

rat

ing)

Hartmann(1993) Classroom loudness scaling data

Power law fit:

L α I 0.22

Textbook figure:

L α I 0.3

log2(L) = 0.3 log2(I )

= 0.3log10 I

log10 2

=0.3

log10 2

dB

10

= dB/10

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 17 / 44

Page 18: Lecture 4: Auditory Perception - Columbia University

Loudness as a function of frequency

Fletcher-Munsen equal-loudness curves

freq / Hz

Inte

nsity

/ dB

SP

L

0

40

20

60

100

80

120

1000100 10,000

Intensity / dB

Equ

ival

ent l

oudn

ess

@

1kH

z

400 0

40

80

80

100

60

20

20 60 20 600

100

60

20

0Intensity / dB

Equ

ival

ent l

oudn

ess

@

1kH

z

40

40

80

80

rapid loudness growth

100 Hz 1 kHz

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 18 / 44

Page 19: Lecture 4: Auditory Perception - Columbia University

Loudness as a function of bandwidthSame total energy, different distributione.g. 2 channels at −6 dB (not −10 dB)

time

freq

freq

mag

freq

mag

Same total

energy I·B

... but wider perceived as louder

I0I1

B0 B1

Bandwidth B‘Critical’ bandwidth

Loud

ness

Critical bands: independent frequency channels

I ∼25 total (4-6 / octave)

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 19 / 44

Page 20: Lecture 4: Auditory Perception - Columbia University

Simultaneous maskingA louder tone can ‘mask’ the perception of a second tone nearby infrequency:

masked threshold

log freq

absolute threshold

masking tone

Inte

nsity

/ dB

Suggests an ‘internal noise’ model:

decision variable x

internal noise

p(x | I)p(x | I)

p(x | I+∆I)

σn

I

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 20 / 44

Page 21: Lecture 4: Auditory Perception - Columbia University

Sequential masking

Backward/forward in time:

time

Inte

nsity

/ dB

masker envelope

masked threshold

simultaneous masking ~10 dB

backward masking ~5 ms

forward masking ~100 ms

→ Time-frequency masking ‘skirt’:

time

freq

inte

nsity

Masking tone

Masked threshold

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 21 / 44

Page 22: Lecture 4: Auditory Perception - Columbia University

What we do and don’t hear

A B XX = A or B?

“two-interval forced-choice”:

time

Timing: 2 ms attack resolution, 20 ms discriminationI but: spectral splatter

Tuning: ∼1% discriminationI but: beats

Spectrum: profile changes, formantsI variables time-frequency resolution

Harmonic phase?

Noisy signals & texture

(Trace vs categorical memory)

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 22 / 44

Page 23: Lecture 4: Auditory Perception - Columbia University

Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 23 / 44

Page 24: Lecture 4: Auditory Perception - Columbia University

Pitch perception: a classic argument in psychophysics

Harmonic complexes are a pattern on AN

10

20

30

40

50

60

70

0 0.05 0.1 time/s

freq

. cha

n.

I but give a fused percept (ecological)

What determines the pitch percept?I not the fundamental

How is it computed?Two competing models: place and time

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 24 / 44

Page 25: Lecture 4: Auditory Perception - Columbia University

Place model of pitch

AN excitation pattern shows individual peaks

‘Pattern matching’ method to find pitch

frequency channel

frequency channel

AN

exc

itatio

n

Pitch strength

resolved harmonics

broader HF channels cannot resolve

harmonics

Correlate with harmonic ‘sieve’:

Support: Low harmonics are very important

But: Flat-spectrum noise can carry pitch

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 25 / 44

Page 26: Lecture 4: Auditory Perception - Columbia University

Time model of pitch

Timing information is preserved in AN down to ∼1 ms scale

Extract periodicity by e.g. autocorrelation and combine acrossfrequency channels

lag / ms

time

freq per-channel

autocorrelation

autocorrelation

Summary autocorrelation

0 10 20 30common period

(pitch)

But: HF gives weak pitch (in practice)

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 26 / 44

Page 27: Lecture 4: Auditory Perception - Columbia University

Alternate & competing cues

Pitch perception could rely on various cuesI average excitation patternI summary autocorrelationI more complex pattern matching

Relying on just one cue is brittleI e.g. missing fundamental

→ Perceptual system appears to use a flexible, opportunisticcombination

Optimal detector justification?

argmaxθ

p(θ | x) = argmaxθ

p(x | θ)p(θ)

= argmaxθ

p(x1 | θ)p(x2 | θ)p(θ)

I if x1 and x2 are conditionally independent

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 27 / 44

Page 28: Lecture 4: Auditory Perception - Columbia University

Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 28 / 44

Page 29: Lecture 4: Auditory Perception - Columbia University

Speech perception

Highly specialized functionI subsequent to source organization?

. . . but also can interact

Kinds of speech sounds

20

30

40

50

60

1.4 1.6 1.8 2 2.2 2.4 2.6 time/slevel/dB

freq

/ H

z

0

1000

2000

3000

4000

watch thin as a dimeahas

stop burstfricativevowelnasalglide

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 29 / 44

Page 30: Lecture 4: Auditory Perception - Columbia University

Cues to phoneme perception

Linguists describe speech with phonemes

watch thin as a dimeahas

mdnctcl

^

θ zwzh e

III ayε

Acoustic-phoneticians describe phonemes by

• formants & transitions

• bursts & onset times

time

freq

vowel formants

transition stop burst

voicing onset time

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 30 / 44

Page 31: Lecture 4: Auditory Perception - Columbia University

Categorical perception

(Some) speech sounds perceived categorically rather thananalogically

I e.g. stop-burst and timing:

T

P K P

Pi e a c o uε

following vowelbu

rst f

req

f b /

Hz

1000

2000

3000

4000

time

freq stop burst vowel

formants

fb

I tokens within category are hard to distinguishI category boundaries are very sharp

Categories are learned for native tongueI “merry” / “Mary” / “marry”

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 31 / 44

Page 32: Lecture 4: Auditory Perception - Columbia University

Where is the information in speech?

‘Articulation’ of high/low-pass filtered speech:

Art

icul

atio

n / %

1000

20

40

60

80

2000 3000 4000freq / Hz

high-pass low-pass

sums to more than 1. . .

Speech message is highly redundant

e.g. constraints of language, context

→ listeners can understand with very few cues

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 32 / 44

Page 33: Lecture 4: Auditory Perception - Columbia University

Top-down influences: Phonemic restoration (Warren, 1970)

What if a noise burst obscures speech?

1.4 1.6 1.8 2 2.2 2.4 2.6 time / s

freq

/ H

z

0

1000

2000

3000

4000

auditory system ‘restores’ the missing phoneme. . . based on semantic content. . . even in retrospect

Subjects are typically unaware of which sounds are restored

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 33 / 44

Page 34: Lecture 4: Auditory Perception - Columbia University

A predisposition for speech: Sinewave replicas

Replace each formant with a single sinusoid (Remez et al., 1981)

010002000300040005000

0.5 1 1.5 2 2.5 30

10002000300040005000

time / s

freq

/ H

zfr

eq /

Hz

Speech

Sines

speech is (somewhat) intelligible

people hear both whistles and speech (“duplex”)

processed as speech despite un-speech-like

What does it take to be speech?

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 34 / 44

Page 35: Lecture 4: Auditory Perception - Columbia University

Simultaneous vowelsMix synthetic vowels with different f0s

freq+ =

dB

/iy/ @ 100 Hz

/ah/ @ 125 Hz

Pitch difference helps(though not necessarily)

DV identification vs. ∆f0 (200ms) (Culling & Darwin 1993)

% b

oth

vow

els

corr

ect

∆f0 (semitones)

0 11/4 1/2 2 4

25

50

75

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 35 / 44

Page 36: Lecture 4: Auditory Perception - Columbia University

Computational models of speech perception

Various theoretical-practical models of speech comprehensione.g.

Phoneme recognition

Lexical access

Grammar constraints

Speech

Words

Open questions:I mechanism of phoneme classificationI mechanism of lexical recallI mechanism of grammar constraints

ASR is a practical implementation (?)

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 36 / 44

Page 37: Lecture 4: Auditory Perception - Columbia University

Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 37 / 44

Page 38: Lecture 4: Auditory Perception - Columbia University

Auditory organization

Detection model is huge simplification

The real role of hearing is much more general:Recover useful information from the outside world

→ Sound organization into events and sources

0 2 4 time/s

frq/Hz

0

2000

4000 Voice

Stab

Rumble

Research questions:I what determines perception of sources?I how do humans separate mixtures?I how much can we tell about a source?

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 38 / 44

Page 39: Lecture 4: Auditory Perception - Columbia University

Auditory scene analysis: simultaneous fusion

Harmonics are distinct on AN,but perceived as one sound (“fused”)

time

freq

I depends on common onsetI depends on harmonicity (common period)

Methodologies:I ask subject how many ‘objects’I match attributes e.g. object pitchI manipulate high level e.g. vowel identity

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 39 / 44

Page 40: Lecture 4: Auditory Perception - Columbia University

Sequential grouping: streaming

Pattern / rhythm: property of a set of objectsI subsequent to fusion ∵ employs fused events?

–2 octaves

TRT: 60-150 ms

time

freq

uenc

y

∆f:1 kHz

Measure by relative timing judgmentsI cannot compare between streams

Separate ‘coherence’ and ‘fusion’ boundaries

Can interact and compete with fusion

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 40 / 44

Page 41: Lecture 4: Auditory Perception - Columbia University

Continuity and restoration

Tone is interrupted by noise burst: what happened?

time

freq

+

+ +

?

I masking makes tone undetectable during noise

Need to infer most probable real-world eventsI observation equally likely for either explanationI prior on continuous tone much higher ⇒ choose

Top-down influence on perceived events. . .

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 41 / 44

Page 42: Lecture 4: Auditory Perception - Columbia University

Models of auditory organization

Psychological accounts suggest bottom-up

inputmixture

signalfeatures(maps)

discreteobjectsFront end Object

formationGrouping

rulesSourcegroups

onsetperiodfrq.mod

time

freq

Brown and Cooke (1994)

Complicated in practice

formation of separate elements

contradictory cues

influence of top-down constraints (context, expectations, . . . )

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 42 / 44

Page 43: Lecture 4: Auditory Perception - Columbia University

Summary

Auditory perception provides the ‘ground truth’ underlyingaudio processing

Physiology specifies information available

Psychophysics measure basic sensitivities

Sounds sources require further organization

Strong contextual effects in speech perception

Transduce Scene analysis

Multiple represent'ns

High-level recognitionSound

Parting thought

Is pitch central to communication? Why?

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 43 / 44

Page 44: Lecture 4: Auditory Perception - Columbia University

References

Richard M. Warren. Perceptual restoration of missing speech sounds. Science, 167(3917):392–393, January 1970.

R. E. Remez, P. E. Rubin, D. B. Pisoni, and T. D. Carrell. Speech perception withouttraditional speech cues. Science, 212(4497):947–949, May 1981.

G. J. Brown and M. Cooke. Computational auditory scene analysis. Computer Speech& Language, 8(4):297–336, 1994.

Brian C. J. Moore. An Introduction to the Psychology of Hearing. Academic Press,fifth edition, April 2003. ISBN 0125056281.

James O. Pickles. An Introduction to the Physiology of Hearing. Academic Press,second edition, January 1988. ISBN 0125547544.

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 44 / 44