Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Colleagues:• Allen Braun, NIH• Greg Hickok, UC Irvine• Jonathan Simon, Univ. Maryland

A brain’s-eye-view of speech perception

David Poeppel

Cognitive Neuroscience of Language LabDepartment of Linguistics and Department of Biology

Neuroscience and Cognitive Science ProgramUniversity of Maryland College Park

Students:• Anthony Boemio • Maria Chait • Huan Luo• Virginie van Wassenhove

“chair”“uncomfortable”“lunch”“soon”

encoding ?

representation ?

Is this a hard problem?Yes!If it could be solved straightforwardly(e.g. by machine), Mark Liberman would be in Tahiti having cold beers.

Outline

(1) Fractionating the problem in space:

Towards a functional anatomy of speech perception

(2) Fractionating the problem in time:

Towards a functional physiology of speech perception

- A hypothesis about the quantization of time

- Psychophysical evidence for temporal integration

- Imaging evidence

interface with lexical items,word recognition


hypothesis about storage:distinctive features[-voice] [+voice] [+voice][+labial] [+high] [+labial][-round] [+round] [-round][….] [….] [….]


hypothesis about storage:distinctive features [-voice] [+voice] [+voice][+labial] [+high] [+labial][-round] [+round] [-round][….] [….] [….]

production,articulation ofspeech


hypothesis about storage: distinctive features[-voice] [+voice] [+voice][+labial] [+high] [+labial][-round] [+round] [-round][….] [….] [….]

hypothesis about production:distinctive features[-voice] [+voice][+labial] [+high][….] [….]

production, articulation ofspeech

analysis of auditory signal spectro-temporal rep. FEATURES

interface with lexical items,wordrecognitionFEATURES

production, articulation ofspeechFEATURES


coordinate transformfrom acoustic to articulatory space

auditory-motor interface

auditory-lexical interface



Unifying concept:distinctive feature


coordinate transformfrom acoustic to articulatory space



STG (bilateral)acoustic-phonetic

speech codespMTG (left)

sound-meaning interface

Area Spt (left)auditory-motor interface

pIFG/dPM (left)articulatory-based

speech codes

Hickok & Poeppel (2000), Trends in Cognitive SciencesHickok & Poeppel (in press), Cognition

MTG and IFG overlap when controlling for the overt/covert distinction across tasks

Hypothesized functions:- lexical selection (MTG)- lexical phon. code retr. (MTG)- post-lexical syllabification (IFG)

Shared neural correlatesof word production and perception processes

Bilat mid/post STGL anterior STGL mid/post MTGL post IFG

Indefrey & Levelt, in press, CognitionMeta-analysis of neuroimaging data, perception/production overlap

Scott & Johnsrude 2003

Possible Subregions of Inferior Frontal GyrusBurton (2001)

Auditory Studies

Burton et al. (2000), Demonet et al. (1992, 1994), Fiez et al, (1995), Zatorre et al., (1992, 1996)

Visual Studies

Sergent et al. (1992, 1993), Poldrack et al., (1999), Paulesu et al. (1993, 1996), Sergent et al., 1993, Shaywitz et al. (1995)

Auditory lexical decision versus FM/sweeps (a), CP/syllables (b), and rest (c)

(a)

(b)

(c)

z=+6 z=+9 z=+12

D. Poeppel et al. (in press)

fMRI (yellow blobs) and MEG (red dots) recordings of speech perception show pronounced bilateral activation of left and right temporal cortices

T. Roberts & D. Poeppel(in preparation)

Binder et al. 2000






speech codes


Outline







- Imaging evidence

The local/global distinction in vision is intuitively clear

Chuck Close

What information does the brain extract from speech signals?

Acoustic and articulatory phonetic phenomena occur on different time scales

Phenomena at the scale of formant transitions, subsegmental cues“short stuff” -- order of magnitude 20-50ms

Phenomena at the scale of syllables (tonality and prosody)“long stuff” -- order of magnitude 150-250ms

finestructure

envelope

Does different granularity in time matter?

Segmental and subsegmental information serial order in speech fool/flu

carp/crapbat/tab

Supra-segmental information

prosody Sleep during lecture! Sleep during lecture?

The local/global distinction can be conceptualized as a multi-resolution analysis in time

Further processing

Supra-segmental information

(time ~200ms)

Segmental information

(time ~20-50ms)

syllabicity metrics tone features, segments

Binding process

Outline







- Imaging evidence

Temporal integration windows

Psychophysical and electrophysiologic evidence suggeststhat perceptual information is integrated and analysed intemporal integration windows (v. Bekesy 1933; Stevens andHall 1966; Näätänen 1992; Theunissen and Miller 1995; etc).

The importance of the concept of a temporal integration window is that it suggests the discontinuous processing of information in the time domain. The CNS, on this view, treats time not as a continuous variable but as a series of temporal windows, and extracts data from a given window.

arrow of time, physics

arrow of time, Central Nervous System

25ms

short temporalintegrationwindows

long temporalintegrationwindows

200ms

Asymmetric sampling/quantization of the speech waveform

This p a p er i s h ar d tp u b l i sh

Two spectrograms of the same word illustrate how differentanalysis windows highlight different aspects of the sounds.

(a) high time resolution - each glottal pulse visible as vertical striation

(b) high frequency resolution - each harmonic visible as horizontal stripe

(a)High time,low frequ.-resolution

(b)Low time,high frequ.-resolution

Hypothesis: Asymmetric Sampling in Time (AST)

Left temporal cortical areas preferentially extractinformation over 25ms temporal integration windows.

Right hemisphere areas preferentially integrate over long, 150-250ms integration windows.

By assumption, the auditory input signalhas a neural representation that is bilaterally symmetric(e.g. at the level of core); beyond the initial representation,the signal is elaborated asymmetrically in the timedomain.

Another way to cocneptualize the AST proposal is to say thatthe sampling rate of non-primary auditory areas isdifferent, with LH sampling at high frequencies (~40Hz)and RH sampling at low frequencies (4-10Hz).

25[40Hz 4Hz]

250

Size of temporal integration windows (ms)[Associated oscillatory frequency (Hz)]

LH RH

Pro

po

rtio

n o

f

ne

uro

na

l en

sem

ble

s

25[40Hz 4Hz]

250

Symmetric representation of spectro-temporal receptive fields in primary auditory cortex

a. Physiological lateralization

LH RH

Analysesrequiring hightemporal resolution

Analysesrequiring high spectralresolutionformant transitions

e.g. intonation contours

e.g.

b. Functional lateralization

Temporally asymmetric elaboration of perceptual representations in non-primary cortex

Asymmetric sampling in time (AST) characteristics

• AST is an example of functional segregation, a standard concept.

• AST is an example of multi-resolution analysis, a signal processing strategy common in other cortical domains (cf. visual areas MT and V4 which, among other differences, have phasic versus tonic firing properties, respectively).

• AST speaks to the “granularity” of perceptual representations: the model suggests that there exist basic perceptual representations that correspond to the different temporal windows (e.g. featural info isequally basic to the envelope of syllables, on this view).

• The AST model connects in plausible ways to the local versus global distinction: there are multiple representations of a given signalon different scales (cf. wavelets)

Global ==> ‘large-chunk’ analysis, e.g., syllabic levelLocal ==> ‘small-chunk’ analysis, e.g., subsegmental level

LH RH




e.g.

25[40Hz 4Hz]

250


LH RH

Pro

po

rtio

n o

f

ne

uro

na

l en

sem

ble

s

25[40Hz 4Hz]

250





Outline





- A hypothesis about the quantization of time AST model


- Imaging evidence

Perception of FM sweeps

Huan Luo, Mike Gordon, Anthony Boemio,David Poeppel

FM Sweep Example

Time (s)0 0.0800227

–0.2

0.2

0

Time (s)0 0.0800227

–0.2

0.2

0

Time (s)0 0.0800227

0

5000

waveform

spectrogram

80msec, from 3-2 kHz, linear FM sweep

The rationale

• Important cues for speech perception:Formant transition in speech sounds

(For example, F2 direction can distinguish /ba/ from /da/)

• Importance in tone languages• Vertebrate auditory system is well equipped

to analyze FM signals.

Tone languages

• For example, Chinese, Thai…

• The direction of FM (of the fundamental frequency) is important in the language to make lexical distinctions.

• (Four tones in Chinese)

/Ma 1/, /Ma 2/ , /Ma 3/, /Ma 4/

Questions

• How good are we at discriminating these signals? determine the threshold of the duration of

stimuli (corresponding to rate) for the detection of FM direction

Any performance difference between UP and DOWN detection?

• Will language experience affect the performance of such a basic perceptual ability?

Stimuli

• Linearly frequency modulated• Frequency range studied: 2-3 kHz (0.5 oct)• Two directions (Up / Down )• Changing FM rate (frequency range/time) by changing

duration. For each frequency range, frequency span is kept constant (slow / Fast )

• Stimuli duration: from 5msec(100 oct/sec) to 640 msec (0.8 oct/sec)

Tasks• Detection and discrimination of UP versus DOWN• 2 AFC, 2IFC, 3IFC

Performance

30%

40%

50%

60%

70%

80%

90%

100%

100[5]

50[10]

25[20]

16.7[30]

12.5[40]

10[50]

6.2[80]

3.1[160]

1.6[320]

0.8[640]

FM Rate (oct/sec)[Stimulus Duration (ms)]

% C

orr

ec

t

Up

Down

Performance

30%

40%

50%

60%

70%

80%

90%

100%

100[5]

50[10]

25[20]

16.7[30]

12.5[40]

10[50]

6.2[80]

3.1[160]

1.6[320]

0.8[640]


% C

orr

ec

t

Up

Down

Performance

30%

40%

50%

60%

70%

80%

90%

100%

100[5]

50[10]

25[20]

16.7[30]

12.5[40]

10[50]

6.2[80]

3.1[160]

1.6[320]

0.8[640]


% C

orr

ec

t

Up

Down

2-3 kHz

1-1.5 kHz

600-900Hz

English speakers

• 3 frequency ranges relevant to speech(approximately F1, F2, F3 ranges)• single-interval 2-AFC

Two main findings:

• threshold for UP at 20ms • UP better than DOWN

Gordon & Poeppel (2001), JASA-ARLO

2IFC • To eliminate the possibility of bias strategy

subjects can use • To see whether the asymmetric performance of

English subjects is due to their “Up preference bias”

Interval 1 Interval 2

UP Down

Which interval (1 or 2) contains certain direction sound?

Same duration of the two sounds, so the only difference is direction

Results for Chinese Subjects

Expt. 2 (Chinese subjects)

0%

20%

40%

60%

80%

100%

5 10 20 30 40 50 80 160 320

Duration(ms)

Pe

rce

nt

Co

rre

ct

Up Down no significant difference

Threshold for both UP and DOWN is about

20 msec

Results for English Subjects

Expt. 2 (English subjects)

0%

20%

40%

60%

80%

100%

5 10 20 30 40 50 80 160 320

Duration(ms)

Pe

rce

nt

Co

rre

ct

up down

No difference now between UP and DOWN

Threshold for both at 20msec

No difference between Chinese and English subjects now.

3IFC

Standard Interval 1 Interval 2

Choose which interval contains DIFFERENT among the three sounds (different quality rather than only direction)

UP UP Down

Expt. 3 vs. Expt. 2 (English speakers)

0%

20%

40%

60%

80%

100%

5 10 20 30 40 50 80 160 320

Duration(ms)

Pe

rce

nt

Co

rre

ct

up down 3interval difference

Expt. 3 vs. Expt. 2 (Chinese speaker)

0%

20%

40%

60%

80%

100%

5 10 20 30 40 50 80 160 320

Duration(ms)

Pe

rce

nt

Co

rre

ct

Up Down Difference detection

No difference between Chinese and English subjects

Threshold confirmed at 20ms

3 IFC versus 2 IFC

Conclusion

• Importance of 20 msec as the threshold for discrimination of FM sweeps

- corresponds to temporal order threshold determined by Hirsh 1959

- consistent with Schouten 1985, 1989 testing FM sweeps

- this basic threshold arguably reflects the shortest integration window that generates robust auditory percepts.

Click trainsClick trains

Anthony Boemio & David PoeppelAnthony Boemio & David Poeppel

Click Stimuli

Psychophysics

Auditory visual integration: the McGurk effect

Virginie van Wassenhove, Ken Grant,David Poeppel

McGurk Effect

• Audiovisual (AV) token

• Visual (V) token

• Auditory (A) token

-40%

-20%

0%

20%

40%

60%

80%

100%

-46

7

-40

0

-33

3

-26

7

-20

0

-13

3

-67 0

67

13

3

20

0

26

7

33

3

40

0

46

7

A lead SOA (ms) A lag

Res

po

nse

Rat

e (%

)

Fusion Rate Visually driven Auditorily driven Corrected Fusion Rate

Response rate as a function of SOA (ms) in the ApVk McGurk pair.

Mean responses (N=21) and standard errors. Fusion rate (open red squares) and corrected fusion rate (filled red squares, dotted line) are /ta/ responses, visually driven responses (open green triangles) are /ka/, and auditorily driven responses (filled blue circles) are /pa/. A negative value in corrected fusion rate is interpreted as a visually dominated error response /ta/.

Identification Task (3AFC) ApVk

True bimodal responses

TWI

Simultaneity Judgment Task (2AFC) ApVk vs. AtVt and AbVg vs. AdVd

0%

20%

40%

60%

80%

100%-4

67

-40

0

-33

3

-26

7

-20

0

-13

3

-67 0

67

13

3

20

0

26

7

33

3

40

0

46

7

A Lead SOA(ms) A Lag

Sim

ult

an

eit

y R

ate

(%

)

ApVk AtVt AbVg AdVd

Simultaneity judgment task. Simultaneity judgment as a function of SOA (ms) in both incongruent and congruent conditions (A pVk and AtVt N=21; AbVg

and AdVd N=18). The congruent conditions (open symbols) are associated with broader and higher simultaneity judgment

profile than the incongruent conditions (filled symbols).

Temporal Window of Integration (TWI) across Tasks and Bimodal Speech Stimuli

-200 -150 -100 -50 0 50 100 150 200

A lead SOA (ms) A lag

AdVd

AtVt

AbVg

AbVg

AbVg

ApVk

S

ID

Stimulus TaskA Lead, Left

Boundary (ms)A Lag, Right

Boundary (ms)Plateau

Center (ms)Window Size

(ms)

ApVk

ID -25 +136 +56 161

S -44 +117 +37 161

AtVt S -80 +125 +23 205

AbVg

ID -34 +174 +70 208

S -37 +122 +43 159

AdVd S -74 +131 +29 205

Outline





- A hypothesis about the quantization of time • AST model

- Psychophysical evidence for temporal integration• FM sweeps and click trains: 20-30ms integration• AV processing in McGurk: 200ms integration

- Imaging evidence

Binding of Temporal Quanta in Speech Processing

Maria Chait, Steven Greenberg, Takayuki Arai, David Poeppel

Multi Resolution Analysis Hypothesis

“SYLLABLE”

Supra- segmental

information

(t.s ~300 ms)

(Sub)-segmental information

(t.s ~30 ms)

syllabicity stress tone feature

Binding process

Original

0-265Hz

5045-6000 Hz

265-315Hz

E1, FS1

E14, FS14

E2, FS2

Low Pass E1 (0-3 Hz)



E1×FS1

E2×FS2

E14×FS14

Filtering Computing the Envelope and fine Structure

Low Pass Filter

Multiply E by FS

0-265Hz

5045-6000 Hz

265-315Hz

E1, FS1

E14, FS14

E2, FS2

High Pass E1 (22- Hz)



E1×FS1

E2×FS2

E14×FS14

S_l

ow

S_h

igh

Signal Processing:

•0-6 khz

•14 channels

•spaced in 1/3 octave steps along the cochlear frequency map.

•Every two neighboring channels are separated by 50hz

Envelope Extraction

Time

Amplitude

Original Envelope

Low Passed Envelope

High Passed Envelope

Original

High Passed Low Passed

Evidence:

• Comodulation masking release• Ahissar et al. (2001) - Phase locking in the auditory

cortex to the envelope of sentence stimuli. • Shannon (1995)• Drullman (1994):

Effect of low pass filtering the envelope on speech reception:*severe reduction at 0-2Hz cutoff frequencies*marginal contribution of frequencies above 16HzEffect of High Pass filtering the envelope:*reduction in speech intelligibility for cutoff frequencies above 64Hz*no reduction in sentence intelligibility when only frequencies below 4Hz are reduced

Experiment 1Stimuli:

- 53 Sentences from the IEEE corpus. - Nonsense Syllables (CUNY)

8 Blocks – 2(voiced/voiceless)*2 vowels(/a/,/i/) *2(CV/VC)

- 3 manipulations0-3 Hz Low Pass22-40 Hz Band Pass0-3 and 22-40 Hz

Each subject hears all 53 sentences but only one manipulationper sentence. A practice block of 26 sentences precedes the experiment.

Task:- Sentences: subjects asked to write down what they heard as precisely as they

can- Syllables: 7-alternative forced choice

Presented Dichotically

Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0-3

22-40

Dichotic

high-pass

Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0-3

22-40

Dichotic

high-passlow-pass

Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0-3

22-40

Dichotic

high-passlow-pass high-passplus

low-pass?

Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0-3

22-40

Dichotic

high-passlow-pass high-passplus

low-pass?Result reflects the interaction between information carried on the short and long time scales.

Outline






- Psychophysical evidence for temporal integration• FM sweeps and click trains: 20-30ms integration• AV processing in McGurk: 200ms integration• Interaction of temporal windows

- Imaging evidence

fMRI study of temporalstructure in concatenated FMs

Anthony Boemio, Allen Braun, Steven Fromm, David Poeppel

Stimulus Properties

Stimulus Properties

All 13 stimuli have nearly identical long-term spectra and RMS power over the entire 9-second stimulus duration. Stimuli differ only in segment duration which was determined by drawing from a Gaussian distribution (previous panel), with means of 12, 25, 45, 85, 160, and 300ms.

Spectrograms Ampl. vs. TimePSDs

Time (sec)0 1Frequency (Hz)100 1E4

1

1E-10

FM Stimulus

CNST Stimulus

TONE Stimulus

fMRI• Single-trial sparse acquisition paradigm(clustered volume acqu.)• 1.5T GE Signa, echo-planar sequence• 11.4s TR (9s signal,2.4s volume), TE 40ms• 24 reps/condition• SPM 99 random-effectsModel, p<0.05 corrected

SPM 99 Cohort Analysis

FMs-CNST Categorical Contrasts (p < 0.05 corr.)

Mean Supra-threshold Voxels vs. Segment Duration Summed Over All

Conditions/Auditory Areas/Hemispheres

0

50

100

150

200

12 25 45 85 160 300Segment Duration (ms)

errorbars are SEM

acquisition

threshold set bycategorical contrastto CNST stimulus-–anything below thislevel will be zero inthe SPM

Only 1secondof stimuli areshown forclarity

SegmentSegment Transition

Hemodynamic response/stimulus modelNot all segment transitions are equal.

Including the segment transitions and segments themselves, but assuming that transitions between long segments contribute more to the response than shorter ones produces the observed activation vs. segment-duration relation (left).

FM/TONECNST

STS Only

0

50

100

150

200

12 25 45 85 160 300

SOA (ms)

Sup

rath

resh

old

STS Only

0.00

0.15

0.30

0.45

0.60

0.75

Supr

athr

esho

ld V

oxel

sMTG/STS P-Value

Type 0.5994Hemi 0.3127Rate <.0001

Type x Hemi 0.9396Type x Rate 0.3772Hemi x Rate 0.0034

Type x Hemi x Rate 0.4137

STG Only

0

50

100

150

200

250

300

350

400

450

12 25 45 85 160 300

SOA (ms)

Su

pra

thre

sh

old

Vo

xels

Left Hemi

Right Hemi

STG Only

0.00

0.15

0.30

0.45

0.60

0.75

Su

pra

thre

sh

old

Vo

xels Left Hemi

Right Hemi

STG P-Value

Type 0.0933Hemi 0.7514Rate <.0001

Type x Hemi 0.8152Type x Rate 0.0578Hemi x Rate 0.7211

Type x Hemi x Rate 0.3209

MEG study of spectral responsesto complex sounds

David Poeppel, Huan Luo, Dana Ritter, Anthony Boemio, Didier Depireux, Jonathan Simon

LH RH

Sen

sitiv

ity o

f

neu

rona

l ens

embl

es

Asymmetric sampling in time (AST) hypothesispredicts electrophysiological asymmetries in specific frequency bands, gamma (25-55Hz) and theta (3-8Hz) ….

… because the hypothesized temporal quantizationis reflected as oscillatory activity.

25 250[40Hz 4Hz]


25 250[40Hz 4Hz]

Flow chart

LH

RH RMS

Gamma BandPass

Filter

Theta BandPass

Filter

RMSGamma for LH

Gamma for RH

Theta for LH

Theta for RH

Multi-taperspectral analysis

Result

Power ratio in specific frequency bands

•

• The difference is much greater in Theta band (low frequency band) and RH activation in Theta band is greater than LH

(P(L)/(P(L)+P(R)))

Kaiser Remetz Elliptic

Gamma 0.4769 0.4751 0.4733

Theta 0.3958 0.3965 0.4210

Distribution of spectral responses

Outline






- Psychophysical evidence for temporal integration• FM sweeps and click trains: 20-30ms integration• AV processing in McGurk: 200ms integration• Interaction of temporal windows

- Imaging evidence• fMRI: temporal sensitivity and lateralization• MEG spectral lateralization






speech codes


LH RH




e.g.

25[40Hz 4Hz]

250


LH RH

Pro

po

rtio

n o

f

ne

uro

na

l en

sem

ble

s

25[40Hz 4Hz]

250





Asymmetric sampling in time (AST) builds on anatomical symmetry but permits functional asymmetry

Conclusion

The input signal (e.g. speech) must interface with higher-order symbolic representations of different types (e.g. segmental representations relevant to lexical access and supra-segmental representations relevant to interpretation).

These higher-order representation categories appear to be lateralized (e.g. segmental phonology/LH, phrasal prosody/RH).

The timing-based asymmetry provides a possible cortical ‘logistical’ or ‘administrative’ device that helps create representations of the appropriate granularity.

If this is on the right track, syllable is - at least for perception -as elementary a unit as feature/segment. Both are basic.

Analysis-by-synthesis I

Hypothesize- and test models

Analysis

Synthesis

Peripheral auditoryprocessing

Segmentation andlabeling

spectralrepresentation

Lexical accesscode

Long-term memory:Abstract lexical repr.

Recoding

acoustic-phoneticmanifestations ofwords

contextualinformation

MATCHINGPROCESS

BEST LEXICALCANDIDATE

Where do the candidatesfor synthesis come from?

Analysis-by-synthesis II

Analysis-by-synthesis model of lexical hypothesis generation and verification (adapted and extended from Klatt, 1979)

spectral analysis

analysis-by-synthesis verification;

“internal forward model”

speechwaveform

segmental analysis

lexical search

synt./seman. analysis

peripheral and central ‘neurogram’

partial feature matrix

lexical hypotheses

predicted subsequent items

best- scoring lexical candidates

acceptable word string

Analysis-by-synthesis III

spectral analysis

analysis-by-synthesis verification;

“internal forward model”

speechwaveform

segmental analysis

lexical search

synt./seman. analysis

peripheral and central ‘neurogram’

partial feature matrix

lexical hypotheses

predicted subsequent items

best- scoring lexical candidates

acceptable word string

auditorycortex

pSTG?MTG?ITG?

frontal areas (articulatory codes) - l IFG, premotor temporo-parietal areas?

Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Documents

Transcript of Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland