Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

97
Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland A brain’s-eye-view of speech perception David Poeppel Cognitive Neuroscience of Language Lab Department of Linguistics and Department of Biology Neuroscience and Cognitive Science Program University of Maryland College Park Students: Anthony Boemio Maria Chait Huan Luo Virginie van Wassenhove

description

A brain’s-eye-view of speech perception David Poeppel Cognitive Neuroscience of Language Lab Department of Linguistics and Department of Biology Neuroscience and Cognitive Science Program University of Maryland College Park. Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine - PowerPoint PPT Presentation

Transcript of Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Page 1: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Colleagues:• Allen Braun, NIH• Greg Hickok, UC Irvine• Jonathan Simon, Univ. Maryland

A brain’s-eye-view of speech perception

David Poeppel

Cognitive Neuroscience of Language LabDepartment of Linguistics and Department of Biology

Neuroscience and Cognitive Science ProgramUniversity of Maryland College Park

Students:• Anthony Boemio • Maria Chait • Huan Luo• Virginie van Wassenhove

Page 2: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

“chair”“uncomfortable”“lunch”“soon”

encoding ?

representation ?

Is this a hard problem?Yes!If it could be solved straightforwardly(e.g. by machine), Mark Liberman would be in Tahiti having cold beers.

Page 3: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Outline

(1) Fractionating the problem in space:

Towards a functional anatomy of speech perception

(2) Fractionating the problem in time:

Towards a functional physiology of speech perception

- A hypothesis about the quantization of time

- Psychophysical evidence for temporal integration

- Imaging evidence

Page 4: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

interface with lexical items,word recognition

Page 5: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

interface with lexical items,word recognition

hypothesis about storage:distinctive features[-voice] [+voice] [+voice][+labial] [+high] [+labial][-round] [+round] [-round][….] [….] [….]

Page 6: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

interface with lexical items,word recognition

hypothesis about storage:distinctive features [-voice] [+voice] [+voice][+labial] [+high] [+labial][-round] [+round] [-round][….] [….] [….]

production,articulation ofspeech

Page 7: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

interface with lexical items,word recognition

hypothesis about storage: distinctive features[-voice] [+voice] [+voice][+labial] [+high] [+labial][-round] [+round] [-round][….] [….] [….]

hypothesis about production:distinctive features[-voice] [+voice][+labial] [+high][….] [….]

production, articulation ofspeech

Page 8: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

analysis of auditory signal spectro-temporal rep. FEATURES

interface with lexical items,wordrecognitionFEATURES

production, articulation ofspeechFEATURES

Page 9: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

interface with lexical items,word recognition

coordinate transformfrom acoustic to articulatory space

auditory-motor interface

auditory-lexical interface

analysis of auditory signal spectro-temporal rep. FEATURES

production, articulation ofspeech

Unifying concept:distinctive feature

Page 10: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

interface with lexical items,word recognition

coordinate transformfrom acoustic to articulatory space

analysis of auditory signal spectro-temporal rep. FEATURES

production, articulation ofspeech

Page 11: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

STG (bilateral)acoustic-phonetic

speech codespMTG (left)

sound-meaning interface

Area Spt (left)auditory-motor interface

pIFG/dPM (left)articulatory-based

speech codes

Hickok & Poeppel (2000), Trends in Cognitive SciencesHickok & Poeppel (in press), Cognition

Page 12: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

MTG and IFG overlap when controlling for the overt/covert distinction across tasks

Hypothesized functions:- lexical selection (MTG)- lexical phon. code retr. (MTG)- post-lexical syllabification (IFG)

Shared neural correlatesof word production and perception processes

Bilat mid/post STGL anterior STGL mid/post MTGL post IFG

Indefrey & Levelt, in press, CognitionMeta-analysis of neuroimaging data, perception/production overlap

Page 13: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Scott & Johnsrude 2003

Page 14: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Possible Subregions of Inferior Frontal GyrusBurton (2001)

Auditory Studies

Burton et al. (2000), Demonet et al. (1992, 1994), Fiez et al, (1995), Zatorre et al., (1992, 1996)

Visual Studies

Sergent et al. (1992, 1993), Poldrack et al., (1999), Paulesu et al. (1993, 1996), Sergent et al., 1993, Shaywitz et al. (1995)

Page 15: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Auditory lexical decision versus FM/sweeps (a), CP/syllables (b), and rest (c)

(a)

(b)

(c)

z=+6 z=+9 z=+12

D. Poeppel et al. (in press)

Page 16: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

fMRI (yellow blobs) and MEG (red dots) recordings of speech perception show pronounced bilateral activation of left and right temporal cortices

T. Roberts & D. Poeppel(in preparation)

Page 17: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Binder et al. 2000

Page 18: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

STG (bilateral)acoustic-phonetic

speech codespMTG (left)

sound-meaning interface

Area Spt (left)auditory-motor interface

pIFG/dPM (left)articulatory-based

speech codes

Hickok & Poeppel (2000), Trends in Cognitive SciencesHickok & Poeppel (in press), Cognition

Page 19: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Outline

(1) Fractionating the problem in space:

Towards a functional anatomy of speech perception

(2) Fractionating the problem in time:

Towards a functional physiology of speech perception

- A hypothesis about the quantization of time

- Psychophysical evidence for temporal integration

- Imaging evidence

Page 20: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland
Page 21: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

The local/global distinction in vision is intuitively clear

Chuck Close

Page 22: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

What information does the brain extract from speech signals?

Page 23: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Acoustic and articulatory phonetic phenomena occur on different time scales

Phenomena at the scale of formant transitions, subsegmental cues“short stuff” -- order of magnitude 20-50ms

Phenomena at the scale of syllables (tonality and prosody)“long stuff” -- order of magnitude 150-250ms

finestructure

envelope

Page 24: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Does different granularity in time matter?

Segmental and subsegmental information serial order in speech fool/flu

carp/crapbat/tab

Supra-segmental information

prosody Sleep during lecture! Sleep during lecture?

Page 25: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

The local/global distinction can be conceptualized as a multi-resolution analysis in time

Further processing

Supra-segmental information

(time ~200ms)

Segmental information

(time ~20-50ms)

syllabicity metrics tone features, segments

Binding process

Page 26: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Outline

(1) Fractionating the problem in space:

Towards a functional anatomy of speech perception

(2) Fractionating the problem in time:

Towards a functional physiology of speech perception

- A hypothesis about the quantization of time

- Psychophysical evidence for temporal integration

- Imaging evidence

Page 27: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Temporal integration windows

Psychophysical and electrophysiologic evidence suggeststhat perceptual information is integrated and analysed intemporal integration windows (v. Bekesy 1933; Stevens andHall 1966; Näätänen 1992; Theunissen and Miller 1995; etc).

The importance of the concept of a temporal integration window is that it suggests the discontinuous processing of information in the time domain. The CNS, on this view, treats time not as a continuous variable but as a series of temporal windows, and extracts data from a given window.

arrow of time, physics

arrow of time, Central Nervous System

Page 28: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

25ms

short temporalintegrationwindows

long temporalintegrationwindows

200ms

Asymmetric sampling/quantization of the speech waveform

This p a p er i s h ar d tp u b l i sh

Page 29: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Two spectrograms of the same word illustrate how differentanalysis windows highlight different aspects of the sounds.

(a) high time resolution - each glottal pulse visible as vertical striation

(b) high frequency resolution - each harmonic visible as horizontal stripe

(a)High time,low frequ.-resolution

(b)Low time,high frequ.-resolution

Page 30: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Hypothesis: Asymmetric Sampling in Time (AST)

Left temporal cortical areas preferentially extractinformation over 25ms temporal integration windows.

Right hemisphere areas preferentially integrate over long, 150-250ms integration windows.

By assumption, the auditory input signalhas a neural representation that is bilaterally symmetric(e.g. at the level of core); beyond the initial representation,the signal is elaborated asymmetrically in the timedomain.

Another way to cocneptualize the AST proposal is to say thatthe sampling rate of non-primary auditory areas isdifferent, with LH sampling at high frequencies (~40Hz)and RH sampling at low frequencies (4-10Hz).

Page 31: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

25[40Hz 4Hz]

250

Size of temporal integration windows (ms)[Associated oscillatory frequency (Hz)]

LH RH

Pro

po

rtio

n o

f

ne

uro

na

l en

sem

ble

s

25[40Hz 4Hz]

250

Symmetric representation of spectro-temporal receptive fields in primary auditory cortex

a. Physiological lateralization

LH RH

Analysesrequiring hightemporal resolution

Analysesrequiring high spectralresolutionformant transitions

e.g. intonation contours

e.g.

b. Functional lateralization

Temporally asymmetric elaboration of perceptual representations in non-primary cortex

Page 32: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Asymmetric sampling in time (AST) characteristics

• AST is an example of functional segregation, a standard concept.

• AST is an example of multi-resolution analysis, a signal processing strategy common in other cortical domains (cf. visual areas MT and V4 which, among other differences, have phasic versus tonic firing properties, respectively).

• AST speaks to the “granularity” of perceptual representations: the model suggests that there exist basic perceptual representations that correspond to the different temporal windows (e.g. featural info isequally basic to the envelope of syllables, on this view).

• The AST model connects in plausible ways to the local versus global distinction: there are multiple representations of a given signalon different scales (cf. wavelets)

Global ==> ‘large-chunk’ analysis, e.g., syllabic levelLocal ==> ‘small-chunk’ analysis, e.g., subsegmental level

Page 33: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

LH RH

Analysesrequiring hightemporal resolution

Analysesrequiring high spectralresolutionformant transitions

e.g. intonation contours

e.g.

25[40Hz 4Hz]

250

Size of temporal integration windows (ms)[Associated oscillatory frequency (Hz)]

LH RH

Pro

po

rtio

n o

f

ne

uro

na

l en

sem

ble

s

25[40Hz 4Hz]

250

Symmetric representation of spectro-temporal receptive fields in primary auditory cortex

a. Physiological lateralization

b. Functional lateralization

Temporally asymmetric elaboration of perceptual representations in non-primary cortex

Page 34: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Outline

(1) Fractionating the problem in space:

Towards a functional anatomy of speech perception

(2) Fractionating the problem in time:

Towards a functional physiology of speech perception

- A hypothesis about the quantization of time AST model

- Psychophysical evidence for temporal integration

- Imaging evidence

Page 35: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Perception of FM sweeps

Huan Luo, Mike Gordon, Anthony Boemio,David Poeppel

Page 36: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

FM Sweep Example

Time (s)0 0.0800227

–0.2

0.2

0

Time (s)0 0.0800227

–0.2

0.2

0

Time (s)0 0.0800227

0

5000

waveform

spectrogram

80msec, from 3-2 kHz, linear FM sweep

Page 37: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

The rationale

• Important cues for speech perception:Formant transition in speech sounds

(For example, F2 direction can distinguish /ba/ from /da/)

• Importance in tone languages• Vertebrate auditory system is well equipped

to analyze FM signals.

Page 38: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Tone languages

• For example, Chinese, Thai…

• The direction of FM (of the fundamental frequency) is important in the language to make lexical distinctions.

• (Four tones in Chinese)

/Ma 1/, /Ma 2/ , /Ma 3/, /Ma 4/

Page 39: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Questions

• How good are we at discriminating these signals? determine the threshold of the duration of

stimuli (corresponding to rate) for the detection of FM direction

Any performance difference between UP and DOWN detection?

• Will language experience affect the performance of such a basic perceptual ability?

Page 40: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Stimuli

• Linearly frequency modulated• Frequency range studied: 2-3 kHz (0.5 oct)• Two directions (Up / Down )• Changing FM rate (frequency range/time) by changing

duration. For each frequency range, frequency span is kept constant (slow / Fast )

• Stimuli duration: from 5msec(100 oct/sec) to 640 msec (0.8 oct/sec)

Tasks• Detection and discrimination of UP versus DOWN• 2 AFC, 2IFC, 3IFC

Page 41: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Performance

30%

40%

50%

60%

70%

80%

90%

100%

100[5]

50[10]

25[20]

16.7[30]

12.5[40]

10[50]

6.2[80]

3.1[160]

1.6[320]

0.8[640]

FM Rate (oct/sec)[Stimulus Duration (ms)]

% C

orr

ec

t

Up

Down

Performance

30%

40%

50%

60%

70%

80%

90%

100%

100[5]

50[10]

25[20]

16.7[30]

12.5[40]

10[50]

6.2[80]

3.1[160]

1.6[320]

0.8[640]

FM Rate (oct/sec)[Stimulus Duration (ms)]

% C

orr

ec

t

Up

Down

Performance

30%

40%

50%

60%

70%

80%

90%

100%

100[5]

50[10]

25[20]

16.7[30]

12.5[40]

10[50]

6.2[80]

3.1[160]

1.6[320]

0.8[640]

FM Rate (oct/sec)[Stimulus Duration (ms)]

% C

orr

ec

t

Up

Down

2-3 kHz

1-1.5 kHz

600-900Hz

English speakers

• 3 frequency ranges relevant to speech(approximately F1, F2, F3 ranges)• single-interval 2-AFC

Two main findings:

• threshold for UP at 20ms • UP better than DOWN

Gordon & Poeppel (2001), JASA-ARLO

Page 42: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

2IFC • To eliminate the possibility of bias strategy

subjects can use • To see whether the asymmetric performance of

English subjects is due to their “Up preference bias”

Interval 1 Interval 2

UP Down

Which interval (1 or 2) contains certain direction sound?

Same duration of the two sounds, so the only difference is direction

Page 43: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Results for Chinese Subjects

Expt. 2 (Chinese subjects)

0%

20%

40%

60%

80%

100%

5 10 20 30 40 50 80 160 320

Duration(ms)

Pe

rce

nt

Co

rre

ct

Up Down no significant difference

Threshold for both UP and DOWN is about

20 msec

Page 44: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Results for English Subjects

Expt. 2 (English subjects)

0%

20%

40%

60%

80%

100%

5 10 20 30 40 50 80 160 320

Duration(ms)

Pe

rce

nt

Co

rre

ct

up down

No difference now between UP and DOWN

Threshold for both at 20msec

No difference between Chinese and English subjects now.

Page 45: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

3IFC

Standard Interval 1 Interval 2

Choose which interval contains DIFFERENT among the three sounds (different quality rather than only direction)

UP UP Down

Page 46: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Expt. 3 vs. Expt. 2 (English speakers)

0%

20%

40%

60%

80%

100%

5 10 20 30 40 50 80 160 320

Duration(ms)

Pe

rce

nt

Co

rre

ct

up down 3interval difference

Expt. 3 vs. Expt. 2 (Chinese speaker)

0%

20%

40%

60%

80%

100%

5 10 20 30 40 50 80 160 320

Duration(ms)

Pe

rce

nt

Co

rre

ct

Up Down Difference detection

No difference between Chinese and English subjects

Threshold confirmed at 20ms

3 IFC versus 2 IFC

Page 47: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Conclusion

• Importance of 20 msec as the threshold for discrimination of FM sweeps

- corresponds to temporal order threshold determined by Hirsh 1959

- consistent with Schouten 1985, 1989 testing FM sweeps

- this basic threshold arguably reflects the shortest integration window that generates robust auditory percepts.

Page 48: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Click trainsClick trains

Anthony Boemio & David PoeppelAnthony Boemio & David Poeppel

Page 49: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Click Stimuli

Page 50: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Psychophysics

Page 51: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Auditory visual integration: the McGurk effect

Virginie van Wassenhove, Ken Grant,David Poeppel

Page 52: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

McGurk Effect

• Audiovisual (AV) token

• Visual (V) token

• Auditory (A) token

Page 53: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

-40%

-20%

0%

20%

40%

60%

80%

100%

-46

7

-40

0

-33

3

-26

7

-20

0

-13

3

-67 0

67

13

3

20

0

26

7

33

3

40

0

46

7

A lead SOA (ms) A lag

Res

po

nse

Rat

e (%

)

Fusion Rate Visually driven Auditorily driven Corrected Fusion Rate

Response rate as a function of SOA (ms) in the ApVk McGurk pair.

Mean responses (N=21) and standard errors. Fusion rate (open red squares) and corrected fusion rate (filled red squares, dotted line) are /ta/ responses, visually driven responses (open green triangles) are /ka/, and auditorily driven responses (filled blue circles) are /pa/. A negative value in corrected fusion rate is interpreted as a visually dominated error response /ta/.

Identification Task (3AFC) ApVk

True bimodal responses

TWI

Page 54: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Simultaneity Judgment Task (2AFC) ApVk vs. AtVt and AbVg vs. AdVd

0%

20%

40%

60%

80%

100%-4

67

-40

0

-33

3

-26

7

-20

0

-13

3

-67 0

67

13

3

20

0

26

7

33

3

40

0

46

7

A Lead SOA(ms) A Lag

Sim

ult

an

eit

y R

ate

(%

)

ApVk AtVt AbVg AdVd

Simultaneity judgment task. Simultaneity judgment as a function of SOA (ms) in both incongruent and congruent conditions (A pVk and AtVt N=21; AbVg

and AdVd N=18). The congruent conditions (open symbols) are associated with broader and higher simultaneity judgment

profile than the incongruent conditions (filled symbols).

Page 55: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Temporal Window of Integration (TWI) across Tasks and Bimodal Speech Stimuli

-200 -150 -100 -50 0 50 100 150 200

A lead SOA (ms) A lag

AdVd

AtVt

AbVg

AbVg

AbVg

ApVk

S

ID

Stimulus TaskA Lead, Left

Boundary (ms)A Lag, Right

Boundary (ms)Plateau

Center (ms)Window Size

(ms)

ApVk

ID -25 +136 +56 161

S -44 +117 +37 161

AtVt S -80 +125 +23 205

AbVg

ID -34 +174 +70 208

S -37 +122 +43 159

AdVd S -74 +131 +29 205

Page 56: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Outline

(1) Fractionating the problem in space:

Towards a functional anatomy of speech perception

(2) Fractionating the problem in time:

Towards a functional physiology of speech perception

- A hypothesis about the quantization of time • AST model

- Psychophysical evidence for temporal integration• FM sweeps and click trains: 20-30ms integration• AV processing in McGurk: 200ms integration

- Imaging evidence

Page 57: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Binding of Temporal Quanta in Speech Processing

Maria Chait, Steven Greenberg, Takayuki Arai, David Poeppel

Page 58: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Multi Resolution Analysis Hypothesis

“SYLLABLE”

Supra- segmental

information

(t.s ~300 ms)

(Sub)-segmental information

(t.s ~30 ms)

syllabicity stress tone feature

Binding process

Page 59: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Original

0-265Hz

5045-6000 Hz

265-315Hz

E1, FS1

E14, FS14

E2, FS2

Low Pass E1 (0-3 Hz)

Low Pass E14 (0-3 Hz)

Low Pass E2 (0-3 Hz)

E1×FS1

E2×FS2

E14×FS14

Filtering Computing the Envelope and fine Structure

Low Pass Filter

Multiply E by FS

0-265Hz

5045-6000 Hz

265-315Hz

E1, FS1

E14, FS14

E2, FS2

High Pass E1 (22- Hz)

High Pass E14 (22- Hz)

High Pass E2 (22- Hz)

E1×FS1

E2×FS2

E14×FS14

S_l

ow

S_h

igh

Signal Processing:

Page 60: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

•0-6 khz

•14 channels

•spaced in 1/3 octave steps along the cochlear frequency map.

•Every two neighboring channels are separated by 50hz

Page 61: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Envelope Extraction

Time

Amplitude

Page 62: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Original Envelope

Low Passed Envelope

High Passed Envelope

Page 63: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Original

High Passed Low Passed

Page 64: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Evidence:

• Comodulation masking release• Ahissar et al. (2001) - Phase locking in the auditory

cortex to the envelope of sentence stimuli. • Shannon (1995)• Drullman (1994):

Effect of low pass filtering the envelope on speech reception:*severe reduction at 0-2Hz cutoff frequencies*marginal contribution of frequencies above 16HzEffect of High Pass filtering the envelope:*reduction in speech intelligibility for cutoff frequencies above 64Hz*no reduction in sentence intelligibility when only frequencies below 4Hz are reduced

Page 65: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Experiment 1Stimuli:

- 53 Sentences from the IEEE corpus. - Nonsense Syllables (CUNY)

8 Blocks – 2(voiced/voiceless)*2 vowels(/a/,/i/) *2(CV/VC)

- 3 manipulations0-3 Hz Low Pass22-40 Hz Band Pass0-3 and 22-40 Hz

Each subject hears all 53 sentences but only one manipulationper sentence. A practice block of 26 sentences precedes the experiment.

Task:- Sentences: subjects asked to write down what they heard as precisely as they

can- Syllables: 7-alternative forced choice

Presented Dichotically

Page 66: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0-3

22-40

Dichotic

high-pass

Page 67: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0-3

22-40

Dichotic

high-passlow-pass

Page 68: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0-3

22-40

Dichotic

high-passlow-pass high-passplus

low-pass?

Page 69: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0-3

22-40

Dichotic

high-passlow-pass high-passplus

low-pass?Result reflects the interaction between information carried on the short and long time scales.

Page 70: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Outline

(1) Fractionating the problem in space:

Towards a functional anatomy of speech perception

(2) Fractionating the problem in time:

Towards a functional physiology of speech perception

- A hypothesis about the quantization of time • AST model

- Psychophysical evidence for temporal integration• FM sweeps and click trains: 20-30ms integration• AV processing in McGurk: 200ms integration• Interaction of temporal windows

- Imaging evidence

Page 71: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

fMRI study of temporalstructure in concatenated FMs

Anthony Boemio, Allen Braun, Steven Fromm, David Poeppel

Page 72: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Stimulus Properties

Page 73: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Stimulus Properties

All 13 stimuli have nearly identical long-term spectra and RMS power over the entire 9-second stimulus duration. Stimuli differ only in segment duration which was determined by drawing from a Gaussian distribution (previous panel), with means of 12, 25, 45, 85, 160, and 300ms.

Spectrograms Ampl. vs. TimePSDs

Time (sec)0 1Frequency (Hz)100 1E4

1

1E-10

FM Stimulus

CNST Stimulus

TONE Stimulus

Page 74: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

fMRI• Single-trial sparse acquisition paradigm(clustered volume acqu.)• 1.5T GE Signa, echo-planar sequence• 11.4s TR (9s signal,2.4s volume), TE 40ms• 24 reps/condition• SPM 99 random-effectsModel, p<0.05 corrected

Page 75: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

SPM 99 Cohort Analysis

FMs-CNST Categorical Contrasts (p < 0.05 corr.)

Page 76: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Mean Supra-threshold Voxels vs. Segment Duration Summed Over All

Conditions/Auditory Areas/Hemispheres

0

50

100

150

200

12 25 45 85 160 300Segment Duration (ms)

errorbars are SEM

Page 77: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland
Page 78: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

acquisition

threshold set bycategorical contrastto CNST stimulus-–anything below thislevel will be zero inthe SPM

Only 1secondof stimuli areshown forclarity

SegmentSegment Transition

Hemodynamic response/stimulus modelNot all segment transitions are equal.

Including the segment transitions and segments themselves, but assuming that transitions between long segments contribute more to the response than shorter ones produces the observed activation vs. segment-duration relation (left).

FM/TONECNST

Page 79: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

STS Only

0

50

100

150

200

12 25 45 85 160 300

SOA (ms)

Sup

rath

resh

old

STS Only

0.00

0.15

0.30

0.45

0.60

0.75

Supr

athr

esho

ld V

oxel

sMTG/STS P-Value

Type 0.5994Hemi 0.3127Rate <.0001

Type x Hemi 0.9396Type x Rate 0.3772Hemi x Rate 0.0034

Type x Hemi x Rate 0.4137

STG Only

0

50

100

150

200

250

300

350

400

450

12 25 45 85 160 300

SOA (ms)

Su

pra

thre

sh

old

Vo

xels

Left Hemi

Right Hemi

STG Only

0.00

0.15

0.30

0.45

0.60

0.75

Su

pra

thre

sh

old

Vo

xels Left Hemi

Right Hemi

STG P-Value

Type 0.0933Hemi 0.7514Rate <.0001

Type x Hemi 0.8152Type x Rate 0.0578Hemi x Rate 0.7211

Type x Hemi x Rate 0.3209

Page 80: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

MEG study of spectral responsesto complex sounds

David Poeppel, Huan Luo, Dana Ritter, Anthony Boemio, Didier Depireux, Jonathan Simon

Page 81: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

LH RH

Sen

sitiv

ity o

f

neu

rona

l ens

embl

es

Asymmetric sampling in time (AST) hypothesispredicts electrophysiological asymmetries in specific frequency bands, gamma (25-55Hz) and theta (3-8Hz) ….

… because the hypothesized temporal quantizationis reflected as oscillatory activity.

25 250[40Hz 4Hz]

Size of temporal integration windows (ms)[Associated oscillatory frequency (Hz)]

25 250[40Hz 4Hz]

Page 82: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland
Page 83: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland
Page 84: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland
Page 85: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Flow chart

LH

RH RMS

Gamma BandPass

Filter

Theta BandPass

Filter

RMSGamma for LH

Gamma for RH

Theta for LH

Theta for RH

Page 86: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland
Page 87: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Multi-taperspectral analysis

Page 88: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Result

Page 89: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Power ratio in specific frequency bands

• The difference is much greater in Theta band (low frequency band) and RH activation in Theta band is greater than LH

(P(L)/(P(L)+P(R)))

Kaiser Remetz Elliptic

Gamma 0.4769 0.4751 0.4733

Theta 0.3958 0.3965 0.4210

Page 90: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Distribution of spectral responses

Page 91: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Outline

(1) Fractionating the problem in space:

Towards a functional anatomy of speech perception

(2) Fractionating the problem in time:

Towards a functional physiology of speech perception

- A hypothesis about the quantization of time • AST model

- Psychophysical evidence for temporal integration• FM sweeps and click trains: 20-30ms integration• AV processing in McGurk: 200ms integration• Interaction of temporal windows

- Imaging evidence• fMRI: temporal sensitivity and lateralization• MEG spectral lateralization

Page 92: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

STG (bilateral)acoustic-phonetic

speech codespMTG (left)

sound-meaning interface

Area Spt (left)auditory-motor interface

pIFG/dPM (left)articulatory-based

speech codes

Hickok & Poeppel (2000), Trends in Cognitive SciencesHickok & Poeppel (in press), Cognition

Page 93: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

LH RH

Analysesrequiring hightemporal resolution

Analysesrequiring high spectralresolutionformant transitions

e.g. intonation contours

e.g.

25[40Hz 4Hz]

250

Size of temporal integration windows (ms)[Associated oscillatory frequency (Hz)]

LH RH

Pro

po

rtio

n o

f

ne

uro

na

l en

sem

ble

s

25[40Hz 4Hz]

250

Symmetric representation of spectro-temporal receptive fields in primary auditory cortex

a. Physiological lateralization

b. Functional lateralization

Temporally asymmetric elaboration of perceptual representations in non-primary cortex

Asymmetric sampling in time (AST) builds on anatomical symmetry but permits functional asymmetry

Page 94: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Conclusion

The input signal (e.g. speech) must interface with higher-order symbolic representations of different types (e.g. segmental representations relevant to lexical access and supra-segmental representations relevant to interpretation).

These higher-order representation categories appear to be lateralized (e.g. segmental phonology/LH, phrasal prosody/RH).

The timing-based asymmetry provides a possible cortical ‘logistical’ or ‘administrative’ device that helps create representations of the appropriate granularity.

If this is on the right track, syllable is - at least for perception -as elementary a unit as feature/segment. Both are basic.

Page 95: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Analysis-by-synthesis I

Hypothesize- and test models

Analysis

Synthesis

Peripheral auditoryprocessing

Segmentation andlabeling

spectralrepresentation

Lexical accesscode

Long-term memory:Abstract lexical repr.

Recoding

acoustic-phoneticmanifestations ofwords

contextualinformation

MATCHINGPROCESS

BEST LEXICALCANDIDATE

Where do the candidatesfor synthesis come from?

Page 96: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Analysis-by-synthesis II

Analysis-by-synthesis model of lexical hypothesis generation and verification (adapted and extended from Klatt, 1979)

spectral analysis

analysis-by-synthesis verification;

“internal forward model”

speechwaveform

segmental analysis

lexical search

synt./seman. analysis

peripheral and central ‘neurogram’

partial feature matrix

lexical hypotheses

predicted subsequent items

best- scoring lexical candidates

acceptable word string

Page 97: Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

Analysis-by-synthesis III

spectral analysis

analysis-by-synthesis verification;

“internal forward model”

speechwaveform

segmental analysis

lexical search

synt./seman. analysis

peripheral and central ‘neurogram’

partial feature matrix

lexical hypotheses

predicted subsequent items

best- scoring lexical candidates

acceptable word string

auditorycortex

pSTG?MTG?ITG?

frontal areas (articulatory codes) - l IFG, premotor temporo-parietal areas?