Download - Florian Jaeger & Roger Levy LSA 2011 Summer Institute Boulder, CO 12 July 2011

Computational PsycholinguisticsLecture 2: surprisal, incremental syntactic processing,

and approximate surprisal

Florian Jaeger & Roger Levy

LSA 2011 Summer InstituteBoulder, CO12 July 2011

Comprehension: Theoretical Desiderata

how to get from here……to here?

the boy will eat…

• Realistic models of human sentence comprehension must account for:• Robustness to arbitrary input• Accurate disambiguation• Inference on basis of incomplete input

(Tanenhaus et al 1995, Altmann and Kamide 1999, Kaiser and Trueswell 2004)

• Processing difficulty is differential and localized

Review

• Garden-pathing under Jurafsky 1996• Scoring relative probability of incremental trees

• An incremental tree is a fully connected sequence of nodes from the root category (typically, S) to all the terminals (words) that have been seen so far

• Nodes on the right frontier of an incremental tree are still “open” (could accrue further daughters)

• What kind of uncertainty does the Jurafsky 1996 model of garden-pathing deal with?• Uncertainty about what has already been said

Generalizing incremental disambiguation

• Another type of uncertainty

• This is uncertainty about what has not yet been said• Reading-time (Ehrlich & Rayner, 1981) and EEG

(Kutas & Hillyard, 1980, 1984) evidence shows this affects processing rapidly

• A good model should account for expectations about how this uncertainty will be resolved

The old man stopped and stared at thewoman?dog?

view?statue?

The squirrel stored some nuts in the tree

the reporter who the senator attacked

Non-probabilistic complexity

• On the traditional view, resource limitations, especially memory, drive processing complexity

• Gibson 1998, 2000 (DLT): multiple and/or more distant dependencies are harder to process

the reporter who attacked the senatorProcessing

Easy

Hard

Probabilistic complexity: surprisal

• Hale (2001) proposed that a word’s complexity in sentence comprehension is determined by its surprisal

• This idea can actually be traced back (at least) to Mandelbrot (1953)• (Cognitive science in the 1950s was extremely

interesting -- many ideas to be mined!]

The surprisal graph

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1Probability

Surprisal (-log P)

Garden-pathing under surprisal

• Another type of local syntactic ambiguity

• Compare with:

When the dog scratched the vet and his new assistant removed the muzzle.

When the dog scratched, the vet and his new assistant removed the muzzle.

When the dog scratched its owner the vet and his new assistant removed the muzzle.

A small PCFG for this sentence type

Two incremental trees

Surprisal for the two variants

Expectations versus memory

• Suppose you know that some event class X has to happen in the future, but you don’t know:1. When X is going to occur2. Which member of X it’s going to be

• The things W you see before X can give you hints about (1) and (2)• If expectations facilitate processing, then seeing W

should generally speed processing of X• But you also have to keep W in memory and retrieve

it at X• This could slow processing at X

Study 1: Verb-final domains

• Konieczny 2000 looked at reading times at German final verbs in a self-paced reading expt

Er hat die Gruppe auf den Berg geführtHe has the group to the mountain led

Er hat die Gruppe geführtHe has the group led

Er hat die Gruppe auf den SEHR SCHÖNEN Berg geführtHe has the group to the VERY BEAUTIFUL mtn. led

“He led the group”

“He led the group to the mountain”

“He led the group to the very beautiful mountain”

Locality predictions and empirical results

• Locality-based models (Gibson 1998) predict difficulty for longer clauses

• But Konieczny found that final verbs were read faster in longer clauses

Prediction easy

hard

hard

Result

fast

fastest

slow

Er hat die Gruppe auf den Berg geführt

Er hat die Gruppe geführt He led the group

He led the group to the mountain

...die Gruppe auf den sehr schönen Berg geführt

He led the group to the very beautiful mountain

450

460

470

480

490

500

510

520

No PP Short PP Long PP

Reading time (ms)

14.8

15

15.2

15.4

15.6

15.8

16

16.2Reading time at final verbNegative Log probability

Er hat die Gruppe (auf den (sehr schönen) Berg) geführtEr hat die Gruppe (auf den (sehr schönen) Berg) geführtEr hat die Gruppe (auf den (sehr schönen) Berg) geführt

Predictions of surprisal

Locality-based models (e.g., Gibson 1998, 2000) would violate monotonicity

Loca

lity-

base

d di

fficu

lty (

ordi

nal)

1

2

3

Levy 2008

Once we’ve seen a PP goal we’re unlikely to see another

So the expectation of seeing anything else goes up pi(w) obtained via a PCFG derived empirically

from a syntactically annotated corpus of German (the NEGRA treebank)

• Seeing more = having more information• More information = more accurate expectations

Deriving Konieczny’s results

auf den Berg

PP

geführt

VNP?

PP-goal?PP-loc?Verb?ADVP?

die Gruppe

VP

NP

S

NP Vfin

Er hat

Study 2: Final verbs, effect of dative

...daß der Freund DEM Kunden das Auto verkaufte

...that the friend the client the car sold‘...that the friend sold the client a car...’

Locality: final verb read faster in DES conditionObserved: final verb read faster in DEM condition

...daß der Freund DES Kunden das Auto verkaufte

...that the friend the client the car sold‘...that the friend of the client sold a car...’

(Konieczny & Döring 2003)

Next:NPnom

NPacc

NPdat

PPADVPVerb

Next:NPnom

NPacc

NPdat

PPADVPVerbverkaufte

verkaufte

V

V

daß

daß

SBAR

COMP

SBAR

COMP

der Freund

der Freund

das Auto

das Auto

DEM Kunden

DES Kunden

NPacc

NPacc

VP

S

NPnom

S

NPnom

VP

NPdat

NPnom

NPgen

Model results

Reading time (ms)

P(wi): word probability

Locality-based predictions

dem Kunden(dative)

555 8.3810-8 slower

des Kunden(genitive)

793 6.3510-8 faster

~30% greater expectation in dative condition

once again, wrong monotonicity

Theoretical bases for surprisal

• So far, we have simply stipulated that complexity ~ surprisal

• To a mathematician, surprisal is a natural cost metric• But as a cognitive scientist, it would be nice to derive

surprisal from prior principles• I’ll present three derivations of surprisal in this section

(1) Surprisal as relative entropy

• Relative entropy: a fundamental information-theoretic measure of the distance between two probability distributions

• Intuitively, the penalty paid by encoding one distribution with a different one

• It turns out that relative entropy over interpretation distributions before and after wi = (surprisal!)

• Surprisal can thus be thought of as reranking cost Relative entropy independently proposed as a measure of

surprise in visual scene perception (Itti & Baldi 2005)€

log 1Pi−1(wi)

Levy 2008

(2) Surprisal as optimal discrimination

• Many theories of reading posit lexical access as key bottleneck• E-Z Reader (Reichle et al., 1998); SWIFT (Engbert et al., 2005)• Same bottleneck should hold for auditory comprehension as well

• Norris (2006)’s Bayesian Reader: lexical access involves a probabilistic judgment about the word’s identity from noisy input

• Certainty takes a “random walk” in probability space, and surprisal determines starting point of the walk

DecisionThreshold

• Connections with diffusion model (Ratcliff 1978) and MSPRT (Baum & Veeravalli 1994)

• Also connections w/ cortical decision-process models (e.g., Usher & McClelland 2001)

(3) Surprisal as optimal preparation

• Are all RT differences best modeled as discrimination?• Intuitively, it makes sense to prepare for events you

expect to happen• Such preparation allows increased avg. response speed

• Smith & Levy (2008) formalize this intuition as an optimization of response speed against (fixed) preparation costs:• Let the brain choose response times, but faster is

costlier• + scale-freeness: a unit’s processing cost is sum of costs

of its subunits• = surprisal, under very general conditions

Smith & Levy, 2008

Is probabilistic facilitation logarithmic?

• What I’ve shown you so far:• More expected = faster

• What the theoretical derivations I’ve shown promised:• More expected = faster in a logarithmic scale

• Established for frequency, not for probability • Focused look at subtleties of specific constructions

may not be the best way to investigate this issue• highly refined probability distributions are challenging to

estimate• we need a lot of data to get a good view of the picture

• Solution: broad-coverage model, reading over free text

Smith & Levy, 2008

Log-probability: methods

• Dataset• the Dundee Corpus (Kennedy et al., 2003)• 50K words of British newspaper text, read by 10

speakers• Measures of interest:

• “Frontier” fixations (all fixations beyond the farthest fixation thus far)

• First fixations (frontier fixations falling on a new word)

fox jumped over the lazy dog

Frontier fixations First fixations

Deconfounding frequency & probability

• Major confound: log-frequency, widely recognized to have linear effect on RT

• Unfortunately, freq & prob are heavily correlated (=0.8)

• Fortunately, there’s still a big cloud of data to help us discriminate between the two (N≈200,000)

Log-probability: results

• Facilitation is essentially linear in log-probability• True even after controlling

conservatively for frequency and word-length effects

binned median log-probs and frontier-fixation RTs

nonparametric regression

Aggregation across words & spillover

Eye-tracking Self-paced reading

When ambiguity facilitates comprehension

• Sometimes, ambiguity seems to facilitate processing:

• Argued to be problematic for parallel constraint-based competition models (Macdonald, Pearlmutter, & Seidenberg 1994)• (though see rebuttal by Green & Mitchell 2006)

The daughteri of the colonelj who shot himself*i/j

The daughteri of the colonelj who shot herselfi/*j

(Traxler et al. 1998; Van Gompel et al. 2001, 2005)

The soni of the colonelj who shot himselfi/j

slower

faster

• Sometimes the reader attaches the RC low...• and everything’s OK

• But sometimes the reader attaches the RC high…• and the continuation is anomalous

• So we’re seeing garden-pathing ‘some’ of the time

himself himself

Traditional account: stochastic race model

NP PP

NPP

of

NP

the daughter

the colonel

RC

who shot…

(Traxler et al. 1998; Van Gompel et al. 2001, 2005)

Surprisal as a parallel alternative

• assume a generative model where choice between herself and himself determined only by antecedent’s gender

NP

NP PP

NPP

of

NP

the daughter

the colonel

RC

who shot…

NP PP

NP

P

of

NP

the daughter

the colonel

RC

who shot…

NP

selfherself

)|()()( TwpTpwpT

ii ∑=

• Surprisal marginalizes over possible syntactic structures

Ambiguity reduces the surprisal

€

pi(himself | daughter) = xhigh yhigh × 0 + x low y low ×1

pi(himself | son ) = xhigh ′ y high ×1+ x low y low ×1

But son…who shot… can

daughter…who shot… can’t contribute probability mass to himself

)son|himself()daughter|himself( ii pp <⇓

Ambiguity/surprisal conclusion

• Cases where ambiguity reduces difficulty aren’t problematic for parallel constraint satisfaction• Although they may be problematic for

competition• Surprisal can be thought of as a revision of

constraint-based theories with competition• Same: a variety of constraints immediately

brought to bear on syntactic comprehension• Different: linking hypothesis from probabilistic

constraints to behavioral observables

Competition versus surpisal: speculation

• Swets et al. (submitted): question type can affect behavioral responses to ambiguous RCs:

“Did the colonel get shot?”

• Asking about RC slowed RC reading time across the board

• And speed of response interacted with question type• RC questions answered slowest in ambiguous condition

• Speculation:• Comprehension is generally parallel & surprisal-based• Competition emerges when comprehender is forced into

a serial channel

Memory constraints: a theoretical puzzle

• # Logically possible analyses grows at best exponentially in sentence length

• Exact probabilistic inference with context-free grammars can be done efficiently in O(n3)

• But…• Requires probabilistic locality, limiting conditioning context• Human parsing is linear—that is, O(n)—anyway

• So we must be restricting attention to some subset of analyses

• Puzzle: how to choose and manage this subset?• Previous efforts: k-best beam search

• Here, we’ll explore the particle filter as a model of limited-parallel approximate inference

Levy, Reali, & Griffiths, 2009, NIPS

The particle filter: general picture

• Sequential Monte Carlo for incremental observations• Let xi be observed data, zi be unobserved states

• For parsing: xi are words, zi are incremental structures• Suppose that after n-1 observations we have the

distribution over interpretations P(zn-1|x1…n-1)• After next observation xn, represent the next

distribution P(zn|x1…n) inductively:

• Approximate P(zi|x1…i) by samples• Sample zn from P(zn|zn-1), and reweight by P(xn|zn)

Particle filter with probabilistic grammars

S NP VP 1.0 V brought 0.4

NP N 0.8 V broke 0.3

NP N RRC 0.2 V tripped 0.3

RRC Part N 1.0 Part brought 0.1

VP V N 1.0 Part broken 0.7

N women 0.7 Part tripped 0.2

N sandwiches 0.3 Adv quickly 1.0

S

women brought sandwiches

VP

N V N*

NP*

*

* *

** * *

*

0.7 0.4 0.3

women brought sandwiches

RRCN

Part N

*

*

*

*

* *

* *

*

0.7 0.1 0.3

**

tripped tripped

*

*VP

V

S

*

**

0.3

NP

Resampling in the particle filter

• With the naïve particle filter, inferences are highly dependent on initial choices• Most particles wind up with small weights• Region of dense posterior poorly explored

• Especially bad for parsing• Space of possible parses grows (at best) exponentially with input length

input

Resampling in the particle filter

• With the naïve particle filter, inferences are highly dependent on initial choices• Most particles wind up with small weights• Region of dense posterior poorly explored

• Especially bad for parsing• Space of possible parses grows (at best) exponentially with input length

input

• We handle this by resampling at each input word

Simple garden-path sentences

The woman brought the sandwich from the kitchen tripped

• Posterior initially misled away from ultimately correct interpretation• With finite # of particles, recovery is not always successful

MAIN VERB (it was the woman who brought the sandwich)

REDUCED RELATIVE (the woman was brought the sandwich)

Solving a puzzle

A-S Tom heard the gossip wasn’t true.A-L Tom heard the gossip about the neighbors wasn’t

true.U-S Tom heard that the gossip wasn’t true.U-L Tom heard that the gossip about the neighbors

wasn’t true.

•Previous empirical finding: ambiguity induces difficulty…•…but so does the length of the ambiguous region•Our linking hypothesis:

Proportion of parse failures at the disambiguating region should increase with sentence difficulty

Frazier & Rayner,1982; Tabor & Hutchins, 2004

Another example (Tabor & Hutchins 2004)

As the author wrote the essay the book grew.As the author wrote the book grew.As the author wrote the essay the book describing Babylon grew.As the author wrote the book describing Babylon grew.

Resampling-induced drift

• In ambiguous region, observed words aren’t strongly informative (P(xi|zi) similar across different zi)

• But due to resampling, P(zi|xi) will drift• One of the interpretations may be lost• The longer the ambiguous region, the more likely this

is

Model Results

Ambiguity matters…

But the length of the ambiguous region also matters!

Human results (offline rating study)