Sentence processing - Linguistic Society
Transcript of Sentence processing - Linguistic Society
Sentence processingEmily Morgan
LSA 2019 Summer InstituteUC Davis
Some sentences are harder to process than others.
• More predictable words are:• read faster• more likely to be skipped over in reading• less likely to evoke regressions in reading
• They also evoke distinctive neural responses using e.g. Event Related Potentials (ERPs)
2
He bought her a pearl necklace for her… birthdaycollection
ERP responses to more/less predictable words
3
N400
(Kutas & Hillyard, 1984)
Processing difficulty arises from syntactic structure as well as word choice• The complex houses married children and their
families.• The warehouse fires a dozen employees every year.• The old man the boat.• The prime number few.• The horse raced past the barn fell.(These are all garden path sentences, of varying severity.)
4
• This is the cat that the dog chased.• This is the rat that cat killed.(Which cat?)• This is the rat that the cat that the dog chased
killed.• This is the cheese that the rat ate.• (Which rat?)• This is the cheese that the rat that the cat that the
dog chased killed ate.
5
Sentences are full of ambiguitiesOne morning I shot an elephant in my pajamas.
How he got in my pajamas I’ll never know.
(Groucho Marx)
(Ford et al., 1982)
The woman discussed the dogs on the beach.
• What does on the beach modify?• dogs (90%); discussed (10%)
The woman kept the dogs on the beach.
• What does on the beach modify?• kept (95%); dogs (5%)
6
Sentence processing is incremental• i.e. Comprehenders don’t wait until they have the
full sentence to process it
7
The boy will eat…vs.The boy will move…
Theoretical questions• Why are some sentences easier/harder to process
than others?• How does a comprehender rapidly disambiguate
between possible interpretations?• (Noting that both processing difficulty and
disambiguation occur incrementally/with incomplete input)
8
Possible factors influencing both processing difficult and ambiguity resolution• Memory constraints
• This is the cheese that the rat that the cat that the dog chased killed ate.
• Whoi did you hope that the candidate said that he admired _____i?
• Expectations• i.e. how predictable is a word/grammatical structure, given
the context (preceding words, real-world context, etc.)• He gave her a pearl necklace for her birthday/collection.
In order to test/disentangle these possibilities, we need to be able to model the predictability of words and syntactic structures in sentences
9
Outline• Different models of sentence probability• n-grams• Probabilistic Context Free Grammars• Recurrent Neural Networks
• Applying these models to psycholinguistic questions• How does a comprehender rapidly disambiguate
between possible interpretations?• Why are some sentences easier/harder to process than
others?
10
Introduction to N-grams
Slides from Dan Jurafsky
(Stanford University Natural Language Processing group)
Language Modeling
Dan Jurafsky
Probabilistic Language Models
• Today’s goal: assign a probability to a sentence• Machine Translation:• P(high winds tonite) > P(large winds tonite)
• Spell Correction• The office is about fifteen minuets from my house
• P(about fifteen minutes from) > P(about fifteen minuets from)
• Speech Recognition• P(I saw a van) >> P(eyes awe of an)
• + Summarization, question-answering, etc., etc.!!
Why?
Dan Jurafsky
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:P(w5|w1,w2,w3,w4)
• A model that computes either of these:P(W) or P(wn|w1,w2…wn-1) is called a language model.
• Better: the grammar But language model or LM is standard
Dan Jurafsky
How to compute P(W)
• How to compute this joint probability:
• P(its, water, is, so, transparent, that)
• Intuition: let’s rely on the Chain Rule of Probability
Dan Jurafsky
Reminder: The Chain Rule
• Recall the definition of conditional probabilities
p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)
• More variables:P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in GeneralP(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
Dan Jurafsky
The Chain Rule applied to compute joint probability of words in sentence
P(“its water is so transparent”) =P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is
so)
€
P(w1w2…wn ) = P(wi |w1w2…wi−1)i∏
Dan Jurafsky
How to estimate these probabilities
• Could we just count and divide?
• No! Too many possible sentences!• We’ll never see enough data for estimating these
€
P(the | its water is so transparent that) =
Count(its water is so transparent that the)Count(its water is so transparent that)
Dan Jurafsky
Markov Assumption• Simplifying assumption:
• Or maybe
€
P(the | its water is so transparent that) ≈ P(the | that)
€
P(the | its water is so transparent that) ≈ P(the | transparent that)
Andrei Markov
Dan Jurafsky
Markov Assumption
• In other words, we approximate each component in the product
€
P(w1w2…wn ) ≈ P(wi |wi−k…wi−1)i∏
€
P(wi |w1w2…wi−1) ≈ P(wi |wi−k…wi−1)
Dan Jurafsky
Simplest case: Unigram model
fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass
thrift, did, eighty, said, hard, 'm, july, bullish
that, or, limited, the
Some automatically generated sentences from a unigram model
€
P(w1w2…wn ) ≈ P(wi)i∏
Dan Jurafsky
Condition on the previous word:
Bigram model
texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen
outside, new, car, parking, lot, of, the, agreement, reached
this, would, be, a, record, november
€
P(wi |w1w2…wi−1) ≈ P(wi |wi−1)
Dan Jurafsky
N-gram models
• We can extend to trigrams, 4-grams, 5-grams• In general this is an insufficient model of language
• because language has long-distance dependencies:
“The computer(s) which I had just put into the machine room on the fifth floor is (are) crashing.”
• But we can often get away with N-gram models
Introduction to N-grams
Language Modeling
Estimating N-gram Probabilities
Language Modeling
Dan Jurafsky
Estimating bigram probabilities
• Relative frequency estimation
€
P(wi |wi−1) =count(wi−1,wi)count(wi−1)
€
P(wi |wi−1) =c(wi−1,wi)c(wi−1)
Dan Jurafsky
An example
<s> I am Sam </s><s> Sam I am </s><s> I do not like green eggs and ham </s>
€
P(wi |wi−1) =c(wi−1,wi)c(wi−1)
Dan Jurafsky
More examples: Berkeley Restaurant Project sentences
• can you tell me about any good cantonese restaurants close by• mid priced thai food is what i’m looking for• tell me about chez panisse• can you give me a listing of the kinds of food that are available• i’m looking for a good place to eat breakfast• when is caffe venezia open during the day
Dan Jurafsky
Raw bigram counts
• Out of 9222 sentences
Dan Jurafsky
Raw bigram probabilities
• Normalize by unigrams:
• Result:
Dan Jurafsky
Bigram estimates of sentence probabilities
P(<s> I want english food </s>) =P(I|<s>) × P(want|I) × P(english|want) × P(food|english) × P(</s>|food)
= .000031
Dan Jurafsky
What kinds of knowledge?
• P(english|want) = .0011• P(chinese|want) = .0065• P(to|want) = .66• P(eat | to) = .28• P(food | to) = 0• P(want | spend) = 0• P (i | <s>) = .25
Dan Jurafsky
Google N-Gram Release, August 2006
…
Dan Jurafsky
Google N-Gram Release
• serve as the incoming 92• serve as the incubator 99• serve as the independent 794• serve as the index 223• serve as the indication 72• serve as the indicator 120• serve as the indicators 45• serve as the indispensable 111• serve as the indispensible 40• serve as the individual 234
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Estimating N-gram Probabilities
Language Modeling
N-grams as language models for computational psycholinguistics
• Advantages• Relatively easy to calculate• Do a surprisingly good job (given how simple they are) at
predicting empirical behavior (as we’ll see later)
• Disadvantages• Can’t capture long-distance dependencies• Don’t represent any underlying linguistic structure, e.g.
syntax
38
Grammars• A grammar is a structured set of production rules• Most commonly used for syntactic description, but
also used in semantics, phonology, etc.• e.g. Context-Free Grammars/Phrase Structure Rules
• A grammar licenses a derivation if all the derivation’s rules are present in the grammar
39
OKX
Context-Free Grammars (CFGs)• Formally, a Context-Free Grammar (CFG) consists
of:• a set of non-terminal symbols (e.g. S, NP, VP, N, V, etc.)
• i.e. symbols from which further derivation will occur• These represent phrasal or lexical categories
• a set of terminal symbols (e.g. the, dog, chase, etc.)• i.e. symbols from which no further derivation will occur• These represent lexical items
• a start symbol (e.g. S)• i.e. the non-terminal symbol that starts every tree
• a set of rules of the form X à Y1 Y2 … Yn• where X is a non-terminal and Yi are either non-terminal or
terminal symbols• e.g. S à NP VP; NP à Det N; N à dog; etc.
40
Context-Free Grammars (CFGs)• A CFG derivation starts with the start symbol (e.g.
S) and recursively expands the non-terminal categories using rules in the grammar• The resulting tree is called the derivation tree
41
CFG exampleContext-free Grammars: an example IIS →NP VPNP→Det NNP→NP PPPP→P NPVP→V
Det→ theN → dogN → catP → nearV → growled
Here is a derivation and the resulting derivation tree:
S
42
Context-Free Grammars (CFGs)
43
• CFGs can tell us about which trees and sentences are/are not licensed by the grammar• But they don’t tell us anything about which trees
and sentences are more probable• So we augment them with probabilities
à Probabilistic Context-Free Grammars (PCFGs)
Probabilistic Context-Free Grammars (PCFGs)• Formally, a PCFG consists of:• a set of non-terminal symbols (e.g. S, NP, VP, N, V, etc.)
• i.e. symbols from which further derivation will occur• These represent phrasal or lexical categories
• a set of terminal symbols (e.g. the, dog, chase, etc.)• i.e. symbols from which no further derivation will occur• These represent lexical items
• a start symbol (e.g. S)• i.e. the non-terminal symbol that starts every tree
• a set of rules of the form X à Y1 Y2 … Yn• where X is a non-terminal and Yi are either non-terminal or
terminal symbols• e.g. S à NP VP; NP à Det N; etc.
• probabilities for each rule such that for each non-terminal X, the sum of the probabilities of all rules with X on the left-hand side = 1, i.e.:
44
![#→%&'])Rules
𝑃(𝑋 → 𝑌./) = 1 (for each non-terminal X)
Example PCFG1 S →NP VP0.8 NP →Det N0.2 NP →NP PP1 PP →P NP1 VP →V
1 Det → the0.5 N → dog0.5 N → cat1 P → near1 V → growled
S
NP
NP
Det
the
N
dog
PP
P
near
NP
Det
the
N
cat
VP
V
growled
0.2
0.8
0.8
0.5
0.5
P(T) = 1× 0.2× 0.8× 1× 0.5× 0.8× 1××0.8× 1× 0.5× 1× 1= 0.032
45
Example PCFG1 S →NP VP0.8 NP →Det N0.2 NP →NP PP1 PP →P NP1 VP →V
1 Det → the0.5 N → dog0.5 N → cat1 P → near1 V → growled
S
NP
NP
Det
the
N
dog
PP
P
near
NP
Det
the
N
cat
VP
V
growled
0.2
0.8
0.8
0.5
0.5
P(T) = 1× 0.2× 0.8× 1× 0.5× 0.8× 1××0.8× 1× 0.5× 1× 1= 0.032
46
P(T) = P(S à NP VP) *P(NP à NP PP) *P(NP à Det N) *P(Det à the) *P(N à dog) *P(PP à P NP) *P(P à near) *P(NP à Det N) *P(Det à the) *P(N à cat) *P(VP à V) *P(V à growled)
Example PCFG1 S →NP VP0.8 NP →Det N0.2 NP →NP PP1 PP →P NP1 VP →V
1 Det → the0.5 N → dog0.5 N → cat1 P → near1 V → growled
S
NP
NP
Det
the
N
dog
PP
P
near
NP
Det
the
N
cat
VP
V
growled
0.2
0.8
0.8
0.5
0.5
P(T) = 1× 0.2× 0.8× 1× 0.5× 0.8× 1××0.8× 1× 0.5× 1× 1= 0.032
47
P(T) = P(S à NP VP) *P(NP à NP PP) *P(NP à Det N) *P(Det à the) *P(N à dog) *P(PP à P NP) *P(P à near) *P(NP à Det N) *P(Det à the) *P(N à cat) *P(VP à V) *P(V à growled)
Estimating PCFG probabilities• Relative frequency estimation• We need a syntactically annotated dataset, aka. a
Treebank• Fortunately, these exist for English and various
other languages• But constructing these datasets is much more
difficult/time-consuming than simply collecting a corpus of unannotated text
𝑃 LHS → RHS =count(LHS → RHS)
count(LHS)
48
An example
49
Recurrent Neural Networks (RNNs)• A type of connectionist model• Rely on emergent representations• Unlike symbolic models (including n-grams and PCFGs)
where we define the symbols and the rules,• RNNS are trained on huge amounts of data and develop
their own representations that maximize their fit to the training data
• Today’s state-of-the-art language models• Used in machine translation, speech-to-text, etc.
50
Recurrent Neural Networks (RNNs)• They’re a black box--we don’t understand their
internal representation or why they work• Harder to use them for computational psycholinguistics
• If we understood them better, maybe they would tell us something about human language processing• We’ll return to this in the last week of the course
51
So far: Language models• n-grams• Probabilistic Context-Free Grammars (PCFGs)• (Recurrent Neural Networks; RNNs)
How can we use these models to investigate:• How do comprehenders rapidly disambiguate
ambiguous sentences?• What makes words and sentences easier/more difficult
to process?
52