684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.
-
Upload
regina-houston -
Category
Documents
-
view
216 -
download
0
description
Transcript of 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.
![Page 1: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/1.jpg)
684.02 05/05/23 1
Probabilistic Context Free Grammars
Chris Brew
Ohio State University
![Page 2: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/2.jpg)
684.02 05/05/23 2
Context Free Grammars HMMs are sophisticated tools for
language modelling based on finite state machines.
Context-free grammars go beyond FSMs They can encode longer range
dependencies than FSMs They too can be made probabilistic
![Page 3: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/3.jpg)
684.02 05/05/23 3
An examples -> np vp s -> np vp ppnp -> det n np -> np ppvp -> v nppp -> p np
n -> girl n -> boy n -> park n -> telescopev -> sawp -> with p -> in
Sample sentence: “The boy saw the girl in the park with the telescope”
![Page 4: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/4.jpg)
684.02 05/05/23 4
Multiple analyses 2 of the 5 are
S
NP
DET
the
N
boy
VP
V
saw
NP
NP
DET
the
N
girl
PP
P
in
NP
DET
the
N
park
PP
P
with
NP
DET
the
N
telescope
S
NP
DET
the
N
boy
VP
V
saw
NP
NP
DET
the
N
girl
PP
P
in
NP
NP
DET
the
N
park
PP
P
with
NP
DET
the
N
telescope
![Page 5: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/5.jpg)
684.02 05/05/23 5
How serious is this ambiguity? Very serious, ambiguities in different
places multiply Easy to get millions of analyses for
simple seeming sentences Maybe we can use probabilities to
disambiguate, just as we chose from exponentially many paths through FSM
Fortunately, similar techniques apply
![Page 6: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/6.jpg)
684.02 05/05/23 6
Probabilistic Context Free Grammars
Same as context free grammars, with one extension– Where there is a choice of productions for a
non-terminal, give each alternative a probability.
– For each choice point, sum of probabilities of available options is 1
– i.e. Production probability is p(rhs|lhs)
![Page 7: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/7.jpg)
684.02 05/05/23 7
An examples -> np vp:0.8 s -> np vp pp:0.2np -> det n:0.5np -> np pp:0.5vp -> v np:1.0pp -> p np:1.0
n -> girl:0.25 n -> boy :0.25 n -> park:0.25 n -> telescope:0.25v -> saw:1.0p -> with:0.5 p -> in:0.5
Sample sentence: “The boy saw the girl in the park with the telescope”
![Page 8: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/8.jpg)
684.02 05/05/23 8
The “low” attachment
S
NP
DET
the
N
boy
VP
V
saw
NP
NP
DET
the
N
girl
PP
P
in
NP
NP
DET
the
N
park
PP
P
with
NP
DET
the
N
telescope
p(“np vp”|s) * p(“det n”|np) * p(“the”|det) *p(“boy”|n) *p(“v np”|vp) * p(“det n”|np) * p(“the”|det) * ...
![Page 9: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/9.jpg)
684.02 05/05/23 9
The “high” attachmentp(“np vp pp”|s) * p(“det n”|np) * p(“the”|det) *p(“boy”|n) *p(“v np”|vp) * p(“det n”|np) * p(“the”|det) * ...
S
NP
DET
the
N
boy
VP
V
saw
NP
NP
DET
the
N
girl
PP
P
in
NP
DET
the
N
park
PP
P
with
NP
DET
the
N
telescope
Note: I’m not claiming that this matches any particular set of psycholinguistic claims, only that the formalism allows such distinctions to be made.
![Page 10: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/10.jpg)
684.02 05/05/23 10
Generating from Probabilistic Context Free Grammars
Start with the distinguished symbol “s” Choose a way of expanding “s”
– This introduces new non-terminals (eg. “np” “vp”)
Choose ways of expanding these Carry on until no more non-terminals
![Page 11: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/11.jpg)
684.02 05/05/23 11
Issues The space of possible trees is infinite.
– But the sum of probabilities for all trees is 1
There is a strong assumption built in to the model– Expansion probability is independent of
position of non-terminal within tree– This assumption is questionable.
![Page 12: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/12.jpg)
684.02 05/05/23 12
Training for Probabilistic Context Free Grammars
Supervised: you have a treebank Unsupervised: you have only words In between: Pereira and Schabes
![Page 13: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/13.jpg)
684.02 05/05/23 13
Supervised Training Look at the trees in your corpus Count the number of times each lhs ->
rhs occurs Divide these counts by number of times
each lhs occurs Maybe smooth as described in the lecture
on probability estimation from counts
![Page 14: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/14.jpg)
684.02 05/05/23 14
Unsupervised Training These are Rabiner’s problems, but for
PCFGs– Calculate the probability of a corpus given
a model– Guess the sequence of states passed
through– Adapt the model to the corpus
![Page 15: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/15.jpg)
684.02 05/05/23 15
Hidden Trees All you see is the output:
– “The boy saw the girl in the park” But you can’t tell which of several trees led to
that sentence Each tree may have a different probability.
Although trees which use the same rules the same number of times must give the same answer.
Don’t know which state you are in.
![Page 16: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/16.jpg)
684.02 05/05/23 16
The three problems Probability estimation
– Given a sequence of observations O and a grammar G. Find P(O|G)
Best tree estimation– Given a sequence of observations O and a
grammar G, find a Tree which maximizes P(O,Tree|G).
![Page 17: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/17.jpg)
684.02 05/05/23 17
The third problem Training
– Adjust the model parameters so that P(O|G) is as large as possible for given O. Hard problem because there are so many adjustable parameters which could vary. Worse than for HMMs. More local maxima.
![Page 18: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/18.jpg)
684.02 05/05/23 18
Probability estimationP(O | G) P(O, Tree | G)
Tree
Easy in principle. Marginalize out the trees, leaving probability of strings.
But this involves sum over exponentially many trees.
Efficient algorithm keeps track of inside and outside probabilities.
![Page 19: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/19.jpg)
684.02 05/05/23 19
Inside Probability The probability that non-terminal NT
expands to the words between i and j
... i SENT A LETTER j ...
RR
NP
NP
...
![Page 20: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/20.jpg)
684.02 05/05/23 20
Outside probability Dual of inside probability.
NP
SENT A LETTER... i SENT A LETTER j ...A MAN
![Page 21: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/21.jpg)
684.02 05/05/23 21
Corpus probability Inside probability of S node and entire string
is probability of all ways of making sentences over that string
Product over all strings in corpus is corpus probability
Can also get corpus probability from outside probabilities
![Page 22: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/22.jpg)
684.02 05/05/23 22
Training Uses inside and outside probabilities Starts from an initial guess Improves the initial guess using data Stops at a (locally) best model Specialization of the EM algorithm
![Page 23: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/23.jpg)
684.02 05/05/23 23
Expected rule counts Consider p(uses rule lhs -> rhs to cover i
through j) Four things need to happen
– Generate outside words leaving hole for lhs– Choose correct rhs– Generate word seen between i and k from first
item in rhs (inside probability)– Generate words seen between k and j using
other items in rhs (more inside probailities)
![Page 24: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/24.jpg)
684.02 05/05/23 24
Refinements In practice there are very many local maxima,
so strategies which involve generating hundreds of thousands of rules may fail badly.
Pereira and Schabes discovered that letting the system know some limited stuff about bracketting is enough to guide it to correct answers
Different grammar formalisms (TAGs, Categorial Grammars...)
![Page 25: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/25.jpg)
684.02 05/05/23 25
A basic parsing algorithm The simplest statistical parsing algorithm
is called CYK or CKY. It is a statistical variant of a bottom-up
tabular parsing algorithm that you should have seen in 684.01
It (somewhat surprisingly) turns out to be closely related to the problem of multiplying matrices.
![Page 26: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/26.jpg)
684.02 05/05/23 26
Basic CKY (review) Assume we have organized the lexicon as a function
lexicon: string -> nonterminal set Organize these nonterminals into the relevant parts of
a two dimensional array indexed by left and right end of the itemFor I = 1 to length(sentence) dochart[I,I+1] = lexicon(sentence[i])
endfor
![Page 27: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/27.jpg)
684.02 05/05/23 27
Basic CKY Assume we have organized the grammar as a function
grammar: nonterminal -> nonterminal -> nonterminal set
![Page 28: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/28.jpg)
684.02 05/05/23 28
Basic CKY Build up new entries from existing entries, working
from shorter entries to longer onesfor l = 2 to length(sentence) do
// l is length of constituentfor s = 1 to len – l + 1 do // s is start of rhs1
for t = 1 to l-1 do (left,mid,right) = (s,s+t,s+l) chart[left,right] =
combine(chart[left,mid],chart[mid,right]) endfor endforendfor
![Page 29: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/29.jpg)
684.02 05/05/23 29
Basic CKY Combine is fun combine(set1,set2) result = empty for item1 in set1 do for item2 in set2 do
result = union result (grammar item1 item2) endfor endforreturn result
![Page 30: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/30.jpg)
684.02 05/05/23 30
Going statistical The basic algorithm tracks labels for
each substring of the input The cell contents are sets of labels A statistical version keeps track of
labels and their probabilities Now the cell contents must be weighted
sets
![Page 31: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/31.jpg)
684.02 05/05/23 31
Going statistical Make the grammar and lexicon produce
weighted sets. gexicon: word -> real*nt setgrammar: real*nt->real*nt -> real*nt set
We now need an operation corresponding to set union for weighted sets.
{s:0.1,np:0.2} WU {s:0.2,np:0.1} = ???
![Page 32: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/32.jpg)
684.02 05/05/23 32
Going statistical (one way)
{s:0.1,np:0.2} WU {s:0.2,np:0.1} = {s:0.3,np:0.3}
If we implement this, we get a parser that calculates the inside probability for each label on each span.
![Page 33: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/33.jpg)
684.02 05/05/23 33
Going statistical (another way)
{s:0.1,np:0.2} WU {s:0.2,np:0.1} = {s:0.2,np:0.2}
If we implement this, we get a parser that calculates the best parse probability for each label on each span.
The difference is that in one case we are combining weights with +, while in the second we use max
![Page 34: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/34.jpg)
684.02 05/05/23 34
Building trees Make the cell contents be sets of trees Make the lexicon be a function from
words to little trees Make the grammar be a function from
pairs of trees to sets of newly created (bigger) trees
Set union is now over sets of trees Nothing else needs to change
![Page 35: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/35.jpg)
684.02 05/05/23 35
Building weighted trees Make the cell contents be sets of trees,
labelled with probabilities Make the lexicon be a function from words to
weighted (little trees) Make the grammar be a function from pairs of
weighted trees to sets of newly created (bigger) trees
Set union is now over sets of weighted trees Again we have a choice of min or +, to get
either parse forest or just best parse
![Page 36: 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b787f8b9ab0599b82f1/html5/thumbnails/36.jpg)
684.02 05/05/23 36
Where to get more information
Roark and Sproat ch 7 Charniak chapters 5 and 6 Allen Natural Language
Understanding ch 7 Lisp code associated with Natural Language
Understanding Goodman: Semiring parsing
(http://www.aclweb.org/anthology/J99-1004)