Post on 14-Jan-2016
description
Lecture 1, 7/21/2005 Natural Language Processing 1
CS60057Speech &Natural Language
Processing
Autumn 2007
Lecture 16
5 September 2007
Lecture 1, 7/21/2005 Natural Language Processing 2
Parsing with features
We need to constrain the rules in CFGs, for example to coerce agreement within and between constituents to pass features around to enforce subcategorisation constraints
Features can be easily added to our grammars And later we’ll see that feature bundles can completely replace
constituents
Lecture 1, 7/21/2005 Natural Language Processing 3
Parsing with features
Rules can stipulate values, or placeholders (variables) for values Features can be used within the rule, or passed up via the mother
nodes Example: subject-verb agreement
S NP VP [if NP and VP agree in number] number of NP depends on noun and/or determiner number of VP depends on verb
S NP(num=X) VP (num=X)
NP (num=X) det(num=X) n (num=X)
VP(num=X) v(num=X) NP(num=?)
Lecture 1, 7/21/2005 Natural Language Processing 4
Declarative nature of features
The rules can be used in various ways To build an NP only if det and n
agree (bottom-up) When generating an NP, to
choose an n which agrees with the det (if working L-to-R) (top-down)
To show that the num value for an NP comes from its components (percolation)
To ensure that the num value is correctly set when generating an NP (inheritance)
To block ill-formed input
NP (num=X) det(num=X) n (num=X)
this det (num=sg)
these det (num=pl)
the det (num=?)
man n (num=sg)
men n (num=pl)
det(num=sg) n(num=sg)
this man
NP (num=sg)
n(num=pl)
men
Lecture 1, 7/21/2005 Natural Language Processing 5
Use of variables Unbound (unassigned)
variables (ie variables with a free value): the can combine with any
value for num Unification means that the
num value for the is set to sg
NP (num=X) det(num=X) n (num=X)
this det (num=sg)
these det (num=pl)
the det (num=?)
man n (num=sg)
men n (num=pl)
det(num=?) n(num=sg)
the man
NP (num=sg)
Lecture 1, 7/21/2005 Natural Language Processing 6
Parsing with features Features must be compatible Formalism should allow features to remain unspecified Feature mismatch can be used to block false analyses, and
disambiguate e.g. they can fish ~ he can fish ~ he cans fish
Formalism may have attribute-value pairs, or rely on argument position
e.g. NP(_num,_sem) det(_num) n (_num,_sem) an = det(sing) the = det(_num) man = n(sing,hum)
Lecture 1, 7/21/2005 Natural Language Processing 7
Parsing with features
Using features to impose subcategorization constraints
VP v e.g. dance
VP v NP e.g. eat
VP v NP NP e.g. give
VP v PP e.g. wait (for)
VP(_num) v(_num,intr)
VP(_num) v(_num,trans) NP
VP(_num) v(_num,ditrans) NP NP
VP(_num) v(_num,prepobj(_case)) PP(_case)
PP(_case) prep(_case) NP
dance = v(plur,intr)
dances = v(sing,intr)
danced = v(_num,intr)
waits = v(sing,prepobj(for))
for = prep(for)
Lecture 1, 7/21/2005 Natural Language Processing 8
v(sing,intrans)
S NP(_num) VP(_num)NP(_num) det(_num) n(_num)VP(_num) v(_num,intrans)VP(_num) v (_num,trans) NP(_1)
Parsing with features (top-down)
S NP(_num) VP(_num)S
NP VP(_num) (_num)
NP(_num) det(_num) n(_num)
the man shot those elephants
det n(_num) (_num)
the = det(_num)
the
man = n(sing)
man
VP(sing) v(sing,intrans)
shot = v(sing,trans)
(sing)
(sing)(sing)
(sing)VP(sing) v(sing,trans) NP(_1)
shot = v(sing,trans)
v NP(sing,trans) (_1)
shot det n (_1) (_1)
those elephants
(pl)
(pl)
NP(_1) det(_1) n(_1)
those = det(pl)
elephants = n(pl)
_num=sing
(pl)
Lecture 1, 7/21/2005 Natural Language Processing 9
Feature structures
Instead of attaching features to the symbols, we can parse with symbols made up entirely of attribute-value pairs: “feature structures”
Can be used in the same way as seen previously Values can be atomic … … or embedded feature structures …
CAT NPNUMBER SGPERSON 3
ATTR1 VAL1ATTR2 VAL2ATTR3 VAL3
CAT NP
AGRNUM SGPERS 3
Lecture 1, 7/21/2005 Natural Language Processing 10
Unification
Probabilistic CFG
August 31, 2006
Lecture 1, 7/21/2005 Natural Language Processing 11
Feature Structures
A set of feature-value pairs
No feature occurs in more than one feature-value pair
(a partial function from features to values)
Circular structures are prohibited.
Lecture 1, 7/21/2005 Natural Language Processing 12
Structured Feature Structure
Part of a third-person singular NP:
Lecture 1, 7/21/2005 Natural Language Processing 13
Reentrant Feature Structure Two features can share a feature structure as value.
Not the same thing as them having equivalent values!
Two distinct feature structure values:
One shared value (reentrant feature structure):
Lecture 1, 7/21/2005 Natural Language Processing 14
they can be coindexed
CAT S
HEADAGR 1
SUBJ [ AGR 1 ]
NUM SGPERS 3
Lecture 1, 7/21/2005 Natural Language Processing 15
Parsing with feature structures
Grammar rules can specify assignments to or equations between feature structures
Expressed as “feature paths”
e.g. HEAD.AGR.NUM = SG
CAT S
HEADAGR 1
SUBJ [ AGR 1 ]
NUM SGPERS 3
Lecture 1, 7/21/2005 Natural Language Processing 18
Subsumption () (Partial) ordering of feature structures Based on relative specificity
The second structure carry less information, but is more general (or subsumes) the first.
Lecture 1, 7/21/2005 Natural Language Processing 19
Subsumption
A more abstract (less specific) feature structure subsumes an equally or more specific one.
A feature structure F subsumes a feature structure G ( F G)
if and only if : For every structure x in F, F(x) G(x) (where F(x) means the value of the
feature x of the feature structure F). For all paths p and q in F such that F(p)=F(q), it is also the case that
G(p)=G(q).
An atomic feature structure neither subsumes
nor is subsumed by another atomic feature structure. Variables subsume all other feature structures. A feature structure F subsumes a feature structure G (F G) if if all parts of F
subsumes all parts of G.
Lecture 1, 7/21/2005 Natural Language Processing 20
Subsumption Example
Consider the following feature structures:
(1)
(2)
(3)
(1) (3)
(2) (3)
but there is no subsumption relation between (1) and (2)
3PERSON
3PERSON
SGNUMBER
SGNUMBER
Lecture 1, 7/21/2005 Natural Language Processing 21
Feature Structures in The Grammar
We will incorporate the feature structures and the unification process as follows:
All constituents (non-terminals) will be associated with feature structures.
Sets of unification constraints will be associated with grammar rules, and these rules must be satisfied for the rule to be satisfied.
These attachments accomplish the following goals:
To associate feature structures with both lexical items and instances of grammatical categories.
To guide the composition of feature structures for larger grammatical constituents based on the feature structures of their component parts.
To enforce compatibility constraints between specified parts of grammatical constraints.
Lecture 1, 7/21/2005 Natural Language Processing 22
Feature unification
Feature structures can be unified if They have like-named attributes that have the same
value:
[NUM SG] [NUM SG] = [NUM SG] Like-named attributes that are “open” get the value
assigned:
CAT NPNUMBER ??PERSON 3
NUMBER SGPERSON 3 =
CAT NPNUMBER SGPERSON 3
Lecture 1, 7/21/2005 Natural Language Processing 23
Feature unification
Complementary features are brought together
Unification is recursive
Coindexed structures are identical (not just copies): assignment to one effects all
CAT NPNUMBER SG
[PERSON 3] =CAT NPNUMBER SGPERSON 3
CAT NPAGR [NUM SG]
CAT NP
AGRNUM SGPERS 3
=CAT NPAGR [PERS 3]
Lecture 1, 7/21/2005 Natural Language Processing 24
Example
CAT NPAGR _1 _2SEM _3
CAT DETAGR _1
CAT NAGR _2 SEM _3
CAT DET
AGRVAL INDEFNUM SG
a
CAT DETAGR [VAL DEF]
the
CAT NLEX “man”AGR [NUM SG]SEM HUM
man
Lecture 1, 7/21/2005 Natural Language Processing 25
the manthe
CAT NAGR _2 SEM _3
CAT DETAGR [VAL DEF]
the
CAT NPAGR _1 _2SEM [_3]
CAT DETAGR _1 [VAL DEF]VAL DEF
CAT NLEX “man”AGR [NUM SG]SEM HUM
man
LEX “man”AGR [NUM SG]SEM HUM
the man
NUM SG
HUM
Lecture 1, 7/21/2005 Natural Language Processing 26
a mana
CAT NAGR _2 SEM _3
CAT NPAGR _1
_2
SEM [_3]
CAT DET
AGR _1
CAT NLEX “man”AGR [NUM SG]SEM HUM
man
LEX “man”AGR [NUM SG]SEM HUM
a man
[NUM SG]
HUM
CAT DET
AGRVAL INDEFNUM SG
a
VAL INDEFNUM SG
VAL INDEFNUM SG
VAL INDEFAGR NUM SG
Lecture 1, 7/21/2005 Natural Language Processing 27
Types and inheritance
Feature typing allows us to constrain possible values a feature can have e.g. num = {sing,plur} Allows grammars to be checked for consistency, and can
make parsing easier We can express general “feature co-occurrence conditions” … And “feature inheritance rules” Both these allow us to make the grammar more compact
Lecture 1, 7/21/2005 Natural Language Processing 28
Co-occurrence conditions and Inheritance rules
General rules (beyond simple unification) which apply automatically, and so do not need to be stated (and repeated) in each rule or lexical entry
Examples:
[cat=np] [num=??, gen=??, case=??]
[cat=v,num=sg] [tns=pres]
[attr1=val1] [attr2=val2]
Lecture 1, 7/21/2005 Natural Language Processing 29
Inheritance rules
Inheritance rules can be over-ridden
e.g. [cat=n] [gen=??,sem=??]
sex={male,female}
gen={masc,fem,neut}
[cat=n,gen=fem,sem=hum] [sex=female]
uxor [cat=n,gen=fem,sem=hum]
agricola [cat=n,gen=fem,sem=hum,sex=male]
Lecture 1, 7/21/2005 Natural Language Processing 30
Unification in Linguistics
Lexical Functional Grammar If interested, see PARGRAM project
GPSG, HPSG Construction Grammar Categorial Grammar
Lecture 1, 7/21/2005 Natural Language Processing 31
Unification
Joining the contents of two feature structures into one new
(the union of the two originals).
The union is most general feature structure subsumed by both.
The union of two contradictory feature structures is undefined (unification fails).
Lecture 1, 7/21/2005 Natural Language Processing 32
Unification Constraints
Each grammar rule will be associated with a set of unification constraints.
0 1 … n {set of unification constraints}
Each unification constraint will be in one of the following forms.
< i feature path> = Atomic value
< i feature path> = < j feature path>
Lecture 1, 7/21/2005 Natural Language Processing 33
Unification Constraints -- Example
For example, the following rule
S NP VP
Only if the number of the NP is equal to the number of the VP.
will be represented as follows:
S NP VP
<NP NUMBER> = <VP NUMBER>
Lecture 1, 7/21/2005 Natural Language Processing 34
Agreement Constraints
S NP VP
<NP NUMBER> = <VP NUMBER>
S Aux NP VP
<Aux AGREEMENT> = <NP AGREEMENT>
NP Det NOMINAL
<Det AGREEMENT> = <NOMINAL AGREEMENT>
<NP AGREEMENT> = <NOMINAL AGREEMENT>
NOMINAL Noun
<NOMINAL AGREEMENT> = <Noun AGREEMENT>
VP Verb NP
<VP AGREEMENT> = <Verb AGREEMENT>
Lecture 1, 7/21/2005 Natural Language Processing 35
Agreement Constraints -- Lexicon Entries
Aux does <Aux AGREEMENT NUMBER> = SG
<Aux AGREEMENT PERSON> = 3
Aux do <Aux AGREEMENT NUMBER> = PL
Det these <Det AGREEMENT NUMBER> = PL
Det this <Det AGREEMENT NUMBER> = SG
Verb serves <Verb AGREEMENT NUMBER> = SG
<Verb AGREEMENT PERSON> = 3
Verb serve <Verb AGREEMENT NUMBER> = PL
Noun flights <Noun AGREEMENT NUMBER> = PL
Noun flight <Noun AGREEMENT NUMBER> = SG
Lecture 1, 7/21/2005 Natural Language Processing 36
Head Features
Certain features are copied from children to parent in feature structures. Example: AGREEMENT feature in NOMINAL is copied into NP. The features for most grammatical categories are copied from one of
the children to the parent. The child that provides the features is called head of the phrase,
and the features copied are referred to as head features. A verb is a head of a verb phrase, and a nominal is a head of a noun
phrase. We may reflect these constructs in feature structures as follows:
NP Det NOMINAL
<Det HEAD AGREEMENT> = <NOMINAL HEAD AGREEMENT>
<NP HEAD> = <NOMINAL HEAD>
VP Verb NP
<VP HEAD> = <Verb HEAD>
Lecture 1, 7/21/2005 Natural Language Processing 37
SubCategorization Constraints
For verb phrases, we can represent subcategorization constraints using three techniques:
Atomic Subcat Symbols Encoding Subcat lists as feature structures Minimal Rule Approach (using lists directly)
We may use any of these representations.
Lecture 1, 7/21/2005 Natural Language Processing 38
Atomic Subcat Symbols
VP Verb
<VP HEAD> = <Verb HEAD>
<VP HEAD SUBCAT> = INTRANS
VP Verb NP
<VP HEAD> = <Verb HEAD>
<VP HEAD SUBCAT> = TRANS
VP Verb NP NP
<VP HEAD> = <Verb HEAD>
<VP HEAD SUBCAT> = DITRANS
Verb slept <Verb HEAD SUBCAT> = INTRANS
Verb served <Verb HEAD SUBCAT> = TRANS
Verb gave <Verb HEAD SUBCAT> = DITRANS
Lecture 1, 7/21/2005 Natural Language Processing 39
Encoding Subcat Lists as Features
Verb gave
<Verb HEAD SUBCAT FIRST CAT> = NP
<Verb HEAD SUBCAT SECOND CAT> = NP
<Verb HEAD SUBCAT THIRD> = END
VP Verb NP NP
<VP HEAD> = <Verb HEAD>
<VP HEAD SUBCAT FIRST CAT> = <NP CAT>
<VP HEAD SUBCAT SECOND CAT> = <NP CAT>
<VP HEAD SUBCAT THIRD> = END
We are only encoding lists using positional features
Lecture 1, 7/21/2005 Natural Language Processing 40
Minimal Rule Approach
In fact, we do not use symbols like SECOND, THIRD. They are just used to encode lists. We can use lists directly (similar to LISP).
<SUBCAT FIRST CAT> = NP
<SUBCAT REST FIRST CAT> = NP
<SUBCAT REST REST> = END
Lecture 1, 7/21/2005 Natural Language Processing 41
Subcategorization Frames for Lexical Entries
We can use two different notations to represent subcategorization frames for lexical entries (verbs).
Verb want
<Verb HEAD SUBCAT FIRST CAT> = NP
Verb want
<Verb HEAD SUBCAT FIRST CAT> = VP
<Verb HEAD SUBCAT FIRST FORM> = INFINITITIVE
INFINITIVEVFORMHEAD
VPCATNPCATSUBCATHEAD
VERBCAT
WANTORTH
,
Lecture 1, 7/21/2005 Natural Language Processing 42
Implementing Unification
The representation we have used cannot facilitate the destructive merger aspect of unification algorithm.
For this reason, we add additional features (additional edges to DAGs) into our feature structures.
Each feature structure will consists of two fields:
Content Field -- This field can be NULL or may contain ordinary feature structure.
Pointer Field -- This field can be NULL or may contain a pointer into another feature structure.
If the pointer field of a DAG is NULL, the content field of DAG contains the actual feature structure to be processed.
If the pointer field of a DAG is not NULL, the destination of that pointer represents the actual feature structure to be processed.
Lecture 1, 7/21/2005 Natural Language Processing 43
Extended Feature Structures
3PERSON
SGNUMBER
NULLPOINTERNULLPOINTER
CONTENTPERSON
NULLPOINTER
SGCONTENTNUMBER
CONTENT3
Lecture 1, 7/21/2005 Natural Language Processing 44
Extended DAG
C
P
C
C
P
P
Num
Per
Null
Null
3
SG
Null
Lecture 1, 7/21/2005 Natural Language Processing 45
Unification of Extended DAGs
33
PERSON
SGNUMBERPERSONSGNUMBER
C
P
Per
C
PNull
3
Null
C
P
Num
C
PNull
SG
Null
Lecture 1, 7/21/2005 Natural Language Processing 46
Unification of Extended DAGs (cont.)
C
P
Num
C
PNull
SG
Null
C
P
Per
C
PNull
3
Null
C
PNull
Null
P
Per
Lecture 1, 7/21/2005 Natural Language Processing 47
Unification Algorithm
function UNIFY(f1,f2) returns fstructure or failure
f1real real contents of f1 /* dereference f1 */
f2real real contents of f2 /* dereference f2 */
if f1real is Null then { f1.pointer f2; return f2; }
else if f2real is Null then { f2.pointer f1; return f1; }
else if f1real and f2real are identical then { f1.pointer f2; return f2; }
else if f1real and f2real are complex feature structures then
{ f2.pointer f1;
for each feature in f2real do
{ otherfeature Find or create a feature corresponding to feature in f1real;
if UNIFY(feature.value,otherfeature.value) returns failure then
return failure; }
return f1; }
else return failure;
Lecture 1, 7/21/2005 Natural Language Processing 48
Example - Unification of Complex Structures
3
)1(
)1(
PERSONAGREEMENTSUBJECT
AGREEMENTSUBJECT
SGNUMBERAGREEMENT
)1(3
)1(
AGREEMENTSUBJECTPERSON
SGNUMBERAGREEMENT
Lecture 1, 7/21/2005 Natural Language Processing 49
Example - Unification of Complex Structures (cont.)
•
•Null
•C •Agr •Num •C•C• SG
•Null
•C•Agr•C• Null
•Null •Null
•Null
Sub
•C •Sub •Agr •C•C•
•Null •Null •Null
•C•Per
•Null
3
•C NullPer
Lecture 1, 7/21/2005 Natural Language Processing 50
Parsing with Unification Constraints
Let us assume that we have augmented our grammar with sets of unification constraints.
What changes do we need to make a parser to make use of them?
Building feature structures and associate them with sub-trees.
Unifying feature structures when sub-trees are created. Blocking ill-formed constituents
Lecture 1, 7/21/2005 Natural Language Processing 51
Earley Parsing with Unification Constraints
What do we have to do to integrate unification constraints with Early Parser?
Building feature structures (represented as DAGs) and associate them with states in the chart.
Unifying feature structures as states are advanced in the chart.
Blocking ill-formed states from entering the chart. The main change will be in COMPLETER function of Earley Parser.
This routine will invoke the unifier to unify two feature structures.
Lecture 1, 7/21/2005 Natural Language Processing 52
Building Feature Structures
NP Det NOMINAL
<Det HEAD AGREEMENT> = <NOMINAL HEAD AGREEMENT>
<NP HEAD> = <NOMINAL HEAD>
corresponds to
)2()1(
)2(
)1(
AGREEMENTHEADNOMINAL
AGREEMENTHEADDet
HEADNP
Lecture 1, 7/21/2005 Natural Language Processing 53
Augmenting States with DAGs
Each state will have an additional field to contain the DAG representing the feature structure corresponding to the state.
When a rule is first used by PREDICTOR to create a state, the DAG associated with the state will simply consist of the DAG retrieved from the rule.
For example,
S NP VP, [0,0],[],Dag1
where Dag1 is the feature structure corresponding to S NP
VP.
NP Det NOMINAL, [0,0],[],Dag2
where Dag2 is the feature structure corresponding to S Det
NOMINAL.
Lecture 1, 7/21/2005 Natural Language Processing 54
What does COMPLETER do?
When COMPLETER advances the dot in a state, it should unify the feature structure of the newly completed state with the appropriate part of the feature structure being advanced.
If this unification process is succesful, the new state gets the result of the unification as its DAG, and this new state is entered into the chart. If it fails, nothing is entered into the chart.
Lecture 1, 7/21/2005 Natural Language Processing 55
A Completion Example
)2()1(
)2(
)1(
AGREEMENTHEADNOMINAL
SGNUMBERAGREEMENTHEADDet
HEADNP
NP Det NOMINAL, [0,1],[SDet],Dag1
Dag1
Parsing the phrase that flight after that is processed.
A newly completed state NOMINAL Noun , [1,2],[SNoun],Dag2
SGNUMBERAGREEMENTHEADNoun
HEADNOMINAL
)1(
)1(Dag2
To advance in NP, the parser unifies the feature structure found under the NOMINAL feature of Dag2, with the feature structure found under the NOMINAL feature of Dag1.
Lecture 1, 7/21/2005 Natural Language Processing 56
Earley Parse
function EARLEY-PARSE(words,grammar) returns chart
ENQUEUE(( S, [0,0], chart[0],dag)
for i from 0 to LENGTH(words) do
for each state in chart[i] do
if INCOMPLETE?(state) and NEXT-CAT(state) is not a PS then
PREDICTOR(state)
elseif INCOMPLETE?(state) and NEXT-CAT(state) is a PS then
SCANNER(state)
else
COMPLETER(state)
end
end
return(chart)
Lecture 1, 7/21/2005 Natural Language Processing 57
Predictor and Scanner
procedure PREDICTOR((A B , [i,j],dagA))
for each (B ) in GRAMMAR-RULES-FOR(B,grammar) do
ENQUEUE((B , [i,j],dagB), chart[j])
end
procedure SCANNER((A B , [i,j],dagA))
if (B PARTS-OF-SPEECH(word[j]) then
ENQUEUE((B word[j] , [j,j+1],dagB), chart[j+1])
end
Lecture 1, 7/21/2005 Natural Language Processing 58
Completer and UnifyStates
procedure COMPLETER((B , [j,k],dagB))
for each (A B , [i,j],dagA) in chart[j] do
if newdag UNIFY-STATES(dagB,dagA,B) fails then ENQUEUE((A B , [i,k],newdag),
chart[k])
end
procedure UNIFY-STATES(dag1,dag2,cat)
dag1cp CopyDag(dag1);
dag2cp CopyDag(dag2);
UNIFY(FollowPath(cat,dag1cp),FollowPath(cat,dag2cp));
end
Lecture 1, 7/21/2005 Natural Language Processing 59
Enqueue
procedure ENQUEUE(state,chart-entry)
if state is not subsumed by a state in chart-entry then
Add state at the end of chart-entry
end
Lecture 1, 7/21/2005 Natural Language Processing 60
Probabilistic Parsing
Slides by Markus Dickinson, Georgetown University
Lecture 1, 7/21/2005 Natural Language Processing 61
Motivation and Outline
Previously, we used CFGs to parse with, but: Some ambiguous sentences could not be
disambiguated, and we would like to know the most likely parse
How do we get such grammars? Do we write them ourselves? Maybe we could use a corpus …
Where we’re going: Probabilistic Context-Free Grammars (PCFGs) Lexicalized PCFGs Dependency Grammars
Lecture 1, 7/21/2005 Natural Language Processing 62
Statistical Parsing
Basic idea Start with a treebank
a collection of sentences with syntactic annotation, i.e., already-parsed sentences
Examine which parse trees occur frequently Extract grammar rules corresponding to those parse
trees, estimating the probability of the grammar rule based on its frequency
That is, we’ll have a CFG augmented with probabilities
Lecture 1, 7/21/2005 Natural Language Processing 63
Using Probabilities to Parse
P(T): probability of a particular parse tree
P(T) = ΠnєT p(r(n))
i.e., the product of the probabilities of all the rules r used to expand each node n in the parse tree
Example: given the probabilities on p. 449, compute the probability of the parse tree on the right
Lecture 1, 7/21/2005 Natural Language Processing 64
Computing probabilities
We have the following rules and probabilities (adapted from Figure 12.1):
S VP .05 VP V NP .40 NP Det N .20 V book .30 Det that .05 N flight .25
P(T) = P(SVP)*P(VPV NP)*…*P(Nflight)= .05*.40*.20*.30*.05*.25 = .000015, or 1.5 x 10-5
Lecture 1, 7/21/2005 Natural Language Processing 65
Using probabilities
So, the probability for that parse is 0.000015. What’s the big deal?
Probabilities are useful for comparing with other probabilities
Whereas we couldn’t decide between two parses using a regular CFG, we now can.
For example, TWA flights is ambiguous between being two separate NPs (cf. I gave [NP John] [NP money]) or one NP:
A: [book [TWA] [flights]] B: [book [TWA flights]]
Probabilities allows us to choose choice B (see figure 12.2)
Lecture 1, 7/21/2005 Natural Language Processing 66
Obtaining the best parse
Call the best parse T(S), where S is your sentence Get the tree which has the highest probability, i.e. T(S) = argmaxTєparse-trees(S) P(T)
Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse
CYK is a form of dynamic programming CYK is a chart parser, like the Earley parser
Lecture 1, 7/21/2005 Natural Language Processing 67
The CYK algorithm
Base case Add words to the chart Store P(A w_i) for every category A in the chart
Recursive case makes this dynamic programming because we only calculate B and C once
Rules must be of the form A BC, i.e., exactly two items on the RHS (we call this Chomsky Normal Form (CNF))
Get the probability for A at this node by multiplying the probabilities for B and for C by P(A BC)
P(B)*P(C)*P(A BC)
For a given A, only keep the maximum probability (again, this is dynamic programming)
Lecture 1, 7/21/2005 Natural Language Processing 68
Problems with PCFGs
It’s still only a CFG, so dependencies on non-CFG info not captured
e.g., Pronouns are more likely to be subjects than objects: P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]
Ignores lexical information (statistics), which is usually crucial for disambiguation
(T1) America sent [[250,000 soldiers] [into Iraq]] (T2) America sent [250,000 soldiers] [into Iraq]
send with into-PP always attached high (T2) in PTB!
To handle lexical information, we’ll turn to lexicalized PCFGs
Lecture 1, 7/21/2005 Natural Language Processing 69
Lexicalized Grammars
The head information is passed up in a syntactic analysis? e.g., VP[head *1] V[head *1] NP
Well, if you follow this down all the way to the bottom of a tree, you wind up with a head word
In some sense, we can say that Book that flight is not just an S, but an S rooted in book
Thus, book is the headword of the whole sentence
By adding headword information to nonterminals, we wind up with a lexicalized grammar
Lecture 1, 7/21/2005 Natural Language Processing 70
Lexicalized PCFGs
Lexicalized Parse Trees Each PCFG rule in a tree
is augmented to identify one RHS constituent to be the head daughter
The headword for a node is set to the head word of its head daughter
[book]
[book]
[flight]
[flight]
Lecture 1, 7/21/2005 Natural Language Processing 71
Incorporating Head Proabilities: Wrong Way
Simply adding headword w to node won’t work: So, the node A becomes A[w] e.g., P(A[w]β|A) =Count(A[w]β)/Count(A)
The probabilities are too small, i.e., we don’t have a big enough corpus to calculate these probabilities
VP(dumped) VBD(dumped) NP(sacks) PP(into) 3x10-10
VP(dumped) VBD(dumped) NP(cats) PP(into) 8x10-11
These probabilities are tiny, and others will never occur
Lecture 1, 7/21/2005 Natural Language Processing 72
Incorporating head probabilities: Right way
Previously, we conditioned on the mother node (A): P(Aβ|A)
Now, we can condition on the mother node and the headword of A (h(A)):
P(Aβ|A, h(A))We’re no longer conditioning on simply the mother
category A, but on the mother category when h(A) is the head
e.g., P(VPVBD NP PP | VP, dumped)
Lecture 1, 7/21/2005 Natural Language Processing 73
Calculating rule probabilities
We’ll write the probability more generally as: P(r(n) | n, h(n)) where n = node, r = rule, and h = headword
We calculate this by comparing how many times the rule occurs with h(n) as the headword versus how many times the mother/headword combination appear in total:
P(VP VBD NP PP | VP, dumped)
= C(VP(dumped) VBD NP PP)/ Σβ C(VP(dumped) β)
Lecture 1, 7/21/2005 Natural Language Processing 74
Adding info about word-word dependencies
We want to take into account one other factor: the probability of being a head word (in a given context)
P(h(n)=word | …)We condition this probability on two things: 1. the category of
the node (n), and 2. the headword of the mother (h(m(n))) P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n,
h(m(n))) P(sacks | NP, dumped)
What we’re really doing is factoring in how words relate to each other
We will call this a dependency relation later: sacks is dependent on dumped, in this case
Lecture 1, 7/21/2005 Natural Language Processing 75
Putting it all together
See p. 459 for an example lexicalized parse tree forworkers dumped sacks into a bin
For rules r, category n, head h, mother m
P(T) = ΠnєT p(r(n)| n, h(n))
e.g., P(VP VBD NP PP |VP, dumped)
subcategorization info*p(h(n) | n, h(m(n)))
e.g. P(sacks | NP, dumped)dependency info
between words
Lecture 1, 7/21/2005 Natural Language Processing 76
Dependency Grammar
Capturing relations between words (e.g. dumped and sacks) is moving in the direction of dependency grammar (DG)
In DG, there is no such thing as constituencyThe structure of a sentence is purely the binary relations
between wordsJohn loves Mary is represented as:
LOVE JOHN LOVE MARY
where A B means that B depends on A
Lecture 1, 7/21/2005 Natural Language Processing 77
Dependency parsing
Lecture 1, 7/21/2005 Natural Language Processing 78
Dependency Grammar/Parsing
A sentence is parsed by relating each word to other words in the sentence which depend on it.
The idea of dependency structure goes back a long way To Pāṇini’s grammar (c. 5th century BCE)
Constituency is a new-fangled invention 20th century invention
Modern work often linked to work of L. Tesniere (1959) Dominant approach in “East” (Eastern bloc/East Asia)
Among the earliest kinds of parsers in NLP, even in US: David Hays, one of the founders of computational linguistics, built early
(first?) dependency parser (Hays 1962)
Lecture 1, 7/21/2005 Natural Language Processing 79
Dependency structure
Words are linked from head (regent) to dependent Warning! Some people do the arrows one way; some the other way
(Tesniere has them point from head to dependent…). Usually add a fake ROOT so every word is a dependent
Shaw Publishing acquired 30 % of American City in March $$
Lecture 1, 7/21/2005 Natural Language Processing 80
Relation between CFG to dependency parse
A dependency grammar has a notion of a head Officially, CFGs don’t But modern linguistic theory and all modern statistical
parsers (Charniak, Collins, Stanford, …) do, via hand-written phrasal “head rules”: The head of a Noun Phrase is a noun/number/adj/… The head of a Verb Phrase is a verb/modal/….
The head rules can be used to extract a dependency parse from a CFG parse (follow the heads).
A phrase structure tree can be got from a dependency tree, but dependents are flat (no VP!)
Lecture 1, 7/21/2005 Natural Language Processing 81
Propagating head words
Small set of rules propagate heads
S(announced)
NP(Smith)
NP(Smith)
NNP
John
NNP
Smith
NP(president)
NP
DT
the
NN
president
PP(of)
IN
of
NP
NNP
IBM
VP(announced)
VBD
announced
NP(resignation)
PRP$
his
NN
resignation
NP
NN
yesterday
Lecture 1, 7/21/2005 Natural Language Processing 82
Extracted structure
NB. Not all dependencies shown here
Dependencies are inherently untyped, though some work like Collins (1996) types them using the phrasal categories
NP
[John Smith]
NPNP
[the president] of [IBM]
SNP VP
announced [his Resignation] [yesterday]
VPVBD NP
NPVP
VBD
Lecture 1, 7/21/2005 Natural Language Processing 83
Sources of information: bilexical dependencies distance of dependencies valency of heads (number of dependents)
A word’s dependents (adjuncts, arguments)
tend to fall near it
in the string.
Dependency Conditioning Preferences
These next 6 slides are based on slides by Jason Eisner and Noah Smith
Lecture 1, 7/21/2005 Natural Language Processing 84
Probabilistic dependency grammar: generative model
1. Start with left wall $
2. Generate root w0
3. Generate left children w-1, w-2, ..., w-ℓ from the FSA λw0
4. Generate right children w1, w2, ..., wr from the FSA ρw0
5. Recurse on each wi for i in {-ℓ, ..., -1, 1, ..., r}, sampling αi (steps 2-4)
6. Return αℓ...α-1w0α1...αr
w0
w-1
w-2
w-ℓ wr
w2
w1
......
w-ℓ.-1
$
λw-ℓ
λw0 ρw0
Lecture 1, 7/21/2005 Natural Language Processing 85
Naïve Recognition/Parsing
It takes two to tango
It
takes two to tango
totakes
takes
takes
O(n5) combinations
It
p
p c
i j k
O(n5N3) if N nonterminal
s r0 n
goal
goal
Lecture 1, 7/21/2005 Natural Language Processing 86
Dependency Grammar Cubic Recognition/Parsing (Eisner & Satta, 1999)
Triangles: span over words, where tall side of triangle is the head, other side is dependent, and no non-head words expecting more dependents
Trapezoids: span over words, where larger side is head, smaller side is dependent, and smaller side is still looking for dependents on its side of the trapezoid
}
}
Lecture 1, 7/21/2005 Natural Language Processing 87
Dependency Grammar Cubic Recognition/Parsing (Eisner & Satta, 1999)
It takes two to tango
goal
One trapezoid
per dependency
.
A triangle is a head with some left (or right) subtrees.
Lecture 1, 7/21/2005 Natural Language Processing 88
Cubic Recognition/Parsing (Eisner & Satta, 1999)
i j k i j k
i j ki j k
O(n3) combinations
O(n3) combinations
0 i n
goal
Gives O(n3) dependency grammar parsing
O(n) combinations
Lecture 1, 7/21/2005 Natural Language Processing 89
Evaluation of Dependency Parsing: Simply use (labeled) dependency accuracy
1 2 3 4 5
1 2 We SUBJ
2 0 eat ROOT
3 5 the DET4 5 cheese
MOD5 2 sandwich
SUBJ
1 2 We SUBJ
2 0 eat ROOT
3 4 the DET4 2 cheese OBJ5 2 sandwich
PRED
Accuracy = number of correct dependenciestotal number of dependencies
= 2 / 5 = 0.40
40%
GOLD PARSED
Lecture 1, 7/21/2005 Natural Language Processing 90
McDonald et al. (2005 ACL):Online Large-Margin Training of Dependency Parsers
Builds a discriminative dependency parser Can condition on rich features in that context
Best-known recent dependency parser Lots of recent dependency parsing activity connected
with CoNLL 2006/2007 shared task Doesn’t/can’t report constituent LP/LR, but evaluating
dependencies correct: Accuracy is similar to but a fraction below dependencies
extracted from Collins: 90.9% vs. 91.4% … combining them gives 92.2% [all lengths] Stanford parser on length up to 40:
Pure generative dependency model: 85.0% Lexicalized factored parser: 91.0%
Lecture 1, 7/21/2005 Natural Language Processing 91
McDonald et al. (2005 ACL):Online Large-Margin Training of Dependency Parsers
Score of a parse is the sum of the scores of its dependencies
Each dependency is a linear function of features times weights
Feature weights are learned by MIRA, an online large-margin algorithm But you could think of it as using a perceptron or maxent classifier
Features cover: Head and dependent word and POS separately Head and dependent word and POS bigram features Words between head and dependent Length and direction of dependency
Lecture 1, 7/21/2005 Natural Language Processing 92
Extracting grammatical relations from statistical constituency parsers
[de Marneffe et al. LREC 2006] Exploit the high-quality syntactic analysis done by statistical
constituency parsers to get the grammatical relations [typed dependencies]
Dependencies are generated by pattern-matching rules
Bills on ports and immigration were submitted by Senator Brownback
NPS
NP
NNP NNP
PP
IN
VP
VP
VBN
VBD
NNCCNNS
NPIN
NP PP
NNS
submitted
Bills were Brownback
Senator
nsubjpass auxpass agent
nnprep_on
ports
immigration
cc_and
Lecture 1, 7/21/2005 Natural Language Processing 93
Evaluating Parser Output
Dependency relations are also useful for comparing parser output to a treebank
Traditional measures of parser accuracy: Labeled bracketing precision:
# correct constituents in parse/# constituents in parse
Labeled bracketing recall:# correct constituents in parse/# (correct) constituents in treebank
parse
There are known problems with these measures, so people are trying to use dependency-based measures instead
How many dependency relations did the parse get correct?