Probabilistic and Lexicalized Parsing

Probabilistic CFGs• Weighted CFGs

– Attach weights to rules of CFG– Compute weights of derivations– Use weights to pick, preferred parses

• Utility: Pruning and ordering the search space, disambiguate, Language Model for ASR.

• Parsing with weighted grammars (like Weighted FA)– T* = arg maxT W(T,S)

• Probabilistic CFGs are one form of weighted CFGs.

Probability Model• Rule Probability:

– Attach probabilities to grammar rules

– Expansions for a given non-terminal sum to 1

R1: VP V .55

R2: VP V NP .40

R3: VP V NP NP .05

– Estimate the probabilities from annotated corpora P(R1)=counts(R1)/counts(VP)

• Derivation Probability:– Derivation T= {R1…Rn}

– Probability of a derivation:

– Most likely probable parse: – Probability of a sentence:

• Sum over all possible derivations for the sentence

• Note the independence assumption: Parse probability does not change based on where the rule is expanded.

n

iiRPTP

1

)()(

)(maxarg* TPTT

T

STPSP )|()(

Structural ambiguity • S NP VP• VP V NP• NP NP PP• VP VP PP• PP P NP

• NP John | Mary | Denver• V -> called• P -> from

John called Mary from Denver

S

VP PP

NP VP

V NP NPP


S

NP

NP VP

V NP PP

PJohn called Mary

from Denver

NP

Cocke-Younger-Kasami Parser

• Bottom-up parser with top-down filtering

• Start State(s): (A, i, i+1) for each Awi+1

• End State: (S, 0,n) n is the input size• Next State Rules

– (Bi, k) (C, k, j) (A, i,j) if ABC

Example


Base Case: Aw

NP

P Denver

NP from

V Mary

NP called

John

Recursive Cases: ABC

NP

P Denver

NP from

X V Mary

NP called

John

NP

P Denver

VP NP from

X V Mary

NP called

John

NP

X P Denver

VP NP from

X V Mary

NP called

John

PP NP

X P Denver

VP NP from

X V Mary

NP called

John

PP NP

X P Denver

S VP NP from

V Mary

NP called

John

PP NP

X X P Denver

S VP NP from

X V Mary

NP called

John

NP PP NP

X P Denver

S VP NP from

X V Mary

NP called

John

NP PP NP

X X X P Denver

S VP NP from

X V Mary

NP called

John

VP NP PP NP

X X X P Denver

S VP NP from

X V Mary

NP called

John

VP1

VP2

NP PP NP

X X X P Denver

S VP NP from

X V Mary

NP called

John

S VP1

VP2

NP PP NP

X X X P Denver

S VP NP from

X V Mary

NP called

John

S VP NP PP NP

X X X P Denver

S VP NP from

X V Mary

NP called

John

Probabilistic CKY• Assign probabilities to constituents as they are

completed and placed in the table• Computing the probability

– Since we are interested in the max P(S,0,n)• Use the max probability for each constituent

• Maintain back-pointers to recover the parse.

)(*),,(*),,(),,(

),,(),,(

BCAPjkCPkiBPjiBCAP

jiBCAPjiAPBCA

Problems with PCFGs• The probability model we’re using is just based on the rules in

the derivation.

• Lexical insensitivity:– Doesn’t use the words in any real way

– Structural disambiguation is lexically driven• PP attachment often depends on the verb, its object, and the preposition • I ate pickles with a fork. • I ate pickles with relish.

• Context insensitivity of the derivation– Doesn’t take into account where in the derivation a rule is used

• Pronouns more often subjects than objects • She hates Mary. • Mary hates her.

• Solution: Lexicalization– Add lexical information to each rule

An example of lexical information: Heads

• Make use of notion of the head of a phrase– Head of an NP is a noun– Head of a VP is the main verb– Head of a PP is its preposition

• Each LHS of a rule in the PCFG has a lexical item

• Each RHS non-terminal has a lexical item.– One of the lexical items is shared with the LHS.

• If R is the number of binary branching rules in CFG, in lexicalized CFG: O(2*|∑|*|R|)

• Unary rules: O(|∑|*|R|)

Example (correct parse)

Attribute grammar

Example (less preferred)

Computing Lexicalized Rule Probabilities

• We started with rule probabilities– VP V NP PP P(rule|VP)

• E.g., count of this rule divided by the number of VPs in a treebank

• Now we want lexicalized probabilities– VP(dumped) V(dumped) NP(sacks)PP(in)– P(rule|VP ^ dumped is the verb ^ sacks is the

head of the NP ^ in is the head of the PP)– Not likely to have significant counts in any

treebank

Another Example• Consider the VPs

– Ate spaghetti with gusto– Ate spaghetti with marinara

• Dependency is not between mother-child.

Vp (ate)

Vp(ate) Pp(with)

vAte spaghetti with gusto

np

Vp(ate)

Pp(with)

Np(spag)

npvAte spaghetti with marinara

Log-linear models for Parsing• Why restrict to the conditioning to the elements of a

rule?– Use even larger context– Word sequence, word types, sub-tree context etc.

• In general, compute P(y|x); where fi(x,y) test the properties of the context; i is the weight of that feature.

• Use these as scores in the CKY algorithm to find the best scoring parse.

Yy

yxf

yxf

ii

ii

e

exyP

),(*

),(*

)|(

Supertagging: Almost parsing

Poachers now control the underground trade

NP

N

poachers

N

NN

tradeS

NP

VP

V

NP

N

poachers

::

S

SAdv

now

VP

VPAdv

now

VP

AdvVP

now

::

S

S

VP

V

NP

control

S

NP

VP

V

NP

control

S

NP

VP

V

NP

control

S

NP

NPDet

the

NP

NP

N

trade

N

NN

poachers

S

NP

VP

V

NP

N

trade

N

NAdj

underground

S

NP

VP

V

NP

Adj

underground

S

NP

VP

V

NP

Adj

underground

S

NP

:

Summary• Parsing context-free grammars

– Top-down and Bottom-up parsers– Mixed approaches (CKY, Earley parsers)

• Preferences over parses using probabilities– Parsing with PCFG and PCKY algorithms

• Enriching the probability model– Lexicalization– Log-linear models for parsing

Probabilistic and Lexicalized Parsing

Documents

Transcript of Probabilistic and Lexicalized Parsing