A Knowledge-Intensive Model for Prepositional Phrase ...nakashole.com/papers/2015-acl-ppa.pdf · A...

A Knowledge-Intensive Model for Prepositional Phrase Attachment

Ndapandula NakasholeCarnegie Mellon University

5000 Forbes AvenuePittsburgh, PA, [email protected]

Tom M. MitchellCarnegie Mellon University

5000 Forbes AvenuePittsburgh, PA, 15213

[email protected]

Abstract

Prepositional phrases (PPs) express cru-cial information that knowledge base con-struction methods need to extract. How-ever, PPs are a major source of syntacticambiguity and still pose problems in pars-ing. We present a method for resolvingambiguities arising from PPs, making ex-tensive use of semantic knowledge fromvarious resources. As training data, we useboth labeled and unlabeled data, utilizingan expectation maximization algorithm forparameter estimation. Experiments showthat our method yields improvements overexisting methods including a state of theart dependency parser.

1 Introduction

Machine reading and information extraction (IE)projects have produced large resources with manymillions of facts (Suchanek et al., 2007; Mitchellet al., 2015). This wealth of knowledge createsa positive feedback loop for automatic knowledgebase construction efforts: the accumulated knowl-edge can be leveraged to improve machine read-ing; in turn, improved reading methods can beused to better extract knowledge expressed usingcomplex and potentially ambiguous language. Forexample, prepositional phrases (PPs) express cru-cial information that IE methods need to extract.However, PPs are a major source of syntactic am-biguity. In this paper, we propose to use semanticknowledge to improve PP attachment disambigua-tion. PPs such as “in”, “at”, and “for” express de-tails about the where, when, and why of relationsand events. PPs also state attributes of nouns.

As an example, consider the following sen-tences: S1.) Alice caught the butterfly with thespots. S2.) Alice caught the butterfly with the net.

S

NP VP

VP NP

Alice

caught butterfly

PPwith

spots

S1.) Noun attachmentS

NP VP

VP NP PPAlice

caught butterfly with net

S2.) Verb attachment

Figure 1: Parse trees where the prepositionalphrase (PP) attaches to the noun, and to the verb.

Relations Noun-Noun binary relations(Paris, located in, France)(net, caught, butterfly)

Nouns Noun semantic categories(butterfly, isA, animal)

Verbs Verb rolescaught(agent, patient, instrument)

Prepositions Preposition definitionsf(for)= used for, has purpose, ...f(with)= has, contains, ...

Discourse Contextn0 ∈ {n0, v, n1, p, n2}

Table 1: Types of background knowledge used inthis paper to determine PP attachment.

S1 and S2 are syntactically different, this is evi-dent from their corresponding parse trees in Fig-ure 1. Specifically, S1 and S2 differ in where theirPPs attach. In S1, the butterfly has spots and there-fore the PP, “with the spots”, attaches to the noun.For relation extraction, we obtain a binary relationof the form: 〈Alice〉 caught 〈butterfly with spots〉.However, in S2, the net is the instrument used forcatching and therefore the PP, “with the net”, at-taches to the verb. For relation extraction, we geta ternary extraction of the form: 〈Alice〉 caught〈butterfly〉 with 〈net〉.

The PP attachment problem is often defined asfollows: given a PP occurring within a sentencewhere there are multiple possible attachment sites

0.4

0.5

0.6

0.7

0.8

0.9

WITH AT FROM FOR AS IN ON

Figure 2: Dependency parser PP attachment accu-racy for various frequent prepositions.

for the PP, choose the most plausible attachmentsite. In the literature, prior work going as far backas (Brill and Resnik, 1994; Ratnaparkhi et al.,1994; Collins and Brooks, 1995) has focused onthe language pattern that causes most PP ambigui-ties, which is the 4-word sequence: {v, n1, p, n2}(e.g., {caught, butterfly, with, spots}). The task isto determine if the prepositional phrase (p, n2) at-taches to the verb v or to the first noun n1. Follow-ing common practice, we focus on PPs occurringas {v, n1, p, n2} quadruples — we shall refer tothese as PP quads.

The approach we present here differs from priorwork in two main ways. First, we make ex-tensive use of semantic knowledge about nouns,verbs, prepositions, pairs of nouns, and the dis-course context in which a PP quad occurs. Table 1summarizes the types of knowledge we consideredin our work. Second, in training our model, werely on both labeled and unlabeled data, employ-ing an expectation maximization (EM) algorithm(Dempster et al., 1977).Contributions. In summary, our main contribu-tions are:

1) Semantic Knowledge: Previous methodslargely rely on corpus statistics. Our approachdraws upon diverse sources of background knowl-edge, leading to performance improvements.

2) Unlabeled Data: In addition to training on la-beled data, we also make use of a large amount ofunlabeled data. This enhances our method’s abil-ity to generalize to diverse data sets.

3) Datasets: In addition to the standard WallStreet Journal corpus (WSJ) (Ratnaparkhi et al.,1994), we labeled two new datasets for testingpurposes, one from Wikipedia (WKP), and an-other from the New York Times Corpus (NYTC).We make these datasets freely available for fu-

0

0.25

0.5

0.75

1

IN FROM WITH FOR OF As AT ON

Verb attachmentsNoun attachments

Figure 3: Noun vs. verb attachment proportionsfor frequent prepositions in the labeled NYTCdataset.

ture research. In addition, we have applied ourmodel to over 4 million 5-tuples of the form{n0, v, n1, p, n2}, and we also make this datasetavailable1 for research into ternary relation extrac-tion beyond spatial and temporal scoping.

2 State of the Art

To quantitatively assess existing tools, we ana-lyzed performance of the widely used Stanfordparser2 as of 2014, and the established baselinealgorithm (Collins and Brooks, 1995), which hasstood the test of time. We first manually labeledPP quads from the NYTC dataset, then prependedthe noun phrase appearing before the quad, ef-fectively creating sentences made up of 5 lexi-cal items (n0 v n1 p n2). We then applied theStanford parser, obtaining the results summarizedin Figure 2. The parser performs well on someprepositions, for example, “of”, which tends to oc-cur with noun attaching PPs as can be seen in Fig-ure 3. However, for prepositions with an even dis-tribution over verb and noun attachments, such as“on”, precision is as low as 50%. The Collinsbaseline achieves 84% accuracy on the bench-mark Wall Street Journal PP dataset. However,drawing a distinction in the precision of differentprepositions provides useful insights on its per-formance. We re-implemented this baseline andfound that when we remove the trivial preposi-tion, “of”, whose PPs are by default attached tothe noun by this baseline, precision drops to 78%.This analysis suggests there is substantial room forimprovement.

1http://rtw.ml.cmu.edu/resources/ppa2http://nlp.stanford.edu:8080/parser/

3 Related Work

Statistics-based Methods. Prominent prior meth-ods learn to perform PP attachment based oncorpus co-occurrence statistics, gathered eitherfrom manually annotated training data (Collinsand Brooks, 1995; Brill and Resnik, 1994) orfrom automatically acquired training data that maybe noisy (Ratnaparkhi, 1998; Pantel and Lin,2000). These models collect statistics on how of-ten a given quadruple, {v, n1, p, n2}, occurs in thetraining data as a verb attachment as opposed to anoun attachment. The issue with this approach issparsity, that is, many quadruples occuring in thetest data might not have been seen in the trainingdata. Smoothing techniques are often employedto overcome sparsity. For example, (Collins andBrooks, 1995) proposed a back-off model that usessubsets of the words in the quadruple, by alsokeeping frequency counts of triples, pairs and sin-gle words. Another approach to overcoming spar-sity has been to use WordNet (Fellbaum, 1998)classes, by replacing nouns with their WordNetclasses (Stetina and Nagao, 1997; Toutanova etal., 2004) to obtain less sparse corpus statistics.Corpus-derived clusters of similar nouns and verbshave also been used (Pantel and Lin, 2000).

Hindle and Rooth proposed a lexical associa-tion approach based on how words are associatedwith each other (Hindle and Rooth, 1993). Lexi-cal preference is used by computing co-occurrencefrequencies (lexical associations) of verbs andnouns, with prepositions. In this manner, theywould discover that, for example, the verb “send”is highly associated with the preposition from, in-dicating that in this case, the PP is likely to be averb attachment.Structure-based Methods. These methods arebased on high-level observations that are then gen-eralized into heuristics for PP attachment deci-sions. (Kimball, 1988) proposed a right associa-tion method, whose premise is that a word tendsto attach to another word immediately to its right.(Frazier, 1978) introduced a minimal attachmentmethod, which posits that words attach to an ex-isting non-terminal word using the fewest addi-tional syntactic nodes. While simple, in practicethese methods have been found to perform poorly(Whittemore et al., 1990).Rule-based Methods. (Brill and Resnik, 1994)

proposed methods that learn a set of transforma-tion rules from a corpus. The rules can be too spe-cific to have broad applicability, resulting in lowrecall. To address low recall, knowledge aboutnouns, as found in WordNet, is used to replace cer-tain words in rules with their WordNet classes.Parser Correction Methods. The quadruples for-mulation of the PP problem can be seen as asimplified setting. This is because, with quadru-ples, there is no need to deal with complex sen-tences but only well-defined quadruples of theform {v, n1, p, n2}. Thus in the quadruples set-ting, there are only two possible attachment sitesfor the PP, the v and n1. An alternative setting isto work in the context of full sentences. In thissetting the problem is cast as a dependency parsercorrection problem (Atterer and Schutze, 2007;Agirre et al., 2008; Anguiano and Candito, 2011).That is, given a dependency parse of a sentence,with potentially incorrect PP attachments, rectifyit such that the prepositional phrases attach to thecorrect sites. Unlike our approach, these methodsdo not take semantic knowledge into account.Sense Disambiguation. In addition to prior workon prepositional phrase attachment, a highly re-lated problem is preposition sense disambiguation(Hovy et al., 2011; Srikumar and Roth, 2013).Even a syntactically correctly attached PP can stillbe semantically ambiguous with respect to ques-tions of machine reading such as where, when, andwhy. Therefore, when extracting information fromprepositions, the problem of preposition sense dis-ambiguation (semantics) has to be addressed in ad-dition to prepositional phrase attachment disam-biguation (syntax). In this paper, our focus is onthe latter.

4 Methodology

Our approach consists of first generating featuresfrom background knowledge and then training amodel to learn with these features. The types offeatures considered in our experiments are sum-marized in Table 2. The choice of features wasmotivated by our empirically driven characteriza-tion of the problem as follows:

(Verb attach) −→ v 〈has-slot-filler〉 n2(Noun attach a.) −→ n1 〈described-by〉 n2(Noun attach b.) −→ n2 〈described-by〉 n1

Feature Type # Feature ExampleNoun-Noun Binary Relations Source: SVOs

F1. svo(n2, v, n1) For q1; (net, caught, butterfly)F2. ∀i : ∃svio; svo(n1, vi, n2) For q2; (butterfly, has, spots)

For q2; (butterfly, can see, spots)Noun Semantic Categories Source: T

F3. ∀ti ∈ T ; isA(n1, ti) For q1 isA(butterlfy, animal)F4. ∀ti ∈ T ; isA(n2, ti) For q2 isA(net, device)

Verb Role Fillers Source: VerbNetF5. hasRole(n2, ri) For q1; (net, instrument)

Preposition Relational Source:MDefinitions F6. def(prep, vi) ∀i :

∃svio; vi ∈M ∧svo(n1, vi, n2) For q2; def(with, has)

Discourse Features Source: Sentence(s), TF7. ∀ti ∈ T ; isA(n0, ti) n0 ∈ {n0, v, n1, p, n2}

Lexical Features Source: PP quads For q1;F8. (v, n1, p, n2) (caught, butterfly, with, net)

F9. (v, n1, p) (caught, butterfly, with)

F10. (v, p, n2) (caught, with, net)

F11. (n1, p, n2) (butterfly, with, net)

F12. (v, p) (caught, with)

F13. (n1, p) (butterfly, with)

F14. (p, n2) (with, net)

F15. (p) (with)

Table 2: Types of features considered in our experiments. All features have values of 1 or 0.The PP quads used as running examples are: q1 = {caught, butterfly, with, net} : V , q2 =

{caught, butterfly, with, spots} : N .

That is, we found that for verb-attaching PPs,n2 is usually a role filler for the verb, e.g., the netfills the role of an instrument for the verb catch.On the other hand, for noun-attaching PPs, onenoun describes or elaborates on the other. In par-ticular, we found two kinds of noun attachments.For the first kind of noun attachment, the secondnoun n2 describes the first noun n1, for exam-ple n2 might be an attribute or property of n1,as in the spots(n2) are an attribute of the butter-fly (n1). And for the second kind of noun attach-ment, the first noun n1 describes the second nounn2, as in the PP quad {expect, decline, in, rates},where the PP “in rates”, attaches to the noun. Thedecline:n1 that is expected:v is in the rates:n2. Wesampled 50 PP quads from the WSJ dataset andfound that every labeling could be explained usingour characterization. We make this labeling avail-able with the rest of the datasets.

We next describe in more detail how each type

of feature is derived from the background knowl-edge in Table 1.

4.1 Feature Generation

We generate boolean-valued features for all thefeature types we describe in this section.

4.1.1 Noun-Noun Binary Relations

The noun-noun binary relation features, F1-2in Table 2, are boolean features svo(n1, vi, n2)(where vi is any verb) and svo(n2, v, n1) (wherev is the verb in the PP quad, and the roles ofn2 and n1 are reversed). These features de-scribe diverse semantic relations between pairs ofnouns (e.g., butterfly-has-spots, clapton-played-guitar). To obtain this type of knowledge, wedependency parsed all sentences in the 500 mil-lion English web pages of the ClueWeb09 corpus,then extracted subject-verb-object (SVO) triplesfrom these parses, along with the frequency of

each SVO triple in the corpus. The value ofany given feature svo(n1, vi, n2) is defined to be1 if that SVO triple was found at least 3 timesin these SVO triples, and 0 otherwise. To seewhy these relations are relevant, let us supposethat we have the knowledge that butterfly-has-spots, svo(n1, vi, n2). From this, we can inferthat the PP in {caught, butterfly, with, spots}is likely to attach to the noun. Similarly, supposewe know that net-caught-butterfly, svo(n2, v, n1).The fact that a net can be used to catch a but-terfly can be used to predict that the PP in{caught, butterfly, with, net} is likely to attachto the verb.

4.1.2 Noun Semantic Categories

Noun semantic type features, F3-4, are booleanfeatures isA(n1, ti) and isA(n2, ti) where ti is anoun category in a noun categorization scheme Tsuch as WordNet classes. Knowledge about se-mantic types of nouns, for example that a butter-fly is an animal, enables extrapolating predictionsto other PP quads that contain nouns of the sametype. We ran experiments with several noun cat-egorizations including WordNet classes, knowl-edge base ontological types, and an unsupervisednoun categorization produced by clustering nounsbased on the verbs and adjectives with which theyco-occur (distributional similarity).

4.1.3 Verb Role Fillers

The verb role feature, F5, is a boolean fea-ture hasRole(n2, ri) where ri is a role thatn2 can fulfill for the verb v in the PP quad,according to background knowledge. Noticethat if n2 fills a role for the verb, then thePP is a verb attachment. Consider the quad{caught, butterfly, with, net}, if we know thata net can play the role of an instrument for theverb catch, this suggests a likely verb attachment.We obtained background knowledge of verbs andtheir possible roles from the VerbNet lexical re-source (Kipper et al., 2008). From VerbNet weobtained 2, 573 labeled sentences containing PPquads (verbs in the same VerbNet group are con-sidered synonymous), and the labeled semanticroles filled by the second noun n2 in the PP quad.We use these example sentences to label similarPP quads, where similarity of PP quads is definedby verbs from the same VerbNet group.

4.1.4 Preposition DefinitionsThe preposition definition feature, F6, is aboolean feature def(prep, vi) = 1 if ∃vi ∈M ∧ svo(n1, vi, n2) = 1, where M is a def-inition mapping of prepositions to verb phrases.This mapping defines prepositions, using verbsin our ClueWeb09 derived SVO corpus, in or-der to capture their senses using verbs; it con-tains definitions such as def(with, *) = contains,accompanied by, ... . If “with” is used in thesense of “contains” , then the PP is a likelynoun attachment, as in n1 contains n2 in thequad ate, cookies, with, cranberries. However,if “with” is used in the sense of “accompaniedby”, then the PP is a likely verb attachment, asin the quad visted, Paris, with, Sue. To obtainthe mapping, we took the labeled PP quads (WSJ,(Ratnaparkhi et al., 1994)) and computed a rankedlist of verbs from SVOs, that appear frequentlybetween pairs of nouns for a given preposition.Other sample mappings are: def(for,*)= used for,def(in,*)= located in. Notice that this feature F6is a selective, more targeted version of F2.

4.1.5 Discourse and Lexical FeaturesThe discourse feature, F7, is a boolean featureisA(n0, ti), for each noun category ti found in anoun category ontology T such as WordNet se-mantic types. The context of the PP quad cancontain relevant information for attachment deci-sions. We take into account the noun precedinga PP quad, in particular, its semantic type. Thisin effect makes the PP quad into a PP 5-tuple:{n0, v, n1, p, n2}, where the n0 provides addi-tional context.

Finally, we use lexical features in the form ofPP quads, features F8-15. To overcome sparsityof occurrences of PP quads, we also use countsof shorter sub-sequences, including triples, pairsand singles. We only use sub-sequences that con-tain the preposition, as the preposition has beenfound to be highly crucial in PP attachment deci-sions (Collins and Brooks, 1995).

4.2 Disambiguation Algorithm

We use the described features to train a modelfor making PP attachment decisions. Our goalis to compute P(y|x), the probability that the PP(p, n2) in the tuple {v, n1, p, n2} attaches to theverb (v) , y = 1 or to the noun(n1), y = 0, given

a feature vector x describing that tuple. As input totraining the model, we are given a collection of PPquads, D where di ∈ D : di = {v, n1, p, n2}. Asmall subset,Dl ⊂ D is labeled data, thus for eachdi ∈ Dl we know the corresponding yi. The restof the quads, Du, are unlabeled, hence their corre-sponding yis are unknown. From each PP quad di,we extract a feature vector xi according to the fea-ture generation process discussed in Section 4.1.

4.2.1 ModelTo model P(y|x), there a various possibilities.One could use a generative model (e.g., NaiveBayes) or a discriminative model ( e.g., logistic re-gression). In our experiments we used both kindsof models, but found the discriminative model per-formed better. Therefore, we present details onlyfor our discriminative model. We use the logisticfunction: P(y|x, ~θ) = e

~θx

1+e~θx, where ~θ is a vec-

tor of model parameters. To estimate these pa-rameters, we could use the labeled data as trainingdata and use standard gradient descent to minimizethe logistic regression cost function. However, wealso leverage the unlabeled data.

4.2.2 Parameter EstimationTo estimate model parameters based on both la-beled and unlabeled data, we use an Expecta-tion Maximization (EM) algorithm. EM estimatesmodel parameters that maximize the expected loglikelihood of the full (observed and unobserved)data. Since we are using a discriminative model,our likelihood function is a conditional likelihoodfunction:

L(θ) =N∑i=1

ln P(yi|xi)

=N∑i=1

yiθTxi − ln (1 + exp(θTxi)) (1)

where i indexes over the N training examples.The EM algorithm produces parameter esti-

mates that correspond to a local maximum inthe expected log likelihood of the data underthe posterior distribution of the labels, given by:argmax

θEp(y|x,θ)[ln P(y|x, θ)]. In the E-step, we

use the current parameters θt−1 to compute theposterior distribution over the y labels, give byP(y|x, θt−1). We then use this posterior distri-bution to find the expectation of the log of the

complete-data conditional likelihood, this expec-tation is given by Q(θ, θt−1), defined as:

Q(θ, θt−1) =N∑i=1

Eθt−1 [ln P(y|x, θ)] (2)

In the M-step, a new estimate θt is then pro-duced, by maximizing thisQ function with respectto θ:

θt = argmaxθQ(θ, θt−1) (3)

EM iteratively computes parametersθ0, θ1, ...θt, using the above update rule ateach iteration t, halting when there is no furtherimprovement in the value of the Q function. Ouralgorithm is summarized in Algorithm 1. TheM-step solution for θt is obtained using gradientascent to maximize the Q function.

Algorithm 1 The EM algorithm for PP attachmentInput: X ,D = Dl ∪Du

Output: θT

for t = 1 . . . T doE-Step:Compute p(y|xi, θt−1)xi : di ∈ Du; p(y|xi, ~θ) = e

~θx

1+e~θx

xi : di ∈ Dl; p(y|xi) = 1 if y = yi, else 0

M-Step:Compute new parameters, θt

θt = argmaxθQ(θ, θt−1)

Q(θ, θt−1) =N∑i=1

∑y∈{0,1}

p(y|xi, θt−1)×

(yθTxi − ln(1 + exp(θTxi)))

if convergence(L(θ),L(θt−1)) thenbreak

end ifend forreturn θT

5 Experimental Evaluation

We evaluated our method on several datasets con-taining PP quads of the form {v, n1, p, n2}. Thetask is to predict if the PP (p, n2) attaches to theverb v or to the first noun n1.

5.1 Experimental Setup

Datasets. Table 3 shows the datasets used in ourexperiments. As labeled training data, we used the

DataSet # Training quads # Test quadsLabeled data

WSJ 20,801 3,097NYTC 0 293WKP 0 381

Unlabeled dataWKP 100,000 4,473,072

Table 3: Training and test datasets used in our ex-periments.

PPAD PPAD- Coll- Stan-NB ins ford

WKP 0.793 0.740 0.727 0.701WKP 0.759 0.698 0.683 0.652\ofNYTC 0.843 0.792 0.809 0.679NYTC 0.815 0.754 0.774 0.621\ofWSJ 0.843 0.816 0.841 N\AWSJ 0.779 0.741 0.778 N\A\of

Table 4: PPAD vs. baselines.

Wall Street Journal (WSJ) dataset. For the unla-beled training data, we extracted PP quads fromWikipedia (WKP) and randomly selected 100, 000

which we found to be a sufficient amount of un-labeled data. The largest labeled test dataset isWSJ but it is also made up of a large fraction, of“of” PP quads, 30% , which trivially attach to thenoun, as already seen in Figure 3. The New YorkTimes (NYTC) and Wikipedia (WKP) datasets aresmaller but contain fewer proportions of “of” PPquads, 15%, and 14%, respectively. Addition-ally, we applied our model to over 4 million un-labeled 5-tuples from Wikipedia. We make thisdata available for download, along with our man-ually labeled NYTC and WKP datasets. For theWKP & NYTC corpora, each quad has a preced-ing noun, n0, as context, resulting in PP 5-tuplesof the form: {n0, v, n1, p, n2}. The WSJ datasetwas only available to us in the form of PP quadswith no other sentence information.Methods Under Comparison. 1) PPAD (Prepo-sitional Phrase Attachment Disambiguator) is ourproposed method. It uses diverse types of seman-tic knowledge, a mixture of labeled and unlabeleddata for training data, a logistic regression classi-

0.5

0.58

0.66

0.74

0.82

0.9

WKP WKP\of NYTC NYTC\of WSJ WSJ\of

PPAD - WordNet Types PPAD - KB TypesPPAD - Unsupervised Types PPAD - WordNet VerbsPPAD - Naive Bayes Collins BaselineStanford Parser

Figure 4: PPAD variations vs. baselines.

fier, and expectation maximization (EM) for pa-rameter estimation 2) Collins is the establishedbaseline among PP attachment algorithms (Collinsand Brooks, 1995). 3) Stanford Parser is a state-of-the-art dependency parser, the 2014 online ver-sion. 4) PPAD Naive Bayes(NB) is the same asPPAD but uses a generative model, as opposed tothe discriminative model used in PPAD.

5.2 PPAD vs. Baselines

Comparison results of our method to the threebaselines are shown in Table 4. For each dataset,we also show results when the “of” quads are re-moved, shown as “WKP\of”, “NYTC\of”, and“WSJ\of”. Our method yields improvements overthe baselines. Improvements are especially sig-nificant on the datasets for which no labeled datawas available (NYTC and WKP). On WKP, ourmethod is 7% and 9% ahead of the Collins base-line and the Stanford parser, respectively. OnNYTC, our method is 4% and 6% ahead of theCollins baseline and the Stanford parser, respec-tively. On WSJ, which is the source of the labeleddata, our method is not significantly better thanthe Collins baseline. We could not evaluate theStanford parser on the WSJ dataset. The parser re-quires well-formed sentences which we could notgenerate from the WSJ dataset as it was only avail-able to us in the form of PP quads with no othersentence information. For the same reason, wecould not generate discourse features,F7, for theWSJ PP quads. For the NYTC and WKP datasets,we generated well-formed short sentences con-taining only the PP quad and the noun precedingit.

Feature Type Precision Recall F1Noun-Noun Binary Relations (F1-2) low high low

Noun Semantic Categories (F3-4) high high highVerb Role Fillers (F5) high low low

Preposition Definitions (F6) low low low

Discourse Features (F7) high low highLexical Features (F8-15) high high high

Table 5: An approximate characterization of feature knowledge sources in terms of precision/recall/F1

5.3 Feature Analysis

We found that features F2 and F6 did not im-prove performance, therefore we excluded themfrom the final model, PPAD. This means that bi-nary noun-noun relations were not useful whenused permissively, feature F2, but when used se-lectively, feature F1, we found them to be useful.Our attempt at mapping prepositions to verb def-initions produced some noisy mappings, resultingin feature F6 producing mixed results. To ana-lyze the impact of the unlabeled data, we inspectedthe features and their weights as produced by thePPAD model. From the unlabeled data, new lex-ical features were discovered that were not in theoriginal labeled data. Some sample new featureswith high weights for verb attachments are: (per-form,song,for,*), (lose,*,by,*), (buy,property,in,*).And for noun attachments: (*,conference,on,*),(obtain,degree,in,*), (abolish,taxes,on,*).

We evaluated several variations of PPAD, theresults are shown in Figure 4. For “PPAD-WordNet Verbs”, we expanded the data by replac-ing verbs in PP quads with synonymous WordNetverbs, ignoring verb senses. This resulted in moreinstances of features F1, F8-10, & F12.

We also used different types of noun categoriza-tions: WordNet classes, semantic types from theNELL knowledge base (Mitchell et al., 2015) andunsupervised types. The KB types and the unsu-pervised types did not perform well, possibly dueto the noise found in these categorizations. Word-Net classes showed the best results, hence theywere used in the final PPAD model for featuresF3-4 & F7. In Section 5.1, PPAD corresponds tothe best model.

5.4 Discussion: The F1 Score of Knowledge

Why did we not reach 100% accuracy? Should re-lational knowledge not be providing a much big-ger performance boost than we have seen in the re-

sults? To answer these questions, we characterizeour features in terms precision and recall, and F1measure of their knowledge sources in Table 5. Alow recall feature means that the feature does notfire on many examples, the feature’s knowledgesource suffers from low coverage. A low preci-sion feature means that when it fires, the featurecould be incorrect, the feature’s knowledge sourcecontains a lot of errors.

From Table 5, the noun-noun binary relationfeatures (F1 − 2) have low precision, but highrecall. This is because the SVO data, extractedfrom the ClueWeb09 corpus, that we used as ourrelational knowledge source is very noisy but it ishigh coverage. The low precision of the SVO datacauses these features to be detrimental to perfor-mance. Notice that when we used a filtered ver-sion of the data, in feature F2, the data was nolonger detrimental to performance. However, theF2 feature is low recall, and therefore it’s impacton performance is also limited. The noun seman-tic category features (F3−4) have high recall andprecision, hence it to be expected that their im-pact on performance is significant. The verb rolefiller features (F5), obtained from VerbNet havehigh precision but low recall, hence their marginalimpact on performance is also to be expected. Thepreposition definition features (F6) poor precisionmade them unusable. The discourse features (F7)are based noun semantic types and lexical features(F8−15), both of which have high recall and pre-cision, hence they useful impact on performance.

In summary, low precision in knowledge isdetrimental to performance. In order for knowl-edge to make even more significant contributionsto language understanding, high precision, highrecall knowledge sources are required for all fea-tures types. Success in ongoing efforts in knowl-edge base construction projects, will make perfor-mance of our algorithm better.

Relation Prep. Attachment accuracy Example(s)acquired from 99.97 BNY Mellon acquired Insight from Lloyds.hasSpouse in 91.54 David married Victoria in Ireland.worksFor as 99.98 Shubert joined CNN as reporter.playsInstrument with 98.40 Kushner played guitar with rock band Weezer.

Table 6: Binary relations extended to ternary relations by mapping to verb-preposition pairs in PP 5-tuples. PPAD predicted verb attachments with accuracy >90% in all relations.

5.5 Application to Ternary Relations

Through the application of ternary relation extrac-tion, we further tested PPAD’s PP disambiguationaccuracy and illustrated its usefulness for knowl-edge base population. Recall that a PP 5-tuple ofthe form {n0, v, n1, p, n2}, whose enclosed PP at-taches to the verb v, denotes a ternary relation witharguments n0, n1, & n2. Therefore, we can extracta ternary relation from every 5-tuple for which ourmethod predicts a verb attachment. If we have amapping between verbs and binary relations froma knowledge base (KB), we can extend KB rela-tions to ternary relations by augmenting the KBrelations with a third argument n2.

We considered four KB binary re-lations and their instances such asworksFor(TimCook,Apple), from the NELLKB. We then took the collection of 4 million5-tuples that we extracted from Wikipedia. Wemapped verbs in 5-tuples to KB relations, basedon significant overlaps in the instances of the KBrelations, noun pairs such as (TimCook,Apple)

with the n0, n1 pairs in the Wikipedia PP 5-tuplecollection. We found that, for example, instancesof the noun-noun KB relation “worksFor” matchn0, n1 pairs in tuples where v = joined andp = as , with n2 referring to the job title. Otherbinary relations extended are: “hasSpouse” ex-tended by “in” with wedding location, “acquired”extended by “from” with the seller of the companybeing acquired. Examples are shown in Table6. In all these mappings, the proportion of verbattachments in the corresponding PP quads issignificantly high ( > 90%). PPAD is overwhelm-ing making the right attachment decisions in thissetting.

Efforts in temporal and spatial relation extrac-tion have shown that higher N-ary relation extrac-tion is challenging. Since prepositions specify de-tails that transform binary relations to higher N-

ary relations, our method can be used to read infor-mation that can augment binary relations alreadyin KBs. As future work, we would like to incor-porate our method into a pipeline for reading be-yond binary relations. One possible direction isto read details about the where,why, who of eventsand relations, effectively moving from extractingonly binary relations to reading at a more generallevel.

6 Conclusion

We have presented a knowledge-intensive ap-proach to prepositional phrase (PP) attachmentdisambiguation, which is a type of syntactic ambi-guity. Our method incorporates knowledge aboutverbs, nouns, discourse, and noun-noun binary re-lations. We trained a model using labeled data andunlabeled data, making use of expectation max-imization for parameter estimation. Our methodcan be seen as an example of tapping into a pos-itive feedback loop for machine reading, whichhas only become possible in recent years due tothe progress made by information extraction andknowledge base construction techniques. Thatis, using background knowledge from existing re-sources to read better in order to further populateknowledge bases with otherwise difficult to extractknowledge. As future work, we would like to useour method to extract more than just binary rela-tions.

Acknowledgments

We thank Shashank Srivastava and members of theNELL team at CMU for helpful comments. Thisresearch was supported by DARPA under contractnumber FA8750-13-2-0005.

ReferencesEneko Agirre, Timothy Baldwin, and David Martinez.

2008. Improving parsing and PP attachment perfor-mance with sense information. In Proceedings ofACL-08: HLT, pages 317–325.

Gerry Altmann and Mark Steedman. 1988. Interac-tion with context during human sentence processing.Cognition, 30:191–238.

Enrique Henestroza Anguiano and Marie Candito.2011. Parse correction with specialized models fordifficult attachment types. In Proceedings of the2011 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP, pages 1222–1233.

Michaela Atterer and Hinrich Schutze. 2007. Preposi-tional phrase attachment without oracles. Computa-tional Linguistics, 33(4):469–476.

Soren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary G. Ives.2007. Dbpedia: A nucleus for a web of open data.In The Semantic Web, 6th International SemanticWeb Conference, 2nd Asian Semantic Web Confer-ence, ISWC 2007 + ASWC 2007, Busan, Korea,November 11-15, 2007., pages 722–735.

Michele Banko, Michael J Cafarella, Stephen Soder-land, Matthew Broadhead, and Oren Etzioni. 2007.Open information extraction for the web. In IJCAI,volume 7, pages 2670–2676.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: A col-laboratively created graph database for structuringhuman knowledge. In Proceedings of the 2008 ACMSIGMOD International Conference on Managementof Data, SIGMOD ’08, pages 1247–1250.

Eric Brill and Philip Resnik. 1994. A rule-basedapproach to prepositional phrase attachment disam-biguation. In 15th International Conference onComputational Linguistics, COLING, pages 1198–1204.

Andrew Carlson, Justin Betteridge, Richard C. Wang,Estevam R. Hruschka, Jr., and Tom M. Mitchell.2010. Coupled semi-supervised learning for infor-mation extraction. In Proceedings of the Third ACMInternational Conference on Web Search and DataMining, WSDM ’10, pages 101–110.

Michael Collins and James Brooks. 1995. Prepo-sitional phrase attachment through a backed-offmodel. In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing, pages27–38.

Marie-Catherine de Marneffe, Bill MacCartney, andChristopher D. Manning. 2006. Generating typeddependency parses from phrase structure parses.In Proceedings of the International Conference onLanguage Recources and Evaluation (LREC, pages449–454.

Luciano Del Corro and Rainer Gemulla. 2013.Clausie: Clause-based open information extraction.In Proceedings of the 22Nd International Confer-ence on World Wide Web, WWW ’13, pages 355–366.

A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.Maximum likelihood from incomplete data via theem algorithm. Journal of the Royal Statistical Soci-ety, Series B, 39(1):1–38.

Anthony Fader, Stephen Soderland, and Oren Etzioni.2011a. Identifying relations for open informationextraction. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing,EMNLP ’11, pages 1535–1545.

Anthony Fader, Stephen Soderland, and Oren Etzioni.2011b. Identifying relations for open informationextraction. In Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing, pages 1535–1545. Association for Computa-tional Linguistics.

Christiane Fellbaum, editor. 1998. WordNet: an elec-tronic lexical database. MIT Press.

Lyn Frazier. 1978. On comprehending sentences: Syn-tactic parsing strategies. Ph.D. thesis, University ofConnecticut.

Sanda M. Harabagiu and Marius Pasca. 1999. Inte-grating symbolic and statistical methods for prepo-sitional phrase attachment. In Proceedings of theTwelfth International Florida Artificial IntelligenceResearch Society ConferenceFLAIRS, pages 303–307.

Donald Hindle and Mats Rooth. 1993. Structural am-biguity and lexical relations. Computational Lin-guistics, 19(1):103–120.

Dirk Hovy, Ashish Vaswani, Stephen Tratz, David Chi-ang, and Eduard Hovy. 2011. Models and trainingfor unsupervised preposition sense disambiguation.In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics: HumanLanguage Technologies: Short Papers - Volume 2,pages 323–328.

John Kimball. 1988. Seven principles of surface struc-ture parsing in natural language. Cognition, 2:15–47.

Karin Kipper, Anna Korhonen, Neville Ryant, andMartha Palmer. 2008. A large-scale classificationof english verbs. Language Resources and Evalua-tion, 42(1):21–40.

Dan Klein and Christopher D. Manning. 2003. Ac-curate unlexicalized parsing. In Proceedings of the41st Annual Meeting of the Association for Compu-tational Linguistics,ACL, pages 423–430.

Ni Lao, Tom Mitchell, and William W Cohen. 2011.Random walk inference and learning in a large scaleknowledge base. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 529–539. Association for Computa-tional Linguistics.

Jeff Mitchell and Mirella Lapata. 2008. Vector-basedmodels of semantic composition. In Proceedings ofthe 46th Annual Meeting of the Association for Com-putational Linguistics, ACL, pages 236–244.

Tom M. Mitchell, William W. Cohen, Estevam R. Hr-uschka Jr., Partha Pratim Talukdar, Justin Bet-teridge, Andrew Carlson, Bhavana Dalvi Mishra,Matthew Gardner, Bryan Kisiel, Jayant Krishna-murthy, Ni Lao, Kathryn Mazaitis, Thahir Mo-hamed, Ndapandula Nakashole, Emmanouil Anto-nios Platanios, Alan Ritter, Mehdi Samadi, Burr Set-tles, Richard C. Wang, Derry Tanti Wijaya, Abhi-nav Gupta, Xinlei Chen, Abulhair Saparov, MalcolmGreaves, and Joel Welling. 2015. Never-endinglearning. In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., pages 2302–2310.

Ndapandula Nakashole and Tom M. Mitchell. 2014.Language-aware truth assessment of fact candidates.In Proceedings of the 52nd Annual Meeting ofthe Association for Computational Linguistics, ACL2014, June 22-27, 2014, Baltimore, MD, USA, Vol-ume 1: Long Papers, pages 1009–1019.

Ndapandula Nakashole and Gerhard Weikum. 2012.Real-time population of knowledge bases: opportu-nities and challenges. In Proceedings of the JointWorkshop on Automatic Knowledge Base Construc-tion and Web-scale Knowledge Extraction, pages41–45. Association for Computational Linguistics.

Ndapandula Nakashole, Martin Theobald, and GerhardWeikum. 2011. Scalable knowledge harvestingwith high precision and high recall. In Proceedingsof the Fourth ACM International Conference on WebSearch and Data Mining, WSDM ’11, pages 227–236.

Ndapandula Nakashole, Tomasz Tylenda, and GerhardWeikum. 2013. Fine-grained semantic typing ofemerging entities. In Proceedings of the 51st An-nual Meeting of the Association for ComputationalLinguistics, ACL, pages 1488–1497.

Kamal Nigam, Andrew McCallum, Sebastian Thrun,and Tom M. Mitchell. 2000. Text classificationfrom labeled and unlabeled documents using EM.Machine Learning, 39(2/3):103–134.

Patrick Pantel and Dekang Lin. 2000. An unsuper-vised approach to prepositional phrase attachmentusing contextually similar words. In 38th AnnualMeeting of the Association for Computational Lin-guistics, ACL.

Adwait Ratnaparkhi, Jeff Reynar, and Salim Roukos.1994. A maximum entropy model for prepositionalphrase attachment. In Proceedings of the Workshopon Human Language Technology, HLT ’94, pages250–255.

Adwait Ratnaparkhi. 1998. Statistical models forunsupervised prepositional phrase attachement. In36th Annual Meeting of the Association for Compu-tational Linguistics and 17th International Confer-ence on Computational Linguistics, COLING-ACL,pages 1079–1085.

Vivek Srikumar and Dan Roth. 2013. Modeling se-mantic relations expressed by prepositions. TACL,1:231–242.

Jiri Stetina and Makoto Nagao. 1997. Prepositionalphrase attachment through a backed-off model. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing, pages 66–80.

Fabian M Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: a core of semantic knowl-edge. In Proceedings of the 16th international con-ference on World Wide Web, pages 697–706. ACM.

Kristina Toutanova, Christopher D. Manning, and An-drew Y. Ng. 2004. Learning random walk modelsfor inducing word dependency distributions. In Ma-chine Learning, Proceedings of the Twenty-first In-ternational Conference, ICML.

Olga van Herwijnen, Antal van den Bosch, JacquesM. B. Terken, and Erwin Marsi. 2003. Learning PPattachment for filtering prosodic phrasing. In 10thConference of the European Chapter of the Asso-ciation for Computational Linguistics,EACL, pages139–146.

Greg Whittemore, Kathleen Ferrara, and Hans Brun-ner. 1990. Empirical study of predictive powers odsimple attachment schemes for post-modifier prepo-sitional phrases. In 28th Annual Meeting of the As-sociation for Computational Linguistics,ACL, pages23–30.

Derry Wijaya, Ndapandula Nakashole, and TomMitchell. 2014. Ctps: Contextual temporal profilesfor time scoping facts via entity state change detec-tion. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing. Associa-tion for Computational Linguistics.

Shaojun Zhao and Dekang Lin. 2004. A nearest-neighbor method for resolving pp-attachment ambi-guity. In Natural Language Processing - First Inter-national Joint Conference, IJCNLP, pages 545–554.

A Knowledge-Intensive Model for Prepositional Phrase ...nakashole.com/papers/2015-acl-ppa.pdf · A...

Documents

Transcript of A Knowledge-Intensive Model for Prepositional Phrase ...nakashole.com/papers/2015-acl-ppa.pdf · A...