Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction...

12
Combining Distant and Partial Supervision for Relation Extraction Gabor Angeli, Julie Tibshirani, Jean Y. Wu, Christopher D. Manning Stanford University Stanford, CA 94305 {angeli, jtibs, jeaneis, manning}@stanford.edu Abstract Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent to distant supervision. We present an ap- proach for providing partial supervision to a distantly supervised relation extrac- tor using a small number of carefully se- lected examples. We compare against es- tablished active learning criteria and pro- pose a novel criterion to sample examples which are both uncertain and representa- tive. In this way, we combine the ben- efits of fine-grained supervision for diffi- cult examples with the coverage of a large distantly supervised corpus. Our approach gives a substantial increase of 3.9% end- to-end F 1 on the 2013 KBP Slot Filling evaluation, yielding a net F 1 of 37.7%. 1 Introduction Fully supervised relation extractors are limited to relatively small training sets. While able to make use of much more data, distantly supervised ap- proaches either make dubious assumptions in or- der to simulate fully supervised data, or make use of latent-variable methods which get stuck in local optima easily. We hope to combine the benefits of supervised and distantly supervised methods by annotating a small subset of the available data us- ing selection criteria inspired by active learning. To illustrate, our training corpus contains 1 208 524 relation mentions; annotating all of these mentions for a fully supervised classifier, at an average of $0.13 per annotation, would cost ap- proximately $160 000. Distant supervision allows us to make use of this large corpus without requir- ing costly annotation. The traditional approach is based on the assumption that every mention of an entity pair (e.g., Obama and USA) participates in the known relation between the two (i.e., born in). However, this introduces noise, as not every men- tion expresses the relation we are assigning to it. We show that by providing annotations for only 10 000 informative examples, combined with a large corpus of distantly labeled data, we can yield notable improvements in performance over the distantly supervised data alone. We report results on three criteria for selecting examples to anno- tate: a baseline of sampling examples uniformly at random, an established active learning criterion, and a new metric incorporating both the uncer- tainty and the representativeness of an example. We show that the choice of metric is important – yielding as much as a 3% F 1 difference – and that our new proposed criterion outperforms the standard method in many cases. Lastly, we train a supervised classifier on these collected exam- ples, and report performance comparable to dis- tantly supervised methods. Furthermore, we no- tice that initializing the distantly supervised model using this supervised classifier is critical for ob- taining performance improvements. This work makes a number of concrete contri- butions. We propose a novel application of active learning techniques to distantly supervised rela- tion extraction. To the best of the authors knowl- edge, we are the first to apply active learning to the class of latent-variable distantly supervised mod- els presented in this paper. We show that anno- tating a proportionally small number of examples yields improvements in end-to-end accuracy. We compare various selection criteria, and show that this decision has a notable impact on the gain in performance. In many ways this reconciles our results with the negative results of Zhang et al. (2012), who show limited gains from na¨ ıvely an- notating examples. Lastly, we make our annota- tions available to the research community. 1 1 http://nlp.stanford.edu/software/ mimlre.shtml

Transcript of Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction...

Page 1: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

Combining Distant and Partial Supervision for Relation Extraction

Gabor Angeli, Julie Tibshirani, Jean Y. Wu, Christopher D. ManningStanford UniversityStanford, CA 94305

{angeli, jtibs, jeaneis, manning}@stanford.edu

Abstract

Broad-coverage relation extraction eitherrequires expensive supervised trainingdata, or suffers from drawbacks inherentto distant supervision. We present an ap-proach for providing partial supervisionto a distantly supervised relation extrac-tor using a small number of carefully se-lected examples. We compare against es-tablished active learning criteria and pro-pose a novel criterion to sample exampleswhich are both uncertain and representa-tive. In this way, we combine the ben-efits of fine-grained supervision for diffi-cult examples with the coverage of a largedistantly supervised corpus. Our approachgives a substantial increase of 3.9% end-to-end F1 on the 2013 KBP Slot Fillingevaluation, yielding a net F1 of 37.7%.

1 Introduction

Fully supervised relation extractors are limited torelatively small training sets. While able to makeuse of much more data, distantly supervised ap-proaches either make dubious assumptions in or-der to simulate fully supervised data, or make useof latent-variable methods which get stuck in localoptima easily. We hope to combine the benefitsof supervised and distantly supervised methods byannotating a small subset of the available data us-ing selection criteria inspired by active learning.

To illustrate, our training corpus contains1 208 524 relation mentions; annotating all ofthese mentions for a fully supervised classifier, atan average of $0.13 per annotation, would cost ap-proximately $160 000. Distant supervision allowsus to make use of this large corpus without requir-ing costly annotation. The traditional approach isbased on the assumption that every mention of anentity pair (e.g., Obama and USA) participates in

the known relation between the two (i.e., born in).However, this introduces noise, as not every men-tion expresses the relation we are assigning to it.

We show that by providing annotations for only10 000 informative examples, combined with alarge corpus of distantly labeled data, we can yieldnotable improvements in performance over thedistantly supervised data alone. We report resultson three criteria for selecting examples to anno-tate: a baseline of sampling examples uniformlyat random, an established active learning criterion,and a new metric incorporating both the uncer-tainty and the representativeness of an example.We show that the choice of metric is important– yielding as much as a 3% F1 difference – andthat our new proposed criterion outperforms thestandard method in many cases. Lastly, we traina supervised classifier on these collected exam-ples, and report performance comparable to dis-tantly supervised methods. Furthermore, we no-tice that initializing the distantly supervised modelusing this supervised classifier is critical for ob-taining performance improvements.

This work makes a number of concrete contri-butions. We propose a novel application of activelearning techniques to distantly supervised rela-tion extraction. To the best of the authors knowl-edge, we are the first to apply active learning to theclass of latent-variable distantly supervised mod-els presented in this paper. We show that anno-tating a proportionally small number of examplesyields improvements in end-to-end accuracy. Wecompare various selection criteria, and show thatthis decision has a notable impact on the gain inperformance. In many ways this reconciles ourresults with the negative results of Zhang et al.(2012), who show limited gains from naıvely an-notating examples. Lastly, we make our annota-tions available to the research community.1

1http://nlp.stanford.edu/software/mimlre.shtml

Page 2: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

2 Background

2.1 Relation Extraction

We are interested in extracting a set of relationsy1 . . . yk from a fixed set of possible relations R,given two entities e1 and e2. For example, wewould like to extract that Barack Obama was bornin Hawaii. The task is decomposed into two steps:First, sentences containing mentions of both e1and e2 are collected. The set of these sentencesx, marked with the entity mentions for e1 and e2,becomes the input to the relation extractor, whichthen produces a set of relations which hold be-tween the mentions. We are predominantly in-terested in the second step – classifying a set ofpairs of entity mentions into the relations they ex-press. Figure 1 gives the general setting for re-lation extraction, with entity pairs Barack Obamaand Hawaii, and Barack Obama and president.

Traditionally, relation extraction has fallen intoone of four broad approaches: supervised classi-fication, as in the ACE task (Doddington et al.,2004; GuoDong et al., 2005; Surdeanu and Cia-ramita, 2007), distant supervision (Craven andKumlien, 1999; Wu and Weld, 2007; Mintz etal., 2009; Sun et al., 2011; Roth and Klakow,2013) deterministic rule-based systems (Soder-land, 1997; Grishman and Min, 2010; Chen et al.,2010), and translation from open domain informa-tion extraction schema (Riedel et al., 2013). Wefocus on the first two of these approaches.

2.2 Supervised Relation Extraction

Relation extraction can be naturally cast as a su-pervised classification problem. A corpus of rela-tion mentions is collected, and each mention x isannotated with the relation y, if any, it expresses.The classifier’s output is then aggregated to decidethe relations between the two entities.

However, annotating supervised training datais generally expensive to perform at large scale.Although resources such as Freebase or the TACKBP knowledge base have on the order of millionsof training tuples over entities it is not feasible tomanually annotate the corresponding mentions inthe text. This has led to the rise of distantly su-pervised methods, which make use of this indirectsupervision, but do not necessitate mention-levelsupervision.

Barack Obama was born in Hawaii.

Barack Obama visited Hawaii.

The president grew up in Hawaii.

state of birth

state of residence

Barack Obama met former president Clinton.

Obama became president in 2008.title

Figure 1: The relation extraction setup. For apair of entities, we collect sentences which men-tion both entities. These sentences are then usedto predict one or more relations between thoseentities. For instance, the sentences containingboth Barack Obama and Hawaii should supportthe state of birth and state of residence relation.

2.3 Distant SupervisionTraditional distant supervision makes the assump-tion that for every triple (e1, y, e2) in a knowledgebase, every sentence containing mentions for e1and e2 express the relation y. For instance, tak-ing Figure 1, we would create a datum for eachof the three sentences containing BARACK OBAMA

and HAWAII labeled with state of birth, and like-wise with state of residence, creating 6 trainingexamples overall. Similarly, both sentences in-volving Barack Obama and president would bemarked as expressing the title relation.

While this allows us to leverage a large databaseeffectively, it nonetheless makes a number of naıveassumptions. First – explicit in the formulation ofthe approach – it assumes that every mention ex-presses some relation, and furthermore expressesthe known relation(s). For instance, the sen-tence Obama visited Hawaii would be erroneouslytreated as a positive example of the born in rela-tion. Second, it implicitly assumes that our knowl-edge base is complete: entity mentions with noknown relation are treated as negative examples.

The first of these assumptions is addressed bymulti-instance multi-label (MIML) learning, de-scribed in Section 2.4. Min et al. (2013) addressthe second assumption by extending the MIMLmodel with additional latent variables, while Xuet al. (2013) allow feedback from a coarse relationextractor to augment labels from the knowledgebase. These latter two approaches are compatiblewith but are not implemented in this work.

2.4 Multi-Instance Multi-Label LearningThe multi-instance multi-label (MIML-RE) modelof Surdeanu et al. (2012), which builds upon work

Page 3: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

. . .

. . . . . .

. . .

Figure 2: The MIML-RE model, as shown in Sur-deanu et al. (2012). The outer plate corresponds toeach of the n entity pairs in our knowledge base.Each entity pair has a set of mention pairs Mi, anda corresponding plate in the diagram for each men-tion pair in Mi. The variable x represents the in-put mention pair, whereas y represents the positiveand negative relations for the given pair of entities.The latent variable z denotes a mention-level pre-diction for each input. The weight vector for themultinomial z classifier is given by wz , and thereis a weight vector wj for each binary y classifier.

by Hoffmann et al. (2011) and Riedel et al. (2010),addresses the assumptions of distantly supervisedrelations extractors in a principled way by positinga latent mention-level annotation.

The model groups mentions according to theirentity pair – for instance, every mention pair withObama and Hawaii would be grouped together. Alatent variable zi is created for every mention i,where zi ∈ R ∪ {None} takes a single relationlabel, or a no relation marker. We create |R| bi-nary variables y representing the known positiveand negative relations for the entity pair. A set ofbinary classifiers (log-linear factors in the graphi-cal model) links the latent predictions z1 . . . z|Mi|and each yj . These classifiers include two classesof features: first, a binary feature which fires if atleast one of the mentions expresses a known rela-tion between the entity pair, and second, a featurefor each co-occurrence of relations for a given en-tity pair. Figure 2 describes the model.

2.5 Background on Active Learning

We describe preliminaries and prior work on ac-tive learning; we use this framework to proposetwo sampling schemes in Section 3 which we useto annotate mention-level labels for MIML-RE.

One way of expressing the generalization errorof a hypothesis h is through its mean-squared errorwith the true hypothesis h:

E[(h(x)− h(x))2]

= E[E[(h(x)− h(x))2|x]]

=

∫xE[(h(x)− h(x))2|x]p(x)dx.

The integrand can be further broken into biasand variance terms:

E[(h(x)− h(x))2] = (E[h(x)]− h(x))2

+ E[(h(x)− E[h(x)])2]

where for simplicity we’ve dropped the condition-ing on x.

Many traditional sampling strategies, such asquery-by-committee (QBC) (Freund et al., 1992;Freund et al., 1997) and uncertainty sampling(Lewis and Gale, 1994), work by decreasing thevariance of the learned model. In QBC, wefirst create a ‘committee’ of classifiers by ran-domly sampling their parameters from a distribu-tion based on the training data. These classifiersthen make predictions on the unlabeled examples,and the examples on which there is the most dis-agreement are selected for labeling. This strat-egy can be seen as an attempt to decrease the ver-sion space – the set of classifiers that are consis-tent with the labeled data. Decreasing the versionspace should lower variance, since variance is in-versely related to the size of the hypothesis space.

In most scenarios, active learning does not con-cern itself with the bias term. If a model is fun-damentally misspecified, then no amount of ad-ditional training data can lower its bias. How-ever, our paradigm differs from the traditional set-ting, in that we are annotating latent variables ina model with a non-convex objective. These an-notations may help increase the convexity of ourobjective, leading us to a more accurate optimumand thereby lowering bias.

The other component to consider is∫x · · · p(x)dx. This suggests that it is impor-

tant to choose examples that are representativeof the underlying distribution p(x), as we wantto label points that will improve the classifier’spredictions on as many and as high-probabilityexamples as possible. Incorporating a repre-sentativeness metric has been shown to providea significant improvement over plain QBC or

Page 4: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

uncertainty sampling (McCallum and Nigam,1998; Settles, 2010).

2.6 Active Learning for Relation Extraction

Several papers have explored active learning forrelation extraction. Fu and Grishman (2013) em-ploy active learning to create a classifier quicklyfor new relations, simulated from the ACE corpus.Finn and Kushmerick (2003) compare a numberof selection criteria – including QBC – for a su-pervised classifier. To the best of our knowledge,we are the first to apply active learning to distantlysupervised relation extraction. Furthermore, weevaluate our selection criteria live in a real-worldsetting, collecting new sentences and evaluatingon an end-to-end task.

For latent variable models, McCallum andNigam (1998) apply active learning to semi-supervised document classification. We take in-spiration from their use of QBC and the choice ofmetric for classifier disagreement. However theirmodel assumes a fully Bayesian set-up, whereasours does not require strong assumptions about theparameter distributions.

Settles et al. (2008) use active learning to im-prove a multiple-instance classifier. Their modelis simpler in that it does not allow for unobservedvariables or multiple labels, and the authors onlyevaluate on image retrieval and synthetic text clas-sification datasets.

3 Example Selection

We describe three criteria for selection examplesto annotate. The first – sampling uniformly – isa baseline for our hypothesis that intelligently se-lecting examples is important. For this criterion,we select mentions uniformly at random from thetraining set to annotate. This is the approach usedin Zhang et al. (2012). The other two criteria relyon a metric for disagreement provided by QBC;we describe our adaptation of QBC for MIML-REas a preliminary to introducing these criteria.

3.1 QBC For MIML-RE

We use a version of QBC based on bootstrap-ping (Saar-Tsechansky and Provost, 2004). Tocreate the committee of classifiers, we re-samplethe training set with replacement 7 times and traina model over each sampled dataset. We mea-sure disagreement on z-labels among the classi-fiers using a generalized Jensen-Shannon diver-

gence (McCallum and Nigam, 1998), taking theaverage KL divergence of all classifier judgments.

We first calculate the mention-level confi-dences. Note that z(m)

i ∈ Mi denotes the latentvariable in entity pair i with index m; z(−m)

i de-notes the set of all latent variables except z(m)

i :

p(z(m)i |yi,xi) =

p(yi, z(m)i |xi)

p(yi|xi)

=

∑z(−m)i

p(yi, zi|xi)∑z(m)i

p(yi, z(m)i |xi)

.

Notice that the denominator just serves to nor-malize the probability within a sentence group.We can rewrite the numerator as follows:∑

z(−m)i

p(yi, zi|xi)

=∑z(−m)i

p(yi|zi)p(zi|xi)

= p(z(m)i |xi)

∑z(−m)i

p(yi|zi)p(z(−m)i |xi).

For computational efficiency, we approximatep(z

(−m)i |xi) with a point mass at its maximum.

Next, we calculate the Jensen-Shannon (JS) diver-gence from the k bootstrapped classifiers:

1

k

k∑c=1

KL(pc(z(m)i |yi,xi)||pmean(z

(m)i |yi,xi)) (1)

where pc is the probability assigned by each of thek classifiers to the latent z(m)

i , and pmean is the av-erage of these probabilities. We use this metricto capture the disagreement of our model with re-spect to a particular latent variable. This is thenused to inform our selection criteria.

We note that QBC may be especially useful inour situation as our objective is highly nonconvex.If two committee members disagree on a latentvariable, it is likely because they converged to dif-ferent local optima; annotating that example couldhelp bring the classifiers into agreement.

The second selection criterion we consider isthe most straightforward application of QBC – se-lecting the examples with the highest JS disagree-ment. This allows us to compare our criterion, de-scribed next, against an established criterion fromthe active learning literature.

Page 5: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

3.2 Sample by JS DisagreementWe propose a novel active learning sampling cri-terion that incorporates not only disagreement butalso representativeness in selecting examples toannotate. Prior work has taken a weighted combi-nation of an example’s disagreement and a scorecorresponding to whether the example is drawnfrom a dense portion of the feature space (e.g.,McCallum and Nigam (1998)). However, this re-quires both selecting a criterion for defining den-sity (e.g., distance metric in feature space), andtuning a parameter for the relative weight of dis-agreement versus representativeness.

Instead, we account for choosing representa-tive examples by sampling without replacementproportional to the example’s disagreement. For-mally, we define the probability of selecting anexample z

(m)i to be proportional to the Jensen-

Shannon divergence in (1). Since the training set isan approximation to the prior distribution over ex-amples, sampling uniformly over the training set isan approximation to sampling from the prior prob-ability of seeing an input x. We can view our crite-rion as an approximation to sampling proportionalto the product of two densities: a prior over exam-ples x, and the JS divergence mentioned above.

4 Incorporating Sentence-LevelAnnotations

Following Surdeanu et al. (2012), MIML-RE istrained through hard discriminative ExpectationMaximization, inferring the latent z values in theE-step and updating the weights for both the z andy classifiers in the M-step. During the E-step, weconstrain the latent z to match our sentence-levelannotations when available.

It is worth noting that even in the hard-EMregime, we can in principle incorporate annotatoruncertainty elegantly into the model. At each Estep, each zi is set according to

zi(m)∗ ≈ argmax

z∈R

[p(z | x(m)

i ,wz) ×∏r

p(y(r)i | z

′i,w

(r)y )

]where z′i contains the inferred labels from the

previous iteration, but with its mth component re-placed by z

(m)i .

By setting the distribution p(z | x(m)i ,wz) to re-

flect uncertainty among annotators, we can leave

open the possibility for the model to choose a re-lation which annotators deemed unlikely, but themodel nonetheless prefers. For simplicity, how-ever, we treat our annotations as a hard assign-ment.

In addition to incorporating annotations duringtraining, we can also use this data to intelligentlyinitialize the model. Since the MIML-RE objec-tive is non-convex, the initialization of the classi-fier weights wy and wz is important. The y clas-sifiers are initialized with the “at-least-once” as-sumption of Hoffmann et al. (2011); wz can be ini-tialized either using traditional distant supervisionor from a supervised classifier trained on the an-notated sentences. If initialized with a supervisedclassifier, the model can be viewed as augment-ing this supervised model with a large distantlylabeled corpus, providing both additional entitypairs to train from, and additional mentions for anannotated entity pair.

5 Crowdsourced Example Annotation

Most prior work on active learning is done by sim-ulation on a fully labeled dataset; such a datasetdoesn’t exist for our case. Furthermore, a key aimof this paper is to practically improve state-of-the-art performance in relation extraction in additionto evaluating active learning criteria. Therefore,we develop and execute an annotation task for col-lecting labels for our selected examples.

We utilize Amazon Mechanical Turk to crowd-source annotations. For each task, the annotator(Turker) is presented with the task description, fol-lowed by 15 questions, 2 of which are randomlyplaced controls. For each question, we presentTurkers with a relation mention and the top 5 re-lation predictions from our classifier. The Turkeralso has an option to freely specify a relation notpresented in the first five options, or mark thatthere is no relation. We attempt to heuristicallymatch common free-form answers to official rela-tions.

To maintain the quality of the results, we dis-card all submissions in which both controls wereanswered incorrectly, and additionally discard allsubmissions from Turkers who failed the controlson more than 1

3 of their submissions. Rejectedtasks were republished for other workers to com-plete. We collect 5 annotations for each example,and use the most commonly agreed answer as theground truth. Ties are broken arbitrarily, except in

Page 6: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

Figure 3: The task shown to Amazon MechanicalTurk workers. A sentence along with the top 5 re-lation predictions from our classifier are shown toTurkers, as well as an option to specify a customrelation or manually enter “no relation.” The cor-rect response for this example should be either norelation or a custom relation.

the case of deciding between a relation and no re-lation, in which case the relation was always cho-sen.

A total of 23 725 examples were annotated, cov-ering 10 000 examples for each of the three selec-tion criteria. Note that there is overlap betweenthe examples selected for the three criteria. In ad-dition, 10 023 examples were annotated during de-velopment; these are included in the set of all an-notated examples, but excluded from any of thethree criteria. The compensation per task was 23cents; the total cost of annotating examples was$3156, in addition to $204 spent on developing thetask. Informally, Turkers achieved an accuracy ofaround 75%, as evaluated by a paper author, per-forming disproportionately well on identifying theno relation label.

6 Experiments

We evaluate the three high-level research contri-butions of this work: we show that we improvethe accuracy of MIML-RE, we validate the effec-tiveness of our selection criteria, and we provide acorpus of annotated examples, evaluating a super-vised classifier trained on this corpus. The train-ing and testing methodology for evaluating thesecontributions is given in Sections 6.1 and 6.2; ex-periments are given in Section 6.3.

6.1 Training SetupWe adopt the setup of Surdeanu et al. (2012) fortraining the MIML-RE model, with minor modifi-cations. We use both the 2010 and 2013 KBP of-

ficial document collections, as well as a July 2013dump of Wikipedia as our text corpus. We sub-sample negatives such that 1

3 of our dataset con-sists of entity pairs with no known relations. In allexperiments, MIML-RE is trained for 7 iterationsof EM; for efficiency, the z classifier is optimizedusing stochastic gradient descent;2 the y classifiersare optimized using L-BFGS.

Similarly to Surdeanu et al. (2011), we as-sign negative relations which are either incompat-ible with the known positive relations (e.g., re-lations whose co-occurrence would violate typeconstraints); or, are actually functional relationsin which another entity already participates. Forexample, if we know that Obama was born in theUnited States, we could add born in as a negativerelation to the pair Obama and Kenya.

Our dataset consists of 325 891 entity pairs withat least one positive relation, and 158 091 entitypairs with no positive relations. Pairs with at leastone known relation have an average of 4.56 men-tions per group; groups with no known relationshave an average of 1.55 mentions per group. In to-tal, 1 208 524 distinct mentions are considered; theannotated examples are selected from this pool.

6.2 Testing MethodologyWe compare against the original MIML-RE modelusing the same dataset and evaluation methodol-ogy as Surdeanu et al. (2012). This allows for anevaluation where the only free variable betweenthis and prior work is the predictions of the rela-tion extractor.

Additionally, we evaluate the relation extractorsin the context of Stanford’s end-to-end KBP sys-tem (Angeli et al., 2014) using the NIST TAC-KBP 2013 English Slotfilling evaluation. In theend-to-end framework, the input to the system is aquery entity and a set of articles, and the output isa set of slot fills – each slot fill is a candidate triplein the knowledge base, the first element of whichis the query entity. This amounts to roughly pop-ulating a data structure like Wikipedia infoboxesautomatically from a large corpus of text.

Importantly, an end-to-end evaluation in a top-performing full system gives a more accurate ideaof the expected real-world gain from each model.Both the information retrieval component provid-ing candidates to the relation extractor, as well as

2For the sake of consistency, the supervised classifiers andthose in Mintz++ are trained identically to the z classifiers inMIML-RE.

Page 7: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

Method InitActive Learning Criterion

Not Used Uniform High JS Sample JS All AvailableP R F1 P R F1 P R F1 P R F1 P R F1

Mintz++ — 41.3 28.2 33.5 — — — —

MIML-REDist 38.0 30.5 33.8 39.2 30.4 34.2 41.7 28.9 34.1 36.6 31.1 33.6 37.5 30.6 33.7

Sup 35.1 35.6 35.4 34.4 35.0 34.7 46.2 30.8 37.0 39.4 36.2 37.7 36.0 37.1 36.5

Supervised — — 35.5 28.9 31.9 31.3 33.2 32.2 33.5 35.0 34.2 32.9 33.4 33.2

Table 1: A summary of results on the end-to-end KBP 2013 evaluation for various experiments. Thefirst column denotes the algorithm used: either traditional distant supervision (Mintz++), MIML-RE, ora supervised classifier. In the case of MIML-RE, the model may be initialized either using Mintz++, orthe corresponding supervised classifier (the “Not Used” column is initialized with the “All” supervisedclassifier). One of five active learning scenarios are evaluated: no annotated examples provided, the threeactive learning criteria, and all available examples used. The entry in blue denotes the basic MIML-REmodel; entries in gray perform worse than this model. The bold items denote the best performanceamong selection criteria.

the consistency and inference performed on theclassifier output introduce bias in this evaluation’ssensitivity to particular types of errors. Mistakeswhich are easy to filter, or are difficult to retrieveusing IR are less important in this evaluation; incontrast, factors such as providing good confi-dence scores for consistency become more impor-tant.

For the end-to-end evaluation, we use the offi-cial evaluation script with two changes: First, allsystems are evaluated with provenance ignored, soas not to penalize any system for finding a newprovenance not validated in the official evaluationkey. Second, each system reports its optimal F1

along its P/R curve, yielding results which areoptimistic when compared against other systemsentered into the competition. However, this alsoyields results which are invariant to threshold tun-ing, and is therefore more appropriate for compar-ing between systems in this paper.

Development was done on the KBP 2010–2012queries, and results are reported using the 2013queries as a simulated test set. Our best systemachieves an F1 of 37.7; the top two teams at KBP2013 (of 18 entered) achieved F1 scores of 40.2and 37.1 respectively, ignoring provenance.

6.3 Results

Table 1 summarizes all results for the end-to-endtask; relevant features of the table are copied insubsequent sections to illustrate key trends. Mod-els which perform worse than the original MIML-RE model (MIML-RE, initialized with “Dist,” un-der “Not Used”) are denoted in gray. The best per-

System P R F1

Mintz++ 41.3 28.2 33.5MIML + Dist 38.0 30.5 33.8MIML + Sup 35.1 35.6 35.4MIML + Dist + SampleJS 36.6 31.1 33.6MIML + Sup + SampleJS 39.4 36.2 37.7

Table 2: A summary of improvements to MIML-RE on the end-to-end slotfilling task, copied fromTable 1. Mintz++ is the traditional distantly su-pervised model. The second row corresponds tothe unmodified MIML-RE model. The third rowcorresponds to MIML-RE initialized with a su-pervised classifier (trained on all examples). Thefourth row is MIML-RE with annotated exam-ples incorporated during training (but not initial-ization). The last row shows the best results ob-tained by our model.

forming model improves on the base model by 3.9F1 points on the end-to-end task.

We evaluate each of the individual contribu-tions of the paper: improving the accuracy ofthe MIML-RE relation extractor, evaluating ourexample selection criteria, and demonstrating theannotated examples’ effectiveness for a fully-supervised relation extractor.

Improve MIML-RE Accuracy A key goal ofthis work is to improve the accuracy of the MIML-RE model; we show that we improve the modelboth on the end-to-end slotfilling task (Table 2) aswell as on a standard evaluation (Figure 5). Sim-ilar to our work, recent work by Pershina et al.

Page 8: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

System P R F1

MIML + Sup 35.1 35.6 35.4MIML + Sup + Uniform 34.4 35.0 34.7MIML + Sup + HighJS 46.2 30.8 37.0MIML + Sup + SampleJS 39.4 36.2 37.7MIML + Sup + All 36.0 37.1 36.5

Table 3: A summary of the performance of eachexample selection criterion. In each case, themodel was initialized with a supervised classifier.The first row corresponds to the MIML-RE modelinitialized with a supervised classifier. The middlethree rows show performance for the three selec-tion criteria, used both for initialization and duringtraining. The last row shows results if all availableannotations are used, independent of their source.

System P R F1

Mintz++ 41.3 28.2 33.5MIML + Dist 38.0 30.5 33.8Supervised + SampleJS 33.5 35.0 34.2MIML + Sup 35.1 35.6 35.5MIML + Sup + SampleJS 39.4 36.2 37.7

Table 4: A comparison of the best performing su-pervised classifier with other systems. The topsection compares the supervised classifier withprior work. The lower section highlights the im-provements gained from initializing MIML-REwith a supervised classifier.

(2014) incorporates labeled data to guide MIML-RE during training. They make use of labeled datato extract training guidelines, which are intendedto generalize across many examples. We show thatwe can match or outperform their improvementswith our best criterion.

A few interesting trends emerge from the end-to-end results in Table 2. Using annotated sen-tences during training alone did not improve per-formance consistently, even hurting performancewhen the SampleJS criterion was used. Thissupports an intuition that the initialization of themodel is important, and that it is relatively difficultto coax the model out of a local optimum if it isinitialized poorly. This is further supported by theimprovement in performance when the model isinitialized with a supervised classifier, even whenno examples are used during training. Similartrends are reported in prior work, e.g., Smith andEisner (2007) Section 4.4.6.

0.2

0.3

0.4

0.5

0.6

0.7

0 0.05 0.1 0.15 0.2 0.25 0.3

Pre

cisi

on

Recall

MIML-RESurdeanu et al. (2012)

Mintz++

Figure 4: MIML-RE and Mintz++ evaluated ac-cording to Surdeanu et al. (2012). The originalmodel from the paper is plotted for comparison, asour training methodology is somewhat different.

0.2

0.3

0.4

0.5

0.6

0.7

0 0.05 0.1 0.15 0.2 0.25 0.3

Pre

cisi

on

Recall

Sample JSPershina et al. (2014)

MIML-RE

Figure 5: Our best active learning criterion evalu-ated against our version of MIML-RE, alongsidethe best system of Pershina et al. (2014).

Also interesting is the relatively small gainMIML-RE provides over traditional distant super-vision (Mintz++) in this setting. We conjecturethat the mistakes made by Mintz++ are often rel-atively easily filtered by the downstream consis-tency component. This is supported by Figure 4;we evaluate our trained MIML-RE model againstMintz++ and the results reported in Surdeanu etal. (2012). We show that our model performs aswell or better than the original implementation,and consistently outperforms Mintz++.

Evaluate Selection Criteria A key objective ofthis work is to evaluate how much of an impactcareful selection of annotated examples has on theoverall performance of the system. We evaluatethe three selection criteria from Section 3.2, show-ing the results for MIML-RE in Table 3; resultsfor the supervised classifier are given in Table 1.In both cases, we show that the sampled JS cri-

Page 9: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

terion performs comparably to or better than theother criteria.

At least two interesting trends can be noted fromthese results: First, the uniformly sampled crite-rion performed worse than MIML-RE initializedwith a supervised classifier. This may be due tonoise in the annotation: a small number of an-notation errors on entity pairs with only a singlecorresponding mention could introduce dangerousnoise into training. These singleton mentions willrarely have disagreement between the committeeof classifiers, and therefore will generally only beselected in the uniform criterion.

Second, adding in the full set of examples didnot improve performance – in fact, performancegenerally dropped in this scenario. We conjecturethat this is due to the inclusion of the uniformlysampled examples, with performance dropping forthe same reasons as above.

Both of these results can be reconciled withthe results of Zhang et al. (2012); like this work,they annotated examples to analyze the trade-offbetween adding more data to a distantly super-vised system, and adding more direct supervi-sion. They conclude that annotations provide onlya relatively small improvement in performance.However, their examples were uniformly selectedfrom the training corpus, and did not make useof the structure provided by MIML-RE. Our re-sults agree in that neither the uniform selectioncriterion nor the supervised classifier significantlyoutperformed the unmodified MIML-RE model;nonetheless, we show that if care is taken in se-lecting these labeled examples we can achieve no-ticeable improvements in accuracy.

We also evaluate our selection criteria on theevaluation of Surdeanu et al. (2012), both initial-ized with Mintz++ (Figure 7) and with the super-vised classifier (Figure 6). These results mirrorthose in the end-to-end evaluation; when initial-ized with the supervised classifier the high dis-agreement (High JS) and sampling proportional todisagreement (Sample JS) criteria clearly outper-form both the base MIML-RE model as well asthe uniformly sampling criterion. Using the an-notated examples only during training yielded noperceivable benefit over the base model (Figure 7).

Supervised Relation Extractor The examplescollected can be used to directly train a supervisedclassifier, with results summarized in Table 4. Themost salient insight is that the performance of the

0.2

0.3

0.4

0.5

0.6

0.7

0 0.05 0.1 0.15 0.2 0.25 0.3

Pre

cisi

on

Recall

Sample JSHigh JS

UniformMIML-RE

Figure 6: A comparison of models trained withvarious selection criteria on the evaluation of Sur-deanu et al. (2012), all initialized with the corre-sponding supervised classifier.

0.2

0.3

0.4

0.5

0.6

0.7

0 0.05 0.1 0.15 0.2 0.25 0.3

Pre

cisi

on

Recall

Sample JSHigh JS

UniformMIML-RE

Figure 7: A comparison of models trained withvarious selection criteria on the evaluation of Sur-deanu et al. (2012), all initialized with Mintz++.

best supervised classifier is similar to that of theMIML-RE model, despite being trained on nearlytwo orders of magnitude less training data.

More interestingly, however, the supervisedclassifier provides a noticeably better initializa-tion for MIML-RE than Mintz++, yielding betterresults even without enforcing the labels duringEM. These results suggest that the power gainedfrom the the more sophisticated MIML-RE modelis best used in conjunction with a small amount oftraining data. That is, using MIML-RE as a princi-pled model for combining a large distantly labeledcorpus and a small number of careful annotationsyields significant improvement over using eitherof the two data sources alone.

Page 10: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

Relation # P R F1

no relation 3073employee of 1978 29 32 33 46 31 38countries of res. 1061 30 42 7 40 11 41states of residence 427 57 33 14 7 23 12cities of residence 356 31 52 9 30 14 38(org:)member of 290 0 0 0 0 0 0country of hq 280 63 62 65 62 64 62top members 221 36 26 50 60 42 36country of birth 205 22 0 40 0 29 0parents 196 10 26 31 54 15 35city of hq 194 46 52 57 61 51 56(org:)alt names 184 52 48 39 39 45 43founded by 180 100 89 29 38 44 53city of birth 145 17 50 8 17 11 25state of hq 132 50 64 30 35 38 45title 121 20 26 28 35 23 30subsidiaries 105 33 25 6 3 10 5founded 90 62 82 62 69 62 75spouse 88 37 54 85 85 51 66origin 86 42 43 68 70 51 53state of birth 83 0 50 0 10 0 17charges 69 54 54 16 16 24 24cause of death 69 93 93 39 39 55 55(per:)alt names 69 9 20 2 3 3 6country of death 65 100 100 10 10 18 18members 54 0 0 0 0 0 0children 52 53 62 14 18 22 27parents 50 64 64 28 28 39 39city of death 38 42 75 16 19 23 30dissolved 38 0 0 0 0 0 0date of death 33 64 64 44 39 52 48political affiliation 23 7 25 100 100 13 40state of death 19 0 0 0 0 0 0shareholders 19 0 0 0 0 0 0siblings 16 50 50 33 33 40 40schools attended 14 80 78 41 48 54 60date of birth 11 100 100 85 85 92 92other family 9 0 0 0 0 0 0age 4 94 97 94 90 94 93# of employees 3 0 0 0 0 0 0religion 2 100 100 29 29 44 44website 0 25 0 3 0 6 0

Table 5: A summary of relations annotated, andend-to-end slotfilling performance by relation.The first column gives the relation; the secondshows the number of examples annotated. Thesubsequent columns show the performance of theunmodified MIML-RE model and our best per-forming model (SampleJS). Changes in values arebolded; positive changes are shown in green andnegative changes in red. The most frequent 10 re-lations in the evaluation are likewise bolded.

6.4 Analysis By RelationIn this section, we explore which of the KBP rela-tions were shown to Turkers, and whether the im-provements in accuracy correspond to these rela-tions. We compare only the unmodified MIML-RE model, and our best model (MIML-RE initial-ized with the supervised classifier, under the Sam-pleJS criterion). Results are shown in Table 5.

A few interesting trends emerge from this anal-ysis. We note that annotating even 80+ examplesfor a relation seems to provide a consistent boostin accuracy, whereas relations with fewer anno-tated examples tended to show little or no change.However, the gains of our model are not univer-sal across relation types, even dropping noticeablyon some – for instance, F1 drops on both state ofresidence and country of birth. This could suggestsystematic noise from Turker judgments; e.g., forforeign geography (state of residence) or ambigu-ous relations (top members).

An additional insight from the table is the mis-match between examples chosen to be annotated,and the most popular relations in the KBP evalu-ation. For instance, by far the most popular KBPrelation (title) had only 121 examples annotated.

7 Conclusion

We have shown that providing a relatively smallnumber of mention-level annotations can improvethe accuracy of MIML-RE, yielding an end-to-endimprovement of 3.9 F1 on the KBP task. Further-more, we have introduced a new active learningcriterion, and shown both that the choice of crite-rion is important, and that our new criterion per-forms well. Lastly, we make available a dataset ofmention-level annotations for constructing a tradi-tional supervised relation extractor.

Acknowledgements

We thank the anonymous reviewers for theirthoughtful comments, and Dan Weld for his feed-back on additional experiments and analysis. Wegratefully acknowledge the support of the DefenseAdvanced Research Projects Agency (DARPA)Deep Exploration and Filtering of Text (DEFT)Program under Air Force Research Laboratory(AFRL) contract no. FA8750-13-2-0040. Anyopinions, findings, and conclusion or recommen-dations expressed in this material are those of theauthors and do not necessarily reflect the view ofthe DARPA, AFRL, or the US government.

Page 11: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

ReferencesGabor Angeli, Arun Chaganty, Angel Chang, Kevin

Reschke, Julie Tibshirani, Jean Y. Wu, Osbert Bas-tani, Keith Siilats, and Christopher D. Manning.2014. Stanford’s 2013 KBP system. In TAC-KBP.

Zheng Chen, Suzanne Tamang, Adam Lee, XiangLi, Wen-Pin Lin, Matthew Snover, Javier Artiles,Marissa Passantino, and Heng Ji. 2010. CUNY-BLENDER. In TAC-KBP.

Mark Craven and Johan Kumlien. 1999. Constructingbiological knowledge bases by extracting informa-tion from text sources. In AAAI.

George R Doddington, Alexis Mitchell, Mark A Przy-bocki, Lance A Ramshaw, Stephanie Strassel, andRalph M Weischedel. 2004. The automatic contentextraction (ACE) program–tasks, data, and evalua-tion. In LREC.

Aidan Finn and Nicolas Kushmerick. 2003. Activelearning selection strategies for information extrac-tion. In International Workshop on Adaptive TextExtraction and Mining.

Yoav Freund, H Sebastian Seung, Eli Shamir, and Naf-tali Tishby. 1992. Information, prediction, andquery by committee. In NIPS.

Yoav Freund, H Sebastian Seung, Eli Shamir, and Naf-tali Tishby. 1997. Selective sampling using thequery by committee algorithm. Machine learning,28(2-3):133–168.

Lisheng Fu and Ralph Grishman. 2013. An efficientactive learning framework for new relation types. InIJCNLP.

Ralph Grishman and Bonan Min. 2010. New YorkUniversity KBP 2010 slot-filling system. In Proc.TAC 2010 Workshop.

Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min.2005. Exploring various knowledge in relation ex-traction. In ACL.

Raphael Hoffmann, Congle Zhang, Xiao Ling, LukeZettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extractionof overlapping relations. In ACL-HLT.

David D Lewis and William A Gale. 1994. A sequen-tial algorithm for training text classifiers. In SIGIR.

Andrew McCallum and Kamal Nigam. 1998. Employ-ing EM and pool-based active learning for text clas-sification. In ICML.

Bonan Min, Ralph Grishman, Li Wan, Chang Wang,and David Gondek. 2013. Distant supervision forrelation extraction with an incomplete knowledgebase. In NAACL-HLT.

Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In ACL.

Maria Pershina, Bonan Min, Wei Xu, and Ralph Gr-ishman. 2014. Infusion of labeled data into distantsupervision for relation extraction. In ACL.

Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In Machine Learning and Knowl-edge Discovery in Databases. Springer.

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M Marlin. 2013. Relation extractionwith matrix factorization and universal schemas. InNAACL-HLT.

Benjamin Roth and Dietrich Klakow. 2013. Feature-based models for improving the quality of noisytraining data for relation extraction. In CIKM.

Maytal Saar-Tsechansky and Foster Provost. 2004.Active sampling for class probability estimation andranking. Machine Learning, 54(2):153–178.

Burr Settles, Mark Craven, and Soumya Ray. 2008.Multiple-instance active learning. In Advances inneural information processing systems, pages 1289–1296.

Burr Settles. 2010. Active learning literature survey.University of Wisconsin Madison Technical Report1648.

Noah Smith and Jason Eisner. 2007. Novel estimationmethods for unsupervised discovery of latent struc-ture in natural language text. Ph.D. thesis, JohnsHopkins.

Stephen G Soderland. 1997. Learning text analysisrules for domain-specific natural language process-ing. Ph.D. thesis, University of Massachusetts.

Ang Sun, Ralph Grishman, Wei Xu, and Bonan Min.2011. New York University 2011 system for KBPslot filling. In Proceedings of the Text AnalyticsConference.

Mihai Surdeanu and Massimiliano Ciaramita. 2007.Robust information extraction with perceptrons. InACE07 Proceedings.

Mihai Surdeanu, Sonal Gupta, John Bauer, David Mc-Closky, Angel X Chang, Valentin I Spitkovsky, andChristopher D Manning. 2011. Stanfords distantly-supervised slot-filling system. In Proceedings of theText Analytics Conference.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallap-ati, and Christopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction.In EMNLP.

Fei Wu and Daniel S Weld. 2007. Autonomously se-mantifying wikipedia. In Proceedings of the six-teenth ACM conference on information and knowl-edge management. ACM.

Page 12: Combining Distant and Partial Supervision for Relation ... · Broad-coverage relation extraction either requires expensive supervised training data, or suffers from drawbacks inherent

Wei Xu, Le Zhao, Raphael Hoffman, and Ralph Grish-man. 2013. Filling knowledge base gaps for distantsupervision of relation extraction. In ACL.

Ce Zhang, Feng Niu, Christopher Re, and Jude Shav-lik. 2012. Big data versus the crowd: Looking forrelationships in all the right places. In ACL.