arXiv:1504.05929v2 [cs.CL] 25 Sep 2015mcubed.mit.edu/files/public/RT1/2015FrazierA...the...

A Hierarchical Distance-dependent Bayesian Model forEvent Coreference Resolution

Bishan Yang Claire CardieDepartment of Computer Science

Cornell University{bishan, cardie}@cs.cornell.edu

Peter FrazierSchool of Operations Researchand Information Engineering

Cornell [email protected]

Abstract

We present a novel hierarchical distance-dependent Bayesian model for event coref-erence resolution. While existing generativemodels for event coreference resolution arecompletely unsupervised, our model allowsfor the incorporation of pairwise distances be-tween event mentions — information that iswidely used in supervised coreference mod-els to guide the generative clustering process-ing for better event clustering both within andacross documents. We model the distancesbetween event mentions using a feature-richlearnable distance function and encode themas Bayesian priors for nonparametric cluster-ing. Experiments on the ECB+ corpus showthat our model outperforms state-of-the-artmethods for both within- and cross-documentevent coreference resolution.

1 Introduction

The task of event coreference resolution consists ofidentifying text snippets that describe events, andthen clustering them such that all event mentions inthe same partition refer to the same unique event.Event coreference resolution can be applied withina single document or across multiple documentsand is crucial for many natural language process-ing tasks including topic detection and tracking, in-formation extraction, question answering and tex-tual entailment (Bejan and Harabagiu, 2010). Moreimportantly, event coreference resolution is a neces-sary component in any reasonable, broadly applica-ble computational model of natural language under-standing (Humphreys et al., 1997).

In comparison to entity coreference resolu-tion (Ng, 2010), which deals with identifying andgrouping noun phrases that refer to the same dis-course entity, event coreference resolution has notbeen extensively studied. This is, in part, becauseevents typically exhibit a more complex structurethan entities: a single event can be described viamultiple event mentions, and a single event mentioncan be associated with multiple event arguments thatcharacterize the participants in the event as well asspatio-temporal information (Bejan and Harabagiu,2010). Hence, the coreference decisions for eventmentions usually require the interpretation of eventmentions and their arguments in context. See, forexample, Figure 1, in which five event mentionsacross two documents all refer to the same under-lying event: Plane bombs Yida camp.

Event: Plane bombs Yida camp

Document 1 Document 2

The {Yida refugee camp} {in South Sudan} was bombed {on Thursday}.

The {Yida refugee camp} was the target of an air strike {in South Sudan} {on Thursday}.

{Two bombs} fell {within the Yida camp}, including {one} {close to the school}.

{At least four bombs} were reportedly dropped.{Four bombs} were dropped

within just a few moments - {two} {inside the camp itself }, while {the other two} {near the airstrip}.

Figure 1: Examples of event coreference. Mutuallycoreferent event mentions are underlined and in boldface;participant and spatio-temporal information for the high-lighted event is marked by curly brackets.

Most previous approaches to event coreferenceresolution (e.g., Ahn (2006), Chen et al. (2009)) op-erated by extending the supervised pairwise classi-

arX

iv:1

504.

0592

9v2

[cs

.CL

] 2

5 Se

p 20

15

fication model that is widely used in entity corefer-ence resolution (e.g., Ng and Cardie (2002)). In thisframework, pairwise distances between event men-tions are modeled via event-related features (e.g.,that indicate event argument compatibility), and ag-glomerative clustering is applied to greedily mergeevent mentions into clusters. A major drawback ofthis general approach is that it makes hard decisionson the merging and splitting of clusters based onheuristics derived from the pairwise distances. Inaddition, it only captures pairwise coreference deci-sions within a single document and can not accountfor signals that commonly appear across documents.More recently, Bejan and Harabagiu (2010; 2014)proposed several nonparametric Bayesian modelsfor event coreference resolution that probabilisti-cally infer event clusters both within a document andacross multiple documents. Their method, however,is completely unsupervised, and thus can not en-code any readily available supervisory informationto guide the model toward better event clustering.

To address these limitations, we propose a novelBayesian model for within- and cross-documentevent coreference resolution. It leverages super-vised feature-rich modeling of pairwise coreferencerelations and generative modeling of cluster distri-butions, and thus allows for both probabilistic in-ference over event clusters and easy incorporationof pairwise linking preferences. Our model buildson the framework of the distance-dependent Chi-nese restaurant process (DDCRP) (Blei and Frazier,2011), which was introduced to incorporate data de-pendencies into nonparametric clustering models.Here, however, we extend the DDCRP to allowthe incorporation of feature-based, learnable dis-tance functions as clustering priors, thus encourag-ing event mentions that are close in meaning to be-long to the same cluster. In addition, we introduce tothe DDCRP a representational hierarchy that allowsevent mentions to be grouped within a document andwithin-document event clusters to be grouped acrossdocuments.

To investigate the effectiveness of our approach,we conduct extensive experiments on the ECB+ cor-pus (Cybulska and Vossen, 2014b), an extensionto EventCorefBank (ECB) (Bejan and Harabagiu,2010) and the largest corpus available that containsevent coreference annotations within and across

documents. We show that integrating pairwiselearning of event coreference relations with unsu-pervised hierarchical modeling of event clusteringachieves promising improvements over state-of-the-art approaches for within- and cross-document eventcoreference resolution.

2 Related Work

Coreference resolution in general is a difficult natu-ral language processing (NLP) task and typically re-quires sophisticated inferentially-based knowledge-intensive models (Kehler, 2002). Extensive work inthe literature focuses on the problem of entity coref-erence resolution and many techniques have beendeveloped, including rule-based deterministic mod-els (e.g. Cardie and Wagstaff (1999), Raghunathanet al. (2010), Lee et al. (2011)) that traverse overmentions in certain orderings and make determin-istic coreference decisions based on all availableinformation at the time; supervised learning-basedmodels (e.g. Stoyanov et al. (2009), Rahman and Ng(2011), Durrett and Klein (2013)) that make use ofrich linguistic features and the annotated corpora tolearn more powerful coreference functions; and fi-nally, unsupervised models (e.g. Bhattacharya andGetoor (2006), Haghighi and Klein (2007, 2010))that successfully apply generative modeling to thecoreference resolution problem.

Event coreference resolution is a more complextask than entity coreference resolution (Humphreyset al., 1997) and also has been relatively less stud-ied. Existing work has adapted similar ideas tothose used in entity coreference. Humphreys etal. (1997) first proposed a deterministic cluster-ing mechanism to group event mentions of pre-specified types based on hard constraints. Later ap-proaches (Ahn, 2006; Chen et al., 2009) appliedlearning-based pairwise classification decisions us-ing event-specific features to infer event clustering.Bejan and Harabagiu (2010; 2014) proposed sev-eral unsupervised generative models for event men-tion clustering based on the hierarchical Dirichletprocess (HDP) (Teh et al., 2006). Our approachis related to both supervised clustering and gener-ative clustering approaches. It is a nonparametricBayesian model in nature but encodes rich linguis-tic features in clustering priors. More recent work

modeled both entity and event information in eventcoreference. Lee et al. (2012) showed that itera-tively merging entity and event clusters can boostthe clustering performance. Liu et al. (2014) demon-strated the benefits of propagating information be-tween event arguments and event mentions duringa post-processing step. Other work modeled eventcoreference as a predicate argument alignment prob-lem between pairs of sentences, and trained clas-sifiers for making alignment decisions (Roth andFrank, 2012; Wolfe et al., 2015). Our model alsoleverages event argument information into the de-cisions of event coreference but incorporates it intoBayesian clustering priors.

Most existing coreference models, both for eventsand entities, focus on solving the within-documentcoreference problem. Cross-document coreferencehas attracted less attention due to lack of annotatedcorpora and the requirement for larger model capac-ity. Hierarchical models (Singh et al., 2010; Wick etal., 2012; Haghighi and Klein, 2007) have been pop-ular choices for cross-document coreference as theycan capture coreference at multiple levels of gran-ularities. Our model is also hierarchical, capturingboth within- and cross-document coreference.

Our model is also closely related to thedistance-dependent Chinese Restaurant Process(DDCRP) (Blei and Frazier, 2011). The DDCRPis an infinite clustering model that can account fordata dependencies (Ghosh et al., 2011; Socher et al.,2011). But it is a flat clustering model and thus can-not capture hierarchical structure that usually existsin large data collections. Very little work has ex-plored the use of DDCRP in hierarchical clusteringmodels. Kim and Oh (2011; Ghosh et al. (2011)combined a DDCRP with a standard CRP in a two-level hierarchy analogous to the HDP with restricteddistance functions. Ghosh et al. (2014) proposeda two-level DDCRP with data-dependent distance-based priors at both levels. Our model is also a two-level DDCRP model but differs in that its distancefunction is learned using a feature-rich log-linearmodel. We also derive an effective Gibbs samplerfor posterior inference.

Action bombsParticipant Sudan, Yida refugee camp

Time Thursday, Nov 10, 2011Location South Sudan

Table 1: Mentions of event components

3 Problem Formulation

We adopt the terminology from ECB+ (Cybulskaand Vossen, 2014b), a corpus that extends the widelyused EventCorefBank (ECB (Bejan and Harabagiu,2010)). An event is something that happens or a sit-uation that occurs (Cybulska and Vossen, 2014a). Itconsists of four components: (1) an Action: whathappens in the event; (2) Participants: who or whatis involved; (3) a Time: when the event happens; and(4) a Location: where the event happens. We as-sume that each document in the corpus consists of aset of mentions — text spans — that describe eventactions, their participants, times, and locations. Ta-ble 1 shows examples of these in the sentence “Su-dan bombs Yida refugee camp in South Sudan onThursday, Nov 10th, 2011.”

In this paper, we also use the term event men-tion to refer to the mention of an event action, andevent arguments to refer collectively to mentions ofthe participants, times and locations involved in theevent. Event mentions are usually noun phrases orverb phrases that clearly describe events. Two eventmentions are considered coreferent if they refer tothe same actual event, i.e. a situation involving a par-ticular combination of action, participants, time andlocation. Note that in text, not all event argumentsare always present for an event mention; they mayeven be distributed over different sentences. Thuswhether two event mentions are coreferential shouldbe determined based on the context. For example,in Figure 1, the event mention dropped in DOCU-MENT 1 corefers with air strike in the same docu-ment as they describe the same event, Plane bombsYida camp, in the discourse context; it also coreferswith dropped in DOCUMENT 2 based on the con-texts of both documents.

The problem of event coreference resolution canbe divided into two sub-problems: (1) event ex-traction: extracting event mentions and event ar-guments, and (2) event clustering: grouping event

mentions into clusters according to their corefer-ence relations. We consider both within- and cross-document event coreference resolution and hypothe-size that leveraging context information from multi-ple documents will improve both within- and cross-document coreference resolution. In the following,we first describe the event extraction step and thenfocus on the event clustering step.

4 Event Extraction

The goal of event extraction is to extract from a textall event mentions (actions) and event arguments(the associated participants, times and locations).One might expect that event actions could be ex-tracted reasonably well by identifying verb groups;and event arguments, by applying semantic role la-beling (SRL) to identify, for example, the Agent andPatient of each predicate. Unfortunately, most SRLsystems only handle verbal predicates and so wouldmiss event mentions described via noun phrases. Inaddition, SRL systems are not designed to captureevent-specific arguments. Accordingly, we foundthat a state-of-the-art SRL system (SwiRL (Sur-deanu et al., 2007)) extracted only 56% of the ac-tions, 76% of participants, 65% of times and 13% oflocations for events in a development set of ECB+based on a head word matching evaluation measure.(We provide dataset details in Section 6.)

To produce higher recall, we adopt a supervisedapproach and train an event extractor using sen-tences from ECB+, which are annotated for eventactions, participants, times and locations. Be-cause these mentions vary widely in their lengthand grammatical type, we employ semi-MarkovCRFs (Sarawagi and Cohen, 2004) using the loss-augmented objective of Yang and Cardie (2014) thatprovides more accurate detection of mention bound-aries. We make use of a rich feature set that includesword-level features such as unigrams, bigrams, POStags, WordNet hypernyms, synonyms and FrameNetsemantic roles, and phrase-level features such asphrasal syntax (e.g., NP, VP) and phrasal embed-dings (constructed by averaging word embeddingsproduced by word2vec (Mikolov et al., 2013)). Ourexperiments on the same (held-out) developmentdata show that the semi-CRF-based extractor cor-rectly identifies 95% of actions, 90% of participants,

94% of times and 74% of locations again based onhead word matching.

Note that the semi-CRF extractor identifies eventmentions and event arguments but not relation-ships among them, i.e. it does not associate argu-ments with an event mention. Lacking supervi-sory data in the ECB+ corpus for training an eventaction-argument relation detector, we assume thatall event arguments identified by the semi-CRF ex-tractor are related to all event mentions in the samesentence and then apply SRL-based heuristics toaugment and further disambiguate intra-sententialaction-argument relations (using the SwiRL SRL).More specifically, we link each verbal event men-tion to the participants that match its ARG0, ARG1or ARG2 semantic role fillers; similarly, we asso-ciate with the event mention the time and locationsthat match its AM-TMP and AM-LOC role fillers, re-spectively. For each nominal event mention, we as-sociate those participants that match the possessor ofthe mention since these were suggested in Lee et al.(2012) as playing the ARG0 role for nominal predi-cates.

5 Event Clustering

Now we describe our proposed Bayesian model forevent clustering. Our model is a hierarchical exten-sion of the distance-dependent Chinese RestaurantProcess (DDCRP). It first groups event mentionswithin a document to form within-document eventcluster and then groups these event clusters acrossdocuments to form global clusters. The model canaccount for the similarity between event mentionsduring the clustering process, putting a bias towardclusters comprised of event mentions that are simi-lar to each other based on the context. To captureevent similarity, we use a log-linear model with richsyntactic and semantic features, and learn the featureweights using gold-standard data.

5.1 Distance-dependent Chinese RestaurantProcess

The Distance-dependent Chinese Restaurant Pro-cess (DDCRP) is a generalization of the ChineseRestaurant process (CRP) that models distributionsover partitions. In a CRP, the generative process canbe described by imagining data points as customers

in a restaurant and the partitioning of data as tablesat which the customers sit. The process randomlysamples the table assignment for each customer se-quentially: the probability of a customer sitting at anexisting table is proportional to the number of cus-tomers already sitting at that table and the probabil-ity of sitting at a new table is proportional to a scal-ing parameter. For each customer sitting at the sametable, an observation can be drawn from a distri-bution determined by the parameter associated withthat table. Despite the sequential sampling process,the CRP makes the assumption of exchangeability:the permutation of the customer ordering does notchange the probability of the partitions.

The exchangeability assumption may not be rea-sonable for clustering data that has clear inter-dependencies. The DDCRP allows the incorporationof data dependencies in infinite clustering, encour-aging data points that are closer to each other to begrouped together. In the generative process, insteadof directly sampling a table assignment for each cus-tomer, it samples a customer link, linking the cus-tomer to another customer or itself. The clusteringcan be uniquely constructed once the customer linksare determined for all customers: two customers be-long to the same cluster if and only if one can reachthe other by traversing the customer links (treatingthese links as undirected).

More formally, consider a sequence of customers1, ..., n, and denote a = (a1, ..., an) as the assign-ments of the customer links. ai ∈ {1, . . . , n} isdrawn from

p(ai = j|F, α) ∝

{F (i, j), j 6= i

α, j = i(1)

where F is a distance function and F (i, j) is a valuethat measures the distance between customer i andj. α is a scaling parameter, measuring self-affinity.For each customer, the observation is generated bythe per-table parameters as in the CRP. A DDCRPis said to be sequential if F (i, j) = 0 when i < j,so customers may link only to themselves, and toprevious customers.

5.2 A Hierarchical Extension of the DDCRPWe can model within-document coreference reso-lution using a sequential DDCRP. Imagining cus-tomers as event mentions and the restaurant as a

document, each mention can either refer to an an-tecedent mention in the document or no other men-tions, starting the description of a new event. How-ever, the coreference relations may also exist acrossdocuments — the same event may be described inmultiple documents. Thus it is ideal to have a two-level clustering model that can group event men-tions within a document and further group themacross documents. Therefore we propose a hierar-chical extension of the DDCRP (HDDCRP) that em-ploys a DDCRP twice: the first-level DDCRP linksmentions based on within-document distances andthe-second level DDCRP links the within-documentclusters based on cross-document distances, forminglarger clusters in the corpus.

The generative process of an HDDCRP can bedescribed using the same “Chinese Restaurant”metaphor. Imagine a collection of documents as acollection of restaurants, and the event mentions ineach document as customers entering a restaurant.The local (within-document) event clusters corre-spond to tables. The global (within-corpus) eventclusters correspond to menus (tables that serve thesame menu belong to the same cluster). The hid-den variables are the customer links and the tablelinks. Figure 2 shows a configuration of these vari-ables and the corresponding clustering structure.

Figure 2: A cluster configuration generated by the HDD-CRP. Each restaurant is represented by a rectangle. Thesmall green circles represent customers. The ovals repre-sent tables and the colors reflect the clustering. Each cus-tomer is assigned a customer link (a solid arrow), linkingto itself or another customer in the same restaurant. Thecustomer who first sits at the table is assigned a table link(a dashed arrow), linking to itself or another customer in adifferent restaurant, resulting in the linking of two tables.

More formally, the generative process for the HD-DCRP can be described as follows:

1. For each restaurant d ∈ {1, ..., D}, for each

customer i ∈ {1, ..., nd}, sample a customerlink using a sequential DDCRP:

p(ai,d = (j, d)) ∝

Fd(i, j), j < i

αd, j = i

0, j > i

(2)

2. For each restaurant d ∈ {1, ..., D}, for each ta-ble t, sample a table link for the customer (i, d)who first sits at t using a DDCRP:

p(ci,d = (j, d′)) ∝{F0((i, d), (j, d

′)), j ∈ {1, ..., nd′}, d′ 6= d

α0, j = i, d′ = d

(3)

3. Calculate clusters z(a, c) by traversing all thecustomer links a and the table links c. Twocustomers are in the same cluster if and onlyif there is a path from one to the other along thelinks, where we treat both table and customerlinks as undirected.

4. For each cluster k ∈ z(a, c), sample parame-ters φk ∼ G0(λ).

5. For each customer i in cluster k, sample an ob-servation xi ∼ p(·|φzi) where zi = k.

F1:D and F0 are distance functions that map a pairof customers to a distance value. We will discussthem in detail in Section 5.4.

5.3 Posterior Inference with Gibbs SamplingThe central computation problem for the HDDCRPmodel is posterior inference — computing the con-ditional distribution of the hidden variables given theobservations p(a, c|x, α0, F0, α1:D, F1:D). The pos-terior is intractable due to a combinatorial numberof possible link configurations. Thus we approxi-mate the posterior using Markov Chain Monte Carlo(MCMC) sampling, and specifically using a Gibbssampler.

In developing this Gibbs sampler, we first observethat the generative process is equivalent to one that,in step 2 samples a table link for all customers,and then in step 3, when calculating z(a, c), in-cludes only those table links ci,d originating at cus-tomers (i, d) that started a new table, i.e. that choseai,d = (i, d).

The Gibbs sampler for the HDDCRP iterativelysamples a customer link for each customer (i, d)from

p(a∗i,d|a−(i,d), c,x, λ) ∝ p(a∗i,d)Ha(x, z, λ) (4)

where

Ha(x, z, λ) =p(x|z(a−(i,d) ∪ a∗i,d, c, λ))p(x|z(a−(i,d), c), λ))

After sampling all the customer links, it samplesa table link for all customers (i, d) according to

p(c∗i,d|a, c−(i,d),x, λ) ∝ p(c∗i,d)Hc(x, z, λ) (5)

where

Hc(x, z, λ) =p(x|z(a, c−(i,d) ∪ c∗i,d, λ))p(x|z(a, c−(i,d)), λ))

For those customers (i, d) that did not start a newtable, i.e. with ai,d 6= (i, d), the table link c∗i,d doesnot affect the clustering, and so Hc(x, z, λ) = 1 inthis case.

Referring back to the event coreference examplein 1, Figure 3 shows an example of variable config-uration for the HDDCRP model and the correspond-ing coreference clusters.

a1=1 a2=2 a3=3 a4=4 a5=4 c1=3 c2=2 c3=2 c4=2 c5=5[ina]

Figure 3: An example of event clustering and the cor-responding variable assignments. The assignments of ainduce tables, or within-document (WD) clusters, and theassignments of c induce menus, or cross-document (CD)clusters. [ina] denotes that the variable is inactive andwill not affect the clustering.

In implementation, we can simplify the computa-tions of both Ha(x, z, λ) and Hc(x, z, λ) by usingthe fact that the likelihood under clustering z(a, c)can be factorized as

p(x|z(a, c), λ) =∏

k∈z(a,c)

p(xz=k|λ)

where xz=k denotes all customers that belong to theglobal cluster k. p(xz=k|λ) is the marginal proba-bility. It can be computed as

p(xz=k|λ) =∫p(φ|λ)

∏i∈z=k

p(xi|φ)dφ

where xi is the observation associated with cus-tomer i. In our problem, the observation corre-sponds to the lemmatized words in the event men-tion. We model the observed word counts usingcluster-specific multinomial distributions with sym-metric Dirichlet priors.

5.4 Feature-based Distance Functions

The distance functions F1:D and F0 encode the pri-ors for the clustering distribution, preferring cluster-ing data points that are closer to each other. We con-sider event mentions as the data points and encodethe similarity (or compatibility) between event men-tions as priors for event clustering. Specifically, weuse a log-linear model to estimate the similarity be-tween a pair of event mentions (xi, xj)

fθ(xi, xj) ∝ exp{θTψ(xi, xj)} (6)

where ψ is a feature vector, containing a rich setof features based on event mentions i and j: (1)head word string match, (2) head POS pair, (3) co-sine similarity between the head word embeddings(we use the pre-trained 300-dimensional word em-beddings from word2vec1), (4) similarity betweenthe words in the event mentions (based on term fre-quency (TF) vectors), (5) the Jaccard coefficient be-tween the WordNet synonyms of the head words,and (6) similarity between the context words (a win-dow of three words before and after each event men-tion). If both event mentions involve participants,we consider the similarity between the words in theparticipant mentions based on the TF vectors, sim-ilarly for the time mentions and the location men-tions. If the SRL role information is available, wealso consider the similarity between words in eachSRL role, i.e. Arg0, Arg1, Arg2.

Training We train the parameter θ using logis-tic regression with an L2 regularizer. We constructthe training data by considering all ordered pairs

1https://code.google.com/p/word2vec/

Train Dev Test Total# Documents 462 73 447 982# Sentences 7,294 649 7,867 15,810

# Annotated event mentions 3,555 441 3,290 7,286# Cross-document chains 687 47 486 1,220# Within-document chains 2,499 316 2,137 4,952

Table 2: Statistics of the ECB+ corpus

of event mentions within a document, and also allpairs of event mentions across similar documents.To measure document similarity, we collect all men-tions of events, participants, times and locations ineach document and compute the cosine similaritybetween the TF vectors constructed from all theevent-related mentions. We consider two documentsto be similar if their TF-based similarity is above athreshold σ (we set it to 0.4 in our experiments).

After learning θ, we set the within-document distances as Fd(i, j) = fθ(xi, xj),and the across-document distances asF0((i, d), (j, d

′)) = w(d, d′)fθ(xi,d, xj,d′), wherew(d, d′) = exp(γsim(d, d′)) captures documentsimilarity where sim(d, d′) is the TF-based sim-ilarity between document d and d′, and γ is aweight parameter. Higher γ leads to a highereffect of document-level similarities on the linkingprobabilities. We set γ = 1 in our experiments.

6 Experiments

We conduct experiments using the ECB+ cor-pus (Cybulska and Vossen, 2014b), the largestavailable dataset with annotations of both within-document (WD) and cross-document (CD) eventcoreference resolution. It extends ECB 0.1 (Lee etal., 2012) and ECB (Bejan and Harabagiu, 2010)by adding event argument and argument type an-notations as well as adding more news documents.The cross-document coreference annotations onlyexist in documents that describe the same seminalevent (the event that triggers the topic of the docu-ment and has interconnections with the majority ofevents from its surrounding textual context (Bejanand Harabagiu, 2014)). We divide the dataset into atraining set (topics 1-20), a development set (topics21-23), and a test set (topics 24-43). Table 2 showsthe statistics of the data.

We performed event coreference resolution on allpossible event mentions that are expressed in the

https://code.google.com/p/word2vec/

documents. Using the event extraction method de-scribed in Section 4, we extracted 53,429 event men-tions, 43,682 participant mentions, 5,791 time men-tions and 3,836 location mentions in the test data,covering 93.5%, 89.0%, 95.0%, 72.8% of the an-notated event mentions, participants, time and loca-tions, respectively.

We evaluate both within- and cross-documentevent coreference resolution. As in previouswork (Bejan and Harabagiu, 2010), we evaluatecross-document coreference resolution by merg-ing all documents from the same seminal eventinto a meta-document and then evaluate the meta-document as in within-document coreference reso-lution. However, during inference time, we do notassume the knowledge of the mapping of documentsto seminal events.

We consider three widely used coreference reso-lution metrics: (1) MUC (Vilain et al., 1995), whichmeasures how many gold (predicted) cluster merg-ing operations are needed to recover each predicted(gold) cluster; (2) B3 (Bagga and Baldwin, 1998),which measures the proportion of overlap betweenthe predicted and gold clusters for each mention andcomputes the average scores; and (3) CEAF (Luo,2005) (CEAFe), which measures the best alignmentof the gold-standard and predicted clusters. We alsoconsider the CoNLL F1, which is the average F1 ofthe above three measures. All the scores are com-puted using the latest version (v8.01) of the officialCoNLL scorer (Pradhan et al., 2014).

6.1 BaselinesWe compare our proposed HDDCRP model (HDD-CRP) to five baselines:

• LEMMA: a heuristic method that groups allevent mentions, either within or across docu-ments, which have the same lemmatized headword. It is usually considered a strong baselinefor event coreference resolution.

• AGGLOMERATIVE: a supervised clusteringmethod for within-document event corefer-ence (Chen et al., 2009). We extend it towithin- and cross-document event coreferenceby performing single-link clustering in twophases: first grouping mentions within doc-uments and then grouping within-document

clusters to larger clusters across documents.We compute the pairwise-linkage scores usingthe log-linear model described in Section 5.4.

• HDP-LEX: an unsupervised Bayesian clus-tering model for within- and cross-documentevent coreference (Bejan and Harabagiu,2010)2. It is a hierarchical Dirichlet process(HDP) model with the likelihood of all the lem-matized words observed in the event mentions.In general, the HDP can be formulated using atwo-level sequential CRP. Our HDDCRP modelis a two-level DDCRP that generalizes the HDPto allow data dependencies to be incorporatedat both levels3.

• DDCRP: a DDCRP model we develop for eventcoreference resolution. It applies the distanceprior in Equation 1 to all pairs of event men-tions in the corpus, ignoring the documentboundaries. It uses the same likelihood func-tion and the same log-linear model to learnthe distance values as HDDCRP. But it hasfewer link variables than HDDCRP and it doesnot distinguish between the within-documentand cross-document link variables. For thesame clustering structure, HDDCRP can gener-ate more possible link configurations than DD-CRP.

• HDDCRP∗: a variant of the proposed HDDCRP

that only incorporates the within-document de-pendencies but not the cross-document depen-dencies. The generative process of HDDCRP∗ issimilar to the one described in Section 5.2, ex-cept that in step 2, for each table t, we sample

2We re-implement the proposed HDP-based models: theHDP1f , HDPflat (including HDPflat (LF), (LF+WF), and(LF+WF+SF)) and HDPstruct, but found that the HDPflat

with lexical features (LF) performs the best in our experiments.We refer to it as HDP-LEX.

3Note that HDP-LEX is not a special case of HDDCRP be-cause we define the table-level distance function as the distancesbetween customers instead of between tables. In our model, theprobability of linking a table t to another table s depends onthe distance between the head customer at table t and all othercustomers who sit at table s. Defining the table-level distancefunction this way allows us to derive a tractable inference algo-rithm using Gibbs sampling.

a cluster assignment ct according to

p(ct = k) ∝

{nk, k ≤ Kα0, k = K + 1

where K is the number of existing clusters,nk is the number of existing tables that be-long to cluster k, α is the concentration param-eter. And in step 3, the clusters z(a, c) are con-structed by traversing the customer links andlooking up the cluster assignments for the ob-tained tables. We also use Gibbs sampling forinference.

6.2 Parameter settingsFor all the Bayesian models, the reported results areaveraged results over five MCMC runs, each for 500iterations. We found that mixing happens before500 iterations in all models by observing the jointlog-likelihood. For the DDCRP, HDDCRP∗ and HDD-CRP, we randomly initialized the link variables. Be-fore initialization, we assume that each mention be-longs to its own cluster. We assume mentions areordered according to their appearance within a doc-ument, but we do not assume any particular orderingof documents. We also truncated the pairwise men-tion similarity to zero if it is below 0.5 as we foundthat it leads to better performance on the develop-ment set. We set α1 = ... = αD = 0.5, α0 = 0.001for HDDCRP, α0 = 1 for HDDCRP∗, α = 0.1 for DD-CRP, and λ = 10−7. All the hyperparameters wereset based on the development data.

6.3 Main ResultsTable 3 shows the event coreference results. Wecan see that LEMMA-matching is a strong baselinefor event coreference resolution. HDP-LEX providesnoticeable improvements, suggesting the benefit ofusing an infinite mixture model for event cluster-ing. AGGLOMERATIVE further improves the per-formance over HDP-LEX for WD resolution, how-ever, it fails to improve CD resolution. We conjec-ture that this is due to the combination of ineffectivethresholding and the prediction errors on the pair-wise distances between mention pairs across docu-ments. Overall, HDDCRP∗ outperforms all the base-lines in CoNLL F1 for both WD and CD evaluation.The clear performance gains over HDP-LEX demon-strate that it is important to account for pairwise

mention dependencies in the generative modeling ofevent clustering. The improvements over AGGLOM-ERATIVE indicate that it is more effective to modelmention-pair dependencies as clustering priors thanas heuristics for deterministic clustering.

Comparing among the HDDCRP-related models,we can see that HDDCRP clearly outperforms DD-CRP, demonstrating the benefits of incorporating thehierarchy into the model. HDDCRP also performsbetter than HDDCRP∗ in WD CoNLL F1, indicat-ing that incorporating cross-document informationhelps within-document clustering. We can also seethat HDDCRP performs similarly to HDDCRP∗ in CDCoNLL F1 due to the lower B3 F1, in particular,the decrease in B3 recall. This is because apply-ing the DDCRP prior at both within- and cross-document levels results in more conservative clus-tering and produces smaller clusters. This could bepotentially improved by employing more accuratesimilarity priors.

To further understand the effect of modelingmention-pair dependencies, we analyze the impactof the features in the mention-pair similarity model.Table 4 lists the learned weights of some top features(sorted by weights). We can see that they mainlyserve to discriminate event mentions based on thehead word similarity (especially embedding-basedsimilarity) and the context word similarity. Eventargument information such as SRL Arg1, SRL Arg0,and Participant are also indicative of the coreferen-tial relations.

6.4 Discussion

We found that HDDCRP corrects many errors madeby the traditional agglomerative clustering model(AGGLOMERATIVE) and the unsupervised genera-tive model (HDP-LEX). AGGLOMERATIVE easilysuffers from error propagation as the errors madeby the supervised distance learner cannot be cor-rected. HDP-LEX often mistakenly groups mentionstogether based on word co-occurrence statistics butnot the apparent similarity features in the mentions.In contrast, HDDCRP avoids such errors by perform-ing probabilistic modeling of clustering and mak-ing use of rich linguistic features trained on avail-able annotated data. For example, HDDCRP cor-rectly groups the event mention “unveiled” in “Ap-ple’s Phil Schiller unveiled a revamped MacBook

MUC B3 CEAFe CoNLLP R F1 P R F1 P R F1 F1

Cross-document Event Coreference Resolution (CD)LEMMA 75.1 55.4 63.8 71.7 39.6 51.0 36.2 61.1 45.5 53.4

HDP-LEX 75.5 63.5 69.0 65.6 43.7 52.5 34.8 60.2 44.1 55.2AGGLOMERATIVE 78.3 59.2 67.4 73.2 40.2 51.9 30.2 65.6 41.4 53.6

DDCRP 79.6 58.2 67.1 78.1 39.6 52.6 31.8 69.4 43.6 54.4HDDCRP∗ 77.5 66.4 71.5 69.0 48.1 56.7 38.2 63.0 47.6 58.6HDDCRP 80.3 67.1 73.1 78.5 40.6 53.5 38.6 68.9 49.5 58.7

Within-document Event Coreference Resolution (WD)LEMMA 60.9 30.2 40.4 78.9 57.3 66.4 63.6 69.0 66.2 57.7

HDP-LEX 50.0 39.1 43.9 74.7 67.6 71.0 66.2 71.4 68.7 61.2AGGLOMERATIVE 61.9 39.2 48.0 80.7 67.6 73.5 65.6 76.0 70.4 63.9

DDCRP 71.2 36.4 48.2 85.4 64.9 73.8 61.8 76.1 68.2 63.4HDDCRP∗ 58.1 42.8 49.3 78.4 68.7 73.2 67.6 74.5 70.9 64.5HDDCRP 74.3 41.7 53.4 85.6 67.3 75.4 65.1 79.8 71.7 66.8

Table 3: Within- and cross-document coreference results on the ECB+ corpus

Pro today” together with the event mention “an-nounced” in “this notebook isn’t the only laptop Ap-ple announced for the MacBook Pro lineup today”,while both HDP-LEX and AGGLOMERATIVE modelsfail to make such connection.

By looking further into the errors, we found thata lot of mistakes made by HDDCRP are due to theerrors in event extraction and pairwise linkage pre-diction. The event extraction errors include falsepositive and false negative event mentions and eventarguments, boundary errors for the extracted men-tions, and argument association errors. The pairwiselinking errors often come from the lack of seman-tic and world knowledge, and this applies to bothevent mentions and event arguments, especially fortime and location arguments which are less likelyto be repeatedly mentioned and in many cases re-quire external knowledge to resolve their meanings,e.g., “May 3, 2013” is “Friday” and “Mount Cook”is “New Zealand’s highest peak”.

7 Conclusion

In this paper we propose a novel Bayesian modelfor within- and cross-document event coreferenceresolution. It leverages the advantages of genera-tive modeling of coreference resolution and feature-rich discriminative modeling of mention referencerelations. We have shown its power in resolvingevent coreference by comparing it to a traditional ag-

Features WeightHead Embedding sim 4.5

String match 2.77Context sim 1.75

Synonym sim 1.56TF sim 1.17

SRL Arg1 sim 1.10SRL Arg0 sim 0.89Participant sim 0.68

Table 4: Learned weights for selected features

glomerative clustering approach and a state-of-the-art unsupervised generative clustering approach. Itis worth noting that our model is general and can beeasily applied to other clustering problems involvingfeature-rich objects and cluster sharing across datagroups. While the model can effectively cluster ob-jects of a single type, it would be interesting to ex-tend it to allow joint clustering of objects of differenttypes, e.g., events and entities.

Acknowledgments

We thank Cristian Danescu-Niculescu-Mizil, IgorLabutov, Lillian Lee, Moontae Lee, Jon Park, Chen-hao Tan, and other Cornell NLP seminar partici-pants and the reviewers for their helpful comments.This work was supported in part by NSF grantIIS-1314778 and DARPA DEFT Grant FA8750-13-2-0015. The third author was supported by

NSF CAREER CMMI-1254298, NSF IIS-1247696,AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-0038, and the ACSF AVF. The views and conclu-sions contained herein are those of the authors andshould not be interpreted as necessarily represent-ing the official policies or endorsements, either ex-pressed or implied, of NSF, DARPA or the U.S.Government.

References[Ahn2006] David Ahn. 2006. The stages of event extrac-

tion. In Proceedings of the Workshop on Annotatingand Reasoning about Time and Events, pages 1–8.

[Bagga and Baldwin1998] Amit Bagga and Breck Bald-win. 1998. Algorithms for scoring coreference chains.In The First International Conference on LanguageResources and Evaluation Workshop on LinguisticsCoreference, volume 1, pages 563–6.

[Bejan and Harabagiu2010] Cosmin Adrian Bejan andSanda Harabagiu. 2010. Unsupervised event coref-erence resolution with rich linguistic features. In ACL,pages 1412–1422.

[Bejan and Harabagiu2014] Cosmin Adrian Bejan andSanda Harabagiu. 2014. Unsupervised eventcoreference resolution. Computational Linguistics,40(2):311–347.

[Bhattacharya and Getoor2006] Indrajit Bhattacharya andLise Getoor. 2006. A latent Dirichlet model for unsu-pervised entity resolution. In SDM, volume 5, page 59.

[Blei and Frazier2011] David M. Blei and Peter I. Frazier.2011. Distance dependent Chinese restaurant pro-cesses. The Journal of Machine Learning Research,12:2461–2488.

[Cardie and Wagstaff1999] Claire Cardie and KiriWagstaff. 1999. Noun phrase coreference as clus-tering. In Proceedings of the 1999 Joint SIGDATConference on Empirical Methods in Natural Lan-guage Processing and Very Large Corpora, pages82–89.

[Chen et al.2009] Zheng Chen, Heng Ji, and Robert Har-alick. 2009. A pairwise event coreference model, fea-ture impact and evaluation for event coreference reso-lution. In Proceedings of the Workshop on Events inEmerging Text Types, pages 17–22.

[Cybulska and Vossen2014a] Agata Cybulska and PiekVossen. 2014a. Guidelines for ECB+ annotation ofevents and their coreference. Technical report, NWR-2014-1, VU University Amsterdam.

[Cybulska and Vossen2014b] Agata Cybulska and PiekVossen. 2014b. Using a sledgehammer to crack a nut?lexical diversity and event coreference resolution. In

Proceedings of the 9th Language Resources and Eval-uation Conference (LREC2014), pages 26–31.

[Durrett and Klein2013] Greg Durrett and Dan Klein.2013. Easy victories and uphill battles in coreferenceresolution. In EMNLP, pages 1971–1982.

[Ghosh et al.2011] Soumya Ghosh, Andrei B. Ungure-anu, Erik B. Sudderth, and David M. Blei. 2011. Spa-tial distance dependent Chinese restaurant processesfor image segmentation. In Advances in Neural In-formation Processing Systems, pages 1476–1484.

[Ghosh et al.2014] Soumya Ghosh, Michalis Raptis,Leonid Sigal, and Erik B. Sudderth. 2014. Nonpara-metric clustering with distance dependent hierarchies.

[Haghighi and Klein2007] Aria Haghighi and Dan Klein.2007. Unsupervised coreference resolution in a non-parametric Bayesian model. In ACL, volume 45, page848.

[Haghighi and Klein2010] Aria Haghighi and Dan Klein.2010. Coreference resolution in a modular, entity-centered model. In NAACL, pages 385–393.

[Humphreys et al.1997] Kevin Humphreys, RobertGaizauskas, and Saliha Azzam. 1997. Event coref-erence for information extraction. In Proceedingsof a Workshop on Operational Factors in Practical,Robust Anaphora Resolution for Unrestricted Texts,pages 75–81.

[Kehler2002] Andrew Kehler. 2002. Coherence, Refer-ence, and the Theory of Grammar. CSLI publicationsStanford, CA.

[Kim and Oh2011] Dongwoo Kim and Alice Oh. 2011.Accounting for data dependencies within a hierarchi-cal Dirichlet process mixture model. In Proceedingsof the 20th ACM International Conference on Infor-mation and Knowledge Management, pages 873–878.

[Lee et al.2011] Heeyoung Lee, Yves Peirsman, AngelChang, Nathanael Chambers, Mihai Surdeanu, andDan Jurafsky. 2011. Stanford’s multi-pass sieve coref-erence resolution system at the CoNLL-2011 sharedtask. In Proceedings of the Fifteenth Conference onComputational Natural Language Learning: SharedTask, pages 28–34.

[Lee et al.2012] Heeyoung Lee, Marta Recasens, AngelChang, Mihai Surdeanu, and Dan Jurafsky. 2012.Joint entity and event coreference resolution acrossdocuments. In Proceedings of the 2012 Joint Confer-ence on Empirical Methods in Natural Language Pro-cessing and Computational Natural Language Learn-ing, pages 489–500.

[Liu et al.2014] Zhengzhong Liu, Jun Araki, EduardHovy, and Teruko Mitamura. 2014. Supervisedwithin-document event coreference using informationpropagation. In Proceedings of the International Con-ference on Language Resources and Evaluation.

[Luo2005] Xiaoqiang Luo. 2005. On coreference resolu-tion performance metrics. In EMNLP, pages 25–32.

[Mikolov et al.2013] Tomas Mikolov, Kai Chen, GregCorrado, and Jeffrey Dean. 2013. Efficient estimationof word representations in vector space. Proceedingsof Workshop at ICLR.

[Ng and Cardie2002] Vincent Ng and Claire Cardie.2002. Improving machine learning approaches tocoreference resolution. In ACL, pages 104–111.

[Ng2010] Vincent Ng. 2010. Supervised noun phrasecoreference research: The first fifteen years. In ACL,pages 1396–1411.

[Pradhan et al.2014] Sameer Pradhan, Xiaoqiang Luo,Marta Recasens, Eduard Hovy, Vincent Ng, andMichael Strube. 2014. Scoring coreference partitionsof predicted mentions: A reference implementation.In ACL, pages 22–27.

[Raghunathan et al.2010] Karthik Raghunathan, Heey-oung Lee, Sudarshan Rangarajan, Nathanael Cham-bers, Mihai Surdeanu, Dan Jurafsky, and ChristopherManning. 2010. A multi-pass sieve for coreferenceresolution. In EMNLP, pages 492–501.

[Rahman and Ng2011] Altaf Rahman and Vincent Ng.2011. Coreference resolution with world knowledge.In ACL, pages 814–824.

[Roth and Frank2012] Michael Roth and Anette Frank.2012. Aligning predicate argument structures inmonolingual comparable texts: A new corpus for anew task. In SemEval, pages 218–227.

[Sarawagi and Cohen2004] Sunita Sarawagi andWilliam W. Cohen. 2004. Semi-markov condi-tional random fields for information extraction. InAdvances in Neural Information Processing Systems,pages 1185–1192.

[Singh et al.2010] Sameer Singh, Michael Wick, andAndrew McCallum. 2010. Distantly labelingdata for large scale cross-document coreference.arXiv:1005.4298.

[Socher et al.2011] Richard Socher, Andrew L. Maas, andChristopher D. Manning. 2011. Spectral Chineserestaurant processes: Nonparametric clustering basedon similarities. In International Conference on Artifi-cial Intelligence and Statistics, pages 698–706.

[Stoyanov et al.2009] Veselin Stoyanov, Nathan Gilbert,Claire Cardie, and Ellen Riloff. 2009. Conundrums innoun phrase coreference resolution: Making sense ofthe state-of-the-art. In Proceedings of the Joint Con-ference of the 47th Annual Meeting of the ACL andthe 4th International Joint Conference on Natural Lan-guage Processing of the AFNLP: Volume 2-Volume 2,pages 656–664.

[Surdeanu et al.2007] Mihai Surdeanu, Lluıs Marquez,Xavier Carreras, and Pere R. Comas. 2007. Combi-

nation strategies for semantic role labeling. Journal ofArtificial Intelligence Research, pages 105–151.

[Teh et al.2006] Yee Whye Teh, Michael I. Jordan,Matthew J. Beal, and David M. Blei. 2006. Hierar-chical Dirichlet processes. Journal of the AmericanStatistical Association, 101(476).

[Vilain et al.1995] Marc Vilain, John Burger, John Ab-erdeen, Dennis Connolly, and Lynette Hirschman.1995. A model-theoretic coreference scoring scheme.In Proceedings of the 6th Conference on Message Un-derstanding, pages 45–52.

[Wick et al.2012] Michael Wick, Sameer Singh, and An-drew McCallum. 2012. A discriminative hierarchi-cal model for fast coreference at large scale. In ACL,pages 379–388.

[Wolfe et al.2015] Travis Wolfe, Mark Dredze, and Ben-jamin Van Durme. 2015. Predicate argument align-ment using a global coherence model. In NAACL,pages 11–20.

[Yang and Cardie2014] Bishan Yang and Claire Cardie.2014. Joint modeling of opinion expression extractionand attribute classification. Transactions of the Asso-ciation for Computational Linguistics, 2:505–516.

arXiv:1504.05929v2 [cs.CL] 25 Sep 2015mcubed.mit.edu/files/public/RT1/2015__Frazier__A...the...

Documents

Transcript of arXiv:1504.05929v2 [cs.CL] 25 Sep 2015mcubed.mit.edu/files/public/RT1/2015__Frazier__A...the...

arXiv:1504.05929v2 [cs.CL] 25 Sep 2015mcubed.mit.edu/files/public/RT1/2015FrazierA...the...

Transcript of arXiv:1504.05929v2 [cs.CL] 25 Sep 2015mcubed.mit.edu/files/public/RT1/2015FrazierA...the...