CohEEL: Coherent and E cient Named Entity Linking through ... · CohEEL: Coherent and E cient Named...

CohEEL: Coherent and Efficient Named Entity Linkingthrough Random Walks

Toni Gruetzea, Gjergji Kasnecia, Zhe Zuoa, Felix Naumanna

a Hasso Plattner Institute, Prof.-Dr.-Helmert-Straße 2-3, 14482 Potsdam, GermanyEmail address: [email protected]

Abstract

In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to aconsiderable increase of valuable textual information about entities. Harvesting entity knowledge from theselarge text collections is a major challenge. It requires the linkage of textual mentions within the documentswith their real-world entities. This process is called entity linking.Solutions to this entity linking problem have typically aimed at balancing the rate of linking correctness(precision) and the linking coverage rate (recall). While entity links in texts could be used to improvevarious Information Retrieval tasks, such as text summarization, document classification, or topic-basedclustering, the linking precision is the decisive factor. For example, for topic-based clustering a method thatproduces mostly correct links would be more desirable than a high-coverage method that leads to more butalso more uncertain clusters.We propose an efficient linking method that uses a random walk strategy to combine a precision-orientedand a recall-oriented classifier in such a way that a high precision is maintained, while recall is elevatedto the maximum possible level without affecting precision. An evaluation on three datasets with distinctcharacteristics demonstrates that our approach outperforms seminal work in the area and shows higherprecision and time performance than the most closely related state-of-the-art methods.

Keywords: Entity Linking, Named Entity Disambiguation, Random Walk, Machine Learning

1. Named Entity Linking

Semi-structured and collaboratively created Webplatforms, such as Wikipedia, have motivated a widevariety of research projects aiming at knowledge har-vesting. For instance, large semantic knowledgebases, such as DBpedia [1] and YAGO [2], are basedon the structured assets of Wikipedia, such as in-foboxes and categories. However, harvesting implicitknowledge about entities in text without consistentstructure, i.e., on websites or digital libraries, is stilla major challenge. To address it, reliable techniquesfor Named Entity Detection and Disambiguation in

natural language text are needed. Especially the dis-ambiguation process is a critical step; it involves thecorrect grouping of textual mentions of the same real-world entity. If these groups are linked to correspond-ing entities in a knowledge base, the process is re-ferred to as Named Entity Linking (NEL). Quote 1

U.S. District Court, New York: Judge RichardBerman expects to rule this week on Brady’sfour-game suspension appeal. Will the NFL fi-nally prevail over Brady in the Deflategate saga?

Quote 1: Running example of a Named EntityLinking task with emphasized entity mentions.

Preprint submitted to Elsevier January 31, 2016

mailto:[email protected]

shows an example text of a news article about the De-flategate scandal. The mentions of different entitieswithin the text (e.g., U.S. District Court) are em-phasized. Note that only named entities have beenhighlighted, where general entities, such as, concepts(e.g., judge) or temporal expressions (e.g., week) havenot.

The task is to link these highlighted mentions tocorresponding knowledge base entities. Dredze et al.identify three key challenges for NEL [3]:

(i) Name variations occur for various reasons, for in-stance to reduce the length of the actual men-tion, such as “NFL” as abbreviation for “Na-tional Football League”.

(ii) The absence of entities is another importantchallenge. For instance, the mentioned “Deflate-gate” is not covered by Wikipedia versions from2014 or earlier. A linking algorithm has to cau-tiously handle such mentions and should not linkthem to similarly named entities, such as thedata compression algorithm DEFLATE .

(iii) Furthermore, the entity ambiguity concernscases where different entities are referred to bythe same name, such as “Brady”. An algorithmwith the goal to link this mention to an entityfrom Wikipedia has to choose between variousgeographical entities, such as a city in Texas anda village in Nebraska, hundreds of Persons (e.g.,the famous American football quarterback and 4time super bowl winner, the award-winning filmdirector and producer, the American judge andAssociate Justice, . . . ).

There are various application areas that could ben-efit from reliable NEL techniques. Digital librariesare usually managed by information systems that en-able textual keyword search. However, state-of-the-art keyword-based search engines are not able to dealwith name variations or ambiguity. Hence, previ-ously identified entity mentions might enable usersto identify documents containing information aboutspecific entities. Furthermore, relationships betweenentities can be inherited from co-occurrences in doc-uments and aggregated into large semantic relation-ship graphs. The identified entity mentions as well

as the relationships between them might serve as astarting point for topic-based clustering and docu-ment classification to enable a categorization of thedocument collection.

For Web retrieval tasks, such as person or productsearch, reliable disambiguation tools could enable theclustering of results, so that each cluster representsonly one real-world entity, allowing the user to fo-cus on the documents of interest [4]. Encyclopedicand scholarly search as well as question answering ap-proaches would immensely benefit from reliable NELtechniques, by using the entity links in a documentcorpus to identify relevant texts [5]. Several academicprojects in the realm of artificial intelligence, e.g.,Read the Web [6] or YAGO-NAGA [7], as well asdifferent scientific workshops and benchmarks, suchas the TAC knowledge-base-population track or theERD challenge [8], or GERBIL benchmark reposi-tory [9], have highlighted the importance of NEL forbootstrapping relationship extraction from naturallanguage texts.

The effectiveness of those applications highly de-pends on the quality of the named entity detectionand disambiguation step. This quality is in turnstrongly influenced by the type of collected docu-ments. For instance, digital libraries containing sci-entific publications share different text characteristicsthan collections containing news articles, social net-work posts, music reviews, or even entire books offiction. Text structure, length, and topic variabilityare decisive for the vocabulary and the contextual in-formation contained in the text. In our experimentalevaluation, we show that most state-of-the-art algo-rithms lack precision, with values between 30% and80% depending on the text type, only expensive co-herence reasoning strategies can lead to more reli-able results. The approach presented in this paperfocuses on the efficient retrieval of reliable, i.e., highprecision, alignments of approximately 90%.

Furthermore, the efficiency of NEL algorithms isa decisive factor that is usually neglected in currentstate-of-the-art algorithms. For instance, the runtimeof NEL algorithms is important for use cases dealingwith large text corpora, such as digital libraries withmillions of texts, e.g., the Internet Archive, arXiv.org,or Google Books. NEL algorithms with a runtime of

2

https://en.wikipedia.org/wiki/DEFLATE

https://archive.org/

http://arxiv.org/

https://books.google.com/

minutes per document would take two years to anno-tate such corpus. Another efficiency bound use-caseare streaming applications. Given an infinite streamof incoming documents to be annotated, e.g., blogposts, news articles, or short messages, an NEL sys-tem has to process the texts approximately in thetime window between two consecutive texts. Ourcontributions are the following:

1. We propose CohEEL, a supervised two-stepmodel that enables the reliable NEL results.The model is designed to incorporate a knowl-edge base and (i) automatically adapts to dif-ferent input text types, (ii) incorporates arbi-trary mention-scoring functions, and (iii) con-siders known relations between mentioned enti-ties within documents (i.e., coherence).

2. Based on the CohEEL model, we discuss a con-crete configuration that provides reliable and co-herent entity alignments to the knowledge basesYAGO and Wikipedia with a precision of ap-proximately 90%.

3. Finally, we provide an exhaustive experimentalcomparison of our algorithm with state-of-the-art methods with respect to (i) linking qualityand (ii) runtime.

2. Problem and Prior Work

In this section we introduce basic terms and dis-cuss the related research for the Named Entity Link-ing problem. A specific discussion of the competitorsused in the evaluation can be found in Section 5.

2.1. The Named Entity Linking Problem

The term named entity stands for a real-world in-stance and is distinct from a concept, which repre-sents a category of named entities. Hence, NEL dif-fers from the Wikification process, introduced in [10],which deals with the automatic identification of linksto Wikipedia articles (i.e., both concepts and in-stances). It is also distinct from the word-sense dis-ambiguation, which identifies the sense of a word in asentence and is not focused on named entities [11, 12].A prerequisite for NEL is the discovery of textualmentions that might refer to named entities.

Definition 1. A mention m = (D, p) is a textualreference occurring in position p within document Dand referring to some named entity.

In the following we assume that such Named EntityRecognition (NER) techniques are readily available;we employ the Stanford NER tool [13] to discovermentions. Please note, that some related work followsanother definition of NEL, that includes the NERstep [14].

The subsequent NEL task for a given set M(D) ofmentions in document D and a set E(K) of namedentities in knowledge base K is to find a partial map-ping fa : M(D) → E(K), such that if m ∈ M(D)is mapped to e ∈ E(K) then indeed m refers to thereal-world entity represented by e. For instance, themention “Richard Berman” shall be mapped to theWikipedia article of Richard M. Berman. The align-ment function fa remains partial, because in somecases K might not contain the entity that is referredto by a mention. For a total function, fa can mapthe mention to a designated entity NIL in such cases.For better readability we use E′ to refer to the ex-tended knowledge base entity set E(K)∪NIL, andE to refer to E(K). Furthermore, we use M to ad-dress the set of mentions of a document M(D) andrefer to a mapping from mention m to entity e as analignment, denoted (m, e).

So far, mentions are defined only as the occurrencesof named entities in a document by means of uniquepositions. However, each mention m within a doc-ument covers different aspects of information aboutthe entity it refers to, most importantly the surfaceand the context. In related research, these aspectsare commonly referred to as local ranking features,because they are independent of other entity men-tions in the document [15].

The surface srfc(m) of a mention m is the ac-tual word or phrase used to identify the referencedreal-world instance in the text. For instance, sur-faces in Quote 1 are srfc(m1)=“U.S. District Court”or srfc(m2) = “New York”. The context ctxn(m)of mention m is a multi-set of n consecutive termssurrounding the mention (including its surface). Forefficiency and the fact that the semantic related-ness between a term and a mention decreases with

3

https://en.wikipedia.org/wiki/Richard_M._Berman

increasing distance, the size n of the context ctxnis typically much smaller than the document size(n |D|). The context is usually implemented asa sliding window over the document, the mentiontypically occurs in the center of the window. Forinstance, the context of size n = 12 for the men-tion of “Richard Berman” is ctx12(m4) = “Dis-trict”, “Court”, “New”, “York”, “Judge”, “Richard”,“Berman”, “expects”, “to”, “rule”, “this”, “week”.

2.2. Related Research

Traditional NEL approaches build on differentcombinations of surface and context feature scores tofind the most appropriate entity for each mention inthe knowledge base [3, 15–19]. Zuo et al. extend thismethod by aggregating decisions from an ensemble ofdifferent ranking functions based on sampled subsetof the context as well as the surface ranks [20].

Additionally to these mention-local characteris-tics, many NEL are based on document-global fea-tures [15]. Because named entities within documentsare usually mentioned in the context of other, oftenrelated entities, the alignments of mentions to enti-ties in a knowledge base K should not be performedindependently. Assuming that K covers salient re-lationships between entities, reasoning about coher-ent groups of entities may help resolve ambiguitiesin a joint entity linking process [21]. For instance,Quote 1 combines two topics, one surrounding theUnited States District Court for the Southern Dis-trict of New York and Judge Richard M. Berman andanother one surrounding the American football fraudscandal “Deflategate” caused by New England Patri-ots quarterback Tom Brady. However, finding themost coherent joint alignments of the |M | mentionsthat co-occur in a document, can lead to an NP-hardproblem [21]. Therefore, state-of-the-art approachesresort to a pair-wise coherence-based reasoning andalignment strategy. Milne and Witten identify coher-ent entity sets by computing the relatedness of am-biguous candidates (i.e., several entities for one men-tion) to other unambiguous entity candidates in thedocument [22]. This is done by reasoning about therelatedness of candidate entities based on Wikipediaarticles that are hyperlinked to the candidates’ arti-cles. Because the relatedness score is based only on

the coherence with respect to unambiguous entities,the approach does not require exhaustive reasoningover the quadratic problem space (i.e., all combina-tions of candidate pairs). Further related work tar-gets to improve the performance of the relatednessscore by applying Jaccard index, cosine similarity, orpointwise mutual information [15, 19]. Ratinov et al.introduce a model that is based on a linear combi-nation of 4 local and 9 pairwise global features torank mention candidates [15]. In a second step, thetop-ranked candidate per mention is verified by usingadditional features, retrieved from the scored candi-date list. For instance, the improvement of the rankerconfidence with respect to the second-best candidateor the entropy of the surface probability score. Duet al. employ similarity measures that capture theaverage pair-wise proximity between candidate enti-ties in the knowledge graph, as well as their aver-age pair-wise conceptual similarity by means of thelowest-common-ancestor classes [23]. Kulkarni et al.attempt an efficient solution for finding collective an-notations of entities in documents by transformingthe problem to a local hill climbing algorithm [24].Hoffart et al. exploit the key-phrase- and hypernymy-based relatedness between the candidate entities inthe knowledge base to jointly link mentions occur-ring in the same document to the candidate enti-ties [21, 25].

In contrast to the pairwise coherence reasoning,Han et al. approach the problem by means of randomwalks on the “referent graph”, where each entity isassumed to share portions of its alignment probabil-ity with its knowledge base neighbors [26]. Thus, thealgorithm estimates the coherence of the entire solu-tion at once and exploits the complete graph struc-ture, instead of aggregating the values over differentcandidate pairs. However, due to exhaustive candi-date selection in the algorithm, the holistic graph rea-soning has to be performed over a large set of can-didate entities, i.e., yielding millions of candidatesfor mid-sized documents and high runtimes. Agirreet al. tackle this issue by applying a more conser-vative candidate selection strategy (using a surfacedictionary) and a sparser knowledge base graph [27].By selecting only “reciprocal links” from Wikipedia,the number of edges can be reduced by an order of

4

magnitude, while improving the NEL accuracy sig-nificantly. Moro et al. introduce a greedy algorithmthat is based on the knowledge base BabelNet andsimultaneously disambiguates named entities as wellas word senses [28]. The knowledge base is curatedusing a random walk approach that removes seldomhit neighbors (i.e., under 100 times). A document isinitially represented as a graph containing all possi-ble interpretations of entity mentions as well as wordreferences and the direct relationships among them.To disambiguate text fragments mentioning entitiesand word senses, the algorithm iteratively removesthe weakliest connected interpretation of the mostambiguous the mention to build a dense subgraphrepresentation of the document. Following this iter-ative process, the densest subgraph is defined as thegraph with highest average degree. The remainingcandidates of the densest subgraph are scored basedon their degree in the dense graph (w.r.t. fragments aswell as other candidates). The disambiguation resultis retrieved by selecting the highest scored meaningfor each fragment. The authors furthermore suggestthe application of a threshold to remove uncertainmeanings. In contrast to this proceeding, Guo andBarbosa introduce a greedy approach that is itera-tively increasing the solution set of the disambigua-tion task [29]. The algorithm represents entities anddocuments as “semantic signatures”. A signature isdefined as the stationary probability distribution ofa random walk. For entities, the random walks areperformed on the first- and second-degree neighbors,whereas the neighborhood size is narrowed by remov-ing neighbors with degree of under 200. The docu-ment neighborhood is built according to the alreadydisambiguated entities as well as candidate entitiesin the document. To improve the runtime of the al-gorithm, the candidate set per mention is limited to20 entities that are selected based on a surface and acontext compatibility score. Mention candidates aredisambiguated iteratively: starting with the mentionwith the smallest candidate set, the algorithm selectsan entity based on a similarity measure that is basedon the KL-Divergence between document candidate.For the next iteration, the selected candidate is addedto the set of disambiguated entities.

The named entity linking model introduced in this

work is based on a two-phase strategy: a classificationand a random-walk-based reasoning phase. Due to asupervised learning approach, our model is able toadequately prune the candidate space and to adaptto different types of texts to provide steadily highprecision results. The approach does not rely on spe-cific attributes of any knowledge base and is neitherbased on handcrafted thresholds, nor manually op-timized combinations of different scores (e.g., neigh-borhood relevance degree or alignment uncertaintythresholds). This initial classification enables effi-cient coherence reasoning over a strongly reduced en-tity set in the iterative exploitation phase using ran-dom walks.

3. Identifying Reliable and CoherentEntity Alignments

CohEEL (Coherent and Efficient Named EntityLinking) is an abstract model for reliable and coher-ent entity linking. It makes use of a knowledge basethat provides the set of entities to be discovered inarbitrary documents. We therefore assume that theknowledge base covers relevant inter-entity relationsthat allow reasoning about entity interdependencies(such as links in Wikipedia, co-author relationshipsor citations in bibliographic knowledge bases).

Figure 1 gives an overview of our approach. Thefirst step, Candidate Classification, is based on dif-ferent supervised learning methods and is responsiblefor an efficient mention-wise proposal of alignmentsby:

• proposing high-precision seed alignments S, i.e.,the corresponding mentions are linked to theseentities, and by

• proposing high-coverage candidate alignmentsC, i.e., more reasoning is needed about the link-ing of the mentions with respect to these entities.

The candidate classification is essential to prune thecandidate space and perform a coherence reasoningefficiently.

The second step, Judicious Neighborhood Explo-ration (JNE), cautiously extends the seed set withhigh-coverage candidate entities (from the first step)

5

Can

did

ate

Cla

ssif

icat

ion

m1 m2 …

Jud

icio

us

Nei

gh

bo

rhoo

d

Ex

plo

rati

on

S C

A

Φ = 𝑓 ⋈ 𝑓′ ⋈ ⋯ ⋈ 𝜆𝑓 ⋈ ⋯

U U

Candidate

Finder

Seed

Finder

m2 … m1

M

Figure 1: Two-phase model of CohEEL creating a set of align-ments A for a given set of mentions M from one document.

that are related to the seed entities, i.e., candidatescontained in the seed communities within the knowl-edge base. Hence, this step improves the recall signif-icantly while providing a comparable precision of thefound alignments. Both steps are explained in detailin the following sections.

3.1. Candidate Classification

The candidate classification step is an importantstep towards an efficient coherence reasoning. In the-ory, a mention can be linked to each entity in theknowledge base. However, only a small amount ofentities cover features making them promising candi-dates. A candidate classifier is meant to prune thecandidate space to a viable size. For the mention-wise proposal of entities, the rating for the match ofa mention and a knowledge base entity is computedby means of various scoring functions.

3.1.1. Scoring Functions

Definition 2. A scoring function f : M×E′ → Rquantifies the affinity between a mention m ∈ M toa knowledge base entity e ∈ E′.

Intuitively, the outcome of all scoring functionsf(m, e) should be correlated to the actual probabil-ity that m represents the real world entity e in thegiven text. First-order scores are derived from vari-ous features of mentions and entities, e.g., the com-patibility of the mention surface and the entity name,or the similarity between mention context and theentity information (see discussion of mention-localand document-global characteristics in Section 2).The set of actually used scoring functions mainly de-pends on the information provided by the knowledgebase. For instance, contextual similarity functionscan be applied only for knowledge bases containingtextual entity features. For instance, the contextctx12(m4) of mention “Richard Berman” might yieldimportant indications that the referred entity is judgeRichard M. Berman and not the more prominentlobbyist Richard B. Berman. Many scoring func-tions have been discussed in previous work. Ratinovat al. introduce thirteen different scoring functionsbased on mention-local as well as document globalfeatures [15]. Piccinno and Ferragina provide morethan 70 features in their WAT framework [19]. Dueto performance orientation of the candidate classi-fication step scores have to be retrieved efficiently(i.e., without complex iterative coherence reasoningphases). For instance, in our experiments, we limitthe basic scoring functions to a surface and a contextcompatibility score, as well as an efficient relatednessmeasure (see Section 4.2).

Additionally, CohEEL is able to incorporatehigher-order scoring functions that quantify the affin-ity between mention m ∈ M and knowledge baseentity e ∈ E′ based on a scoring function f , i.e.,λ : ((m, e), f) 7→ y ∈ R, for short: λf (m, e). Anexample of a higher-order score might be given by afunction λrank

f . It is defined as the rank of entity e inthe candidate list for m that is ordered by the scoresof a first-order function f . In this case, the rank of acandidate entity is a feature with additional implicitmention-local information, i.e., how many other en-tities have higher f scores, which can be useful forthe linking decision. Ratinov et al. apply two higher-order scores to remove uncertain alignments from thesolution set [15]: the entropy of prior probability (sur-face compatibility) and a confidence gain of the best

6

candidate compared to the second match.

To not restrict the model to a specific knowledgebase, we have to make sure that CohEEL relies nei-ther on specific attributes of the knowledge base, noron specific scoring functions. Instead, it learns to in-corporate a combination of arbitrary functions (suchas the ones above) and automatically derives a goodfeature combination based on a representative train-ing sample to judge further alignments. Hence, incontrast to other approaches, CohEEL is not basedon handcrafted rules or thresholds that combine aspecific set of scoring functions. This is important,because a one-size-fits-all solution is unrealistic forthe broad variety of available document types. Thediversity of the employed scoring functions allowsCohEEL to judge different relatedness aspects be-tween mentions and entities.

As stated earlier, CohEEL applies candidate selec-tion to prune the alignment candidate space for theJNE phase. For this, we cast the selection into abinary classification problem. Based on a trainingset, two selection models are learned, a cautious one,which is a finds for high-precision seed alignments,and a brave one, which produces high-coverage listsof candidate alignments. Note, the evidences gainedby various scoring functions enable the training ofclassifiers for well-founded decisions that are superiorto thresholds or rules based on single features.

3.1.2. Seed Finder

With the help of different first f, f ′, . . . andhigher order scoring functions λ, λ′, . . ., CohEEL

is able to generate a set of labeled feature vectors ~φper alignment:

~φ(m,e)

7→⟨f(m, e) , λf (m, e) , λ′f (m, e) , . . .,

f ′(m, e) , λf ′(m, e), λ′f ′(m, e), . . .,

f ′′(m, e), . . .⟩

Based on a set of training documents (including cor-rect alignments), a binary classifier is trained thatassigns class-labels to an arbitrary alignment (m, e),namely the label correct (positive) or incorrect (nega-tive). To influence how cautious or brave the classifierassigns class labels to alignments, the concept of loss

matrices is applied. During training, the classifica-tion model is adjusted to minimize the expected lossdefined as the sum of costs produced over all fourclasses of the confusion matrix [30]. In CohEEL,proper classifications are not punished and the lossmatrix (L) is used to weight the different types ofmisclassifications (type I/II errors). The costs fortrue positives (L

tp) and true negatives (L

tn) are set

to zero.

The first classifier in CohEEL is the Seed Finder.It is configured to identify at most one entity permention with high precision.

Definition 3. Seed alignments are reliable, non-contradicting alignments, i.e., with a high probabilityof being correct.

The entities of seed alignments (seeds for short) aresubsequently used to deduce further correct align-ments. For instance, the mention “Deflategate” inQuote 1 might be easily linked with the eponymousWikipedia article, because no other articles are re-lated to this name. Furthermore, the Seed Findermight be able to detect that “Richard Berman” refersto the same entity as the article of U.S. District Courtjudge Richard M. Berman but not the disputed lob-byist Richard B. Berman. This can be achieved byconsidering surface and specifically context compati-bility scores (e.g., the context ctx12(m4) mentioned inSection 2.1 is very similar to the judge’s Wikipediaarticle). Note that this definition of seeds stronglydiffers from previous work where unambiguous enti-ties were used instead (e.g., for the relatedness def-inition of Milne and Witten [22], see Section 4.2).For instance, there is only one matching Wikipediaarticle available for the mention “Deflategate”. Fol-lowing our definition, an ambiguous entity might bechosen as a seed if the scores indicate one entity be-ing a good choice, e.g., for the mention of “RichardBerman”. Furthermore, unambiguous entities mightbe discarded as seed if the contextual score is verylow.

The transformation of the original problem into aclassification problem, allows us to configure CohEELsuch that it yields high precision. This is achieved byapplying loss matrix Ls, which defines higher costs

7

for false positive alignments (Lsfp

) than for false neg-

atives (Lsfn

). In this way we achieve a bias of theclassifier decisions towards higher precision. For in-stance, the cost model with a ratio of Ls

fp/Ls

fn= 9

punishes false positives nine times stronger than falsenegatives and guide the learner to create more reli-able classifiers with an expected precision E(|Pmin|)of around 90% or more. Hence, we can define thecost model as follows:

Lsfp

=E(Pmin)

1− E(Pmin), and Ls

fn= 1

However, the classification results (i.e., in terms ofconfusion matrix) are not directly transferable to theSeed Finder performance, because the classifier mightidentify several candidates per mention as correct.These contradicting alignments are undue, hence, theSeed Finder is required to select at most one entityper mention. This problem becomes more severe incase of lower costs ratios Ls

fp/Ls

fn, that let the clas-

sifier make braver decisions. To resolve these con-tradicting cases, we decided to reject all of these en-tities. There are various other strategies (e.g., ran-domly pick one entity, select the entity with highestscores, . . . ), however, our implementation prevents adecrease of precision of the seed alignments, due tothe addition uncertain alignments.

3.1.3. Candidate Finder

In contrast to the Seed Finder, the goal of the Can-didate Finder is to identify a set of entities Cm permention m.

Definition 4. A candidate alignment set Cm fora mention m contains with high probability the align-ment with the actual entity e referred to by m.

The Candidate Finder is configured to create a can-didate list with a high coverage rate (recall). Astraightforward implementation would be an “alwayscorrect” classifier that declares all entities from theknowledge base as candidates (Cm = E). However,for practical reasons such a classifier would be use-less, because the following JNE phase would be costlyand inaccurate. Hence, another important goal is toprune the candidate lists per mention without losing

correct entities. Given a specific maximal expectedcandidate set cardinality E(|Cm|) for the CandidateFinder, the appropriate loss matrix Lc is given by

Lcfp

= 1, and Lcfn

= E(|Cm|)− 1

and maximizes the number of true positives while lim-iting the average candidate set size to E(|Cm|). Theintuition behind this cost model is simple: to mini-mize the risk of missing a correct candidate (fn), theclassifier accepts n-times more incorrect candidates(fp), where n = E(|Cm|)−1. Note that the maximumexpected candidate set cardinality E(|Cm|) has alsoa large influence on the runtime behavior of the sec-ond phase of the algorithm (i.e., neighborhood explo-ration). The influence of the E(|Cm|) on the runtimeof CohEEL is further discussed in the next section.

For instance, given the “U.S. District Court” men-tion surface in Quote 1, the number of compati-ble Wikipedia articles exceeds 90. Considering themention context, containing the terms “New” and“York”, the Candidate Finder is might be empow-ered to prune the candidate set to the four courts ofNew York, namely the Court for the Northern, East-ern, Southern, and Western District of New York.

In the next step, CohEEL leverages the seeds toestablish new alignments to candidate entities suchthat the coherence (in terms of semantic relations)between all the entities in the alignments is maxi-mized.

3.2. Judicious Neighborhood Exploration

Typically, documents contain only few distinctgroups of topically related entities. The goal of neigh-borhood exploration is to identify candidates thatare semantically related to the identified seeds for agiven document. The prerequisite for this step is aknowledge base that covers relevant inter-entity re-lations. NEL algorithms are able to deduce furtheralignments from a given set of entity assignments byexploiting such knowledge. However, there is a trade-off between quality and completeness: the more ag-gressive the deduction process is, the more additionalalignments can be found but the higher the risk offinding incorrect alignments. Furthermore, reasoning

8

over a set of entity candidates for a large documentmight be computationally expensive.

To efficiently find coherent candidate entities, weapply an iterative exploration model that in each it-eration uses the current set of seeds to efficiently dis-cover the most coherent candidate, and adds it to theseed-set for the next iteration. To measure the coher-ence between seeds and candidates, we apply a Ran-dom Walk with Restarts algorithm (RWR) [31]. Ran-

dom walks on graphs estimate the probability p(t)s (n)

of finding a random walker at vertex n at time t afterstarting from vertex s. The steady state probabilityps(n) can be viewed as a measure of affinity between sand n. For us, ps(n) represents the probability of fin-ishing at entity n after following knowledge base linksstarting from entity s and thus represents the proxim-ity between s and n. In other words, we define coher-ence as probability that a random walker, that startsat a seed entity (e.g., Wikipedia article) and followsknowledge base links (hyperlinks to other articles),finishes at a candidate. The application of randomwalks has the following benefits: (i) it captures thestructure of the complete neighborhood graph, and(ii) in comparison to traditional graph distances, itcaptures all facets of relationships between two nodesin the graph. The steady state probabilities of a ran-dom walk corresponds to the eigenvector ~p of theneighborhood transition matrix and can be efficientlyapproximated using the Power method [32].

Algorithm 1 outlines the details of our approach.Given a set S of seed alignments and a set C of can-didate alignments, the approach tries to identify aset of coherent alignments A. The functions M(A)and E(A) retrieve the union of all mentions/entitieswithin one set of alignments, and NG(A) retrievesthe transition matrix of the neighborhood of alreadydisambiguated entities in the knowledge base. Theneighborhood graph is a subgraph of the knowledgebase, containing all first and second order neighborsof disambiguated entities. More details about theconstruction are given in the next subsection.

Between Lines 4 and 22 the iterative explorationis executed by extending the A with the alignmentsof the most promising candidate entity ei. It fin-ishes if no additional alignment could be added. In

Algorithm 1: Judicious Neighborhood Exploration

Data: seed alignments S = s1, s2, ...,candidate alignments C = c1, c2, ...,where S,C ⊂ P(M × E)

Result: alignments A = a1, a2, ..., whereA ⊆ S ∪ C

1 A := S;2 A′ := ∅;3 /* running iterative exploration */

4 while A′ 6= A do5 /* remove candidates w/ covered mentions */

6 C := C \ (M(A)× E);

7 /* transition matrix of the knowledge base

neighborhood of E(A) */

8 N := NG(A) ;

9 /* get starting vector w/ same start

probability for each entity in A */

10 ~s :=

si = |E(A)|−1 if ei ∈ E(A)si = 0 otherwise

11 ~p := ~s;12 it := 0;

13 /* power iterations until convergence */

14 repeat15 ~p′ := ~p;

16 ~p := (1− α)N~p + α~s;17 it := it+ 1;

18 until fθ(‖~p− ~p′‖, it);

19 A′ := A;

20 /* add all alignments of candidate entity

e with highest affinity (max pe, pe > 0) */

21 A := A ∪ (C ∩ (M × arg maxe∈E(C)

pe));

22 end

Lines 6-12 all necessary variables for the RWR areinitialized, and in Lines 11 to 18 the power itera-tions are executed to identify the eigenvector ~p ofthe neighborhood graph until the convergence crite-rion is fulfilled. As parameters for the random walk,we suggest to use the established values of α = 0.15and θc = 10−8. Where the former parameter influ-ences the skew of steady state probabilities towardsthe seed entities, the latter one mainly influences the

9

runtime of the random walk. We discuss the conver-gence separately.

The eigenvector of the transition matrix representsthe steady state probabilities of all candidates in theneighborhood after starting a RWR from A. Con-sequently, Line 21 extends the set of seeds for thenext RWR by the most probable candidate align-ments, i.e., those with highest affinity from the prioralignments. This step can be understood as a max-imum a posteriori estimation of the candidates enti-ties. Note, we do not use any confidence threshold tonarrow the alignments, but add each valid candidatehaving a non-zero steady state probability (pe > 0).A probability of zero means that the entity cannot bereached during a random walk and is thus unrelated.

To finally create a complete result set for a docu-ment, the alignments A are extended by additionallinks to NIL alignments, if (i) neither the Seed norCandidate Finder is able to identify a promising en-tity for mention m or (ii) the JNE phase did notidentify any relevant connection from any alignmententity E(A) to an entity in the candidate list Cm formention m. Specifically (ii) is rather conservative,but ensures a high linking quality (we do not linkcandidates without strong indication) and underlinesthe focus on high precision results. A braver versioncould choose the candidate based on candidate clas-sifier scores to increase recall while adding potentialfalse positive alignments In the following, NIL align-ments are treated as alignments to entities with norepresentation in the knowledge base.

3.2.1. Neighborhood construction

The neighborhood graph is defined as the sub-graph of the knowledge base, containing all directneighbors of all relevant entities (i.e., E(A ∪ C))as well as the neighbors of second order (neighborsof neighbor) of disambiguated entities in the docu-ment. Figure 2a depicts the reduced neighborhoodof the legal topic discussed in Quote 1. The men-tion “Richard Berman” is represented by the darkgray entity Richard M. Berman aligned by the SeedFinder (see discussion in Section 3.1.2). The men-tions “U.S. District Court” and “New York” are notdisambiguated yet. However, as discussed in Sec-tion 3.1.3 the Candidate Finder identified four po-

tential alignments for the U.S. Court (e.g., out of 94Courts listed in Wikipedia): U.S. District Courts forthe Northern, Eastern, Southern, and Western Dis-trict of New York, as well as two candidates for men-tion New York: the U.S. State and the populous City.Because two of the four U.S. District Court entities(Northern and Western District) are not part of thesecond order neighborhood of Richard M. Berman,we do not show them in the graph. These two candi-dates would end up with a steady state probability ofzero. Furthermore, we added only a subset (at most2) of neighborhood entities (white nodes) for seeds orcandidates for brevity and clarity of the figure.

To be able to perform the random walk more effi-ciently, we compress the neighborhood graph withoutinfluencing the results of the RWR ranking (see Fig-ure 2b). Because the ranking is only relevant for can-didate entities, we can merge all “dead ends” in thegraph to a new default node K. Dead ends are ver-tices that do not refer (directly or indirectly) to anyrelevant entity. They are either sinks or exclusivelyrefer to other dead ends.

To retain the degree of the remaining edges, we rep-resent the neighborhood as a multigraph where thenumber of edges to the dead end node corresponds tothe amount of merged neighbors. This is importantto maintain the structural properties of the graph forthe random walk. The K-node can be interpreted asa state of a surfer who has left the direct neighbor-hood graph and is browsing the rest of the knowledgebase. We are aware of the fact, that the probabilityof a random walker returning from K to the graphis not zero, but would be correlated to the (global)in-degree of the neighborhood entities. However, be-cause we do not want to favor hub entities in theneighborhood (i.e., New York has a much higher in-degree than U.S. Courts), we decided to ignore thisfact. The return of a surfer to the neighborhood ismodeled as a restart of the surfer from one of thealready aligned entities E(A).

In a final step, we represent the neighborhoodgraph as stochastic transition matrix N. It is builtbased on the matrix representation of the graph. Thematrix contains integer elements that represent thenumber of edges between two nodes. To guaranteeconvergence of a random walk, we have to prevent the

10

Richard M. Berman

District Court NY South

District Court NY East

New York City

New York State

Davis Polk & Wardwell

United States federal judge

Manhattan

Tucker Act

Mid-Atlantic region

United States

Brooklyn

(a) Knowledge base neighborhood

Richard M. Berman

District Court NY South

District Court NY EastNew York City

New York State

Davis Polk & WardwellK

2

2

2

2

(b) Compressed neighborhood

Figure 2: Two graph representations during neighborhood graph construction of the JNE for entities forming the legal topicsubgraph of Quote 1.

graph from containing dangling nodes or being peri-odic. For now, at least K is a dangling node. Theeasiest way to ensure both properties is by adding“stalling edges”. Broadly speaking, a stalling edgemodels a pausing of the surfer on the same arti-cle. In a final step, the matrix is normalized suchthat the entries contain actual transition probabili-ties. An executed random walk on N would yield anew alignment from “U.S. District Court” to the en-tity of the Court in the Southern New York District(pe = 5.8%) that is ranked above the Eastern Dis-trict Court (pe = 1.0%), as well as the City of NewYork (pe = 5.7%) and New York state (pe = 4.4%).

3.2.2. Convergence

The convergence of a RWR is usually modeled byobserving the changes of the probability vector in thelast power iteration ∆~p = ‖~p − ~p′‖. If the changesiterations fall below a tolerance level (∆~p ≤ θc), theRWR is defined to be converged.

Lemma (Rate of Convergence [32]). Given a non-bipartite graph, a Random Walk with Restart – witha restart probability α and a convergence tolerancelevel θc – converges approximately after log1−α(θc)power iterations.

For the common values α = 0.15 and θc = 10−8,a RWR converges on average after roughly 113 itera-tions. To further reduce the number of matrix opera-tions, we use the following trick: Because we employthe steady state probabilities of the random walk torank candidate entities, the random walk could al-ready been terminated after the candidate ranking issteady. However, for different document types, thenumber of iterations leading to a stable selection ofthe most probable candidate entity might differ (i.e.,due to varying amount of ambiguous mentions andcandidates per mention). Therefore, we use charac-teristics exploited from data used to train Seed andCandidate Finder (see Section 4.4). We deduce thenumber of iterations θit, after which the top-rankedcandidate entity is assumed to be stable. To learn thethreshold, we propose the following heuristic: TheJNE phase is executed for all documents of the train-ing set until convergence. For each document, we de-rive the iteration its at which the top-ranked entityis stable. The value of θit is then set to the 99th-percentile over all its of the training set documents.The 99th-percentile is used to find a reliable thresh-old that does not depend on outliers. We are awareof the fact that its might decrease the linking qual-ity of our approach, but we clearly favor the runtime

11

benefits. We define the convergence criteria as:

fθ(∆~p, it) :=

true, ∆~p ≤ θc or it > θitfalse, otherwise

Given a reasonably low threshold θit (see also Sec-tion 4.4), the runtime of each RWR is driven by thedimensionality of the compressed transition matrixN. The number of seed entities is limited by thenumber of mentions |M |. Assuming a fixed entitycandidate set size E(|Cm|) provided by the Candi-date Finder, the entity count is bound by the mentioncount, instead of the knowledge base size. Hence, thedimensionality of N is quadratically limited by thenumber of mentions. Furthermore, the number ofsteps for the iterative exploration is also constrainedby the number of mentions, because each iterationadds at least one alignment.

3.2.3. Discussion

In this section we showed how to identify the mostpromising candidate entities by applying an iterativeexploration model. The model efficiently discoversthe most coherent candidate with respect to a set ofseeds in each iteration. This is a rather conservativeproceeding in comparison to exploring the completeknowledge base graph for any connections betweencandidates as proposed by Han et al. and Agirre etal. [26, 27]. CohEEL surveys only the first and secondorder seed communities to find matching candidates.

Guo and Barbosa follow a similar incrementalstrategy, by selecting one entity at a time and re-run the random walk over the new seed set. Theirselection strategy differs in two aspects [29]: first,their algorithm selects the most appropriate candi-date from the least ambiguous mention per iteration,instead of the most coherent candidate in the docu-ment. Second, instead of picking the entity with thehighest random walker visiting probability, the ap-proach measures the compatibility of document andcandidates based on semantic signatures. A signa-ture is the steady state probability vector of a ran-dom walk starting form the candidate entity SSe orfrom the disambiguated entities of the document SSd(similar to our seeds). Hence, these two distributionsrepresent the probability that a random walker ends

at a specific entity in the neighborhood of e or d. Thesignatures are then compared using the Zero-KL Di-vergence, a non-symmetric measure to estimate theinformation loss when SSd is used to approximateSSe. Thus, the approach selects the candidate, whoseneighborhood distribution is best represented by theneighborhood of the already disambiguated entities.Practically speaking, Guo and Barbosa rank candi-dates based on the likeliness that a random walkerstarting from seeds ends with the same probabilityat an entity as a walker starting from the candidate,whereas we rank based on the likeliness of hitting can-didate during a random walk starting from the seeds.Note, we do not claim that our interpretation of co-herence yields better results, however, it is more con-servative and specifically considers directed knowl-edge base relationships. This interpretation further-more yields an advantage in efficiency. By apply-ing our neighborhood compression without affectingthe steady state probabilities of any candidate, weare able to shrink the random walk graphs. BecauseCohEEL executes at most |M | random walks per doc-ument, each linking one mention per iteration, theruntime depends only on the compressed neighbor-hood of the already found alignments. By applyingθit, CohEEL is able to minimize the runtime of eachindividual random walk.

4. Applying CohEEL

We demonstrate the value of CohEEL by exem-plarily applying it to a combination of the openknowledge bases YAGO and Wikipedia. Due to thefact that YAGO is based on Wikipedia and con-tains links to the Wikipedia entities, an integrationof both knowledge bases is straightforward. Note,because we solve the NEL problem, ontological en-tities, such as <yago:person>, and facts and rela-tionships, such as <rdf:type>, <rdfs:subClassOf>,are ignored. Note, previous work discusses the appli-cation of ontological constraints to as presented byDalvi et al. [33]. Most of the discussed parametersare generic and could be applied to other knowledgebases. However, some parameters also depend on spe-cific knowledge base features (i.e., scoring functions).Furthermore, we show that the CohEEL model is able

12

http://www.mpi-inf.mpg.de/yago-naga/yago/

to autonomously adapt to different text types, pro-viding a steady linking quality.

We now provide a discussion of all parameters forCohEEL, namely Scoring Functions (Section 4.2),Candidate Classification models (Section 4.3), andNeighborhood Exploration strategies (Section 4.4).Within each section, we provide experiments thatcompare the influence of the individual parametersand provide a configuration used for the subsequentexperiments. A detailed comparison of the discussedconfiguration with state-of-the-art algorithms is pre-sented in Section 5.

4.1. Experimental Setup

The experiments in the following sections are basedon ground truth alignments of text mentions to theknowledge base YAGO [2], which were also used inthe AIDA project [21, 25]. The aligned entities com-prise all instances from YAGO, but do not includethe concepts of the knowledge base.

4.1.1. Datasets

As discussed earlier, different types of documentcollections share different text characteristics. Ta-ble 1 shows the properties of the three datasets usedto evaluate CohEELs adaptability:

The news article dataset contains 100 randomlypicked Reuters articles from the CoNLL-YAGO da-taset [20]. The articles were manually aligned withentities from YAGO.

The encyclopedic text corpus consists of 335Wikipedia articles selected in 2006 [17]. The anno-tated named entities in this dataset are retrieved bytranslating the linked Wikipedia pages within eacharticle to the corresponding YAGO entities.

The synthetic micro corpus consists of 50 shorttext snippets and was introduced in the AIDAproject [25]. Every text snippet consists of few (usu-ally one) hand-crafted sentences about different am-biguous mentions of named entities and has simi-lar properties as content of microblogging platforms,such as Twitter.

4.1.2. Quality Measures

We measure precision P , recallR, specificity S, andFβ-measure to investigate the varying linking quality

Table 1: Dataset overview.

mentions words pertexts per text text

news 100 24.4 171.4encyclopedic 335 15.1 365.5micro 50 3 12.6

of the tested approaches. Given the gold standardalignments G for a dataset, we first distinguish be-tween entity (Ge) and NIL alignments (GNIL). En-tity alignments link to an actual entity e ∈ E in theknowledge base, whereas mentions linking to NIL in-dicate that the actual entity meaning is not present inthe knowledge base. Besides the gold standard, eachalgorithm performs the NEL job and generates twoaccording sets (Ae and ANIL). The quality measurescan be calculated as follows:

P =|Ae ∩Ge||Ae|

R =|Ae ∩Ge||Ge|

S =|A

NIL∩G

NIL|

|GNIL|

Fβ =(1 + β2) · P ·R

(β2 · P ) +R

These definitions are basically the micro-averagedversions over all corpus documents. Where preci-sion measures the fraction of correctly linked enti-ties given the set of found alignments (Ae), the recall(also known as sensitivity) measures the proportionof correctly linked entities (Ge). A high sensitivityshows a low number of type II errors. Furthermore,specificity measures the proportion of correctly iden-tified NIL alignments. Low specificity reveals a highnumber of type I errors. Thus, it is easy to increasethe performance with respect to one measure by low-ering another one. For instance, by replacing allANIL by linking to a prominent entity might yieldin an increase of sensitivity (recall) but a decrease ofspecificity. Furthermore, this strategy might decreasethe precision of the result. Additionally to the threemeasures, we show F-measure figures. F1 is definedas the harmonic mean of precision and recall, whereF.25 favors precision over recall. To compare the ac-tual linking quality and eliminate the influence of theNER quality, all results are based on an identical setof mentions over the tested documents.

13

4.2. Scoring Functions

As discussed in Section 3.1.1, CohEEL applies aset of scoring functions to qualify the compatibilitybetween mentions and entities.

4.2.1. First-Order Scores

Based on the knowledge base features, we applythree well-known scoring functions that are based ondifferent entity features and proved successful in vari-ous related work: (i) the surface prominence functionfprom relies on the entity name variations, (ii) thecontext scoring function fctx is based on the textualknowledge about the entities, and (iii) the related-ness scoring function frel qualifies the compatibilityto other aligned entities in the document.

The surface prominence score fprom(m, e) isbased on the compatibility of mention surface srfc(m)and entity e. In related work it is commonly re-ferred to as commonness, mention-entity-prior, sur-face form, or anchor-title compatibility [15, 17, 18,20, 22, 34]. It computes a score based on the men-tion surface srfc(m) and entity e by leveraging theinformation contained in the Wikipedia link labels.A link label is a textual mention that is hyperlinkedto a Wikipedia article that describes a named en-tity. To derive candidate entities for a given sur-face, we compute the relative frequency by which aWikipedia article is hyperlinked from any occurrenceof this label. Hence, the surface prominence scorecan be seen as an estimate of the conditional log-probability for entity e given the surface of mentionm: fprom(m, e) = logP (e|srfc(m)). Note that en-tities never referred to by m would end up with alog-probability of −∞. However, these entities arenot considered as candidates and ignored by default.

The above scoring function considers only the“prominence” of entities with respect to the men-tion surfaces. For the running example of mention“Brady” in Quote 1, the prominence score yields arelatively low value for the New England Patriotsquarterback. The quarterback is linked only in 4%of all “Brady” mentions, while the Texan city is re-ferred to in 20% of the cases. By additionally exploit-ing contextual information, CohEEL is able to gainevidence that the mention should be aligned to thefootball player nevertheless.

Second, to efficiently exploit the context of men-tions, this configuration builds on statistical languagemodels. These have been widely used in informationretrieval for ranking result documents to keywordqueries [35]. For all knowledge base entities, a lan-guage model is built based on the relative frequencyof terms that occur in their respective Wikipedia ar-ticles. The statistical language model L(θe) of an en-tity e represents a probability distribution over terms(word unigrams) occurring in the context of the en-tity. Given the textual context c = ctxn(m) of a men-tion, for all instances e that have s = srfc(m) as alabel, CohEEL estimates the probability that c is gen-erated by L(θe), i.e., fctx(m, e) = logP (c|L(θe)) andscores alignments based on these log-probabilities.The context of a mention is represented by termsthat surround it in the document. As proposed byPedersen et al. in [16], we set the range of terms sur-rounding the mention to n = 50. In previous researchsimilar approaches were used to measure the contextcompatibility in terms of cosine similarity betweenmention context and entity texts [15, 29] or probabil-ity score detonating that the context is generated bya candidate article [18–20, 26].

Third, we exploit the context of the mention byconsidering the relations between entity candidateswithin a document. This is done based on the relat-edness measure introduced by Milne and Witten[22].The score calculates the pairwise relatedness betweentwo entities based on the set of Wikipedia articles re-ferring to the two entities. To avoid a cyclic definitionof the relatedness measure, Milne and Witten pro-pose to calculate frel as the weighted average of therelatedness between a candidate and all unambiguousentity candidates in the document for all other men-tions (E

u

D). Due to this simplified definition of seed

entities, the score can be calculated efficiently with-out an extensive coherence reasoning phase. Similarrelatedness measures are shown to outperform themeasure of Milne and Witten [15, 19]. However, forthis work we decided to limit to this proven score.

Figure 3 compares the alignment performance, todepict the differences of the scoring functions withrespect to different text types. An alignment is de-rived by selecting the entity ea for mention m thatmaximizes the value of the scoring function f (i.e.,

14

0

0.2

0.4

0.6

0.8

1

prom ctx

rel

comb

prom ctx

rel

comb

prom ctx

rel

comb

news encyclopedic micro

precis

ion

/ r

eca

ll

scoring functions

precision

recall

f com

b

f pro

m

f pro

m

f pro

m

f ctx

f ctx

f ctx

f rel

f rel

f rel

f com

b

f com

b

Figure 3: Performance of different first-order functions and alinear combination (fcomb) of them.

ea = arg maxe∈K f(m, e)). In addition to the intro-duced first-order scoring functions, we further showperformance of a fourth score (fcomb). The score is alinear combination of the other scores. The weightsfor the individual scores are equal for all datasetsand are derived without any holdout strategies. Thisis similar to the strategy of manually finding a goodcombination of scores for these three kinds of texts toimprove the individual score-based algorithms withthe aforementioned features.

The performance ratio between the four scoringfunctions are similar for the news and the encyclope-dic dataset. The superior first-order score for thesetwo datasets is fprom. This is not surprising, giventhat for many cases the selection of prominent en-tities is appropriate for news or encyclopedic texts.For instance, given the mention “Michael Jordan”,the entity of the famous NBA player is usually a goodalignment choice and only about 25% of the mentionsin these datasets should be linked differently. Ignor-ing the prominence of the candidate, the other twofirst-order scores cannot compete, especially the con-text score shows a low performance. However, theperformance of fcomb shows, that a combination ofall three scores yield an increase in alignment quality.The linear combination improves the best first-orderscore (fprom) by 4-7% in precision and recall (3-5percentage points). The improvement of the linearcombination in comparison to frel and fctx is 12-18%and 30-60%, respectively.

On the micro dataset however, divergent resultscan be identified. Here, the context score outper-forms the other first-order scores. This is due to

the synthetic nature of the dataset, containing doc-uments with mainly ambiguous mentions, meaningthat there are only seldom obvious entity alignmentsper document (E

u

D= ∅). For instance, the dataset

contains a document: “David and Victoria namedtheir children Brooklyn, Romeo, Cruz, and HarperSeven.” It is obvious, that the members of the Beck-ham family cannot be easily aligned because the firstname mentions are very ambiguous. Many similarlynamed, prominent entities can be found, e.g., David— the second king of Israel, Victoria — the state insouth-east of Australia, Brooklyn — the borough ofNew York City, . . . . Focusing on the contextual in-formation, the candidate ranking yield better results.For instance, the Wikipedia article of David Beckhamcontains textual information about the other familymembers. As for the other two datasets, the lin-ear combination improves all three first-order scores.Where the increase in comparison to fctx is 3%, fpromis outperformed by 75-90% and frel is beaten by morethan 500%. This shows that a combination of differ-ent scores can lead to an improvement of linking re-sults and underlines the independent operating prin-ciples of the scores.

4.2.2. Higher-Order Scores

In addition to first-order scoring functions,CohEEL enables the application of higher-order scor-ing functions. Higher-order scores cover proportionsbetween first-order scores of the candidates for onemention. For the remainder of this work, we intro-duce three higher-order scores that cover different as-pects of the candidate list of a mention:

λrank

f is the rank of e in a list ordered by the first-order score f (see also Section 3.1). It coversthe global position of e among all candidates formention m and thus correlates with cardinalityof the set of entities providing higher f -scoreswith respect to mention m.

λ∆top

f is the decrease in base score f in comparisonto the highest scored entity e′ for the same men-tion (m). It indicates the loss of f -score that isencountered by aligning e with m in contrast toe′.

15

λ∆succ

f is the f -score increase in comparison to thenext lower scored entity e′ for mention m. Thisscore provides evidence by how much e outper-forms each lower-scored entity at the minimum.

In an evaluation of the alignment performance, theformer two higher-order scores would perform as wellas the applied first-order function, because they re-tain the order of the candidates. The latter one doesnot provide clear candidate rankings per mention andwould thus fail in this evaluation.

4.3. Candidate Classification

We now discuss a configuration for the candidateclassification step of CohEEL. We compare the influ-ence of different classification models for both mod-ules, the Seed Finder and the Candidate Finder. Fur-thermore, we discuss the influence of different higher-order score combinations.

4.3.1. Seed Finder

The Seed Finder has to be configured to achieve re-liable alignments, i.e., with a high precision. Out ofvarious different existing classification models, we de-cided to apply the entropy-based decision tree learnerC4.5 [36]. Decision trees already proved in similarresearch fields [37, 38]. In comparison to the othermodels, decision trees can explicitly model the de-pendencies between attributes without any assump-tions, i.e., linearity in the data. Because the C4.5learner applies a reduced error pruning strategy, itis less sensitive to outliers and reduces the danger ofoverfitting, given a representative training set. Fur-thermore, the learner implicitly perform a feature se-lection, i.e., only features with significant informationgain are considered.

We now show the influence of the higher-orderscores on the seed selector. Our classifier implemen-tations are based on the machine learning algorithmsprovided by the Weka framework [39]. To this end,we use a configurations that apply all three previ-ously defined first-order functions. Furthermore, theinfluence of different higher-order score combinationsis compared:

noλ a configuration without any higher-order func-tions: ~φ

noλ=〈fprom,fctx,frel〉

∆top a configuration with only one higher-order scorefor each first-order score:~φ

∆top=⟨fprom,λ

∆topfprom

,fctx,λ∆topfctx

,frel,λ∆topfrel

⟩λ∗ a configuration with all three higher-order func-

tions resulting in all twelve scores:~φλ∗=

⟨fprom,λ

rankfprom

,λ∆topfprom

,λ∆succfprom

,fctx,...,frel,...,λ∆succfrel

⟩Figure 4 shows the performance for different fea-

ture vector configurations ~φ based on C4.5. To mea-sure how accurately the models perform on unknowndata, the depicted precision and recall values weremeasured using 10-fold cross-validation. The ap-plied loss matrix Ls

fp/Ls

fn= 9 punishes false positives

nine times stronger than false negatives to guide thelearner to create more reliable classifiers with preci-sion of around 90%. The three main observationsare: (i) the application of additional higher-orderscores allow C4.5 to construct more intricate deci-sion rules. This influence might vary for other clas-sification models (not shown in the figure). For in-stance, a Naıve Bayes classifier follows a feature inde-pendence assumption. By introducing more stronglycorrelated features, the model is skewed, leading tolower precision. (ii) The classification quality of themodel on the micro dataset is lower than on theother datasets. This is expected, because the first-order scores showed decreased performance as well.Furthermore, the models for noλ and ∆top are toocautious and do not identify any entity alignment ascorrect (i.e., seed) leading to a theoretic precision of1 and a recall of 0. Only a combination of higher-order scores enables the model to generate rules forpositive classifications. Especially λ∆succ

f seems tohelp the classifier, while solely providing only insuffi-cient indications for certain decisions. (iii) The C4.5

0

0.2

0.4

0.6

0.8

1

news encyclopedic micro

pre

cisi

on

/ r

eca

ll

data set precision (no λ) recall (no λ)

precision (∆top) recall (∆top)

precision (λ*) recall (λ*)

Figure 4: Performance of different scoring functions and theC4.5 classification model for the Seed Finder.

16

2 3 4 5 6 7 8

0.0

0.2

0.4

0.6

0.8

1.0

news

E(|Cm|)

Cov

erag

e ra

te

no λ NB∆top NBλ* NB

no λ C4.5∆top C4.5λ* C4.5

no λ SVM∆top SVMλ* SVM

2 3 4 5 6 7 8

encyclopedic

E(|Cm|)2 3 4 5 6 7 8

micro

E(|Cm|)

Figure 5: Performance of different scoring function, expected candidate set cardinality (E(|Cm|)), and classification modecombinations for the Candidate Finder.

decision tree with λ∗ outperforms the other configu-rations with a precision always around 95% while stillsupplying competitive recall values. For the followingexperiments we use the C4.5 decision tree algorithmwith λ∗ to model the Seed Finder.

4.3.2. Candidate Finder

As discussed in Section 3.1.3, the Candidate Finderis meant to provide a list of candidates per mention.Such list should contain the correct entity with a highprobability. It is important, that the correct candi-date is contained in the list. Incorrect entities in thelist are removed in the JNE phase. For instance, theclassifier should add all entities that have a high valuefor one score, even if other candidates might have bet-ter overall scores. Therefore, we propose to use NaıveBayes, a probabilistic classification model. Due tothe independence assumption, the model labels analignment as correct as soon as there are sufficientindications, and it does not revise its decision basedon the other scores. For instance, given an alignmentwith a first-order score (f) that is low in compari-son to the correct alignments in the training set and

a high ranking (e.g., λrankf = 1) of the same align-

ment, Naıve Bayes would still use the latter featureas an indicator of the alignment being correct. Othermodels, such as C4.5, instead would reason over thesetwo features holistically and probably label such aninstance as incorrect.

Next, we compare the performance of Naıve Bayes(NB) with two established classification models, the

entropy-based decision tree learner C4.5, and Sup-port vector machines (SVM), a linear classificationmodel. Besides the classification model, the selec-tion of an appropriate expected candidate set car-dinality E(|Cm|) is important to enable an efficientJNE phase. For the knowledge bases Wikipedia andYAGO, the average number of entities per ambigu-ous name is of 4.57. Hence, we argue that a valueof E(|Cm|) ≈ 4.57 is sufficient for the most linkingtasks of various with datasets. Figure 5 compares theperformance of different classification and cost modelcombinations on the introduced datasets. To measurethe performance of the configurations, we employ thecoverage measure. It corresponds to the recall of anoptimal selector that always picks the correct entity(e ∈ Cm ∩ Ge or NIL if Cm ∩ Ge = ∅) from thecandidate set. Hence, the coverage is the maximumpossible recall value CohEEL can achieve in futuresteps (i.e., during JNE). Additionally to the classifi-cation and cost model, we evaluate the influence ofthe different ~φ configurations, introduced in the pre-vious section. As expected, the overall trend is thatthe Naıve Bayes classifier outperforms the other twomodels in terms of coverage. The performance of thealgorithms for the news and encyclopedic datasetsis comparable and the performance of the differentstrategies varies more on the micro dataset. The ex-pected candidate set cardinality E(|Cm|) clearly influ-ences the candidate coverage. An increase in E(|Cm|)increases the coverage. Furthermore, the addition ofhigher-order scores, enables a large increase of cov-

17

erage for all classification models. The difference be-tween ∆top and λ∗ configurations is not definite, butin general, the λ∗ feature vector performs better.

Hence, we argue that the Naıve Bayes is a goodchoice for the Candidate Finder and is applied to allintroduced features (λ∗) in the following experiments.Furthermore, we apply the rather conservative lossmatrix derived from E(|Cm|) = 6 to support datasetswith more ambiguous mentions without subvertingthe JNE performance.

4.4. Judicious Neighborhood Exploration

In the JNE phase, CohEEL combines the align-ments identified by Seed and Candidate Finder. Therelationship graph from the knowledge base is used toidentify coherent sets of entities within a document.Due to the directed nature of the inter-instance rela-tions in YAGO, we construct directed neighborhoodgraphs. However, other knowledge bases might con-tain undirected relations, for instance a bibliographicknowledge base with co-author relationships.

As stated earlier, the RWR is known to convergedepending on restart probability α and convergencetolerance level θc. We apply the common values ofα = 0.15 and θc = 10−8, implying that the powermethod for the RWR converges after roughly 113 it-erations. To further reduce the number of iterationsand thus the runtime of the JNE phase, we intro-duced θit. This parameter depends on the dataset,i.e., number of ambiguous mentions and candidatesper mention. As stated earlier, the value of θit can belearned from the training set that was used to trainSeed and Candidate Finder by observing the itera-tion its on which top-ranked entity is stable for alldocuments in the training set.

Figure 6 compares the convergence of the steadystate probability vector (~p) and the iteration its af-ter which the most promising candidate was steady(finally selected) during the power iterations. Thestatistics are exemplary collected over all RWRs forthe encyclopedic dataset. What surprises first is thatthe number of necessary power iterations is approxi-mately 80 (instead of 113 as expected). The theoreticconvergence bounds are defined for any graph size.However, in our setup the document size (hundreds

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

iteration

ratio

of r

ando

m w

alks

steady state converged top entity steady

Figure 6: Convergence behavior of steady state probabilitiesand top scored candidate entities based on the encyclopedicdataset.

of mentions) limits the graph size leading to thou-sands of entities and a faster convergence in compar-ison to the theoretic bounds, which is observed forgraphs that are several orders of magnitudes higher(e.g., PageRank with billions of nodes).

In 62% of the RWR executions the most promisingcandidate is already steady after one power iterationand changes only in 3% of the cases after the fifthiteration. For the encyclopedic dataset, the value ofθit = 9 was derived. This reduces the number ofnecessary iterations nearly by an order of magnitude.On the news dataset, the iteration bound was setto θit = 16, whereas on the micro dataset the valuewas θit = 5. This emphasizes the variance of the pa-rameters on different datasets and shows the runtimeefficiency increase of the RWR due to the convergencecriteria determination.

Next, the linking quality of the previously dis-cussed configurations for Seed and Candidate Finderare compared with the performance of the combinedCohEEL model after applying the Judicious Neigh-borhood Exploration (JNE). Table 2 shows precisionand recall results for the CohEEL configuration af-ter (i) applying the Judicious Neighborhood Explo-ration (JNE), based on the alignments of (ii) SeedFinder, and (iii) Candidate Finder for the three dif-ferent datasets. The candidate coverage is the maxi-mum possible recall value the model can achieve (Sec-tion 4.3.2). The 95% confidence intervals were deter-mined by executing all experiments using a 10-foldcross validation and incorporating the variance be-tween the different folds.

As can be seen, the precision of the Seed Finder

18

Table 2: Performance of the CohEEL modules and their 95%confidence intervals over a 10-fold cross validation.

news ency. micro

JN

E

Prec. 91.09%+.07 89.19%+.06 90.48%+.10

–.14 –.09 –.32

Recall 73.99%+.10

72.59%+.03

40.43%+.14

–.16 –.10 –.38

See

ds Prec. 97.36%+.03 94.66%+.04 95.35%+.05

–.19 –.04 –.27

Recall 57.05%+.17

63.00%+.09

26.95%+.11

–.16 –.08 –.16

Candidate94.57%

+.0496.95%

+.0296.63%

+.03

Coverage –.03 –.02 –.07

reaches around 95% for all three datasets. This em-phasizes the effectiveness of this classifier. Further-more, the Candidate Finder supports a high coverageof correct entities of around 95% for all three data-sets. This shows that the Seed and Candidate Find-ers do not corrupt the reasoning process in the Judi-cious Neighborhood Exploration. Due to the limitedsize of the micro dataset, the confidence interval islarger in comparison to the other datasets.

It is apparent from the results that the neighbor-hood exploration phase is able to significantly in-crease the number of alignments while decreasing theprecision provided by the Seed Finder only slightly.While the decrease in precision is only around 5-6%,the JNE is able to increase the recall by 15-50% forall three datasets. For instance, the candidate classi-fier provided 10,215 alignments for the news dataset.These comprise 666 of 773 correct alignments missingin the seed set. During the JNE phase, CohEEL ad-ditionally selects 280 correct and 116 incorrect candi-dates. Recalling that these are the difficult mentions,in which all of the tested classification models failedto provide reliable alignments, this shows the qual-ity of the Judicious Neighborhood Exploration. Itremains to say that the recall is nowhere near thetheoretical possible value given by the candidate setcoverage. This is due to the cautious nature of theneighborhood exploration phase that uses only two-step neighborhood expansion. Following this restric-

tion, the knowledge base does not indicate a relevantrelationship. For instance, the algorithm is not in-tended to identify relationships between two distincttopics like the legal and the American football topicpresented in Quote 1.

5. Comparative Evaluation

Next, we provide an evaluation of the discussedCohEEL configuration based on the introduced data-sets (Section 4.1.1) and compare it with state-of-the-art approaches.

Besides CohEEL we evaluate a strong scoring func-tion baseline as well as nine state-of-the-art ap-proaches. The baseline fcomb derives alignmentsby selecting the entity ea for mention m that max-imizes the value of the scoring function fcomb (i.e.,ea = arg maxe∈K fcomb(m, e)). The scoring functionis based on a linear combination of surface, contextand relatedness scoring functions (for details of theweights see Section 4.2) and outperforms the individ-ual scores in our tests.

The nine state-of-the-art approaches that indeedachieve a high quality in the disambiguation and link-ing task are:

First, the approach of Cucerzan (referred to asLED) extends the term-based feature vectors ofWikipedia entities by information, such as keyphrases and categories, from other articles that linkto it [17]. The project was conducted at Mi-crosoft and the code is proprietary. Hence, we re-implemented the algorithm according to the descrip-tions in the paper.

Second, we compare with an approach that is basedon the algorithm of Milne and Witten (referred toas L2LW) [22]. We implemented the algorithm ac-cording to the descriptions in the paper based on theWeka toolkit. In contrast to the evaluation of theprior work that was based on a knowledge base con-sisting of only 700 Wikipedia articles, we applied thealgorithm to the complete knowledge bases YAGOand Wikipedia. For commonness and relatednessmeasures, we used the scores fprom and frel intro-duced in Section 4.2. Subsequently, we trained thealgorithm based on all three datasets.

19

Third, we evaluate BEL, an approach that buildson a majority-voting over multiple ranking classi-fiers [20]. BEL is based on the idea of bagging thecontextual information of each mention. We usedthe original implementation, which operates on theknowledge base of YAGO.

UKB, the algorithm of Agirre et al., runs a person-alized PageRank approach over the Wikipedia graphbuilt over reciprocal article links [27]. We used theimplementation as well as the knowledge base graphprovided on the project website.

Fifth, AIDA (also known as GRAPH-KORE) pro-vides an algorithm that uses graph-based connec-tivity between candidate entities of multiple men-tions (e.g., derived from the type, subclassOf edgesof the knowledge graph or from the incoming links inWikipedia articles) to determine the most promisinglinking of the mentions [25]. For the experiments,we used the implementation provided on the projectwebsite.

Sixth, we provide the results of the approach ofGuo and Barbosa (REL-RW), which applies an it-erative algorithm based on semantic entity signaturesderived from stationary random walk distributionover the knowledge base neighborhood [29]. Unfor-tunately, the source code was not available as opensource at the time of the evaluation, however, theauthors generously executed the algorithm on theirhardware to produce the alignments for the discusseddatasets.

Finally, we used the General Entity AnnotationBenchmark Framework GERBIL [9] to retrieve theresults for the three algorithms based on their exter-nal Web services: DBpedia Spotlight disambiguatesentities using a generative probabilistic model thatbased on surface and context features [18]. WATapplies different mention-local features to build anentity graph for each document and use link-basedranking algorithms on the entity graph contain-ing edges weighted by relatedness between the en-tities [19]. Babelfy identifies a densest subgraph byiteratively removing the weakliest connected entityfrom a semantic document representation, which isderived from the multilingual knowledge base Babel-Net [28].

To enable a fair comparison of the linking quality

of approaches not based on the same knowledge baseversion (i.e., LED, Spotlight, UKB, WAT, Babelfy,and REL-RW), we replaced the alignments of entitieswithout representation in our knowledge base (e.g.,concepts like summer) by NIL links.

Table 3 provides a performance comparison overall eight algorithms. The six rightmost algorithmsare the approaches that perform a coherence reason-ing to improve the linking quality. It is apparent thatthese approaches produce the best results for all ofthe applied quality measures. Furthermore, it is vis-ible that the performance of all competitors is lowerfor the synthetic micro dataset than for the two longtext datasets.

All competitors of CohEEL provide balanced pre-cision-recall values per dataset leading to large num-bers of incorrect entity alignments of up to 71%. Incontrast, CohEEL consistently outperforms all theother approaches in terms of precision and is able toreduce the false positive rate to around 10%. Onlyone competitor is able to achieve comparable preci-sion values on the two long text corpora (news andencyclopedic): REL-RW. Whereas it performs wellfor long text, it suffers from low precision of only60% on the micro dataset. For this dataset, all com-petitors suffer from low precision. For the best com-petitor Babelfy, one in four entity alignments are in-correct, whereas the other competitors even providea third or more incorrect entities. Only CohEEL isable to produce a higher precision. This emphasizesthe effectiveness of the model with a focus on produc-ing reliable alignments. In terms of recall, CohEELis mostly on par with the non-coherence-based ap-proaches (five leftmost algorithms) but is outper-formed by the coherence-based approaches. Thisleads to an upper midfield position of our approachwhen ranked by the harmonic mean between preci-sion and recall (F1 measure), favoring balanced pre-cision-recall performance. However, in comparisonto the strongest competitor REL-RW, CohEEL losesonly 2 to 8 percent points in F1 in all datasets.

Recalling that the focus of our system lies on pro-ducing reliable alignments with high precision, a com-parison based on the F.25-measure that favors preci-sion to recall is more reasonable. The performanceof REL-RW is impressive on the long text corpora

20

http://ixa2.si.ehu.es/ukb/README.wiki.txt

http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/

http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/

Table 3: Performance comparison of state-of-the-art algorithms and CohEEL. Right-sided: approaches that apply a graph-basedcoherence reasoning strategy

newsfcomb LED L2LW Spotlight BEL UKB WAT AIDA Babelfy REL-RW CohEEL

P 80.6% 63.4% 72.1% 79.3% 80.1% 83.3% 84.8% 78.7% 76.6% 88.1% 91.1%R 78.2% 64.9% 69.7% 73.8% 80.1% 82.8% 83.1% 80.4% 76.5% 85.4% 74.0%S 78.4% 61.2% 71.2% 81.0% 73.1% 75.2% 74.9% 66.0% 70.2% 96.6% 88.7%F1 79.4% 64.1% 70.9% 76.5% 80.1% 83.0% 84.0% 79.5% 76.5% 86.8% 81.7%F.25 80.1% 63.7% 71.6% 78.1% 80.1% 83.2% 84.5% 79.0% 76.6% 87.6% 87.1%

encyclopedicfcomb LED L2LW Spotlight BEL UKB WAT AIDA Babelfy REL-RW CohEEL

P 82.0% 63.5% 72.3% 84.2% 82.2% 87.0% 81.6% 79.5% 77.3% 88.0% 89.2%R 82.2% 68.8% 70.4% 70.9% 82.8% 58.6% 73.7% 85.6% 77.0% 89.0% 72.6%S 65.3% 50.1% 67.6% 76.5% 68.6% 79.0% 72.1% 52.2% 65.5% 80.8% 83.3%F1 82.1% 66.0% 71.4% 77.0% 82.5% 70.0% 77.4% 82.4% 77.1% 88.5% 80.0%F.25 82.0% 64.5% 71.9% 81.2% 82.3% 79.3% 79.9% 80.6% 77.2% 88.2% 85.3%

microfcomb LED L2LW Spotlight BEL UKB WAT AIDA Babelfy REL-RW CohEEL

P 56.2% 40.1% 29.4% 48.8% 54.8% 63.7% 59.6% 64.4% 72.7% 60.0% 90.5%R 58.2% 41.8% 24.8% 44.7% 48.2% 61.0% 59.6% 66.7% 73.8% 55.3% 40.4%S 28.6% 14.3% 57.1% 42.9% 42.9% 42.9% 42.9% 28.6% 28.6% 57.1% 85.7%F1 57.1% 41.0% 26.9% 46.7% 51.3% 62.3% 59.6% 65.5% 73.2% 57.6% 55.9%F.25 56.6% 40.5% 28.4% 47.9% 53.4% 63.1% 59.6% 64.8% 72.9% 59.0% 72.5%

and only CohEEL is able to produce comparable re-sults. However, for the micro dataset the CohEELimproves the F.25 of REL-RW by 14 percent pointsand is on par with the best competitor Babelfy. Allother algorithms are outperformed by 8–44 percentpoints.

The previous discussion covered only the detectionof knowledge base entities. Specifically, most com-petitors show better results in terms of sensitivity(recall). However, in terms of specificity, i.e., the pro-portion of NIL alignments that are correctly identi-fied, the performance of the algorithms shows a differ-ent behavior. CohEEL favors cautiousness for linkingknowledge base entities and thus achieves high speci-ficity values and is the only approach that detectsaround 85% of the out-of-knowledge base entities forall three datasets.

Overall, the three iterative graph-based coher-ence reasoning approaches Babelfy, REL-RW, and

CohEEL outperform the other systems. Dependingon the dataset, either Babelfy or REL-RW providesthe best F1 results, because of the stronger precision-recall-balance in comparison to CohEEL. REL-RWin particular on the news and encyclopedic dataset.This is probably due to the focus of the configurationtowards longer texts. The evaluation in the originalpublication is performed on news corpora, namelyMSNBC, AQUAINT, and ACE2004 [29]. The strongperformance drop of REL-RW on the micro dataset,which shows other text characteristics, is drastic. Onthis dataset Babelfy outperforms all other competi-tors by nearly ten percent points in F1. In contrast,CohEEL is able to consistently provide high precisionand specificity values. This underscores the advan-tage of our supervised approach that is able to au-tomatically adopt to different input text types andassure high specificity and precision values.

Figure 7 depicts that, executed on Intel Xeon

21

fcomb LED L2LW BEL UKB AIDA CohEEL

algorithm

runt

imes

in s

econ

ds

010

2030

4050

60

3m 3

1s

1m 3

3s

14m

9s

5m 2

3s

16m

50snews

encyclopedicmicro

Figure 7: Runtime comparison of state-of-the-art algorithmsand CohEEL.

2.67GHz machine with 32 CPU-cores and 256GBRAM, the runtime of most of the tested algorithmsper document varies in the seconds range (between1 and 15s). This is due to the fact, that fcomb,LED, and BEL are not based on an exhaustive coher-ence reasoning. Furthermore, the relatedness score ofL2LW is based on a greedy approach that calculatescoherence based on unambiguous entities and doesnot consider different combinations of ambiguous en-tities. Hence, the runtime of all four approaches ismainly determined by the calculation of the scoringfunctions.

AIDA applies an expensive reasoning over all can-didate combinations, leading to a document process-ing time more than one order of magnitude higherthan the other competitors. The measured runtimesof AIDA for the micro dataset are surprisingly high(16min 50s), since each document contains only threementions on average. However, these mentions arehighly ambiguous, yielding a large number of po-tential entities, e.g., all persons named “David” incombination to all persons named “Victoria”. UKBshows slightly better performance than AIDA. Asdiscussed in Section 2.2, this can be explained byan efficient RWR implementation that is based on asparser knowledge base graph containing only rele-vant (reciprocal) Wikipedia links. However, depend-ing on the dataset, UKB analyzes each documentwithin 0.5 to 3.5min on average. This shows thatthe coherence reasoning phase is a costly operation,even on thoroughly curated knowledge base graphs.

In contrast to AIDA and UKB, the coherence rea-

soning of CohEEL, i.e., the JNE phase, is only asmall factor and occupies between 5% and 20% ofthe runtime. Furthermore, the candidate classifica-tion phase has negligibly low runtimes of only 1-5msper mention. Thus, the overall runtime of CohEELlies between 2 and 18s per document depending onthe dataset and is therefore comparable with the ap-proaches without coherence reasoning. This is impor-tant for different scenarios: if large text collections(millions of documents) have to be processed, textshave to be analyzed in seconds to annotate the wholecorpus in acceptable time (days). Furthermore, inonline scenarios, such as an entity linking web ser-vice, where links have to be retrieved per documentimmediately, runtimes larger than several seconds arenot acceptable.

Please note that we cannot supply runtime mea-surements of DBpedia Spotlight, WAT, Babelfy, andREL-RW, because we did not execute the programson our hardware. However, as discussed in Sec-tion 3.2.3, we argue that CohEEL is able to outper-form REL-RW: Assuming a similar document neigh-borhood graph sizes as UKB – both algorithms arebased on the Wikipedia link graph – we expect a com-parable runtime of REL-RW, meaning in the order ofminutes per document. This assumption is due to thecomplexity of the RWR over a large (uncompressed)neighborhood graph.

6. Conclusion

We introduced CohEEL, a novel and efficientnamed entity linking model that uses a random walkstrategy to combine the results of a precision-orientedand a recall-oriented classifier while maintaining ahigh precision and elevating the recall to practicallyviable levels. We introduced a configuration of thismodel based on YAGO and Wikipedia and comparedit with state-of-the-art NEL approaches. While show-ing superior behavior in terms of precision, CohEELalso provides competitive F -measure values. Further-more, we showed that CohEEL’s efficiency is practi-cal and thus enables online scenarios, where entitylinks have to be retrieved in near-real-time.

We are working on a distributed version of CohEELbased on Apache Flink to help to process and anno-

22

http://flink.apache.org

tate huge text collections, i.e., dozens of millions ofdocuments1. We plan to extend CohEEL with theability to automatically detect possible mention sur-faces in texts, and thus reduce its dependency on anamed entity recognizer. Furthermore, we plan totest the influence of the entity alignments found byCohEEL on different information retrieval tasks, suchas document clustering or question answering. An-other goal is to investigate CohEEL’s performance onother types of texts, such as scientific texts (e.g., lifesciences, social sciences, etc.) or user generated con-tent (e.g., blog articles, product reviews, etc.) basedon appropriate knowledge bases. Future research notcovered by this work targets at improving knowledgebase graphs for the coherence reasoning to solve theNEL task [27–29, 33]. Specifically, the quality aswell as efficiency of CohEEL might be improved bygathering dense subgraph structure information, suchas k-trusses [40]. CohEEL provides a strong basisfor further investigations in this important researcharea.

Acknowledgments

This research was funded by the German ResearchSociety, DFG grant no. FOR 1306.

7. References

[1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann,R. Cyganiak, Z. G. Ives, DBpedia: A nucleusfor a web of open data, in: The Semantic Web,Vol. 4825 of Lecture Notes in Computer Sci-ence, Springer, 2007, pp. 722–735. doi:10.

1007/978-3-540-76298-0_52.

[2] J. Hoffart, F. M. Suchanek, K. Berberich,E. Lewis-Kelham, G. de Melo, G. Weikum,YAGO2: Exploring and querying world knowl-edge in time, space, context, and many lan-guages, in: Proceedings of the InternationalConference on World Wide Web (WWW), 2011,pp. 229–232. doi:10.1145/1963192.1963296.

1https://hpi.de/naumann/projects/coheel

[3] M. Dredze, P. McNamee, D. Rao, A. Gerber,T. Finin, Entity disambiguation for knowledgebase population, in: Proceedings of the Interna-tional Conference on Computational Linguistics(COLING), 2010, pp. 277–285.

[4] T. Gruetze, G. Kasneci, Z. Zuo, F. Naumann,Bootstrapping Wikipedia to answer ambiguousperson name queries, in: International Work-shop on Information Integration on the Web (II-Web), 2014, pp. 56–61. doi:10.1109/ICDEW.

2014.6818303.

[5] M. Khalid, V. Jijkoun, M. de Rijke, The impactof named entity normalization on informationretrieval for question answering, in: Advancesin Information Retrieval, Vol. 4956 of LectureNotes in Computer Science, Springer, 2008, pp.705–710. doi:10.1007/978-3-540-78646-7_

83.

[6] A. Carlson, J. Betteridge, B. Kisiel, B. Settles,E. R. Hruschka Jr., T. M. Mitchell, Toward anarchitecture for never-ending language learning,in: Proceedings of the National Conference onArtificial Intelligence (AAAI), 2010, pp. 1306–1313.

[7] G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ra-manath, G. Weikum, NAGA: Searching andranking knowledge, in: Proceedings of the IEEEInternational Conference on Data Engineering(ICDE), 2008, pp. 953–962. doi:10.1109/

ICDE.2008.4497504.

[8] D. Carmel, M.-W. Chang, E. Gabrilovich, B.-J. P. Hsu, K. Wang, ERD 2014: Entity recog-nition and disambiguation challenge, SIGIRForum 48 (2) (2014) 63–77. doi:10.1145/

2701583.2701591.

[9] R. Usbeck, M. Roder, A.-C. Ngonga Ngomo,C. Baron, A. Both, M. Brummer, D. Ceccarelli,M. Cornolti, D. Cherix, B. Eickmann, P. Fer-ragina, C. Lemke, A. Moro, R. Navigli, F. Pic-cinno, G. Rizzo, H. Sack, R. Speck, R. Troncy,J. Waitelonis, L. Wesemann, GERBIL: General

23

http://dx.doi.org/10.1007/978-3-540-76298-0_52

http://dx.doi.org/10.1007/978-3-540-76298-0_52

http://dx.doi.org/10.1145/1963192.1963296

https://hpi.de/naumann/projects/coheel

http://dx.doi.org/10.1109/ICDEW.2014.6818303

http://dx.doi.org/10.1109/ICDEW.2014.6818303

http://dx.doi.org/10.1007/978-3-540-78646-7_83

http://dx.doi.org/10.1007/978-3-540-78646-7_83

http://dx.doi.org/10.1109/ICDE.2008.4497504

http://dx.doi.org/10.1109/ICDE.2008.4497504

http://dx.doi.org/10.1145/2701583.2701591

http://dx.doi.org/10.1145/2701583.2701591

entity annotator benchmarking framework, in:Proceedings of the International Conference onWorld Wide Web (WWW), 2015, pp. 1133–1143.

[10] R. Mihalcea, A. Csomai, Wikify!: Linking doc-uments to encyclopedic knowledge, in: Proceed-ings of the International Conference on Infor-mation and Knowledge Management (CIKM),2007, pp. 233–242. doi:10.1145/1321440.

1321475.

[11] R. Sinha, R. Mihalcea, Unsupervised graph-based word sense disambiguation using mea-sures of word semantic similarity, in: Proceed-ings of the International Conference on Seman-tic Computing (ICSC), 2007, pp. 363–369. doi:10.1109/ICSC.2007.107.

[12] E. Agirre, A. Soroa, Personalizing PageRankfor word sense disambiguation, in: Proceedingsof the Conference of the European Chapter ofthe Association for Computational Linguistics(EACL), 2009, pp. 33–41.

[13] M.-C. d. Marneffe, B. MacCartney, C. D. Man-ning, Generating typed dependency parses fromphrase structure parses, in: Proceedings ofthe International Conference on Language Re-sources and Evaluation (LREC), 2006, pp. 449–454.

[14] B. Hachey, W. Radford, J. Nothman, M. Hon-nibal, J. R. Curran, Evaluating entity link-ing with Wikipedia, Artificial Intelligence 194(2013) 130–150. doi:10.1016/j.artint.2012.04.005.

[15] L. Ratinov, D. Roth, D. Downey, M. Ander-son, Local and global algorithms for disambigua-tion to Wikipedia, in: Proceedings of the An-nual Meeting of the Association for Computa-tional Linguistics: Human Language Technolo-gies (HLT), 2011, pp. 1375–1384.

[16] T. Pedersen, A. Purandare, A. Kulkarni, Namediscrimination by clustering similar contexts, in:Proceedings of the International Conference onIntelligent Text Processing and ComputationalLinguistics (CICLing), 2005, pp. 226–237. doi:

10.1007/978-3-540-30586-6_24.

[17] S. Cucerzan, Large-scale named entity disam-biguation based on Wikipedia data, in: Pro-ceedings of the Joint Conference on Empiri-cal Methods in Natural Language Processingand Computational Natural Language Learning(EMNLP-CoNLL), 2007, pp. 708–716.

[18] J. Daiber, M. Jakob, C. Hokamp, P. N. Mendes,Improving efficiency and accuracy in multilin-gual entity extraction, in: Proceedings of theInternational Conference on Semantic Systems(I-SEMANTICS), 2013, pp. 121–124. doi:10.

1145/2506182.2506198.

[19] F. Piccinno, P. Ferragina, From TagME to WAT:A new entity annotator, in: Proceedings of theInternational Workshop on Entity Recognition& Disambiguation (ERD), 2014, pp. 55–62. doi:10.1145/2633211.2634350.

[20] Z. Zuo, G. Kasneci, T. Gruetze, F. Naumann,BEL: Bagging for entity linking, in: Proceed-ings of the International Conference on Compu-tational Linguistics (COLING), 2014, pp. 2075–2086.

[21] J. Hoffart, M. A. Yosef, I. Bordino,H. Furstenau, M. Pinkal, M. Spaniol, B. Taneva,S. Thater, G. Weikum, Robust disambiguationof named entities in text, in: Proceedings of theConference on Empirical Methods in NaturalLanguage Processing (EMNLP), 2011, pp.782–792.

[22] D. Milne, I. H. Witten, Learning to link withWikipedia, in: Proceedings of the InternationalConference on Information and Knowledge Man-agement (CIKM), 2008, pp. 509–518. doi:10.

1145/1458082.1458150.

[23] F. Du, Y. Chen, X. Du, Linking entities inunstructured texts with RDF knowledge bases,in: Proceedings of Asia-Pacific Web Confer-ence (APWeb), Vol. 7808 of Lecture Notes inComputer Science, Springer, 2013, pp. 240–251.doi:10.1007/978-3-642-37401-2_25.

24

http://dx.doi.org/10.1145/1321440.1321475

http://dx.doi.org/10.1145/1321440.1321475

http://dx.doi.org/10.1109/ICSC.2007.107

http://dx.doi.org/10.1109/ICSC.2007.107

http://dx.doi.org/10.1016/j.artint.2012.04.005

http://dx.doi.org/10.1016/j.artint.2012.04.005

http://dx.doi.org/10.1007/978-3-540-30586-6_24

http://dx.doi.org/10.1007/978-3-540-30586-6_24

http://dx.doi.org/10.1145/2506182.2506198

http://dx.doi.org/10.1145/2506182.2506198

http://dx.doi.org/10.1145/2633211.2634350

http://dx.doi.org/10.1145/2633211.2634350

http://dx.doi.org/10.1145/1458082.1458150

http://dx.doi.org/10.1145/1458082.1458150

http://dx.doi.org/10.1007/978-3-642-37401-2_25

[24] S. Kulkarni, A. Singh, G. Ramakrishnan,S. Chakrabarti, Collective annotation of Wiki-pedia entities in web text, in: Proceedings ofthe ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining, 2009, pp. 457–466.doi:10.1145/1557019.1557073.

[25] J. Hoffart, S. Seufert, D. B. Nguyen,M. Theobald, G. Weikum, KORE: Keyphraseoverlap relatedness for entity disambigua-tion, in: Proceedings of the InternationalConference on Information and KnowledgeManagement (CIKM), 2012, pp. 545–554.doi:10.1145/2396761.2396832.

[26] X. Han, L. Sun, J. Zhao, Collective entity link-ing in web text: A graph-based method, in:Proceedings of the International ACM SIGIRConference on Research and development in In-formation Retrieval, 2011, pp. 765–774. doi:

10.1145/2009916.2010019.

[27] E. Agirre, A. Barrena, A. Soroa, Studying theWikipedia hyperlink graph for relatedness anddisambiguation, CoRR abs/1503.01655.URL http://arxiv.org/abs/1503.01655

[28] A. Moro, A. Raganato, R. Navigli, Entity linkingmeets word sense disambiguation: a unified ap-proach, Transactions of the Association for Com-putational Linguistics 2 (2014) 231–244.

[29] Z. Guo, D. Barbosa, Robust entity linking viarandom walks, in: Proceedings of the Interna-tional Conference on Information and Knowl-edge Management (CIKM), 2014, pp. 499–508.doi:10.1145/2661829.2661887.

[30] C. M. Bishop, Pattern Recognition and MachineLearning, Springer, 2006.

[31] H. Tong, C. Faloutsos, J.-Y. Pan, Randomwalk with restart: fast solutions and ap-plications, Knowledge and Information Sys-tems 14 (3) (2008) 327–346. doi:10.1007/

s10115-007-0094-2.

[32] A. N. Langville, C. D. Meyer, Deeper in-side PageRank, Internet Mathematics 1 (3)(2004) 335–380. doi:10.1080/15427951.2004.10129091.

[33] B. Dalvi, E. Minkov, P. P. Talukdar, W. W.Cohen, Automatic gloss finding for a knowledgebase using ontological constraints, in: Proceed-ings of the International Conference on WebSearch and Data Mining (WSDM), 2015, pp.277–285.

[34] P. N. Mendes, M. Jakob, A. Garcıa-Silva,C. Bizer, DBpedia Spotlight: Shedding lighton the web of documents, in: Proceedings ofthe International Conference on Semantic Sys-tems (I-SEMANTICS), 2011, pp. 1–8. doi:

10.1145/2063518.2063519.

[35] C. Zhai, J. Lafferty, A study of smoothing meth-ods for language models applied to informa-tion retrieval, ACM Transactions on InformationSystems 22 (2). doi:10.1145/984321.984322.

[36] J. R. Quinlan, C4.5: Programs for machinelearning, Vol. 1, Morgan Kaufmann, 1993.

[37] J. F. McCarthy, W. G. Lehnert, Using decisiontrees for coreference resolution, in: Proceedingsof the International Joint Conference on Artifi-cial Intelligence (IJCAI), 1995, pp. 1050–1055.

[38] V. Ng, C. Cardie, Improving machine learningapproaches to coreference resolution, in: Pro-ceedings of the Annual Meeting of the Associa-tion for Computational Linguistics (ACL), Asso-ciation for Computational Linguistics, 2002, pp.104–111. doi:10.3115/1073083.1073102.

[39] M. Hall, E. Frank, G. Holmes, B. Pfahringer,P. Reutemann, I. H. Witten, The WEKA datamining software: An update, SIGKDD Explo-rations 11 (1) (2009) 10–18.

[40] J. Cohen, Graph twiddling in a MapReduceworld, Computing in Science and Engineering11 (4) (2009) 29–41. doi:10.1109/MCSE.2009.

120.

25

http://dx.doi.org/10.1145/1557019.1557073

http://dx.doi.org/10.1145/2396761.2396832

http://dx.doi.org/10.1145/2009916.2010019

http://dx.doi.org/10.1145/2009916.2010019

http://arxiv.org/abs/1503.01655




http://dx.doi.org/10.1145/2661829.2661887

http://dx.doi.org/10.1007/s10115-007-0094-2

http://dx.doi.org/10.1007/s10115-007-0094-2

http://dx.doi.org/10.1080/15427951.2004.10129091

http://dx.doi.org/10.1080/15427951.2004.10129091

http://dx.doi.org/10.1145/2063518.2063519

http://dx.doi.org/10.1145/2063518.2063519

http://dx.doi.org/10.1145/984321.984322

http://dx.doi.org/10.3115/1073083.1073102

http://dx.doi.org/10.1109/MCSE.2009.120

http://dx.doi.org/10.1109/MCSE.2009.120

CohEEL: Coherent and E cient Named Entity Linking through ... · CohEEL: Coherent and E cient Named...

Documents

Transcript of CohEEL: Coherent and E cient Named Entity Linking through ... · CohEEL: Coherent and E cient Named...