Binary Vector Approach to Entity Linking: TALP in TAC-KBP 2014

Binary Vector Approach to Entity Linking: TALP in TAC-KBP 2014

A. Naderi, H. Rodrıguez, J. TurmoTALP Research Center, UPC, Spain.

{anaderi,horacio,turmo}@lsi.upc.edu

Abstract

This document describes the work performedby the Universitat Politecnica de Catalunya(UPC) in its third participation at TAC-KBP2014 in Mono-Lingual (English) Entity Link-ing task.

1 Introduction

EL is the task of referring a named entity men-tion occurring in a text document (henceforth, back-ground document) to the unique entry within a refer-ence knowledge base (KB). TAC-KBP track definesthe task of EL as follows: having a set of queries,each one consisting of a query name along with abackground document in which the query name canbe found (the query name is mentioned using its be-gin and end offsets in the background document)and a source document collection from which sys-tems can learn, the EL system is required to selectthe relevant KB entry. Queries sometimes consistof the same name string from different documents.The system is expected to distinguish the ambigu-ous names (e.g., Barcelona could refer to the sportteam, the university, city, state, or person). An ELsystem is also required to cluster those named en-tities referring to a same Not-in-KB (NIL) entities.This Paper proposes a Vector Space Model (VSM)to rank a set of candidates which are possibly a cor-rect answer for a query.

2 Literature Review

The recent works on EL in its contemporary historyare inspired from the older history of Word Sense

Disambiguation (WSD) where this challenge firstlyarised. Many studies achieved on WSD are quiterelevant to EL. Disambiguation methods in the stateof the art can be classified into supervised methods,unsupervised methods and knowledge-based meth-ods (Navigli, 1990).

Supervised Disambiguation (SD). The first cate-gory applies machine-learning techniques for infer-ring a classifier from training (manually annotated)data sets to classify new examples. Researcher pro-posed different methods for SD. A Decision List(Rivest, 1987) is a SD method containing a set ofrules (if-then-else) to classify the samples. In con-tinue, (Klivans and Servedio., 2006) used learn-ing decision lists for Attribute Efficient Learning.(Magee, 1964) introduced another SD method De-cision Tree that has a tree-like structure of deci-sions and their possible consequences. C4.5 (Quin-lan, 1993), a common algorithm of learning decisiontrees was outperformed by other supervised methods(Mooney, 1996). (John and Langley, 1995) stud-ied on the Naive Bayes classifier. This classifieris a supervised method based on the Bayes’ theo-rem and is a member of simple probabilistic clas-sifiers. The model is based on the computing theconditional probability of each class membershipdepending on a set of features. (Mooney, 1996)demonstrated good performance of this classifiercompared with other supervised methods. (McCul-loch and Pitts, 1943) introduced Neural Networksthat is a computational model inspired by centralnervous system of organisms. The model is pre-sented as a system of interconnected neurons. Al-though (Towell and Voorhees, 1998) showed an ap-

propriate performance by this model but the experi-ment was achieved in a small size of data. However,the dependency to large amount of training data isa major drawback (Navigli, 1990). Recently, differ-ent combination of supervised approaches are pro-posed. The combination methods are highly inter-esting since they can cover the weakness of eachstand-alone SD methods (Navigli, 1990).

Unsupervised Disambiguation (UD). The under-lying hypothesis of UD is that, each word is corre-lated with its neighboring context. Co-located wordsgenerate a cluster tending to a same sense or topic.No labeled training data set or any machine-readableresources (e.g. dictionary, ontology, thesauri, etc.)are applied for this approach (Navigli, 1990). Con-text Clustering (Schutze, 1992) is a UD method bywhich each occurrence of a target word in a corpusis indicated as a context vector. The vectors are thengathered in clusters, each indicating a sense of tar-get word. A drawback of this method is that, a largeamount of un-labeled training data is required. (Lin,1998) studied on Word Clustering a UD methodbased on clustering the words which are seman-tically similar. Later on, (Pantel and Lin, 2002)proposed a word clustering approach called clus-tering by committee (CBC). (Widdows and Dorow,2002) described another UD method Co-occurrenceGraphs assuming that co-occurrence words and theirrelations generate a co-occurrence graph. In thisgraph, the vertices are co-occurrences and the edgesare the relations between co-occurrences.

Knowledge-based Disambiguation (KD). Thegoal of this approach is to apply knowledge re-sources (such as dictionaries, thesauri, ontolo-gies, collocations, etc.) for disambiguation(Lesk, 1986)(Banerjee and Pedersen, 2003)(Fu,1982)(Bunke and Alberto Sanfeliu, 1990)(Mihalcea,2004). Although, these methods have lower per-formance compared with supervised techniques, butthey have a wider coverage (Navigli, 1990).

Recently, some collective efforts are done to re-search in this field in form of challenging competi-tions. The advantage of such competitions is that,the performance of systems are more comparablesince all participants assess their systems in a sametestbed including same resources and training andevaluation corpus. To this end, Knowledge BasePopulation (KBP) EL track at Text Analysis Con-

ference (TAC) 1 is the most important challengingcompetition being subject of significant study since2009. The task is annually organized by which manyteams present their proposed systems.

3 Methodology and Contribution

The method proposed in this paper follows the typ-ical architecture in the state of the art (Figure 1).Briefly, given a query the system preprocesses thebackground document (Document Pre-processingstep). Subsequently, those KB nodes which can bepotential candidates to be the correct entity are se-lected (Candidate Generation step). Furthermore,a binary vector is generated from the contexts be-tween both components of a set of pairs 〈queryname, named entity mention〉 occurring in the sen-tence containing the begin and end offsets of queryname in background document (henceforth, offsetsentence) and also from all sentences of each candi-date document–wikitext (Binary Vector Generationstep). Next, a set of graphs is generated in order torepresent relations between the components of thepairs (Graph Generation step). Finally, all candi-dates are ranked in a list and a candidate having thehighest score is selected. In addition, all queriesbelonging to the same Not-In-KB (NIL) entity areclustered together assigning the same NIL id (Can-didate Ranking and NIL clustering step). The fi-nal task (Candidate Ranking and NIL Clustering) isthe most challenging and highly crucial among stepsabove.

Details of each step are provided next.

3.1 Document Pre-processingInitially, the background document has to be nor-malized to be used by other components. To thisobjective, the system pre-processes the document inthe following way.

Document Partitioning and Text Cleaning. Thiscomponent separates the textual and non-textualparts of the document. Then, the further steps areonly applied over the textual part.

In addition, each document might contain HTMLtags and noise (e.g. in Web documents) which are

1The TAC is organized and sponsored by the U.S. NationalInstitute of Standards and Technology (NIST) and the U.S. De-partment of Defense.

Figure 1: General architecture of a EL system.

removed by the system.

Sentence Segmentation and Text Normalization.This module operates on the context of documentsas following:• Sentence Segmentation. The documents are

splitted by discovering sentence boundaries.• Soft Mention (SM). Entity mentions repre-

sented with abbreviations are expanded, e.g.“Tech. Univ. of Texas”, is replaced with “Tech-nical University of Texas”. To this end, adictionary-based mapping was manually gen-erated and applied.

3.2 Query Expansion

In most queries, query name might be ambiguous,or background document contains poor and sparseinformation about the query. In these cases, queryexpansion can reduce the ambiguity of query nameand enrich the content of documents through findingname variants of the query name, integrating morediscriminative information, and tagging meta data tothe content of documents.

For doing so, we apply the following steps:

Query Classification. Query type recognitionhelps to filter out those KB entities with type differ-ent to the query type. Our system classifies queriesinto 3 entity types: PER (e.g. “George Washing-ton”), ORG (e.g. “Microsoft”) and GPE (GeoPo-

litical Entity, e.g. “Heidelberg city”). We proceedunder the assumption that a longer mention (e.g.“George Washington”) tends to be less ambiguousthan a shorter one (e.g. “Washington”) and, thus,the type of the longest query mention tends to be thecorrect one. The query classification is performedin three steps. First, we use the Illinois Named En-tity Recognizer and Classifier (NERC) (Ratinov andRoth, 2009) to tag the types of named entity men-tions occurring in the background document. Sec-ond, we find the set of mentions in the backgrounddocument referring to the query. More concretely,we take mention m1 defined by the query offsetswithin the background document (e.g. “Bush”) andfind the set of mentions that include m1 occurring inthat document (e.g. “Bush”, “G. Bush”, “George W.Bush”). Finally, we select the longest mention fromthe resulting set of mentions and take its type as thequery type.

Alternate Name Generation. Generating Alter-nate Names (AN) of each query can effectively re-duce the ambiguities of the mention and improve re-call, under the assumption that two name variants inthe same document can refer to the same entity. Wefollow the techniques below to generate AN:

• Acronym Expansion. Acronyms form a majorpart of ORG queries and can be highly am-biguous, e.g. “ABC” is referred to around 100entities. The purpose of acronym expansionis to reduce its ambiguity. The system seeksinside the background document to gather allsubsequent tokens with the first capital orderlymatched to the letters of the acronym. Also,the expansions are acquired before or insidethe parentheses, e.g. “Congolese National Po-lice (PNC)”, or “PNC (Congolese National Po-lice)”.• Gazetteer-based AN Generation. Sometimes,

query names are abbreviations. In these occa-sions, auxiliary gazetteers are beneficial to mapthe pairs of 〈abbreviation, expansion〉 such asthe US states, (e.g. the pair 〈CA, California〉or 〈MD, Maryland〉), and country abbrevia-tions, (e.g. the pairs 〈UK, United Kingdom〉,〈US, United States〉 or 〈UAE, United Arab Emi-rates〉).

• Google API. In more challenging cases, somequery names contains grammatical irregulari-ties or a partial form of the entity name. Us-ing Google API, more complete forms of thequery name are probed across the Web. Fordoing so, the component captures title of first(top ranked) result of the Google search en-gine as possibly better form of the query name.For instance, in the case of query name “ManU”, using the method above, the complete form“Manchester United F.C.” is obtained.

3.3 Candidate GenerationGiven a particular query, q, a set of candidates,C = {c1, . . . , cn}, is found by retrieving those en-tries from the KB whose names are similar enough ,using Dice measure, to one of the alternate names ofq found by the query expansion. In our experimentswe used a similarity threshold of 0.9, 0.8 and 1 forPER, ORG and GPE respectively. By comparing thecandidate entity type extracted from the correspond-ing KB page and query type obtained by NERC, wefilter out those candidates having different types toattain more discriminative candidates.

3.4 Binary Vector GenerationIn this step, we exploit the context between com-ponenets of the pair 〈query name, named entitymention〉 occurring in the offset sentence of back-ground document and all sentences of each candi-date wikitext. Our hypothesis is based on the factthat the context between the components of each paircontains discriminative information and can be usedto rank candidates. In the case of background docu-ment the system just used the offset sentence insteadof all sentences since in most cases the named en-tity mentions occurring in other sentences of back-ground document are far from the main topic andmay have a negative influence in ranking candidates.

Consider a query name q along with its back-ground document dq in which the query name oc-curs, the offset sentence so and a set of named entitymentions occurring in the offset sentence Mso ={m1

q , . . . ,mrq}. The system extracts each pair λi

composed by the query name q and each named en-tity mention mi

q ∈Mso :

λi = 〈q,miq〉 (1)

where, i ∈ {1, . . . , |Mso |}.Consider the set of candidates C = {c1, . . . , cn}.

Let the set of all sentences in each candidate’s wiki-text Sc = {s1

c , . . . , stc}, and a set of named en-

tity mentions in each candidate’s wikitext Mc ={m1

c , . . . ,muc }, the system extracts each pair λ con-

sists of query name and a mention both occurring inthe same sentence:

λi,j,k = 〈qi,j ,mi,j,k〉 (2)

where, i ∈ {1, . . . , |C|}, j ∈ {1, . . . , |Sc|}, and k ∈{1, . . . , |Mj |}, i.e. mi,j,k is the k-th mention in j-th sentence of i-th candidate. Assuming the contextbetween occurrences q and m defined as follows:

Wλ = w1w2 . . . wn (3)

The word sequence Wλ is then lemmatized, and allstopwords are removed. The system gathers all lem-mas of all pairs together to create a bag of lemmas.

Lλ = {l1, l2, . . . , ly} (4)

LT =⋃λ∈Λ

Lλ (5)

where Lλ is a bag of lemmas of each pair λ, Λ isa set of all existing pairs, LT is a bag of lemmasof all pairs Λ. Next, the system generates for eachpair a vector of features using the bag of lemmas (bi-nary vectors). For doing so, the system generates abinary vector (a row matrix) ϕi assigned to pair λi(Equation 6). The value of each element of the vec-tors is initially set to zero. The number of vectorsis equal to the number of pairs (|Λ|) and the numberof elements of each vector is equal to the number oflemmas in LT (|LT |). For each vector, if the sys-tem finds a lemma from the bag of lemmas (LT ) inthe relevant Lλ, the corresponding element of thatvector is set to one (Equation 6).

ϕi =[ l1 l2 . . . ld

b1i b2i . . . bdi

](6)

Each element in vectors is equal to:

bji =

{0 if lj not in Lλi1 if lj in Lλi

(7)

Figure 2: A sample graph structure.

3.5 Graph Generation

In order to rank the candidates, the system generatesa star graph for the query Gq = (Vq, Eq) and foreach candidate Gc = (Vc, Ec) in which V and Eare the sets of vertices and edges respectively (Fig-ure 2). This step aims to generate a graph structureto measure the similarity between Gq and each Gcin order to select the most similar candidate to thequery. Central vertex ofGq is labeled with the queryname and central vertex of each Gc is labeled withthe candidate name. Other vertices in Gq and Gc arelabeled with those named entity mentions existing inthe set of pairs λ ∈ Λ. Each edge is labeled with thesemantic relation existing between the linked enti-ties which is represented by a binary vector ϕ corre-sponding to each pair λ. Given that we are dealingwith star graphs, each graph Gi with central vertex ican be represented as the list of pairs (ej ,vj) for alloutcoming edges of vertex i (Equation 8).

Gi := {〈ej , vj〉}i = {〈ϕj ,mj〉}i (8)

3.6 Candidate Ranking and NIL Clustering

For ranking candidates, each Gc∈C is scored basedon the similarity between Gq and Gc which is equalto the degree of similarity between outcoming edgesof both graphs Gq and Gc (Equation 9).

Sim(Gq, Gc) = Sim({〈ϕi,mi〉}q, {〈ϕj ,mj〉}c)(9)

In order to calculate Sim(Gq, Gc), the system firstcompares the similarity β between each vertexmi ∈

Gq and each vertex mj ∈ Gc (except q and ci) usingLevenshtein distance metric (Equation 10).

βmi,mj = 1−levmi,mj

|mi|+ |mj |(10)

where βmi,mj is the degree of similarity betweenmi ∈ Gq and mj ∈ Gc, and levmi,mj is Leveshteinmetric for measuring the difference between twostrings mi and mj , and |mi| and |mj | are lengths(number of characters) of mi and mj respectively.For instance, if mi=“Barcelona” and mj=“F.C.Barcelona”, then, levmi,mj = 5 and |mi| + |mj | =23, therefore, βmi,mj = 1− 5

23 = 0.78.In addition, the system compares the similarity β

between each edge ϕi ∈ Gq and each edge ϕj ∈ Gcusing Dice metric (Equation 11).

βϕi,ϕj = diceϕi,ϕj =2Ti,jTi + Tj

(11)

where βϕi,ϕj is the degree of similarity betweenϕi ∈ Gq and ϕj ∈ Gc, and diceϕi,ϕj is the fuc-tion to calculate dice coefficient between ϕi and ϕj ,Ti,j is the number of positive matches between vec-tors ϕi and ϕj , and Ti and Tj are the total num-ber of positive presences in the vectors ϕi and ϕjrespectively. For instance in Equation 11, sup-pose that ϕi=[1110001010]2 and ϕj=[0010001011],then, diceϕi,ϕj = 2(3)

9 = 0.66.Furthermore, for each Gc the system generates a

set of links Hq,c = {h1, . . . , hf}, each link h be-tween vertices mq and mc. As shown in Figure 3each link h has attached weight α. To calculate thevalue of each α, the system combines the similaritiesβmi,mj and βϕi,ϕj (Equation 12).

αh∈Hq,c = βmi,mj + (1− βmi,mj )βϕi,ϕj (12)

Subsequently, to score each candidate, the averageof all α values for each Gc is obtained:

Sim(Gq, Gc) = Xc∈C =

∑h∈H αh

|H|(13)

whereXc∈C indicates the score obtained by the can-didate c ∈ C. The system then selects that candidate

2In this example we assumed |ϕ| = 10, however in the realsamples, |ϕ| is much more than this amount (usually, |ϕ| >100).

Figure 3: A sample graph structure with α relation.

having the highest score as the correct reference ofthe query (Equation 14).

answerq := {z ∈ C|∀c ∈ C : Xc ≤ Xz} (14)

where answerq indicates those entry in the KB re-lated to the query.

For those queries referring to entities which arenot in the KB (NIL queries), the system should clus-ter them in groups, each referring to a same Not-In-KB entity (NIL Clustering). To this objective, a termclustering method is applied to cluster such queries.Each initial NIL query forms a cluster assigning aNIL id. The module compares each new NIL querywith each existing cluster (initial NIL query) using adice coefficient similarity between all ANs (includ-ing query name) of both queries. In the compari-son between new NIL query and the cluster, we con-sider the maximum dice score. If the similarity ishigher than the predefined NIL threshold, the newNIL query obtains the Id of this cluster, otherwiseit forms a new NIL cluster obtaing a new NIL id.In our experiments we manually selected 0.8 as NILthreshold (Equation 15).

idnil =

{idclr if dicenil,clr ≥ 0.8idnclr otherwise

(15)

where, idnil is the Id of NIL query, idclr is the Idof an existing cluster, dicenil,clr is Dice function ap-plied to NIL query and existing cluster, and idnclr isId of a new cluster.

4 Evaluation Framework

We have participated in the framework of the TAC-KBP 2014 Mono-Lingual (English) EL evaluationtracks3.

Given a list of queries, the system should providethe identifier of the KB entry to which the namerefers if existing, or a NIL ID if there is no suchKB entry. The EL system is also required to clus-ter together queries referring to the same non-KB(NIL) entities and provide a unique ID for each clus-ter. The reference KB used in this track includeshundreds of thousands of entities based on articlesfrom an October 2008 dump of English WP, whichincludes 818,741 nodes. The evaluation query set in2014 experiment contains 5234 queries. Some enti-ties share confusable names, especially challengingin the case of acronyms.

5 Results and Analysis

We have participated in TAC-KBP 2014 Mono-Lingual (English) EL track. Totally, 17 teams sub-mitted 55 runs in this section. Table 1 illustrates ourofficial results measured by B-cubed+ metric (F1)and submitted in two runs. First run without ac-cess to the Web and the second run using web ac-cess. The table splits the results by those query an-swers existing in reference KB (in-KB) and thosenot in the KB (NIL). Evaluation corpus in TAC-KBP 2014 includes three types of genres, NewsWires (NW), Web Documents (WB), and Discus-sion Fora (DF). DF is highly challenging since itcontains many grammatical irregularities extractedfrom fora, blogs, etc. Furthermore, the table indi-cates the results by three different query types, PER,ORG, and GPE. It also shows results obtained bythe median of all participants and results of a teamhaving highest score in TAC-KBP 2014.

Although, our assumption was to obviously ob-tain a better results in run 2 (with access to web),but the overall results obtained from two runs arevery close. By analogy, the system obtained a higherscore to link in-KB queries and a lower score forNIL queries in run 2 compared against other run.Our results are less than the median of all partici-pants in this track. Our result for NIL queries are

3http://www.nist.gov/tac/

SystemB3+ F1

Allin-KB

NIL NW WB DF PER ORG GPE

Run 1 0.56 0.36 0.79 0.56 0.54 0.58 0.65 0.49 0.35Run 2 0.57 0.43 0.73 0.57 0.53 0.59 0.63 0.54 0.41Median 0.70 0.65 0.76 0.67 0.68 0.76 0.72 0.70 0.72Highest 0.82 0.79 0.85 0.83 0.78 0.84 0.84 0.82 0.83

Table 1: TALP official results obtained in TAK-KBP2014 Mono-Lingual (English) EL evaluation tracksubmitted in two runs. Run 1 without web accessand run 2 with web access.

near to the median but the result for in-KB queries isfar from the score obtained by the median. In addi-tion, among the query types the worst result belongsto the GPE.

6 Conclusions and Future Work

.In our third participation in TAC-KBP 2014 EL

evaluation track, we applied a VSM model in orderto rank the candidates of each query. The resultswere compared with those obtained by all partici-pants. We achieved the comparison using B-cubed+F1 metric. We obtained a result less that the medianof all participants in TAC-KBP 2014. In order toimprove the system accuracy, we aim to combine an-other graph-based ranker. This ranker will measurethe semantic similarity between the query and eachcandidate. It would be beneficial when the queriesare highly ambiguous.

Acknowledgements

This work has been produced with the support of theSKATER project (TIN2012-38584-C06-01).

References

Satanjeev Banerjee and Ted Pedersen. 2003. Extendedgloss overlaps as a measure of semantic relatedness.IJCAI, 3.

Horst Bunke and eds Alberto Sanfeliu. 1990. Syntacticand structural pattern recognition: theory and applica-tions. World Scientific, 7.

King Sun Fu. 1982. Syntactic pattern recognition andapplications. Prentice-Hall.

George H. John and Pat Langley. 1995. Estimating con-tinuous distributions in bayesian classifiers. In the

Eleventh conference on Uncertainty in artificial intel-ligence. Morgan Kaufmann Publishers Inc.

Adam R. Klivans and Rocco A. Servedio. 2006. To-ward attribute efficient learning of decision lists andparities. The Journal of Machine Learning Research,7:587–602.

Michael Lesk. 1986. Automatic sense disambiguationusing machine readable dictionaries: how to tell a pinecone from an ice cream cone. In the 5th annual inter-national conference on Systems documentation. ACM.

Dekang Lin. 1998. Automatic retrieval and clustering ofsimilar words. In the 17th international conference onComputational Linguistics, volume 2. Association forComputational Linguistics.

John F. Magee. 1964. Decision trees for decision mak-ing. Graduate School of Business Administration,Harvard University.

Warren S. McCulloch and Walter Pitts. 1943. A logi-cal calculus of the ideas immanent in nervous activity.The bulletin of mathematical biophysics, 5(4):115–133.

Rada Mihalcea. 2004. Co-training and self-training forword sense disambiguation. In the Conference on Nat-ural Language Learning.

Raymond J. Mooney. 1996. Comparative experimentson disambiguating word senses: An illustration of therole of bias in machine learning. In Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 82–91.

Roberto Navigli. 1990. Word sense disambiguation: Asurvey. ACM Computing Surveys (CSUR), 41(2).

Patrick Pantel and Dekang Lin. 2002. Discovering wordsenses from text. In the 8th ACM SIGKDD interna-tional conference on Knowledge discovery and datamining. ACM.

John Ross Quinlan. 1993. C4. 5: programs for machinelearning, volume 1. Morgan Kaufmann.

L. Ratinov and D. Roth. 2009. Design challengesand misconceptions in named entity recognition. InCoNLL.

Ronald L. Rivest. 1987. Learning decision lists. Ma-chine learning, 2(3):229–246.

Hinrich Schutze. 1992. Dimensions of meaning. In Su-percomputing ’92: ACM/IEEE Conference on Super-computing, pages 787–796. IEEE Computer SocietyPress.

Geoffrey Towell and Ellen M. Voorhees. 1998. Dis-ambiguating highly ambiguous words. ComputationalLinguistics, 24(1):125–145.

Dominic Widdows and Beate Dorow. 2002. A graphmodel for unsupervised lexical acquisition. In the 19thinternational conference on Computational linguis-tics, volume 1. Association for Computational Lin-guistics.

Binary Vector Approach to Entity Linking: TALP in TAC-KBP 2014

Documents

Transcript of Binary Vector Approach to Entity Linking: TALP in TAC-KBP 2014