Named Entity Disambiguation Based on Explicit Semantics

14
Named Entity Disambiguation Based on Explicit Semantics Martin Jačala and Jozef Tvarožek Špindlerův Mlýn, Czech Republic January 23, 2012 Slovak University of Technology Bratislava, Slovakia

description

Slovak University of Technology Bratislava, Slovakia. Named Entity Disambiguation Based on Explicit Semantics. Martin Ja čala and Jozef Tvarožek. Špindlerův Mlýn , Czech Republic January 23, 2012. Problem. Given an input text, detect and decide on correct meaning of named entites. - PowerPoint PPT Presentation

Transcript of Named Entity Disambiguation Based on Explicit Semantics

Page 1: Named Entity Disambiguation Based on Explicit Semantics

Named Entity Disambiguation Based on Explicit Semantics

Martin Jačala and Jozef Tvarožek

Špindlerův Mlýn, Czech RepublicJanuary 23, 2012

Slovak University of TechnologyBratislava, Slovakia

Page 2: Named Entity Disambiguation Based on Explicit Semantics

2

Problem• Given an input text, detect and decide on

correct meaning of named entites

The jaguar is a big cat, a feline in the Panthera genus, and is the only Panthera species found in the

Americas.Jaguar cars today are designed in Jaguar Land Rover's engineering centres at the Whitley plant in

Coventry.

Jaguar (animal)

Jaguar Cars

Named Entity Disambiguation Based on Explicit Semantics

Page 3: Named Entity Disambiguation Based on Explicit Semantics

3

Current solutions• Charles-Miller hypothesis– Similar words occur in semantically similar

contexts– Hence, we should be able to distinguish the

correct sense.• Dictionary approach, sense clustering• Semantic similarity• Use of BigData – Wikipedia, Open Directory

Project

Named Entity Disambiguation Based on Explicit Semantics

Page 4: Named Entity Disambiguation Based on Explicit Semantics

4

General scheme• In detecting the most probable sense– we rank the possible senses, and– pick the best

Named Entity Disambiguation Based on Explicit Semantics

Entity boundary detection

Document transformation

Candidate meanings selection

Similarity computation

Semantic space

Disambiguation dictionary

Page 5: Named Entity Disambiguation Based on Explicit Semantics

5

Entity boundary detection• Stanford named entity recognizer– Conditional random fields– F1 score 88 to 93% (CoNLL 2003 dataset)– Used only for detection of surface forms

Named Entity Disambiguation Based on Explicit Semantics

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.

Page 6: Named Entity Disambiguation Based on Explicit Semantics

6

Disambiguation dictionary• Build from explicit semantics in Wikipedia– Disambiguation and redirect pages– hyperlinks in articles’ text

• Candidate meanings for each surface form,e.g.

Named Entity Disambiguation Based on Explicit Semantics

Jaguar = {Jaguar, Jaguar Cars, Mac OS X 10.2, ...}Big Blue = {IBM, Pacific Ocean, New York Giants, ...}Jordan = {Jordan, Jordan River, Michael Jordan, ...}

Page 7: Named Entity Disambiguation Based on Explicit Semantics

7

Vector space • Concept space = Wikipedia’s articles– Explicit (human-defined concepts)

• We construct term-concept matrix– terms = rows, concepts = columns– tf-idf values

• For input text, the “semantic” vector is– Sum of term vectors for all terms in the input

Named Entity Disambiguation Based on Explicit Semantics

Page 8: Named Entity Disambiguation Based on Explicit Semantics

8

Example

Named Entity Disambiguation Based on Explicit Semantics

Concepts: Mammal, Britain, Technology, Racing, Wilderness, Amazonia, Sports Car,

Leopard

Jaguar cars today are designed in Jaguar Land Rover's engineering centres at the Whitley plant in Coventry.

(30, 0, 0, 0, 430, 320, 0, 100)

The jaguar is a big cat, a feline in the Panthera genus, and is the only Panthera species found in the Americas.

(0, 20, 473, 845, 0, 0, 304, 0)

Page 9: Named Entity Disambiguation Based on Explicit Semantics

9

Ranking• Candidate meaning description = wiki page– Description is transformed to the vector space

• Ranking algorithm:Given a surface form, entity meanings are ranked according to the cosine similarity between the input text and each of the candidate meaning descriptions

• Extension: add entity occurrence context as input to similarity computation

Named Entity Disambiguation Based on Explicit Semantics

Page 10: Named Entity Disambiguation Based on Explicit Semantics

10

Baseline• Dictionary contains surface forms with context– extracted from Wikipedia using letter capitalization

• Context = sliding window of 50 words around each occurrence of hypertext link

• Article names are the entity labels– Description is the average of contexts in Wikipedia

• Disambiguation result is the best similarity match of the input text to the entity descriptions

Named Entity Disambiguation Based on Explicit Semantics

Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL. pp. 9-16 (2006)

Page 11: Named Entity Disambiguation Based on Explicit Semantics

11

Datasets• Three datasets – manually disambiguated– Development, Evaluation1, Evaluation2

• Each contains 20 “random” newswire articles containing ambiguous entities– Columbia, Texas, Jaguar, etc.

• Discarded refs to entities not in Wikipedia– Each entity has avg. 18 meanings– 78% entities have more than two meanings

Named Entity Disambiguation Based on Explicit Semantics

Page 12: Named Entity Disambiguation Based on Explicit Semantics

12

Evaluation results

Named Entity Disambiguation Based on Explicit Semantics

devel eval1 eval20

25

50

75

100

91.93 90.25 86.5681.41 81.05 82.3376.2885.36 82.38

Success rate [%]

ESA LSA Baseline

Page 13: Named Entity Disambiguation Based on Explicit Semantics

13

Evaluation discussion• Baseline method fails to capture the Wikipedia

content due to the way it is written– Hyperlinks are usually early in the article

• Latent semantic analysis (250 dims)– Speeds up the process– Decreased accuracy

Named Entity Disambiguation Based on Explicit Semantics

Page 14: Named Entity Disambiguation Based on Explicit Semantics

14

Summary• Named entity disambiguation based on

explicit semantics• Explicit semantics within Wikipedia– Link descriptions– Redirect and disambiguation pages

• 7 to 8% improvement– 86 to 92% accuracy (up from 76 to 85%)

Named Entity Disambiguation Based on Explicit Semantics