Exploiting Wikipedia for Entity Name Disambiguation in Tweets

“Exploiting Wikipedia for Entity Name Disambiguation in Tweets”

Muhammad Atif QureshiColm O'RiordanGabriella Pasi

06/16/14 2

Contents

● Introduction● Related Work● Methodology● Evaluation● Conclusion

06/16/14 3

06/16/14 4

Motivation

● Social media users voice their opinions about various entities/brands (e.g., musicians, movies, companies)

● So that's an implicit feedback for an entity/brand

● This has recently given birth to a new area within the marketing domain known as “online reputation management”

06/16/14 5

Problem Statement

● Given a set of tweets collected after issuing a query of entity (brand) name, the task is to determine which of the tweets are related to the entity and which are not

● Decide if tweet is related to Apple Inc.– “Apple tastes better than blackberry”

– “Apple phones are better than blackberry”

06/16/14 6

Wikipedia Graph Structure

C1A1

A3

A4

C3C2

C4

C5 C6 C7

C10

C9

Category Article

Category Edge Article Belonging to Category

A2

Article Link

06/16/14 7

Related Work

● Entity Linking: to link an entity to it's correct sense– Ferragina and Scaiella 2010 and Meij et al 2012 has

proposed strategies over tweets● Use hyperlink structure of Wikipedia and anchor texts of the

links to those Wikipedia pages.● Disambiguation is performed by application of a voting

function among all senses associated to anchors detected

– Meij et al 2012 employs supervised machine learning techniques for further improvement

06/16/14 8

Methodology

● Chunking Strategy● Entity Phrases & Categories● Features Based on Wikipedia Articles'

Hyperlinks● Features Based on Wikipedia Articles'

Hyperlinks

06/16/14 9

Chunking Strategy

I prefer Samsung over HTC, Apple, Nokia because it is economical and good

i prefer samsung over htc apple nokia because it is economical and good

Phrase Chunks with boundaries

samsungprefer htc apple nokia economical

Stopwords removed,Longest phrase matched over Wikipedia as article

06/16/14 10

Entity Phrases & Categories

Entity E1

Wikipedia Article AE1

of entity E1

List of Wikipedia Categories CL_E

1 of AE

1

Sub-Categories SCL_E1 of

CL_E1 up to a depth 2

List ofEntity Phrases of E

1

or ArticlesRC

Wikipedia Articles in CL_E

1

Wikipedia Articles in SCL_E1

Entity Categories or RC

Has a

Mentions inside

Has

Categories or WC (i.e., RC WC)⊂

06/16/14 11

Context PhrasesEntity Phrase

Features Based on Wikipedia Articles' Hyperlinks

appleChunked tweet

Entity Phrase Senses Context Phrase Senses Avg. Max. Sense Scoredoctor fruit

phd band medical album plant

apple (fruit) 80 45 230 6 532 381

apple (film) 10 50 0 9 0 29.5

apple (inc.) 83 20 10 5 0 44

Feature values are generated using Inlinks, outlinks, inlink+outlinks

Sense apple (inc.) is related to Entity while others were not

For entity Apple Inc.

doctor fruit

06/16/14 12

Relatedness Score Based on Wikipedia Category-Article Structure

DepthSignificace( p)= ∑cat∈RC∩ pcat

1depthcat+1

CatSignificace( p)=∣RC∩ pcat∣∣WC∩ pcat∣

∗log(∣RC∩ pcat∣+1)

PhraseSignificace( p)=log (wordlen( p)+1)× p frequency

Relatedness Score= ∑p∈MatchedPhrases

Depth significance( p)×Cat significance( p)×Phrase significance

06/16/14 13

Dataset

● Multilingual tweets of 61 entities (25% Spanish, 75% English)– Training ~749 tweets for each entity

– Testing ~1481 tweets for each entity

Domains No. of Entities

Training Testing

Non Rev Orig Trans Non Rev Orig Trans

Music 20 1461 14353 12518 3296 1998 28137 23442 6693

University 10 3548 3412 6569 391 6760 7387 13060 1087

Banking 11 2021 5753 5327 2447 4335 11635 10918 5052

Automotive 20 3767 11356 12585 2538 6851 23253 24690 5414

Total 61 10797 34874 36999 8672 19944 70412 72110 18246

06/16/14 14

Measure

● Reliability is the product of precision in both classes (i.e., true positives and true negatives)

● Sensitivity is the product of recall of both classes

Reliability=TP

TP+FP×

TNTN +FN

Sensitivity=TP

TP+FN×

TNTN +FP

06/16/14 15

Settings

● Classifier: Random Forest

Settings Features Based on Wikipedia Articles' Hyperlinks

Relatedness Score Based on Wikipedia Category-Article Structure

Domain Level Training

Entity Level Training

hrdomain

x x x

hrentity

x x x

rdomain

x x

rentity

x x

06/16/14 16

Results

Team Reliability Sensitivity F(R,S)

POPSTAR 0.73 0.45 0.49

OUR APPROACH 0.67 0.42 0.45

SZTE NLP 0.60 0.44 0.44

LIA 0.66 0.36 0.38

BASELINE 0.49 0.32 0.33

UvA UNED 0.68 0.22 0.21

Domain Setting Reliability Sensitivity F(R,S)

Automotives hrdomain

0.54 0.47 0.47

Banking hrentity

0.75 0.58 0.49

University hrdomain

0.71 0.44 0.49

Music rentity

0.83 0.34 0.39

Evaluation Results on Test Set by Domain

Performance Comparison with Other Systems

06/16/14 17

Conclusion

● The experimental evaluations establish Wikipedia’s strength as a significant encyclopaedic resource for the task of entity name disambiguation in tweets.

● The relatedness score defined using Wikipedia category-article structure introduces a powerful semantic notion of linking n-grams in a tweet with the information relevant to an entity

● As future work, we aim to combine our Wikipedia based features with text based techniques to further improve the performance

06/16/14 18

References

● E. Amigo, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martin, E. Meij, M. de Rijke, and D. Spina. Overview of replab 2013: Evaluating on-line reputation monitoring systems. In CLEF 2013 Labs and Workshop Notebook Papers, Springer LNCS, 2013.

● P. Ferragina and U. Scaiella. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM.

● E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. WSDM ’12, pages 563–572, New York, NY, USA, 2012. ACM.

● M.-H. Peetz, D. Spina, J. Gonzalo, and M. de Rijke. Towards an active learning system for company name disambiguation in microblog streams. In CLEF (Online Working Notes/Labs/Workshop), 2013.

06/16/14 19

Questions

???

Exploiting Wikipedia for Entity Name Disambiguation in Tweets

Technology

Transcript of Exploiting Wikipedia for Entity Name Disambiguation in Tweets