RAProp : Ranking Tweets by Exploiting the Tweet/User/Web Ecosystem and Inter-Tweet Agreement
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
-
Upload
m-atif-qureshi -
Category
Technology
-
view
81 -
download
1
Transcript of Exploiting Wikipedia for Entity Name Disambiguation in Tweets
“Exploiting Wikipedia for Entity Name Disambiguation in Tweets”
Muhammad Atif QureshiColm O'RiordanGabriella Pasi
06/16/14 2
Contents
● Introduction● Related Work● Methodology● Evaluation● Conclusion
06/16/14 3
06/16/14 4
Motivation
● Social media users voice their opinions about various entities/brands (e.g., musicians, movies, companies)
● So that's an implicit feedback for an entity/brand
● This has recently given birth to a new area within the marketing domain known as “online reputation management”
06/16/14 5
Problem Statement
● Given a set of tweets collected after issuing a query of entity (brand) name, the task is to determine which of the tweets are related to the entity and which are not
● Decide if tweet is related to Apple Inc.– “Apple tastes better than blackberry”
– “Apple phones are better than blackberry”
06/16/14 6
Wikipedia Graph Structure
C1A1
A3
A4
C3C2
C4
C5 C6 C7
C10
C9
Category Article
Category Edge Article Belonging to Category
A2
Article Link
06/16/14 7
Related Work
● Entity Linking: to link an entity to it's correct sense– Ferragina and Scaiella 2010 and Meij et al 2012 has
proposed strategies over tweets● Use hyperlink structure of Wikipedia and anchor texts of the
links to those Wikipedia pages.● Disambiguation is performed by application of a voting
function among all senses associated to anchors detected
– Meij et al 2012 employs supervised machine learning techniques for further improvement
06/16/14 8
Methodology
● Chunking Strategy● Entity Phrases & Categories● Features Based on Wikipedia Articles'
Hyperlinks● Features Based on Wikipedia Articles'
Hyperlinks
06/16/14 9
Chunking Strategy
I prefer Samsung over HTC, Apple, Nokia because it is economical and good
i prefer samsung over htc apple nokia because it is economical and good
Phrase Chunks with boundaries
samsungprefer htc apple nokia economical
Stopwords removed,Longest phrase matched over Wikipedia as article
06/16/14 10
Entity Phrases & Categories
Entity E1
Wikipedia Article AE1
of entity E1
List of Wikipedia Categories CL_E
1 of AE
1
Sub-Categories SCL_E1 of
CL_E1 up to a depth 2
List ofEntity Phrases of E
1
or ArticlesRC
Wikipedia Articles in CL_E
1
Wikipedia Articles in SCL_E1
Entity Categories or RC
Has a
Mentions inside
Has
Categories or WC (i.e., RC WC)⊂
06/16/14 11
Context PhrasesEntity Phrase
Features Based on Wikipedia Articles' Hyperlinks
appleChunked tweet
Entity Phrase Senses Context Phrase Senses Avg. Max. Sense Scoredoctor fruit
phd band medical album plant
apple (fruit) 80 45 230 6 532 381
apple (film) 10 50 0 9 0 29.5
apple (inc.) 83 20 10 5 0 44
Feature values are generated using Inlinks, outlinks, inlink+outlinks
Sense apple (inc.) is related to Entity while others were not
For entity Apple Inc.
doctor fruit
06/16/14 12
Relatedness Score Based on Wikipedia Category-Article Structure
DepthSignificace( p)= ∑cat∈RC∩ pcat
1depthcat+1
CatSignificace( p)=∣RC∩ pcat∣∣WC∩ pcat∣
∗log(∣RC∩ pcat∣+1)
PhraseSignificace( p)=log (wordlen( p)+1)× p frequency
Relatedness Score= ∑p∈MatchedPhrases
Depth significance( p)×Cat significance( p)×Phrase significance
06/16/14 13
Dataset
● Multilingual tweets of 61 entities (25% Spanish, 75% English)– Training ~749 tweets for each entity
– Testing ~1481 tweets for each entity
Domains No. of Entities
Training Testing
Non Rev Orig Trans Non Rev Orig Trans
Music 20 1461 14353 12518 3296 1998 28137 23442 6693
University 10 3548 3412 6569 391 6760 7387 13060 1087
Banking 11 2021 5753 5327 2447 4335 11635 10918 5052
Automotive 20 3767 11356 12585 2538 6851 23253 24690 5414
Total 61 10797 34874 36999 8672 19944 70412 72110 18246
06/16/14 14
Measure
● Reliability is the product of precision in both classes (i.e., true positives and true negatives)
● Sensitivity is the product of recall of both classes
Reliability=TP
TP+FP×
TNTN +FN
Sensitivity=TP
TP+FN×
TNTN +FP
06/16/14 15
Settings
● Classifier: Random Forest
Settings Features Based on Wikipedia Articles' Hyperlinks
Relatedness Score Based on Wikipedia Category-Article Structure
Domain Level Training
Entity Level Training
hrdomain
x x x
hrentity
x x x
rdomain
x x
rentity
x x
06/16/14 16
Results
Team Reliability Sensitivity F(R,S)
POPSTAR 0.73 0.45 0.49
OUR APPROACH 0.67 0.42 0.45
SZTE NLP 0.60 0.44 0.44
LIA 0.66 0.36 0.38
BASELINE 0.49 0.32 0.33
UvA UNED 0.68 0.22 0.21
Domain Setting Reliability Sensitivity F(R,S)
Automotives hrdomain
0.54 0.47 0.47
Banking hrentity
0.75 0.58 0.49
University hrdomain
0.71 0.44 0.49
Music rentity
0.83 0.34 0.39
Evaluation Results on Test Set by Domain
Performance Comparison with Other Systems
06/16/14 17
Conclusion
● The experimental evaluations establish Wikipedia’s strength as a significant encyclopaedic resource for the task of entity name disambiguation in tweets.
● The relatedness score defined using Wikipedia category-article structure introduces a powerful semantic notion of linking n-grams in a tweet with the information relevant to an entity
● As future work, we aim to combine our Wikipedia based features with text based techniques to further improve the performance
06/16/14 18
References
● E. Amigo, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martin, E. Meij, M. de Rijke, and D. Spina. Overview of replab 2013: Evaluating on-line reputation monitoring systems. In CLEF 2013 Labs and Workshop Notebook Papers, Springer LNCS, 2013.
● P. Ferragina and U. Scaiella. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM.
● E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. WSDM ’12, pages 563–572, New York, NY, USA, 2012. ACM.
● M.-H. Peetz, D. Spina, J. Gonzalo, and M. de Rijke. Towards an active learning system for company name disambiguation in microblog streams. In CLEF (Online Working Notes/Labs/Workshop), 2013.
06/16/14 19
Questions
???