A Framework for Collecting, Extracting and Managing Event Identity Information from Twitter
-
Upload
debanjan-mahata -
Category
Data & Analytics
-
view
418 -
download
0
Transcript of A Framework for Collecting, Extracting and Managing Event Identity Information from Twitter
A Framework for Collecting, Extracting and Managing Event Identity Information from
Debanjan Mahata, John R. [email protected], [email protected]
Department of Information ScienceUniversity of Arkansas at Little Rock
Vivek Kumar [email protected]
Department of Computer ScienceSouth Asian University, New Delhi, India
Social Media
A daily average of 58 million tweets is posted in Twitter. Source: http://goo.gl/Oz5sIZ
An average 60 million photos are shared in Instagram daily. Source: http://instagram.com/press
Facebook stores 300 petabytes of data related to its users from all over the world. Source: http://goo.gl/XxEfeX
72% of all internet users are now active on social media. Source: http://goo.gl/qAuIoe
46% of adult Internet users post original photos or videos online that they themselves have created. Source: http://goo.gl/iQ06Ix
/
EIIM in MDM
Zhou, Yinle, and John Talburt. "Entity identity information management (EIIM)." International Conference on Information Quality (ICIQ-11), Adelaide, Australia. 2011.
Challenges
Volume and Velocity Veracity
New post: Sochi Was For Suckers - Laugh Studios/
http://t.co/cWQJCBp3Ow #lol #funny #rofl #funnypic #fail #wtf
Informal Text
Variety
Searching the Long TailSampling Bias
Sparse Link Structure Between
Content in Social Media
Lack of Evaluation Datasets
EIIM Life Cycle in Twitter
Mahata, Debanjan, and John Talburt. "A Framework for Collecting and Managing Entity Identity Information from Social Media.“ 19 th International Conference on Information Quality, Xi’An, China.
Identity Integrity1
Assigns unique identifier to a real-life event being tracked by the framework and maintains the same identifier for newly
collected event references
Identity Integrity Requires
• Each real-world event in the domain has one and only one representation in the information system.
• Distinct real-world events have distinct representations in the information system.
Allocates individual EIIS to
each real-life event being tracked by the framework
Event Reference Preparation• Parts-of-Speech Tagging• Special Character Detection• Data Cleansing• Duplicate Detection• Stop Word Detection and Elimination• Slang Word Extraction• Feeling Word Extraction• Tokenization• Stemming• Tweet Meta-Data
• Expanded URLs• User Information• Verification• Favorite Count• Retweet Count• User Mentions
• Entity Extraction
Potential Applications• Event Monitoring and Analysis• Event Information Retrieval• Opinion and Review Mining• Recommender Systems• Event Management and Marketing• Social Media Data Integration• Many More
Future Directions
• Summarizing Event Content• Identification of Insightful Opinionated
Content• Event Topic Modeling• Event-specific Recommendations• Distributed Processing of
TwitterEventInfoGraph• Ontology for Event Content in Social Media• Many More
Tweet Features
No. of Unigram Tokens, No. of Stop Words, No. of Slang Words, No. of Feeling Words, No. of Hashtags, Has URL, Is Verified, No. of User Mentions, Length of Post, No. of Unique Characters, No. of Special Characters, Favorite Count, Retweet Count, Formality, No. of Nouns, No. of Adjectives, No. of Verbs, No. of Adverbs.
Logistic Regression Model Performance
Precision Recall F-1 Score
Non-informative (0) 0.70 0.49 0.57
Informative (1) 0.78 0.90 0.84
Avg/TotalAccuracy = 76.64%
0.76 0.77 0.75
Olteanu, Alexandra, et al. "CrisisLex: A lexicon for collecting and filtering microblogged communications in crises." In Proceedings of the 8th International AAAI Conference on Weblogs and Social Media (ICWSM" 14). No. EPFL-CONF-203561. 2014.
Event Information Quality
28000 annotated tweets
26 Events
Related and Informative – “#MediaLarge wildfire in N. Colorado promptsEvacuation : Crews are battling a fast-Moving wildfire http://t.co/ju1BGTKH #Politics #News”
Related but not Informative – “RT @LarimerSheriff: #HighParkFire update http://t.co/hBy5shen”
Not Related – “#Intern #US #TATTOO#Wisconsin #Ohio #NC #PA #Florida#Colorado #Iowa #Nevada #Virginia#NV #mlb Travel Destinations;http://t.co/TIHBJKF2”
• SeenRank (http://seen.co/about)
• TextRank (Mihalcea, Rada, and Paul Tarau. "TextRank: Bringing order into texts." Association for Computational Linguistics, 2004.)
• LexRank(Erkan, Günes, and Dragomir R. Radev. "LexRank: graph-based lexical centrality as salience in text summarization." Journal of Artificial Intelligence Research (2004): 457-479.)
• RTRank
• Centroid(Becker, Hila, Mor Naaman, and Luis Gravano. "Selecting Quality Twitter Content for Events." ICWSM 11 (2011).)
• Logistic Regression
Baselines
Evaluation Metrics
p
i
rel
p iDCG
i
1 )1log(
12
p
pp IDCG
DCGnDCG
n
natreferencesrelevantofNumbernatecision Pr
Baeza-Yates, Ricardo, and Berthier Ribeiro-Neto. Modern information retrieval. Vol. 463. New York: ACM press, 1999.
Järvelin, Kalervo, and Jaana Kekäläinen. "Cumulated gain-based evaluation of IR techniques." ACM Transactions on Information Systems (TOIS) 20.4 (2002): 422-446.