A Framework for Collecting, Extracting and Managing Event Identity Information from Twitter

37
A Framework for Collecting, Extracting and Managing Event Identity Information from Twitter Debanjan Mahata, John R. Talburt [email protected] , [email protected] Department of Information Science University of Arkansas at Little Rock Vivek Kumar Singh [email protected] Department of Computer Science South Asian University, New Delhi, India

Transcript of A Framework for Collecting, Extracting and Managing Event Identity Information from Twitter

A Framework for Collecting, Extracting and Managing Event Identity Information from

Twitter

Debanjan Mahata, John R. [email protected], [email protected]

Department of Information ScienceUniversity of Arkansas at Little Rock

Vivek Kumar [email protected]

Department of Computer ScienceSouth Asian University, New Delhi, India

Social Media

A daily average of 58 million tweets is posted in Twitter. Source: http://goo.gl/Oz5sIZ

An average 60 million photos are shared in Instagram daily. Source: http://instagram.com/press

Facebook stores 300 petabytes of data related to its users from all over the world. Source: http://goo.gl/XxEfeX

72% of all internet users are now active on social media. Source: http://goo.gl/qAuIoe

46% of adult Internet users post original photos or videos online that they themselves have created. Source: http://goo.gl/iQ06Ix

/

Real-life Events

EIIM in MDM

Zhou, Yinle, and John Talburt. "Entity identity information management (EIIM)." International Conference on Information Quality (ICIQ-11), Adelaide, Australia. 2011.

Problem Definition

Challenges

Volume and Velocity Veracity

New post: Sochi Was For Suckers - Laugh Studios/

http://t.co/cWQJCBp3Ow #lol #funny #rofl #funnypic #fail #wtf

Informal Text

Variety

Searching the Long TailSampling Bias

Sparse Link Structure Between

Content in Social Media

Lack of Evaluation Datasets

EIIM Life Cycle in Twitter

Mahata, Debanjan, and John Talburt. "A Framework for Collecting and Managing Entity Identity Information from Social Media.“ 19 th International Conference on Information Quality, Xi’An, China.

Identity Integrity1

Assigns unique identifier to a real-life event being tracked by the framework and maintains the same identifier for newly

collected event references

Identity Integrity Requires

• Each real-world event in the domain has one and only one representation in the information system.

• Distinct real-world events have distinct representations in the information system.

Allocates individual EIIS to

each real-life event being tracked by the framework

Greater than 8 million tweets collected for experiments

Event Reference Preparation• Parts-of-Speech Tagging• Special Character Detection• Data Cleansing• Duplicate Detection• Stop Word Detection and Elimination• Slang Word Extraction• Feeling Word Extraction• Tokenization• Stemming• Tweet Meta-Data

• Expanded URLs• User Information• Verification• Favorite Count• Retweet Count• User Mentions

• Entity Extraction

Event Related Content Analysis

Event Identity Information Processing

EventIdentityInfoGraph

Process using

EventIdentityInfoRank

7

NDCG Curves for Millions March NYC

NDCG Values for Millions March NYC

Precision Values for Millions March NYC

Potential Applications• Event Monitoring and Analysis• Event Information Retrieval• Opinion and Review Mining• Recommender Systems• Event Management and Marketing• Social Media Data Integration• Many More

Future Directions

• Summarizing Event Content• Identification of Insightful Opinionated

Content• Event Topic Modeling• Event-specific Recommendations• Distributed Processing of

TwitterEventInfoGraph• Ontology for Event Content in Social Media• Many More

Additional Slides

Tweet Features

No. of Unigram Tokens, No. of Stop Words, No. of Slang Words, No. of Feeling Words, No. of Hashtags, Has URL, Is Verified, No. of User Mentions, Length of Post, No. of Unique Characters, No. of Special Characters, Favorite Count, Retweet Count, Formality, No. of Nouns, No. of Adjectives, No. of Verbs, No. of Adverbs.

Logistic Regression Model Performance

Precision Recall F-1 Score

Non-informative (0) 0.70 0.49 0.57

Informative (1) 0.78 0.90 0.84

Avg/TotalAccuracy = 76.64%

0.76 0.77 0.75

Olteanu, Alexandra, et al. "CrisisLex: A lexicon for collecting and filtering microblogged communications in crises." In Proceedings of the 8th International AAAI Conference on Weblogs and Social Media (ICWSM" 14). No. EPFL-CONF-203561. 2014.

Event Information Quality

28000 annotated tweets

26 Events

Related and Informative – “#MediaLarge wildfire in N. Colorado promptsEvacuation : Crews are battling a fast-Moving wildfire http://t.co/ju1BGTKH #Politics #News”

Related but not Informative – “RT @LarimerSheriff: #HighParkFire update http://t.co/hBy5shen”

Not Related – “#Intern #US #TATTOO#Wisconsin #Ohio #NC #PA #Florida#Colorado #Iowa #Nevada #Virginia#NV #mlb Travel Destinations;http://t.co/TIHBJKF2”

Event Related Content Analysis

EventIdentityInfoRank

NDCG Values for Millions March NYC

NDCG Curves for Millions March NYC

Precision Values for Millions March NYC

NDCG Values for Sydney Siege Crisis

NDCG Curves for Sydney Siege Crisis

Precision Values for Sydney Siege Crisis

• SeenRank (http://seen.co/about)

• TextRank (Mihalcea, Rada, and Paul Tarau. "TextRank: Bringing order into texts." Association for Computational Linguistics, 2004.)

• LexRank(Erkan, Günes, and Dragomir R. Radev. "LexRank: graph-based lexical centrality as salience in text summarization." Journal of Artificial Intelligence Research (2004): 457-479.)

• RTRank

• Centroid(Becker, Hila, Mor Naaman, and Luis Gravano. "Selecting Quality Twitter Content for Events." ICWSM 11 (2011).)

• Logistic Regression

Baselines

Evaluation Metrics

p

i

rel

p iDCG

i

1 )1log(

12

p

pp IDCG

DCGnDCG

n

natreferencesrelevantofNumbernatecision Pr

Baeza-Yates, Ricardo, and Berthier Ribeiro-Neto. Modern information retrieval. Vol. 463. New York: ACM press, 1999.

Järvelin, Kalervo, and Jaana Kekäläinen. "Cumulated gain-based evaluation of IR techniques." ACM Transactions on Information Systems (TOIS) 20.4 (2002): 422-446.