Download - Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva.

Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010)

Padmini Srinivasan

Computer Science Department Department of Management Sciences

http://cs.uiowa.edu/[email protected]

Twitter

• 2 to 3 new users/second!!• 175 million users as of today• 95 million tweets written each day• 300 employees and hiring!• “what are you doing” “what’s happening”• Daily chatter, URL sharing, news– Distributed/local reporting; no filters; no editing– Fastest, lowest level information service

• Not a social network but an information network

Aggregators & others

• Plenty: – Tweetmeme

• Separates the kinds of media linked to (video, image..)

– Twitter search– Twistori

• Aggregates emotions

– Tweetsentiments.com– Retail Twitter Aggregation– TweetTabs– Where, what, when

Finding ‘Emergent’ Topics

• What is it?– A topic popular in current time and not in the past

• 5 steps:– Represent tweets as vectors of terms (language

independent)– Graph active authors’ social relationships (PageRank)– Model each term’s life cycle: “novel aging theory”– Rank terms based on their “energy”, select top few– Create a navigable topic graph linking emerging terms

with co-occurring ones: Emerging Topics.

Cool points• Nice overview with enough details about text representation, processing etc.• Hypothesis: flow of ideas from geographical origin of event to outside

– So find the starting tweets and you find the locale– Always so? Global event? Disasters?

• See PageRank in action– Author network in particular

• SCC– Term networks

• Biological metaphor– Lots of different terms: nutrition, calories, energy…

• Reasonable case study• Language – humour

– heartquake– Not dumping factor

Some trends

Last Class ContinuationCrawler Evaluation

• What are good pages? • Web scale is daunting• User based crawls are short, but web agents?• Page importance assessed– Presence of query keywords– Similarity of page to query/description– Similarity to seed pages (held out sample)– Use a classifier – not the same as used in crawler– Link-based popularity (but within topic?)

Summarizing Performance

• Precision– Relevance is Boolean: yes/no• Harvest rate: # of good pages/total # pages

– Relevance is continuous• Average relevance over crawled set

– Recall• Target recall: held out seed pages (H)

– |H ∧ pages crawled|/|pages crawled|

• Robustness– Start same crawler on disjoint seed sets. Examine overlap of

fetched pages

Sample Performance Graph

Summary

• Crawler architecture• Crawler algorithms• Crawler evaluation• Assignment 1– Run two crawlers for 5000 pages.– Start with the same set of seed pages for a topic.– Look at overlap and report this over time

(robustness)