Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010)
Padmini Srinivasan
Computer Science Department Department of Management Sciences
http://cs.uiowa.edu/[email protected]
• 2 to 3 new users/second!!• 175 million users as of today• 95 million tweets written each day• 300 employees and hiring!• “what are you doing” “what’s happening”• Daily chatter, URL sharing, news– Distributed/local reporting; no filters; no editing– Fastest, lowest level information service
• Not a social network but an information network
Aggregators & others
• Plenty: – Tweetmeme
• Separates the kinds of media linked to (video, image..)
– Twitter search– Twistori
• Aggregates emotions
– Tweetsentiments.com– Retail Twitter Aggregation– TweetTabs– Where, what, when
Finding ‘Emergent’ Topics
• What is it?– A topic popular in current time and not in the past
• 5 steps:– Represent tweets as vectors of terms (language
independent)– Graph active authors’ social relationships (PageRank)– Model each term’s life cycle: “novel aging theory”– Rank terms based on their “energy”, select top few– Create a navigable topic graph linking emerging terms
with co-occurring ones: Emerging Topics.
Cool points• Nice overview with enough details about text representation, processing etc.• Hypothesis: flow of ideas from geographical origin of event to outside
– So find the starting tweets and you find the locale– Always so? Global event? Disasters?
• See PageRank in action– Author network in particular
• SCC– Term networks
• Biological metaphor– Lots of different terms: nutrition, calories, energy…
• Reasonable case study• Language – humour
– heartquake– Not dumping factor
Some trends
Last Class ContinuationCrawler Evaluation
• What are good pages? • Web scale is daunting• User based crawls are short, but web agents?• Page importance assessed– Presence of query keywords– Similarity of page to query/description– Similarity to seed pages (held out sample)– Use a classifier – not the same as used in crawler– Link-based popularity (but within topic?)
Summarizing Performance
• Precision– Relevance is Boolean: yes/no• Harvest rate: # of good pages/total # pages
– Relevance is continuous• Average relevance over crawled set
– Recall• Target recall: held out seed pages (H)
– |H ∧ pages crawled|/|pages crawled|
• Robustness– Start same crawler on disjoint seed sets. Examine overlap of
fetched pages
Sample Performance Graph
Summary
• Crawler architecture• Crawler algorithms• Crawler evaluation• Assignment 1– Run two crawlers for 5000 pages.– Start with the same set of seed pages for a topic.– Look at overlap and report this over time
(robustness)
Top Related