Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News...
-
Upload
gabi-agustini -
Category
Technology
-
view
17.942 -
download
2
Transcript of Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News...
MJ no more:Using Wikipedia Concurrent Edit SpikesWith Social Network Plausibility ChecksFor Breaking News DetectionThomas Steiner ([email protected], @tomayac)Seth van Hooland ([email protected], @sethvanhooland)Ed Summers ([email protected], @edsu)
News more and more don't break on the newswire
First Story Detection on Realtime Social Networks
Typically based on Twitter because of their Streaming API [Twitter2012].
Try to detect spikes in time, locality, text (oftentimes restricted domain, e.g., earthquake prediction).
A typical representative for this kind of approach is, e.g., [Petrović2010].
High recallLow precision
[Twitter2012] https://dev.twitter.com/docs/streaming-apis/streams/public
[Petrović2010] Saša Petrović, Miles Osborne, and Victor Lavrenko. 2010. Streaming first story detection with application to Twitter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 181–189.
Curation based on Wikipedia
Wikipedia page view logs are publicly available [Wikipedia2012]. Updated on an hourly basis.
Osbourne et al. have successfully shown that there is a relation between Wikipedia page views and news events [Osbourne2012].
Improves the approach of [Petrović2010] by using Wikipedia logs.
Key findings:Wikipedia lags about 2h behind the news.Newly created pages add noise.
[Wikipedia2012] http://dumps.wikimedia.org/other/pagecounts-raw/
[Osbourne2012] M. Osborne, S. Petrovic, R. McCreadie, C. Macdonald, I. Ounis. 2012. Bieber no more: First Story Detection using Twitter and Wikipedia. In SIGIR 2012 Workshop on Time-aware Information Access (#TAIA2012), Portland, Oregon, USA
Key idea: inverse the process
Use Wikipedia live IRC stream of recent changes [WikipediaIRC2012], then do a sanity check on social networks.
[WikipediaIRC2012] http://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds
Introducing Wikipedia Live Monitor
Hooks into the Wikipedia recent changes IRC channels for all Wikipedia locales.
Channel names follow the pattern#language.project, e.g., #de.wikipedia
When an article gets edited, retrieve all language versions and treat them as a cluster.
E.g., en:Albert_Einstein is in the same cluster as de:Albert_Einstein.
1) ≥ 5 Occurrences An article cluster must have at least n edits before it is considered a breaking news candidate.
2) ≤60 Seconds Between Edits An article cluster may have at max n seconds in between edits in order to be regarded a breaking news candidate.
3) ≥2 Concurrent EditorsAn article cluster must be edited by at least n concurrent editors before it is considered a breaking news candidate.
4) ≤240 Seconds Since Last Edit An article cluster is thrown out of the monitoring loop if its last edit is longer ago than n seconds.
Breaking News Conditions
Koninginnedag (http://twitpic.com/cn1vgf/full)
Evaluation—Does it work at all?
Champions League Semi Final BVB vs. RMD with Lewandowski (http://twitpic.com/clo0s0)
Evaluation—Does it work at all?
Boston Bombings (https://twitter.com/jason_koebler/statuses/323892465545388033,http://www.usnews.com/news/articles/2013/04/15/is-wikipedia-better-for-breaking-news-than-twitter)
Evaluation—Does it work at all?
Lag time for global events: <5 min
Resignation of Pope Benedict XVI (http://en.wikipedia.org/wiki/Resignation_of_Pope_Benedict_XVI)
Three first edit times (UTC) after news broke on Feb 11, 2013● English Wikipedia article: 10:58, 10:59, 11:02● French Wikipedia article: 11:00, 11:00, 11:01
Implies that by looking at only two language versions (the actual number of monitored versions is 42) of the Pope article, the system would have reported the news at 11:01
Twitter account of Reuters announced the news at 10:59
Vatican Radio’s announcement was made at 10:57:47
Evaluation—How well does it work?
Work with realtime page view logs in addition to page edit logs(API format currently being defined by Wikimedia)
News categorization and classificationE.g., Category Living-Persons removed from person implies (sad) news
Improve false-positive rate, make connection with social networks and actual article edits stronger
Auto notification system upon breaking news candidatesPre-announcement: follow @WikiLiveMon
Future Work
Play with the system athttp://wikipedia-irc.herokuapp.com/
Read the paper at http://arxiv.org/abs/1303.4702
Ask questions here or [email protected] & @tomayac
Demo and thank you