Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News...

13
MJ no more: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection Thomas Steiner ([email protected], @tomayac) Seth van Hooland ([email protected], @sethvanhooland) Ed Summers ([email protected], @edsu)

Transcript of Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News...

Page 1: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

MJ no more:Using Wikipedia Concurrent Edit SpikesWith Social Network Plausibility ChecksFor Breaking News DetectionThomas Steiner ([email protected], @tomayac)Seth van Hooland ([email protected], @sethvanhooland)Ed Summers ([email protected], @edsu)

Page 2: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

News more and more don't break on the newswire

Page 3: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

First Story Detection on Realtime Social Networks

Typically based on Twitter because of their Streaming API [Twitter2012].

Try to detect spikes in time, locality, text (oftentimes restricted domain, e.g., earthquake prediction).

A typical representative for this kind of approach is, e.g., [Petrović2010].

High recallLow precision

[Twitter2012] https://dev.twitter.com/docs/streaming-apis/streams/public

[Petrović2010] Saša Petrović, Miles Osborne, and Victor Lavrenko. 2010. Streaming first story detection with application to Twitter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 181–189.

Page 4: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

Curation based on Wikipedia

Wikipedia page view logs are publicly available [Wikipedia2012]. Updated on an hourly basis.

Osbourne et al. have successfully shown that there is a relation between Wikipedia page views and news events [Osbourne2012].

Improves the approach of [Petrović2010] by using Wikipedia logs.

Key findings:Wikipedia lags about 2h behind the news.Newly created pages add noise.

[Wikipedia2012] http://dumps.wikimedia.org/other/pagecounts-raw/

[Osbourne2012] M. Osborne, S. Petrovic, R. McCreadie, C. Macdonald, I. Ounis. 2012. Bieber no more: First Story Detection using Twitter and Wikipedia. In SIGIR 2012 Workshop on Time-aware Information Access (#TAIA2012), Portland, Oregon, USA

Page 5: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

Key idea: inverse the process

Use Wikipedia live IRC stream of recent changes [WikipediaIRC2012], then do a sanity check on social networks.

[WikipediaIRC2012] http://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds

Page 6: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

Introducing Wikipedia Live Monitor

Hooks into the Wikipedia recent changes IRC channels for all Wikipedia locales.

Channel names follow the pattern#language.project, e.g., #de.wikipedia

When an article gets edited, retrieve all language versions and treat them as a cluster.

E.g., en:Albert_Einstein is in the same cluster as de:Albert_Einstein.

Page 7: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

1) ≥ 5 Occurrences An article cluster must have at least n edits before it is considered a breaking news candidate.

2) ≤60 Seconds Between Edits An article cluster may have at max n seconds in between edits in order to be regarded a breaking news candidate.

3) ≥2 Concurrent EditorsAn article cluster must be edited by at least n concurrent editors before it is considered a breaking news candidate.

4) ≤240 Seconds Since Last Edit An article cluster is thrown out of the monitoring loop if its last edit is longer ago than n seconds.

Breaking News Conditions

Page 8: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

Koninginnedag (http://twitpic.com/cn1vgf/full)

Evaluation—Does it work at all?

Page 9: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

Champions League Semi Final BVB vs. RMD with Lewandowski (http://twitpic.com/clo0s0)

Evaluation—Does it work at all?

Page 11: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

Lag time for global events: <5 min

Resignation of Pope Benedict XVI (http://en.wikipedia.org/wiki/Resignation_of_Pope_Benedict_XVI)

Three first edit times (UTC) after news broke on Feb 11, 2013● English Wikipedia article: 10:58, 10:59, 11:02● French Wikipedia article: 11:00, 11:00, 11:01

Implies that by looking at only two language versions (the actual number of monitored versions is 42) of the Pope article, the system would have reported the news at 11:01

Twitter account of Reuters announced the news at 10:59

Vatican Radio’s announcement was made at 10:57:47

Evaluation—How well does it work?

Page 12: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

Work with realtime page view logs in addition to page edit logs(API format currently being defined by Wikimedia)

News categorization and classificationE.g., Category Living-Persons removed from person implies (sad) news

Improve false-positive rate, make connection with social networks and actual article edits stronger

Auto notification system upon breaking news candidatesPre-announcement: follow @WikiLiveMon

Future Work

Page 13: Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection

Play with the system athttp://wikipedia-irc.herokuapp.com/

Read the paper at http://arxiv.org/abs/1303.4702

Ask questions here or [email protected] & @tomayac

Demo and thank you