Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
-
Upload
eswcsummerschool -
Category
Documents
-
view
356 -
download
3
description
Transcript of Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Global Media Monitoring h0p://eventregistry.org/
Marko Grobelnik Jozef Stefan Ins4tute Ljubljana, Slovenia
Contribu4ons from Gregor Leban, Blaz Fortuna, Janez Brank, Jan Rupnik, Andrej Muhic
ESWC Summer School, Sep 2nd 2014, Kalamaki
Outline
• Introduc4on • Collec4ng Media Data • Document Enrichment • Cross-‐linguality • News Repor4ng Bias • Event Representa4on • Event Tracking • Future Projects
Introduc=on
What ques=ons we’ll try to answer?
• Where to get global media data? • What is extractable from media documents? • How to connect informa4on across languages? • What is an event? • How to approach diversity in news repor4ng? • How to visualize global event dynamics?
Systems/Demos used within the presenta=on
• NewsFeed (hWp://newsfeed.ijs.si/) • News and social media crawler
• Enrycher (hWp://enrycher.ijs.si/) • Language and Seman4c annota4on
• XLing (hWp://xling.ijs.si/ • Cross-‐lingual document linking and categoriza4on
• DiversiNews (hWp://aidemo.ijs.si/diversinews/) • News Diversity Explorer
• Event Registry (hWp://eventregistry.org/) • Event detec4on and topic tracking
The overall goal
• The goal is to establish a real-‐4me system • …to collect data from global media in real-‐4me • …to iden4fy events and track evolving topics • …to assign stable iden4fiers to events • …to iden4fy events across languages • …to detect diversity of repor4ng along several dimensions • …to provide rich exploratory visualiza4ons • …to provide interoperable data export
Main stream news
Blogs
Cross-‐lingual ar4cle matching
Ar4cle seman4c annota4on
Event forma4on
Cross-‐lingual cluster matching Event registry
API Interface
Event info. extrac4on
Input data
Pre-‐processing steps
Event construc4on
Event storage & maintenance
Extrac4on of date references
Ar4cle clustering
Iden4fying related events
Detec4on of ar4cle duplicates
Global Media Monitoring pipeline
GUI/Visualiza4ons
hWp://EventRegistry.org
Collec=ng Media Data hWp://newsfeed.ijs.si/
Where to get references to news publishers? • Good start is Wikipedia list of newspapers: • hWp://en.wikipedia.org/wiki/Lists_of_newspapers
From a newspaper home-‐page to an ar=cle hWp://www.ny4mes.com/ HTML
RSS Feed (list of ar4cles)
Ar4cle to be retreived
Collec=ng global media data
• Data collec4on service News-‐Feed • hWp://newsfeed.ijs.si/ • …crawling global main-‐stream and social media
• Monitoring • ~60k main-‐stream publishers (RSS feeds+special feeds) • ~250k most influen4al blogs (RSS feeds) • free TwiWer feed
• Data volume: ~350k ar4cles & blogs per day (+5M tweets) • Languages: eng (50%), ger (10%), spa (8%), fra (5%)
Downloading the news stream (1/2)
• The stream is accessible at hWp://newsfeed.ijs.si/stream/ • To download the whole stream con4nuously, you can use the python script (hWp://newsfeed.ijs.si/hWp2fs.py) • The script does the following:
Downloading the news stream (2/2) • News Stream Contents and Format
• The root element, <ar4cle-‐set>, contains zero or more ar4cles in the following XML format:
• …more details: • Trampus, Mitja and Novak, Blaz: The Internals Of An Aggregated Web News Feed. Proceedings of 15th Mul4conference on Informa4on Society 2012 (IS-‐2012). [PDF]
Document Enrichment hWp://enrycher.ijs.si/
What can extracted from a document? • Lexical level
• Tokeniza4on – extrac4ng tokens from a document (words, separators, …) • Sentence spli<ng – set of sentences to be further processed
• Linguis4c level • Part-‐of-‐Speech – assigning word types (nouns, verbs, adjec4ves, …) • Deep Parsing – construc4ng parse trees from sentences • Triple extrac4on – subject-‐predicate-‐object triple extrac4on • Name en4ty extrac4on – iden4fying names of people, places, organiza4ons
• Seman4c level • Co-‐reference resolu4on – replacing pronouns with corresponding names; merging different surface forms of names into single en4ty
• Seman4c labeling – assigning seman4c iden4fiers to names (e.g. LOD/DBpedia/Freebase) including disambigua4on
• Topic classifica4on – assigning topic categories to a document (e.g. DMoz) • Summariza4on – assigning importance to parts of a document • Fact extrac4on – extrac4ng relevant facts from a document
Enrycher (h0p://enrycher.ijs.si/) Plain text
Text Enrichment
Diego Maradona Seman4cs: owl:sameAs: hKp://dbpedia.org/resource/Diego_Maradona owl:sameAs: hKp://sw.opencyc.org/concept/Mx4rvofERZwpEbGdrcN5Y29ycA rdf:type: hWp://dbpedia.org/class/yago/Argen4naInterna4onalFootballers rdf:type: hWp://dbpedia.org/class/yago/Argen4neExpatriatesInItaly rdf:type: hWp://dbpedia.org/class/yago/Argen4neFootballManagers rdf:type: hWp://dbpedia.org/class/yago/Argen4neFootballers
Robbie Keane Seman4cs: owl:sameAs: hKp://dbpedia.org/resource/Robbie_Keane rdf:type: hWp://dbpedia.org/class/yago/CoventryCityF.C.Players rdf:type: hWp://dbpedia.org/class/yago/ExpatriateFootballPlayersInItaly rdf:type: hWp://dbpedia.org/class/yago/F.C.InternazionaleMilanoPlayers
Extracted graph of triples from text
“Enrycher” is available as as a web-‐service genera4ng Seman4c Graph, LOD links, En44es, Keywords, Categories, Text Summariza4on, Sen4ment
Enrycher Architecture
• Enrycher is a web service consis4ng of a set of interlinked modules…
• …covering lexical, linguis4c and seman4c annota4ons
• …expor4ng data in XML or RDF • To execute the service, one should send an HTTP POST request, with the raw text in the body: • curl -d “Enrycher was developed at JSI, a research institute in Ljubljana. Ljubljana is the capital of Slovenia.” http://enrycher.ijs.si/run!
Plain text
Annotated document
Cross-‐linguality hWp://xling.ijs.si/
• Cross-‐linguality is a set of func4ons on how to transfer informa4on across the languages • …having this, we can track informa4on independent of the language borders • Machine Transla4on is expensive and slow, so the goal is to avoid machine transla4on to gain speed and scale
• The key building block is the func4on for comparing and categoriza4on of documents in different languages • XLing.ijs.si is an open web service to bridge informa4on across 100 languages
Cross-‐linguality How to operate in many languages?
Languages covered by XLing (top 100 Wikipedia languages)
XLing (XLing.ijs.si) service for comparing and categoriza=on of documents across 100 languages
Chinese Text
English Text
Automa4cally Extracted Keywords
Automa4cally Extracted Keywords
Similarity Between Two Documents
Selec4on Of 100 Languages
News Repor=ng Bias hWp://aidemo.ijs.si/diversinews/
News Repor=ng Bias example
Detec=ng News Repor=ng Bias
• The task: • Given a news story, are we able to say from which news source it came?
• We compared CNN and Aljazeera reports about the same events from the war in Iraq • …300 aligned ar4cles describing the same story from both sources
• The same topics are expressed in both sources with the following keywords: • CNN with:
• Insurgents, Troops, Baghdad, Iran, Militant, Police, Suicide, Terrorist, United, Na4onal, Hussein, Alleged, Israeli, Syria, Terrorism…
• Aljazeera with: • AWacks, Claims, Rebels, Withdrawing, Report, Fighters, President, Resistance, Occupa4on, Injured, Army, Demanded, Hit, Muslim, …
DiversiNews iPad App (1/2)
• DiversiNews iPad App is using newsfeed.ijs.si and enrycher.ijs.si services
• …in its ini4al screen is shows list of current hot topics and current trending events
Hot Topics
Trending Events
DiversiNews iPad App (2/2) • DiversiNews “diversity search” screen allows dynamic reranking of ar4cles describing an event along three dimensions: • Geography – where is a content being published from • Subtopics – what are subtopics of an event • Sen4ment – what are good and what are bad news
• For each query it provides • Automa4cally generated summary • List of corresponding ar4cles
Geography
Subtopics
Sen4ment
Summary Ar4cles
Event Representa=on hWp://eventregistry.org/
What is an event? (abstract descrip=on) • …more prac4cal ques4on: what defini4on of is computa4onally feasible?
• In general, an event is something which “s4cks out” of the average in some kind of (high dimensional) data space • …could be interpreted as an “anomaly” • …densifica4on of data points (e.g. many similar documents) • …significant change of distribu4on (e.g. a trend on TwiWer)
• In prac4ce, the event could be: • A cluster od documents / change of a distribu4on in data
• Detected in an unsupervised way • A fit to a pre-‐built model
• Detected in a supervised way
How to represent an event?
• Baseline data for a news event is usually a cluster of documents • …with some preprocessing we extract linguis4c and seman4c annota4ons • …seman4c annota4ons are linked to ontologies providing possibility for mul4resolu4on annota4ons
• Three levels of event representa4on: • Feature vector event representa4on:
• …light weight representa4on that can be easily represented as a set of feature vectors augmented with external ontologies – suitable for scalable ML analysis
• Structured event representa4on: • Infobox representa4on (slots filling) using open schema or event taxonomy
• Deep event representa4on • Seman4c representa4on linked to a world-‐model (e.g. CycKB common sense knowledge) – suitable for reasoning and diagnos4cs
Feature vector event representa=on
• Feature vectors easily extractable from news documents: • Topical dimension – what is being talked about? (keywords) • Social dimension – which en44es are men4oned? (named en44es) • Temporal aspect – what is the 4me of an event? (temporal distribu4on) • Geographical aspect – where an event is taking place? (loca4on) • Publisher aspect – who is repor4ng? (publisher iden4fiers) • Sen4ment/bias aspect – emo4onal signals (numeric es4mates)
• Scalable Machine Learning techniques can easily deal with such representa4on • …in “Event Registry” system we use this representa4on to describe events
Example of “feature vector” event representa=on: Event Registry “Chicago” related events
Where? (geography)
When? (temporal distribu4on)
Who? (named en44es)
What? (keyword/ topics)
Query: “Chicago”
Structured event representa=on
• Structured event representa4on describes an event by its “Event Type” and corresponding informa4on slots to be filled • Event Types should be taken from “Event Taxonomy” • …at this stage of development this level of representa4on s4ll requires human interven4on to achieve high accuracy (Precision/Recall) extrac4on
• Example on the right – Wikipedia event infobox: • 2011 Tōhoku earthquake and tsunami
“Event Taxonomy” – preview to the current development
Prototype for event Infobox extrac=on: XLike annota=on service
• The goal is to build a system for economically viable extrac4on of event infoboxes • …using crowd-‐sourcing • …aiming at high Precision & Recall for a small cost
Event sequences & Hierarchical events
• Once having events iden4fies and represented we can connect events into “event sequences” (also called story-‐lines) • “Event sequences” include events which are supposedly related and cons4tute larger story • Collec4on of interrelated events can be also organized in hierarchies (e.g. World Cup event consists from a series of smaller events)
An example event: Microsoa Windows 9
Similar events example: similar events to Microsoa Windows 9 event
Event sequence iden=fica=on
Hierarchy of events
Example Microsoa hierarchy of events
Zoom-‐in Example Microsoa hierarchy of events
Event Tracking hWp://eventregistry.org/
Live Event tracking with h0p://EventRegistry.org/
Event descrip4on through en44es and Seman4c keywords
Collec4on of events described through En4ty relatedness
Collec4on of events described through trending concepts
Collec4on of events described through three level categoriza4on
Events iden4fied across languages
Collec4on of events described through Repor4ng dynamics
Collec4on of events described through a story-‐line of related events
Event Registry exports event data through API and RDF/Storyline ontology • API to search and export event informa4on
• Export of all the system data in JSON
• Event data is exported in a structured form • BBC Storyline ontology • hWp://www.bbc.co.uk/ontologies/storyline/2013-‐05-‐01.html
• SPARQL endpoint: • hWp://eventregistry.org/rdf/search
• hWp://eventregistry.org/rdf/event/{eventID} • hWp://eventregistry.org/rdf/ar4cle/{ar4cleD} • hWp://eventregistry.org/rdf/storyline/{storylineID} • Example: hWp://eventregistry.org/rdf/event/1234
Future / Follow-‐up projects
Some of the follow-‐up projects
• Understanding global social dynamics • How global society func4ons?
• Integra4ng text-‐based media with TV channels • …requires speech recogni4on, video processing, visual object recogni4on, face recogni4on, …
• Event predic4on / Event-‐Consequence predic4on • …requires understanding of causality in the social dynamics and much more
• Micro-‐reading / Machine-‐reading • …full understanding of individual documents – the goal for 10+ years