Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

53
Global Media Monitoring h0p://eventregistry.org/ Marko Grobelnik Jozef Stefan Ins4tute Ljubljana, Slovenia Contribu4ons from Gregor Leban, Blaz Fortuna, Janez Brank, Jan Rupnik, Andrej Muhic ESWC Summer School, Sep 2 nd 2014, Kalamaki

description

Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Transcript of Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Page 1: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Global  Media  Monitoring  h0p://eventregistry.org/

Marko  Grobelnik  Jozef  Stefan  Ins4tute  Ljubljana,  Slovenia  

 Contribu4ons  from  Gregor  Leban,  Blaz  Fortuna,  Janez  Brank,  Jan  Rupnik,  Andrej  Muhic  

ESWC  Summer  School,  Sep  2nd  2014,  Kalamaki  

Page 2: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Outline

•  Introduc4on  • Collec4ng  Media  Data  • Document  Enrichment  • Cross-­‐linguality  • News  Repor4ng  Bias  •  Event  Representa4on  •  Event  Tracking  •  Future  Projects    

Page 3: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Introduc=on

Page 4: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

What  ques=ons  we’ll  try  to  answer?

• Where  to  get  global  media  data?  • What  is  extractable  from  media  documents?  • How  to  connect  informa4on  across  languages?  • What  is  an  event?  • How  to  approach  diversity  in  news  repor4ng?  • How  to  visualize  global  event  dynamics?  

Page 5: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Systems/Demos  used  within  the  presenta=on

• NewsFeed  (hWp://newsfeed.ijs.si/)  •  News  and  social  media  crawler  

•  Enrycher  (hWp://enrycher.ijs.si/)  •  Language  and  Seman4c  annota4on  

• XLing  (hWp://xling.ijs.si/  •  Cross-­‐lingual  document  linking  and  categoriza4on  

• DiversiNews  (hWp://aidemo.ijs.si/diversinews/)  •  News  Diversity  Explorer  

•  Event  Registry  (hWp://eventregistry.org/)  •  Event  detec4on  and  topic  tracking  

Page 6: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

The  overall  goal

•  The  goal  is  to  establish  a  real-­‐4me  system  •  …to  collect  data  from  global  media  in  real-­‐4me  •  …to  iden4fy  events  and  track  evolving  topics  •  …to  assign  stable  iden4fiers  to  events  •  …to  iden4fy  events  across  languages  •  …to  detect  diversity  of  repor4ng  along  several  dimensions  •  …to  provide  rich  exploratory  visualiza4ons  •  …to  provide  interoperable  data  export  

Page 7: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Main  stream  news  

Blogs  

Cross-­‐lingual  ar4cle  matching  

Ar4cle  seman4c  annota4on  

Event  forma4on  

Cross-­‐lingual  cluster  matching   Event  registry  

API  Interface  

Event  info.  extrac4on  

Input  data  

Pre-­‐processing  steps  

Event  construc4on  

Event  storage  &  maintenance  

Extrac4on  of  date  references  

Ar4cle  clustering  

Iden4fying  related  events  

Detec4on  of  ar4cle  duplicates  

Global  Media  Monitoring  pipeline

GUI/Visualiza4ons  

hWp://EventRegistry.org  

Page 8: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Collec=ng  Media  Data hWp://newsfeed.ijs.si/  

Page 9: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Where  to  get  references  to  news  publishers? • Good  start  is  Wikipedia  list  of  newspapers:  •  hWp://en.wikipedia.org/wiki/Lists_of_newspapers  

Page 10: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

From  a  newspaper  home-­‐page  to  an  ar=cle hWp://www.ny4mes.com/   HTML  

RSS  Feed  (list  of  ar4cles)  

Ar4cle  to  be  retreived  

Page 11: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Collec=ng  global  media  data

• Data  collec4on  service  News-­‐Feed  •  hWp://newsfeed.ijs.si/  •  …crawling  global  main-­‐stream  and  social  media  

• Monitoring    •  ~60k  main-­‐stream  publishers  (RSS  feeds+special  feeds)  •  ~250k  most  influen4al  blogs  (RSS  feeds)  •  free  TwiWer  feed  

• Data  volume:  ~350k  ar4cles  &  blogs  per  day  (+5M  tweets)  •  Languages:  eng  (50%),  ger  (10%),  spa  (8%),  fra  (5%)  

Page 12: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Downloading  the  news  stream  (1/2)

•  The  stream  is  accessible  at  hWp://newsfeed.ijs.si/stream/  •  To  download  the  whole  stream  con4nuously,  you  can  use  the  python  script  (hWp://newsfeed.ijs.si/hWp2fs.py)    •  The  script  does  the  following:    

Page 13: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Downloading  the  news  stream  (2/2) •  News  Stream  Contents  and  Format  

•  The  root  element,  <ar4cle-­‐set>,  contains  zero  or  more  ar4cles  in  the  following  XML  format:    

• …more  details:  •  Trampus,  Mitja  and  Novak,  Blaz:  The  Internals  Of  An  Aggregated  Web  News  Feed.  Proceedings  of  15th  Mul4conference  on  Informa4on  Society  2012  (IS-­‐2012).  [PDF]  

Page 14: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Document  Enrichment hWp://enrycher.ijs.si/  

Page 15: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

What  can  extracted  from  a  document? •  Lexical  level  

•  Tokeniza4on  –  extrac4ng  tokens  from  a  document  (words,  separators,  …)  •  Sentence  spli<ng  –  set  of  sentences  to  be  further  processed  

•  Linguis4c  level  •  Part-­‐of-­‐Speech  –  assigning  word  types  (nouns,  verbs,  adjec4ves,  …)  •  Deep  Parsing  –  construc4ng  parse  trees  from  sentences  •  Triple  extrac4on  –  subject-­‐predicate-­‐object  triple  extrac4on  •  Name  en4ty  extrac4on  –  iden4fying  names  of  people,  places,  organiza4ons  

•  Seman4c  level  •  Co-­‐reference  resolu4on  –  replacing  pronouns  with  corresponding  names;  merging  different  surface  forms  of  names  into  single  en4ty  

•  Seman4c  labeling  –  assigning  seman4c  iden4fiers  to  names  (e.g.  LOD/DBpedia/Freebase)  including  disambigua4on  

•  Topic  classifica4on  –  assigning  topic  categories  to  a  document  (e.g.  DMoz)  •  Summariza4on  –  assigning  importance  to  parts  of  a  document  •  Fact  extrac4on  –  extrac4ng  relevant  facts  from  a  document  

Page 16: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Enrycher  (h0p://enrycher.ijs.si/) Plain  text  

Text  Enrichment  

Diego  Maradona  Seman4cs:  owl:sameAs:  hKp://dbpedia.org/resource/Diego_Maradona  owl:sameAs:  hKp://sw.opencyc.org/concept/Mx4rvofERZwpEbGdrcN5Y29ycA  rdf:type:  hWp://dbpedia.org/class/yago/Argen4naInterna4onalFootballers  rdf:type:  hWp://dbpedia.org/class/yago/Argen4neExpatriatesInItaly  rdf:type:  hWp://dbpedia.org/class/yago/Argen4neFootballManagers  rdf:type:  hWp://dbpedia.org/class/yago/Argen4neFootballers  

Robbie  Keane  Seman4cs:  owl:sameAs:  hKp://dbpedia.org/resource/Robbie_Keane  rdf:type:  hWp://dbpedia.org/class/yago/CoventryCityF.C.Players  rdf:type:  hWp://dbpedia.org/class/yago/ExpatriateFootballPlayersInItaly  rdf:type:  hWp://dbpedia.org/class/yago/F.C.InternazionaleMilanoPlayers  

Extracted  graph  of  triples  from  text  

“Enrycher”  is  available  as    as  a  web-­‐service  genera4ng  Seman4c  Graph,  LOD  links,    En44es,  Keywords,  Categories,  Text  Summariza4on,  Sen4ment  

Page 17: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Enrycher  Architecture

•  Enrycher  is  a  web  service  consis4ng  of  a  set  of  interlinked  modules…  

•  …covering  lexical,  linguis4c  and  seman4c  annota4ons  

•  …expor4ng  data  in  XML  or  RDF  •  To  execute  the  service,  one  should  send  an  HTTP  POST  request,  with  the  raw  text  in  the  body:  •  curl -d “Enrycher was developed at JSI, a research institute in Ljubljana. Ljubljana is the capital of Slovenia.” http://enrycher.ijs.si/run!

Plain  text  

Annotated  document  

Page 18: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Cross-­‐linguality hWp://xling.ijs.si/  

Page 19: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

• Cross-­‐linguality  is  a  set  of  func4ons  on  how  to  transfer  informa4on  across  the  languages  •  …having  this,  we  can  track  informa4on  independent  of  the  language  borders  •  Machine  Transla4on  is  expensive  and  slow,  so  the  goal  is  to  avoid  machine  transla4on  to  gain  speed  and  scale  

•  The  key  building  block  is  the  func4on  for  comparing  and  categoriza4on  of  documents  in  different  languages  •  XLing.ijs.si  is  an  open  web  service  to  bridge  informa4on  across  100  languages  

Cross-­‐linguality  How  to  operate  in  many  languages?

Page 20: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Languages  covered  by  XLing  (top  100  Wikipedia  languages)

Page 21: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

XLing  (XLing.ijs.si)  service  for  comparing  and  categoriza=on  of  documents  across  100  languages

Chinese  Text  

English  Text  

Automa4cally  Extracted  Keywords  

Automa4cally  Extracted  Keywords  

Similarity  Between  Two  Documents  

Selec4on  Of  100  Languages  

Page 22: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

News  Repor=ng  Bias hWp://aidemo.ijs.si/diversinews/  

Page 23: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

News  Repor=ng  Bias  example

Page 24: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Detec=ng  News  Repor=ng  Bias

•  The  task:    •  Given  a  news  story,  are  we  able  to  say  from  which  news  source  it  came?  

• We  compared  CNN  and  Aljazeera  reports  about  the  same  events  from  the  war  in  Iraq  •  …300  aligned  ar4cles  describing  the  same  story  from  both  sources  

•  The  same  topics  are  expressed  in  both  sources  with  the  following  keywords:  •  CNN  with:  

•  Insurgents,  Troops,  Baghdad,  Iran,  Militant,  Police,  Suicide,  Terrorist,  United,  Na4onal,  Hussein,  Alleged,  Israeli,  Syria,  Terrorism…  

•  Aljazeera  with:  •  AWacks,  Claims,  Rebels,  Withdrawing,  Report,  Fighters,  President,  Resistance,  Occupa4on,  Injured,  Army,  Demanded,  Hit,  Muslim,  …  

Page 25: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

DiversiNews  iPad  App  (1/2)

• DiversiNews  iPad  App  is  using  newsfeed.ijs.si  and  enrycher.ijs.si  services  

• …in  its  ini4al  screen  is  shows  list  of  current  hot  topics  and  current  trending  events  

Hot  Topics  

Trending  Events  

Page 26: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

DiversiNews  iPad  App  (2/2) • DiversiNews  “diversity  search”  screen  allows  dynamic  reranking  of  ar4cles  describing  an  event  along  three  dimensions:  •  Geography  –  where  is  a  content  being  published  from  •  Subtopics  –  what  are  subtopics  of  an  event  •  Sen4ment  –  what  are  good  and  what  are  bad  news  

•  For  each  query  it  provides  •  Automa4cally  generated  summary  •  List  of  corresponding  ar4cles  

Geography  

Subtopics  

Sen4ment  

Summary  Ar4cles  

Page 27: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Event  Representa=on hWp://eventregistry.org/  

Page 28: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

What  is  an  event?  (abstract  descrip=on) • …more  prac4cal  ques4on:  what  defini4on  of  is  computa4onally  feasible?  

•  In  general,  an  event  is  something  which  “s4cks  out”  of  the  average  in  some  kind  of  (high  dimensional)  data  space  •  …could  be  interpreted  as  an  “anomaly”  •  …densifica4on  of  data  points  (e.g.  many  similar  documents)  •  …significant  change  of  distribu4on  (e.g.  a  trend  on  TwiWer)  

•  In  prac4ce,  the  event  could  be:  •  A  cluster  od  documents  /  change  of  a  distribu4on  in  data  

•  Detected  in  an  unsupervised  way  •  A  fit  to  a  pre-­‐built  model  

•  Detected  in  a  supervised  way  

Page 29: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

How  to  represent  an  event?

• Baseline  data  for  a  news  event  is  usually  a  cluster  of  documents  •  …with  some  preprocessing  we  extract  linguis4c  and  seman4c  annota4ons  •  …seman4c  annota4ons  are  linked  to  ontologies  providing  possibility  for  mul4resolu4on  annota4ons  

•  Three  levels  of  event  representa4on:  •  Feature  vector  event  representa4on:  

•  …light  weight  representa4on  that  can  be  easily  represented  as  a  set  of  feature  vectors  augmented  with  external  ontologies  –  suitable  for  scalable  ML  analysis  

•  Structured  event  representa4on:  •  Infobox  representa4on  (slots  filling)  using  open  schema  or  event  taxonomy  

•  Deep  event  representa4on  •  Seman4c  representa4on  linked  to  a  world-­‐model  (e.g.  CycKB  common  sense  knowledge)  –  suitable  for  reasoning  and  diagnos4cs  

Page 30: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Feature  vector  event  representa=on

•  Feature  vectors  easily  extractable  from  news  documents:  •  Topical  dimension  –  what  is  being  talked  about?  (keywords)  •  Social  dimension  –  which  en44es  are  men4oned?  (named  en44es)  •  Temporal  aspect  –  what  is  the  4me  of  an  event?  (temporal  distribu4on)  •  Geographical  aspect  –  where  an  event  is  taking  place?  (loca4on)  •  Publisher  aspect  –  who  is  repor4ng?  (publisher  iden4fiers)  •  Sen4ment/bias  aspect  –  emo4onal  signals  (numeric  es4mates)  

•  Scalable  Machine  Learning  techniques  can  easily  deal  with  such  representa4on  •  …in  “Event  Registry”  system  we  use  this  representa4on  to  describe  events  

Page 31: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Example  of  “feature  vector”  event  representa=on:  Event  Registry  “Chicago”  related  events

Where?  (geography)  

When?  (temporal  distribu4on)  

Who?  (named    en44es)  

What?  (keyword/  topics)  

Query:  “Chicago”  

Page 32: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Structured  event  representa=on

•  Structured  event  representa4on  describes  an  event  by  its  “Event  Type”  and  corresponding  informa4on  slots  to  be  filled  •  Event  Types  should  be  taken  from  “Event  Taxonomy”  • …at  this  stage  of  development  this  level  of  representa4on  s4ll  requires  human  interven4on  to  achieve  high  accuracy  (Precision/Recall)  extrac4on  

•  Example  on  the  right  –  Wikipedia  event  infobox:    •  2011  Tōhoku  earthquake  and  tsunami  

Page 33: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

“Event  Taxonomy”  –  preview  to  the  current  development

Page 34: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Prototype  for  event  Infobox  extrac=on:    XLike  annota=on  service

•  The  goal  is  to  build  a  system  for  economically  viable  extrac4on  of  event  infoboxes  •  …using  crowd-­‐sourcing  •  …aiming  at  high  Precision  &  Recall  for  a  small  cost  

Page 35: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Event  sequences  &  Hierarchical  events

• Once  having  events  iden4fies  and  represented  we  can  connect  events  into  “event  sequences”  (also  called  story-­‐lines)  •  “Event  sequences”  include  events  which  are  supposedly  related  and  cons4tute  larger  story  • Collec4on  of  interrelated  events  can  be  also  organized  in  hierarchies  (e.g.  World  Cup  event  consists  from  a  series  of  smaller  events)  

Page 36: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

An  example  event:  Microsoa  Windows  9

Page 37: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Similar  events  example:  similar  events  to  Microsoa  Windows  9  event

Page 38: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Event  sequence  iden=fica=on

Page 39: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Hierarchy  of  events

Page 40: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Example  Microsoa  hierarchy  of  events

Page 41: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Zoom-­‐in  Example    Microsoa    hierarchy  of  events

Page 42: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Event  Tracking hWp://eventregistry.org/  

Page 43: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Live  Event  tracking  with  h0p://EventRegistry.org/  

Page 44: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Event  descrip4on  through  en44es  and  Seman4c  keywords  

Page 45: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Collec4on  of  events    described  through    En4ty  relatedness  

Page 46: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Collec4on  of  events    described  through    trending  concepts  

Page 47: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Collec4on  of  events    described  through    three  level  categoriza4on  

Page 48: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Events  iden4fied  across  languages  

Page 49: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Collec4on  of  events    described  through    Repor4ng  dynamics  

Page 50: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Collec4on  of  events    described  through    a  story-­‐line  of  related  events  

Page 51: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Event  Registry  exports  event  data  through  API  and  RDF/Storyline  ontology •  API  to  search  and  export  event  informa4on  

•  Export  of  all  the  system  data  in  JSON  

•  Event  data  is  exported  in  a  structured  form  •  BBC  Storyline  ontology  •  hWp://www.bbc.co.uk/ontologies/storyline/2013-­‐05-­‐01.html  

•  SPARQL  endpoint:  •  hWp://eventregistry.org/rdf/search  

•  hWp://eventregistry.org/rdf/event/{eventID}  •  hWp://eventregistry.org/rdf/ar4cle/{ar4cleD}  •  hWp://eventregistry.org/rdf/storyline/{storylineID}  •  Example:  hWp://eventregistry.org/rdf/event/1234  

Page 52: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Future  /  Follow-­‐up  projects

Page 53: Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

Some  of  the  follow-­‐up  projects

• Understanding  global  social  dynamics  •  How  global  society  func4ons?    

•  Integra4ng  text-­‐based  media  with  TV  channels  •  …requires  speech  recogni4on,  video  processing,  visual  object  recogni4on,  face  recogni4on,  …  

•  Event  predic4on  /  Event-­‐Consequence  predic4on  •  …requires  understanding  of  causality  in  the  social  dynamics  and  much  more  

• Micro-­‐reading  /  Machine-­‐reading  •  …full  understanding  of  individual  documents  –  the  goal  for  10+  years