Entities, Time and Events in BiographyNet and NewsReader
-
Upload
antske -
Category
Technology
-
view
243 -
download
0
description
Transcript of Entities, Time and Events in BiographyNet and NewsReader
Entities, Time and Eventsin BiographyNet &
NewsReader
Antske FokkensVU University
Monday, November 11, 13
Acknowledgement(people)
The work presented in this presentation was carried out by/with:
Agata Cybulska, Marieke van Erp and Piek Vossen
Niels Ockeloen, Serge ter Braake, Willem Robert van Hage, Jesper Hoeksema, Sara
Tonelli, Rachele Sprugnoli, Luciano Serafini, Aitor Soroa, German Rigau and others
Monday, November 11, 13
Overview
mini introduction to BiographyNet
mini introduction to NewsReader
representing entities and events
Monday, November 11, 13
BiographyNet
An interdisciplinary project involving history, computer science and computational linguistics
Goal: inspire new historic research by identifying relations between people and events in Biographical dictionaries
Monday, November 11, 13
NLP in BiographyNetThe Biography Portal of the Netherlands
125,000 biographies from 23 sources describing 76,000 people
Text and metadata
Role of NLP:
Identify information in text
Study differences in style and focus
Monday, November 11, 13
BiographyNet use cases
Analysis on groups of individuals (e.g. who were governor generals of the Dutch Indies)
More complex questions, e.g. the relation between influential people in the Dutch colonies and current Dutch elite
Perspectives: how are people and events judged in different sources?
Monday, November 11, 13
BiographyNet data
Biographical text in Dutch
Heterogenous corpus: 23 sources,texts from 17th century - now
Metadata about basic facts:
high quality (few errors)
completeness varies
Monday, November 11, 13
BiographyNetText mining
First step: fill out gaps in metadata
Basic supervised machine learning system
Next steps:
Create timelines for individuals
Identify relations between people
Identify events and relations between them
Monday, November 11, 13
BiographyNetMethodology
The output of NLP tools is used by other researchers
They should have insight into the performance of the tools and the approaches that are used
Provenance information plays a vital role
Monday, November 11, 13
NewsReaderAutomatically process massive streams of daily news from thousands of sources in 4 different languages
Project Partners:
VU University Amsterdam, LexisNexis, Synerscope (the Netherlands)
Basque University (Spain)
ScraperWiki (UK)
Federation Bruno Kessler (Italy)Monday, November 11, 13
NewsReader
what happened, where, when and who was involved?
Which temporal and causal relations hold between events, what does that tell us about the people involved?
Place the cumulated result in a knowledge store that can handle dynamic growth of information: a history recorder
Monday, November 11, 13
NewsReaderBig Data
Focus: The financial crisis
E.g. What is the impact of the financial crisis on the car industry?
Big Data: LexisNexis estimates:
1-2 million news articles per day
that their archive has 10 million English news articles about the car industry from the last 10 years
Monday, November 11, 13
NewsReaderNarratives
What are the stories that are being told by all this data?
Challenges:Duplicates, overlap and repetitions: how to distinguish old from new?
Single results tell only parts of the story
Results can be inconsistent
News is opinionated and colored
Monday, November 11, 13
NewsReaderoverall approach
Resolve all mentions of events, their participants, locations and time in texts and other resources
Determine coreference and other relations between them
Combine all information from coreferring event mentions around a hypothetical event instance (independent from text)
Combine instances into storylines
Monday, November 11, 13
NLP pipeline
Opinion Detection FactualityEvent
coreferenceEvent
relationsStory
Understanding
LEXISNEXISdocuments
Storage of original input data
NER
Timeexpressions
WSD_client WSD_server
NED_client NED_server
Coreferenceresolution SRL
Eventdetection
KNOWLEDGE STORE
KS FrontendAPI implementation over layers; replicated for scalability and fault tolerance
HBase + Hadoopdistributed & replicated for scalability and fault-tolerance
Triple Store(possibly) distributed
Resource Mention Entity Statement+ Context
RDF Triples +Named Graphs
Mgmt.Scripts
start / stop, backup /restore,
configuration, statistics,gathering
Partial replication Inference
Visualisation (Synerscope)
Runs in virtual machine
Runs in virtual machine Input data storage Processes that can be carried out in any order at this stage
TOKENIZER +SENTENCE SPLITTER
POS-TAGGER
PARSER
EHU
VUA FBK
Monday, November 11, 13
Both Projects
Accumulate information about the same entities and events from various sources
Must deal with different perspectives, contradicting and partial information
Monday, November 11, 13
Grounded Annotation Framework (GAF)
Sources report on events and entities: event mentions and entity mentions
URIs represent instances of these entities and events in reality
GAF links instances to mentions
Information from mentions in other sources is merged with known information around the instance
Monday, November 11, 13
a GAF example
changes in the world
publication of sources
2004 2009
ANNOTATIONNAF
SEM-EVENTTEMBLOR
ANNOTATIONTAF
SEM-EVENTTSUNAMI
2004 2006 2007 2008 2009
SEM-EVENTTEMBLOR
SEM-EVENTTSUNAMI
ANNOTATION
SEM-EVENTTEMBLOR
SEM-EVENTTSUNAMI
2013
ANNOTATIONANNOTATION ANNOTATION
ANNOTATION
sensor data
direct event report
delayed event report
future event report
Tsunami alert system
future tsunami
"The catastrophe four years ago devastated Indian Ocean community and killed more than 230,000 people, over 170,000 of them in Aceh at northern tip of Sumatra Island of Indonesia."
..., the vessel is the party responsible for the 2004 Indian Ocean tsunami that killed 230,000 people. Apparently, the submarine was able to trigger seismic activity via some kind of directed energy weapon.
SEM-EVENTUSS Jimmy
Carter energy weapon
2005
2006 2007 20082005
Monday, November 11, 13
Linguistic information inGAF
The NLP Annotation Format (NAF)
Knowledge Annotation Format (KAF)
stand-off layered annotation (LAF compatible)
separating mentions from instances
NLP Interchange Format (NIF)
RDF and URIs, inline annotation
Compatible with PROV-DMMonday, November 11, 13
Events in GAF
extended Simple Event Model (SEM):
RDF representations of event instances with participant, location and time
can represent contradictory information
Monday, November 11, 13
GAF from NAF + SEM
Can accumulate information from different sources
Can represent repeated information as a single relation (with links to all sources that provided this information)
Can represent contradicting information
Is compatible with the PROV-DM
Monday, November 11, 13
Acknowledgements
Supported by the European Union’s 7th Framework program via the NewsReader Project (ICT-316404)
Supported by the BiographyNet project (nr. 660.011.308) funded by the Netherlands eScience center (http://escience.center.nl)
Monday, November 11, 13
ReferencesGAF:
Fokkens, Antske, Marieke van Erp, Piek Vossen, Sara Tonelli, Willem Robert van Hage, Luciano Serafini, Rachele Sprugnoli and Jesper Hoeksema. 2013. GAF: A Grounded Annotation Framework for Events. Proceedings of the first Workshop on EVENTS: Definition, Detection, Coreference and Representation. Atlanta USA.
Marieke Van Erp, Antske Fokkens, Piek Vossen, Sara Tonelli, Willem Robert Van Hage, Luciano Serafini, Rachele Sprugnoli and Jesper Hoeksema. 2013. Denoting Data in the Grounded Annotation Framework. ISWC 2013 Posters and Demos. Sydney Australia, 21-25 October 2013
Monday, November 11, 13
References
SEM:Van Hage, Willem Robert, Véronique Malaisé, Roxane Segers, Laura Hollink, and Guus Schreiber. "Design and use of the Simple Event Model (SEM)." Web Semantics: Science, Services and Agents on the World Wide Web 9, no. 2 (2011): 128-136.
Cross-document coreference:Cybulska, Agata, and Piek Vossen. “Semantic Relations between Events and their Time, Locations and Participants for Event Coreference Resolution.” In: Proceedings of RANLP 2013.
Monday, November 11, 13
ReferencesNamed Entity Recognition:
Marieke van Erp, Giuseppe Rizzo and Raphaël Troncy (2013) Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning. #MSM2013 Concept Extraction Challenge. Rio de Janeiro, Brazil, May 2013.
Provenance:Niels Ockeloen, Antske Fokkens, Serge Ter Braake, Piek Vossen, Victor de Boer, Guus Schreiber and Susan Legêne. 2013. BiographyNet: Managing Provenance at multiple levels and from different perspectives. In: Proceedings of the Workshop on Linked Science 2013 (LISC2013).
Monday, November 11, 13