Post on 10-May-2015
description
NERD: an open source platform for extracting and
disambiguating named entities in very diverse documents
Raphaël Troncy <raphael.troncy@eurecom.fr> Giuseppe Rizzo <giuseppe.rizzo@eurecom.fr>
What is a Named Entity recognition task?
A task that aims to locate and classify the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent in a textual document
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 2
Example
“ I want to book a room in an hotel located in the heart of Paris, just a stone’s throw from the Eiffel Tower ”
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 3
Eric Charton, “Named Entity Detection and Entity Linking in the Context of Semantic Web: Exploring the ambiguity question”
Part of Speech
I PRP want VBP to TO book VB a DT room NN in IN … … Paris NNP
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 4
NER: What is Paris? NEL: Which Paris are we talking about?
Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”
What is Paris? Type Ambiguity
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 5
Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”
dbpedia-owl:Asteroid schema:City schema:Movie dbpedia-owl:Film
Named Entity Recognition (NER)
I PRP O want VBP O to TO O book VB O a DT O room NN O in IN O … … … Paris NNP LOC
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 6
Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”
What is Paris? Name Ambiguity
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 7
Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”
Paris, Kentucky Paris, Maine Paris, Tennessee
Paris, France Paris, Idaho Paris, Ontario
Named Entity Linking (NEL)
I PRP O O want VBP O O to TO O O book VB O O a DT O O room NN O O in IN O O … … … … Paris NNP LOC http://dbpedia.org/resource/Paris
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 8
Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”
NER Tools and Web APIs
Standalone software GATE Stanford CoreNLP Temis
Web APIs
http://nerd.eurecom.fr/
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 9
Compare performances of NER and NEL tools Understand strengths and weaknesses of different Web APIs Adapt NER processing to different context
(Learn how to) Combine NER (/ NEL) tools
Participate in various benchmarks
NERD: Named Entity Recognition and Disambiguation
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 10
What is NERD? REST API2 ontology1
UI3
1 http://nerd.eurecom.fr/ontology 2 http://nerd.eurecom.fr/api/application.wadl
3 http://nerd.eurecom.fr
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 11
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 12/15
Alchemy API
DBpedia Spotlight
Evri Extractiv Lupedia Open Calais
Saplo Wikimeta Yahoo! Zemanta
Language EN,FR, GR,IT, PT,RU, SP,SW
EN GR* PT* SP*
EN,IT
EN EN,FR, IT
EN,FR SP
EN, SW
EN,FR SP
EN EN
Granularity OEN OEN OED OEN OEN OEN OED OEN OEN OED
Entity position
N/A char offset
N/A word offset
range of chars
char offset
N/A POS offset
range of
chars
N/A
Classification schema
Alchemy DBpedia FreeBase Scema.or
g
Evri DBpedia DBpedia LinkedM
DB
Open Calais
N/A ESTER
Yahoo FreeBase
Number of classes
324 320 5 34 319 95 5 7 13 81
Response Format
JSON MicroF XML RDF
HTML JSON RDF XML
HTML
JSON
RDF
HTML JSON RDF XML
HTML JSON RDFa XML
JSON MicroFormat
JSON JSON XML
JSON XML
XML JSON RDF
Quota (calls/day)
30000 unl 3000
3000 unl 50000 1333 unl 5000 10000
Factual comparison of 10 Web NER tools
Aligned the taxonomies used by the extractors
NERD Ontology
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 13
NERD type Occurrence
Person 10
Organization 10
Country 6
Company 6
Location 6
Continent 5
City 5
RadioStation 5
Album 5
Product 5
... ...
Building the NERD Ontology
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 14
NERD REST API
GET, POST, PUT,
DELETE
/document /user /annotation/{extractor} /extraction /evaluation ...
JSON
“entities” : [{ “entity”: “Tim Berners-Lee” , “type”: “Person” , “uri”: "http://dbpedia.org/resource/Tim_berners_lee", “nerdType”: "http://nerd.eurecom.fr/ontology#Person", “startChar”: 30, “endChar”: 45, “confidence”: 1, “relevance”: 0.5 }]
Rizzo G., Troncy R. (2012), NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Web Extraction Tools. In: European chapter of the Association for Computational Linguistics (EACL'12), Avignon, France.
RDF
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 15
NERD meets NIF
Model documents through a set of strings deferencable on the Web
: offset_23107_ 23110 a str:String ; str:referenceContext :offset_0_26546 .
: offset_23107_ 23110 sso:oen dbpedia:W3C.
dbpedia:W3C rdf:type nerd:Organization .
Map string to entity
Classification
Rizzo G, Troncy R., Hellmann S. and Bruemmer M. (2012), NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. In: (LDOW'12) Linked Data on the Web (WWW'12), Lyon, France.
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 16
NERD User Dashboard
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 17
NERD User Interface
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 18
History of NER benchmarks CoNLL 2003 and CoNLL 2005
schema (4 types): person, organization, location and miscellaneous
ACE 2004, ACE 2005 and ACE 2007 schema (7 types): person, organization, location, facility, weapon,
vehicle and geo-political entity entity recognition, co-ref, find relationships among entities extracted
TAC 2009 (Knowledge Base Track) schema (3 types): person, organization and location create a knowledge base from the named entities extracted
ETAPE 2012 (Named Entity Task) schema: Quaero (7 main types, 32 sub-types)
MSM 2013: tweet corpus ! schema (4 types): person, organization, location, miscellaneous
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 19
ETAPE 2012 challenge
genre train dev test sources
TV news 7h 40m 1h 40m 1h 40m BFM Story, Top QUestions (LCP)
TV debates 10h 30m 5h 10m 5h 10m Pile et Face, Ca vous regarde, Entre les lignes (LCP)
TV amusements - 1h 05m 1h 05m La place du village (TV8)
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 20
Train Dev Eval Item length 26h 10h 55m 10h 55m Nb files 44 15 15 Nb words 290517 91656 115511 Nb Named Entities 46763 14398 13055 Nb unique categories 33 33 33
NERD @ ETAPE (naïve combined strategy)
(eA1,tA1,URIA1,siA1,eiA1) ... ... ...
`
(eA2,tA2,URIA2,siA2,eiA2) (eA3,tA3,URIA3,siA3,eiA3)
(eN2,tN2,URIN2,siN2,eiN2) (eN1,tN1,URIN1,siN1,eiN1)
extraction
cleaning
fusion When at least 2 extractors classify the same entity with a different type then we apply a preferred selection order
(empirically defined): Wikimeta, AlchemyAPI, OpenCalais, Lupedia
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 21
Participation at ETAPE (combined+ strategy)
(eA1,tA1,URIA1,siA1,eA1)
`
(eA2,tA2,URIA2,siA2,eiA2)
(eN2,tN2,URIN2,sN2,eN2) (eN1,tN1,URIN1,sN1,eN1)
...
ETAPE Train & Dev
Learned model
Created static rules
fusion Conflicts handled by
priority selection: own, Wikimeta,AlchemyAPI,OpenCalais,Lupedia
POS tagger
Apply rules
(e1,t1,URI1,si1,ei1)
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 22
NERD Global results
SLR Precision Recall F-measure %correct
combined 86.85% 35.31% 17.69% 23.44% 17.69%
combined+ 188.81% 15.13% 28.40% 19.45% 28.40%
Combined+ : Eval corpus differs substantially from the Train & Dev corpora. The static rules do not fit well the Eval corpora and they introduce classification noise.
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 23
Per-extractor results SLR Precision Recall F-measure %correct
alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45%
lupedia 39.49% 22.87% 1.56% 2.91% 1.56%
opencalais 37.47% 41.69% 3.53% 6.49% 3.53%
wikimeta 36.67% 19.40% 4.25% 6.95% 4.25%
combined (nerd)
86.85% 35.31% 17.69% 23.44% 17.69%
combined+ (nerd+)
188.81% 15.13% 28.40% 19.45% 28.40%
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 24
Learning How to Combine NER Extractors
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 26
NERD on CoNLL 2003 (NER task)
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 27
NERD on MSM 2013 (NER task)
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 28
NERD on MSM 2013 (NEL task)
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 29
Media Fragment Enricher: http://mfe.synote.org/mfe/
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 30
Linking pieces of knowledge
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 31
Linking pieces of knowledge
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 32
Named Entities for Video Classification
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 33
Workflow
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 34
Media Fragment Enricher Services
Media Fragment Enricher UI
Metadata & timed-text
NERD Client RDFizator Triple Store
Categori-zation
Video and metadata preview
Video replay with subtitles and aligned NEs
1: Video URL
2: Metadata
3: meta-data 4:NERDify
5:Timed Text 6: NEs with time
alignment (json)
7: RDFize (ttl)
8: Generate Category
9: SPARQL query
Channel signature based on NE distribution
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 35
LinkedTV: automatic annotations ...
22/10/2013 - - 37 NLP&DBpedia International Workshop, Sydney, October 2013
... and enrichment for hypervideos
Cubism Expressionism
Fauvism
FACETS / PROPERTIES OF CONCEPT
CONCEPT IN PLAYER
CONTENT ENRICHMENT
22/10/2013 - - 38 NLP&DBpedia International Workshop, Sydney, October 2013
Media Fragments and Annotations
nerd:Location Cafe Rick
nerd:Person H. Bogart
nerd:Person I. Bergman
nerd:Location Casablanca
Media Fragment URI 1.0 Chapters Scenes Shots etc…
http://data.linkedtv.eu/media/e2899e7f#t=840,900
22/10/2013 - - 39 NLP&DBpedia International Workshop, Sydney, October 2013
Enrichment and Hypervideos
nerd:Location Cafe Rick
nerd:Person H. Bogart
nerd:Person I. Bergman
nerd:Location Casablanca
Nerd:Person E. Tierney
nerd:Location China
22/10/2013 - - 40 NLP&DBpedia International Workshop, Sydney, October 2013
Locator
MediaResource
MediaFragment Annotation
Entity
URL (hyperlink)
Type
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 41
Media Fragment + Open Annotation + NERD
OffsetBasedString
Towards a Linked Media Layer
Enriching media with media from a closed collection (e.g. BBC archive) The MediaEval scenario (~ 1697 hours of archived BBC video)
http://www.multimediaeval.org/mediaeval2013/hyper2013/
Enriching media with content from the open web LinkedTV scenarios: white listed web sites for each program Media Collector for Social Media
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 42
Seed video enriched with web content rbbaktuell_20120809
nerd:Location Brandenburg
oa
Enrichments are Annotations too
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 44
Media Finder (named entities clustering)
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 45
Media Finder (zooming in a cluster)
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 46
Media Finder: http://mediafinder.eurecom.fr/
Live Topic Generation from Event Streams WWW 2013 Demo Session http://www.youtube.com/watch?v=8iRiwz7cDYY
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 47
Credits
Giuseppe Rizzo, Vuk Milicic, José Luis Redondo Garcia (EURECOM)
Thomas Steiner (Google Inc.)
Marieke van Erp (Free University of Amsterdam)
Yunjia Li (University of Southampton)
… and many other students
22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 48
http://www.slideshare.net/troncy
22/10/2013 - - 49 NLP&DBpedia International Workshop, Sydney, October 2013