NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud
Integrating NLP using Linked Data
-
Upload
sebastian-hellmann -
Category
Education
-
view
1.875 -
download
0
Transcript of Integrating NLP using Linked Data
ISWC – 2013/10/23 – Page 1 http://lod2.euCreating Knowledge out of Interlinked Data
LOD2 Presentation . 02.09.2010 . Page http://lod2.euAKSW, Universität Leipzig
Integrating NLP using Linked Data
Sebastian Hellmann, Jens Lehmann, Sören Auer and Martin Brümmer
http://nlp2rdf.orghttp://lod2.eu
http://slideshare.net/kurzum
ISWC – 2013/10/23 – Page 2 http://lod2.eu
Introduction
ISWC – 2013/10/23 – Page 3 http://lod2.eu
Introduction
Core problems in integrating NLP:
1. Too much heterogeneity
2. Almost no open standards available
3. Lack of open collaboration
4. Difficult and large domain
ISWC – 2013/10/23 – Page 4 http://lod2.eu
Hardly any reusability in NLP
• Free software (as in free beer), but no open licenses
• Few standards and few mappings
• Integration is hard-wired (you have to write software)
– for each tool, for each framework
Main benefits of using RDF, OWL and Linked Data are:
• lower entry barrier (as a client / user)
• easy data integration (linking, mapping)
• reusability of tools and conceptualisations (ontologies)
• off-the-shelf solutions for common tasks
Problem analysis
ISWC – 2013/10/23 – Page 5 http://lod2.eu
The Semantic Gap
ISWC – 2013/10/23 – Page 6 http://lod2.eu
ISWC – 2013/10/23 – Page 7 http://lod2.eu
NLP2RDF project
NLP2RDF (http://nlp2rdf.org)
- community project bootstrapped by LOD2
- develops NLP Interchange Format (NIF)
- umbrella project to combine (and consolidate) existing work
ISWC – 2013/10/23 – Page 8 http://lod2.eu
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.
→ to create an eco-system of interopable web services
NIF Overview
ISWC – 2013/10/23 – Page 9 http://lod2.eu
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.
• Reuse of existing standards such as RDF, OWL2, the PROV Ontology, LAF (ISO 24612), Unicode and RFC 5147
• Standardize access parameters, annotations (e.g. tokenization), validation and log messages
• Reuse of existing ontologies:
NIF Overview
ISWC – 2013/10/23 – Page 10 http://lod2.eu
Example NIF Workflow
NIF workflow, however, can obviously not provide any better performance (F-measure, speed) than a properly configured UIMA or GATE pipeline with the same components.
ISWC – 2013/10/23 – Page 11 http://lod2.eu
Use Cases
• Internationalization TagSet 2.0
• Part of Speech Tagging
• Wikifier API access via RDFaCE (Entity Linking)
ISWC – 2013/10/23 – Page 12 http://lod2.eu
• NIF will be the recommended RDF conversion of the Internationalisation Tagset 2.0 of W3C (ITS 2.0) - http://www.w3.org/TR/its20/
• NIF turns out to have a unique selling proposition regarding NLP and RDF
• There were no suitable alternative RDF vocabulary for this conversion available.
UC1 - Internationalisation Tagset 2.0
ISWC – 2013/10/23 – Page 13 http://lod2.eu
RDFa parsers loose all provenance information:
<http://examples.com/books/wikinomics> dc:title ''Wikinomics'' .
Source: https://en.wikipedia.org/wiki/RDFa
ITS 2.0
Source: http://www.w3.org/TR/its20/#EX-HTML-whitespace-normalization
ISWC – 2013/10/23 – Page 14 http://lod2.eu
UC1 - Internationalisation Tagset 2.0
ISWC – 2013/10/23 – Page 15 http://lod2.eu
UC1 - Internationalisation Tagset 2.0
String offset based on:- Unicode NFC, code points- ISO 24612- RFC 5147
ISWC – 2013/10/23 – Page 16 http://lod2.eu
Please see the paper:
UC2 – Part of Speech Tagging
http://purl.org/olia
ISWC – 2013/10/23 – Page 17 http://lod2.eu
UC3 – Wikifier API access via RDFaCE
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
ISWC – 2013/10/23 – Page 18 http://lod2.eu
UC3 - Wikifier API access via RDFaCE
http://rdface.aksw.org/
ISWC – 2013/10/23 – Page 19 http://lod2.eu
UC3 - Wikifier API access via RDFaCE
http://rdface.aksw.org/
ISWC – 2013/10/23 – Page 20 http://lod2.eu
Evaluation
Please see the paper!
1) Quantitative Analysis with Google Wikilinks Corpus as NIF RDF
• Crawl of 3 million web sites, 40 million Wikipedia links
• ~ 477 million triples in NIF
2) Questionnaire and Developers Study for NIF 1.0
• NIF 1.0 was released in September 2009
• Over 30 known implementations (22 not from authors)
• 14 developers participated in the study
• Minimal NIF implementation requires less than 500 LoC
3) Qualitative Comparison with other Frameworks and Formats
ISWC – 2013/10/23 – Page 21 http://lod2.eu
State of NIF 2.0
Corpora as Linked Data
• Wikilinks corpus - http://wiki-link.nlp2rdf.org
• KORE 50 - http://www.yovisto.com/labs/ner-benchmarks/
• DBpedia Spotlight dataset
Tools
• entityclassifier.eu – http://entityclassifier.eu
• Spotlight - http://spotlight.dbpedia.org
• Open NLP
• Stanford CoreNLP - https://github.com/NLP2RDF/software
• Validator - https://github.com/NLP2RDF/software
ISWC – 2013/10/23 – Page 22 http://lod2.eu
State of NIF 2.0• Rollout is in progress
• Distributed implementation at different speed and quality
• Software lifecycle:
• Implementation
• Testing/Validation
• Integration in the main software
• Deployment as a web service
• Hosted web services often not up to date while code base is
ISWC – 2013/10/23 – Page 24 http://lod2.eu
NLP2RDF provides infrastructure for your NLP ontologies
• Redundant, persistent hosting
• Maven packages
• Code and documentation generation
• Continuous Integration (planned)
• Indexing
• Validation of instance data
For ontology creators
Please write to me or the mailing [email protected]
ISWC – 2013/10/23 – Page 25 http://lod2.eu
• Early industrial uptake
• OpenLink, Vistatech.ie, Zemanta, Tenforce, Unister
• ITS 2.0 W3C standard was driven by localization industry
• NIF is open and free (CC0 planned)
• NIF is designed to be a cost-saver
Take home message
Not primarily aimed atincreasing features or performance (F-Measure)
ISWC – 2013/10/23 – Page 26 http://lod2.eu
Open Community – All feedback is welcome!
http://slideshare.net/kurzum
Websites:
http://nlp2rdf.org
http://lod2.eu
Thanks for your attention
ISWC – 2013/10/23 – Page 27 http://lod2.eu
Annotations
ISWC – 2013/10/23 – Page 28 http://lod2.eu
NIF
ISWC – 2013/10/23 – Page 29 http://lod2.eu
https://bitbucket.org/srfgkmt/stanbol-nlp
Scalability - Salzburg Research KMT
ISWC – 2013/10/23 – Page 30 http://lod2.eu
• Recommendation for RDF Literals
• http://unicode.org/reports/tr15/#Norm_Forms
Unicode Normal Form C
ISWC – 2013/10/23 – Page 31 http://lod2.eu
Tokenization
Christian Chiarcos, Julia Ritz, Manfred Stede: By all these lovely tokens... Merging conflicting tokenizations. Language Resources and Evaluation 46(1): 53-74 (2012)
ISWC – 2013/10/23 – Page 32 http://lod2.eu
• SPARQL queries produce (find) errors
• http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.ttl
• RLOG – An RDF Logging Ontology
• ./validate.jar -i nif-erroneous-model.ttl -t file
• Demo → character count
• Demo → all errors
Validation over specification
ALL DEMOS ARE AVAILABLE AT:
http://nlp2rdf.org/leipzig-24-9-2013
ISWC – 2013/10/23 – Page 33 http://lod2.eu
NIF
Demo:http://nlp2rdf.lod2.eu/demo.php
ISWC – 2013/10/23 – Page 34 http://lod2.eu
OLiA
http://purl.org/olia
ISWC – 2013/10/23 – Page 35 http://lod2.eu
NIF
ISWC – 2013/10/23 – Page 36 http://lod2.eu
NIF