Introduction to LDL 2012
-
Upload
sebastian-hellmann -
Category
Technology
-
view
921 -
download
3
Transcript of Introduction to LDL 2012
Linked Data in Linguistics
Representing and Connecting Language Data and Language Metadata
34th Annual Meeting of the German Linguistic Society (DGfS), AG 2Frankfurt/M., Germany, March 7th 9th, 2012
Sebastian Hellmann, Christian Chiarcos, Sebastian Nordhoff
If not otherwise noted, content is cc-by
NB: An asterisk in these notes, like this:
*
Indicates a transition on the slide.
Overview
Technological Background (SH)
Linked Open Data and Collaborative Research (SH)
Linked Data for Linguistics (CC)
Building a Linguistic Linked Open Data CloudProspects of Linked Data in Linguistics (CC)
Annotated Corpora (CC)
Lexical-Semantic Resource (SH)
Linguistic Databases (SN)
What to Expect from LDL-2012
From Excel to RDF and Linked Data
From Excel to RDF and Linked Data
A data collection about sailing ships:
Source http://en.wikipedia.org/wiki/File:Bounty_modified_photo.jpg
From Excel to RDF and Linked Data
Add the Gorch Fock
Source http://en.wikipedia.org/wiki/File:Gorch_Fock_unter_Segeln_Kieler_Foerde_2006.jpg
From Excel to RDF and Linked Data
Add the auxiliary propulsion of the Gorch Fock
The field is now irregular
From Excel to RDF and Linked Data
A first empty field is introduced
From Excel to RDF and Linked Data
Entity Attribute Value, data represented in triples
From Excel to RDF and Linked Data
XML does also not produce sparsity or anomalies, but what about:
1. Automatically infer rows (reduces size) 2. Check consistency (not validity)3. Merge two tables (not only syntactically, but semantically)4. Enrich with external data (also retrieve updates)5. Query
From Excel to RDF and Linked Data
XML does also not produce sparsity or anomalies, but what about:
1. Automatically infer rows (reduces size) 2. Check consistency (not validity)3. Merge two tables (not only syntactically, but semantically)4. Enrich with external data (also retrieve updates)5. Query
From Excel to RDF and Linked Data
Description Logic (DL) is a family of formal knowledge representation languages
fragments of first order logic
usually decidable inference problems
Well researched complexity
Basis for the Web Ontology Language (OWL)
Reasoner implementations available
Franz Baader, Ian Horrocks, and Ulrike Sattler Chapter 3 Description Logics. In Frank van Harmelen, Vladimir Lifschitz, and Bruce Porter, editors, Handbook of Knowledge Representation. Elsevier, 2007.
From Excel to RDF and Linked Data
Description Logic inference
From Excel to RDF and Linked Data
Description Logic constraints
Possible to detect inconsistencies, i.e. Gorch Fock must not be a Sailingship
From Excel to RDF and Linked Data
XML does also not produce sparsity or anomalies, but what about:
1. Automatically infer rows (reduces size) 2. Check consistency (not validity)3. Merge two tables (not only syntactically, but semantically)4. Enrich with external data (also retrieve updates)5. Query
Uniform Resource Identifiers (URIs)
Agree on a common vocabulary and names for entities
On the schema level, coherence of properties and types is required for data integration
URIs allow for globally unique identifiers:Gorch Fockvs.http://en.wikipedia.org/wiki/Gorch_Fock_(1958)vs.http://dbpedia.org/resource/Gorch_Fock_(1958)dbpedia:Gorch_Fock_(1958)
From Excel to RDF and Linked Data
Last table before we get more technical
4 Types of Object
From Excel to RDF and Linked Data
my:owner
dbprop:shipLength
81.2^^xsd:double
dbpedia:German_Navy
my:German_Navyowl:sameAs
owl:sameAs
More data
my:Gorch_Fock
dbpedia:Gorch_Fock_(1958)owl:sameAs
Other datasets
RDF and OWL - recap
RDF Resource Description FrameworkEntity Attribute Value + URIs
Triples
Shared Vocabularies
Graphs
OWL Web Ontology LanguageBased on Description Logic and extends RDF
OWL-DL Reasoning
Consistency checks
Both are W3C standards
Syntax training
Presenters will probably show you some code during the next days
On the next slide you will see some syntax examples
Serialization: Turtle and XML
Serialization: Turtle and XML
SPARQL
Ability to merge data and query it using the W3C standard SPARQL (SPARQL Protocol and Query Language)
SPARQL is the SQL of the Semantic Web
SELECT ?ship WHERE {?ship rdf:type my:SailingShip .?ship my:propulsion ?engine .?engine my:fuelType my:Diesel .?ship dbprop:shipLength ?length .Filter (xsd:double (?length) >= 80.0 )}
Linked Data
Linked Open Data cloud
Linked Open Data cloud
Linked Open Data cloud
Linked Open Data cloud
Linked Open Data cloud
Linked Open Data cloud
Linked Open Data cloud
Linked Open Data cloud
Linked Open Data cloud
Image of a table with some data
Source http://lod-cloud.net
4 Rules of Linked Data
Use URIs as names for things
Use HTTP URIs so that people can look up those names.
When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
Include links to other URIs. so that they can discover more things.
http://www.w3.org/DesignIssues/LinkedData.html
Linked Data - Content Negotiation
Different views for different data consumers: Browser
Linked Data - Content Negotiation
Different views for different data consumers: Applications
Linked Data
A dataset is a set of RDF triples that is published, maintained or aggregated by a single provider.
An RDF link is an RDF triple whose subject and object are described in different datasets
A linkset is a collection of such RDF links between two datasets
Why going for the fifth star?
Source: http://webofdata.wordpress.com/2011/05/22/why-we-link/
Geonames
Central ContractorRegistration (CCR)
Open Licence allow republishing and reuse
Motivation for collaboration:High potential that invested efforts can be reused, i.e. data, links, vocabularies, schemas
(Effortful) feedback: Users complement data, extend vocabularies and contribute changes. VoCamps for achieving coherence.
Source: Chiarcos, Hellmann, Nordhoff, Towards a Linguistic Linked Open Data cloud: TheOpen Linguistics Working Group, Traitement Automatique des Langues, to appear
Example DBpedia
Data is extracted from Wikipedia
Wikipedia just publishes the unstructured data
Small DBpedia team creates RDF
Community of stakeholders clean the data and create links
Estimate:10-20% to consolidate community effort
Scalability
Golden Hammer Anti-Pattern
Adequacy
Linked Data for Linguistics
Linked Data for Linguistics
Representation and modelling
Structural interoperability
Integrating distributed resources
Conceptual interoperability
Dynamic Import
Representation and modelling
Structural interoperability
Representation and modelling
Different linguistic subcommunities have developed representation standards, e.g.,LMF: Lexical Markup Framework (Francopoulo et al. 2009)lexical-semantic resources
GrAF: Graph Annotation Framework (Ide and Suderman 2007)for annotated corpora
based on labelled directed acyclic graphs (feature structures)
RDF data model: labelled directed (multi-)graphsUniform formalism for different resource types
Sublanguages (e.g., RDFS, OWL) allow to define domain-specific vocabularies
Structural interoperability
With different language resources represented in RDF, we can combine both sources of information freelycross-resource queries with RDF query languages (e.g., SPARQL)
Given a corpus with WordNet sense annotations (e.g., the Manually Annotated Sub-Corpus MASC) (Ide et al. 2010) Retrieve all sentences that describe locations
i.e., sentences containing a token annotated with a WordNet sense that is a hyponym of location
Difficult to realize with GrAF or LMF
Integrating distributed resources
SPARQL supports nested subqueries to run on different repositories
No physical integration of resources in a single data base requiredEasy to link to centralized repositories of reference terminology, etc.
Conceptual interoperability
Resources should specify which vocabulary (e.g., for annotation) they use and how it is definedBy reference to community-maintained terminology repositories, e.g.,GOLD (Farrar and Langendoen 2010)
ISOcat (Windhouwer and Wright @ LDL-2012)
Can be used, e.g., for disambiguationIf a lexeme in a lexicon has a certain morphosyntactic categorization, we can retrieve all sentences from a corpus with corresponding annotationse.g., land as a noun, but not as a verb
Dynamic import
Linking resources implemented with URIs, which can be resolved on-the-fly to update and enrich data setsFor a token in a corpus, additional information can be aggregated from different repositories by resolving links (retrieving senses from a lexical-semantic repository or concepts from a terminology repository)
If the information in the target resource was updated since the original annotation was performed, then the updates are available at query time
Inconsistencies can be avoided through versioning
Ecosystem, infrastructure and community
RDF and related standards are maintained by an active and relatively large communityDifferent fields of applicationLibraries, GeoData, BioMed, ...
Established W3C standard and technological infrastructure
Linguistically relevant resources already providedlexical-semantic resources (e.g., WordNet)
RDF facilitates distributed development, re-using data, and, indirectly, interdisciplinary cooperation
Building a Linguistic Linked Open Data cloud
Building a Linguistic Linked Open Data cloud
In LOD cloudLexical Semantic resources
Linguistic meta data
Further relevant typesfor linguistic research:Annotated corpora
Input and output of NLP tools
Linguistic data bases
Repositories of linguistic terminology
Building a Linguistic Linked Open Data cloud
Each single provider has different incentives to use Linked Data and/or RDF
Concepts of RDF and Linked Data have been brought up to solve open problems in different subcommunities of linguistics and neighboring fields
As an illustration, we briefly introduce three examples
Building a Linguistic Linked Open Data cloud
Annotated corporaUnderlying problem: structural and conceptual interoperability
Natural Language Processing for the semantic webUnderlying problem: NLP output represented in idiosyncratic formalisms, results to be represented in RDF
Typological data basesUnderlying problem: globally unique identifiers (not just for languages, but for dialects, language families, etc.)
Annotated corpora
Linked Data and Corpus Interoperability
Linked Data and Corpus Interoperability
Linked Data can be used to address interoperability issues of annotated corpora
Corpus: collection of texts developed to analyze language and to develop tools for this purpose=> Annotated corpora
Different types of annotations, different communities involved, different languages=> Interoperability challenge
@book{mcenery2001corpus, title={Corpus linguistics: an introduction}, author={McEnery, T. and Wilson, A.}, year={2001}, publisher={Edinburgh Univ Pr} }
@book{tognini2001corpus, title={Corpus linguistics at work}, author={Tognini-Bonelli, E.}, volume={6}, year={2001}, publisher={John Benjamins Publishing Co} }
@inproceedings{brewster2004data, title={Data driven ontology evaluation}, author={Brewster, C. and Alani, H. and Dasmahapatra, S. and Wilks, Y.}, booktitle={Proceedings of LREC}, volume={2004}, year={2004}, organization={Citeseer} }
@inproceedings{mahesh1995semantic, title={Semantic classification for practical natural language processing}, author={Mahesh, K. and Nirenburg, S.}, booktitle={Proceedings of the Sixth ASlS SIG/CR Classification Research Workshop: An Interdisciplinary Meeting. Chicago, IL}, year={1995}, organization={Citeseer} }
Linked Data and Corpus Interoperability
Linked Data can be used to address interoperability issues of annotated corpora
Corpus: collection of texts developed to analyze language and to develop tools for this purpose=> Annotation
Different types of annotations, different communities involved, different languages=> Interoperability challenge
Structural interoperability Interoperable representation formalisms
Conceptual interoperabilityReference definitions for linguistic categories and features@book{mcenery2001corpus, title={Corpus linguistics: an introduction}, author={McEnery, T. and Wilson, A.}, year={2001}, publisher={Edinburgh Univ Pr} }
@book{tognini2001corpus, title={Corpus linguistics at work}, author={Tognini-Bonelli, E.}, volume={6}, year={2001}, publisher={John Benjamins Publishing Co} }
@inproceedings{brewster2004data, title={Data driven ontology evaluation}, author={Brewster, C. and Alani, H. and Dasmahapatra, S. and Wilks, Y.}, booktitle={Proceedings of LREC}, volume={2004}, year={2004}, organization={Citeseer} }
@inproceedings{mahesh1995semantic, title={Semantic classification for practical natural language processing}, author={Mahesh, K. and Nirenburg, S.}, booktitle={Proceedings of the Sixth ASlS SIG/CR Classification Research Workshop: An Interdisciplinary Meeting. Chicago, IL}, year={1995}, organization={Citeseer} }
Structural Interoperability
word annotations(tokens)
Analyses produced by different researchers / NLP tools use different representation formalisms
Structural Interoperability
word annotations(tokens)
span annotations(markables)
Analyses produced by different researchers / NLP tools use different representation formalisms
Structural Interoperability
word annotations(tokens)
span annotations(markables)
tree-likeannotations
Analyses produced by different researchers / NLP tools use different representation formalisms
Structural Interoperability
relational annotations
Analyses produced by different researchers / NLP tools use different representation formalisms
Structural Interoperability
Analyses produced by different researchers / NLP tools use different representation formalisms
State-of-the art approaches Graph-based data model
Represent data in standoff XML(Ide and Suderman 2007, Chiarcos et al. 2008, Eckart et al. @ LDL)
Presentation of Nancy Ide @ LDL 2012
Structural Interoperability
XML standoff
MASC corpus, GrAF format
Working with XML standoff
How to store, retrieve and query XML standoff data efficiently ?Direct use with XML data bases inefficient (Eckart 2008)
Inline XML (e.g., Dipper et al. 2007)
Relational DB formats (e.g., Eckart et al. @ LDL)
RDF as another possibility (e.g., Chiarcos 2012)Databases are optimized for graph querying
Extensive (open source) infrastructure available
Conceptual interoperability
Integration with Linked Data resources
Corpus Interoperability with RDF
Structural Interoperabilitye.g. POWLA - http://purl.org/powlalossless transformation to RDF from standoff XML
Linking to lexical-semantic resources (WordNet)
Conceptual InteroperabilityCross-Linking to terminology repositories (OLiA, GOLD, ISOcat)
Entity-Linking to metadata (Geodata, LOD cloud)
Natural Language Processing Interchange FormatNIF
NLP Interchange Format (NIF)
NIF is an RDF/OWL-based format
Achieve interoperability for:Output of NLP tools
Linguistic data in RDF
Text documents
Web of Data (LOD cloud)
A Transparent Formalization of Text for Machines
A Transparent Formalization of Text for Machines
Intransparent for machines
A Transparent Formalization of Text for Machines
The city Berlin is the capital of Germany.
URIhttp://example.org/sample #offset_0_42
Universe of discourse is defined as the words over the alphabet of Unicode characters
NLP Interchange Format
Specification for NIF 1.0 (http://nlp2rdf.org/nif-1-0/)
different implementations (alpha/beta) are available as Open Source (UIMA, Gate Annie, Stanford Parser, DBpedia Spotlight)
Mailing list available at http://nlp2rdf.org
Demo: http://nlp2rdf.lod2.eu/demo.php
Poster during the poster session Thursday 13:00-14:30
Typological databasesGlottolog/Langdoc
Glottolog/Langdoc
Two subprojectsGlottolog provides identifiers and additional information for 100k languoids (languages, dialects, families)main competitor projects: ISO 639-3/Ethnologue
Multitree
Langdoc provides identifiers and additional information for 180k referencesmain competitor project: OLAC
Problems to addressexisting identifiers are not granular enough (ISO 636-3: 7k)
existing identifiers have unclear reference (multitree altc refers to both Micro-Altaic and Macro-Altaic)
existing identifiers have no verifiable empirical basis
Solutions21k identifiers for main tree
total of 104k identifiers for all nodes of multitree trees
RDF
glottolog:12345 gl:sublanguoid glottolog:41202 .glottolog:12345 gl:superlanguoid glottolog:94211 .
Langdoc
180k references to literature treating (mostly) lesser-known languages
annotated for language, document type, macro-area
limited full text indexing
give me any grammar or grammar sketch from an Afro-Asiatic language spoken in Eurasia where the word 'dual' occurs in the text
RDF
glottolog:12345 gl:immediatelydescribedin langdoc:23456 .
Position of G/L in the LLOD cloud
Availability
XHTML: http://glottolog.livingsources.org
RDF: http://glottolog.livingsources.org/sparql
Outlook
Outlook
From OWLG to DGfS
The Open Knowledge Foundation Working Group for Open Data in Linguistics (OKFN-OWLG) was founded in late 2010
We first established a series of meetings and a mailing list
Build the structure, create momentum
Two workshops: OKCon 2011 in Berlin, this workshop
This afternoon: Christian Kreutz presents the OKFN
Building the Linguistic Linked Data Cloud
This workshop
Exploratory workshop
Chart domains as to the amount and kind of data which can be integrated into the LLOD-cloud
increase coverage more domains
increase density more links between resources
increase discussion between independent subcommunities
This workshop
Spread the word
http://linguistics.okfn.org/
poster at DGfS-CL session on Thursday
start this workshop: first talk: Declerck et al.
Towards Linked Language Data (LLD) for Digital Humanities
We would like to thank
MPI
Springer
LOD2
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level