Introduction to LDL 2012

Linked Data in Linguistics
Representing and Connecting Language Data and Language Metadata

34th Annual Meeting of the German Linguistic Society (DGfS), AG 2Frankfurt/M., Germany, March 7th 9th, 2012

Sebastian Hellmann, Christian Chiarcos, Sebastian Nordhoff

If not otherwise noted, content is cc-by

NB: An asterisk in these notes, like this:

*

Indicates a transition on the slide.

Overview

Technological Background (SH)

Linked Open Data and Collaborative Research (SH)

Linked Data for Linguistics (CC)

Building a Linguistic Linked Open Data CloudProspects of Linked Data in Linguistics (CC)

Annotated Corpora (CC)

Lexical-Semantic Resource (SH)

Linguistic Databases (SN)

What to Expect from LDL-2012

From Excel to RDF and Linked Data


A data collection about sailing ships:

Source http://en.wikipedia.org/wiki/File:Bounty_modified_photo.jpg


Add the Gorch Fock

Source http://en.wikipedia.org/wiki/File:Gorch_Fock_unter_Segeln_Kieler_Foerde_2006.jpg


Add the auxiliary propulsion of the Gorch Fock

The field is now irregular


A first empty field is introduced


Entity Attribute Value, data represented in triples


XML does also not produce sparsity or anomalies, but what about:

1. Automatically infer rows (reduces size) 2. Check consistency (not validity)3. Merge two tables (not only syntactically, but semantically)4. Enrich with external data (also retrieve updates)5. Query





Description Logic (DL) is a family of formal knowledge representation languages

fragments of first order logic

usually decidable inference problems

Well researched complexity

Basis for the Web Ontology Language (OWL)

Reasoner implementations available

Franz Baader, Ian Horrocks, and Ulrike Sattler Chapter 3 Description Logics. In Frank van Harmelen, Vladimir Lifschitz, and Bruce Porter, editors, Handbook of Knowledge Representation. Elsevier, 2007.


Description Logic inference


Description Logic constraints

Possible to detect inconsistencies, i.e. Gorch Fock must not be a Sailingship




Uniform Resource Identifiers (URIs)

Agree on a common vocabulary and names for entities

On the schema level, coherence of properties and types is required for data integration

URIs allow for globally unique identifiers:Gorch Fockvs.http://en.wikipedia.org/wiki/Gorch_Fock_(1958)vs.http://dbpedia.org/resource/Gorch_Fock_(1958)dbpedia:Gorch_Fock_(1958)


Last table before we get more technical

4 Types of Object


my:owner

dbprop:shipLength

81.2^^xsd:double

dbpedia:German_Navy

my:German_Navyowl:sameAs

owl:sameAs

More data

my:Gorch_Fock

dbpedia:Gorch_Fock_(1958)owl:sameAs

Other datasets

RDF and OWL - recap

RDF Resource Description FrameworkEntity Attribute Value + URIs

Triples

Shared Vocabularies

Graphs

OWL Web Ontology LanguageBased on Description Logic and extends RDF

OWL-DL Reasoning

Consistency checks

Both are W3C standards

Syntax training

Presenters will probably show you some code during the next days

On the next slide you will see some syntax examples

Serialization: Turtle and XML

Serialization: Turtle and XML

SPARQL

Ability to merge data and query it using the W3C standard SPARQL (SPARQL Protocol and Query Language)

SPARQL is the SQL of the Semantic Web

SELECT ?ship WHERE {?ship rdf:type my:SailingShip .?ship my:propulsion ?engine .?engine my:fuelType my:Diesel .?ship dbprop:shipLength ?length .Filter (xsd:double (?length) >= 80.0 )}

Linked Data

Linked Open Data cloud









Image of a table with some data

Source http://lod-cloud.net

4 Rules of Linked Data

Use URIs as names for things

Use HTTP URIs so that people can look up those names.

When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)

Include links to other URIs. so that they can discover more things.

http://www.w3.org/DesignIssues/LinkedData.html

Linked Data - Content Negotiation

Different views for different data consumers: Browser

Linked Data - Content Negotiation

Different views for different data consumers: Applications

Linked Data

A dataset is a set of RDF triples that is published, maintained or aggregated by a single provider.

An RDF link is an RDF triple whose subject and object are described in different datasets

A linkset is a collection of such RDF links between two datasets

Why going for the fifth star?

Source: http://webofdata.wordpress.com/2011/05/22/why-we-link/

Geonames

Central ContractorRegistration (CCR)

Open Licence allow republishing and reuse

Motivation for collaboration:High potential that invested efforts can be reused, i.e. data, links, vocabularies, schemas

(Effortful) feedback: Users complement data, extend vocabularies and contribute changes. VoCamps for achieving coherence.

Source: Chiarcos, Hellmann, Nordhoff, Towards a Linguistic Linked Open Data cloud: TheOpen Linguistics Working Group, Traitement Automatique des Langues, to appear

Example DBpedia

Data is extracted from Wikipedia

Wikipedia just publishes the unstructured data

Small DBpedia team creates RDF

Community of stakeholders clean the data and create links

Estimate:10-20% to consolidate community effort

Scalability

Golden Hammer Anti-Pattern

Adequacy

Linked Data for Linguistics

Linked Data for Linguistics

Representation and modelling

Structural interoperability

Integrating distributed resources

Conceptual interoperability

Dynamic Import




Different linguistic subcommunities have developed representation standards, e.g.,LMF: Lexical Markup Framework (Francopoulo et al. 2009)lexical-semantic resources

GrAF: Graph Annotation Framework (Ide and Suderman 2007)for annotated corpora

based on labelled directed acyclic graphs (feature structures)

RDF data model: labelled directed (multi-)graphsUniform formalism for different resource types

Sublanguages (e.g., RDFS, OWL) allow to define domain-specific vocabularies


With different language resources represented in RDF, we can combine both sources of information freelycross-resource queries with RDF query languages (e.g., SPARQL)

Given a corpus with WordNet sense annotations (e.g., the Manually Annotated Sub-Corpus MASC) (Ide et al. 2010) Retrieve all sentences that describe locations

i.e., sentences containing a token annotated with a WordNet sense that is a hyponym of location

Difficult to realize with GrAF or LMF

Integrating distributed resources

SPARQL supports nested subqueries to run on different repositories

No physical integration of resources in a single data base requiredEasy to link to centralized repositories of reference terminology, etc.


Resources should specify which vocabulary (e.g., for annotation) they use and how it is definedBy reference to community-maintained terminology repositories, e.g.,GOLD (Farrar and Langendoen 2010)

ISOcat (Windhouwer and Wright @ LDL-2012)

Can be used, e.g., for disambiguationIf a lexeme in a lexicon has a certain morphosyntactic categorization, we can retrieve all sentences from a corpus with corresponding annotationse.g., land as a noun, but not as a verb

Dynamic import

Linking resources implemented with URIs, which can be resolved on-the-fly to update and enrich data setsFor a token in a corpus, additional information can be aggregated from different repositories by resolving links (retrieving senses from a lexical-semantic repository or concepts from a terminology repository)

If the information in the target resource was updated since the original annotation was performed, then the updates are available at query time

Inconsistencies can be avoided through versioning

Ecosystem, infrastructure and community

RDF and related standards are maintained by an active and relatively large communityDifferent fields of applicationLibraries, GeoData, BioMed, ...

Established W3C standard and technological infrastructure

Linguistically relevant resources already providedlexical-semantic resources (e.g., WordNet)

RDF facilitates distributed development, re-using data, and, indirectly, interdisciplinary cooperation

Building a Linguistic Linked Open Data cloud


In LOD cloudLexical Semantic resources

Linguistic meta data

Further relevant typesfor linguistic research:Annotated corpora

Input and output of NLP tools

Linguistic data bases

Repositories of linguistic terminology


Each single provider has different incentives to use Linked Data and/or RDF

Concepts of RDF and Linked Data have been brought up to solve open problems in different subcommunities of linguistics and neighboring fields

As an illustration, we briefly introduce three examples


Annotated corporaUnderlying problem: structural and conceptual interoperability

Natural Language Processing for the semantic webUnderlying problem: NLP output represented in idiosyncratic formalisms, results to be represented in RDF

Typological data basesUnderlying problem: globally unique identifiers (not just for languages, but for dialects, language families, etc.)

Annotated corpora

Linked Data and Corpus Interoperability


Linked Data can be used to address interoperability issues of annotated corpora

Corpus: collection of texts developed to analyze language and to develop tools for this purpose=> Annotated corpora

Different types of annotations, different communities involved, different languages=> Interoperability challenge

@book{mcenery2001corpus, title={Corpus linguistics: an introduction}, author={McEnery, T. and Wilson, A.}, year={2001}, publisher={Edinburgh Univ Pr} }

@book{tognini2001corpus, title={Corpus linguistics at work}, author={Tognini-Bonelli, E.}, volume={6}, year={2001}, publisher={John Benjamins Publishing Co} }

@inproceedings{brewster2004data, title={Data driven ontology evaluation}, author={Brewster, C. and Alani, H. and Dasmahapatra, S. and Wilks, Y.}, booktitle={Proceedings of LREC}, volume={2004}, year={2004}, organization={Citeseer} }

@inproceedings{mahesh1995semantic, title={Semantic classification for practical natural language processing}, author={Mahesh, K. and Nirenburg, S.}, booktitle={Proceedings of the Sixth ASlS SIG/CR Classification Research Workshop: An Interdisciplinary Meeting. Chicago, IL}, year={1995}, organization={Citeseer} }


Linked Data can be used to address interoperability issues of annotated corpora

Corpus: collection of texts developed to analyze language and to develop tools for this purpose=> Annotation

Different types of annotations, different communities involved, different languages=> Interoperability challenge

Structural interoperability Interoperable representation formalisms

Conceptual interoperabilityReference definitions for linguistic categories and features@book{mcenery2001corpus, title={Corpus linguistics: an introduction}, author={McEnery, T. and Wilson, A.}, year={2001}, publisher={Edinburgh Univ Pr} }

@book{tognini2001corpus, title={Corpus linguistics at work}, author={Tognini-Bonelli, E.}, volume={6}, year={2001}, publisher={John Benjamins Publishing Co} }

@inproceedings{brewster2004data, title={Data driven ontology evaluation}, author={Brewster, C. and Alani, H. and Dasmahapatra, S. and Wilks, Y.}, booktitle={Proceedings of LREC}, volume={2004}, year={2004}, organization={Citeseer} }

@inproceedings{mahesh1995semantic, title={Semantic classification for practical natural language processing}, author={Mahesh, K. and Nirenburg, S.}, booktitle={Proceedings of the Sixth ASlS SIG/CR Classification Research Workshop: An Interdisciplinary Meeting. Chicago, IL}, year={1995}, organization={Citeseer} }

Structural Interoperability

word annotations(tokens)

Analyses produced by different researchers / NLP tools use different representation formalisms



span annotations(markables)




span annotations(markables)

tree-likeannotations



relational annotations




State-of-the art approaches Graph-based data model

Represent data in standoff XML(Ide and Suderman 2007, Chiarcos et al. 2008, Eckart et al. @ LDL)

Presentation of Nancy Ide @ LDL 2012


XML standoff

MASC corpus, GrAF format

Working with XML standoff

How to store, retrieve and query XML standoff data efficiently ?Direct use with XML data bases inefficient (Eckart 2008)

Inline XML (e.g., Dipper et al. 2007)

Relational DB formats (e.g., Eckart et al. @ LDL)

RDF as another possibility (e.g., Chiarcos 2012)Databases are optimized for graph querying

Extensive (open source) infrastructure available


Integration with Linked Data resources

Corpus Interoperability with RDF

Structural Interoperabilitye.g. POWLA - http://purl.org/powlalossless transformation to RDF from standoff XML

Linking to lexical-semantic resources (WordNet)

Conceptual InteroperabilityCross-Linking to terminology repositories (OLiA, GOLD, ISOcat)

Entity-Linking to metadata (Geodata, LOD cloud)

Natural Language Processing Interchange FormatNIF

NLP Interchange Format (NIF)

NIF is an RDF/OWL-based format

Achieve interoperability for:Output of NLP tools

Linguistic data in RDF

Text documents

Web of Data (LOD cloud)

A Transparent Formalization of Text for Machines


Intransparent for machines


The city Berlin is the capital of Germany.

URIhttp://example.org/sample #offset_0_42

Universe of discourse is defined as the words over the alphabet of Unicode characters

NLP Interchange Format

Specification for NIF 1.0 (http://nlp2rdf.org/nif-1-0/)

different implementations (alpha/beta) are available as Open Source (UIMA, Gate Annie, Stanford Parser, DBpedia Spotlight)

Mailing list available at http://nlp2rdf.org

Demo: http://nlp2rdf.lod2.eu/demo.php

Poster during the poster session Thursday 13:00-14:30

Typological databasesGlottolog/Langdoc

Glottolog/Langdoc

Two subprojectsGlottolog provides identifiers and additional information for 100k languoids (languages, dialects, families)main competitor projects: ISO 639-3/Ethnologue

Multitree

Langdoc provides identifiers and additional information for 180k referencesmain competitor project: OLAC

Problems to addressexisting identifiers are not granular enough (ISO 636-3: 7k)

existing identifiers have unclear reference (multitree altc refers to both Micro-Altaic and Macro-Altaic)

existing identifiers have no verifiable empirical basis

Solutions21k identifiers for main tree

total of 104k identifiers for all nodes of multitree trees

RDF

glottolog:12345 gl:sublanguoid glottolog:41202 .glottolog:12345 gl:superlanguoid glottolog:94211 .

Langdoc

180k references to literature treating (mostly) lesser-known languages

annotated for language, document type, macro-area

limited full text indexing

give me any grammar or grammar sketch from an Afro-Asiatic language spoken in Eurasia where the word 'dual' occurs in the text

RDF

glottolog:12345 gl:immediatelydescribedin langdoc:23456 .

Position of G/L in the LLOD cloud

Availability

XHTML: http://glottolog.livingsources.org

RDF: http://glottolog.livingsources.org/sparql

Outlook

Outlook

From OWLG to DGfS

The Open Knowledge Foundation Working Group for Open Data in Linguistics (OKFN-OWLG) was founded in late 2010

We first established a series of meetings and a mailing list

Build the structure, create momentum

Two workshops: OKCon 2011 in Berlin, this workshop

This afternoon: Christian Kreutz presents the OKFN

Building the Linguistic Linked Data Cloud

This workshop

Exploratory workshop

Chart domains as to the amount and kind of data which can be integrated into the LLOD-cloud

increase coverage more domains

increase density more links between resources

increase discussion between independent subcommunities

This workshop

Spread the word

http://linguistics.okfn.org/

[email protected]

poster at DGfS-CL session on Thursday

start this workshop: first talk: Declerck et al.
Towards Linked Language Data (LLD) for Digital Humanities

We would like to thank

MPI

Springer

LOD2

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

Introduction to LDL 2012

Technology

Transcript of Introduction to LDL 2012