Introduction to Linked Datasadl.kuleuven.be/docs/smeSpire_training... · put it together to do all...
Transcript of Introduction to Linked Datasadl.kuleuven.be/docs/smeSpire_training... · put it together to do all...
2 Modules
1. Introduction
2. From data to information to knowledge
3. Towards the semantic web
4. Linked Data basics
5. Publishing Linked Data
6. Linked Data usage
7. Linked Data vs. Open Data
3 After the training you will be able to:
• Identify and describe the concepts of semantic web and linked data
• Identify and describe the buildings blocks of Linked Data
• Identify the different steps in publishing linked data
• Understand how linked data can be consumed
• Explain the difference between linked and open data
• Understand the benefits of linked data
4 Target Audience
This seminar aims at :
• managers, ICT strategists and professionals that need a basic understanding of Linked Data.
Prior knowledge:
• no explicit pre-requisites are required.
5
Part 1
Introduction
6 Search for data - Music
7 Search for data - Sports
8 Search for data – Emergency Response
9 Search for data – Emergency Response
10 Search for data – Emergency Response
11 How to build such an application?
Site editors roam the Web for new facts Approach 1: • They update the site manually • And the site gets soon out-of-date
Approach 2: • “Scrape” the sites with a program to extract the information
i.e. write some code to incorporate the new data • Easily get out of date again… Approach 3: • Write some code to incorporate the new data via APIs • Easily get out of date again…
12 How to build such a site?
Use external, public datasets • Wikipedia, MusicBrainz, …
They are available as data • not API-s or hidden on a Web site • data can be extracted using, e.g., HTTP requests or standard
queries
13
Part 2
From data to information to knowledge
14 Proper meaning of terms
Data, information & knowledge
Is there any distinction? What do these terms mean and how do they relate to each other?
15 Data
Data are kind of raw material and can be considered as facts of the world.
Data can be a group of symbols, numbers, or writing.
16 Information
When data has been processed, data become information. Information is basically a framework of data which have a useful meaning for someone who read it.
Trendanalyzer, Hans Rosling
17 Knowledge
Knowledge is a new form of information which has been transformed into something that is triggering people to act. It is the understanding of rules needed to interpret information.
Using the previous example:
Hans Rosling could use his information to provide new insights on population growth
18 Distribution of data
The web as it is today !
• Data is delivered to us in the form
of web pages - HTML with separate download links or web applications.
• Documents that are linked to each other through the use of hyperlinks.
• Humans or machines can read these documents, but machines have difficulty extracting any meaning from these documents themselves.
19 Use of applications
20 Data on the web
• There are more an more data on the Web
government data, health related data, general knowledge, company information, flight information, restaurants,…
• More and more applications rely on the availability of that data
21 But… data are often in isolation, “silos”
22
Part 3
Towards the semantic web
23 Imagine…
A “Web” where • documents are available for download on the Internet • but there would be no hyperlinks among them
The problem is real !!!
24 Data on the web is not
enough…
We need a proper infrastructure for a real Web of Data data is available on the Web accessible via standard Web technologies data are interlinked over the Web i.e. data can be integrated over the Web
This is where Semantic Web technologies come in !
25 I.e.,… connect the silos
26 Semantic web
The semantic web is an evolving extension of the World Wide Web in which web content can be expressed not only in natural language, but also in a format that can be read and used by software agents, thus permitting them to find, share and integrate information more easily.
27 Web of data
The Web of Data is about enabling the access to this data, by making it available in machine-readable formats and connecting it using Uniform Resource Identifiers (URIs), thus enabling people and machines to collect the data, and put it together to do all kinds of things with it (permitted by the licence).
Machine-readable data (or metadata) is data in a format that can be interpreted by a computer.
2 types of machine-readable data:
• Human-readable data that is marked up so that it can also be understood by computers, e.g. microformats, RDFa;
• Data formats intended principally for computers, e.g. RDF, XML and JSON.
See also: http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html http://linkeddatabook.com/editions/1.0/
28 Semantic web stack
29 Linked Data
• “A way of making the Semantic Web happen“ (it is hoped)
• Key concept: leverage the existence of structured data and combine it with the languages and infrastructures of the Web and the Semantic Web
https://en.wikipedia.org/wiki/Marie_Curie
http://dbpedia.org/page/Marie_Curie
30 Linked Data
31 Linked Data
“Linked data is a set of design principles for sharing machine-readable data on the Web for use by public administrations, business and citizens.”
The four design principles of Linked Data (by Tim Berners Lee):
1. Use Uniform Resource Identifiers (URIs) as names for things.
2. Use HTTP URIs so that people can look up those names.
3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL).
4. Include links to other URIs so that they can discover more things.
32 The four principles in practice
33 Linked Open Data Cloud
34
Part 4
Linked data basics
35 Core components
Semantic technologies:
• URIs for naming things
• RDF for modelling data
• SPARQL for querying
• OWL for modelling concepts or ontologies
36
Linked data basics: URIs
37 Uniform Resource Identifier (URI)
“A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource.”
• A person, e.g. Albert Einstein
http://dbpedia.org/resource/Albert_Einstein
• A country, e.g. Belgium
http://dbpedia.org/resource/Belgium
• A world heritage site, e.g. the Acropolis of Athens
http://dbpedia.org/resource/Acropolis_of_Athens
• A dataset, e.g. Fertility Indicators
http://open-data.europa.eu/en/data/dataset/ 03YMULVqadXL7IO6JZiBkQ
See also: http://www.slideshare.net/OpenDataSupport/design-and-manage-persitent-uris
BE
38 Identify data items
pd:cygri
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Person
pd:cygri = http://richard.cyganiak.de/foaf.rdf#cygri
dbpedia:Berlin = http://dbpedia.org/resource/Berlin
From http://www.ai.sri.com/~nysmith/slides/aic-seminars/090724-bizer.ppt
39 Resolving URIs over the web
dp:Cities_in_Germany
3.405.259 dp:population
skos:subject
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Person pd:cygri
From http://www.ai.sri.com/~nysmith/slides/aic-seminars/090724-bizer.ppt
40 Dereferencing URIs over the web
dp:Cities_in_Germany
3.405.259 dp:population
skos:subject
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Person rdf:type
dbpedia:Hamburg
dbpedia:Muenchen
skos:subject
skos:subject
pd:cygri
From http://www.ai.sri.com/~nysmith/slides/aic-seminars/090724-bizer.ppt
41
Linked data basics: RDF
42 Resource Description framework (RDF and RDFS)
RDF stands for: – Resource: Everything that can
have a unique identifier (URI), e.g. pages, places, people, dogs, products...
– Description: attributes, features, and relations of the resources
– Framework: model, languages and syntaxes for these descriptions
• RDF was published as a W3C recommendation in 1999.
• RDF was originally introduced as a data model for metadata.
• RDF was generalised to cover knowledge of all kinds.
43 Resource Description framework (RDF and RDFs)
• The model is domain-neutral, application-neutral
• The model can be viewed as directed, labeled graphs or as an object-oriented model (object/attribute/value)
RDF data model is an abstract, conceptual layer independent of XML
consequently, XML is a transfer syntax for RDF, not a component of RDF
RDF data might never occur in XML form
44 Resource Description framework (RDF and RDFs)
RDF breaks every piece of information down in triples:
• Subject – a resource, which may be identified with a URI.
• Predicate – a URI-identified reused specification of the relationship.
• Object – a resource or literal to which the subject is related.
SPARQL is a standardised language for querying RDF data.
http://dbpedia.org/resource/Brussels is the capital of “Belgium”. OR
http://dbpedia.org/resource/Brussels is the capital of http://dbpedia.org/resource/Belgium.
Subject Predicate Object
See also: http://www.slideshare.net/OpenDataSupport/introduction-to-rdf-sparql
45 RDF model example
http://www.w3.org/TR/REC-rdf-syntax/
“Ora Lassila”
dc:Creator
“1999-02-22”
dc:Date
“W3C”
dc:Publisher
46 RDF model example
• Nike, Dahliastraat 24, 2160 Wommelgem
<rdf:RDF xmlns:rov=“http://www.w3.org/TR/vocab-regorg/ “ xmlns:org=“http://www.w3.org/TR/vocab-org/” xmlns:locn=“http://www.w3.org/ns/locn#” > <rov:RegisteredOrganization rdf:about=“http://example.com/org/2172798119”> <rov:legalName> “Nike”< /rov:legalName> <org:hasRegisteredSite rdf:resource=“http://example.com/site/1234”/> </rov:RegisteredOrganization> <locn:Address rdf:about=“http://example.com/site/1234”/> <locn:fullAddress>” Dahliastraat 24, 2160 Wommelgem”</locn:fullAddress> </locn:Address> </rdf:RDF>
47 RDF serialization formats
• XML: is currently the only syntax that is standardised by W3C
• N3 (Notation 3): a non-XML serialization of RDF models designed to be easier to write by hand, and in some cases easier to follow.
• Turtle (Terse RDF Triple Language): a format for expressing data in the Resource Description Framework (RDF) data model with the syntax similar to SPARQL.
• N-triples: It is a line-based, plain text serialisation format for RDF (Resource Description Framework) graphs, and a subset of the Turtle (Terse RDF Triple Language) format
• JSON (JavaScript Object Notation): is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. New upcoming W3C recommendation: JSON-LD
48
Linked data basics: SPARQL
49 SPARQL
SPARQL is the standard language to query graph data represented as RDF triples.
• SPARQL Protocol and RDF Query Language
• One of the three core standards of the Semantic Web, along with RDF and OWL.
• SPARQL can be used to query and update RDF data.
• Became a W3C standard January 2008.
• SPARQL 1.1 now in Working Draft status.
50 Query types
• SELECT: Return a table of all X, Y, etc. satisfying the following conditions ...
• CONSTRUCT: Find all X, Y, etc. satisfying the following conditions ... and substitute them into the following template in order to generate (possibly new) RDF statements, creating a new graph.
• DESCRIBE: Find all statements in the dataset that provide information about the following resource(s) ... (identified by name or description)
• ASK: Are there any X, Y, etc. satisfying the following conditions ...
51 Query example
return the name of an organisation with particular URI
comp:A rov:haslegalName “Niké” . comp:A org:hasRegisteredSite site:1234 . Comp:B rov:haslegalName “BARCO” . site:1234 locn:fullAddress “Dahliastraat 24, 2160 Wommelgem . PREFIX comp: < http://example/org/org/> PREFIX org: < http://www.w3.org/TR/vocab-regorg/ > PREFIX site: <http://example.org/site/> PREFIX rov: <http://www.w3.org/TR/vocab-regorg/> SELECT ?name WHERE { ?x org:hasRegisteredSite site:1234 . ?x rov:haslegalName ?name .}
name
“Niké”
Sample data
Query
Result
52
Linked data basics: RDF Schema
53 RDF schema • First step towards the “extra knowledge”: terms,
restrictions, relationships….
• Defines small vocabulary for RDF:
o Class, subClassOf, type o Property, subPropertyOf o domain, range
• Vocabulary can be used to define other vocabularies for your application domain
Person
Student Researcher
subClassOf subClassOf
Jeen
type
hasSuperVisor domain range
Frank
type
hasSuperVisor
54 RDF schema in XML
55 RDFS constraints
• RDFS is a framework allowing:
– typing, subtyping
– properties to be put in a hierarchy
– datatypes can be defined
• RDFS is sufficient for many vocabularies, but not for all!
• Complex applications may want more possibilities. Can a program reason about some terms?
E.g.: “if «Person» resources «A» and «B» have the same «foaf:email» property, then «A» and «B» are identical”
56
Linked data basics: OWL and ontologies
57 OWL and ontologies
• OWL = Web Ontology Language
• An ontology is a conceptual model.
• An Ontology is the collection of semantic definitions for a domain.
– Example: an Aircraft Ontology is the set of semantic definitions for the Aircraft domain, e.g.
• Predator is a subClassOf Aircraft.
• sensorID is a FunctionalProperty.
• Platform is an equivalentClass to Aircraft.
– Predator, Aircraft etc. are concepts.
• OWL is complex. It is a large set of additional terms and allows for logical axioms.
58 Basic idea of conceptual modeling
The semiotic triangle (not only in Semantic Web)
59 Ontologies
Communities of users (application builders, ...) can
• Search for ontologies
• Re-use existing ontologies
– Established domain-specific ontologies (e.g., real-estate, medicine, bioinformatics)
– „The big one“: Cyc, see www.cyc.com
• Link to existing ontologies
• Extend existing ontologies
60 Ontologies as conceptual model
Or Database (knowledge base) = Ontology + Instances
My Life and Times
Illusions
First and Last Freedom
Paul McCartney
Richard Bach
J. Krishnamurti
June, 1998
1972
1974
title author date
BookCatalogue
<owl:Class rdf:ID="BookCatalogue"/>
<owl:DatatypeProperty rdf:ID="title">
<rdfs:domain rdf:resource="#BookCatalogue"/>
<rdfs:range rdf:resource="&xsd;#string"/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="author">
<rdfs:domain rdf:resource="#BookCatalogue"/>
<rdfs:range rdf:resource="&xsd;#string"/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="date">
<rdfs:domain rdf:resource="#BookCatalogue"/>
<rdfs:range rdf:resource="&xsd;#date"/>
</owl:DatatypeProperty>
<?xml version=“1.0”?>
<BookCatalogue>
<title>My Life and Times</title>
<author>Paul McCartney</author>
<date>June, 1998</date>
</BookCatalogue>
61 Popular ontologies or vocabularies
Friend-of-a-Friend (FOAF) Vocabulary for describing people
Core Person Vocabulary
Vocabulary to describe the fundamental
characteristics of a person, e.g. the name, the
gender, the date of birth...
DOAP Vocabulary for describing projects
ADMS.SW Vocabulary for describing open source software
projects
ADMS Vocabulary for describing interoperability assets.
Dublin Core Defines general metadata attributes
Registered Organisation Vocabulary Vocabulary for describing organizations, typically in a
national or regional register
Organization Ontology for describing the structure of organizations
Core Location Vocabulary Vocabulary capturing the fundamental characteristics
of a location.
Core Public Service Vocabulary Vocabulary capturing the fundamental characteristics
of a service offered by public administration
schema.org Agreed vocabularies for publishing structured data on the Web elaborated by Google, Yahoo and Microsoft See also:
http://www.w3.org/wiki/TaskForces/Community
Projects/LinkingOpenData/CommonVocabulari
es
62 Typical usage of owl:sameAs
Linking from one data set (DBpedia) to the other (Geonames):
This is a major mechanism of “Linking” in the Linked Open Data project
<http://dbpedia.org/resource/Amsterdam>
owl:sameAs <http://sws.geonames.org/2759793>;
63
Part 5
Publishing Linked Data
64 5-star schema of Linked
(Open) Data
• Make your stuff available on the Web (whatever format) under an open license.
• Make it available as structured data (e.g., Excel instead of image scan of a table)
• Use non-proprietary formats (e.g., CSV instead of Excel)
• Use URIs to denote things, so that people can point at your stuff
• Link your data to other data to provide context
65
★ Make your stuff available on the Web under an open licence
66
Pros & cons of ★ open data
As a consumer... As a publisher...
You can look at it. It is simple to publish.
You can store it locally. You do not have explain repeatedly to others that they can use your data.
You can enter the data into any other system.
You can change the data.
You can share the data with anyone.
67
★ ★ Make it available as structured data
68
Pros & cons of ★ ★ open data
All the benefits of ★ open data; plus
As a consumer... As a publisher...
You can directly process it with proprietary software to aggregate it, perform calculations, visualise it, etc.
It is still simple to publish.
You can export it into another (structured) format.
69
• Proprietary: Excel, Word, PDF...
• Non-proprietary: XML, CSV, RDF, JSON, ODF...
• Road safety- Accidents 2006:
★ ★ ★ Use non-proprietary formats
70
Pros & cons of ★ ★ ★ open data
• All the benefits of ★ ★ open data; plus
As a consumer... As a publisher...
You can manipulate the data in any way you like, without being confined by the capabilities of any particular software.
It is still simple to publish.
- But, you do need converters or plug-ins to export the data from the proprietary format.
71
★ ★ ★ ★ Use URIs to denote things
For example, creating an URI for one of the units of the Greek Ministry of the Administrative Reform and e-Governance.
See also: http://www.slideshare.net/OpenDataSupport/design-and-manage-persitent-uris
http://org.testproject.eu/id/office/office-of-the-deputy-minister-for-administrative-reform-and-e-governance
72
Pros & cons of ★ ★ ★ ★ open data
As a consumer... As a publisher...
You can link to it from any other place. You have fine-granular control over the data items and can optimise their access.
You can bookmark it. Other data publishers can now link into your data, promoting it to 5 star.
You can reuse parts of the data. You will be able to reuse vocabularies, data and metadata, and URI design patterns instead of creating them from scratch.
You may be able to reuse existing tools and libraries.
You can combine the data safely with other data.
- But you typically need to invest some time in slicing and dicing your data.
- But understanding the technology requires effort and can have a steep learning curve.
All the benefits of ★ ★ ★ open data; plus
73
★ ★ ★ ★ ★ Link your data to other data to provide context
74
Pros & cons of ★ ★ ★ ★ ★ open data
All the benefits of ★ ★ ★ ★ open data; plus
As a consumer... As a publisher...
You can discover more (related) data while consuming the data.
You make your data discoverable.
You can directly learn about the data schema.
You increase the context, expressivity, quality and value of your data (and consequently you give visibility to your organisation).
You can combine data from different source, be innovative, gain new knowledge, be an entrepreneur...
- This requires an investment in time, money, technology and competencies/ skills.
- But, you now have to deal with broken data links. Not all publishers/data sources will be reliable.
75
Part 6
Linked Data usage
Storing, accessing, combining and inferencing SW data
76 How is LD stored?
• Simple standalone RDF files
• In „Semantic Web / LOD databases“: triplestores
– A triplestore is a purpose-built database for the storage and retrieval of Resource Description Framework (RDF) metadata.
– A triplestore can store many (up to billions) of RDF triples
– For a list of implementations, see http://en.wikipedia.org/wiki/Triplestore
77 How is LD accessed?
– By search engines that can extract the markup from Web pages
• e.g., Google
– By search engines that directly access triplestores
• e.g. Sindice
– By your own applications that directly access triplestores
• Obviously, data can then also be transformed into RDF (e.g. RDFa) or into human-readable web pages, see the following for an example
78 Example RDF->HTML
79 How can LD be combined
What does the combination/integration of this information require?
– “Linkability“ at the technical level: see Linked Data
principles
– “Linkability“ at the semantic level of identity: sameAs
– “Linkability“ at the semantic level of more complex relationships: schema / ontology matching
80 Inferencing
Inference is the act or process of deriving logical conclusions from premises known or assumed to be true.
Deductive reasoning
• All swans are white.
• Tilly is a swan.
Tilly is white.
• Truth-preserving!
Inductive reasoning
• Tilly and Edda and Edwin and … are swans.
• Tilly and Edda and Edwin and … are white.
All swans are white.
• „Bringing new knowledge into the world“
81 OWL properties can be…
– Functional
– Inverse functional (or: Inverse of another relation)
– Transitive
– Symmetric
– Asymmetric
– Reflexive
– Irreflexive
… and this allows for inferences on individuals
82 Inference example
Class(a:bus_driver complete intersectionOf(
a:person
restriction(a:drives someValuesFrom (a:bus))))
Class(a:driver complete intersectionOf(
a:person
restriction(a:drives someValuesFrom(a:vehicle))))
Class(a:bus partial a:vehicle)
Conclusion: Busdrivers are Drivers
83 Use of ontology concepts in
LOD
84
Part 7
Linked Data vs. Open Data
85 Opening Data
• Increasing trend for all things ‘open’ in Europe.
• Driven by two very different motives:
– Need for transparency of administration
– Fuel for economic growth
• Removes many obstacles to sharing data
• Inspired by USA
86 What is Open Data (OD)?
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.” --opendefinition.org
In summary, this means the following:
• Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.
• Reuse and Redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets.
• Universal Participation: everyone must be able to use, reuse and redistribute - there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed.
87 What is Open Government
Data (OGD)?
Open government data means:
– Data produced or commissioned by government or government controlled entities.
– Data which is open as defined in the Open Definition – that is, it can be freely used, reused and redistributed by anyone.
– Data that is not sensitive or private.
Source:[http://data.gov.uk/data]
Source:[http://publicdata.eu/]
88 Expected benefits of OGD
Transparency. Citizens need to know what their government is doing. They need to be able freely to access government data and information and to share that information with other citizens. Sharing and reuse allows analysing and visualising to create more understanding.
Releasing social and commercial value. Data is a key resource for social and commercial activities. Government creates or holds a large amount of information. Open government data can help drive the creation of innovative business and services that deliver social and commercial value.
Participatory governance. Open Data enables citizens to be much more directly informed and involved in decision-making and facilitation their contribution to the process of governance.
Reducing government costs. Open Data enables the sharing of information within governments in machine-readable interoperable formats, hence reducing costs of information exchange and data integration. Governments themselves are the biggest reusers of Open Government Data.
89 Linked Data vs Open Data
Open data
Data can be published and be publicly available under an open licence without linking to other data sources.
Linked data
Data can be linked to URIs from other data sources, using open standards such as RDF without being publicly available under an open licence.
See also: Cobden et al., A research agenda for Linked Closed Data http://ceur-ws.org/Vol-782/CobdenEtAl_COLD2011.pdf
90
Examples of Linked data initiatives
91
92
DE – Bibliotheksverbund Bayern
Linked data from 180 academic libraries in Bavaria, Berlin and Brandenburg.
IT – Agenzia per l’Italia digitiale
Three datasets published as linked data: the Index of Public Administration, the SPC contracts for web services and conduction systems and the Classifications for the data in Public Administration.
NL – Building and address register
The Dutch Address and Buildings base register published as linked data.
UK – Ordnance Survey
Three OS Open Data products published as linked data: the 1:50 000 Scale Gazetteer, Code-Point Open and the administrative geography taken from Boundary Line.
UK – Companies House
Publishing basic company details as linked data using a simple URI for each company in their database.
93
Non-governmental applications
94 Conclusions
• Linked data is a set of design principles for sharing machine-readable data on the Web.
• Linked data and open data are not the same.
• URIs, RDF, OWL and SPARQL form the foundational layer for Linked data.
• Linked data offers a number of advantages for:
o Data integration with small impact on legacy systems;
o Enables for semantic interoperability;
o Enables creativity and innovation through context and knowledge-creation.
95 Discussion time
96 Greatful thanks and acknowledgements to
• W3C
– Introduction to the Semantic Web (2011 Semantic Technologies Conference, 6th of June, 2011, San Francisco, CA, USA Ivan Herman)
• European Commission
– Open Data Support training modules (https://joinup.ec.europa.eu/community/ods/document/online-training-material)
• Bettina Berendt, KU Kuleuven
– Course Knowledge and the Web, 1st semester 2013/2014, http://www.cs.kuleuven.be/~berendt/teaching/
• Bart van Leeuwen, Netage.NL
– http://www.slideshare.net/semanticfire