Consuming Linked Data SemTech2010
-
Upload
juan-sequeda -
Category
Technology
-
view
6.524 -
download
0
description
Transcript of Consuming Linked Data SemTech2010
Consuming Linked Data
Juan F. SequedaDepartment of Computer Science
University of Texas at AustinSemTech 2010
How many people are familiar with
• RDF• SPARQL• Linked Data• Web Architecture (HTTP, etc)
History• Linked Data Design Issues by TimBL July 2006• Linked Open Data Project WWW2007• First LOD Cloud May 2007• 1st Linked Data on the Web Workshop WWW2008• 1st Triplification Challenge 2008• How to Publish Linked Data Tutorial ISWC2008• BBC publishes Linked Data 2008• 2nd Linked Data on the Web Workshop WWW2009• NY Times announcement SemTech2009 - ISWC09• 1st Linked Data-a-thon ISWC2009• 1st How to Consume Linked Data Tutorial ISWC2009• Data.gov.uk publishes Linked Data 2010• 2st How to Consume Linked Data Tutorial WWW2010• 1st International Workshop on Consuming Linked Data COLD2010• …
May 2007
Oct 2007
Nov 2007 (1)
Nov 2007 (2)
Feb 2008
Mar 2008
Sept 2008
Mar 2009 (1)
Mar 2009 (2)
July 2009
June 2010
YOU GET THE PICTURE
ITS BIG and getting
BIGGER and
BIGGER
Now what can we do with this data?
Let’s consume it!
The Modigliani Test
• Show me all the locations of all the original paintings of Modigliani
• Daniel Koller (@dakoller) showed that you can find this with a SPARQL query on DBpedia
Thanks Richard MacManus - ReadWriteWeb
Results of the Modigliani Test
• Atanas Kiryakov from Ontotext• Used LDSR – Linked Data Semantic Repository– Dbpedia– Freebase– Geonames– UMBEL– Wordnet
Published April 26, 2010: http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php
SPARQL QueryPREFIX fb: http://rdf.freebase.com/ns/PREFIX dbpedia: http://dbpedia.org/resource/PREFIX dbp-prop: http://dbpedia.org/property/PREFIX dbp-ont: http://dbpedia.org/ontology/PREFIX umbel-sc: http://umbel.org/umbel/sc/PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#PREFIX ot: http://www.ontotext.com/SELECT DISTINCT ?painting_l ?owner_l ?city_fb_con ?city_db_loc ?
city_db_citWHERE { ?p fb:visual_art.artwork.artist dbpedia:Amedeo_Modigliani ;
fb:visual_art.artwork.owners [ fb:visual_art.artwork_owner_relationship.owner ?ow ] ; ot:preferredLabel ?painting_l. ?ow ot:preferredLabel ?owner_l . OPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } .
OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc }
OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] }}
Let’s start by making sure that we understand what Linked Data is…
Do you SEARCH or do you FIND?
Search for
Football Players who went to the University of Texas at Austin, played for
the Dallas Cowboys as Cornerback
Why can’t we just FIND it…
Guess how I FOUND out?
I’ll tell you how I did NOT find it
Current Web = internet + links + docs
So what is the problem?
• We aren’t always interested in documents– We are interested in THINGS– These THINGS might be in documents
• We can read a HTML document rendered in a browser and find what we are searching for– This is hard for computers. – Computers have to guess (even though they are
pretty good at it)
What do we need to do?
• Make it easy for computers/software to find THINGS
How can we do that?
• Besides publishing documents on the web– which computers can’t understand easily
• Let’s publish something that computers can understand
RAW DATA!
But wait… don’t we do that already?
Current Data on the Web
• Relational Databases• APIs• XML• CSV• XLS• …• Can’t computers and applications already
consume that data on the web?
True! But it is all in different formats and data models!
This makes it hard to integrate data
The data in different data sources aren’t linked
For example, how do I know that the Juan Sequeda in Facebook is the same as Juan
Sequeda in Twitter
Or if I create a mashup from different services, I have to learn different APIs and I get different
formats of data back
Wouldn’t it be great if we had a standard way of publishing data on the Web?
We have a standardized way of publishing documents on the web, right?
HTML
Then why can’t we have a standard way of publishing data on the Web?
Good question! And the answer is YES. There is!
Resource Description Framework (RDF)
• A data model – A way to model data– i.e. Relational databases use relational data model
• RDF is a triple data model• Labeled Graph• Subject, Predicate, Object• <Juan> <was born in> <California>• <California> <is part of> <the USA>• <Juan> <likes> <the Semantic Web>
RDF can be serialized in different ways
• RDF/XML• RDFa (RDF in HTML)• N3• Turtle• JSON
So does that mean that I have to publish my data in RDF now?
You don’t have to… but we would like you to
An example
Document on the Web
Databases back up documents
Isbn Title Author PublisherID ReleasedData
978-0-596-15381-6
Programming the Semantic Web
Toby Segaran 1 July 2009
… … … … …
PublisherID PublisherName
1 O’Reilly Media
… …
This is a THING:A book title “Programming the Semantic Web” by Toby Segaran, …
THINGS have PROPERTIES:A Book as a Title, an author, …
Lets represent the data in RDF
book
Programming the Semantic Web
978-0-596-15381-6
Toby Segaran
Publisher O’Reilly
title
name
author
publisher
isbn
Isbn Title Author PublisherID ReleasedData
978-0-596-15381-6
Programming the Semantic Web
Toby Segaran
1 July 2009
PublisherID PublisherName
1 O’Reilly Media
Remember that we are on the web
Everything on the web is identified by a URI
And now let’s link the data to other data
http://…/isbn978
Programming the Semantic Web
978-0-596-15381-6
Toby Segaran
http://…/publisher1 O’Reilly
title
name
author
publisher
isbn
And now consider the data from Revyu.com
http://…/isbn978
http://…/
review1
Awesome Book
http://…/
reviewer
Juan Sequeda
hasReview
reviewer
description
name
Let’s start to link data
http://…/isbn978
Programming the Semantic Web
978-0-596-15381-6
Toby Segaran
http://…/publisher1 O’Reilly
title
name
author
publisher
isbn
http://…/isbn978
sameAs
http://…/
review1
Awesome Book
http://…/
reviewer
Juan Sequeda
hasReview
hasReviewer
description
name
Juan Sequeda publishes data too
http://juansequeda.
com/id
livesIn
Juan Sequedaname
http://dbpedia.org/Austin
Let’s link more datahttp://…/isbn978
http://…/
review1
Awesome Book
http://…/
reviewer
Juan Sequeda
http://juansequeda.
com/id
hasReview
hasReviewer
description
name
sameAs
livesIn
Juan Sequedaname
http://dbpedia.org/Austin
And more
http://…/isbn978
Programming the Semantic Web
978-0-596-15381-6
Toby Segaran
http://…/publisher1
O’Reilly
title
name
author
publisher
isbn
http://…/isbn978
sameAs
http://…/
review1
Awesome Book
http://…/
reviewer
Juan Sequeda
http://juansequeda.
com/id
hasReview
hasReviewer
description
name
sameAs
livesIn
Juan Sequedaname
http://dbpedia.org/Austin
Data on the Web that is in RDF and is linked to other RDF data is LINKED
DATA
Linked Data Principles
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up (dereference) those names.
3. When someone looks up a URI, provide useful information.
4. Include links to other URIs so that they can discover more things.
Linked Data makes the web appear as ONE
GIANTHUGE
GLOBAL
DATABASE!
I can query a database with SQL. Is there a way to query Linked Data with a query language?
Yes! There is actually a standardize language for that
SPARQL
FIND all the reviews on the book “Programming the Semantic Web” by people who live in
Austin
http://…/isbn978
Programming the Semantic Web
978-0-596-15381-6
Toby Segaran
http://…/publisher1 O’Reilly
title
name
author
publisher
isbn
http://…/isbn978
sameAs
http://…/
review1
Awesome Book
http://…/
reviewer
Juan Sequeda
http://juansequeda.
com
hasReview
hasReviewer
description
name
sameAs
livesIn
Juan Sequedaname
http://dbpedia.org/Austin
This looks cool, but let’s be realistic. What is the incentive to publish Linked Data?
What was your incentive to publish an HTML page in 1990?
1) Share data in documents2) Because you neighbor was doing it
So why should we publish Linked Data in 2010?
1) Share data as data2) Because you neighbor is doing it
And guess who is starting to publish Linked Data now?
Linked Data Publishers
• UK Government• US Government• BBC• Open Calais – Thomson Reuters• Freebase• NY Times• Best Buy• CNET• Dbpedia• Are you?
How can I publish Linked Data?
Publishing Linked Data• Legacy Data in Relational Databases– D2R Server– Virtuoso– Triplify– Ultrawrap
• CMS– Drupal 7
• Native RDF Stores– Databases for RDF (Triple Stores)
• AllegroGraph, Jena, Sesame, Virtuoso
– Talis Platform (Linked Data in the Cloud)• In HTML with RDFa
Consuming Linked Data by Humans
HTML Browsers
Links to other URIs
<span rel="foaf:interest"><a href="http://dbpedia.org/resource/Database"
property="dcterms:title">Database</a>, <a href="http://dbpedia.org/resource/Data_integration" property="dcterms:title">Data Integration</a>, <a href="http://dbpedia.org/resource/Semantic_Web" property="dcterms:title">Semantic Web</a>, <a href="http://dbpedia.org/resource/Linked_Data" property="dcterms:title">Linked Data</a>, etc.</span>
HTML Browsers
• RDF can be serialized in RDFa• Have you heard of– Yahoo’s Search Monkey– Google Rich Snippets?
• They are consuming RDFa• But WHY?
Because there is life beyond ten blue links
Google and Yahoo are starting to crawl RDFa!
The Semantic Web is a reality!
The Reality
• Yahoo is crawling data that is in RDFa and Microformats under a specific vocabularies – FOAF– GoodRelations– …
• Google is crawling RDFa and Microformats that use the Google vocabulary
Linked Data Browsers
Linked Data Browsers
• Not actually separate browsers. Run inside of HTML browsers
• View the data that is returned after looking up a URI in tabular form
• (IMO) UI lacks usability
Linked Data Browsers
• Tabulator– http://www.w3.org/2005/ajar/tab
• OpenLink– http://ode.openlinksw.com/
• Zitgist Dataviewr– http://dataviewer.zitgist.com/
• Marbles– http://www5.wiwiss.fu-berlin.de/marbles/
• Explorator– http://www.tecweb.inf.puc-rio.br/explorator
Faceted Browsers
http://dbpedia.neofonie.de
http://dev.semsol.com/2010/semtech/
On-the-fly Mashups
http://sig.ma
What’s next?
Time to create new and innovative ways to interact with Linked Data
This may be one of the Killer Apps that we have all been waiting for
http://en.wikipedia.org/wiki/File:Mosaic_browser_plaque_ncsa.jpg
It’s time to partner with HCI community
Semantic Web UIs don’t have to be ugly
Consume Linked Data with SPARQL
SPARQL Endpoints
• Linked Data sources usually provide a SPARQL endpoint for their dataset(s)
• SPARQL endpoint: SPARQL query processing service that supports the SPARQL protocol*
• Send your SPARQL query, receive the result
* http://www.w3.org/TR/rdf-sparql-protocol/
Where can I find SPARQL Endpoints?
• Dbpedia: http://dbpedia.org/sparql
• Musicbrainz: http://dbtune.org/musicbrainz/sparql
• U.S. Census: http://www.rdfabout.com/sparql
• Semantic Crunchbase: http://cb.semsol.org/sparql
• http://esw.w3.org/topic/SparqlEndpoints
Accessing a SPARQL Endpoint
• SPARQL endpoints: RESTful Web services• Issuing SPARQL queries to a remote SPARQL
endpoint is basically an HTTP GET request to the SPARQL endpoint with parameter query
GET /sparql?query=PREFIX+rd... HTTP/1.1 Host: dbpedia.org User-agent: my-sparql-client/0.1
URL-encoded string with the SPARQL query
Query Results Formats
• SPARQL endpoints usually support different result formats:– XML, JSON, plain text
(for ASK and SELECT queries)– RDF/XML, NTriples, Turtle, N3
(for DESCRIBE and CONSTRUCT queries)
Query Results Formats
PREFIX dbp: http://dbpedia.org/ontology/PREFIX dbpprop: http://dbpedia.org/property/SELECT ?name ?bday WHERE { ?p dbp:birthplace <http://dbpedia.org/resource/Berlin> . ?p dbpprop:dateOfBirth ?bday . ?p dbpprop:name ?name .}
Query Result Formats
• Use the ACCEPT header to request the preferred result format:
GET /sparql?query=PREFIX+rd... HTTP/1.1 Host: dbpedia.org User-agent: my-sparql-client/0.1 Accept: application/sparql-results+json
Query Result Formats
• As an alternative some SPARQL endpoint implementations (e.g. Joseki) provide an additional parameter out
GET /sparql?out=json&query=... HTTP/1.1 Host: dbpedia.org User-agent: my-sparql-client/0.1
Accessing a SPARQL Endpoint
• More convenient: use a library• SPARQL JavaScript Library
– http://www.thefigtrees.net/lee/blog/2006/04 sparql_calendar_demo_a_sparql.html
• ARC for PHP– http://arc.semsol.org/
• RAP – RDF API for PHP– http://www4.wiwiss.fu-berlin.de/bizer/rdfapi/index.html
Accessing a SPARQL Endpoint
• Jena / ARQ (Java)– http://jena.sourceforge.net/
• Sesame (Java)– http://www.openrdf.org/
• SPARQL Wrapper (Python)– http://sparql-wrapper.sourceforge.net/
• PySPARQL (Python)– http://code.google.com/p/pysparql/
Accessing a SPARQL Endpoint
Example with Jena/ARQimport com.hp.hpl.jena.query.*;
String service = "..."; // address of the SPARQL endpoint String query = "SELECT ..."; // your SPARQL query QueryExecution e =
QueryExecutionFactory.sparqlService(service, query)
ResultSet results = e.execSelect(); while ( results.hasNext() ) {
QuerySolution s = results.nextSolution(); // ...
}
e.close();
• Querying a single dataset is quite boringcompared to:
• Issuing SPARQL queries over multiple datasets
• How can you do this?1. Issue follow-up queries to different endpoints2. Querying a central collection of datasets3. Build store with copies of relevant datasets4. Use query federation system
Follow-up Queries
• Idea: issue follow-up queries over other datasets based on results from previous queries
• Substituting placeholders in query templates
String s1 = "http://cb.semsol.org/sparql"; String s2 = "http://dbpedia.org/sparql";
String qTmpl = "SELECT ?c WHERE{ <%s> rdfs:comment ?c }";String q1 = "SELECT ?s WHERE { ..."; QueryExecution e1 = QueryExecutionFactory.sparqlService(s1,q1); ResultSet results1 = e1.execSelect(); while ( results1.hasNext() ) {
QuerySolution s1 = results.nextSolution(); String q2 = String.format( qTmpl, s1.getResource("s"),getURI()
); QueryExecution e2=
QueryExecutionFactory.sparqlService(s2,q2); ResultSet results2 = e2.execSelect(); while ( results2.hasNext() ) {
// ... }e2.close();
}e1.close();
Find a list of companies Filtered by some criteria and return Dbpedia URIs from them
Follow-up Queries
• Advantage– Queried data is up-to-date
• Drawbacks– Requires the existence of a SPARQL endpoint for
each dataset– Requires program logic– Very inefficient
Querying a Collection of Datasets
• Idea: Use an existing SPARQL endpoint that provides access to a set of copies of relevant datasets
• Example:– SPARQL endpoint over a majority of datasets from
the LOD cloud at:
http://lod.openlinksw.com/sparql
http://uberblic.org
Querying a Collection of Datasets
• Advantage:– No need for specific program logic
• Drawbacks:– Queried data might be out of date – Not all relevant datasets in the collection
Own Store of Dataset Copies
• Idea: Build your own store with copies of relevant datasets and query it
• Possible stores:– Jena TDB http://jena.hpl.hp.com/wiki/TDB– Sesame http://www.openrdf.org/– OpenLink Virtuoso http://virtuoso.openlinksw.com/– 4store http://4store.org/– AllegroGraph http://www.franz.com/agraph/ – etc.
Populating Your Store
• Get RDF dumps provided for the datasets• (Focused) Crawling
• ldspider http://code.google.com/p/ldspider/– Multithreaded API for focussed crawling– Crawling strategies (breath-first, load-balancing)– Flexible configuration with callbacks and hooks
Own Store of Dataset Copies
• Advantages:– No need for specific program logic – Can include all datasets– Independent of the existence, availability, and
efficiency of SPARQL endpoints• Drawbacks:– Requires effort to set up and to operate the store – Ideally, data sources provide RDF dumps; if not? – How to keep the copies in sync with the originals?– Queried data might be out of date
Federated Query Processing
• Idea: Querying a mediator which distributes sub-queries to relevant sources and integrates the results
Federated Query Processing
• Instance-based federation– Each thing described by only one data source – Untypical for the Web of Data
• Triple-based federation– No restrictions – Requires more distributed joins
• Statistics about datasets required (both cases)
Federated Query Processing
• DARQ (Distributed ARQ)– http://darq.sourceforge.net/ – Query engine for federated SPARQL queries– Extension of ARQ (query engine for Jena)– Last update: June 28, 2006
• Semantic Web Integrator and Query Engine(SemWIQ)– http://semwiq.sourceforge.net/– Actively maintained
Federated Query Processing
• Advantages:– No need for specific program logic – Queried data is up to date
• Drawbacks:– Requires the existence of a SPARQL endpoint for
each dataset– Requires effort to set up and configure the
mediator
In any case:
• You have to know the relevant data sources– When developing the app using follow-up queries– When selecting an existing SPARQL endpoint over
a collection of dataset copies– When setting up your own store with a collection
of dataset copies– When configuring your query federation system
• You restrict yourself to the selected sources
In any case:
• You have to know the relevant data sources– When developing the app using follow-up queries– When selecting an existing SPARQL endpoint over
a collection of dataset copies– When setting up your own store with a collection
of dataset copies– When configuring your query federation system
• You restrict yourself to the selected sourcesThere is an alternative:
Remember, URIs link to data
Automated Link Traversal
• Idea: Discover further data by looking up relevant URIs in your application
• Can be combined with the previous approaches
Link Traversal Based Query Execution
• Applies the idea of automated link traversal to the execution of SPARQL queries
• Idea:– Intertwine query evaluation with traversal of RDF links– Discover data that might contribute to query results
during query execution• Alternately:– Evaluate parts of the query – Look up URIs in intermediate solutions
Link Traversal Based Query Execution
Link Traversal Based Query Execution
Link Traversal Based Query Execution
Link Traversal Based Query Execution
Link Traversal Based Query Execution
Link Traversal Based Query Execution
Link Traversal Based Query Execution
Link Traversal Based Query Execution
Link Traversal Based Query Execution
Link Traversal Based Query Execution
Link Traversal Based Query Execution
• Advantages:– No need to know all data sources in advance– No need for specific programming logic– Queried data is up to date– Does not depend on the existence of SPARQL
endpoints provided by the data sources• Drawbacks:– Not as fast as a centralized collection of copies– Unsuitable for some queries– Results might be incomplete (do we care?)
Implementations
• Semantic Web Client library (SWClLib) for Javahttp://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/• SWIC for Prologhttp://moustaki.org/swic/
Implementations
• SQUIN http://squin.org – Provides SWClLib functionality as a Web service– Accessible like a SPARQL endpoint– Install package: unzip and start• Less than 5 mins!
– Convenient access with SQUIN PHP tools:
$s = 'http:// ...'; // address of the SQUIN service $q = new SparqlQuerySock( $s, '... SELECT ...' ); $res = $q->getJsonResult();// or getXmlResult()
Real World Example
Getting Started
• Finding URIs• Finding Additional Data• Finding SPARQL Endpoints
What is a Linked Data application
• Software system that makes use of data on the web from multiple datasets and that benefits from links between the datasets
Characteristics of Linked Data Applications
• Consume data that is published on the web following the Linked Data principles: an application should be able to request, retrieve and process the accessed data
• Discover further information by following the links between different data sources: the fourth principle enables this.
• Combine the consumed linked data with data from sources (not necessarily Linked Data)
• Expose the combined data back to the web following the Linked Data principles
• Offer value to end-users
Examples
• http://data-gov.tw.rpi.edu/wiki• http://dbrec.net/• http://fanhu.bz/• http://data.nytimes.com/schools/schools.html• http://sig.ma • http://visinav.deri.org/semtech2010/
Hot Research Topics• Interlinking Algorithms• Provenance and Trust• Dataset Dynamics• UI• Distributed Query• Evaluation– “You want a good thesis? IR is based on precision and recall.
The minute you add semantics, it is a meaningless feature. Logic is based on soundness and completeness. We don’t want soundness and completeness. We want a few good answers quickly.” – Jim Hendler at WWW2009 during the LOD gathering
Thanks Michael Hausenblas
THANKS
Juan Sequedawww.juansequeda.com
@juansequeda#cold
www.consuminglinkeddata.org
Acknowledgements: Olaf Hartig, Patrick Sinclair, Jamie Taylor
Slides for Consuming Linked Data with SPARQL by Olaf Hartig