balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information

Post on 21-May-2015

591 views 0 download

Tags:

description

Presentation for 5th International Workshop on Data Engineering meets the Semantic Web (DESWeb) In conjunction with ICDE 2014, Chicago IL, USA, March 31, 2014 held by Kai Schlegel

Transcript of balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information

DESWeb 2014ICDE 2014, Chicago IL, USA, March 3

balloon FusionSPARQL Rewriting Based on

Unified Co-Reference Information

Kai Schlegel (kai.schlegel@googlemail.com)Florian Stegmaier, Sebastian Bayerl, Michael Granitzer, Harald Kosch

2

Motivation

SPARQL Rewriting & Federation

Intermediate Results

Outline

supported by the European Commission under the Seventh Framework Program

3

Linked Data isthe heart of Semantic Web

“- W3C Semantic Web Group

4

Huge Potential!

5

Developing withLinked Open Data

6

• Easy access to Linked Data• Query Linked Open Data with SPARQL

• Plethora of tools available

• Problems: • Business oriented

• Complex setup

• Maintenance

• „Paper-only“

• Not developer friendly

• Simple and „instant“ SPARQL Query Federation (-as-a-Service)

Motivation

Nothing-as-a-Service

7

• How to get information about the German City „Passau“?

• Problem: LOD is not a single database!

Querying LOD

SPARQL

SPARQL

RDF

RDFRDF

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

de.dbpedia.org

Relations, Coordinates, Leader, etc.

What about the population?

SPARQL

8

• Problem: Selection of appropriate endpoints

• Send query to some endpoints and aggregate the results?

Distributed Querying!

SPARQL

SPARQL

RDF

RDFRDF

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

de.dbpedia.org

SPARQL

linkedgeodata.org

WHAT ?

9

• Problem: Different identifier for the same semantic concept

Misunderstanding: Co-Referencing

SPARQL

SPARQL

RDF

RDFRDF

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

de.dbpedia.org

SPARQL

linkedgeodata.org

WHAT ?

Known problem in linguistic:

It’s a spud! “What?“

I mean potato! “

Co-Referencing: Multiple expressions refer to the same thing.

10

Problem = Solution?

SPARQL-based crawling of co-reference information

Exploit co-reference information for• accomplishing immediate SPARQL rewriting

• performing endpoint selection

• execute automatic query federation

Basic idea: Focusing distributed co-reference information

Main principle: Semantic entites over identifier!

11

Components

balloon toolsuite

12

balloon Overflight• SPARQL based crawling of LOD endpoints

• Query: Ask for subjects and objects which are related with special predicate

• Simplified global view on• Equivalence: owl:SameAs, skos:exactMatch,

coref:coreferenceData, ...

• Graph-Database Neo4j• Equivalence Cluster:

Multiple synonym URIs representing the same semantic entity including Provenance

13

balloon Fusion

SPARQL Federation setup using co-reference information

SPARQL Transformation for each BGP1. Determine synonym URIs

2. Select suitable endpoints

3. Adapt sub-queries to endpoints

4. Federated querying

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

SPARQL

141. Determine synonym URIs

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

SPARQL

15

2. Select suitable endpoints

• Provenance based selection (PBS)• Endpoints which are involved in cluster composition

• Namespace based selection (NBS)• Prefix and Namespace matching of synonym URLs

Summarized: origin of co-reference information and origin of synonym URIs

162. Select suitable endpoints (2)

Assumption: • Provenance information only contains „linkedgeodata.org“

as co-reference origin• Namespaces for freebase and dbpedia available (datahub.io)

PBS:Linked-Geo-Data

Endpoint

NBS:DBPedia endpoint

NBS:Freebaseendpoint

17

3. Adapt sub-queries to endpoints

PBS:Linked-Geo-Data

Endpoint

NBS:DBPedia endpoint

NBS:Freebaseendpoint

SELECT ?p ?o WHERE {<http://rdf.freebase.com/

ns/m.01h5td> ?p ?o.}

SPARQL

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

SPARQL

SELECT ?p ?o WHERE { { <http://rdf.freebase.com/ns/m.01h5td> ?p ?o. } UNION { <http://linkedgeodata.org/triplify/node240057351> ?p ?o. } UNION { <http://de.dbpedia.org/resource/Passau> ?p ?o. }}

SPARQL

SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}

SPARQL

18

• W3C SPARQL 1.1 Federated Query Extension (SERVICE)• (Partial) Query can be executed against a remote SPARQL

endpoint

• Distributed sub-queries don‘t contain SPARQL 1.1 features

4. Federated Querying

SPARQL

SELECT ?p ?o WHERE { SERVICE <http://dbpedia.org/sparql> { <http://de.dbpedia.org/resource/Passau> ?p ?o. } UNION { SERVICE <http://www.freebase.com/base/sparql> { <http://rdf.freebase.com/ns/m.01h5td> ?p ? } } UNION { SERVICE <http://linkedgeodata.org/sparql/> { { <http://rdf.freebase.com/ns/m.01h5td> ?p ?o. } UNION { <http://linkedgeodata.org/triplify/node240057351> ?p ?o. } UNION { <http://de.dbpedia.org/resource/Passau> ?p ?o. }}}}

19

• Endpoint status check• Check routine in terms of availability and latency

• Minimize sub-queries• Group sub-queries with common endpoint

• Push join to endpoint

• SPARQL Features• Condense PBS UNION-construct of synonym URIs

• SPARQL 1.1 VALUES or FILTER with IN operator

• Not well implemented in Linked Data endpoints

Optimizations (ongoing)

20

balloon Overflight Results

21Results from a sounding balloon

22balloon toolsuite

23

Statistics• Datahub.io: Linked Open Data Cloud catalog• 337 datasets in total

• 237 expose a SPARQL endpoint

• 112 successfully queried for co-reference information

• Balloon Dataset (first run)

• 17.6M co-reference statements

• 22.4M distinct URLs

• 8.4M equivalence cluster (~ 2.68 identifier per cluster)

• Pending Analysis• Distribution of cluster sizes, Number of different Hosts per

cluster

• Main representative per cluster & False-Friends

24

Open Source:

• Demo, information and sources available (MIT License)• X as a Service

• SPARQL Rewriting (HTTP API)

• Query Federation (SPARQL)

http://schlegel.github.io/balloon

25

Summary:• SPARQL-based crawling of distributed co-reference

information

• Exploit co-reference information for SPARQL federation

Single Point of Access

26

Any questions?

“Research is formalized curiosity. It is poking and prying with a purpose. - Zora Neale Hurston