balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
-
Upload
kai-schlegel -
Category
Data & Analytics
-
view
591 -
download
0
description
Transcript of balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
DESWeb 2014ICDE 2014, Chicago IL, USA, March 3
balloon FusionSPARQL Rewriting Based on
Unified Co-Reference Information
Kai Schlegel ([email protected])Florian Stegmaier, Sebastian Bayerl, Michael Granitzer, Harald Kosch
2
Motivation
SPARQL Rewriting & Federation
Intermediate Results
Outline
supported by the European Commission under the Seventh Framework Program
3
Linked Data isthe heart of Semantic Web
“- W3C Semantic Web Group
4
Huge Potential!
5
Developing withLinked Open Data
6
• Easy access to Linked Data• Query Linked Open Data with SPARQL
• Plethora of tools available
• Problems: • Business oriented
• Complex setup
• Maintenance
• „Paper-only“
• Not developer friendly
• Simple and „instant“ SPARQL Query Federation (-as-a-Service)
Motivation
Nothing-as-a-Service
7
• How to get information about the German City „Passau“?
• Problem: LOD is not a single database!
Querying LOD
SPARQL
SPARQL
RDF
RDFRDF
SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}
de.dbpedia.org
Relations, Coordinates, Leader, etc.
What about the population?
SPARQL
8
• Problem: Selection of appropriate endpoints
• Send query to some endpoints and aggregate the results?
Distributed Querying!
SPARQL
SPARQL
RDF
RDFRDF
SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}
de.dbpedia.org
SPARQL
linkedgeodata.org
WHAT ?
9
• Problem: Different identifier for the same semantic concept
Misunderstanding: Co-Referencing
SPARQL
SPARQL
RDF
RDFRDF
SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}
de.dbpedia.org
SPARQL
linkedgeodata.org
WHAT ?
Known problem in linguistic:
It’s a spud! “What?“
I mean potato! “
Co-Referencing: Multiple expressions refer to the same thing.
10
Problem = Solution?
SPARQL-based crawling of co-reference information
Exploit co-reference information for• accomplishing immediate SPARQL rewriting
• performing endpoint selection
• execute automatic query federation
Basic idea: Focusing distributed co-reference information
Main principle: Semantic entites over identifier!
11
Components
balloon toolsuite
12
balloon Overflight• SPARQL based crawling of LOD endpoints
• Query: Ask for subjects and objects which are related with special predicate
• Simplified global view on• Equivalence: owl:SameAs, skos:exactMatch,
coref:coreferenceData, ...
• Graph-Database Neo4j• Equivalence Cluster:
Multiple synonym URIs representing the same semantic entity including Provenance
13
balloon Fusion
SPARQL Federation setup using co-reference information
SPARQL Transformation for each BGP1. Determine synonym URIs
2. Select suitable endpoints
3. Adapt sub-queries to endpoints
4. Federated querying
SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}
SPARQL
141. Determine synonym URIs
SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}
SPARQL
15
2. Select suitable endpoints
• Provenance based selection (PBS)• Endpoints which are involved in cluster composition
• Namespace based selection (NBS)• Prefix and Namespace matching of synonym URLs
Summarized: origin of co-reference information and origin of synonym URIs
162. Select suitable endpoints (2)
Assumption: • Provenance information only contains „linkedgeodata.org“
as co-reference origin• Namespaces for freebase and dbpedia available (datahub.io)
PBS:Linked-Geo-Data
Endpoint
NBS:DBPedia endpoint
NBS:Freebaseendpoint
17
3. Adapt sub-queries to endpoints
PBS:Linked-Geo-Data
Endpoint
NBS:DBPedia endpoint
NBS:Freebaseendpoint
SELECT ?p ?o WHERE {<http://rdf.freebase.com/
ns/m.01h5td> ?p ?o.}
SPARQL
SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}
SPARQL
SELECT ?p ?o WHERE { { <http://rdf.freebase.com/ns/m.01h5td> ?p ?o. } UNION { <http://linkedgeodata.org/triplify/node240057351> ?p ?o. } UNION { <http://de.dbpedia.org/resource/Passau> ?p ?o. }}
SPARQL
SELECT ?p ?o WHERE { <http://de.dbpedia.org/resource/Passau> ?p ?o.}
SPARQL
18
• W3C SPARQL 1.1 Federated Query Extension (SERVICE)• (Partial) Query can be executed against a remote SPARQL
endpoint
• Distributed sub-queries don‘t contain SPARQL 1.1 features
4. Federated Querying
SPARQL
SELECT ?p ?o WHERE { SERVICE <http://dbpedia.org/sparql> { <http://de.dbpedia.org/resource/Passau> ?p ?o. } UNION { SERVICE <http://www.freebase.com/base/sparql> { <http://rdf.freebase.com/ns/m.01h5td> ?p ? } } UNION { SERVICE <http://linkedgeodata.org/sparql/> { { <http://rdf.freebase.com/ns/m.01h5td> ?p ?o. } UNION { <http://linkedgeodata.org/triplify/node240057351> ?p ?o. } UNION { <http://de.dbpedia.org/resource/Passau> ?p ?o. }}}}
19
• Endpoint status check• Check routine in terms of availability and latency
• Minimize sub-queries• Group sub-queries with common endpoint
• Push join to endpoint
• SPARQL Features• Condense PBS UNION-construct of synonym URIs
• SPARQL 1.1 VALUES or FILTER with IN operator
• Not well implemented in Linked Data endpoints
Optimizations (ongoing)
20
balloon Overflight Results
21Results from a sounding balloon
22balloon toolsuite
23
Statistics• Datahub.io: Linked Open Data Cloud catalog• 337 datasets in total
• 237 expose a SPARQL endpoint
• 112 successfully queried for co-reference information
• Balloon Dataset (first run)
• 17.6M co-reference statements
• 22.4M distinct URLs
• 8.4M equivalence cluster (~ 2.68 identifier per cluster)
• Pending Analysis• Distribution of cluster sizes, Number of different Hosts per
cluster
• Main representative per cluster & False-Friends
24
Open Source:
• Demo, information and sources available (MIT License)• X as a Service
• SPARQL Rewriting (HTTP API)
• Query Federation (SPARQL)
http://schlegel.github.io/balloon
25
Summary:• SPARQL-based crawling of distributed co-reference
information
• Exploit co-reference information for SPARQL federation
Single Point of Access
26
Any questions?
“Research is formalized curiosity. It is poking and prying with a purpose. - Zora Neale Hurston