Semantics and optimisation of the SPARQL1.1 federation extension
-
Upload
oscar-corcho -
Category
Technology
-
view
1.165 -
download
0
description
Transcript of Semantics and optimisation of the SPARQL1.1 federation extension
Semantics and optimization of the SPARQL 1.1 federation
extension
Carlos Buil Aranda (1), Marcelo Arenas (2), Oscar Corcho (1)
[email protected], [email protected], [email protected]
11th November, 2010, Madrid, Spain
(1) Ontology Engineering Groupd, Facultad de Informática, Universidad Politécnica de Madrid
(2) Departamento Ciencias de la Computacion, Pontificia Universidad Católica de Chile
Introduction
• How many of you have been in the need of making queries to distributed SPARQL endpoints?
Example
• Using the Pubmed references obtained from the Geneid gene dataset, retrieve information about genes and their references in the Pubmed dataset.
• From Pubmed we access the information in the National Library of Medicine’s controlled vocabulary thesaurus, stored at the MeSH endpoint, so we have more complete information about such genes.
• Finally, we also access the HHPID endpoint, which is the knowledge base for the HIV-1 protein.
Example
• Using the Pubmed references obtained from the Geneid gene dataset, retrieve information about genes and their references in the Pubmed dataset.
• From Pubmed we access the information in the National Library of Medicine’s controlled vocabulary thesaurus, stored at the MeSH endpoint, so we have more complete information about such genes.
• Finally, we also access the HHPID endpoint, which is the knowledge base for the HIV-1 protein.
Introduction
Pubmed
MESH
HHPID
?meshReference <owl:sameAs> ?descriptor
{?pubmed <pubmed:meshref> ?mesh . ?mesh <pubmed:descriptor> ?descriptor .}
?int <hhpid:elementGene2> ?gene1
GeneID
?gene1 <geneid:pubmed_xref> ?pubmed
Introduction
Given SPARQL1.0: How do you do those queries?•Option 1: Make local copies of all those graphs into your favourite triple store, separated into different named graphs / contexts, and evaluate a single query over the whole set of graphs.
•Option 2: Send individual queries into each SPARQL endpoint, and join information in a programmatic manner on the client side. Highly inefficient. •Option 3: Use some of the existing distributed query processing extensions: DARQ, NetworkedGraphs, ARQ, etc.
SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 . ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed . ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}
SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 . ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed . ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}
Introduction: SPARQL 1.1 Federation Extension
• Allows specifying queries over distributed SPARQL endpoints• New operator: SERVICE a P
• We may combine local and remote SPARQL endpoints, depending on the characteristics of the data that we are handling
SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE{ SERVICE <http://quebec.hhpid.bio2rdf.org/sparql> { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 .} . SERVICE <http://127.0.0.1:2020/sparql-geneid> { ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed .} . SERVICE <http://pubmed.bio2rdf.org/sparql> { ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . }. SERVICE <http://127.0.0.1:2021/sparql-mesh> { ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}
SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE{ SERVICE <http://quebec.hhpid.bio2rdf.org/sparql> { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 .} . SERVICE <http://127.0.0.1:2020/sparql-geneid> { ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed .} . SERVICE <http://pubmed.bio2rdf.org/sparql> { ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . }. SERVICE <http://127.0.0.1:2021/sparql-mesh> { ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}
Table of Contents
• Introduction
• Syntax and Semantics
• SERVICE evaluation
• Optimising Federated Queries
• Implementation
• Conclusions
Syntax & Semantics Preliminaries
Syntax & Semantics Preliminaries
Syntax & Semantics Preliminaries
SPARQL1.1 SERVICE syntax
• Queries are of the form:
• “a” is an IRI or a variable, so it can be: • A predefined service endpoint:
• e.g., <http://quebec.hhpid.bio2rdf.org/sparql> • A variable: SERVICE ?X {P1}
SELECT * WHERE{
P2 . P3 .SERVICE a {P1} ....
}
SPARQL 1.1 SERVICE Semantics
• We extend [PAG09] with the semantics for SERVICE:
[PAG09] J. Pérez, M. Arenas and C. Gutiérrez. Semantics and complexity of SPARQL. TODS 34(3), 2009
!!!
SPARQL 1.1 SERVICE Semantics
• So, if we find SERVICE ?X P1, do we have to send queries to every single endpoint in the world?
Table of Contents
• Introduction
• Syntax and Semantics
• SERVICE evaluation
• Optimising Federated Queries
• Implementation
• Conclusions
SERVICE Evaluation
• What happens when there is a variable ?X in SERVICE ?X P ?• ?X must be bound in order to evaluate the pattern
• That is, ?X needs to have a value when the SERVICE operator is evaluated
• Examples:P1 = (SELECT {?X, ?N, ?E} WHERE
{(?X, service_address, ?Y) AND (SERVICE ?Y {?N, email, ?E})}
P2 = (SELECT {?X, ?N, ?E} WHERE
{((?X, service_description, ?Z) UNION (?X, service_address, ?Y)) AND ((SERVICE ?Z {?N, email, ?E}) UNION (SERVICE ?Y {?N, email, ?E})) }
… In order to evaluate P1 and P2, we must ensure a specific evaluation order so as to ensure a safe evaluation
SERVICE Evaluation – Ingredients and Informal Definitions
• Boundedness (of a variable in a query)• ?Y is bound in
P1 = ((?X, service address, ?Y) AND (SERVICE ?Y (?N, email, ?E)))
• ?Y is not bound in P1 = ((?X, service address, ?Z) OPT (?X, service_desc, ?Y)) AND (SERVICE ?Y (?N, email, ?E))
• However, checking this is undecidable
• Strong Boundedness (of a variable in a query)• We impose some syntactic conditions (details in the paper)
• Service Boundedness• Based on the parse tree of the query, if we find a SERVICE ?X P, then
if it has an ancestor where ?X is bound, the service is bound
• However, checking this is undecidable
• Service Safeness• Hence we impose syntactic conditions, and we have to check if variable
?X is strongly bound
SERVICE Evaluation - Boundedness
SERVICE Evaluation - Strong Boundedness
SERVICE Evaluation - Safeness
• Given a SPARQL query P, define T(P) as the parse tree of P. In this tree, every node corresponds to a sub-pattern of P.
SERVICE Evaluation - Safeness
• Definition (service-boundedness)A SPARQL query P is service-bound if for every node u of T(P) with label (SERVICE ?X P1), it holds that:
• There exists a node v of T(P) with label P2 such that v is an ancestor of u in T(P) and ?X is bound in P2
• P1 is service-bound
• TheoremThe problem of verifying, given a SPARQL query P, whether P is service-bound is undecidable.
• Definition (service-safeness)A SPARQL query P is service-safe if for every node u of T(P) with label (SERVICE ?X P1), it holds that:
• There exists a node v of T(P) with label P2 such that v is an ancestor of u in T(P) and ?X is strongly bound in P2
• P1 is service-safe
Table of Contents
• Introduction
• Syntax and Semantics
• SERVICE evaluation
• Optimising Federated Queries
• Implementation
• Conclusions
Optimising Federated Queries
• Well-designed patterns [PAG09]
Optimising federated queries with well-designed patterns
• We extended the notion of well-designed patterns for SPARQL1.1 Federated Query
• The following rules (from [PAG09]) also hold for SERVICE:
• Proposition• If P is a well-designed pattern and Q is obtained for P by applying
either (1) or (2) or (3), then Q is a well-designed pattern equivalent to P.
Table of Contents
• Introduction
• Syntax and Semantics
• SERVICE evaluation
• Optimising Federated Queries
• Implementation
• Conclusions
Implementation: OGSA-DAI
• Implemented on top of OGSA-DAI/OGSA-DQP• Extensible framework to access, integrate, transform and deliver
distributed and heterogeneous sources of data
• Implements part of the WS-DAI specification
• Service Oriented Data Access (direct and indirect access)
• Distributed Query Processing
• Features• RDF extension (available in the official OGSA-DAI release)
• We process
• SERVICE IRI P
• AND, OPTIONAL, UNION
• Most common FILTERS (<, =, >)
• Solution modifiers
• Coming features missing
• SERVICE ?X P
Implementation: Evaluation
• Benchmark test• Existing benchmarks (Berlin SPARQL Benchmark and
SP2Bench) were not suitable (no distributed queries), and other benchmarks in an early stage
• Focus in life sciences queries: bio2rdf.org project
• Seven queries of increasing complexity
• http://www.oeg-upm.net/files/sparql-dqp/
• Bio2rdf datasets• bio2rdf.org: 2.3 billion triples
• Used Entrez Gene (13 million triples), pubmed (797 million triples), HHPID (244.091 tiples) and MeSH (689.542 triples)
• Downloaded some datasets (HHPID and pubmed) and divided into several endpoints of 300.000 triples
• Hardware used:• Intel Core 2 Duo, 2,50GH, 3GB RAM
Results
Table of Contents
• Introduction
• Syntax and Semantics
• SERVICE evaluation
• Optimising Federated Queries
• Implementation
• Conclusions
Conclusions
• Formalisation of the SPARQL 1.1 Basic Federation Extension syntax and semantics
• Safeness conditions in the evaluation of SERVICE in the presence of variables
• Simple query optimisation based on an extension of well-designed patterns
• Implementation based on a robust data-access system like OGSA-DAI• Focused on large-scale data sources• More optimisations can be easily included• Indirect data access mode (you send the query, it sends you
a handler to where the result will be placed, and you can use that resource).
Semantics and optimization of the SPARQL 1.1 federation
extension
Acknowledgements•Implementation:
• OGSA-DAI team (specially Ally Hume)•Query generation:
• Bio2RDF project team (specially Marc-Alexandre Nolin)
•Heavy discussions on syntax and semantics• Jorge Pérez
•Funding• ADMIRE Project (FP7-ICT-215024)• FONDECYT grant 1090565
Acknowledgements•Implementation:
• OGSA-DAI team (specially Ally Hume)•Query generation:
• Bio2RDF project team (specially Marc-Alexandre Nolin)
•Heavy discussions on syntax and semantics• Jorge Pérez
•Funding• ADMIRE Project (FP7-ICT-215024)• FONDECYT grant 1090565