Download - Semantics and optimisation of the SPARQL1.1 federation extension

Transcript
Page 1: Semantics and optimisation of the SPARQL1.1 federation extension

Semantics and optimization of the SPARQL 1.1 federation

extension

Carlos Buil Aranda (1), Marcelo Arenas (2), Oscar Corcho (1)

[email protected], [email protected], [email protected]

11th November, 2010, Madrid, Spain

(1) Ontology Engineering Groupd, Facultad de Informática, Universidad Politécnica de Madrid

(2) Departamento Ciencias de la Computacion, Pontificia Universidad Católica de Chile

Page 2: Semantics and optimisation of the SPARQL1.1 federation extension

Introduction

• How many of you have been in the need of making queries to distributed SPARQL endpoints?

Example

• Using the Pubmed references obtained from the Geneid gene dataset, retrieve information about genes and their references in the Pubmed dataset.

• From Pubmed we access the information in the National Library of Medicine’s controlled vocabulary thesaurus, stored at the MeSH endpoint, so we have more complete information about such genes.

• Finally, we also access the HHPID endpoint, which is the knowledge base for the HIV-1 protein.

Example

• Using the Pubmed references obtained from the Geneid gene dataset, retrieve information about genes and their references in the Pubmed dataset.

• From Pubmed we access the information in the National Library of Medicine’s controlled vocabulary thesaurus, stored at the MeSH endpoint, so we have more complete information about such genes.

• Finally, we also access the HHPID endpoint, which is the knowledge base for the HIV-1 protein.

Page 3: Semantics and optimisation of the SPARQL1.1 federation extension

Introduction

Pubmed

MESH

HHPID

?meshReference <owl:sameAs> ?descriptor

{?pubmed <pubmed:meshref> ?mesh . ?mesh <pubmed:descriptor> ?descriptor .}

?int <hhpid:elementGene2> ?gene1

GeneID

?gene1 <geneid:pubmed_xref> ?pubmed

Page 4: Semantics and optimisation of the SPARQL1.1 federation extension

Introduction

Given SPARQL1.0: How do you do those queries?•Option 1: Make local copies of all those graphs into your favourite triple store, separated into different named graphs / contexts, and evaluate a single query over the whole set of graphs.

•Option 2: Send individual queries into each SPARQL endpoint, and join information in a programmatic manner on the client side. Highly inefficient. •Option 3: Use some of the existing distributed query processing extensions: DARQ, NetworkedGraphs, ARQ, etc.

SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 . ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed . ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}

SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 . ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed . ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}

Page 5: Semantics and optimisation of the SPARQL1.1 federation extension

Introduction: SPARQL 1.1 Federation Extension

• Allows specifying queries over distributed SPARQL endpoints• New operator: SERVICE a P

• We may combine local and remote SPARQL endpoints, depending on the characteristics of the data that we are handling

SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE{ SERVICE <http://quebec.hhpid.bio2rdf.org/sparql> { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 .} . SERVICE <http://127.0.0.1:2020/sparql-geneid> { ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed .} . SERVICE <http://pubmed.bio2rdf.org/sparql> { ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . }. SERVICE <http://127.0.0.1:2021/sparql-mesh> { ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}

SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE{ SERVICE <http://quebec.hhpid.bio2rdf.org/sparql> { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 .} . SERVICE <http://127.0.0.1:2020/sparql-geneid> { ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed .} . SERVICE <http://pubmed.bio2rdf.org/sparql> { ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . }. SERVICE <http://127.0.0.1:2021/sparql-mesh> { ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}

Page 6: Semantics and optimisation of the SPARQL1.1 federation extension

Table of Contents

• Introduction

• Syntax and Semantics

• SERVICE evaluation

• Optimising Federated Queries

• Implementation

• Conclusions

Page 7: Semantics and optimisation of the SPARQL1.1 federation extension

Syntax & Semantics Preliminaries

Page 8: Semantics and optimisation of the SPARQL1.1 federation extension

Syntax & Semantics Preliminaries

Page 9: Semantics and optimisation of the SPARQL1.1 federation extension

Syntax & Semantics Preliminaries

Page 10: Semantics and optimisation of the SPARQL1.1 federation extension

SPARQL1.1 SERVICE syntax

• Queries are of the form:

• “a” is an IRI or a variable, so it can be: • A predefined service endpoint:

• e.g., <http://quebec.hhpid.bio2rdf.org/sparql> • A variable: SERVICE ?X {P1}

SELECT * WHERE{

P2 . P3 .SERVICE a {P1} ....

}

Page 11: Semantics and optimisation of the SPARQL1.1 federation extension

SPARQL 1.1 SERVICE Semantics

• We extend [PAG09] with the semantics for SERVICE:

[PAG09] J. Pérez, M. Arenas and C. Gutiérrez. Semantics and complexity of SPARQL. TODS 34(3), 2009

!!!

Page 12: Semantics and optimisation of the SPARQL1.1 federation extension

SPARQL 1.1 SERVICE Semantics

• So, if we find SERVICE ?X P1, do we have to send queries to every single endpoint in the world?

Page 13: Semantics and optimisation of the SPARQL1.1 federation extension

Table of Contents

• Introduction

• Syntax and Semantics

• SERVICE evaluation

• Optimising Federated Queries

• Implementation

• Conclusions

Page 14: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation

• What happens when there is a variable ?X in SERVICE ?X P ?• ?X must be bound in order to evaluate the pattern

• That is, ?X needs to have a value when the SERVICE operator is evaluated

• Examples:P1 = (SELECT {?X, ?N, ?E} WHERE

{(?X, service_address, ?Y) AND (SERVICE ?Y {?N, email, ?E})}

P2 = (SELECT {?X, ?N, ?E} WHERE

{((?X, service_description, ?Z) UNION (?X, service_address, ?Y)) AND ((SERVICE ?Z {?N, email, ?E}) UNION (SERVICE ?Y {?N, email, ?E})) }

… In order to evaluate P1 and P2, we must ensure a specific evaluation order so as to ensure a safe evaluation

Page 15: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation – Ingredients and Informal Definitions

• Boundedness (of a variable in a query)• ?Y is bound in

P1 = ((?X, service address, ?Y) AND (SERVICE ?Y (?N, email, ?E)))

• ?Y is not bound in P1 = ((?X, service address, ?Z) OPT (?X, service_desc, ?Y)) AND (SERVICE ?Y (?N, email, ?E))

• However, checking this is undecidable

• Strong Boundedness (of a variable in a query)• We impose some syntactic conditions (details in the paper)

• Service Boundedness• Based on the parse tree of the query, if we find a SERVICE ?X P, then

if it has an ancestor where ?X is bound, the service is bound

• However, checking this is undecidable

• Service Safeness• Hence we impose syntactic conditions, and we have to check if variable

?X is strongly bound

Page 16: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation - Boundedness

Page 17: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation - Strong Boundedness

Page 18: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation - Safeness

• Given a SPARQL query P, define T(P) as the parse tree of P. In this tree, every node corresponds to a sub-pattern of P.

Page 19: Semantics and optimisation of the SPARQL1.1 federation extension

SERVICE Evaluation - Safeness

• Definition (service-boundedness)A SPARQL query P is service-bound if for every node u of T(P) with label (SERVICE ?X P1), it holds that:

• There exists a node v of T(P) with label P2 such that v is an ancestor of u in T(P) and ?X is bound in P2

• P1 is service-bound

• TheoremThe problem of verifying, given a SPARQL query P, whether P is service-bound is undecidable.

• Definition (service-safeness)A SPARQL query P is service-safe if for every node u of T(P) with label (SERVICE ?X P1), it holds that:

• There exists a node v of T(P) with label P2 such that v is an ancestor of u in T(P) and ?X is strongly bound in P2

• P1 is service-safe

Page 20: Semantics and optimisation of the SPARQL1.1 federation extension

Table of Contents

• Introduction

• Syntax and Semantics

• SERVICE evaluation

• Optimising Federated Queries

• Implementation

• Conclusions

Page 21: Semantics and optimisation of the SPARQL1.1 federation extension

Optimising Federated Queries

• Well-designed patterns [PAG09]

Page 22: Semantics and optimisation of the SPARQL1.1 federation extension

Optimising federated queries with well-designed patterns

• We extended the notion of well-designed patterns for SPARQL1.1 Federated Query

• The following rules (from [PAG09]) also hold for SERVICE:

• Proposition• If P is a well-designed pattern and Q is obtained for P by applying

either (1) or (2) or (3), then Q is a well-designed pattern equivalent to P.

Page 23: Semantics and optimisation of the SPARQL1.1 federation extension

Table of Contents

• Introduction

• Syntax and Semantics

• SERVICE evaluation

• Optimising Federated Queries

• Implementation

• Conclusions

Page 24: Semantics and optimisation of the SPARQL1.1 federation extension

Implementation: OGSA-DAI

• Implemented on top of OGSA-DAI/OGSA-DQP• Extensible framework to access, integrate, transform and deliver

distributed and heterogeneous sources of data

• Implements part of the WS-DAI specification

• Service Oriented Data Access (direct and indirect access)

• Distributed Query Processing

• Features• RDF extension (available in the official OGSA-DAI release)

• We process

• SERVICE IRI P

• AND, OPTIONAL, UNION

• Most common FILTERS (<, =, >)

• Solution modifiers

• Coming features missing

• SERVICE ?X P

Page 25: Semantics and optimisation of the SPARQL1.1 federation extension

Implementation: Evaluation

• Benchmark test• Existing benchmarks (Berlin SPARQL Benchmark and

SP2Bench) were not suitable (no distributed queries), and other benchmarks in an early stage

• Focus in life sciences queries: bio2rdf.org project

• Seven queries of increasing complexity

• http://www.oeg-upm.net/files/sparql-dqp/

• Bio2rdf datasets• bio2rdf.org: 2.3 billion triples

• Used Entrez Gene (13 million triples), pubmed (797 million triples), HHPID (244.091 tiples) and MeSH (689.542 triples)

• Downloaded some datasets (HHPID and pubmed) and divided into several endpoints of 300.000 triples

• Hardware used:• Intel Core 2 Duo, 2,50GH, 3GB RAM

Page 26: Semantics and optimisation of the SPARQL1.1 federation extension

Results

Page 27: Semantics and optimisation of the SPARQL1.1 federation extension

Table of Contents

• Introduction

• Syntax and Semantics

• SERVICE evaluation

• Optimising Federated Queries

• Implementation

• Conclusions

Page 28: Semantics and optimisation of the SPARQL1.1 federation extension

Conclusions

• Formalisation of the SPARQL 1.1 Basic Federation Extension syntax and semantics

• Safeness conditions in the evaluation of SERVICE in the presence of variables

• Simple query optimisation based on an extension of well-designed patterns

• Implementation based on a robust data-access system like OGSA-DAI• Focused on large-scale data sources• More optimisations can be easily included• Indirect data access mode (you send the query, it sends you

a handler to where the result will be placed, and you can use that resource).

Page 29: Semantics and optimisation of the SPARQL1.1 federation extension

Semantics and optimization of the SPARQL 1.1 federation

extension

Acknowledgements•Implementation:

• OGSA-DAI team (specially Ally Hume)•Query generation:

• Bio2RDF project team (specially Marc-Alexandre Nolin)

•Heavy discussions on syntax and semantics• Jorge Pérez

•Funding• ADMIRE Project (FP7-ICT-215024)• FONDECYT grant 1090565

Acknowledgements•Implementation:

• OGSA-DAI team (specially Ally Hume)•Query generation:

• Bio2RDF project team (specially Marc-Alexandre Nolin)

•Heavy discussions on syntax and semantics• Jorge Pérez

•Funding• ADMIRE Project (FP7-ICT-215024)• FONDECYT grant 1090565