Strategies for executing federated queries in SPARQL1 · Strategies for executing federated queries...
Transcript of Strategies for executing federated queries in SPARQL1 · Strategies for executing federated queries...
Strategies for executing federated queries in SPARQL1.1Carlos Buil-Aranda1, Axel Polleres2 and Jürgen Umbrich2
1. Center for Semantic Web Research (CIWS), DCC, PUC, Chile 2. WU Wien (Vienna University of Economics & Business)
SPARQL Federated Query, i.e. query all these databases as !if they were a single one
How can I query all these
data?
I want a list of mouse phenotypes and their
symbols
Now I want to combine the symbols with
standard scientific terminology
Use of the SERVICE keyword
VALUES operator for shipping
data to the remote endpoint
SPARQL 1.1 Federated Query Extension
SELECT ?mgi ?symbol ?status WHERE { SERVICE<http://mgi.bio2rdf.org/sparql>{ ?mgi xSymbol ?symbol . ?mgi xHGNX ?xhgnc } SERVICE<http://hgnc.bio2rdf.org/sparql> { ?xhgnc Status ?status } }
Some problems related to query federation:
Time outs
Result set incompleteness
We will try to find algorithms for dealing with these problems in SPARQL query federation
How the federation is implemented? (general idea)
The systems basically want to reduce the amount of data transmitted… …and the amount of processing time needed by the remote server As users, we want sound and complete results
Two ideas for the query federation algorithms: Query one dataset and use its results to constrain the query to the next dataset Query both datasets and join locally their results
How the federation is implemented? (general idea)
SPARQL Query Federation Algorithms
Use combinations of SPARQL operators
VALUES (implicitly used)
FILTER
UNION (a variant used in FedX)
Use well-known database algorithms
Nested Loop Join (used in Virtuoso, Sesame and Jena-Fuseki)
Hash Join (used in SIHJoin)
How federated queries are executed (general case)
Local Federation Implementation
SPARQL Endpoint 1: • Virtuoso • Jena • Sesame
SPARQL Endpoint 2: • Virtuoso • Jena • Sesame
SERVICE query 1
SERVICE query 2
How federated queries are executed (general case)
Local Federation Implementation
SPARQL Endpoint 1: • Virtuoso • Jena • Sesame
SPARQL Endpoint 2: • Virtuoso • Jena • Sesame
SERVICE queries 1 and 2
SERVICE query 2
How federated queries are executed (only using SERVICE)
SERVICE implementation using VALUES
SELECT * ( service (mgi) ( (?X xHGNX ?xhgnc) AND (?X xSymbol ?symbol) AND )
service (hgnc)( (?xhgnc status ?status ) VALUES { (?xhgnc) (hgnc:13182, hgnc:18126, hgnc:27022) }
service (hgnc) ( (?xhgnc status ?status) ) )
?X ?symbol ?xhgncmgi:1913386 Znrd1 hgnc:13182mgi:3039616 Znrf3 hgnc:18126mgi:2443415 Zpld1 hgnc:27022
AND
SERVICE implementation using FILTER
SELECT * ( service (mgi) ( (?X xHGNX ?xhgnc) AND (?X xSymbol ?symbol) AND )
service (hgnc)( ?xhgnc status ?status FILTER(?xghnc = hgnc:13182 || ?xhgnc = hgnc:18126 || ?xhgnc = hgnc:207022 ) )
?X ?symbol ?xhgncmgi:1913386 Znrd1 hgnc:13182mgi:3039616 Znrf3 hgnc:18126mgi:2443415 Zpld1 hgnc:27022
service (hgnc) ( (?xhgnc status ?status) ) )
AND
SERVICE implementation using FILTER + UNION
SELECT * ( service (mgi) ( (?X xHGNX ?xhgnc) AND (?X xSymbol ?symbol) AND )
(service (hgnc)( ?xhgnc status ?status FILTER(?xhgnc = hgnc:13182))) UNION (service (hgnc)( ?xhgnc status ?status FILTER(?xhgnc = hgnc:18126))) UNION (service (hgnc)( ?xhgnc status ?status FILTER(?xhgnc = hgnc:27022)))
?X ?symbol ?xhgncmgi:1913386 Znrd1 hgnc:13182mgi:3039616 Znrf3 hgnc:18126mgi:2443415 Zpld1 hgnc:27022
service (hgnc) ( (?xgnc status ?status) ) )
AND
SERVICE implementation using a Nested Loop Join
SELECT * ( service (mgi) ( (?X xHGNX ?xhgnc) AND (?X xSymbol ?symbol) AND )
service (hgnc)( (hgnc:13182 status ?status ) ) service (hgnc)( (hgnc:18126 status ?status ) ) service (hgnc)( (hgnc:27022 status ?status ) )
service (hgnc) ( (?xhgnc status ?status) ) )
?X ?symbol ?xhgncmgi:1913386 Znrd1 hgnc:13182mgi:3039616 Znrf3 hgnc:18126mgi:2443415 Zpld1 hgnc:27022
AND
data xhgncmgi:1913386, Znrd1 hgnc:13182mgi:3039616, Znrf3 hgnc:18126mgi:2443415, Zpld1 hgnc:27022
SERVICE implementation using a Symmetric Hash Join
SELECT * ( service (mgi) ( (?X xHGNX ?xhgnc) AND (?X xSymbol ?symbol) AND ) AND
service (hgnc) ( (?xhgnc status ?status) ) )
?status ?xhgnc"Approved" hgnc:13182"Approved" hgnc:18126
service (hgnc) ( (?xhgnc status ?status) ) )
Strategies Evaluation - ExampleLocal data:
Remote data::bob :works :riva . :bob :works :italy .
Query:SELECT * WHERE { SERVICE <example_server1> { ?X foaf:knows :peter } . SERVICE <example_server2> { {?Y :works :italy} UNION {?X :works :riva} } }
Results are:{?X -‐> :bob}, {?Y -‐> :bob, ?X -‐> :bob}
:bob foaf:knows :peter .
Strategies Evaluation - Example (using FILTER & UNION)
However, if we evaluate that query with these data using the configuration before:
Jena-Fuseki Sesame VirtuosoSERVICE 2 2 2VALUES 2 2 2FILTER 2 1 1UNION 2 1 1
NESTED 2 2 2SYMHASH 1 1 1
?person1 ?person2 ?place:bob :peter :riva
:peter :alice :rome:peter :alice :riva
?person1 ?person2 ?place:bob :peter :riva
:peter :mark
VALUES example (using OPTIONAL)SELECT * WHERE { ?person1 foaf:knows ?person2 OPTIONAL (?person1 :works ?place) } VALUES (?place) {(:riva) (:rome)}
:bob foaf:knows :peter . :peter foaf:knows :alice . :bob :works :puc .
?place:riva
:rome
Remote Query:
Remote Data:
There are (at least) two problems we may find
With FILTER and UNION strategies we may miss results
With VALUES we get mixed data plus "unwanted" results
There are other problems with NESTED (related to variable substitution)
SELECT * WHERE { SERVICE<http://localhost:3030/mgi/sparql> { # 250923 results ?s rdf:type ?type . OPTIONAL { ?s mgi:xHGNC ?hgnc_link } } SERVICE <http://localhost:3130/hgnc/sparql>{ # 23 results ?hgnc_link hgnc:status "Approved". ?hgnc_link hgnc:date_modified ?date . FILTER(?date < "1995-‐01-‐01") } }
Consider now this query with real data:
Strategies Evaluation - Bio2RDF Example
Jena-Fuseki Sesame VirtuosoSERVICE 5,361,421 5,361,421 toVALUES 6 6 23FILTER 6 6 6UNION 6 6 6
NESTED 5,361,421 5,361,421 toSYMHASH 6 6 6
We evaluate that query with these data using the configuration before obtaining:
Why all these inconsistent results happen?
The empty mapping is the join identity, it matches with everything
It is not null rejecting!
How can we fix it? preventing the injection of null values in the remote query
Besides, substituting a variable for a value that does not belong to the database is not a good idea
One possible fix: Strongly Bound syntactic restriction [1,2](?mgi, type, ?type) ?mgi ?type!
?mgi:4420798 mgi:Marker…
(?mgi, type, ?type) OPTIONAL ( ?mgi, works, ?link_hgnc )
?mgi ?link_hgncmgi:99926 hgnc:7415
mgi:102492… ..
(?mgi1, pos, “-‐1") UNION (?mgi2, pos, “74.83")
?mgi1 ?mgi2mgi:99926
mgi:1916948… …
(Fixed) Strategies evaluation with real data: which one is best? (data installed in a local network)
Evaluation Query SetCARD Q1 CARD Q2 #triple
patterns Q1#triple
patterns Q2 CARD Q
B0 27 1 1 1 1B1 27 33562 1 1 1B2 17817 33562 2 1 17547B3 16753 2 2 3 2B4 250924 23 2 2 6B5 16753 8771 2 2 3274B6 268743 27132 3 7 23873B7 35636 33134 3 4 17545
(Time) Results on Jena-Fuseki (in milliseconds)
B0 B1 B2 B3 B4 B5 B6 B7SERVICE 1436 642 31620 39119 * * * toFILTER 439 360 7235 6022 13633 3638 26577 62947UNION 730 678 10657 15033 22269 7732 60814 63335VALUES 211 227 8247 5223 error error error 11646NESTED 758 643 52160 55445 * * * to
SYMHASH 403 16462 17563 1937 9494 13082 53591 28759
Results on Sesame
B0 B1 B2 B3 B4 B5 B6 B7SERVICE 77 320 1555 73 * * * 1968FILTER 149 596 4340 4788 7815 3230 18746 7032UNION 260 645 6415 10807 12502 5443 43648 13673VALUES 167 153 4289 2893 error error error 7042NESTED 443 385 52051 62083 to to to 89487
SYMHASH 101 14520 15037 1203 4662 5409 20915 21242
Results on Virtuoso
B0! B1 B2 B3 B4 B5 B6 B7SERVICE error error error error error error error errorFILTER 159 159 1731 31720 26329 31324 68749 37444UNION 267 237 30904 72495 55321 3737 7134 11990VALUES 137 117 6611 7561 error error error 13291NESTED 559 500 88240 128399 * to to 196280
SYMHASHH
102 1905 2733 2205 5525 2306 8149 3407
ConclusionsAlgorithms for SPARQL query federation are not as straightforward as they seem
It is important to check their correctness
Which strategy/algorithm should I use?
Depends on the remote server
SYMHASH and FILTER use to perform well (real data, controlled scenario)
If these two do not work well for the setup, use UNION/VALUES (when possible)
Thanks.
Any questions?
References[1] Semantics and optimization of the SPARQL 1.1 federation extension C Buil-Aranda, M Arenas, O Corcho, The Semanic Web: Research and Applications, 1-15. ![2] Federating queries in SPARQL 1.1: Syntax, semantics and evaluation. C Buil-Aranda, M Arenas, O Corcho, A Polleres, Web Semantics: Science, Services and Agents on the World Wide Web 18 (1), 1-17 ![3] SPARQL Web-Querying Infrastructure: Ready for Action? C Buil-Aranda, A Hogan, J Umbrich, PY Vandenbussche, The Semantic Web–ISWC 2013, 277-293. ![4] Strategies for executing federated queries in SPARQL1.1, C Buil-Aranda, A Polleres and Jürgen Umbrich, The Semantic Web–ISWC 2014, to appear.