Scaling out federated queries for Life Sciences Data In Production
-
Upload
dieter-de-witte -
Category
Data & Analytics
-
view
415 -
download
1
Transcript of Scaling out federated queries for Life Sciences Data In Production
SCALING OUT FEDERATED QUERIES
FOR LIFE SCIENCES DATA IN PRODUCTION
Dieter De Witte, Laurens De Vocht, et al.
• IMEC– IDLAB – GHENT UNIVERSITY
• ONTOFORCE
Catch 22!?
A. No Semantic Web Applications
because no Semantic Data
B. No Semantic Data
because no applications
A. The LOD Cloud for Life Sciences...
Ontoforce’s
DISQOVER
covers
> 110
Life Sciences Datasets
B. DISQOVER is an Exploratory Semantic Search UI (faceted browsing)
To Click = To SPARQL
The missing link in our catch 22? “How to run federated queries?”
Direct ETL
• Cloud Instances
• PAGO amis:
Scientific Benchmark = Reproducible Benchmark
Benchmark Client
• 1 single-threaded warm-up run (all 1,223 queries)
• 1 multi-threaded (8) run
• (8 x randomized order)
Database Node(s)
How to evaluate an RDF Database solution? Performance (
Data store,
Dataset,
Configuration,
Number of nodes,
Hardware (RAM)
)
Performance (
NoSQL Triple stores,
Watdiv 10M, 100M, 1000M,
Standard Configs,
Single Node,
32 GB RAM
)
SIGMOD 2016: Single Node SOTA on artificial data
More data, more problems
Timeout
Query performance: Virtuoso Leads, Blazegraph follows
Timeout
SWAT4LS 2016: Multi-node SOTA on real data
Performance (
Scale out systems,
DISQOVER data,
Optimized Configs,
Multi-Node, Compression
64 GB RAM
)
How to deal with Big Linked Data?
1. Vertical Scaling: bigger box
2. Compression: smaller content
3. Horizontal Scaling: more boxes, 1 location
4. Federation: more boxes, more locations
V1, Bla1 (single node Virtuoso, Blazegraph)
V1_32 (32GB Virtuoso)
Fu1 (Fuseki + HDT)
V3 (Virtuoso cluster 3 nodes)
Fl3 (FluidOps, aka FedX)
DISQOVER dataset ...
and queries
Count, Union, Sort, Aggregations
Example Query 1: Nesting, FILTERs, unbound triples
Example Query 4: Aggregations, Optionals
Initial performance results were counter-intuitive... and incorrect!!!
Worse hardware, better performance?
Only Virtuoso-backed systems survive multi-threaded benchmark
marks last successful query (no timeout)
1 x 1,223 queries 8 x 1,223 queries
No errors but incorrect #results!!!
FILTERs, UNIONs are challenging but ORDER + GROUP + OPTIONAL dominate
COUNT DISTINCT
600 – 1,223 BGPs
Conclusions & Future Work
• Additional diagnostics for RDF solutions!
• Extend benchmarking software with query correctness assessment!
• Multi-node RDF solutions???
• Towards Full paper:
– NoSQL for Ontoforce Data
– Scale out approaches for Watdiv + test LDF
– Release reusable end-to-end benchmark software:
• Setup AND Postprocessing
Thanks for your attention!!
SCALING OUT FEDERATED QUERIES
FOR LIFE SCIENCES DATA IN PRODUCTION
Dieter De Witte, Laurens De Vocht, et al.
contact: [email protected]
slideshare:
• IMEC– IDLAB – GHENT UNIVERSITY
• ONTOFORCE