Scaling out federated queries for Life Sciences Data In Production

SCALING OUT FEDERATED QUERIES

FOR LIFE SCIENCES DATA IN PRODUCTION

Dieter De Witte, Laurens De Vocht, et al.

[email protected]

• IMEC– IDLAB – GHENT UNIVERSITY

• ONTOFORCE

Catch 22!?

A. No Semantic Web Applications

because no Semantic Data

B. No Semantic Data

because no applications

A. The LOD Cloud for Life Sciences...

Ontoforce’s

DISQOVER

covers

> 110

Life Sciences Datasets

B. DISQOVER is an Exploratory Semantic Search UI (faceted browsing)

To Click = To SPARQL

The missing link in our catch 22? “How to run federated queries?”

Direct ETL

• Cloud Instances

• PAGO amis:

Scientific Benchmark = Reproducible Benchmark

Benchmark Client

• 1 single-threaded warm-up run (all 1,223 queries)

• 1 multi-threaded (8) run

• (8 x randomized order)

Database Node(s)

How to evaluate an RDF Database solution? Performance (

Data store,

Dataset,

Configuration,

Number of nodes,

Hardware (RAM)

)

Performance (

NoSQL Triple stores,

Watdiv 10M, 100M, 1000M,

Standard Configs,

Single Node,

32 GB RAM

)

SIGMOD 2016: Single Node SOTA on artificial data

More data, more problems

Timeout

Query performance: Virtuoso Leads, Blazegraph follows

Timeout

SWAT4LS 2016: Multi-node SOTA on real data

Performance (

Scale out systems,

DISQOVER data,

Optimized Configs,

Multi-Node, Compression

64 GB RAM

)

How to deal with Big Linked Data?

1. Vertical Scaling: bigger box

2. Compression: smaller content

3. Horizontal Scaling: more boxes, 1 location

4. Federation: more boxes, more locations

V1, Bla1 (single node Virtuoso, Blazegraph)

V1_32 (32GB Virtuoso)

Fu1 (Fuseki + HDT)

V3 (Virtuoso cluster 3 nodes)

Fl3 (FluidOps, aka FedX)

DISQOVER dataset ...

and queries

Count, Union, Sort, Aggregations

Example Query 1: Nesting, FILTERs, unbound triples

Example Query 4: Aggregations, Optionals

Initial performance results were counter-intuitive... and incorrect!!!

Worse hardware, better performance?

Only Virtuoso-backed systems survive multi-threaded benchmark

marks last successful query (no timeout)

1 x 1,223 queries 8 x 1,223 queries

No errors but incorrect #results!!!

FILTERs, UNIONs are challenging but ORDER + GROUP + OPTIONAL dominate

COUNT DISTINCT

600 – 1,223 BGPs

Conclusions & Future Work

• Additional diagnostics for RDF solutions!

• Extend benchmarking software with query correctness assessment!

• Multi-node RDF solutions???

• Towards Full paper:

– NoSQL for Ontoforce Data

– Scale out approaches for Watdiv + test LDF

– Release reusable end-to-end benchmark software:

• Setup AND Postprocessing

Thanks for your attention!!

SCALING OUT FEDERATED QUERIES

FOR LIFE SCIENCES DATA IN PRODUCTION

Dieter De Witte, Laurens De Vocht, et al.

contact: [email protected]

slideshare:

• IMEC– IDLAB – GHENT UNIVERSITY

• ONTOFORCE

Scaling out federated queries for Life Sciences Data In Production

Data & Analytics

Transcript of Scaling out federated queries for Life Sciences Data In Production