Scaling out federated queries for Life Sciences Data In Production

Post on 16-Apr-2017

415 views 1 download

Transcript of Scaling out federated queries for Life Sciences Data In Production

SCALING OUT FEDERATED QUERIES

FOR LIFE SCIENCES DATA IN PRODUCTION

Dieter De Witte, Laurens De Vocht, et al.

dieter.dewitte@ugent.be

• IMEC– IDLAB – GHENT UNIVERSITY

• ONTOFORCE

Catch 22!?

A. No Semantic Web Applications

because no Semantic Data

B. No Semantic Data

because no applications

A. The LOD Cloud for Life Sciences...

Ontoforce’s

DISQOVER

covers

> 110

Life Sciences Datasets

B. DISQOVER is an Exploratory Semantic Search UI (faceted browsing)

To Click = To SPARQL

The missing link in our catch 22? “How to run federated queries?”

Direct ETL

• Cloud Instances

• PAGO amis:

Scientific Benchmark = Reproducible Benchmark

Benchmark Client

• 1 single-threaded warm-up run (all 1,223 queries)

• 1 multi-threaded (8) run

• (8 x randomized order)

Database Node(s)

How to evaluate an RDF Database solution? Performance (

Data store,

Dataset,

Configuration,

Number of nodes,

Hardware (RAM)

)

Performance (

NoSQL Triple stores,

Watdiv 10M, 100M, 1000M,

Standard Configs,

Single Node,

32 GB RAM

)

SIGMOD 2016: Single Node SOTA on artificial data

More data, more problems

Timeout

Query performance: Virtuoso Leads, Blazegraph follows

Timeout

SWAT4LS 2016: Multi-node SOTA on real data

Performance (

Scale out systems,

DISQOVER data,

Optimized Configs,

Multi-Node, Compression

64 GB RAM

)

How to deal with Big Linked Data?

1. Vertical Scaling: bigger box

2. Compression: smaller content

3. Horizontal Scaling: more boxes, 1 location

4. Federation: more boxes, more locations

V1, Bla1 (single node Virtuoso, Blazegraph)

V1_32 (32GB Virtuoso)

Fu1 (Fuseki + HDT)

V3 (Virtuoso cluster 3 nodes)

Fl3 (FluidOps, aka FedX)

DISQOVER dataset ...

and queries

Count, Union, Sort, Aggregations

Example Query 1: Nesting, FILTERs, unbound triples

Example Query 4: Aggregations, Optionals

Initial performance results were counter-intuitive... and incorrect!!!

Worse hardware, better performance?

Only Virtuoso-backed systems survive multi-threaded benchmark

marks last successful query (no timeout)

1 x 1,223 queries 8 x 1,223 queries

No errors but incorrect #results!!!

FILTERs, UNIONs are challenging but ORDER + GROUP + OPTIONAL dominate

COUNT DISTINCT

600 – 1,223 BGPs

Conclusions & Future Work

• Additional diagnostics for RDF solutions!

• Extend benchmarking software with query correctness assessment!

• Multi-node RDF solutions???

• Towards Full paper:

– NoSQL for Ontoforce Data

– Scale out approaches for Watdiv + test LDF

– Release reusable end-to-end benchmark software:

• Setup AND Postprocessing

Thanks for your attention!!

SCALING OUT FEDERATED QUERIES

FOR LIFE SCIENCES DATA IN PRODUCTION

Dieter De Witte, Laurens De Vocht, et al.

contact: dieter.dewitte@ugent.be

slideshare:

• IMEC– IDLAB – GHENT UNIVERSITY

• ONTOFORCE