Scaling out federated queries for Life Sciences Data In Production

22
SCALING OUT FEDERATED QUERIES FOR LIFE SCIENCES DATA IN PRODUCTION Dieter De Witte, Laurens De Vocht, et al. [email protected] IMEC– IDLAB – GHENT UNIVERSITY ONTOFORCE

Transcript of Scaling out federated queries for Life Sciences Data In Production

Page 1: Scaling out federated queries for Life Sciences Data In Production

SCALING OUT FEDERATED QUERIES

FOR LIFE SCIENCES DATA IN PRODUCTION

Dieter De Witte, Laurens De Vocht, et al.

[email protected]

• IMEC– IDLAB – GHENT UNIVERSITY

• ONTOFORCE

Page 2: Scaling out federated queries for Life Sciences Data In Production

Catch 22!?

A. No Semantic Web Applications

because no Semantic Data

B. No Semantic Data

because no applications

Page 3: Scaling out federated queries for Life Sciences Data In Production

A. The LOD Cloud for Life Sciences...

Ontoforce’s

DISQOVER

covers

> 110

Life Sciences Datasets

Page 4: Scaling out federated queries for Life Sciences Data In Production

B. DISQOVER is an Exploratory Semantic Search UI (faceted browsing)

To Click = To SPARQL

Page 5: Scaling out federated queries for Life Sciences Data In Production

The missing link in our catch 22? “How to run federated queries?”

Direct ETL

Page 6: Scaling out federated queries for Life Sciences Data In Production

• Cloud Instances

• PAGO amis:

Scientific Benchmark = Reproducible Benchmark

Benchmark Client

• 1 single-threaded warm-up run (all 1,223 queries)

• 1 multi-threaded (8) run

• (8 x randomized order)

Database Node(s)

Page 7: Scaling out federated queries for Life Sciences Data In Production

How to evaluate an RDF Database solution? Performance (

Data store,

Dataset,

Configuration,

Number of nodes,

Hardware (RAM)

)

Page 8: Scaling out federated queries for Life Sciences Data In Production

Performance (

NoSQL Triple stores,

Watdiv 10M, 100M, 1000M,

Standard Configs,

Single Node,

32 GB RAM

)

SIGMOD 2016: Single Node SOTA on artificial data

Page 9: Scaling out federated queries for Life Sciences Data In Production

More data, more problems

Timeout

Page 10: Scaling out federated queries for Life Sciences Data In Production

Query performance: Virtuoso Leads, Blazegraph follows

Timeout

Page 11: Scaling out federated queries for Life Sciences Data In Production

SWAT4LS 2016: Multi-node SOTA on real data

Performance (

Scale out systems,

DISQOVER data,

Optimized Configs,

Multi-Node, Compression

64 GB RAM

)

Page 12: Scaling out federated queries for Life Sciences Data In Production

How to deal with Big Linked Data?

1. Vertical Scaling: bigger box

2. Compression: smaller content

3. Horizontal Scaling: more boxes, 1 location

4. Federation: more boxes, more locations

V1, Bla1 (single node Virtuoso, Blazegraph)

V1_32 (32GB Virtuoso)

Fu1 (Fuseki + HDT)

V3 (Virtuoso cluster 3 nodes)

Fl3 (FluidOps, aka FedX)

Page 13: Scaling out federated queries for Life Sciences Data In Production

DISQOVER dataset ...

and queries

Page 14: Scaling out federated queries for Life Sciences Data In Production

Count, Union, Sort, Aggregations

Page 15: Scaling out federated queries for Life Sciences Data In Production

Example Query 1: Nesting, FILTERs, unbound triples

Page 16: Scaling out federated queries for Life Sciences Data In Production

Example Query 4: Aggregations, Optionals

Page 17: Scaling out federated queries for Life Sciences Data In Production

Initial performance results were counter-intuitive... and incorrect!!!

Worse hardware, better performance?

Page 18: Scaling out federated queries for Life Sciences Data In Production

Only Virtuoso-backed systems survive multi-threaded benchmark

marks last successful query (no timeout)

1 x 1,223 queries 8 x 1,223 queries

Page 19: Scaling out federated queries for Life Sciences Data In Production

No errors but incorrect #results!!!

Page 20: Scaling out federated queries for Life Sciences Data In Production

FILTERs, UNIONs are challenging but ORDER + GROUP + OPTIONAL dominate

COUNT DISTINCT

600 – 1,223 BGPs

Page 21: Scaling out federated queries for Life Sciences Data In Production

Conclusions & Future Work

• Additional diagnostics for RDF solutions!

• Extend benchmarking software with query correctness assessment!

• Multi-node RDF solutions???

• Towards Full paper:

– NoSQL for Ontoforce Data

– Scale out approaches for Watdiv + test LDF

– Release reusable end-to-end benchmark software:

• Setup AND Postprocessing

Page 22: Scaling out federated queries for Life Sciences Data In Production

Thanks for your attention!!

SCALING OUT FEDERATED QUERIES

FOR LIFE SCIENCES DATA IN PRODUCTION

Dieter De Witte, Laurens De Vocht, et al.

contact: [email protected]

slideshare:

• IMEC– IDLAB – GHENT UNIVERSITY

• ONTOFORCE