BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

18
1 / 18 EndoMine System Jewish General Hospital by David Lauzon and Anton Zakharov Big Data Montreal #9 February 5th 2013

description

High-level use case description of one department of a hospital, and comparisons of two solutions : 1) Big data solution using Cloudera Impala; and 2) Traditional RDBMS solution using Oracle DB.

Transcript of BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

Page 1: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

1 / 18

EndoMine SystemJewish General Hospital

by David Lauzon and Anton ZakharovBig Data Montreal #9February 5th 2013

Page 2: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

2 / 18

Presentation

• Our Objectives• Requirements and context• Project scope• Hadoop Solution

– Big Data Solution Overview– Hive Table Schema– Compression Performance– Data Architecture in Hadoop– Hadoop/Impala Prototype Demo

• Oracle Solution• Hadoop vs Oracle comparison• What are expensive queries?

Page 3: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

3 / 18

Our Objectives

• Lead an end-of-study project in an industrial context– Requirements elicitation– Implement a « proof-of-concept » prototype

• Experiment with big data technologies– Compare with RDBMS

Page 4: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

4 / 18

Requirements and context

• Department of Medical Diagnostic (medical test results DB, e.g. blood, urine, ...)– Dr. Shaun Eintracht

• « ad hoc » Query • ETL Query

– Dr. Elizabeth Mac Namara• « business intelligence » requirements• Realtime Dashboard

• Department of Endocrinology– Dr. Mark Trifiro

• Data mining

Page 5: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

5 / 18

Project scope

• First iteration = improve ad-hoc queries– Slow analytical queries and ETL (MS Access)– Risk of « crashing » production DB– Some queries impossible to process

Page 6: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

6 / 18

Production DB (Oracle)

Page 7: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

7 / 18

Solutions

• Solution 1 : Hadoop + Impala

• Solution 2 : Tune the existing Oracle RDBMS

Page 8: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

8 / 18

Big Data Solution Overview

Page 9: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

9 / 18

Hive Table Schema

Page 10: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

10 / 18

Compression Performance

Oracle FS Text File Sequence File

SeqFile + Gzip

SeqFile + Snappy

0

50

100

150

200

250

ImpalaHiveOracle

Page 11: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

11 / 18

Data Architecture in Hadoop

• All big tables are pre-joined– With specimen (1) – Without specimen (2)

• Partitioned using two schemes – Year-month (3) – Year and Test (4)

• 4 different versions of the same data:– stay_order_results_yearmonth – stay_order_results_year_and_test – stay_order_results_specimen_yearmonth – stay_order_results_specimen_year_and_test

Page 12: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

12 / 18

Hadoop Prototype Demo

Page 13: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

13 / 18

Oracle Solution

• Same tables as source DB– A big pre-joined table is not a good solution

• Techniques explored :– Partitioning• Partitions automatically created

– Compression• Inefficient for joins

– Clustering– Join multiple partitioned tables

Page 14: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

14 / 18

Oracle Solution (continued)

• Avoid too many indexes on the big tables:– Takes a lot of memory– Slow to create– May not be used if query use more than 5% of the

rows

Page 15: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

15 / 18

Comparison: Hadoop Solution

• Pro– Crunch massive amount of data– Scalability– Free software

• Cons– Needs better UI and tune-ups– Maintenance cost– Require ETL time to merge data into one table – BIG Joins should be avoided

Page 16: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

16 / 18

Comparison: Oracle Solution

• Pro– Just need to create a slave DB (just?)– Faster random-lookup– Easier to find expertise

• Cons– Scalability up to a certain point..– Synchronisation with master DB:• Rebuilding indexes would take hours

Page 17: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

17 / 18

What are expensive queries?

• If possible, avoid these constructs on large result sets– SELECT DISTINCT– ORDER BY– GROUP BY– JOIN big table with another big table• JOIN big table with multiple small tables should be OK

Page 18: BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

18 / 18

Conclusion

• Recommendation to use a “classic” RDBMS– The database fit on a single-node– Existing expertise in-house– Acceptable performance with appropriate

tune-ups– Stop using MS Access

• Disadvantage : limited scalability