Post on 16-Oct-2021
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Using RDBMS, NoSQL or Hadoop?
Jean-‐Pierre Dijcks Big Data Product Management Server Technologies
1
DOAG Conference 2015
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Data Ingest
2
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. | 3
Ingest
“Chunk” 256 MB
Node 1
Node 2
Node 3
HDFS
Synchronous
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. | 4
Ingest
“Chunk” 256 MB
Node 1
Node 2
Node 3
HDFS
Synchronous
Master Node
Eventually Consistent
“Records”
Replica Node 1
Replica Node 2
<key><value>
<key><value>
<key><value>
t
t + 1
t + 2
Ingest
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. | 5
Ingest
“Chunk” 256 MB
Node 1
Node 2
Node 3
HDFS
Synchronous
Master Node
Eventually Consistent
“Records”
Replica Node 1
Replica Node 2
<key><value>
<key><value>
<key><value>
t
t + 1
t + 2
Ingest Compute
Parse / Validate
DBF DBF
Storage
ASM
SQL Ingest
“Transac^ons” Op^mized
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
High-‐level Comparison HDFS NoSQL RDBMS
Data Type Chunk Record Transac^on
Write Type Synchronous Eventually Consistent ACID Compliant
Data Prepara^on No Parsing No Parsing Parsing and Valida^on
6
Ingest
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Disaster Recovery -‐ High Availability Across Geography
7
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Disaster Recovery: Hadoop
8
Ingest
“Chunk” 256 MB
Node 1
Node 2
Node 3 Synchronous
between nodes
Data Center 1
Node 1
Node 2
Node 3
Data Center 2
MUST Duplicate to a separate Cluster in Data Center 2
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Disaster Recovery: RDBMS
9
Ingest Compute
“Transac^ons” Parse / Validate
DBF DBF
Storage
ASM
Op^mized
SQL
Data Guard Golden Gate
Data Center 1 Data Center 2
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Disaster Recovery: NoSQL Op^on
10
Ingest Master Node
Eventually Consistent
“Records”
Replica Node 1
Replica Node 2
<key><value>
<key><value>
<key><value>
t
t + 1
t + 2
Data Center 1 or Region 1
Data Center 2 or Region 2
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Note on HBase • Don't confuse HBase with NoSQL in terms of geographical distribu^on op^ons • Hbase is based on HDFS storage • Consequently, HBase cannot span data centers as we see in other NoSQL systems
11
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Big Data Appliance includes Cloudera BDR • Oracle packages this on BDA as a Cloudera op^on called BDR (Backup and Disaster Recovery) • BDR = DistCP – distributed copy method leveraging parallel compute (MapReduce like)
• BDR is NOT proven as Data Guard or GG – Most importantly, think of BDR as a batch process not a trickle
• BDR is included (no extra charge) on BDA
12
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
High-‐level Comparison HDFS NoSQL RDBMS
Data Type Chunk Record Transac^on
Write Type Synchronous Eventually Consistent ACID Compliant
Data Prepara^on No Parsing No Parsing Parsing and Valida^on
DR Type Second Cluster Node Replica Second RDBMS
DR Unit File Record Transac^on
DR Timing Batch Record Transac^on
13
Ingest
DR
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Access Paths to Data Sets
14
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Data Sets and Analy^cal SQL Queries: Hadoop
15
“Chunk” 256 MB block
No Parse
json INSERT • Data is NOT parsed on ingest, example JSON
document is not parsed.. Original files loaded an broken into chunks
• Does not like small files UDPATE • NO update allowed (append only)
• Expensive to update a few KB by replacing a 256MB chunk (as part of a file replacement)
SELECT • Scan ALL data even for a single “row” answer • SQL access path is a “full table scan” • Hive op^mizes this with
• Metadata (Par^^on Pruning) • MapReduce in Parallel to achieve scale
NO UPDATE Hive uses
MapReduce In
Parallel Achieves SCALE
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Data Sets and Analy^cal SQL Queries: NoSQL
16
<key><Value>
json
Index Lookup
API: Get Put Mget Mput
PUT (Insert) • Data is typically not parsed, example, JSON
document is loaded as a value with a key GET (Select) • Data Retrieval through “Index Lookup” based
on KEY • Only access path is index (primary or
secondary) • If retrieving from Replica may not be
consistent (yet) • Generally do not issue SQL over NoSQL • Not used for large analysis, instead used to
receive and provide single records or small sets based on the same key
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Data Sets and Analy^cal SQL Queries: RDBMS
17
TABLE INDEX INSERT • Parse data and op^mize for retrieval • Adhere to transac^on consistency UPDATE • Enables individual records to be updated • With read-‐consistency • Row level locking etc.
SELECT • Read-‐consistent guaranteed • SQL Op^miza^on:
• Choose the best access path based on the ques^on and database sta^s^cs
• Op^mized joins, spills to disk etc. • Supports very complex ques^ons
Data is parsed
Metadata is available on
write
Auxiliary structures
enable myriad of access paths
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
High-‐level Comparison HDFS NoSQL RDBMS
Data Type Chunk Record Transac^on
Write Type Synchronous Eventually Consistent ACID Compliant
Data Prepara^on No Parsing No Parsing Parsing and Valida^on
DR Type Second Cluster Node Replica Second RDBMS
DR Unit File Record Transac^on
DR Timing Batch Record Transac^on
Complex Analy^cs? Yes No Yes
Query Speed Slow Fast for simple ques^ons Fast
# of Data Access Methods One (full table scan) One (index lookup) Many (Op^mized)
18
Ingest
DR
Acces
Affordable Scale Low Predictable Latency Flexible Performance
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Performance Aspects Ingest and Query
19
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
A simple set of criteria
20
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5 Concurrency
Complex Query Response Times
Single Record Read/Write Performance
Bulk Write Performance
Privileged User Security
General User Security
Governance Tools
System per TB Cost
Backup per TB Cost
Skills Acquisi^on Cost
RDBMS
NoSQL DB
Hadoop
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Ingest Performance Considera^ons HDFS • No parsing • Fastest for Inges^ng Large Data Sets, like log files
• Bulk write of sets of data, not individual records
• Slower write on a record-‐by-‐record basis than NoSQL
NoSQL • No parsing • Fastest for Wri^ng Individual Records to the master
• Direct writes on a per-‐record basis
• Best for Millisecond Write
RDBMS
• RDBMS does work to data (parsing) on ingest
• Ingest speed is slower than NoSQL on a record by record basis due to parsing
• Ingest speed is slower than Hadoop on a file by file basis due to parsing
• Benefits of RDBMS parsing become evident on query performance…
21
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Query Performance Considera^ons HDFS • Requires parsing • Slowest on Query Time – Due to Full Table Scans on top of parsing
– No query caching
• Able to run complex queries – Requires use of MR or Spark programs
– SQL immature (SQL-‐92)
NoSQL • No parsing • Fastest on Query Time for “get”
• Consistently fast is the goal
• No syntax available to run complex queries
RDBMS
• Fastest SQL query ^mes because of parsing work done on ingest – Completely op^mized storage formats for IO
– Oracle knows how to pick stuff out, metadata on files..
– Least amount of I/O to retrieve records
– Advanced Caching
• RDBMS can run more complex SQL queries than NoSQL or Hadoop
22
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
SQL on Hadoop vs RDBMS
Hadoop
• Pay on QUERY for GAINS on INGEST • Full table scan – Shorten full table scan by adding more nodes to scan through data
• Problem remains, data isn’t parsed – All raw data must be read for EACH and EVERY read
RDBMS
• Pay on INGEST (Parsing) for GAINS on QUERY • Op^mized Query Path
23
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Benefits Compared Hadoop NoSQL RDBMS
“Schema on Read” Flexibility Write file on disk without error checking Programma^cally s^tch disparate data together
Low Latency for simple ac^ons
High performance for lots of workloads without end user knowing
Parsed Data Mul^ple Access Paths Advanced Query Caching In-‐Memory Formats
Affordable Scale HOWEVER No API to run analy^cal query
Consistency -‐ exact same result at exact same ^me
Traceability of transac^ons
Applica^ons become simpler
24
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Concurrency
25
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Concurrency Comparisons
26
HDFS
• Most Hadoop systems have small number of users running large number of jobs – Batch or very frequent micro batch calcula^ons
• Customers report Impala struggles with concurrency (much like Netezza, or worse)
• Immature resource management, not all solu^ons work nicely
NoSQL
• Very high concurrency • Geographically distributed • Must deal with consistency issues
• Great reader farm for publishing data to for example web apps
• Load balancing built-‐in
RDBMS
• Hundreds of users, hundreds of queries running at the same ^me
• Queries are balanced over the system
• Resource Management able to control the whole system
• Proven in many large organiza^ons
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Cost
27
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Cost • Customers want reduce cost by moving from RDBMS to Hadoop – Hadoop delivers lowest cost per TB
• But which workloads can move from RDBMS to Hadoop? – Dealing with transac^ons? Stay in RDBMS – Running Advanced SQL queries? Stay in RDBMS – Large numbers of concurrent users? Stay in RDBMS
• Will hit high performance penalty for moving DW to Hadoop because of full table scans
28
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Can SQL on Hadoop solve the problems without sacrifice?
29
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
No.. 1 ½ Axempts
Impala
• Solu^on to speed up Hive and MR – Wrote own op^mizer and query engine
• Large speedup but at the cost of: – No MR so loses scalability – Less SQL capabili^es so loses BI transparency
– System cost increases to run Impala (needs add’l memory) so loses some cost advantage
Impala with Parquet
• Convert files into columnar data blocks (Parquet file format)
• Speed up because of columnar formats etc:
• But at a cost: – Must run ETL to Parquet (comparable to ingest into RDBMS)
– Loose schema on read => flexibility – Double the data
• Essen^ally a new columnar DB on HDFS
30
Copyright © 2015 Oracle and/or its affiliates. All rights reserved. |
Conclusion: Choose the Right Tool for the Right Job Don’t Use something it’s not meant to be
31