UsingRDBMS,NoSQLor* Hadoop?

Using RDBMS, NoSQL or Hadoop?

Jean-‐Pierre Dijcks Big Data Product Management Server Technologies

DOAG Conference 2015

Data Ingest

Ingest

“Chunk” 256 MB

Node 1

Node 2

Node 3

Synchronous

Ingest

“Chunk” 256 MB

Node 1

Node 2

Node 3

Synchronous

Master Node

Eventually Consistent

“Records”

Replica Node 1

Replica Node 2

Ingest

“Chunk” 256 MB

Node 1

Node 2

Node 3

Synchronous

Master Node

“Records”

Replica Node 1

Replica Node 2

Ingest Compute

Parse / Validate

DBF DBF

Storage

SQL Ingest

“Transac^ons” Op^mized

High-‐level Comparison HDFS NoSQL RDBMS

Data Type Chunk Record Transac^on

Write Type Synchronous Eventually Consistent ACID Compliant

Data Prepara^on No Parsing No Parsing Parsing and Valida^on

Ingest

Disaster Recovery -‐ High Availability Across Geography

Disaster Recovery: Hadoop

Ingest

“Chunk” 256 MB

Node 1

Node 2

Node 3 Synchronous

between nodes

Data Center 1

Node 1

Node 2

Node 3

Data Center 2

MUST Duplicate to a separate Cluster in Data Center 2

Disaster Recovery: RDBMS

Ingest Compute

“Transac^ons” Parse / Validate

DBF DBF

Storage

Op^mized

Data Guard Golden Gate

Data Center 1 Data Center 2

Disaster Recovery: NoSQL Op^on

Ingest Master Node

“Records”

Replica Node 1

Replica Node 2

Data Center 1 or Region 1

Data Center 2 or Region 2

Note on HBase • Don't confuse HBase with NoSQL in terms of geographical distribu^on op^ons • Hbase is based on HDFS storage • Consequently, HBase cannot span data centers as we see in other NoSQL systems

Big Data Appliance includes Cloudera BDR • Oracle packages this on BDA as a Cloudera op^on called BDR (Backup and Disaster Recovery) • BDR = DistCP – distributed copy method leveraging parallel compute (MapReduce like)

• BDR is NOT proven as Data Guard or GG – Most importantly, think of BDR as a batch process not a trickle

• BDR is included (no extra charge) on BDA

DR Type Second Cluster Node Replica Second RDBMS

DR Unit File Record Transac^on

DR Timing Batch Record Transac^on

Ingest

Access Paths to Data Sets

Data Sets and Analy^cal SQL Queries: Hadoop

“Chunk” 256 MB block

No Parse

json INSERT •  Data is NOT parsed on ingest, example JSON

document is not parsed.. Original files loaded an broken into chunks

•  Does not like small files UDPATE •  NO update allowed (append only)

•  Expensive to update a few KB by replacing a 256MB chunk (as part of a file replacement)

SELECT •  Scan ALL data even for a single “row” answer •  SQL access path is a “full table scan” •  Hive op^mizes this with

•  Metadata (Par^^on Pruning) •  MapReduce in Parallel to achieve scale

NO UPDATE Hive uses

MapReduce In

Parallel Achieves SCALE

Data Sets and Analy^cal SQL Queries: NoSQL

Index Lookup

API: Get Put Mget Mput

PUT (Insert) •  Data is typically not parsed, example, JSON

document is loaded as a value with a key GET (Select) •  Data Retrieval through “Index Lookup” based

on KEY •  Only access path is index (primary or

secondary) •  If retrieving from Replica may not be

consistent (yet) •  Generally do not issue SQL over NoSQL •  Not used for large analysis, instead used to

receive and provide single records or small sets based on the same key

Data Sets and Analy^cal SQL Queries: RDBMS

TABLE INDEX INSERT •  Parse data and op^mize for retrieval •  Adhere to transac^on consistency UPDATE •  Enables individual records to be updated •  With read-‐consistency •  Row level locking etc.

SELECT •  Read-‐consistent guaranteed •  SQL Op^miza^on:

•  Choose the best access path based on the ques^on and database sta^s^cs

•  Op^mized joins, spills to disk etc. •  Supports very complex ques^ons

Data is parsed

Metadata is available on

Auxiliary structures

enable myriad of access paths

DR Type Second Cluster Node Replica Second RDBMS

DR Unit File Record Transac^on

DR Timing Batch Record Transac^on

Complex Analy^cs? Yes No Yes

Query Speed Slow Fast for simple ques^ons Fast

# of Data Access Methods One (full table scan) One (index lookup) Many (Op^mized)

Ingest

Affordable Scale Low Predictable Latency Flexible Performance

Performance Aspects Ingest and Query

A simple set of criteria

5 Concurrency

Complex Query Response Times

Single Record Read/Write Performance

Bulk Write Performance

Privileged User Security

General User Security

Governance Tools

System per TB Cost

Backup per TB Cost

Skills Acquisi^on Cost

NoSQL DB

Hadoop

Ingest Performance Considera^ons HDFS •  No parsing •  Fastest for Inges^ng Large Data Sets, like log files

•  Bulk write of sets of data, not individual records

•  Slower write on a record-‐by-‐record basis than NoSQL

NoSQL •  No parsing •  Fastest for Wri^ng Individual Records to the master

•  Direct writes on a per-‐record basis

•  Best for Millisecond Write

•  RDBMS does work to data (parsing) on ingest

•  Ingest speed is slower than NoSQL on a record by record basis due to parsing

•  Ingest speed is slower than Hadoop on a file by file basis due to parsing

•  Benefits of RDBMS parsing become evident on query performance…

Query Performance Considera^ons HDFS •  Requires parsing •  Slowest on Query Time –  Due to Full Table Scans on top of parsing

–  No query caching

•  Able to run complex queries –  Requires use of MR or Spark programs

–  SQL immature (SQL-‐92)

NoSQL •  No parsing •  Fastest on Query Time for “get”

•  Consistently fast is the goal

•  No syntax available to run complex queries

•  Fastest SQL query ^mes because of parsing work done on ingest –  Completely op^mized storage formats for IO

–  Oracle knows how to pick stuff out, metadata on files..

–  Least amount of I/O to retrieve records

–  Advanced Caching

•  RDBMS can run more complex SQL queries than NoSQL or Hadoop

SQL on Hadoop vs RDBMS

Hadoop

• Pay on QUERY for GAINS on INGEST •  Full table scan – Shorten full table scan by adding more nodes to scan through data

• Problem remains, data isn’t parsed – All raw data must be read for EACH and EVERY read

• Pay on INGEST (Parsing) for GAINS on QUERY • Op^mized Query Path

Benefits Compared Hadoop NoSQL RDBMS

“Schema on Read” Flexibility Write file on disk without error checking Programma^cally s^tch disparate data together

Low Latency for simple ac^ons

High performance for lots of workloads without end user knowing

Parsed Data Mul^ple Access Paths Advanced Query Caching In-‐Memory Formats

Affordable Scale HOWEVER No API to run analy^cal query

Consistency -‐ exact same result at exact same ^me

Traceability of transac^ons

Applica^ons become simpler

Concurrency

Concurrency Comparisons

•  Most Hadoop systems have small number of users running large number of jobs –  Batch or very frequent micro batch calcula^ons

•  Customers report Impala struggles with concurrency (much like Netezza, or worse)

•  Immature resource management, not all solu^ons work nicely

•  Very high concurrency •  Geographically distributed •  Must deal with consistency issues

•  Great reader farm for publishing data to for example web apps

•  Load balancing built-‐in

•  Hundreds of users, hundreds of queries running at the same ^me

•  Queries are balanced over the system

•  Resource Management able to control the whole system

•  Proven in many large organiza^ons

Cost • Customers want reduce cost by moving from RDBMS to Hadoop – Hadoop delivers lowest cost per TB

• But which workloads can move from RDBMS to Hadoop? – Dealing with transac^ons? Stay in RDBMS – Running Advanced SQL queries? Stay in RDBMS – Large numbers of concurrent users? Stay in RDBMS

• Will hit high performance penalty for moving DW to Hadoop because of full table scans

Can SQL on Hadoop solve the problems without sacrifice?

No.. 1 ½ Axempts

Impala

•  Solu^on to speed up Hive and MR – Wrote own op^mizer and query engine

•  Large speedup but at the cost of: – No MR so loses scalability – Less SQL capabili^es so loses BI transparency

– System cost increases to run Impala (needs add’l memory) so loses some cost advantage

Impala with Parquet

•  Convert files into columnar data blocks (Parquet file format)

•  Speed up because of columnar formats etc:

•  But at a cost: – Must run ETL to Parquet (comparable to ingest into RDBMS)

–  Loose schema on read => flexibility –  Double the data

•  Essen^ally a new columnar DB on HDFS

Conclusion: Choose the Right Tool for the Right Job Don’t Use something it’s not meant to be

Using*RDBMS,*NoSQLor* Hadoop?

Documents

Transcript of Using*RDBMS,*NoSQLor* Hadoop?

Incrementally Maintaining Classification using an RDBMS - arXiv

Exploring Oracle RDBMS latches (spinlocks) using Solaris DTrace

Querying RDBMS Using Natural Language

Co-existence or competition - RDBMS and Hadoop

Hadoop Administrator · a) Why Hadoop Discussion about the drawbacks of traditonal RDBMS and why Hadoop is better than traditional RDBMS. b) Introduction to Hadoop Discussion about

Splice Machine (the-only-hadoop-rdbms in the market)

Practical Hadoop Migration - Home - Springer978-1-4842-1287-5/1.pdf · Practical Hadoop Migration ... RDBMS Design and Implementation Tools ... Systems with the Hadoop Distributed

Using Oracle RDBMS With Sun Solaris Containers (Zones) · Using Oracle RDBMS With Sun Solaris Containers ... –Independent administration ... –“Solaris 10 Containers” Kimberly

Sentiment Analysis using Hadoop - SCE Support Centerdcm.uhcl.edu/caps15g1/pdfs/Last_Final_PPT_Group1.pdfSentiment Analysis using Hadoop ... (ranging from conventional RDBMS data stores

Data Migration from RDBMS to Hadoop - CORE

Data Migration from RDBMS to Hadoop

Using Hadoop

Hadoop Administrator … · Hadoop Administration: 1. Types Of Data and Tools used 2. Characteristics Of Big Data 3. Hadoop And Traditional Rdbms 4. Hadoop Core Services and Daemons

Hadoop and rdbms with sqoop

Incrementally Maintaining Classification using an RDBMS

Big data hadoop rdbms

BIGDATA HADOOP COURSE CONTENT · Industries using Hadoop. Data Locality. Hadoop Architecture. Map Reduce & HDFS. Using the Hadoop single node image (Clone). The Hadoop Distributed

Using*RDBMS,*NoSQLor* Hadoop? fileDataType* Chunk* Record* Transac^on* ... • SQL*access*path*is*a“full*table*scan”* • Hive*op^mizes*this*with* • Metadata(Par^^on*Pruning)*

Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Co existence or Competitions? RDBMS and Hadoop

UsingRDBMS,NoSQLor* Hadoop?

Transcript of UsingRDBMS,NoSQLor* Hadoop?

UsingRDBMS,NoSQLor* Hadoop? fileDataType* Chunk* Record* Transac^on* ... • SQLaccesspathisa“fulltablescan”* • Hiveop^mizesthiswith • Metadata(Par^^onPruning)