Hadoop databases for oracle DBAs

Session ID:

Prepared by:

Hadoop databases: Hive, Impala, Spark, PrestoFor ORACLE DBAs

Maxym Kharchenko, Gluent

@maxymkh

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Whoami• Database Kernel developer

-> ORACLE DBA-> Database Hadoop/Cloud developer

• Worked with ORACLE for the last 15 years

• OCM, ORACLE Ace alumni, Amazon alumni

• Last year: OLTP -> Hadoop

Shameless plug about my company

GluentOracle

TeradataNoSQL

Big Data Sources

Agenda• What’s Hadoop databases ?

• Hive/Impala/Spark vs. ORACLE (hopefully, demo)

• Best ways to start

What is Hadoop:• For “Big data”

• Can deal with “Unstructured” data

• Distributed

• Consists of: HDFS + MapReduce

• Requires you to write MapReduce jobs, NoSql

Yes, but what does it all mean ?

Imagine that you are Googlein the early 2000s

Target Ads• You need to query web crawler data

• Which is unbelievably huge

• These queries need to be:

• (reasonably) Fast• (reasonably) Cheap• (reasonably) Easy to use

Let’s build a Data Warehouse

(traditional) Data warehouse • Been there for years

• Mature and (relatively) advanced

• SQL !!!

Data Warehouse scorecardRequirements RDBMS(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data ¯\_(ツ )_/¯

Scaling up “Big data” ain’t cheap• Can’t fit all of the data

on a single box

• Cost is quicklygetting out of hand

(cheap) Commodity systemsmake “big data” feasible

Solution = commodity systems

$$$$$ $$

Commodity systems scorecardRequirements Commodity(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data

All your queries are Java Classes

Google• 2003:

Google File System(GFS) paper

• 2004:Google MapReduce(MR) paper

Hadoop• 2006: Hadoop

”Traditional Data Warehouse” vs. Hadoop

Requirements Hadoop Data Warehouse(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data ¯\_(ツ )_/¯

• 2010: Facebook releases Apache Hive

• SQL on Hadoop !

SQL on Hadoop - Hive

• 2012: Cloudera announces Impala

• Faster SQL on Hadoop !

Another SQL on Hadoop - Impala

And then, it exploded …

“Hadoop” vs “Relational” databasesDemo … hopefully

This is not about NoSql :-)

Same: Tablessql> describe sh.products;

+-----------------------+----------------+---------+| name | type | comment |+-----------------------+----------------+---------+| prod_id | bigint | || prod_name | string | || prod_desc | string | || prod_category_id | bigint | || prod_category_desc | string | || supplier_id | bigint | || prod_total_id | decimal(38,18) | || prod_src_id | decimal(38,18) | || prod_eff_from | timestamp | || prod_eff_to | timestamp | || prod_valid | string | |+-----------------------+----------------+---------+

Same: Running SQL queriessql> select prod_id, count(1)from sh.sales s, sh.channels cwhere c.channel_id = s.channel_id and c.channel_desc='Catalog'group by prod_idorder by 2 desclimit 5;

+------------------------+----------+| prod_id | count(1) |+------------------------+----------+| 43.000000000000000000 | 5182 || 46.000000000000000000 | 5165 || 22.000000000000000000 | 5162 || 123.000000000000000000 | 5152 || 32.000000000000000000 | 5145 |+------------------------+----------+Fetched 5 row(s) in 3.26s

Same: Queries are optimizedsql> explain select count(1) from sh.times;+----------------------------------------------------------+| Explain String |+----------------------------------------------------------+| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 || || 03:AGGREGATE [FINALIZE] || | output: count:merge(1) || | || 02:EXCHANGE [UNPARTITIONED] || | || 01:AGGREGATE || | output: count(1) || | || 00:SCAN HDFS [sh.times] || partitions=16/16 files=32 size=500.45KB |+----------------------------------------------------------+

Different: What gets optimized• No “regular” indexes

• But many operationsare distributed

SALES 1TIMES 1

SALES 2TIMES 2

SALES 3TIMES 3

Different: Native cloud filesystem supportsql> show partition sh.sales;

s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET s3a://bucket1/sh/sales/time_id=2011-02 | PARQUET s3a://bucket1/sh/sales/time_id=2011-03 | PARQUET s3a://bucket1/sh/sales/time_id=2011-04 | PARQUET s3a://bucket1/sh/sales/time_id=2011-05 | PARQUET s3a://bucket1/sh/sales/time_id=2011-06 | PARQUET s3a://bucket1/sh/sales/time_id=2011-07 | PARQUET s3a://bucket1/sh/sales/time_id=2011-08 | PARQUET hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET hdfs://clust1/sh/sales/time_id=2011-10 | PARQUET hdfs://clust1/sh/sales/time_id=2011-11 | PARQUET hdfs://clust1/sh/sales/time_id=2011-12 | PARQUET

Database engine does NOT ”own” data

example01.dbfsysaux01.dbfsystem01.dbftemp01.dbfundotbs01.dbfusers01.dbf

a01_data.parqa01_data.parqa03_data.parqa04_data.parqa05_data.parqa06_data.parq

Different: Different engines can work withthe same data files (even at the same time)

Different: … or copies of the data files

hdfs://adhoc/a.parqhdfs://adhoc/b.parqhdfs://adhoc/c.parqhdfs://adhoc/d.parqhdfs://adhoc/e.parqhdfs://adhoc/f.parq

hdfs://prod/a.parqhdfs://prod/b.parqhdfs://prod/c.parqhdfs://prod/d.parqhdfs://prod/e.parqhdfs://prod/f.parq

s3://backup/a.parqs3://backup/b.parqs3://backup/c.parqs3://backup/d.parqs3://backup/e.parqs3://backup/f.parq

Different: Open data formats• Not proprietary – many

tools can read/write

• No additional $$for “advanced features”:

• Columnar storage• Storage indexes• Compression

Same: “sqlplus-like” clients> impala-shell -i 10.0.0.1

[10.0.0.1:21000] > select prod_id, count(1)from sh.sales group by prod_id order by 2 desc limit 1;

+-----------------------+----------+| prod_id | count(1) |+-----------------------+----------+| 48.000000000000000000 | 74026 |+-----------------------+----------+

> beeline –u 'jdbc:hive2://10.0.0.1:10000'0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1)from sh.sales group by prod_id order by 2 desc limit 1;

Different: External dictionary

User data

Dictionary (SYS)

User data

Dictionary (SYS)

Hive Metastore

Different: Append only, “ETL-like” DML• Hadoop DML

is more like ETL

• Data is presumed static

• ACID: someinterpretation required

• Schema on read

UPDATE t SET a=12 WHERE b=1;

Table T (base):

a_data.orc

Table T (base):

a_data.orc

Table T (delta):

b_data.orc

Compactor runs …

Table T (base):

c_data.orc

Databases

Apache Hive

Slave C

• “Designed” for “batch” queries (*)

• Runs on top of standardHadoop RM: YARN

• Supports multiple “engines”: MR, TEZ, Spark

• SerDes

YARN NM

datanode

Master

Hiveserver2

namenode

Slave C

YARN NM

datanode

Slave C

YARN NM

datanode

YARN RM

Slave A

Apache Impala• Designed for

“quick interactive” queries

• “Data-local” execution

• In-memory processingimpalad

datanode

Slave B

impalad

datanode

Slave C

impalad

datanode

Master

statestored

namenode

catalogd

Apache Spark• “Better Hadoop”

with “native”:SQL, Mlib, GraphX

• In-memory processing, based on RDDs

• Supports many clusters: “native”, YARN, Mesos

• Flexible programming model

Master

Driver

Slave A

Executor

Slave B

Executor

Slave C

Executor

Presto

Slave A

• Designed for “interactive” queries

• In-memory processing

• Custom storage “plugins”: Hive, Kafka, MySql, Postgres,… worker

Slave B

worker

Slave C

worker

Master

coordinator

How to start

Step 1: Google “Hadoop ecosystem”

Step 2: Try to install the simplest thing

Step 3

Step 4

Hint: Nobody builds their own Linux anymore

Chose Hadoop distribution that suits you

Hadoop distributions• Pre-built and pre-integrated

(aka: all things work out of the box)

• Each has their own “philosophy” …

• … As well as preferred Hadoop database

So what’s in it for me ?• It’s interesting (cool technology that hits many recent

buzzwords)

• If you know ORACLE, it’s close to your skill set

• It’s promising and future oriented

Please Complete Your Session Evaluation Evaluate this session in your COLLABORATE app. Pull up this session and tap "Session Evaluation" to complete the survey.

Session ID: 557

Hadoop databases for oracle DBAs

Data & Analytics

Transcript of Hadoop databases for oracle DBAs

PostgreSQL WAL for DBAs

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Database Workshop Report - Indico · Oracle version 2.3 (accelerator control) Distributed Databases PostgreSQL Oracle (RSI) MySQL Big Data Analytics NoSQL Hadoop Time Series Databases

Integration of Oracle and Hadoop: hybrid databases ... · Advantages of integrating Oracle and Hadoop • Best of two worlds: • Oracle, optimized for Online Transactional System

Hadoop databases for oracle DBAs

Exadata Management for DBAs

Proactive Performance Management for Enterprise Databases...White Paper - Proactive Performance Management for Enterprise Databases 4 Abstract DBAs today need to do more than react

WebLogic for DBAs 1.0h

MySQL for Oracle DBAs

Big Data for Oracle DBAs - Proligence · 2013. 9. 11. · Big Data for Oracle DBAs Arup Nanda. fcrawler.looksmart.com -- ... NoSQL Databases. SQL-interface required Hive HiveQL. select

WebLogic for DBAs

Annasaheb Dange College of Engineering & Technology · 2015. 12. 29. · Hadoop Administration Setting up Hadoop Clusters 1. Real time Cluster (YARN), 2. Pseudo Cluster NoSQL Databases

The High Performance Sybase ASE DBA - · PDF fileMore Databases – While some DBAs use Sybase ASE only, or manage only a handful of databases, most manage dozens and even hundreds.

DBAS SERIES CONNECTORS

Future @ Cloud: Cloud Computing meets Smart Ecosystems ... · IND²UCE for HBase/Hadoop Cloud Databases HBase: NoSQL database inspired and modeled after Google‘s Bigtable 1 Hadoop:

CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk MapReduce and Hadoop.

BENCHMARKING CLOUD DATABASES - JBoss Developer · benchmarking cloud databases case study on hbase, hadoop and cassandra using ycsb

A PostgreSQL DBAs Toolbelt · 2017. 3. 26. · A PostgreSQL DBAs Toolbelt Kaarel Moppel 23.03.2017 Kaarel Moppel 23.03.2017. Fields of interest for DBAs ... wal-e-cloudoriented

Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

MySQL e Oracle para DBAs