Hadoop databases for oracle DBAs

Post on 13-Apr-2017

395 views 0 download

Transcript of Hadoop databases for oracle DBAs

Session ID:

Prepared by:

Hadoop databases: Hive, Impala, Spark, PrestoFor ORACLE DBAs

557

Maxym Kharchenko, Gluent

@maxymkh

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Whoami• Database Kernel developer

-> ORACLE DBA-> Database Hadoop/Cloud developer

• Worked with ORACLE for the last 15 years

• OCM, ORACLE Ace alumni, Amazon alumni

• Last year: OLTP -> Hadoop

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Shameless plug about my company

GluentOracle

TeradataNoSQL

Big Data Sources

MSSQL

App X

App Y

App Z

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Agenda• What’s Hadoop databases ?

• Hive/Impala/Spark vs. ORACLE (hopefully, demo)

• Best ways to start

April 2-6, 2017 in Las Vegas, NV USA #C17LV

What is Hadoop:• For “Big data”

• Can deal with “Unstructured” data

• Distributed

• Consists of: HDFS + MapReduce

• Requires you to write MapReduce jobs, NoSql

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Yes, but what does it all mean ?

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Imagine that you are Googlein the early 2000s

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Target Ads• You need to query web crawler data

• Which is unbelievably huge

• These queries need to be:

• (reasonably) Fast• (reasonably) Cheap• (reasonably) Easy to use

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Let’s build a Data Warehouse

April 2-6, 2017 in Las Vegas, NV USA #C17LV

(traditional) Data warehouse • Been there for years

• Mature and (relatively) advanced

• SQL !!!

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Data Warehouse scorecardRequirements RDBMS(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data ¯\_(ツ )_/¯

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Scaling up “Big data” ain’t cheap• Can’t fit all of the data

on a single box

• Cost is quicklygetting out of hand

April 2-6, 2017 in Las Vegas, NV USA #C17LV

(cheap) Commodity systemsmake “big data” feasible

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Solution = commodity systems

=

$$$$$ $$

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Commodity systems scorecardRequirements Commodity(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data

April 2-6, 2017 in Las Vegas, NV USA #C17LV

All your queries are Java Classes

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Google• 2003:

Google File System(GFS) paper

• 2004:Google MapReduce(MR) paper

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Hadoop• 2006: Hadoop

April 2-6, 2017 in Las Vegas, NV USA #C17LV

”Traditional Data Warehouse” vs. Hadoop

Requirements Hadoop Data Warehouse(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data ¯\_(ツ )_/¯

April 2-6, 2017 in Las Vegas, NV USA #C17LV

• 2010: Facebook releases Apache Hive

• SQL on Hadoop !

SQL on Hadoop - Hive

April 2-6, 2017 in Las Vegas, NV USA #C17LV

• 2012: Cloudera announces Impala

• Faster SQL on Hadoop !

Another SQL on Hadoop - Impala

April 2-6, 2017 in Las Vegas, NV USA #C17LV

And then, it exploded …

“Hadoop” vs “Relational” databasesDemo … hopefully

April 2-6, 2017 in Las Vegas, NV USA #C17LV

This is not about NoSql :-)

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Same: Tablessql> describe sh.products;

+-----------------------+----------------+---------+| name | type | comment |+-----------------------+----------------+---------+| prod_id | bigint | || prod_name | string | || prod_desc | string | || prod_category_id | bigint | || prod_category_desc | string | || supplier_id | bigint | || prod_total_id | decimal(38,18) | || prod_src_id | decimal(38,18) | || prod_eff_from | timestamp | || prod_eff_to | timestamp | || prod_valid | string | |+-----------------------+----------------+---------+

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Same: Running SQL queriessql> select prod_id, count(1)from sh.sales s, sh.channels cwhere c.channel_id = s.channel_id and c.channel_desc='Catalog'group by prod_idorder by 2 desclimit 5;

+------------------------+----------+| prod_id | count(1) |+------------------------+----------+| 43.000000000000000000 | 5182 || 46.000000000000000000 | 5165 || 22.000000000000000000 | 5162 || 123.000000000000000000 | 5152 || 32.000000000000000000 | 5145 |+------------------------+----------+Fetched 5 row(s) in 3.26s

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Same: Queries are optimizedsql> explain select count(1) from sh.times;+----------------------------------------------------------+| Explain String |+----------------------------------------------------------+| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 || || 03:AGGREGATE [FINALIZE] || | output: count:merge(1) || | || 02:EXCHANGE [UNPARTITIONED] || | || 01:AGGREGATE || | output: count(1) || | || 00:SCAN HDFS [sh.times] || partitions=16/16 files=32 size=500.45KB |+----------------------------------------------------------+

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: What gets optimized• No “regular” indexes

• But many operationsare distributed

SALES 1TIMES 1

SALES 2TIMES 2

SALES 3TIMES 3

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: Native cloud filesystem supportsql> show partition sh.sales;

s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET s3a://bucket1/sh/sales/time_id=2011-02 | PARQUET s3a://bucket1/sh/sales/time_id=2011-03 | PARQUET s3a://bucket1/sh/sales/time_id=2011-04 | PARQUET s3a://bucket1/sh/sales/time_id=2011-05 | PARQUET s3a://bucket1/sh/sales/time_id=2011-06 | PARQUET s3a://bucket1/sh/sales/time_id=2011-07 | PARQUET s3a://bucket1/sh/sales/time_id=2011-08 | PARQUET hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET hdfs://clust1/sh/sales/time_id=2011-10 | PARQUET hdfs://clust1/sh/sales/time_id=2011-11 | PARQUET hdfs://clust1/sh/sales/time_id=2011-12 | PARQUET

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Database engine does NOT ”own” data

April 2-6, 2017 in Las Vegas, NV USA #C17LV

example01.dbfsysaux01.dbfsystem01.dbftemp01.dbfundotbs01.dbfusers01.dbf

a01_data.parqa01_data.parqa03_data.parqa04_data.parqa05_data.parqa06_data.parq

Different: Different engines can work withthe same data files (even at the same time)

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: … or copies of the data files

hdfs://adhoc/a.parqhdfs://adhoc/b.parqhdfs://adhoc/c.parqhdfs://adhoc/d.parqhdfs://adhoc/e.parqhdfs://adhoc/f.parq

hdfs://prod/a.parqhdfs://prod/b.parqhdfs://prod/c.parqhdfs://prod/d.parqhdfs://prod/e.parqhdfs://prod/f.parq

s3://backup/a.parqs3://backup/b.parqs3://backup/c.parqs3://backup/d.parqs3://backup/e.parqs3://backup/f.parq

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: Open data formats• Not proprietary – many

tools can read/write

• No additional $$for “advanced features”:

• Columnar storage• Storage indexes• Compression

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Same: “sqlplus-like” clients> impala-shell -i 10.0.0.1

[10.0.0.1:21000] > select prod_id, count(1)from sh.sales group by prod_id order by 2 desc limit 1;

+-----------------------+----------+| prod_id | count(1) |+-----------------------+----------+| 48.000000000000000000 | 74026 |+-----------------------+----------+

> beeline –u 'jdbc:hive2://10.0.0.1:10000'0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1)from sh.sales group by prod_id order by 2 desc limit 1;

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: External dictionary

User data

Dictionary (SYS)

User data

Dictionary (SYS)

Hive Metastore

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: Append only, “ETL-like” DML• Hadoop DML

is more like ETL

• Data is presumed static

• ACID: someinterpretation required

• Schema on read

UPDATE t SET a=12 WHERE b=1;

Table T (base):

a_data.orc

Table T (base):

a_data.orc

Table T (delta):

b_data.orc

Compactor runs …

Table T (base):

c_data.orc

Databases

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Apache Hive

Slave C

• “Designed” for “batch” queries (*)

• Runs on top of standardHadoop RM: YARN

• Supports multiple “engines”: MR, TEZ, Spark

• SerDes

YARN NM

datanode

Master

Hiveserver2

namenode

Slave C

YARN NM

datanode

Slave C

YARN NM

datanode

YARN RM

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Slave A

Apache Impala• Designed for

“quick interactive” queries

• “Data-local” execution

• In-memory processingimpalad

datanode

Slave B

impalad

datanode

Slave C

impalad

datanode

Master

statestored

namenode

catalogd

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Apache Spark• “Better Hadoop”

with “native”:SQL, Mlib, GraphX

• In-memory processing, based on RDDs

• Supports many clusters: “native”, YARN, Mesos

• Flexible programming model

Master

Driver

Slave A

Executor

Slave B

Executor

Slave C

Executor

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Presto

Slave A

• Designed for “interactive” queries

• In-memory processing

• Custom storage “plugins”: Hive, Kafka, MySql, Postgres,… worker

Slave B

worker

Slave C

worker

Master

coordinator

How to start

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Step 1: Google “Hadoop ecosystem”

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Step 2: Try to install the simplest thing

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Step 3

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Step 4

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Hint: Nobody builds their own Linux anymore

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Chose Hadoop distribution that suits you

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Hadoop distributions• Pre-built and pre-integrated

(aka: all things work out of the box)

• Each has their own “philosophy” …

• … As well as preferred Hadoop database

April 2-6, 2017 in Las Vegas, NV USA #C17LV

So what’s in it for me ?• It’s interesting (cool technology that hits many recent

buzzwords)

• If you know ORACLE, it’s close to your skill set

• It’s promising and future oriented

Q&A

Please Complete Your Session Evaluation Evaluate this session in your COLLABORATE app. Pull up this session and tap "Session Evaluation" to complete the survey.

Session ID: 557