Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem

Zohar Elkayam

www.realdbamagic.com

Twitter: @realmgic

Things Every Oracle DBA Needs to Know about the

Hadoop Ecosystem

Who am I?

• Zohar Elkayam, CTO at Brillix

• DBA, team leader, database trainer, public speaker, and a senior consultant for over 18 years

• Oracle ACE Associate

• Involved with Big Data projects since 2011

• Blogger – www.realdbamagic.com and www.ilDBA.co.il

http://brillix.co.il2

http://www.realdbamagic.com/

http://www.ildba.co.il/

About Brillix

• Brillix is a leading company that specialized in Data

Management

• We provide professional services, training and

consulting for Databases, Security, NoSQL, and Big

Data solutions

• Providing the Brillix Big Data Experience Center

3

Agenda

• What is the Big Data challenge?

• A Big Data Solution: Hadoop

• HDFS

• MapReduce and YARN

• Hadoop Ecosystem: HBase, Sqoop, Hive, Pig and other

tools

• Where does the DBA fits in?


The Challenge


The Big Data Challenge


Volume

• Big data come in one size: Big.

• Size is measured in Terabyte(1012), Petabyte(1015),

Exabyte(1018), Zettabyte (1021)

• The storing and handling of the data becomes an issue

• Producing value out of the data in a reasonable time is

an issue


Variety

• Big Data extends beyond structured data, including semi-structured

and unstructured information: logs, text, audio and videos

• Wide variety of rapidly evolving data types requires highly flexible

stores and handling


Un-Structured Structured

Objects Tables

Flexible Columns and Rows

Structure Unknown Predefined Structure

Textual and Binary Mostly Textual

Velocity

• The speed in which the data is being generated and

collected

• Streaming data and large volume data movement

• High velocity of data capture – requires rapid ingestion

• Might cause a backlog problem


Okay, So What Defines Big Data?

• When the data is too big or moves too fast to handle in a sensible amount of time

• When the data doesn’t fit any conventional database structure

• When the solution to the business need becomes part of the problem

• When we think that we can still produce value from that data and want to handle it


Value

Big data is not about the size of the data,

It’s about the value within the data


How to do Big Data


Big Data in Practice

• Big data is big: technological infrastructure solutions

needed

• Big data is complicated:

• We need developers to manage handling of the data

• We need devops to manage the clusters

• We need data analysts and data scientists to produce value


Infrastructure Challenges

• Infrastructure that is built for:

• Large-scale

• Distributed / scaled out

• Data-intensive jobs that spread the problem across clusters

of server nodes


Infrastructure Challenges (cont.)

• Storage: Efficient and cost-effective enough to capture

and store terabytes, if not petabytes, of data

• Network infrastructure that can quickly import large data

sets and then replicate it to various nodes for

processing

• Security capabilities that protect highly-distributed

infrastructure and data

16 http://brillix.co.il

A Big Data Solution:Apache Hadoop


Apache Hadoop

• Open source project run by Apache Foundation (2006)

• Hadoop brings the ability to cheaply process large

amounts of data, regardless of its structure

• It Is has been the driving force behind the growth of the

big data industry

• Get the public release from:

• http://hadoop.apache.org/core/


http://hadoop.apache.org/core/

Original Hadoop Components

• HDFS: Hadoop Distributed File System – distributed file

system that runs in a clustered environment

• MapReduce – programming paradigm for running

processes over a clustered environments.

Main idea: bring the program to the data


Hadoop Benefits

• Reliable solution based on unreliable hardware

• Designed for large files

• Load data first, structure later

• Designed to maximize throughput of large scans

• Designed to leverage parallelism

• Designed for scale out

• Flexible development platform

• Solution Ecosystem

20

What Hadoop Is Not?

• Hadoop is not a database – it does not a replacement for

DW, or for other relational databases

• Hadoop is not for OLTP/real-time systems

• Very good for large amount, not so much for smaller sets

• Designed for clusters – there is no Hadoop monster

server (single server)


Hadoop Limitations

• Hadoop is scalable but it’s not fast

• Some assembly is required

• Batteries are not included

• DIY mindset

• Open source limitations apply

• Technology is changing very rapidly


Hadoop under the Hood


Original Hadoop 1.0 Components

• HDFS: Hadoop Distributed File System – distributed file

system that runs in a clustered environment

• MapReduce – programming paradigm for running

processes over a clustered environments


Hadoop 2.0

• Hadoop 2.0 changed the Hadoop conception and

introduced better resource management and speed:

• Hadoop Common

• HDFS

• YARN

• Multiple data processing

frameworks including

MapReduce, Spark and

others


HDFS is...

• A distributed file system

• Designed to reliably store data using commodity hardware

• Designed to expect hardware failures and stay resilient

• Intended for large files

• Designed for batch inserts


Files and Blocks

• Files are split into blocks (single unit of storage)

• Managed by Namenode and stored on Datanodes

• Transparent to users

• Replicated across machines at load time

• Same block is stored on multiple machines

• Good for fault-tolerance and access

• Default replication factor is 3

27

HDFS Node Types

HDFS has three types of Nodes:

• Datanodes

• Responsible for actual file store

• Serving data from files (data) to client

• Namenode (MasterNode)

• Distribute files in the cluster

• Responsible for the replication between

the datanodes and for file blocks location

• BackupNode

• Backup node for the NameNode

28

HDFS is Good for...

• Storing large files

• Terabytes, Petabytes, etc...

• Millions rather than billions of files

• 128MB or more per file

• Streaming data

• Write once and read-many times patterns

• Optimized for streaming reads rather than random reads

29

HDFS is Not So Good For...

• Low-latency reads / Real-time application

• High-throughput rather than low latency for small chunks of data

• HBase addresses this issue

• Large amount of small files

• Better for millions of large files instead of billions of small files

• Multiple Writers

• Single writer per file

• Writes at the end of files, no-support for arbitrary offset

30

MapReduce is...

• A programming model for expressing distributed

computations at a massive scale

• An execution framework for organizing and performing

such computations

• Bring the code to the data, not the data to the code


The MapReduce Paradigm

• Imposes key-value input/output

• We implement two functions:• MAP - Takes a large problem and divides into sub problems and

performs the same function on all sub-problems

Map(k1, v1) -> list(k2, v2)

• REDUCE - Combine the output from all sub-problems (each key goes to the same reducer)

Reduce(k2, list(v2)) -> list(v3)

• Framework handles everything else (almost)

32

Divide and Conquer


MapReduce is Good For...

• Embarrassingly parallel algorithms

• Summing, grouping, filtering, joining

• Off-line batch jobs on massive data sets

• Analyzing an entire large dataset


MapReduce is NOT Good For...

• Jobs that needs shared state or coordination

• Tasks are shared-nothing

• Shared-state requires scalable state store

• Low-latency jobs

• Jobs on smaller datasets

• Finding individual records


YARN

• Takes care of distributed processing and coordination

• Scheduling

• Jobs are broken down into smaller chunks called tasks

• These tasks are scheduled to run on data nodes

• Task Localization with Data

• Framework strives to place tasks on the nodes that host the

segment of data to be processed by that specific task

• Code is moved to where the data is

36

YARN

• Error Handling

• Failures are an expected behavior so tasks are automatically re-

tried on other machines

• Data Synchronization

• Shuffle and Sort barrier re-arranges and moves data between

machines

• Input and output are coordinated by the framework

37

Extending Hadoop

Improving Hadoop

• Core Hadoop is complicated so some tools and solution frameworks were added to make things easier

• There are over 80 different Apache projects for big data solution which uses Hadoop

• Hadoop Distributions collects some of these tools and release them as a complete package

• Cloudera

• HortonWorks

• MapR

• Amazon EMR

39

Common HADOOP 2.0 Technology Eco System


Databases and DB Connectivity

• HBase: NoSQL Key/Value wide-column oriented

datastore that is native to HDFS

• Sqoop: a tool designed to import data from and export

data to relational databases (HDFS, Hbase, or Hive)

41

HBase

• HBase is the closest thing we had to database in the early

Hadoop days

• Distributed key/value with wide-column oriented database

built on top of HDFS - providing Big Table-like capabilities

• Does not have a query language: only get, put, and scan

commands

• Usually compared with Cassandra (non-Hadoop native

Apache project)

42

When do we use HBase?

• Huge volumes of randomly accessed data

• HBase is at its best when it’s accessed in a distributed fashion

by many clients (high consistency)

• Consider HBase when we are loading data by key, searching

data by key (or range), serving data by key, querying data by

key or when storing data by row that doesn’t conform well to a

schema.

43

When NOT to use HBase

• HBase doesn’t use SQL, don’t have an optimizer,

doesn’t support in transactions or joins

• HBase doesn’t have data types

• See project Apache Phoenix for better data structure

and query language when using HBase

44

Sqoop2

• Sqoop is a command line tool for moving data from RDBMS to Hadoop

• Uses MapReduce program or Hive to load the data

• Can also export data from HBase to RDBMS

• Comes with connectors to MySQL, PostgreSQL, Oracle, SQL Server and DB2.

• Example:

$bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' \

--table lineitem --hive-import

$bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --export-dir /data/lineitemData

45

Improving MapReduce Programmability

• Pig: Programming language that simplifies Hadoop

actions: loading, transforming and sorting of data

• Hive: enables Hadoop to operate as data warehouse

using SQL-like syntax

46

Pig

• Pig is an abstraction on top of Hadoop• Provides high level programming language designed for data

processing

• Scripts converted into MapReduce code, and executed on the Hadoop Clusters

• Makes ETL processing and other simple MapReduce easier without writing MapReduce code

• Often replaced by more up-to-date tools like Apache Spark

47

Hive

• Data warehousing solution built on top of Hadoop

• Provides SQL-like query language named HiveQL

• Minimal learning curve for people with SQL expertise

• Data analysts are target audience

• Early Hive development work started at Facebook in

2007

48

Hive Provides

• Ability to bring structure to various data formats

• Simple interface for ad hoc querying, analyzing and

summarizing large amounts of data

• Access to files on various data stores such as HDFS

and HBase

• Also see: Apache Impala (mainly in Cloudera)

49

Improving Hadoop – More useful tools

• For improving coordination: Zookeeper

• For Improving log collection: Flume

• For improving scheduling/orchestration: Oozie

• For Improving UI: Hue/Ambari

50

ZooKeeper

• ZooKeeper is a centralized service for maintaining configuration

information, naming, providing distributed synchronization, and providing

group services

• It allows distributed processes to coordinate with each other through a

shared hierarchal namespace which is organized similarly to a standard

file system

• ZooKeeper stamps each update with a number that reflects the order of all

ZooKeeper transactions

51

Flume

• Flume is a distributed system for collecting log data from many

sources, aggregating it, and writing it to HDFS

• Flume does for file what Sqoop did for RDBMS

• Flume maintains a central list of ongoing data flows, stored

redundantly in Zookeeper.

52

Is Hadoop the Only Big Data Solution?

• No – There are other solutions:

• Apache Spark and Apache Mesos frameworks

• NoSQL systems (Apache Cassandra, CouchBase, MongoDB and

many others)

• Stream analysis (Apache Kafka, Apache Flink)

• Machine learning (Apache Mahout, Spark MLib)

• Some can be integrated with Hadoop, but some are

independent


Where Does the DBA Fits In?

• Big Data solutions are not databases. Databases are

probably not going to disappear, but we feel the change

even today: DBA’s must be ready for the change

• DBA’s are the perfect candidates to transition into Big Data

Experts:

• Have system (OS, disk, memory, hardware) experience

• Can understand data easily

• DBA’s are used to work with developers and other data users


What DBAs Needs Now?

• DBA’s will need to know more programming: Java,

Scala, Python, R or any other popular language in the

Big Data world will do

• DBA’s needs to understand the position shifts, and the

introduction of DevOps, Data Scientists, CDO etc.

• Big Data is changing daily: we need to learn, read, and

be involved before we are left behind…


Q&A


Summary

• Big Data is here – it’s complicated and RDBMS does not fit anymore

• Big Data solutions are evolving Hadoop is an example for such a solution

• Spark is very popular Big Data solution

• DBA’s need to be ready for the change: Big Data solutions are not databases and we make ourselves ready


Thank You

Zohar Elkayamtwitter: @[email protected]

www.realdbamagic.com


mailto:[email protected]

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem

Technology

Transcript of Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem