Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

42
Cmprssd Intrduction To Hadoop, SQL-on-Hadoop, NoSQL @arsenyspb [email protected] Singapore University of Technology & Design 2016-11-09

Transcript of Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Page 1: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Cmprssd Intrduction ToHadoop, SQL-on-Hadoop, NoSQL

@arsenyspb

[email protected]

Singapore University of Technology & Design2016-11-09

Page 2: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Thank You For Inviting!My special kind regards to:

Professor Meihui Zhang

Associate Director Hou Liang Seah

Industry Outreach Manager Robin Soo

Page 3: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

🤔 What am I supposed to do?..

Please raise hand if you…

…want to learn about modern data analytics ?..

…are OK if I use words like “Java” or “Command Line” or “Port”?..

…got enough kopi / teh / red bull for next 1 hour?..

…have hands-on experience with Hadoop, Spark, Hive?..

Page 4: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Shameless Self-Intro

Page 5: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

5

Hi, My Name Is Arseny, And I’m…

2011

Page 6: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Hadoop In A 🌰 Nutshell

Page 7: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

7

1998

2016

It All Started At Google

Page 8: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

8

2003

2004

2006

Hadoop is Google’s Tech in Open Source

2006

Page 9: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

9 Hadoop Originates From Hyperscale Approach

However, in 2016 big data & Hadoop don’t need a hyperscale datacenter

Page 10: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Closer Look, i.e. Hortonworks Data Platform (HDP)

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFS EncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory

Spark

Others

ISV Engines

TezTez Tez Slider Slider

HDFS Hadoop Distributed File System

DATA MANAGEMENT

Hortonworks Data Platform 2.3

Data Lifecycle & Governance

FalconAtlas

We will “compress” all these topics during next 1 hour

Page 11: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Quick demo

Page 12: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

HDFS In A 🌰 NutshellHadoop Distributed File System

Page 13: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

13© 2015 Pivotal Software, Inc. All rights reserved.

Reading Data From HDFS

Client NodeClient JVM

DistributedFileSystem

HDFS Client

1: open

FSDataInputStream

namenodeJVM

NameNode

datanodeJVM

DataNode

datanodeJVM

DataNode

datanodeJVM

DataNode

2: Request file block locations

3: read

6: close

4: read from block

5: read from block

Page 14: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

14© 2015 Pivotal Software, Inc. All rights reserved.

Writing Data to HDFS

Client NodeClient JVM

DistributedFileSystem

HDFS Client

1: create

FSDataOutputStream

namenodeJVM

NameNode

datanodeJVM

DataNode

datanodeJVM

DataNode

datanodeJVM

DataNode

2: create

3a: write

6: close

4a: write packet5c: ack packet

4b: write packet

4c: write packet

5b: ack packet

5a: ack packet

7: complete

DataStreamer

3b: Request allocation(as new blocks required)

3c: Three data-node, data-block pairs returned

Diagram shows3x replication

Page 15: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Quick demo

Page 16: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

YARN In A 🌰 NutshellYet Another Resource Negotiator

Page 17: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

17

Traditional SQL databases: structured Schema-on-WriteLegacy SQL Is All Structured

row keys color shape timestamp

row

row

row

......

first red square HH:MM:SS

second blue round HH:MM:SS

1 create schema on file or block storage

2 load data3 query dataselect ROW KEY, COLOR from … where

Can’t add data before the schema is created. To change schema, drop and re-loaded entire table. A drop of TB-size table with Foreign Keys could last days.

Page 18: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

18

file.csv & other.txt

Unstructured Schema-on-Read QueryMapReduce In Color

1 load data straight from HDFS2 query data - map - shuffle - reduce

Page 19: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

19

MapReduce In Process Diagram

Page 20: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

20© 2015 Pivotal Software, Inc. All rights reserved.

Starting Job – MapReduce v2.0

Client NodeClient JVM

JobMapReduceprogram

Jobtracker Node

1: initiate job 2: request new application

3: copy job jars, config

4: submit job

9: retrieve job jars, data

Node Manager Node

JVM

Node manager

Child JVM

YARNchild

Mapper or Reducer

10: run

Shared File-System (e.g. HDFS)

6: determine input splits

7b: start container

Node Manager NodeJVM

MRAppMasterNode Manager

5b: launch

5c: initialize job

5a: start container

7a: allocate task resources

8: launch

JVM

ResourceManager

Page 21: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Quick demo

Page 22: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Hive In A 🌰 NutshellSQL interface to MapReduce Jobs

Page 23: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

23

Relational DB

Relational DB and SQL conceived to– Remove repeated data, replace with tabular structure & relationships

▪ Provide efficient & robust structure for data storage

Exploit regular structure with declarative query language

–Structured Query Language

DRY – Don’t Repeat Yourself

Page 24: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

24

What Hive Is… A SQL-like processing capability based on Hadoop

Enables easy data summarisation, ad-hoc reporting and querying, and analysis of large volumes of data

Built on HQL, a SQL-like query language– Statements run as mapreduce jobs– Also allows mapreduce programmers to plugin custom mappers and reducers

• Works with Plain text, Hbase, ORC, Parquet and others formats

• Metadata is stored in MySQL

Page 25: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

25

Hive Schemas

Hive is schema-on-read– Schema is only enforced when the data is read (at query time)– Allows greater flexibility: same data can be read using multiple

schemas

Contrast with RDBMSes, which are schema-on-write– Schema is enforced when the data is loaded– Speeds up queries at the expense of load times

Page 26: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

26

Hive Architecture

Hive Metastore + MySQL

Page 27: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

27

What Hive Is Not…

Hive, like Hadoop, is designed for batch processing of large datasets

Not a real-time system, not fully SQL-92 compliant– “Sibling” solutions like Tez, Impala and HAWQ offer more compliance

Latency and throughput are both high compared to a traditional RDBMS– Even when dealing with relatively small data (<100 MB)

Page 28: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Quick demo

Page 29: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

HBASE In A 🌰 NutshellSQL interface to MapReduce Jobs

Page 30: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

30

ACID is Business Requirement for RDBMs Traditional DB-s have excellent support for ACID transactions

– Atomic: All write operations succeed, or nothing is written

–Consistent: Integrity rules guaranteed at commit

–Isolation: It appears to the user as if only one process executes at a time. (Two concurrent transactions will not see on another’s transaction while “in flight”.)

–Durable: The updates made to the database in a committed transaction will be visible to future transactions. (Effects of a process do not get lost if the system crashes.)

Page 31: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

31

Scale RDBMS?..

RDBMS is bad fit for huge scale, online applications How to do Sharding?.. Unlimited but Scaling up?.. Maybe give up on Joins for latency and do Master-Slave?..

Big Data describes problem, Not only SQLdefines the general approach to solution:– Emphasis on scale, distributed processing, use of commodity

hardware

Page 32: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

32

Business Needs for “Not Only SQL” Not Only SQL DBs evolved from web-scale use-cases

– Google, Amazon, Facebook, Twitter, Yahoo, …▪ “Google Cache” = Entire page saved in to a cell of a BigTable database

▪Columnar layout preferred▪ filters to reduce the disk lookups for non-existent rows or columns increases the performance of

a database query operation.

– Requirement for massive scale, relational fits badly ▪ Queries relatively simple▪ Direct interaction with online customers

– Cost-effective, dynamic horizontal scaling required▪ Many nodes based on inexpensive (commodity) hardware▪ Must manage frequent node failures & addition of nodes at any time

Page 33: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

🤔 But how to build such DB?..

Page 34: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

34

Reminder: The CAP Theorem (2 not 3)

Consistency

Partition tolerance

Availability “Once a writer has written, all readers will see that write”

Single Version of Truth?

“System is Available to serve 100% of requests and complete them successfully.”

No SPOF?..

“A system can continue to operate in the presence of a network Partitions”

Replicas?..

Page 35: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

35

Eventually Consistent vs. ACID An artificial acronym you may see is BASE

– Basically Available▪ System seems to work all the time

– Soft State▪ Not wholly consistent all the time, but…

– Eventual Consistency▪ After a period with no updates, a given dataset will be consistent

Resulting systems characterized as “eventually consistent” – Overbooking an airline or hotel and passing risk to customer

Page 36: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

36

Non-relational distributed databaseHBase is a database: has a schema, but it’s non-relational

row keyscolumn family

“color”column family

“shape”

row

row

first “red”: #F00“blue”: #00F

“yellow”: #F0F“square”:

second“round”:

“size”: XXL

1.) Create column families

2.) Load data, multiples of rows form region files on HDFS3.) Query data

hbase>get “first”, “color”:”yellow” COLUMN CELL yellow timestamp=1295774833226, value=“#F0F”

hbase>get “second”, “shape”:”size” COLUMN CELL size timestamp=1295723467122, value=“XXL”

Page 37: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

37

Col

umn

Orie

nted

Sto

rage

Page 38: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

38

Hba

se

Clie

nt

Reg

ion

Serv

er

Zookeeper

SQL

ODB

C Cl

ient

Pivo

tal

HAW

Q P

XF H

base

Cl

ient

Apac

he

Phoe

nix

Hba

se

Clie

nt

Sequential HDFS Write & L2 Read

Adaptive Pre Fetch & L2 Reads Sequential Writes

SQ

L JD

BC

Clie

ntHb

ase

API

Clie

nt (1) Put/Delete

Writ

e-Ah

ead

Log

(WAL

)

Mem

stor

e (3) Flush toHDFS

(2.1) Write toMemStore

(2.0) Write to WAL

(4) Get/Scan Read RequestClient RAM Pre-Fetch

HBase Architecture, Read & Write

Memstore = Eventual

Consistency

HFile

Page 39: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

39

HBase namespace layout

Page 40: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

40

From “Hbase Definitive Guide”

http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2

Compression (HBase and others)

Page 41: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Q&A?..

http://bit.ly/isilonhbase

@arsenyspb

Page 42: Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL