The State of BigData - meetup bigdata @ovh

85
@PennAnData

Transcript of The State of BigData - meetup bigdata @ovh

Page 1: The State of BigData  -  meetup bigdata @ovh

@PennAnData

Page 2: The State of BigData  -  meetup bigdata @ovh

The State of Big Data2016

Page 3: The State of BigData  -  meetup bigdata @ovh

Summary

1 Data Facts

2 Hadoop Basics

3 Beyond Batch : Streaming

4 Columnar Storage

5 Ecosystem

Page 4: The State of BigData  -  meetup bigdata @ovh

Big Data Facts

PART 1

Page 5: The State of BigData  -  meetup bigdata @ovh

3V's

Volume Velocity Variety

Page 6: The State of BigData  -  meetup bigdata @ovh

Volume ...

data Production constantly growing

data Retention increase widely

Extract Value from you data

Storage Cost decrease

Page 7: The State of BigData  -  meetup bigdata @ovh

Velocity ...

Data produced Faster

Get Real Time insight

Move from capture to Analysis

Get Actionable insight

Page 8: The State of BigData  -  meetup bigdata @ovh

Variety ...

Not only Structured Data

Toward mostly Unstructured

Text (articles, comments, tweets,...)

Images (id cards, bills,...)

Logs, metrics,...

Page 9: The State of BigData  -  meetup bigdata @ovh
Page 10: The State of BigData  -  meetup bigdata @ovh

Seek Time

• 5-10 ms

• 200 move/s

Page 11: The State of BigData  -  meetup bigdata @ovh

Data Transfer Rate

Mbps 100 1000 10000MB/s 12.5 125 1250

1 Mo 80ms 8ms 0.8ms1 CD (700 Mo) 56s 5.6s 0.56s1 Go (1000 Mo) 1m20 8s 0.8s1 DVD (4700 Mo) 6m16 37.6s 3.76s1 To (1000 Go) 22h13 2h13m 13m

Page 12: The State of BigData  -  meetup bigdata @ovh

Data Transfer Rate

Mbps 100 1000 10000MB/s 12.5 125 1250

1 min 750 MB 7.5 GB 75 GB15 min 11 GB 112 GB 1 TB1 hour 45 GB 450 GB 4.5 TB1 day 1TB 10.8 TB 108 TB

Page 13: The State of BigData  -  meetup bigdata @ovh
Page 14: The State of BigData  -  meetup bigdata @ovh

Payload

Page 15: The State of BigData  -  meetup bigdata @ovh

Definition

“Big Data really is about having insights and making animpact on your business. If you aren’t taking advantage ofthe data you’re collecting, then you just have a pile of data,

you don’t have Big Data.”

#BigData

Page 16: The State of BigData  -  meetup bigdata @ovh
Page 17: The State of BigData  -  meetup bigdata @ovh

Introducing Hadoop

PART 2

Page 18: The State of BigData  -  meetup bigdata @ovh
Page 19: The State of BigData  -  meetup bigdata @ovh

#DougCutting

Page 20: The State of BigData  -  meetup bigdata @ovh

#Tools

Page 21: The State of BigData  -  meetup bigdata @ovh

Timeline

Page 22: The State of BigData  -  meetup bigdata @ovh

#HDFS

Page 23: The State of BigData  -  meetup bigdata @ovh

#Blocks

HDFS

Page 24: The State of BigData  -  meetup bigdata @ovh

/ HDFS

File

Blocks

DataNodes

Page 25: The State of BigData  -  meetup bigdata @ovh

File

Blocks

DataNodes

/ HDFS / Replication

Page 26: The State of BigData  -  meetup bigdata @ovh

DataNodes

NameNode

/ HDFS / NameNode

Page 27: The State of BigData  -  meetup bigdata @ovh

DataNodes

NameNodes

/ HDFS / Namespace #Federation

Page 28: The State of BigData  -  meetup bigdata @ovh

/ HDFS / HA

#HighAvailability

NN1 NN2

Page 29: The State of BigData  -  meetup bigdata @ovh

/ HDFS / HA

Failover Controller

● NameNode Side● Health monitor● Manage HA State

● Zookeeper Side● Monitor State● Maintain or Try to

get Active Lock

#Five9rulez

Page 30: The State of BigData  -  meetup bigdata @ovh

/ HDFS / Client #Read

#DataLocality

Page 31: The State of BigData  -  meetup bigdata @ovh

/ HDFS / Client #Write

#ReplicationFactor3

Page 32: The State of BigData  -  meetup bigdata @ovh

#MapReduce

Page 33: The State of BigData  -  meetup bigdata @ovh

MapReduce

#MAP

Page 34: The State of BigData  -  meetup bigdata @ovh

MapReduce

#SHUFFLE

Page 35: The State of BigData  -  meetup bigdata @ovh

MapReduce

#REDUCE

Page 36: The State of BigData  -  meetup bigdata @ovh

MapReduce

<key1, val1> map

<key2, val2> mapreduce <okey1, oval1>

reduce <okey2, oval2>

<key3, val3> map

<key500, val500> map

<ikey2, ival521><key501, val501> map reduce <okey150, oval150>

<key502, val502> map <ikey150, ival522>

<ikey1, ival1>

<ikey2, ival2>

<ikey1, ival3>

<ikey2, ival4>

<ikey150, ival520>

Input Input Pairs

Intermediate Pairs

Output Pairs

Output

Step 1: Split )

Step 2: Map

Step 3: Shuffle / Sort

Step 4: Reduce

Step 5: Store )

Page 37: The State of BigData  -  meetup bigdata @ovh

MapReduce

Page 38: The State of BigData  -  meetup bigdata @ovh

MapReduce

Page 39: The State of BigData  -  meetup bigdata @ovh

#Pig &

#Hive

Page 40: The State of BigData  -  meetup bigdata @ovh

Hive

● Tez

● Impala

● Presto.io

Page 41: The State of BigData  -  meetup bigdata @ovh

#HBase

Page 42: The State of BigData  -  meetup bigdata @ovh

HBase

#Model

Page 43: The State of BigData  -  meetup bigdata @ovh

HBase

#Model

Page 44: The State of BigData  -  meetup bigdata @ovh

HBase

#Model

Page 45: The State of BigData  -  meetup bigdata @ovh

HBase

#Model

Page 46: The State of BigData  -  meetup bigdata @ovh

HBase

#PhysicalStorage

Page 47: The State of BigData  -  meetup bigdata @ovh

HBase

#Scale

Page 48: The State of BigData  -  meetup bigdata @ovh

HBase

#Scale

Page 49: The State of BigData  -  meetup bigdata @ovh

HBase

#Meta

Page 50: The State of BigData  -  meetup bigdata @ovh

#HBaseArch

Page 51: The State of BigData  -  meetup bigdata @ovh

HBase #SQL

Page 52: The State of BigData  -  meetup bigdata @ovh

HBase

#Features

● Coprocessor● Auto-sharding● Scan (full,range)● Schemaless● Cell versioning● Battle tested

● Compactions● Replications● Custom filters● Transactional● Low Latency● Active Community

Page 53: The State of BigData  -  meetup bigdata @ovh

Beyond Batch : Streaming

PART 3

Page 54: The State of BigData  -  meetup bigdata @ovh

/ Streaming / Data Platform #Transport

Page 55: The State of BigData  -  meetup bigdata @ovh

/ Streaming / Data Platform / Kafka

Page 56: The State of BigData  -  meetup bigdata @ovh

+ =

/ Streaming / Frameworks

Page 57: The State of BigData  -  meetup bigdata @ovh

/ Streaming / Storm / Topology #Storm

Page 58: The State of BigData  -  meetup bigdata @ovh

/ Streaming / Storm / Topology #Parallelism

Page 59: The State of BigData  -  meetup bigdata @ovh

/ Streaming / Flink #Job

Page 60: The State of BigData  -  meetup bigdata @ovh

/ Streaming / Flink

#DataSet API #DataStream API

Page 61: The State of BigData  -  meetup bigdata @ovh

Ok Steven, but a new DSL for each new hype tool ?Come on...

Page 62: The State of BigData  -  meetup bigdata @ovh
Page 63: The State of BigData  -  meetup bigdata @ovh
Page 64: The State of BigData  -  meetup bigdata @ovh

Apache Beam

#Features

● Open Sourced Google DataFlow● Unify bigdata developements● Beam Model (from DataFlow model)● Parallel Data processing Pipelines● Pluggable runners: Flink or G Cloud DataFlow● Portability● SDKs : Java / Python

Page 65: The State of BigData  -  meetup bigdata @ovh

#Architecture

Page 66: The State of BigData  -  meetup bigdata @ovh

Lambda Architecture

Page 67: The State of BigData  -  meetup bigdata @ovh

Drawbacks

• Hard to mergefor serving layer

• Hard to maintainand operate both realtime andbatch code in sync

Page 68: The State of BigData  -  meetup bigdata @ovh

Kappa Architecture

Page 69: The State of BigData  -  meetup bigdata @ovh

From Storm to Flink

Page 70: The State of BigData  -  meetup bigdata @ovh

#Yarn

Page 71: The State of BigData  -  meetup bigdata @ovh

Yarn

#MapReduce

Page 72: The State of BigData  -  meetup bigdata @ovh

Yarn

#MessagePassing

Page 73: The State of BigData  -  meetup bigdata @ovh

Yarn

#StreamProcessing

Page 74: The State of BigData  -  meetup bigdata @ovh

Yarn

#DistributedLoadTest

Page 75: The State of BigData  -  meetup bigdata @ovh

Yarn

#RessourceManagement

Page 76: The State of BigData  -  meetup bigdata @ovh

Yarn Frameworks

Page 77: The State of BigData  -  meetup bigdata @ovh

#Mesos

Page 78: The State of BigData  -  meetup bigdata @ovh
Page 79: The State of BigData  -  meetup bigdata @ovh

Columnar Storage

PART 4

Page 80: The State of BigData  -  meetup bigdata @ovh

Columnar Storage

#ORC#Parquet

Page 81: The State of BigData  -  meetup bigdata @ovh

Ecosystem

PART 5

Page 82: The State of BigData  -  meetup bigdata @ovh
Page 83: The State of BigData  -  meetup bigdata @ovh

Vendors

Page 84: The State of BigData  -  meetup bigdata @ovh

Integration

Page 85: The State of BigData  -  meetup bigdata @ovh

?@StevenLeRoux

2016