Big!Data!Bootcamp!! 2016! - s3.amazonaws.com › maylong › Chicago+Bootcamp › Chicago… ·...

!

Big!Data!Bootcamp!!

2016!

Day!1!1!Hadoop!

2!

Introduc8on!

 Name!

 Role!

 Expecta8on!

Agenda!–!Day!1!/!Day!2!

3!

1"

•  Hadoop""Architecture"

•  HDFS""•  Demo"

•  Lab"•  YARN"•  MapReduce"

•  Lab"•  Job"Scheduler"•  Preemp=on"

•  Capacity"Scheduler"Lab"

•  SQL"DataFrames"

•  Catalyst"Op=mizer"

•  Spark"Streaming"

•  Lab"

2" 3"

•  Hive"•  Demo"

4"•  Security"

•  Lab"•  Lucene"/"Solr"Elas=cSearch"

!!!!!!!!!!!Agenda!Day!2!and!Day!3:!!

•  Field"Engagement"

•  Posi=oning"

•  Architecture"

•  Setup"

•  Troubleshoo=ng"

4!

•  Administra=on"

•  Best"Prac=ces"

•  Archiving"

•  ERPs"

•  MongoDB"

•  Assessment"

Hadoop!Architecture!

Hadoop!2.x!

HADOOP"1.x!

HDFS"(redundant,!reliable!storage)!

MapReduce"(cluster!resource!management!

!&!data!processing)!

HDFS2"(redundant,!reliable!storage)!

YARN"(cluster!resource!management)!

HADOOP"2.x!

Single"Use"System"

Batch&Apps&Mul="Purpose"PlaRorm"

Batch,&Interac/ve,&Online,&Streaming,&…&

MapReduce"(data!processing)!

Others"(data!processing)!

7!

  A!Hadoop!cluster!is!made!up!of!master!and!slave!(worker)!nodes!

–  Master!nodes!manage!the!infrastructure!

–  Worker!nodes!contain!the!distributed!data!and!perform!processing!

Overview!of!a!Hadoop!Cluster!

Master"Nodes"–"NameNode,!ResourceManager,!Standby!NameNode,!Hbase!Master!

Master"Node"1"

NameNode!

Oozie!Server!

ZooKeeper!

Master"Node"2"

ResourceManager!

Standby!NameNode!

HBase!Master!

HiveServer2!

ZooKeeper!

Management"Node"

Ambari!Server!

WebHCat!Server!

JobHistoryServer!

ZooKeeper!

Worker"Nodes"–"NodeManager,!DataNode,!Hbase!RegionServer!

Worker"Node"1"

DataNode!

NodeManager!

H!RegionServer!

Worker"Node"2"

DataNode!

NodeManager!

H!RegionServer!

Worker"Node"3"

DataNode!

NodeManager!

H!RegionServer!

Worker"Node"4"

DataNode!

NodeManager!

H!RegionServer!

Standalone!Deployment!Mode!!

•  Single!JVMs!!

(Java!Virtual!Machines)!

•  Local!file!system!

•  Suitable!for!tes8ng!during!development!

!

Hadoop"

Pig"

Hive"

JVM"

CPU"

Memory"

Disk"

Pseudo1Distributed!Deployment!Mode!

•  Each!daemon!runs!its!own!JVM!

•  Hortonworks!Sandbox!runs!in!Pseudo1Distributed!mode!

•  They!all!run!on!a!single!machine!using!HDFS!

•  Appropriate!for!QA!and!development!environments!

NameNode"

JVM!

ResourceManager"

JVM!

DataNode"

JVM!

NodeManager"

JVM!

CPU"

Memory"HDFS"

Distributed!Deployment!Mode!

•  Each!daemon!runs!on!

its!own!machine!

•  Data!is!maintained!

with!local!HDFS!

•  Best!for!produc8on!environments!

CPU!

Memory!

Local!

Disk!

Master!Node!

Memory!

Data!

(HDFS)!

Worker!Node!

CPU!

Memory!

Data!

(HDFS)!

Worker!Node!

CPU!

Memory!

Data!

(HDFS)!

Worker!Node!

CPU!

Memory!

Data!

(HDFS)!

Worker!Node!

CPU!

Small!Cluster!with!HA!

Ambari

Client Gateway Hiveserver

WebHCat

Hive Metastore

Monitoring!

&!Metrics!

Node Manager

Datanode

HBase Region

Node Manager

Datanode

HBase Region

Node Manager

Datanode

HBase Region

Node Manager

Datanode

HBase Region

Node Manager

Datanode

HBase Region

Node Manager

Datanode

HBase Region

HBase Master 1

Namenode 1

Zookeeper

Oozie

Zookeeper

Namenode 2

Resource Manager 1

Zookeeper

History Server

Timeline Server Falcon

Journal Keeper

Journal Keeper

Journal Keeper

Resource Manager 2

HBase Master 2

KNOX KNOX

Client Gateway

Worker Node 1 Worker Node 2 Worker Node 3 Worker Node 4 Worker Node 5 Worker Node 6

Master Node 1 Master Node 2 Master Node 3 Master Node 4 Utility Node 1 Utility Node 2

Hadoop!2.x!

HADOOP"1.x!






HADOOP"2.x!

Single"Use"System"





14!

HDFS!Architecture!

NOTE:"HDFS!is!the!most!commonly!used!filesystem!!

implementa8on.!!Other!available!implementa8ons:!!

Tachyon,!GlusterFS,!S3!and!more.!

Storage"

DataNode"

NameNode" Namespace! Block!Map!

Metadata! Journaling!

NameNode and DataNodes are daemon jvms """""""DataNode"

""""""""NameNode"

Storage"

DataNode"

Storage"

DataNode"

15!

""""""""""""""""""""""""""""""""DataNode"1" """"""""""""""""""""""""""""""""""DataNode"2" """""""""""""""""""""""""""""""""""DataNode"3"

NameNode"Client"

1

2

3

4

5

1.!Client!sends!a!request!to!the!NameNode!to!add!a!file!to!HDFS"

2.!NameNode!gives!client!a!lease!to!file!path!

3.!For!every!block,!the!client!will!request!the!NameNode!to!provide!a!new!!

!!!!blockid!and!a!list!of!des+na+on!DataNodes"

4.!The!client!will!write!the!block!directly!to!the!first!DataNode!in!the!list!

5.!The!first!DataNode!pipelines!the!replica8on!to!the!next!DataNode!in!the!list!

16!

1.!When!the!NameNode!starts,!it!reads!the!fsimage_N!!

!!!!!and!edits_N!files"

2.!The!transac8ons!in!edits_N!are!merged!with!fsimage_N!

3.!A!newly1created!fsimage_N+1!is!wrifen!to!disk,!and!a!new,!!

!!!!!empty!edits_N+1!is!created"

The NameNode

NameNode"

Namespace!

Block!Map!

Metadata!

Journaling!

fsimage! edits!

4.!Now!a!client!applica8on!can!create!a!new!file!!

!!!!!in!HDFS!

5.!The!NameNode!journals!that!create!!

!!!!!transac8on!in!the!edits_N+1!file.!

The!NameNode!will!be!in!safemode,!a!read1only!mode.!!

!

Note:!The!blockmap!construc8on!process!happens!!

during!safemode.!Blockmap!is!not!persistent!in!fsimage!!

or!edits,!and!must!be!dynamically!reconstructed!when!!

the!NN!starts.!!

17!

The!DataNodes!

DataNode1" DataNode2" DataNode3" DataNode4"

NameNode"

Heartbeat"&"

Blockreport"

Heartbeat"&"

Blockreport"

Heartbeat"&"

Blockreport"

“Sorry, DataNode 3,

but I’m going to assume you are dead.”

18!

Heterogeneous!Storage!

•  Before!•  DataNode!is!a!single!storage!•  Storage!is!uniform!1!Only!storage!type!

Disk!

•  Storage!types!hidden!from!the!file!

system!

•  New!Architecture!•  DataNode!is!a!collec8on!of!storages!•  Support!different!types!of!storages!

–  Disk,!SSDs,!Memory!

All"disks"as"a"single"storage"

Collec=on"of"=ered"storages"

S3"

Swi]"

SAN"

Filers"

19!

HDFS!Storage!Architecture!

NameNode"

Clients"

DataNode"

Reports!

Per!

Device!

Read/Write!

To!Specific!

Storage!

W!

Understanding NameNode HA

Ac=ve"NameNode"

Namespace!

Block!Management!

ZKFC"

Archive"

ZK" ZK" ZK"

Writes" Reads"

Standby"NameNode"

Namespace!

Block!Management!

ZKFC"

Standby"

Hadoop"HA"Cluster"

DataNode"1"

BLK1! BLK2! BLK6! BLK5!

DataNode"2"

BLK2! BLK4! BLK5!

DataNode"3"""


DataNode"n"""


HDFS"–"Distributed"Storage"

Heartbeats" Heartbeats"

For!automa8c!failover!

21!

Configura8on!Hierarchy!!

Default!Configura8on!

!

hadoop1common.jar!

hadoop1hdfs.jar!

hadoop1mapreduce1client1core.jar!

hadoop1yarn1common.jar!

!

core1default.xml!

hdfs1default.xml!

mapred1default.xml!

yarn1default.xml!

!

Inherits!/!Extends!

Site/Node!Configura8on!

/etc/hadoop/conf!

!

core1site.xml!

hdfs1site.xml!

mapred1site.xml!

yarn1site.xml!

!

!

!

Applica8on/Job!Configura8on!

!

yarn!jar!–D!prop=value!

Configura8on.set(…)!

!

!

!

Inherits!/!Extends!

22!

HDFS!Demo!

23!

HDFS!Lab!

Hadoop!2.x!

HADOOP"1.x!






HADOOP"2.x!

Single"Use"System"





What is YARN? •  Yet!Another!Resource!Nego8ator!•  Part!of!core!Hadoop!•  Mul81node!cluster!has!an!aggregate!pool!of!compute!resources!

•  CPU!and!memory!

•  Like!resources!that!an!opera8ng!system!would!manage!

–  AKA:!the!Hadoop!Opera8ng!System!

•  Disk!resources!are!managed!by!HDFS!

Now!have!infrastructure!resource!management!by!Hadoop!

27!

5"1.  Scale!

2.  New!Programming!Models!&!Services!

3.  Improved!cluster!u8liza8on!

4.  Agility!

5.  Beyond!Java!

5!Key!Benefits!of!YARN!

ResourceManager

NodeManager NodeManager NodeManager NodeManager



YARN Architecture

Container!1.1!

Scheduler!

Container!2.4!

Container!1.2!

Container!1.3!

AM!2!

AM!2!

Container!2.1!

Container!2.2!

Container!2.3!

Lifecycle of an YARN Application

NodeManager

ResourceManager

Client"

1.   Client!submits!applica8on!

request!to!Applica=ons"

Manager"

2.  Response!with!Applica=on"

ID"

3.   Applica=on"Submission"

Context"(Applica8on!Id,!

Queue,!applica8on!name,!

applica8on!type,!AM!

resource,!priority)!+!

Container"Launch"Context"

(launch!commands,!

LocalResource,!Delega8on!

tokens!

4.  Start!Applica=onMaster"

5.   Get"Capabili=es"

6.   Req/Rec"Containers"

7.  Container!Launch"Requests"

NodeManager

NodeManager NodeManager xx

Applica8ons!

Manager!

1

2

34

Applica8on!

MasterService!

Scheduler!

Container!

Applica=onMaster"

MR"Job"

Job"

Map"

3"

Job"

Map"

4"

Job"

Map"

5"

5

6

Job"

Map"

1"

Job"

Reducer"

1"

Job"

Map"

2"

Job"

Map6"

Job"

Reducer2"

Job"

Map7"

Job"

Map8"Free!Space!

7

NodeManager

ZooKeeper

1

ZooKeeper ZooKeeper

ResourceManager

Scheduler!

Ac=ve"

ResourceManager

Scheduler!

Standby"

NodeManager

NodeManager NodeManager

Client"Node"

YARN!

Client!

NodeManager

ZooKeeper ZooKeeper ZooKeeper

ResourceManager

Scheduler!

Ac=ve"

ResourceManager

Scheduler!

Standby"

NodeManager


Client"Node"

YARN!

Client!

Batch:!

Applica8on!

Master!

Map!1.1!

Map!1.2!

2

NodeManager


ResourceManager

Scheduler!

Ac=ve"

ResourceManager

Scheduler!

Standby"

NodeManager


Client"Node"

YARN!

Client!

Batch:!

Applica8on!

Master!

Map!1.1!

Map!1.2!

HDFS"HDFS"

HDFS"

3

HDFS"HDFS"

HDFS"

NodeManager


ResourceManager

Scheduler!

DOWN!"

ResourceManager

Scheduler!

Standby"

NodeManager


Client"Node"

YARN!

Client!

Batch:!

Applica8on!

Master!

Map!1.2!

X X 4

HDFS"HDFS"

HDFS"

NodeManager


ResourceManager

Scheduler!

DOWN!"

ResourceManager

Scheduler!

Ac=ve"

NodeManager


Client"Node"

YARN!

Client!

Batch:!

Applica8on!

Master!

Map!1.2!

X X 5

HDFS"HDFS"

HDFS"

NodeManager


ResourceManager

Scheduler!

DOWN!"

ResourceManager

Scheduler!

Standby"

NodeManager


Client"Node"

YARN!

Client!

Batch:!

Applica8on!

Master!

Map!1.2!

X X 6

HDFS"HDFS"

HDFS"

NodeManager


ResourceManager

Scheduler!

DOWN!"

ResourceManager

Scheduler!

Ac=ve"

NodeManager


Client"Node"

YARN!

Client!

Batch:!

Applica8on!

Master!

Map!1.2!

X X 7

HDFS"HDFS"

HDFS"

NodeManager


ResourceManager

Scheduler!

Standby"

ResourceManager

Scheduler!

Ac=ve"

NodeManager


Client"Node"

YARN!

Client!

Batch:!

Applica8on!

Master!

Map!1.2!

8

HDFS"HDFS"

HDFS"

NodeManager


ResourceManager

Scheduler!

Standby"

ResourceManager

Scheduler!

Ac=ve"

NodeManager


Client"Node"

YARN!

Client!

Batch:!

Applica8on!

Master!

Map!1.2!

9

Map!1.1!

MapReduce!

Hadoop!2.x!

HADOOP"1.x!






HADOOP"2.x!

Single"Use"System"





41!

MapReduce!Phases!

NM + DN

Mapper!

NM + DN

Mapper!

NM + DN

Mapper!

NM + DN

Reducer!

NM + DN

Reducer!

Map!Phase! Shuffle/Sort! Reduce!Phase!

Data!is!shuffled!

across!the!network!

and!sorted!

Inverted Index: Map

books" movies"

Key"

P1"[books,"amazon.com]"

P2"[movies,"amazon.com]"

Map"2"

Key"

P2"[automo=ve,"walmart.com]"

P1"[books,"walmart.com]"

Value"

books" automo=ve"

Value"

Local"File"System"

Local"File"System"

Worker"

Node"

Worker"

Node"

Hadoop"Distributed"File"

System"(HDFS)"

Map"1"

Amazon.com"

Walmart.com"

SORTED"

BY"

KEY!"

Inverted Index: Shuffle

Reduce"1"Map"1"P1"[books,"amazon.com]"P2"[movies,"amazon.com]"

P1"[books,"walmart.com]"

Reduce"2"

Map"2"P2"[automo=ve,"walmart.com]"

44!

Reduce"2"

Inverted!Index:!Reduce!

Worker"

Node"Reduce"1"

"[books,"amazon.com]"

"[books,"walmart.com]"

"[automo=ve,"walmart.com]"

"[movies,"amazon.com]"

books,"walmart.com"amazon.com""""""""""

HDFS"

HDFS"

Worker"

Node"

automo=ve,"walmart.com"

movies,"amazon.com"

Within a Single Mapper

Sort/Combine!&!Spill!

P0"P1"P2"P3"

Spill file

Serialized"(key/value)"and"par==on"#"

Record!info!(K,V!offsets!

and!par88on)!

maintained!

mapreduce.task.io.sort.mb"

Deserialized"Key/Value"

Local!Disk!

Offset,"uncompressed,"compressed"length"

In-Memory Spill index

Pluggable2Shuffle2and2Pluggable2Sort&

mapreduce.map.sort.spill.percent"

User’s"Map"

Method"

Record"

Reader"Input"Split"

User�s&Par//oner/Comparator/Combiner&

Sort"Buffer"

Map-side Merging of Spill Files

Par88on!0!

Par88on!1!

Par88on!2!

Par88on!3!

Par88on!0!

Par88on!1!

Par88on!2!

Par88on!3!

Par88on!0!

Par88on!1!

Par88on!2!

Par88on!3!

Par==on"0"

Par==on"1"

Par==on"2"

Par==on"3"

Spill"File1"

Spill"File2"

Spill"File3"

InqMemory"Spill"Index"

(Offset"Length)""

for"each"Spill"File"

Final"Spill"Index"

(Offset,"Length)"

Processing"

of"requests""

from"Reducers"

Merge"/"Combine"

Final"Spill"

file.out"

file..out.index"

(these"are"in"a"

separate"directory"

for"each"map"task)"

NodeManager

Input

Shuffle Dataflow

Mapper 0

Mapper 3

Mapper 2

Mapper 1 Reducer 0

Reducer 1

Intermediate file transfers

Output

Output

Reducer – Shuffle Received Data

Small"Map"

Output"

Can"fit"in"memory" Merge/"

Combine"

Disk"

Spill""

merged"

data"

Happens in parallel (copier threads) = mapreduce.reduce.shuffle.parallelcopies

Cannot"fit"in"memory"

Large"Map"

Output"Merge"

Spill merged data

mapreduce.reduce.shuffle.input.buffer.percent!=!%!of!JVM!Heap!to!use!for!buffering!

Compression!

Compression Needs and Tradeoffs •  Compressing!data!can!speed!up!data1intensive!I/O!opera8ons!

•  MapReduce!jobs!are!almost!always!I/O!bound!

•  Compressed!data!can!save!storage!space!and!speed!up!data!transfers!across!the!network!

•  Capital!alloca8on!for!hardware!can!go!further!

•  Reduced!I/O!and!network!load!can!result!in!significant!performance!improvements!

•  MapReduce!jobs!can!finish!faster!overall!

•  But,!CPU!u8liza8on!and!processing!8me!increase!during!compression!and!decompression!

•  Understanding!the!tradeoffs!is!important!for!MapReduce!pipeline’s!overall!performance!!

51!

Data"Compression"in"Hadoop’s"MR"Pipeline"

Input""

splits"

Map Reduce Buffer in memory

Par==on"and"Sort"

fetch"

Merge""

on"disk"

Merge"and"sort"

"Other""

maps"

Other""

reducers"

I/P""

compressed"

Mapper"decompresses"

Mapper"O/P"compressed"

Map"Reduce"

Reduce"I/P"

Map"O/P"

Reducer"I/P"decompresses"

Reducer"O/P"compressed"

Sort"&"Shuffle"

Compress" Decompress"

1 2 3

52!

Phase"" Config" Values"

1)!Input!data!to!Map"

File!extension!recognized!automa8cally!for!decompression"

File!extensions!for!supported!formats!

Note:!For!SequenceFile,!headers!have!the!informa8on![compression!

(boolean),!block!compression!(boolean),!and!compression!codec]"

One!of!the!supported!codecs" one!defined!in!io.compression.codecs

2)!Intermediate!

(Map)!Output"

mapreduce.map.output.compressfalse!(default),!true!

"

mapreduce.map.output.compress.codec!one!defined!in!io.compression.codecs

3)!Final!(Reduce)!Output"

mapreduce.output.fileoutputormat.!compressfalse!(default),!true!

"

mapreduce.output.fileoutputormat.!compress.codec one!defined!in!io.compression.codecs

mapreduce.output.fileoutputormat.!compress.typeType!of!compression!to!use!for!SequenceFile!outputs:!NONE,!RECORD!

(default),!BLOCK"

Using!Data!Compression!in!Hadoop!

Common File Formats

•  Block!Compressed:!

•  SequenceFile!•  RCFile!•  ORC!File!•  Avro!File!•  Parquet!

•  Other!Formats:!

•  Text!•  JSON!•  XML!

54!

Format" Algorithm" Strategy"

zlib"Uses!DEFLATE!!

(LZ77!and!Huffman!coding)"Dic8onary1based,!API"

gzip" Wrapper!around!zlib" Dic8onary1based,!standard!compression!u8lity"

bzip2" Burrows1Wheeler!transform" Transform1based,!block1oriented"

LZO" Variant!of!LZ77" Dic8onary1based,!block1oriented,!API"

LZ4" Simplified!variant!of!LZ77" Fast!scan,!API"

Snappy" LZ77" Block1oriented,!API"

LZOP" LZOIX" Splifable!with!preprocessing!to!generate!index!file"

Compression!Op8ons!in!Hadoop!

55!

MapReduce!Demo!

56!

MapReduce!Lab!

Job!Scheduler!

YARN Capacity Scheduler

Administer"

FUNCTION"

Isolate"

FUNCTION"

Schedule"

FUNCTION"

Capacity Scheduler •  Distributes jobs to queues based on capacity

•  Capacity determined by percentage of available memory •  Total of all of the task percentage assigned has to add up to 100%

•  Can set individual user limits per queue •  Extra capacity available is given to other queues evenly

50%! 30%! 20%!Configured!for:!

Actual:!

40%! 35%! 25%!

Queue1! Queue2! Queue3!

Defining Queues yarn.scheduler.capacity.root.queues= "Marketing,Sales,Engineering”

yarn.scheduler.capacity.root.Marketing.capacity=50

yarn.scheduler.capacity.root.Sales.capacity=30

yarn.scheduler.capacity.root.Engineering.capacity=20!

Configuring Capacity Limits yarn.scheduler.capacity.root.queues= "Marketing,Marketing-longrunning" yarn.scheduler.capacity.root. Marketing.capacity=70 yarn.scheduler.capacity.root. Marketing.maximum-capacity=80 yarn.scheduler.capacity.root. Marketing-longrunning.capacity=30 yarn.scheduler.capacity.root. Marketing-longrunning.maximum-capacity=30

Configuring Permissions yarn.scheduler.capacity.root. Engineering.acl_submit_applications= "George,Tom developer,admin" yarn.scheduler.capacity.root. Engineering.acl_administer_queue= "Tom admin"

Preempton!!

64!

YARN!Capacity!Scheduler!

•  Queues!with!guaranteed!capacity!

•  Queue!hierarchy!

•  Job!submission!Access!Control!Lists!

•  Max!capacity!per!queue!

•  User!limits!within!queue!

•  Preemp8on!across!queues!

•  Management!Admin.!Access!Control!Lists!

•  Capacity1Scheduler.xml!Administer"

Schedule"

Isolate"

FUNCTION"

FUNCTION"

FUNCTION"

65!

Preemp8on:!Flow!!

•  Current"Capacity"

•  Guaranteed"Capacity"

•  Pending"Requests"

•  Figure"out"what"is"needed"to"achieve"capacity"balance"

•  Select"applica=ons"to"preempt:"Over"cap."Qs"and"FIFO"order"

•  Respect"bounds"on"amount"of"preemp=on"allowed"for"each"round"

1.   Remove"reserva=ons"from"the"most"recently"assigned"app"

2.   Issue"preemp=ons"for"containers"of"same"app"(reverse"chronological"order,"last"assigned"

container"first)"

App"Master"preqemp=on"is"last"resort."

1.   Track"those"containers"that"have"been"issued"by"not"yet"executed"preemp=on"

2.   A]er"a"set"of"execu=on"periods,"forcibly"kill"these"containers"

Preempt"

Applica=on(s)"

Gather""

Queue""

State"

Kill""

Containers"

Iden=fy""

Set"of""

Preemp=ons"

1

2

3

4

66!

Preemp8on!in!ac8on!

Product"

QUEUE"

Min:"20%" Max:"30%"

Marke=ng"

QUEUE"

Min:"45%" Max:"75%"

Finance"

QUEUE"

Job"2""

Min:"35%" Max:"35%"

Job1"

2

67!

Preemp8on!in!ac8on!

Product"

QUEUE"

Min:"20%"

Max:"30%"

Marke=ng"

QUEUE"

Min:"45%" Max:"75%"

Finance"

QUEUE"

Job!3!!

Min:"35%"

Max:"35%"

Job1!

Job2!

65%"

3

68!

Preemp8on!in!ac8on!

Product"

QUEUE"

Min:"20%"

Max:"30%"

Marke=ng"

QUEUE"

Min:"45%" Max:"75%"

Finance"

QUEUE"

Job"3""

Min:"35%"

Max:"35%"

Job1!

Job3"

JOB2"PREEMPTED"Job2"

4

69!

Preemp8on!in!ac8on!

Product"

QUEUE"

Min:"20%"

Max:"30%"

Marke=ng"

QUEUE"

Min:"45%" Max:"75%"

Finance"

QUEUE"

Min:"35%"

Max:"35%"

Job2!

Job3!

JOB1"FINISHES"

5

70!

Capacity!Scheduler!Lab!

What is Kafka? •  !!According!to!the!Kava!website:!

•  Distributed:!horizontally!scalable!(just!like!Hadoop!)!

•  Par==oned:!the!data!is!split1up!and!distributed!across!the!brokers!

•  Replicated:!allows!for!automa8c!failover!

•  Unique:!Kava!does!not!track!the!consump8on!of!messages!(the!consumers!do)!

•  Fast:!designed!from!the!ground!up!with!a!focus!on!performance!and!throughout!

Kaua"is"a"distributed,"par==oned,"

replicated"commit"log"service."It"provides"

the"func=onality"of"a"messaging"system,"

but"with"a"unique"design."

Kafka Terminology Kafka is a publish/subscribe messaging system comprised of the following components:

•  Topic: a message feed

•  Producer: a process that publishes messages to a topic

•  Consumer: a process that subscribes to a topic and processes its messages

•  Broker: a server in a Kafka cluster

74!

Kava!Components!

Kaua"Cluster"

producer"

producer"

producer"

consumer"

consumer"

consumer"

brokers"

Kaua"uses"ZooKeeper"to""

coordinate"brokers"with"

consumers"

75!

Publishing!Messages!

producer"

message_a"

message_b"

message_c"

message_d"

message_e"

message_f"

…"

1."A"producer"publishes"messages"to"a"topic"

2."The"producer"decides"which"

par==on"to"send"each"message"to"

offset!1>! 0! 1! 2! 3! 4!

Par88on!0! message_b! message_f!

Par88on!1! message_a! message_c! message_e!

Par88on!2! message_d!

Old" New"

3."New"messages"are"wriwen"to"the"

end"of"the"par==on"

consumer"

4."A"consumer"fetches"messages"from"a"

par==on"by"specifying"an"offset"

76!

Leader!and!Followers!

Broker!1!

my_topic!

Par88on11!

(follower)!

Broker!2!

my_topic!

Par88on11!

(leader)!

Broker!3!

my_topic!

Par88on11!

(follower)!

Spark!

78!

Hadoop!Ecosystem!

79!

Introducing

Spark"SQL""

DataFrames""

Spark""

Streaming"MLlib" GraphX"

Spark"Core"/"Tungsten"

HDFS" Hive" HBase" S3" Cassandra" Others"

80!

Supported:!

•  Spark!Core!

•  MLlib!

•  Spark!on!YARN!/!Kerberos!

•  Spark!SQL!

•  Spark!R!

•  Spark!Streaming!

•  DataFrame!API!

•  Spark!ML!Pipeline!API!in!PySpark!

Unsupported!/!Tech!Preview:!

•  GraphX!

•  Spark!Standalone/Mesos!

Spark!Features!

Supported:!

•  Spark!Core!

•  MLlib!

•  Spark!on!YARN!/!Kerberos!

•  Spark!SQL!/!DataFrame!

•  Spark!Streaming!

•  Spark!Standalone!

Unsupported:!

•  GraphX!

•  Spark!R!

•  Mesos!

•  Spark!ML!Pipeline!API!!

81!

2015!Spark!Survey!

81!

82!

What"is"Spark?"

Spark is an!open>source&sozware!solu8on!that!performs!rapid!calcula8ons!on!in>memory&datasets&

-  Open!Source![Apache!hosted!&!licensed]!•  Free!to!download!and!use!in!produc8on!

•  Developed!by!a!community!of!developers!

!

-  In1memory!datasets!

•  RDD"(Resilient"Distributed"Data)"is!the!basis!for!what!Spark!enables!

•  Resilient!–!the!models!can!be!recreated!on!the!fly!from!known!state!

•  Distributed!–!the!dataset!is!ozen!par88oned!across!mul8ple!nodes!for!

increased!scalability!and!parallelism!!

83!

! !Java!MapReduce!protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String currentLine = value.toString(); String[] words = StringUtils.split(currentLine, '\\', ' '); for (String word : words) { outputKey.set(word); context.write(outputKey, ONE); } } protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : values) { sum += count.get(); } outputValue.set(sum); context.write(key, outputValue); }

Write!Less!Code:!WordCount!

Scala Spark Core

val words = input.flatMap(line => line.split(" ")) val counts = words.map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} counts.saveAsTextFile(outputFile)

84!

Write!Less!Code:!Calculate!an!Average!

85!

 Spark!Core!contains!the!basic!func8onality!of!Spark!–  !Task!scheduling!–  !Memory!management!

–  !Fault!recovery!–  !Interac8ng!with!storage!systems!

 Home!to!Resilient&Distributed&Datasets&(RDDs):!A!collec8on!of!immutable!par88ons!of!data! Provides!many!APIs!for!building!and!manipula8ng!RDDs!

 Tungsten!(Version!1.5)!

 Spark!Datasets!(SPARK19999!Target!Version:!1.6/2.0)!!

!

Spark!Core!

86!

Format"Name" Spllitable" Structured" Comments"

Text! Yes! No! Plain!old!text!files.!Splifable!provided!

records!are!one!per!line.!

JSON! Yes! Semi! Common!text!based!format,!are!semi1

structured,!splifable!if!one!record!per!line.!

CSV! Yes! Yes! Very!common!text!based!format,!ozen!used!

with!spreadsheet!applica8ons.!

SequenceFile! Yes! Yes! A!common!Hadoop!file!format!used!for!key1

value!data.!

Protocol!Buffers! Yes! Yes! A!fast!space1efficient!mul81language!format.!

!

Object!Files!

!

Yes!

!

Yes!

Useful!for!saving!data!from!a!Spark!job!to!be!

consumed!by!shared!code.!Breaks!if!you!

change!your!classes,!as!it!relies!on!Java!

Serializa8on.!

File!Formats!

87!

Compression!

Property"Name" Default" Descrip=on"

spark.rdd.compress FALSE Whether!to!compress!serialized!RDD!

par88ons

spark.broadcast.compress TRUE Whether!to!compress!broadcast!variables!

before!sending!them

spark.io.compression.codec snappy The!codec!used!to!compress!internal!data!

such!as!RDD!par88ons,!broadcast!variables!

and!shuffle!outputs.!By!default,!Spark!

provides!three!codecs:!lz4,!lzf,!and!snappy

88!

Spark!provides!two!ways!to!create!RDDs:!!

1.!Loading!an!external!dataset!

sc.textFile("data/salarydata.txt")!

2.!Parallelizing!a!collec8on!in!your!driver!program!

!

sc.parallelize(List("M",14,0,95102))!

!

Create!RDD!

89!

  Two!types!of!opera8ons!on!RDDs:!transforma/ons!and!ac/ons&

•  Transforma8ons!are!lazy!(not!computed!immediately)!the!transformed!

RDD!gets!recomputed!when!an!ac8on!is!run!on!it!(default)!

•  RDD!can!be!persisted!into!storage!in!memory!or!disk!

RDD!Opera8ons!

90!

•  map()•  flatMap()•  filter()•  distinct()•  sample()•  union()•  intersection()•  subtract()•  cartesian()

Transforma8ons!

!

Transforma8ons!

Physical!Plans!Physical!Plans!

RDD!

// Transformed RDD

val genderSalaryData = (salaryData.map(line => line.split(',')).map(line => (line(0), line(2).toInt)))

91!

Ac8ons!

!

Transforma8ons!

Physical!Plans!Physical!Plans!

RDD!

!

Ac8on! Value!

// takeOrdered() action

val distinctFemaleSalaries = (genderSalaryData .filter { case(gender, salary) => gender == "F" } .map { case(gender,salary) => salary } .distinct() )distinctFemaleSalaries.takeOrdered(10)(Ordering[Int].reverse)

92!

Pair!RDD!

•  combineByKey•  groupByKey()•  reduceByKey(func) •  mapValues(func)•  flatMapValues(func) •  keys()•  values()•  sortByKey()

salaryData.map(line => line.split(',')).map(line => (line(0), line(2).toInt))

Transforma=ons"on"one"Pair"RDD"

example&({(1,&2),&(3,&4),&(3,&6)})&&

Create"pair"RDD"using"first"word"as"the"key&

•  countByKey() •  collectAsMap()•  lookup(key)

Ac=ons"on"Pair"RDD"

example&({(1,&2),&(3,&4),&(3,&6)})&&

•  Key-Value RDDs are commonly used to perform aggregations

•  Expose new operations

•  Act on each key in parallel or regroup data across the network

•  join•  cogroup•  aggregateByKey•  fullOuterJoin•  leftOuterJoin•  rightOuterJoin

93!

Storage!Levels:!cache(), persist()!

Storage Level Descrip=on"

MEMORY_ONLY (default) Deserialized.!Par88ons!will!not!be!cached.!Recomputed!them!on!demand.!

MEMORY_AND_DISK Deserialized.!Store!the!par88ons!that!don’t!fit!on!disk.!Read!them!on!demand.!

MEMORY_ONLY_SER Serialized.!More!space1efficient!than!deserialized!objects.!CPU1intensive!to!read.!

MEMORY_AND_DISK_SER Spill!par88ons!that!don't!fit!in!memory!to!disk!instead!of!recompu8ng!on!demand.!

DISK_ONLY Store!the!RDD!par88ons!only!on!disk.!

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Same!as!the!levels!above,!but!replicate!each!par88on!on!two!cluster!nodes.!

OFF_HEAP Reduces!GC!overhead,!allows!executors!to!be!smaller!/!share!a!pool!of!memory,!

•  Storage levels are set by passing a StorageLevel object (Scala, Java, Python, R) to persist()

•  The cache() method is a shorthand for using the default storage level: StorageLevel.MEMORY_ONLY (store deserialized objects in memory)

Full set of storage levels:

Partitions •  Each RDD is split into multiple partitions, computed on different nodes of the cluster

•  More partitions in a RDD can provide more parallelism

•  Tasks are executed on a partition

•  Each HDFS block will require one partition

•  Repartitioning can improve performance

95!

  Spark!automa8cally!recovers!from!failed!or!slow!nodes!by!re1execu8ng!failed!or!slow!

tasks!

  Par88ons!are!rebuilt!on!a!different!node!if!node!failure!occurs!!

  Specula8ve!execu8on!can!be!u8lized!for!slower!nodes!

  RDD!data!fails,!Spark!re1computes!the!missing!par88ons,!transparent!to!the!user!

Fault!Tolerance!and!Recovery!

Step 1: Create RDDs

sc.textFile("data/salarydata.txt")"

map(line"=>"line.split(','))"

map(line"=>"(line(0),"line(2).toInt))"

"reduceByKey"{_"+"_}"

collect()"

97!

Step!1:!Create!RDDs!

!!

HadoopRDD!

!!

map()!

!!

reduceByKey()!

!!

collect()!

map()!!!

map()!

•  Pipeline!as!much!

as!possible!

•  Split!into!“stages”!based!on!need!to!

reorganize!data!

98!

Step!2:!Create!Execu8on!Plan!!

""

HadoopRDD"

""

map()"

""

reduceByKey()"

""

collect()"

map()"""

map()"

Stage 1

Gender" Age" Salary" Zip"Code"

M! 14! 0! 95102!M! 3! 0! 95101!M! 25! 26000! 94040!M! 71! 69000! 95102!F! 7! 0! 95102!M! 95! 12000! 95105!F! 36! 61000! 95101!M! 86! 65000! 95051!F! 17! 0! 95103!F! 44! 96000! 94041!...! …! …! …!

Gender" Salary"

M! 0!M! 0!M! 26000!M! 69000!F! 0!M! 12000!F! 61000!M! 65000!F! 0!F! 96000!…! …!

salarydata.txt!

•  Pipeline!as!much!as!

possible!

•  Split!into!“stages”!based!

on!need!to!reorganize!

data!

99!

Step!2:!Create!Execu8on!Plan!!

!!

HadoopRDD"

!!

map()"

!!

reduceByKey()"

!!

collect()"

map()!!!

map()"

Stage 2

Gender" Salary"

M! 0!M! 0!M! 26000!M! 69000!F! 0!M! 12000!F! 61000!M! 65000!F! 0!F! 96000!…! …!

F!total!salaries!=!196171000!!

M!total!salaries!=!191825000!

Step 3: Schedule Tasks • Split each stage into tasks • A task is data + computation • Execute all tasks within a stage before moving on

!!

HadoopRDD"

!!

map()"

map()!""

map()"

Stage 1

Computation

""Task0!

""Task1!

host1:/par==on0"

host2:/par==on1"

101!

Default!par88ons!are!determined!by!number!of!CPU!cores!

!

Step!3:!Schedule!Tasks!

""

HadoopRDD"

""

map()"

map()"""

map()"

""

host1:/par==on0"

""

HadoopRDD"

""

map()"

map()"""

map()"

""

host2:/par==on1"

Task0! Task1!

102!

The!Shuffle!

""

HadoopRDD"

""

map()"

""

reduceByKey()"

""

collect()"

map()"""

map()"

Stage 1

Stage 2

Shuffling

•  Spark creates intermediate buffers for storing shuffle output data

•  Shuffle is sort-based. Every map task writes out data to local disk, reduce tasks make remote requests to fetch that data

•  Buffers are used to store intermediate results of aggregations

•  Spark attempts to limit the total amount of memory used in shuffle

Memory"

Name" EmpID"

Susan& 3230&Jeff& 4432&

Barbera& 5643&David& 2346&Charles& 6346&Terry& 8435&Joseph& 2747&Debra& 2482&Nathan& 3485&Maria& 4824&

Par==onqSorted"File"

Name" EmpID"

Barbera& 5643&Charles& 6346&David& 2346&Debra& 2482&Jeff& 4432&

Joseph& 2747&Maria& 4824&Nathan& 3485&Susan& 3230&Terry& 8435&

map()"!!

reduces"

104!

  Similar!to!reduce,!use!it!to!combine!values!

  Runs!several!parallel!reduce!opera8ons!

  Each!opera8on!have!the!same!key!

  Returns!a!new!RDD!consis8ng!of!each!key!and!the!reduced!value!

  Aggregate!data!separately!for!each!key!

  collect() func8on!retrieves!the!en8re!RDD!

Execute!reduceByKey() collect() Gender" Salary"

M! 0!M! 0!M! 26000!M! 69000!F! 0!M! 12000!F! 61000!M! 65000!F! 0!F! 96000!…! …!

F!total!salaries!=!196171000!!

M!total!salaries!=!191825000!

105!

spark1shell!and!pyspark!shell!

106!

Apache!Zeppelin!

Spark!SQL!

DataFrames!

108!

Interfaces!to!Spark!SQL!

JDBC! Console!User!Programs!

(Java,!Scala,!Python,!R)!

Spark!SQL!

!!Catalyst!Op8mizer!

DataFrame!API!

Spark!

!!Resilient!Distributed!Datasets!

109!

  Most!powerful!way!to!use!Spark!SQL!is!inside!a!Spark!applica8on!

  Load!data!and!query!it!with!SQL!while!simultaneously!combining!it!with!“regular”!

program!code!u8lizing!SQLContext!or!HiveContext

SQLContext!

// SQL Imports// Import Spark SQL. If you can't have the// hive dependencies

import org.apache.spark.sql.SQLContext

// Construct SQL Contextval sqlContext = new SQLContext(…)

// SQL Imports// Import Spark SQL

import org.apache.spark.sql.hive.HiveContext

// Construct Hive Contextval hiveContext = new HiveContext(…)

110!

HiveContext (Recommended)

110!

!

•  Provides!a!superset!of!the!func8onality!in!addi8on!to!the!basic!SQLContext

•  Write!queries!using!the!more!complete!HiveQL!parser!

•  Access!to!Hive!UDFs!and!ability!to!read!data!from!Hive!tables!

•  Build!DataFrames!(represent!structure!data),!and!operate!on!them!with!SQL!or!with!

normal!RDD!opera8ons!like!map.

•  Do!not!need!to!have!an!exis8ng!Hive!setup!!

•  Future!releases!will!focus!on!bringing!SQLContext!up!to!feature!parity!with!a!HiveContext

111!

  Offers!rich!rela8onal/procedural!integra8on!within!Spark!programs!

DataFrames:!

–  Collec8ons!of!structured!records!that!can!be!manipulated!using!Spark’s!procedural!API!or!new!

rela8onal!API!

–  Perform!rela8onal!opera8ons!on!DataFrames!using!a!domain1specific!language!(DSL)!similar!to!R!

data!frames!and!Python!Pandas!

–  Pass!Scala,!Java!or!Python!func8ons!through!DataFrames!to!build!a!logical!plan!

–  Create!directly!from!Spark’s!distributed!objects!

–  Enable!rela8onal!rela8onal!processing!in!exis8ng!Spark!programs!

  Automa8cally!store!data!in!a!columnar!format!

  Go!through!a!rela8onal!op8mizer,!Catalyst&

  Standard!data!representa8on!in!a!new!“ML!pipeline”!API!for!machine!learning!

DataFrames!

Create DataFrames

// case class is used here to define a schema/structure for the underlying datacase class SalaryData(gender: String, age: Int, salary: Int, zip: Int)

val rawSalaryData = sc.textFile("data/salarydata.txt”)

val salaryData = rawSalaryData.map(line => { val fields = line.split(','); SalaryData(fields(0), fields(1).toInt, fields(2).toInt,fields(3).toInt) }).toDF()

•  With!a!HiveContext!or!SQLContext,!applica8ons!can!create!DataFrames!from:!!

•  An!exis8ng!RDD!•  Hive!table!•  Data!Sources!

•  Create!a!DataFrame!from!the!content!of!a!CSV!file:!

Operations on DataFrames

salaryData.printSchema()salaryData.show()

val genderSalaryData = salaryData.select(salaryData("gender"),salaryData("salary")).cache()

genderSalaryData.groupBy("gender").sum("salary").show()genderSalaryData.groupBy("gender").avg("salary").show()

Common!DataFrame!Func8ons:!!

df.show()df.printSchema()df.select()df.filter()df.groupBy().count()

Example:

114!

  Largest!change!to!Spark’s!execu8on!engine!since!the!project’s!incep8on!

  Focuses!on!substan8ally!improving!the!efficiency!of!memory!and!CPU!for!Spark!

applica8ons!

  Includes!three!ini8a8ves:!1.   Memory"Management"and"Binary"Processing:"leveraging!applica8on!seman8cs!to!manage!

memory!explicitly!and!eliminate!the!overhead!of!JVM!object!model!and!garbage!collec8on!

2.   Cacheqaware"computa=on:"algorithms!and!data!structures!to!exploit!memory!hierarchy!

3.   Code"genera=on:"using!code!genera8on!to!exploit!modern!compilers!and!CPUs!

Tungsten!

115!

Considera=ons:"

– Object!overhead!– GC’s!inefficiency!

1.!Memory!Management!and!Binary!Processing!

Reference"

Size" Descrip=on"

4" (object"header)"

4" (object"header)"

4" (object"header)"

4" char[]"String.value"

4" int"String.hash"

4" int"String.hash2"

12!byte!object!header!

20!bytes!of!overhead!+!8!bytes!of!chars!

8!byte!hashcode!

Total:"48"bytes"

Consider"a"simple"string"“Mike”"

Solutions: – Explicit!memory!manager:!

sun.misc.Unsafe!

– Tungsten’s!UnsafeRow!format!

– Spark’s!new!binary!map!

116!

  Uses!Janino!compiler!to!reduce!code!genera8on!8me!

  Speed!up!the!conversion!of!data!from!in1memory!binary!format!to!wire1protocol!for!

shuffle!!

  Increase!throughput!of!serializa8on!

  Generates!specialized!bytecode!code!for!schema!

!

  2X!faster!shuffle!than!Kryo!

3.!Code!Genera8on!

Catalyst!

Op8mizer!

118!

  Based!on!func8onal!programming!constructs!in!Scala!

  Supports!both!rule1based!and!cost1based!op8miza8on!!

  Extensible!for!two!reasons:!

1. Easy!to!add!new!op8miza8on!techniques!and!features!to!Spark!SQL!

2. Enable!external!developers!to!extend!the!op8mizer!

  Contains!a!general!library!for!represen8ng!trees!and!applying!rules!to!manipula8ng!them!

  Libraries!specific!to!rela8onal!query!processing!and!sets!of!rules!that!handle!query!execu8on!

  Offers!several!public!extension!points:!

–  External!Data!Sources!– User1Defined!Types!

Catalyst!Op8mizer!

119!

Query!Planning!

SQL!Query!

DataFrame!

Unresolved!

Logical!Plan!

Cost!Model!

!

RDDs!

!!

Logical!Plan!Op8mized!

Logical!Plan!

Physical!Plans!Physical!Plans!Physical!Plans!

Selected!Physical!Plan!

Code!!

Genera8on!

Physical!Planning!

Logical!Op8miza8on!Analysis!

Catalog!

Analysis

SQL!Query!

DataFrame!

Unresolved!

Logical!Plan!

Cost!Model!

!

RDDs!

!!


Logical!Plan!



Code!!

Genera8on!

Physical!Planning!

Logical!Op8miza8on!Analysis"

Catalog!

Logical Optimization

SQL!Query!

DataFrame!

Unresolved!

Logical!Plan!

Cost!Model!

!

RDDs!

!!


Logical!Plan!



Code!!

Genera8on!

Physical!Planning!

Logical"Op=miza=on"Analysis!

Catalog!

Physical Planning

SQL!Query!

DataFrame!

Unresolved!

Logical!Plan!

Cost!Model!

!

RDDs!

!!


Logical!Plan!



Code!!

Genera8on!

Physical"Planning"


Catalog!

Code Generation

SQL!Query!

DataFrame!

Unresolved!

Logical!Plan!

Cost!Model!

!

RDDs!

!!


Logical!Plan!



Code""

Genera=on"

Physical!Planning!


Catalog!

Query Federation to External Databases • Data pipelines often combine data from heterogeneous sources

• Spark SQL data sources leverage Catalyst to push predicates down into the data sources whenever possible

Example: Use JDBC data source and JSON data source to join two tables together

CREATE TEMPORARY TABLE users USING jdbcOPTIONS(driver "mysql" url "jdbc:mysql://userDB/users ")

CREATE TEMPORARY TABLE logsUSING json OPTIONS (path "logs.json")

SELECT users.id,users.name,logs.messageFROM users JOIN logs WHERE users.id=logs.userId

AND users.registrationDate > "2015-01-01"

User-Defined Functions • Register custom functions in:

– Python!– Java!– Scala !!

• Supports native UDFs and existing Apache Hive UDFs

• The standard Hive UDFs are already automatically included

val udf = (text: String, topic: String) => { text match { case s if s matches s".*(?i)$topic.*" => true case _ => false }}hiveContext.udf.register("contains_topic", udf)hiveContext.sql("select count(*) from newsgroups where contains_topic(subject, 'alaska')").show()

Spark!SQL!DataFrames!Demo!

Spark!Streaming!

Spark Streaming •  A!Spark!component!that!enables!processing!live!streams!of!data!

•  Can!be!created!from!various!input!sources:!Flume,!Kava,!HDFS/S3!

•  Integrate!with!MLlib,!SQL,!DataFrames,!GraphX!

•  Provides!an!abstrac8on!called!DStreams:!a!sequence!of!data!arriving!over!8me!

•  Provides!an!API!for!manipula8ng!data!streams!that!closely!matches!the!Spark!Core’s!RDD!API!

•  Designed!to!provide!the!same!degree!of!fault!tolerance,!throughput,!and!scalability!that!the!Spark!Core!

provides!

Kaua""

Flume""

Kinesis""

HDFS/S3"

Twiwer"

File"Systems"

Databases"

Das"

!!

Discretized Streaming (DStreams) •  Treat!a!streaming!computa8on!as!a!series!of!determinis8c!

batch!computa8ons!on!small!8me!intervals!

•  Basic!abstrac8on!provided!by!Spark!Streaming:!

1.  Input!data!stream!received!from!the!source!

2.  Processed!data!stream!generated!by!transforming!the!input!stream!

•  Represented!by!a!con8nuous!series!of!RDDs!

!!

!!

!!

!!

!!

!!

!!

Data"Streams"

Batches!of!Input!data!

(x&seconds)!

Batches!of!processed!data!

…

Example: View Count Lineage Graph !!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

rsvps

map

ones

reduce

counts

rsvps = meetupReceier("http://...”)ones = rsvps.map(event => (event.event_url, 1))counts = ones.updateStateByKey((a, b) => a + b)

Create"a"DStream"called"rsvps:"""

t2=222

t2=212

Input DStreams and Receivers •  Input"DStreams""

–  DStreams!represen8ng!the!stream!of!input!data!received!from!streaming!

sources!

–  In!previous!example,!rsvps was!an!input!DStream!as!it!represented!the!stream!

of!data!received!from!HTTP!

•  Receiver"Object""–  Receives!the!data!from!a!source!and!stores!it!in!Spark’s!memory!for!processing!

–  Associated!with!every!input!DStream!(except!file!stream)!

Transformations Operators on DStreams •  Produce!a!new!DStream!from!one!or!more!parent!streams.!Can!be!

either:!

–  Stateless:!act!independently!on!each!interval!–  Stateful:!share!data!across!intervals!

Common"Opera=ons:"

Stateless" Stateful"

map()! updateStateByKey()!

flatMap()! window()!

filter()! reduceByWindow()!

repar88on()! reduceByKeyAndWindow()!

reduceByKey()!

groupByKey()!

Output Operators on DStreams •  Let!the!program!write!data!to!external!systems:!save!each!RDD!

to!HDFS!

Current"Output"Operators:"

Output Operation Description print() Prints the first ten elements of every batch of data in a DStream on the driver node running the

streaming application. This is useful for development and debugging.

saveAsTextFiles() Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

saveAsObjectFiles() Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

saveAsHadoopFiles() Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

foreachRDD()The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database.

DataFrame and SQL Operations

•  Use!DataFrames!and!SQL!opera8ons!on!streaming!data!

•  Create!a!SQLContext!using!the!SparkContext!that!the!StreamingContext is!using!

•  Each!RDD!is!converted!to!a!DataFrame,!registered!as!a!temporary!table!and!then!queried!

using!SQL!

!

•  Run!SQL!queries!on!tables!defined!on!streaming!data!from!a!different!threat!

MLlib Operations •  Use!machine!learning!algorithms!provided!by!MLlib!

–  Streaming!machine!learning!algorithms!which!can!simultaneously!learn!from!the!streaming!data!

as!well!as!apply!the!model!on!the!streaming!data:!

•  Streaming!Linear!Regression!

•  Streaming!KMeans!!

•  !For!a!much!larger!class!of!machine!learning!algorithms:!

–  learn!a!learning!model!offline!(i.e.!using!historical!data)!

–  apply!the!model!online!on!streaming!data!

Caching and Persistence •  Use!the!persist() method!on!a!DStream:!automa8cally!persist!every!RDD!of!

that!DStream!in!memory!

•  Useful!if!the!data!in!the!DStream!will!be!computed!mul8ple!8mes:!

– Window1based!opera8ons:!

•  reduceByWindow•  reduceByKeyAndWindow!!

–  State1based!opera8ons!•  updateStateByKey

Checkpointing •  Metadata!checkpoin8ng:!

–  Saving!of!the!informa8on!defining!the!streaming!computa8on!to!fault1tolerant!storage!like!HDFS!

–  Used!to!recover!from!failure!of!the!node!running!the!driver!of!the!streaming!applica8on!!

–  Metadata!includes:!

•  Configura8on!1!The!configura8on!that!was!used!to!create!the!streaming!applica8on!

•  DStream!opera8ons!1!The!set!of!DStream!opera8ons!that!define!the!streaming!applica8on!

•  Incomplete!batches!1!Batches!whose!jobs!are!queued!but!have!not!completed!yet!

•  Data!checkpoin8ng!–  Saving!of!the!generated!RDDs!to!reliable!storage!–  Necessary!in!some!stateful!opera8ons!that!combine!data!across!mul8ple!batches!

Source

Source"" Ar=fact""

Kafka spark-streaming-kafka_2.10 Flume spark-streaming-flume_2.10 Kinesis spark-streaming-kinesis-asl_2.10 [Amazon Software License] Twitter spark-streaming-twitter_2.10 ZeroMQ spark-streaming-zeromq_2.10 MQTT spark-streaming-mqtt_2.10

Advanced""

Sources:"

Basic""

Sources:"

Source"" Usage"

File Streams streamingContext.fileStream()Streams based on Custom Actors streamingContext.actorStream()

Queue of RDDs as Stream streamingContext.queueStream()

Socket Stream socketTextStream()

Simple Text Files textFileStream()

Kafka Direct Stream API •  Beginning!of!every!batch!interval,!iden8fy!the!range!of!offsets!to!consume!

•  When!each!batch’s!jobs!are!executed,!the!data!corresponding!to!the!offset!ranges!is!

read!from!Kava!for!processing!

•  Offsets!are!!saved!reliably!(with!checkpoints)!and!used!to!recompute!the!data!to!

recover!from!failures!

Spark"Streaming"

Driver"

Executor"

Kaua"

Query"latest"offsets"and"decide"offset"ranges"for"batch"

Launch"jobs"using"

"offset"ranges"

Read"data

"using"

offset"ran

ges"in"jobs

"

140!

Spark!Streaming!Lab!

Hive Architecture

NodeManager NodeManager NodeManager

Hadoop"

YARN!job!

Hive"

HiveServer2! Hive!

Compiler!

Op8mizer!

Executor!Megastore!

DB!

MapReduce!

CLI"

JDBC/"

ODBC"

Web"UI"

1.  User!issues!SQL!query!

2.  Hive!parses!and!plans!query!

3.  Query!converted!to!MapReduce!

and!executed!on!Hadoop!

1

Hive!

SQL!2

3

143!

Hive"–"MR" Hive"–"Tez"

Comparing:!Hive/MR!vs.!Hive/Tez!

SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a

JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

SELECT!a.state!

JOIN!(a,!c)!

SELECT!c.price!

SELECT!b.id!

JOIN(a,!b)!

GROUP!BY!a.state!

COUNT(*)!

AVERAGE(c.price)!

M M M

R R

M M

R

M M

R

M M

R

HDFS!

HDFS!

HDFS!

M M M

R R

R

M M

R

R

SELECT!a.state,!

c.itemId!

JOIN!(a,!c)!

JOIN(a,!b)!

GROUP!BY!a.state!

COUNT(*)!

AVERAGE(c.price)!

SELECT!b.id!

Tez!avoids!unneeded!writes!

to!HDFS!

144!

Defining!a!Hive1Managed!Table!

CREATE TABLE customer ( customerID INT, firstName STRING, lastName STRING, birthday TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; !

145!

Defining!an!External!Table!

CREATE EXTERNAL TABLE salaries ( gender string, age int, salary double, zip int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; !

146!

Defining!a!Table!LOCATION!

CREATE EXTERNAL TABLE SALARIES ( gender string, age int, salary double, zip int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/train/salaries/'; !

147!

Loading!Data!into!Hive!

LOAD!DATA!LOCAL!INPATH!’/tmp/customers.csv'!

OVERWRITE!INTO!TABLE!customers;!

LOAD!DATA!INPATH!'/user/train/customers.csv'!

OVERWRITE!INTO!TABLE!customers;!

!

INSERT!INTO!TABLE!birthdays!!

!SELECT!firstName,!lastName,!birthday!!

!FROM!customers!!

!WHERE!birthday!IS!NOT!NULL;!

148!

Performing!Queries!

SELECT!*!FROM!customers;!

!

FROM!customers!

!SELECT!firstName,!lastName,!address,!zip!!

!WHERE!orderID!>!0!!

!ORDER!BY!zip;!

!

SELECT!customers.*,!orders.*!!

!FROM!customers!!

!JOIN!orders!ON!!

!(customers.customerID!=!orders.customerID);!

149!

Hive!Par88ons!

•  Use!the!par==oned"by!clause!to!define!a!par88on!when!crea8ng!a!table:!

create!table!employees!(id!int,!name!string,!salary!double)!

par88oned!by!(dept!string);!

•  Subfolders!are!created!based!on!the!par88on!values:!/apps/hive/warehouse/employees!

!!!!/dept=hr/!

!!!!/dept=support/!

!!!!/dept=engineering/!

!!!!/dept=training/!

150!

Hive!Join!Strategies!

Type" Approach" Pros" Cons"

Shuffle"Join"

Join!keys!are!shuffled!using!

MapReduce,!and!joins!are!

performed!on!the!reduce!side.!

Works!regardless!of!data!

size!or!layout.!

Most!resource1

intensive!and!

slowest!join!type.!

Map"

(Broadcast)"

Join"

Small!tables!are!loaded!into!

memory!in!all!nodes,!mapper!

scans!through!the!large!table,!and!

joins.!

Very!fast,!single!scan!

through!largest!table.!

All!but!one!table!must!be!

small!enough!to!fit!in!

RAM.!

SortqMergeq

Bucket"Join"

Mappers!take!advantage!of!co1

loca8on!of!keys!to!do!efficient!

joins.!

Very!fast!for!tables!of!any!

size.!

Data!must!be!sorted!and!

bucketed!ahead!of!8me.!

151!

Columnar!Storage!for!Hive!Large!block!size!ideal!for!

map/reduce.!

Columnar!format!enables!

high!compression!and!high!

performance.!

Index"Data""

Row"Data""

Stripe"Footer""

Index"Data""

Row"Data""

Stripe"Footer""

Index"Data""

Row"Data""

Stripe"Footer""

File"Footer"

Postscript"

250!M

B!Stripe!!

250!M

B!Stripe!!

250!M

B!Stripe!!

Column"1"

Column"2""

Column"6""

Column"5""

Column"3""

Column"4"

Column"8""

Column"7"

Column"1"

Column"2""

Column"6""

Column"5""

Column"3""

Column"4"

Column"8""

Column"7"

Query Planner •  Cost1Based!Op8mizer!(CBO)!uses!sta8s8cs!from!Hive!tables!to!produce!

op8mal!query!plans!

!

Why!cost1based!op8miza8on?!

–  Ease!of!Use!–!Join!Reordering!–  Reduces!the!need!for!specialists!to!tune!queries.!–  More!efficient!query!plans!lead!to!befer!cluster!u8liza8on.!

set hive.cbo.enable=true; set hive.stats.fetch.column.stats=true;

!

Transactions

Read"

Op=mized"

ORC"File"

Read"

Op=mized"

ORC"File"

Delta""

File""

Merge"

Read"

Op=mized"

ORC"File"

Task"Task"Task"

154!

Hive!Demo!

155!

Hive!Demo!

Security!

Authentication Authentication

Who am I?

Can I prove it?

Control access to cluster.

Authorization

•  What am I allowed to do?

•  Restrict access to explicit data

Data Protection

•  Protect sensitive data at rest & in-motion

Audit

•  Who did what, when & with what results?

Kerberos Knox Gateway

•  Core Hadoop (MR ACL, HDFS perms, Hive & HBase Authz)

•  Apache Ranger

•  Core Hadoop•  OS Level

Encryption

Kerberos Component Architecture

AD"/"LDAP"

Client"

KDC"

Use"exis=ng"directory"

tools"to"manage"users.""

Step"1""

Login"and"get"a""

TGT"Ticket"."

KDC"contains"host"+"

service"+"user"Principals"

and"ACL"for"these"

principals."

Step"3""

Get"access"to"NN"Service"

with"TGT"and"return""

NNqST"Ticket."

Step"2"

Store"TGT"in"cache.""

Step"5""

Read/write"ac=on"on"a"file"

with"NNqST"is"sent"to"NN."If"

access"authorized,"NN"returns"

block"loca=ons,"block"IDs,"

anblock"access"tokens.""

Step"4""

Store"NNqST"in"

cache."

Client’s!

Kerberos!!

Ticket!

Cache!!

TGT"

Ticket""

TGT"

Ticket""

Users" Users/Hosts""

Services""

NN" HIVE" JT" DN"Step"6""

Read/write"block"given"

block"access"token"and"

block"ID"."

Authorization & Audit

Authentication

•  Who am I?

•  Can I prove it?

•  Control access to cluster.

AuthorizationWhat am I allowed to do?

Restrict access to explicit data

Data Protection

•  Protect sensitive data at rest & in-motion

Audit


•  Kerberos •  Knox Gateway

Core Hadoop (MR ACL, HDFS perms, Hive & HBase Authz)Apache RangerPartners

•  Core Hadoop•  OS Level

Encryption

Controlling Who Can Do What •  Access!Control!Services!exist!for!each!of!the!Hadoop!components!

–  HDFS!has!file!Permissions!

–  Higher!Level!services!provide!Access!Control!Rules!•  MapReduce,!HBase,!Hive!all!have!Access!Control!Lists!(ACL)!

•  Improvements!to!these!services!are!being!led!by!Hortonworks!Team:!

–  HDFS!Improvements!1!More!flexible!permissions!for!HDFS!via!mul8ple!policies!on!

the!same!file!or!directory!

–  Hive!Improvements!–!Hortonworks!ini8a8ve!called!Hive!ATZ1NG!allows!familiar!

SQL/database!syntax!(GRANT/REVOKE)!and!allows!more!clients!to!be!secure.!!

HDFS File Permissions Example •  Authoriza8on!requirements:!

•  In!a!sales!department,!they!would!like!a!single!user!Maya!(Department!Manager)!to!control!all!

modifica8ons!to!sales!data!.!

•  Other!members!of!sales!department!need!to!view!the!data,!but!can’t!modify!it.!

•  Everyone!else!in!the!company!must!not!be!allowed!to!view!the!data.!!

•  Can!be!implemented!via!the!following:!

HDFS Extended ACLs •  The!Problem!

•  No!longer!feasible!for!Maya!to!control!all!modifica8ons!to!the!file!

-  New!Requirement:!Maya,!Diane!and!Clark!are!allowed!to!make!modifica8ons!

-  New!Requirement:!New!group!called!execu8ves!should!be!able!to!read!the!sales!data!

•  Current!permissions!model!only!allows!permissions!at!1!group!and!1!user!

•  HDFS!Extended!ACLs!solves!this!issue!•  Now!assign!different!permissions!to!different!users!and!groups!

HDFS""

Directory"

Owner""

Group"

Others""

Group"D"

Group"F""

User"Y"

HDFS Extended ACLs (1 of 2) New!Tools!for!ACL!Management!(setacl,!getacl)!

hdfs dfs -setfacl -m group:execs:r-- /sales-data hdfs dfs -getfacl /sales-data # file: /sales-data # owner: bruce # group: sales user::rw- group::r-- group:execs:r-- mask::r-- other::--

How!do!you!know!if!a!directory!has!ACL’s!set?!

hdfs dfs -ls /sales-data Found 1 items -rw-r-----+ 3 bruce sales 0 2014-03-04 16:31 /sales-data

HDFS Extended ACLs (2 of 2) Default!ACLs!

hdfs dfs -setfacl -m default:group:execs:r-x /monthly-sales-data hdfs dfs -mkdir /monthly-sales-data/JAN hdfs dfs –getfacl /monthly-sales-data/JAN

# file: /monthly-sales-data/JAN # owner: bruce # group: sales user::rwx group::r-x group:execs:r-x mask::r-x other::--- default:user::rwx default:group::r-x default:group:execs:r-x default:mask::r-x default:other::---

Authorization & Audit

Authentication

•  Who am I?

•  Can I prove it?

•  Control access to cluster.

Authorization

What am I allowed to do?

Restrict access to explicit data

Data Protection

Protect sensitive data at rest & in-motion

Audit


•  Kerberos •  Knox Gateway

Core Hadoop (MR ACL, HDFS perms, Hive & HBase Authz)Apache RangerPartners

Core HadoopOS Level Encryption

Encryption •  Data!at!Rest!

–  Op8on!1:!Transparent!Encryp8on!(Built1in)!–  Op8on!2:!OS!Level!Encryp8on!(Out!of!the!Box)!–  Op8on!3:!Custom!Development!(experience!doing!it)!

–  Op8on!4:!Cer8fied!Partners!(e.g:!Protegrity,!Voltage!Security)!•  Data!on!the!Wire!

–  All!wire!protocols!can!be!encrypted!by!HDP!platorm!(2.0+).!Wire1level!

encryp8on!enhancements!led!by!HW!Team.!!

•  Column!Level!

–  No!current!out!of!the!box!support!in!Hadoop!–  Cer8fied!Partners!like!Voltage!Security!provide!these!capabili8es!

Encrypting Data at Rest in HDFS Solu=on" Type" Descrip=on"

•  Transparent!

Encryp8on!

•  HDFS! •  Provides!with!a!solu8on!to!data!encryp8on!on!a!

higher!level!than!the!OS!one!whilst!remaining!na8ve!

and!transparent!to!Hadoop!

!

•  No!third1party!solu8on!required!

•  LUKS! •  File!System/Volume! •  Linux!Unified!Key!Setup1on1disk1format!(or!LUKS)!allows!you!to!

encrypt!par88ons!on!your!Linux;!volume!encryp8on!

technology!

•  Advantage:!Reduced!complexity!since!encryp8on!is!seamlessly!

done!by!OS.!

•  Voltage!Security! •  HDFS! •  Encryp8on!vendor/partner!!provides!seamless!encryp8on!of!

data!at!MR!level!via!PIg!and!Hive!UDFS!!

•  Advantage:!Format!Preserving!Encryp8on!

•  Protegrity! •  HDFS! •  Provides!encryp8on!for!full!stack!

•  Advantage:!Key!management,!centralized!authoriza8on!

policies!

Data In-Motion: Points of Communication

Nodes"2

3

4

Data"Transfer""

(TCP/IP)"

Node"to"Node""

(RPC)""

M/R"Shuffle""

(HTTP)"

Nodes"Client"

2

3

4

Data"Transfer""

(TCP/IP)"

Hadoop"client""

(RPC)"

Web"UI"

(HTTP)"

Nodes"Nodes"

1WebHDFS"

(REST)" Hadoop"Cluster""

Encrypting Data In-Motion Protocol" Communica=on"Point" Encryp=on"Mechanism"

•  REST! •  WebHDFS!!(Client!to!Cluster)!

•  Client!to!Knox!

•  REST!over!SSL!

•  Knox!Gateway!SSL!

•  SPNEGO!1!provides!a!mechanism!for!extending!Kerberos!

to!Web!applica8ons!through!the!standard!HTTP!protocol"

•  HTTP! •  NameNode/

ResourceManager!UI!

•  MapReduce!Shuffle!

•  HTTPS!

•  Encrypted!MapReduce!Shuffle!(MAPREDUCE14117)"

•  RPC! •  Hadoop!Client!(Client!to!

Cluster,!Intra1Cluster)!

•  SASL!–!The!Hadoop!RPC!system!implements!SASL!which!

provides!different!QoP!including!encryp8on"

•  TCP/IP! •  Data!Transfer!(Client!to!

Cluster,!Intra1Cluster)!

•  Encrypted!DataTransfer!Protocol!available!in!HDP!2!

(HDFS13637)!

•  Adding!SASL!support!to!the!DataTransferProtocol!

170!

Security!Lab!

Big!Data!Bootcamp!! 2016! - s3.amazonaws.com › maylong › Chicago+Bootcamp › Chicago… ·...

Documents

Transcript of Big!Data!Bootcamp!! 2016! - s3.amazonaws.com › maylong › Chicago+Bootcamp › Chicago… ·...