Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda...

57
© 2015 MapR Technologies 1 ® © 2015 MapR Technologies Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03

Transcript of Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda...

Page 1: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®© 2015 MapR Technologies 1

®

© 2015 MapR Technologies

Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03

Page 2: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®© 2015 MapR Technologies 2 © 2014 MapR Technologies ®

Polyglot Processing

Page 3: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Polyglot Processing

•  Combination of different

processing engines over

DFS/NoSQL stores

•  Lambda and Kappa

architectures are two

prominent examples http://datadventures.ghost.io/2014/07/06/polyglot-processing/

Page 4: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Fault tolerance

hardware

software

developer

?

Page 5: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®© 2014 MapR Technologies 5 © 2014 MapR Technologies ®

Let’s talk about developers…

Page 6: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

http://xkcd.com/327/

Page 7: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the
Page 8: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the
Page 9: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®© 2015 MapR Technologies 9 © 2014 MapR Technologies ®

human fault tolerance Let’s talk about developers…

Page 10: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Human fault tolerance

Page 11: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

When things go wrong …

http

://al

lface

book

.com

/the-

real

-rea

son-

face

book

-wen

t-dow

n-ye

ster

day-

its-c

ompl

icat

ed_b

1936

6

2010 unfortunate handling of error condition

Page 12: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

When things go wrong …

2012 cascaded bug

http

://m

oney

.cnn

.com

/201

2/06

/21/

tech

nolo

gy/tw

itter

-dow

n/in

dex.

htm

Page 13: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

When things go wrong …

http

://w

ww

.v3.

co.u

k/v3

-uk/

new

s/21

9657

7/rb

s-ta

kes-

gbp1

25m

-hit-

over

-it-o

utag

e

2012 upgrade of batch processing

Page 14: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

When things go wrong …

http

://w

ww

.and

roid

cent

ral.c

om/g

oogl

e-ex

plai

ns-r

easo

ns-b

ehin

d-to

day-

s-30

-min

ute-

serv

ice-

outa

ge

2014 bug/bad config

Page 15: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®© 2014 MapR Technologies 15 © 2014 MapR Technologies ®

Lambda Architecture to the rescue!

Page 16: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Let’s step back a bit …

•  Nathan Marz (Backtype, Twitter, stealth startup)

•  Creator of … –  Storm –  Cascalog –  ElephantDB

http://manning.com/marz/

Page 17: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Lambda Architecture—Requirements •  Fault-tolerant against both hardware failures and human errors

•  Support variety of use cases that include low latency querying as well as updates

•  Linear scale-out capabilities

•  Extensible, so that the system is manageable and can accommodate newer features easily

Page 18: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Lambda Architecture—Concept

•  Latency—the time it takes to run a query

•  Timeliness—how up to date the query results are (! consistency)

•  Accuracy—tradeoff between performance and scalability

(! approximations)

query = function(all data)

Page 19: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Lambda Architecture

NEW DATA STREAM QUERY

BATCH VIEWS

√ View 1 View 2 View N

REAL-TIME VIEWS

BATCH LAYER

SERVINGLAYER

SPEED LAYER

MERGE

IMMUTABLE MASTER DATA

PRECOMPUTE VIEWS BATCH

RECOMPUTE

PROCESS STREAM

INCREMENT VIEWS REAL-TIME

INCREMENT

View 1 View 2 View N

Page 20: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Lambda Architecture—Layers

•  Batch layer –  managing the master dataset, an immutable, append-only set of raw data –  pre-computing arbitrary query functions, called batch views

•  Serving layer indexes batch views so that they can be queried in ad hoc with low latency

•  Speed layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, deals with recent data only

Page 21: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Lambda Architecture—Compensate Batch

time

not absorbed

now

Page 22: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Lambda Architecture—Immutable Data + Views

http://openflights.org

Page 23: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Lambda Architecture—Immutable Data + Views

timestamp airport flight action timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off

timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off

timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing

timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing

timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing 2014-01-01T10:10:00 FCO AZ501 take-off

immutable master dataset

Page 24: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Lambda Architecture—Immutable Data + Views

timestamp airport flight action

2014-01-01T10:00:00 DUB EI123 take-off

2014-01-01T10:05:00 HEL SAS45 take-off

2014-01-01T10:07:00 AMS BA99 take-off

2014-01-01T10:09:00 LHR LH17 landing

2014-01-01T10:10:00 CDG AF03 landing

2014-01-01T10:10:00 FCO AZ501 take-off

immutable master dataset

views

airport planes

AMS 69

CDG 44

DUB 31

FCO 10

HEL 17

LHR 101

airport load: airline planes

AF 59

AZ 23

BA 167

EI 19

LH 201

SAS 28

air-borne per airline:

air-borne: 2307

Page 25: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®© 2014 MapR Technologies 25 © 2014 MapR Technologies ®

Implementing the Lambda Architecture

Page 26: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®© 2014 MapR Technologies 26

Page 27: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Page 28: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

How about an integrated approach?

•  Twitter Summingbird •  Lambdoop •  Apache Spark

Page 29: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Apache Spark

Page 30: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Apache Spark •  Originally developed in 2009 in UC Berkeley’s AMP Lab •  As part of BDAS stack open sourced in 2010 •  Top-level Apache Project as of 2014

http://spark.apache.org/

Page 31: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Apache Spark—a unified platform …

Continued innovation bringing new functionality, such as:

•  Tachyon (Shared RDDs, off-heap solution) •  BlinkDB (approximate queries) •  SparkR (R wrapper for Spark)

Spark SQL (SQL/HQL)

Spark Streaming (stream processing)

MLlib (machine learning)

Spark (core execution engine)

GraphX (graph processing)

Mesos

Distributed File System (local FS, HDFS, S3, …)

YARN

Page 32: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Easy and fast Big Data

•  Easy to Develop –  Rich APIs available through

Java, Scala, Python –  Interactive shell

•  Fast to Run –  Advanced data storage model

(automated optimization between memory and disk)

–  General execution graphs 2-5× less code up to 10× faster on disk,

100× in memory

https://amplab.cs.berkeley.edu/benchmark/

Page 33: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

… for complex workloads …

•  Iterative Algorithms –  machine learning –  graph processing beyond DAG

•  Interactive Data Mining

•  Streaming Applications

Page 34: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

… across multiple datasources

•  Local Files –  file:///opt/httpd/logs/access_log

•  Object Stores (e.g. Amazon S3)

•  HDFS –  text files, sequence files, any other Hadoop InputFormat

•  Key-Value datastores (e.g. Apache HBase)

Page 35: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Easy: expressive API

map reduce

Page 36: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Easy: expressive API

map filter groupBy sort union join leftOuterJoin rightOuterJoin

reduce count fold reduceByKey groupByKey cogroup cross zip

sample take first

partitionBy

mapWith

pipe save ...

Page 37: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Easy: get started immediately

Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Page 38: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

… and scale as you go (mentally and physically)

YARN

Standalone

Page 39: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Resilient Distributed Datasets (RDD)

•  RDDs are the core of the Spark execution engine

•  Collections of elements that can be operated on in parallel

•  Persistent in memory between operations

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 40: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

RDD Operations

•  Lazy evaluation is key to Spark

•  Transformations –  Creation of a new dataset from an existing:

map, filter, distinct, union, sample, groupByKey, join, etc.

•  Actions –  Return a value after running a computation:

collect, count, first, takeSample, foreach, etc.

Page 41: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

RDD persistence

http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

Page 42: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Spark Streaming

•  High-level language operators for streaming data

•  Fault-tolerant semantics

•  Support for merging streaming data with historical data

http://spark.apache.org/docs/latest/streaming-programming-guide.html

Page 43: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Spark Streaming

Run a streaming computation as a series of very small, deterministic batch jobs

43

Spark&

Spark&Streaming&

batches(of(X(seconds(

live(data(stream(

processed(results(

•  Chop up live stream into batches of X seconds

•  Spark treats each batch of data as RDDs and processes them using RDD operations

•  Finally, processed results of the RDD operations are returned in batches

Page 44: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Spark Streaming

Run a streaming computation as a series of very small, deterministic batch jobs

44

•  Batch sizes as low as ½ second, latency of about 1 second

•  Potential for combining batch processing and streaming processing in the same system

Spark&

Spark&Streaming&

batches(of(X(seconds(

live(data(stream(

processed(results(

Page 45: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Spark Streaming Comparison •  Spark Streaming: 670k records/sec/node •  Storm: 115k records/sec/node •  Commercial systems: 100-500k records/sec/node

0(

10(

20(

30(

100( 1000(

Throughp

ut&per&nod

e&(M

B/s)&

Record&Size&(bytes)&

WordCount&

Spark(

Storm(

0(

20(

40(

60(

100( 1000(

Throughp

ut&per&nod

e&(M

B/s)&

Record&Size&(bytes)&

Grep&

Spark(

Storm(

Page 46: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

And in the real world?

Page 47: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Spotify: Collaborative Filtering with Spark

•  collaborative filter (ALS) for

music recommendation •  Hadoop suffers from I/O

overhead •  show a progression of code

rewrites, converting a Hadoop-based app into efficient use of Spark

www.slideshare.net/MrChrisJohnson/collaborative-filtering-with-spark

Page 48: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Cisco: Security Intelligence Operations

Sensor data lands in MapR Spark Streaming on MapR for first check on known threats Data next processed on GraphX and Mahout Additional SQL querying done via Shark and Impala

Page 49: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Leading Pharma Company: NextGen Genomics

Existing process takes several weeks to align chemical compounds with genes

ADAM on Spark allows

realignment in a few hours

Geneticists can minimize engineering dependency

Page 50: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Where to go from here

Page 51: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

The book: Learning Spark

http://shop.oreilly.com/product/0636920028512.do

Page 52: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Apache Spark developer certificate program

http://oreilly.com/go/sparkcert

Page 53: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

http://lambda-architecture.net

Page 54: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

http://spark-stack.org

Page 55: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

MapR Sandbox with Spark

mapr.com/blog/getting-started-spark-mapr-sandbox

Page 56: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Conclusion

•  Let’s scale systems and humans

•  How? Lambda Architecture!

•  Apache Spark is an efficient way to implement Lambda Architecture

Page 57: Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

®

Q & A

@mhausenblas maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies