Spark Engine - ut · 3/29 Artjom Lind ([email protected]) 22.10.2013 Motivation … which is...

29
1/29 Artjom Lind ([email protected]) 22.10.2013 https://ds.cs.ut.ee Hajussüsteemide Uurimislabor Distributed Sytems Research Lab Spark Engine In-Memory Hadoop Cloud Computing (speaker: Artjom Lind) (slides: adopted from http://spark.incubator.apache.org)

Transcript of Spark Engine - ut · 3/29 Artjom Lind ([email protected]) 22.10.2013 Motivation … which is...

Page 1: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

1/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Hajussüsteemide UurimislaborDistributed Sytems Research Lab

Spark EngineIn-Memory Hadoop Cloud Computing (speaker: Artjom Lind)

(slides: adopted from http://spark.incubator.apache.org)

Page 2: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

2/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Motivation● Hadoop programming is based on stortage-to-strorage data

flow (acyclic data flow).● Advantage: automatic decision where to run tasks and

recovery from failures

Input Data

Map

Map

MapMapMapReduce

MapReduce

OutputData

Page 3: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

3/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Motivation

● … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

– Iterative algorithms (machine learning, graph algorithms)– Interactive data mining tools, like Matlab, R or Python based

● All of above do the same (from Hadoop perspective):– Reload the data from storage on each query

Page 4: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

4/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Iterative algorithms

● Using the common data source:

● Bypassing result to next iteration by offloading intermediate results to a storage:

InputData

Iter1

Iter2

Iter3 OutputData

Iter1 Iter2

Page 5: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

5/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Make advantage of RAM

● Keeping the “currently in process” set in Distributed RAM:

● Bypassing intermediate results over Distributed RAM:

InputData

Iter1

Iter2

Iter3

DRAM

Iter1 Iter2DRAM DRAM

Page 6: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

6/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Distributed RAM Features

● It has to be:– Fault-tolerant– Efficient in large commodity clusters

● Lots of nodes with local RAM● How to aggregate it to a Distributed RAM

● How the API can provide these features ?

Page 7: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

7/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Distributed RAM Features

● Existing distributed storage abstractions offer an interface based on fine-grained updates

– Reads and writes to cells in a table– E.g. key-value stores, databases, distributed

memory● Requires replicating data or update logs

across nodes for fault tolerance– Which is expensive for the data-

intensive applications

Page 8: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

8/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

RDD Resilent Distributed Datasets

● Offer an interface based on coarse-grained transformations (e.g. map, group-by, join)

● Allows for efficient fault recovery using lineage

– Log one operation to apply to many elements– Recompute lost partitions of dataset on

failure– No cost if nothing fails

Page 9: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

9/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Recovery

● In case of common data source:

– … we go to storage directly if RDD failed● In case of sequentional bybass:

● … nothing to do we have to recompute everything

Iter1

Iter2

Iter3

DRAM

Iter1 Iter2DRAM DRAM

Page 10: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

10/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Recovery

● In case of common data source:

– … we go to storage directly if RDD failed● In case of sequentional bybass:

● … nothing to do we have to recompute everything

Iter1

Iter2

Iter3

DRAM

Iter1 Iter2DRAM DRAM

However we win in performanceif nothing failed

Page 11: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

11/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Generics of RDD

● Despite coarse-grained interface, RDDs can express surprisingly many parallel algorithms

– These naturally apply the same operation to many items

● Unify many current programming models– Data flow models: MapReduce, Dryad, SQL, …– Specialized tools for iterative apps: BSP (Pregel),

iterative MapReduce (Twister), incremental (CBP)● Also support new apps that these models don’t

Page 12: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

12/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Spark API

● Language-integrated API in Scala● Can be used interactively from Scala

interpreter● Provides:

– Resilient distributed datasets (RDDs)– Transformations to build RDDs (map, join,

filter, …)– Actions on RDDs (return a result to the

program or write data to storage: reduce, count, save, …)

Page 13: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

13/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.persist()

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

DRAMDRAM

DRAM

DRAM

Block1

DRAMDRAM

Source

Block2Block3

Driver Worker

WorkerWorker

Page 14: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

14/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Fault tolerance

RDDs preserves lineage information that can be used to reconstruct lost partitions

errors = lines.filter(_.startsWith(“ERROR”)).map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD

_.startsWith(“ERROR”) _.split(‘\t’)(2)

Page 15: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

15/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Example: Regression

Goal: find best line separating two sets of points

+

++

+

+

+

++ +

– ––

––

+

target

random initial line

Page 16: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

16/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Code

val data = spark.textFile(...).map(readPoint).persist()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce((a,b) => a+b) w -= gradient}

println("Final w: " + w)

Page 17: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

17/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Performance

Page 18: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

18/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

RDD in details

• RDDs additionally provide:• Control over partitioning (e.g. hash-based),

which can be used to optimize data placement across queries

• Control over persistence (e.g. store on disk vs in RAM)

• Fine-grained reads (treat RDD as a big table)

Page 19: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

19/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

RDD vs. Distributed Shared Memory

Concern RDDs Distr. Shared Mem.

Reads Fine-grained Fine-grained

Writes Bulk transformations Fine-grained

Consistency Trivial (immutable) Up to app / runtime

Fault recovery Fine-grained and low-overhead using lineage

Requires checkpoints and program rollback

Straggler mitigation

Possible using speculative execution

Difficult

Work placement

Automatic based on data locality

Up to app (but runtime aims for transparency)

Page 20: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

20/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Spark Applications

● EM alg. for traffic prediction (Mobile Millennium)

● In-memory OLAP & anomaly detection (Conviva)

● Twitter spam classification (Monarch)● Alternating least squares matrix

factorization● Pregel on Spark (Bagel)● SQL on Spark (Shark)

Page 21: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

21/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Mobile Millenium App.

Estimate city traffic using position observations from “probe vehicles” (e.g. GPS-carrying taxis)

Page 22: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

22/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Implementation

Expectation maximization (EM) algorithm using iterated map and groupByKey on the same data3x faster with in-memory RDDs

Page 23: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

23/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Convia GeoReport

Aggregations on many keys w/ same WHERE clause40× gain comes from:Not re-reading unused columns or filtered recordsAvoiding repeated decompressionIn-memory storage of de-serialized objects

Spark

Hive

0 2 4 6 8 10 12 14 16 18 20

0.5

20

Page 24: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

24/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Implementation

Spark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other appsCan read from any Hadoop input source (HDFS, S3, …)

SparkHadoo

pMPI

Mesos

Node Node Node Node

~10,000 lines of code, no changes to Scala

Page 25: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

25/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

RDD Represenatation

Simple common interface: Set of partitions

Preferred locations for each partition List of parent RDDs Function to compute a partition given parents Optional partitioning info

Allows capturing wide range of transformationsUsers can easily add new transformations

Page 26: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

26/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Language integration

● Scala closures are Serializable Java objects● Serialize on driver, load & run on workers

● Not quite enough● Nested closures may reference entire outer

scope● May pull in non-Serializable variables not

used inside● Solution: bytecode analysis + reflection

Page 27: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

27/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Interactive Spark

● Modified Scala interpreter to allow Spark to be used interactively from the command line

– Altered code generation to make each “line” typed have references to objects it depends on

– Added facility to ship generated classes to workers

● Enables in-memory exploration of big data

Page 28: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

28/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Spark Debugger

● Debugging general distributed apps is very hard● Idea: leverage the structure of computations in

Spark, MapReduce, Pregel, and other systems● These split jobs into independent, deterministic

tasks for fault tolerance; use this for debugging:– Log lineage for all RDDs created (small)– Let user replay any task in jdb, or rebuild any

RDD

Page 29: Spark Engine - ut · 3/29 Artjom Lind (artjom.lind@ut.ee) 22.10.2013 Motivation … which is becomes an I/O overhead in case of applications that are reusing some part of input data:

29/29

Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee

Conclusion

● Spark’s RDDs offer a simple and efficient programming model for a broad range of apps

● Achieve fault tolerance efficiently by providing coarse-grained operations and tracking lineage

● Can express many current programming models

www.spark-project.org