Spark Engine - ut · 3/29 Artjom Lind ([email protected]) 22.10.2013 Motivation … which is...
Transcript of Spark Engine - ut · 3/29 Artjom Lind ([email protected]) 22.10.2013 Motivation … which is...
1/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Hajussüsteemide UurimislaborDistributed Sytems Research Lab
Spark EngineIn-Memory Hadoop Cloud Computing (speaker: Artjom Lind)
(slides: adopted from http://spark.incubator.apache.org)
2/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Motivation● Hadoop programming is based on stortage-to-strorage data
flow (acyclic data flow).● Advantage: automatic decision where to run tasks and
recovery from failures
Input Data
Map
Map
MapMapMapReduce
MapReduce
OutputData
3/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Motivation
● … which is becomes an I/O overhead in case of applications that are reusing some part of input data:
– Iterative algorithms (machine learning, graph algorithms)– Interactive data mining tools, like Matlab, R or Python based
● All of above do the same (from Hadoop perspective):– Reload the data from storage on each query
4/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Iterative algorithms
● Using the common data source:
● Bypassing result to next iteration by offloading intermediate results to a storage:
InputData
Iter1
Iter2
Iter3 OutputData
Iter1 Iter2
5/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Make advantage of RAM
● Keeping the “currently in process” set in Distributed RAM:
● Bypassing intermediate results over Distributed RAM:
InputData
Iter1
Iter2
Iter3
DRAM
Iter1 Iter2DRAM DRAM
6/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Distributed RAM Features
● It has to be:– Fault-tolerant– Efficient in large commodity clusters
● Lots of nodes with local RAM● How to aggregate it to a Distributed RAM
● How the API can provide these features ?
7/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Distributed RAM Features
● Existing distributed storage abstractions offer an interface based on fine-grained updates
– Reads and writes to cells in a table– E.g. key-value stores, databases, distributed
memory● Requires replicating data or update logs
across nodes for fault tolerance– Which is expensive for the data-
intensive applications
8/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
RDD Resilent Distributed Datasets
● Offer an interface based on coarse-grained transformations (e.g. map, group-by, join)
● Allows for efficient fault recovery using lineage
– Log one operation to apply to many elements– Recompute lost partitions of dataset on
failure– No cost if nothing fails
9/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Recovery
● In case of common data source:
– … we go to storage directly if RDD failed● In case of sequentional bybass:
● … nothing to do we have to recompute everything
Iter1
Iter2
Iter3
DRAM
Iter1 Iter2DRAM DRAM
10/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Recovery
● In case of common data source:
– … we go to storage directly if RDD failed● In case of sequentional bybass:
● … nothing to do we have to recompute everything
Iter1
Iter2
Iter3
DRAM
Iter1 Iter2DRAM DRAM
However we win in performanceif nothing failed
11/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Generics of RDD
● Despite coarse-grained interface, RDDs can express surprisingly many parallel algorithms
– These naturally apply the same operation to many items
● Unify many current programming models– Data flow models: MapReduce, Dryad, SQL, …– Specialized tools for iterative apps: BSP (Pregel),
iterative MapReduce (Twister), incremental (CBP)● Also support new apps that these models don’t
12/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Spark API
● Language-integrated API in Scala● Can be used interactively from Scala
interpreter● Provides:
– Resilient distributed datasets (RDDs)– Transformations to build RDDs (map, join,
filter, …)– Actions on RDDs (return a result to the
program or write data to storage: reduce, count, save, …)
13/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Example: Log Mining
Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.persist()
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
DRAMDRAM
DRAM
DRAM
Block1
DRAMDRAM
Source
Block2Block3
Driver Worker
WorkerWorker
14/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Fault tolerance
RDDs preserves lineage information that can be used to reconstruct lost partitions
errors = lines.filter(_.startsWith(“ERROR”)).map(_.split(‘\t’)(2))
HDFS File Filtered RDD Mapped RDD
_.startsWith(“ERROR”) _.split(‘\t’)(2)
15/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Example: Regression
Goal: find best line separating two sets of points
+
–
++
+
+
+
++ +
– ––
–
–
–
––
+
target
–
random initial line
16/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Code
val data = spark.textFile(...).map(readPoint).persist()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce((a,b) => a+b) w -= gradient}
println("Final w: " + w)
18/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
RDD in details
• RDDs additionally provide:• Control over partitioning (e.g. hash-based),
which can be used to optimize data placement across queries
• Control over persistence (e.g. store on disk vs in RAM)
• Fine-grained reads (treat RDD as a big table)
19/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
RDD vs. Distributed Shared Memory
Concern RDDs Distr. Shared Mem.
Reads Fine-grained Fine-grained
Writes Bulk transformations Fine-grained
Consistency Trivial (immutable) Up to app / runtime
Fault recovery Fine-grained and low-overhead using lineage
Requires checkpoints and program rollback
Straggler mitigation
Possible using speculative execution
Difficult
Work placement
Automatic based on data locality
Up to app (but runtime aims for transparency)
20/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Spark Applications
● EM alg. for traffic prediction (Mobile Millennium)
● In-memory OLAP & anomaly detection (Conviva)
● Twitter spam classification (Monarch)● Alternating least squares matrix
factorization● Pregel on Spark (Bagel)● SQL on Spark (Shark)
21/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Mobile Millenium App.
Estimate city traffic using position observations from “probe vehicles” (e.g. GPS-carrying taxis)
22/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Implementation
Expectation maximization (EM) algorithm using iterated map and groupByKey on the same data3x faster with in-memory RDDs
23/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Convia GeoReport
Aggregations on many keys w/ same WHERE clause40× gain comes from:Not re-reading unused columns or filtered recordsAvoiding repeated decompressionIn-memory storage of de-serialized objects
Spark
Hive
0 2 4 6 8 10 12 14 16 18 20
0.5
20
24/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Implementation
Spark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other appsCan read from any Hadoop input source (HDFS, S3, …)
SparkHadoo
pMPI
Mesos
Node Node Node Node
…
~10,000 lines of code, no changes to Scala
25/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
RDD Represenatation
Simple common interface: Set of partitions
Preferred locations for each partition List of parent RDDs Function to compute a partition given parents Optional partitioning info
Allows capturing wide range of transformationsUsers can easily add new transformations
26/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Language integration
● Scala closures are Serializable Java objects● Serialize on driver, load & run on workers
● Not quite enough● Nested closures may reference entire outer
scope● May pull in non-Serializable variables not
used inside● Solution: bytecode analysis + reflection
27/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Interactive Spark
● Modified Scala interpreter to allow Spark to be used interactively from the command line
– Altered code generation to make each “line” typed have references to objects it depends on
– Added facility to ship generated classes to workers
● Enables in-memory exploration of big data
28/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Spark Debugger
● Debugging general distributed apps is very hard● Idea: leverage the structure of computations in
Spark, MapReduce, Pregel, and other systems● These split jobs into independent, deterministic
tasks for fault tolerance; use this for debugging:– Log lineage for all RDDs created (small)– Let user replay any task in jdb, or rebuild any
RDD
29/29
Artjom Lind ([email protected])22.10.2013https://ds.cs.ut.ee
Conclusion
● Spark’s RDDs offer a simple and efficient programming model for a broad range of apps
● Achieve fault tolerance efficiently by providing coarse-grained operations and tracking lineage
● Can express many current programming models
www.spark-project.org