Spark

Slide 1

SparkIn-Memory Cluster Computing forIterative and Interactive ApplicationsMatei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley,Michael Franklin, Scott Shenker, Ion Stoica

UC BERKELEY1MotivationMost current cluster programming models are based on acyclic data flow from stable storage to stable storageMapMapMapReduceReduceInputOutputAcyclic2MotivationMapMapMapReduceReduceInputOutputBenefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresMost current cluster programming models are based on acyclic data flow from stable storage to stable storageAlso applies to Dryad, SQL, etcBenefits: easy to do fault tolerance and 3MotivationAcyclic data flow is inefficient for applications that repeatedly reuse a working set of data:Iterative algorithms (machine learning, graphs)Interactive data mining tools (R, Excel, Python)With current frameworks, apps reload data from stable storage on each queryPoint out that both types of apps are actually quite common / desirable in data analytics4Example: Iterative AppsInputiteration 1iteration 2iteration 3result 1result 2result 3. . .iter. 1iter. 2. . .InputEach iteration is, for example, a MapReduce job5DistributedmemoryInputiteration 1iteration 2iteration 3. . .iter. 1iter. 2. . .InputGoal: Keep Working Set in RAMone-timeprocessing

ChallengeDistributed memory abstraction must beFault-tolerantEfficient in large commodity clusters

How do we design a programming interface that can provide fault tolerance efficiently?ChallengeExisting distributed storage abstractions offer an interface based on fine-grained updatesReads and writes to cells in a tableE.g. key-value stores, databases, distributed memoryRequires replicating data or update logs across nodes for fault toleranceExpensive for data-intensive appsSolution: Resilient Distributed Datasets (RDDs)Offer an interface based on coarse-grained transformations (e.g. map, group-by, join)Allows for efficient fault recovery using lineageLog one operation to apply to many elementsRecompute lost partitions of dataset on failureNo cost if nothing failsDistributedmemoryInputiteration 1iteration 2iteration 3. . .iter. 1iter. 2. . .InputRDD Recoveryone-timeprocessing

10Generality of RDDsDespite coarse-grained interface, RDDs can express surprisingly many parallel algorithms These naturally apply the same operation to many itemsUnify many current programming modelsData flow models: MapReduce, Dryad, SQL, Specialized tools for iterative apps: BSP (Pregel), iterative MapReduce (Twister), incremental (CBP)Also support new apps that these models dontKey idea: add variables to the functions in functional programming11OutlineProgramming modelApplicationsImplementationDemoCurrent work12Spark Programming InterfaceLanguage-integrated API in ScalaCan be used interactively from Scala interpreterProvides:Resilient distributed datasets (RDDs)Transformations to build RDDs (map, join, filter, )Actions on RDDs (return a result to the program or write data to storage: reduce, count, save, )You write a single program similar to DryadLINQDistributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across opsVariables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimizationMention cached vars useful for some workloads that wont be shown hereMention its all designed to be easy to distribute in a fault-tolerant fashion

13Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(hdfs://...)errors = lines.filter(_.startsWith(ERROR))messages = errors.map(_.split(\t)(2))messages.persist()

Block 1Block 2Block 3WorkerWorkerWorkerDrivermessages.filter(_.contains(foo)).countmessages.filter(_.contains(bar)).count. . .tasksresultsMsgs. 1Msgs. 2Msgs. 3Base RDDTransformed RDDActionResult: full-text search of Wikipedia in a+b) w -= gradient}

println("Final w: " + w)17Logistic Regression Performance127 s / iterationfirst iteration 174 sfurther iterations 6 sThis is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)18RDDs in More DetailRDDs additionally provide:Control over partitioning (e.g. hash-based), which can be used to optimize data placement across queriesControl over persistence (e.g. store on disk vs in RAM)Fine-grained reads (treat RDD as a big table)RDDs vs Distributed Shared MemoryConcernRDDsDistr. Shared Mem.ReadsFine-grainedFine-grainedWritesBulk transformationsFine-grainedConsistencyTrivial (immutable)Up to app / runtimeFault recoveryFine-grained and low-overhead using lineageRequires checkpoints and program rollbackStraggler mitigationPossible using speculative executionDifficultWork placementAutomatic based on data localityUp to app (but runtime aims for transparency)Spark ApplicationsEM alg. for traffic prediction (Mobile Millennium)In-memory OLAP & anomaly detection (Conviva)Twitter spam classification (Monarch)Alternating least squares matrix factorizationPregel on Spark (Bagel)SQL on Spark (Shark)Mobile Millennium AppEstimate city traffic using position observations from probe vehicles (e.g. GPS-carrying taxis)

Sample DataCredit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.eduOne days worth of SF taxi measurements23ImplementationExpectation maximization (EM) algorithm using iterated map and groupByKey on the same data3x faster with in-memory RDDs

Conviva GeoReportAggregations on many keys w/ same WHERE clause40 gain comes from:Not re-reading unused columns or filtered recordsAvoiding repeated decompressionIn-memory storage of deserialized objectsTime (hours)ImplementationSpark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other appsCan read from any Hadoop input source (HDFS, S3, )SparkHadoopMPIMesosNodeNodeNodeNode~10,000 lines of code, no changes to ScalaMention its designed to be fault-tolerant26RDD RepresentationSimple common interface:Set of partitionsPreferred locations for each partitionList of parent RDDsFunction to compute a partition given parentsOptional partitioning infoAllows capturing wide range of transformationsUsers can easily add new transformationsSchedulerDryad-like task DAGPipelines functionswithin a stageReuses previouslycomputed dataPartitioning-awareto avoid shufflesjoinuniongroupBymapStage 3Stage 1Stage 2A:B:C:D:E:F:G:= previously computed partitionNOT a modified version of Hadoop28Language IntegrationScala closures are Serializable Java objectsSerialize on driver, load & run on workersNot quite enoughNested closures may reference entire outer scopeMay pull in non-Serializable variables not used insideSolution: bytecode analysis + reflectionInteractive SparkModified Scala interpreter to allow Spark to be used interactively from the command lineAltered code generation to make each line typed have references to objects it depends onAdded facility to ship generated classes to workersEnables in-memory exploration of big dataOutlineSpark programming modelApplicationsImplementationDemoCurrent work31Spark DebuggerDebugging general distributed apps is very hardIdea: leverage the structure of computations in Spark, MapReduce, Pregel, and other systemsThese split jobs into independent, deterministic tasks for fault tolerance; use this for debugging:Log lineage for all RDDs created (small)Let user replay any task in jdb, or rebuild any RDD32ConclusionSparks RDDs offer a simple and efficient programming model for a broad range of appsAchieve fault tolerance efficiently by providing coarse-grained operations and tracking lineageCan express many current programming models

www.spark-project.org33Related WorkDryadLINQ, FlumeJavaSimilar distributed collection API, but cannot reuse datasets efficiently across queriesRelational databasesLineage/provenance, logical logging, materialized viewsGraphLab, Piccolo, BigTable, RAMCloudFine-grained writes similar to distributed shared memoryIterative MapReduce (e.g. Twister, HaLoop)Implicit data sharing for a fixed computation patternCaching systems (e.g. Nectar)Store data in files, no explicit control over what is cachedSpark OperationsTransformations(define a new RDD)mapfiltersamplegroupByKeyreduceByKeysortByKeyflatMapunionjoincogroupcrossmapValuesActions(return a result to driver program)collectreducecountsavelookupKeyFault Recovery ResultsThese are for K-means36Behavior with Not Enough RAMPageRank ResultsPregelGraph processing system based on BSP modelVertices in the graph have statesAt each superstep, each vertex can update its state and send messages to vertices in next stepPregel Using RDDsInput graphVertex states0Messages0Superstep 1Vertex states1Messages1Superstep 2. . .Vertex states2Messages2mapgroup by vertex IDgroup by vertex IDmapmapPregel Using RDDsInput graphVertex states0Messages0Superstep 1Vertex states1Messages1Superstep 2. . .Vertex states2Messages2mapgroup by vertex IDgroup by vertex IDmapmapverts = // RDD of (ID, State) pairsmsgs = // RDD of (ID, Message) pairs

newData = verts.cogroup(msgs).mapValues( (id, vert, msgs) => userFunc(id, vert, msgs) // gives (id, newState, outgoingMsgs)).persist()

newVerts = newData.mapValues((v,ms) => v)newMsgs = newData.flatMap((id,(v,ms)) => ms)

Placement OptimizationsPartition vertex RDDs in same way across stepsSo that states never need to be sent across nodesUse a custom partitioner to minimize traffic (e.g. group pages by domain)Easily expressible with RDD partitioning

Spark

Documents

Transcript of Spark