Spark Summit EU talk by Nimbus Goehausen

28
Automatic Checkpointing in Spark Nimbus Goehausen Spark Platform Engineer [email protected] Copyright 2016 Bloomberg L.P. All rights reserved.

Transcript of Spark Summit EU talk by Nimbus Goehausen

Page 1: Spark Summit EU talk by Nimbus Goehausen

AutomaticCheckpointing inSpark

NimbusGoehausenSparkPlatformEngineer

[email protected]

Copyright2016BloombergL.P.Allrightsreserved.

Page 2: Spark Summit EU talk by Nimbus Goehausen

ADataPipeline

source outputProcess

output=Process(source)

Page 3: Spark Summit EU talk by Nimbus Goehausen

Sparkdevelopmentcycle

Makecodechange

Deployjar(1m)

Startsparksubmit/

sparkshell/notebook(1m)

Waitforexecution

(1m– severalhours)

Observeresults

Page 4: Spark Summit EU talk by Nimbus Goehausen

Thisshouldwork,right?

source outputProcess

output=Process(source)

Page 5: Spark Summit EU talk by Nimbus Goehausen

WhathaveIdonetodeservethis?

source outputProcess

java.lang.HotGarbageException: Youmessed upatsome.obscure.library.Util$$anonfun$morefun$1.apply(Util.scala:eleventyBillion)atanother.library.InputStream $$somuchfun$mcV$sp.read(InputStream.scala:42)atorg.apache.spark.serializer.PotatoSerializer.readPotato(PotatoSerializer.scala:2792)

Page 6: Spark Summit EU talk by Nimbus Goehausen

Splititup

source outputStage2Stage1Stage1Output

Main1 Main2

stage1Out=stage1(source)stage1Out.save(“stage1Path”)

stage1Out=load(“stage1Path”)output=stage2(stage1Out)

Page 7: Spark Summit EU talk by Nimbus Goehausen

Automaticcheckpoint

source outputStage2Stage1

Stage1Output

stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)

Page 8: Spark Summit EU talk by Nimbus Goehausen

Firstrun

source outputStage2Stage1

Stage1Output

stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)

Page 9: Spark Summit EU talk by Nimbus Goehausen

Secondrun

source outputStage2Stage1

Stage1Output

stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)

Page 10: Spark Summit EU talk by Nimbus Goehausen

Changelogic

source outputStage2modifiedStage1

modifiedStage1Output

stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)

Stage1Output

Page 11: Spark Summit EU talk by Nimbus Goehausen

Howdowedothis?

• Needsomewayofautomaticallygeneratingapathtosavetoorloadfrom

• Shouldstaythesamewhenwewanttoreloadthesamedata

• Shouldchangewhenwewanttogeneratenewdata

Page 12: Spark Summit EU talk by Nimbus Goehausen

Refresher:whatdefinesanRDD

Dependencies

Logic

Result

Page 13: Spark Summit EU talk by Nimbus Goehausen

Refresher:whatdefinesanRDD

Dependencies

Logic

Result

[1,2,3,4,5…]

[2,4,6,8,10…]

map(x=>x*2)

Page 14: Spark Summit EU talk by Nimbus Goehausen

Alogicalsignature

result=logic(dependencies)

signature(result)=hash(signature(dependencies)+hash(logic))

Page 15: Spark Summit EU talk by Nimbus Goehausen

Alogicalsignature

val rdd2=rdd1.map(f)signature(rdd2)==hash(signature(rdd1)+hash(f))

val sourceRDD =sc.textFile(uri)signature(sourceRDD)==hash(“textFile:”+uri)

Page 16: Spark Summit EU talk by Nimbus Goehausen

Wait,howdoyouhashafunction?

val f=(x:Int)=>x*2fcompilestoaclassfileofjvm bytecodeHashthosebytes

Changeftoval f=(x:Int)=>x*3Andtheclassfilewillchange

Page 17: Spark Summit EU talk by Nimbus Goehausen

Whatifthefunctionreferencesotherfunctionsorstaticmethods?

• f=(x:Int)=>Values.v +staticMethod(x)+g(x)• Thisisthehardpart• Mustfollowdependenciesrecursively,identifyallclassfiles thatthefunctiondependson

• Makeacombinedhashoftheclassfiles andanyruntimevalues

Page 18: Spark Summit EU talk by Nimbus Goehausen

Puttingthistogether

source outputStage2Stage1

Stage1Output

stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)

Page 19: Spark Summit EU talk by Nimbus Goehausen

Howtoswitchincheckpoints

• Oneoption:changespark.• Otheroption:buildontopofspark.

Page 20: Spark Summit EU talk by Nimbus Goehausen

DistributedCollection(DC)

• Mostlysameapi asRDD,maps,flatmaps andsoon

• Canbeinstantiatedwithoutasparkcontext• PassinasparkcontexttogetthecorrespondingRDD

• Computeslogicalsignatureupondefinition,beforematerialization

Page 21: Spark Summit EU talk by Nimbus Goehausen

RDDvs DC

val numbers:DC[Int]=parallelize(1to10)val filtered:DC[Int]=numbers.filter(_<5)val doubled:DC[Int]=filtered.map(_*2)

val doubledRDD:RDD[Int]=doubled.getRDD(sc)

val numbers:RDD[Int] =sc.parallelize(1to10)val filtered:RDD[Int] =numbers.filter(_<5)val doubled:RDD[Int] =filtered.map(_*2)

Page 22: Spark Summit EU talk by Nimbus Goehausen

Easycheckpoint

val numbers:DC[Int]=parallelize(1 to10)val filtered:DC[Int]=numbers.filter(_ <5).checkpoint()val doubled:DC[Int]=filtered.map(_ *2)

Page 23: Spark Summit EU talk by Nimbus Goehausen

Problem:Nonlazydefinition

val numbers:RDD[Int]=sc.parallelize(1 to10)val doubles:RDD[Double] =numbers.map(x =>x.toDouble)val sum:Double=doubles.sum()val normalized: RDD[Double]=doubles.map(x =>x/sum)

Wecan’tgetasignatureof`normalized`withoutcomputingthesum!

Page 24: Spark Summit EU talk by Nimbus Goehausen

Solution:DeferredResult

• DefinedbyafunctiononanRDD• LikeaDCbutcontainsonlyoneelement• Passinasparkcontexttocomputetheresult• Alsohasalogicalsignature• Wecannowexpressentirelylazypipelines

Page 25: Spark Summit EU talk by Nimbus Goehausen

DeferredResultexample

val numbers:DC[Int]=parallelize(1to10)val doubles: DC[Double]=numbers.map(x =>x.toDouble)val sum:DR[Double]=doubles.sum()val normalized:DC[Double]=doubles.withResult(sum).map{case(x,s)=>x/s}

Wecancomputeasignature for`normalized`withoutcomputingasum

Page 26: Spark Summit EU talk by Nimbus Goehausen

Improvingtheworkflow

• Avoidaddingcomplexityinordertospeedthingsup.• Getthebenefitsofavoidingredundantcomputation,but

avoidthecostsofmanagingintermediateresults.• Furtherautomaticbehaviorstoexplore,suchasautomatic

failuretraps,andgenerationofcommonstatistics.

Page 27: Spark Summit EU talk by Nimbus Goehausen

CanIusethis?

Yes!checkitout:https://github.com/bloomberg/spark-flow• requiresspark2.0• stillearly,mustbuildfromsourcecurrently• hopefullyavailableonsparkpackagessoon

Page 28: Spark Summit EU talk by Nimbus Goehausen

Questions?

[email protected]@nimbusgo onTwitter/Github

https://github.com/bloomberg/spark-flowtechatbloomberg.com