Spark Summit EU talk by Nimbus Goehausen
-
Upload
spark-summit -
Category
Data & Analytics
-
view
348 -
download
4
Transcript of Spark Summit EU talk by Nimbus Goehausen
AutomaticCheckpointing inSpark
NimbusGoehausenSparkPlatformEngineer
Copyright2016BloombergL.P.Allrightsreserved.
ADataPipeline
source outputProcess
output=Process(source)
Sparkdevelopmentcycle
Makecodechange
Deployjar(1m)
Startsparksubmit/
sparkshell/notebook(1m)
Waitforexecution
(1m– severalhours)
Observeresults
Thisshouldwork,right?
source outputProcess
output=Process(source)
WhathaveIdonetodeservethis?
source outputProcess
java.lang.HotGarbageException: Youmessed upatsome.obscure.library.Util$$anonfun$morefun$1.apply(Util.scala:eleventyBillion)atanother.library.InputStream $$somuchfun$mcV$sp.read(InputStream.scala:42)atorg.apache.spark.serializer.PotatoSerializer.readPotato(PotatoSerializer.scala:2792)
…
Splititup
source outputStage2Stage1Stage1Output
Main1 Main2
stage1Out=stage1(source)stage1Out.save(“stage1Path”)
stage1Out=load(“stage1Path”)output=stage2(stage1Out)
Automaticcheckpoint
source outputStage2Stage1
Stage1Output
stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)
Firstrun
source outputStage2Stage1
Stage1Output
stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)
Secondrun
source outputStage2Stage1
Stage1Output
stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)
Changelogic
source outputStage2modifiedStage1
modifiedStage1Output
stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)
Stage1Output
Howdowedothis?
• Needsomewayofautomaticallygeneratingapathtosavetoorloadfrom
• Shouldstaythesamewhenwewanttoreloadthesamedata
• Shouldchangewhenwewanttogeneratenewdata
Refresher:whatdefinesanRDD
Dependencies
Logic
Result
Refresher:whatdefinesanRDD
Dependencies
Logic
Result
[1,2,3,4,5…]
[2,4,6,8,10…]
map(x=>x*2)
Alogicalsignature
result=logic(dependencies)
signature(result)=hash(signature(dependencies)+hash(logic))
Alogicalsignature
val rdd2=rdd1.map(f)signature(rdd2)==hash(signature(rdd1)+hash(f))
val sourceRDD =sc.textFile(uri)signature(sourceRDD)==hash(“textFile:”+uri)
Wait,howdoyouhashafunction?
val f=(x:Int)=>x*2fcompilestoaclassfileofjvm bytecodeHashthosebytes
Changeftoval f=(x:Int)=>x*3Andtheclassfilewillchange
Whatifthefunctionreferencesotherfunctionsorstaticmethods?
• f=(x:Int)=>Values.v +staticMethod(x)+g(x)• Thisisthehardpart• Mustfollowdependenciesrecursively,identifyallclassfiles thatthefunctiondependson
• Makeacombinedhashoftheclassfiles andanyruntimevalues
Puttingthistogether
source outputStage2Stage1
Stage1Output
stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)
Howtoswitchincheckpoints
• Oneoption:changespark.• Otheroption:buildontopofspark.
DistributedCollection(DC)
• Mostlysameapi asRDD,maps,flatmaps andsoon
• Canbeinstantiatedwithoutasparkcontext• PassinasparkcontexttogetthecorrespondingRDD
• Computeslogicalsignatureupondefinition,beforematerialization
RDDvs DC
val numbers:DC[Int]=parallelize(1to10)val filtered:DC[Int]=numbers.filter(_<5)val doubled:DC[Int]=filtered.map(_*2)
val doubledRDD:RDD[Int]=doubled.getRDD(sc)
val numbers:RDD[Int] =sc.parallelize(1to10)val filtered:RDD[Int] =numbers.filter(_<5)val doubled:RDD[Int] =filtered.map(_*2)
Easycheckpoint
val numbers:DC[Int]=parallelize(1 to10)val filtered:DC[Int]=numbers.filter(_ <5).checkpoint()val doubled:DC[Int]=filtered.map(_ *2)
Problem:Nonlazydefinition
val numbers:RDD[Int]=sc.parallelize(1 to10)val doubles:RDD[Double] =numbers.map(x =>x.toDouble)val sum:Double=doubles.sum()val normalized: RDD[Double]=doubles.map(x =>x/sum)
Wecan’tgetasignatureof`normalized`withoutcomputingthesum!
Solution:DeferredResult
• DefinedbyafunctiononanRDD• LikeaDCbutcontainsonlyoneelement• Passinasparkcontexttocomputetheresult• Alsohasalogicalsignature• Wecannowexpressentirelylazypipelines
DeferredResultexample
val numbers:DC[Int]=parallelize(1to10)val doubles: DC[Double]=numbers.map(x =>x.toDouble)val sum:DR[Double]=doubles.sum()val normalized:DC[Double]=doubles.withResult(sum).map{case(x,s)=>x/s}
Wecancomputeasignature for`normalized`withoutcomputingasum
Improvingtheworkflow
• Avoidaddingcomplexityinordertospeedthingsup.• Getthebenefitsofavoidingredundantcomputation,but
avoidthecostsofmanagingintermediateresults.• Furtherautomaticbehaviorstoexplore,suchasautomatic
failuretraps,andgenerationofcommonstatistics.
CanIusethis?
Yes!checkitout:https://github.com/bloomberg/spark-flow• requiresspark2.0• stillearly,mustbuildfromsourcecurrently• hopefullyavailableonsparkpackagessoon
Questions?
[email protected]@nimbusgo onTwitter/Github
https://github.com/bloomberg/spark-flowtechatbloomberg.com