Spark Summit EU talk by Nimbus Goehausen

AutomaticCheckpointing inSpark

NimbusGoehausenSparkPlatformEngineer

[email protected]

Copyright2016BloombergL.P.Allrightsreserved.

ADataPipeline

source outputProcess

output=Process(source)

Sparkdevelopmentcycle

Makecodechange

Deployjar(1m)

Startsparksubmit/

sparkshell/notebook(1m)

Waitforexecution

(1m– severalhours)

Observeresults

Thisshouldwork,right?


output=Process(source)

WhathaveIdonetodeservethis?


java.lang.HotGarbageException: Youmessed upatsome.obscure.library.Util$$anonfun$morefun$1.apply(Util.scala:eleventyBillion)atanother.library.InputStream $$somuchfun$mcV$sp.read(InputStream.scala:42)atorg.apache.spark.serializer.PotatoSerializer.readPotato(PotatoSerializer.scala:2792)

…

Splititup

source outputStage2Stage1Stage1Output

Main1 Main2

stage1Out=stage1(source)stage1Out.save(“stage1Path”)

stage1Out=load(“stage1Path”)output=stage2(stage1Out)

Automaticcheckpoint

source outputStage2Stage1

Stage1Output

stage1Out=stage1(source).checkpoint()Output=stage2(stage1Out)

Firstrun


Stage1Output


Secondrun


Stage1Output


Changelogic

source outputStage2modifiedStage1

modifiedStage1Output


Stage1Output

Howdowedothis?

• Needsomewayofautomaticallygeneratingapathtosavetoorloadfrom

• Shouldstaythesamewhenwewanttoreloadthesamedata

• Shouldchangewhenwewanttogeneratenewdata

Refresher:whatdefinesanRDD

Dependencies

Logic

Result

Refresher:whatdefinesanRDD

Dependencies

Logic

Result

[1,2,3,4,5…]

[2,4,6,8,10…]

map(x=>x*2)

Alogicalsignature

result=logic(dependencies)

signature(result)=hash(signature(dependencies)+hash(logic))

Alogicalsignature

val rdd2=rdd1.map(f)signature(rdd2)==hash(signature(rdd1)+hash(f))

val sourceRDD =sc.textFile(uri)signature(sourceRDD)==hash(“textFile:”+uri)

Wait,howdoyouhashafunction?

val f=(x:Int)=>x*2fcompilestoaclassfileofjvm bytecodeHashthosebytes

Changeftoval f=(x:Int)=>x*3Andtheclassfilewillchange

Whatifthefunctionreferencesotherfunctionsorstaticmethods?

• f=(x:Int)=>Values.v +staticMethod(x)+g(x)• Thisisthehardpart• Mustfollowdependenciesrecursively,identifyallclassfiles thatthefunctiondependson

• Makeacombinedhashoftheclassfiles andanyruntimevalues

Puttingthistogether


Stage1Output


Howtoswitchincheckpoints

• Oneoption:changespark.• Otheroption:buildontopofspark.

DistributedCollection(DC)

• Mostlysameapi asRDD,maps,flatmaps andsoon

• Canbeinstantiatedwithoutasparkcontext• PassinasparkcontexttogetthecorrespondingRDD

• Computeslogicalsignatureupondefinition,beforematerialization

RDDvs DC

val numbers:DC[Int]=parallelize(1to10)val filtered:DC[Int]=numbers.filter(_<5)val doubled:DC[Int]=filtered.map(_*2)

val doubledRDD:RDD[Int]=doubled.getRDD(sc)

val numbers:RDD[Int] =sc.parallelize(1to10)val filtered:RDD[Int] =numbers.filter(_<5)val doubled:RDD[Int] =filtered.map(_*2)

Easycheckpoint

val numbers:DC[Int]=parallelize(1 to10)val filtered:DC[Int]=numbers.filter(_ <5).checkpoint()val doubled:DC[Int]=filtered.map(_ *2)

Problem:Nonlazydefinition

val numbers:RDD[Int]=sc.parallelize(1 to10)val doubles:RDD[Double] =numbers.map(x =>x.toDouble)val sum:Double=doubles.sum()val normalized: RDD[Double]=doubles.map(x =>x/sum)

Wecan’tgetasignatureof`normalized`withoutcomputingthesum!

Solution:DeferredResult

• DefinedbyafunctiononanRDD• LikeaDCbutcontainsonlyoneelement• Passinasparkcontexttocomputetheresult• Alsohasalogicalsignature• Wecannowexpressentirelylazypipelines

DeferredResultexample

val numbers:DC[Int]=parallelize(1to10)val doubles: DC[Double]=numbers.map(x =>x.toDouble)val sum:DR[Double]=doubles.sum()val normalized:DC[Double]=doubles.withResult(sum).map{case(x,s)=>x/s}

Wecancomputeasignature for`normalized`withoutcomputingasum

Improvingtheworkflow

• Avoidaddingcomplexityinordertospeedthingsup.• Getthebenefitsofavoidingredundantcomputation,but

avoidthecostsofmanagingintermediateresults.• Furtherautomaticbehaviorstoexplore,suchasautomatic

failuretraps,andgenerationofcommonstatistics.

CanIusethis?

Yes!checkitout:https://github.com/bloomberg/spark-flow• requiresspark2.0• stillearly,mustbuildfromsourcecurrently• hopefullyavailableonsparkpackagessoon

Questions?

[email protected]@nimbusgo onTwitter/Github

https://github.com/bloomberg/spark-flowtechatbloomberg.com

Spark Summit EU talk by Nimbus Goehausen

Data & Analytics

Transcript of Spark Summit EU talk by Nimbus Goehausen