Designing and Debugging Batch and Interactive COBOL Programs Chapter 5.
Interactive Debugging for Big Data...
Transcript of Interactive Debugging for Big Data...
InteractiveDebuggingforBigDataAnalytics
Muhammad Ali Gulzar, Xueyuan Han, Matteo Interlandi,Shaghayegh Mardani, Sai Deep Tetali, Tyson Condie, Todd Millstein,Miryung KimUniversity of California, Los Angeles
DebuggingBigDataAnalytics
• Today’splatformslackdebuggingsupport– Programs(i.e.,queries, jobs)arebatchexecuted /blackboxes– Errorsreflect low-leveldetails (e.g.,taskid?!)notrelevanttothe logicalbug– Longprogramexecution time =>longdevelopment cycles
• Whatdoprogrammersdo?– Trialanderror debuggingonsample data– Post-mortem analysisoferrorlogs– Analyzephysicalviewoftheexecution (ajobid,failednode,etc).
“IwouldliketounderstandtheflowofcontrolthroughtheSparksourcecodeontheworkernodeswhenIsubmitmyapplication…IamassumingI
shouldsetupSparkonEclipse…toenablesteppingthroughSparksourcecodeontheworkernodes.”
TryingtodebugaSparkApplicationonacluster…
Afterayear,stillnogoodanswers!
BigDebug ProjectOverviewBigDebug:DebuggingPrimitives
forInteractiveBigDataProcessinginSpark
[ICSE2016]
SimulatedBreakpointOn-DemandWatchpointCrashCulpritRemediationForwardBackwardTracing
Titian:DataProvenanceforFine-GrainedTracing[PVLDB2016]
Vega:IncrementalComputationforInteractiveDebugging
[UnderReview]
ExampleQueryDevelopmentSession
• Dataset:NYCOpenDataProject– Callstonon-emergencyservicecenters– Datasetcontainscallrecordsfor2010-2015• Recordcontents:calltime,agency,callerlocation,etc.
• Query:Identifytheagencies thatreceivedthemostcallsduringbusyhours– E.g.,busyhourifnumberofcalls>10,000
SparkProgram
caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))
.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency")
hist.show()
Extract DatasetfromHDFSTransform itintoaDataFrame (i.e.,table)Load itintoSparkSQL
caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))
.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency")
hist.show()
ExpressQueryinSparkSQL
caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))
.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency")
hist.show()
Identifythebusyhoursi.e.,#calls>10,000
Joinbusyhourswithcallsthengroupbyagencyandcountthenumberof“calls”receivedbyeachagency
DebuggingQueryResults• Analystobservessomeunexpectedresults– Agenciesthatshouldnotappear• e.g.,BrooklynPublicLibrary
– Expectedagenciesthatshouldappear• e.g,NYPD,NYFD
• Titiansupportforquerytriage– Analystcantracebackfromoutlierresultstocontributingdataatsomeintermediatestage
– Analystcanexecutequeriesagainstintermediatedataleadingtooutlierresults
QueryTriagewithTitian• Intermediateresultsforsubquery– Tracebacktosubqueryandshowdistributionofcallsperhour– Onintermediatedataleadingtooutlierresults
Significant skewinthemidnight hour=0!
SELECThour,count(*)FROMcallsGROUPBYhour
IdentifyBugandRevisetheQuery• TheBug
– Systemassignsdefaultvaluehour=0for…– Callsthatdidnotlogatime
• Possiblecourseofaction– Filteroutcallsassignedtohour=0
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsWHEREhour!=0GROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency
Introducepredicatethatfiltersoutmidnight hour
Vega:Re-executerevisedQuery• Vegamaterializesintermediatestageresults– i.e.,Theprevioussubqueryresultissaved
• VegaQueryRewriterleveragesthistorewritethequeryinto…
SELECTagency,count(*)FROMcallsJOINcountsWHEREcounts.hour !=0ONcalls.hour =counts.hourGROUPBYagency
MaterializedresultfrompreviousexecutionRewritefiltertoremovehour0fromjoining records
Vega:ModifiedQueryEvaluation• Executeanincrementaljoin– “Diff”recordsspecifychangesinthe(join)result– Forthisexample,weincrementallyremoveallrecordsforhour0fromjoinandfinalaggregationresults
• VegaOptimizerResultsConsequence:overanorder-of-magnituderuntimeimprovement
• Whenaprogramfails,ausermaywanttoinvestigateasubsetoftheoriginalinputinducingacrash,afailure,orawrongoutcome.
• DeltaDebugging[Zeller1999]–Wellknowndebuggingalgorithmforminimizingfailure-inducinginputs
– Requiresmultiplerunstoisolatefailure-inducinginputs
AutomatedIsolationofFailure-InducingInputsforBigDataAnalytics
Firstwerunthetesttofindthefailureinducinginputdataset
Background:DeltaDebugging[Zeller,FSE1999]
TestFails
First,werunthetesttofindthefailureinducinginputdataset
Background:DeltaDebugging[Zeller,FSE1999]
Second,wesplitthefailinginputdata
TestFails Split
Background:DeltaDebugging[Zeller,FSE1999]
TestFails Split
TestPasses
TestFails
Background:DeltaDebugging[Zeller,FSE1999]
TestFails Split
TestPasses
TestFailsSplit
Background:DeltaDebugging[Zeller,FSE1999]
TestFails Split
TestPasses
TestFailsSplit
…...
Background:DeltaDebugging[Zeller,FSE1999]
ScalableAutomatedIsolationofFailure-InducingInputs
• Leveragedataprovenancetoreducesearchspace– Avoidcostlyexecutionsondatanotrelevanttothebug
• LeverageVegaoptimizesubsequentruns.
DeltaDebuggingTitian
Conclusion• BigDebug Project– DebuggingPrimitivesforInteractiveBigDataProcessinginApacheSpark– https://sites.google.com/site/sparkbigdebug/
• Titian:InteractiveDataProvenance– Supportstracebackqueriesfromasetofresults– Executionreplayfromanintermediatepoint
• Vega:Optimizingmodifiedqueryexecution– Novelqueryrewritemechanismthatpusheschangesbackwardstosavework– Incrementalevaluationthatoperatesondatachangesinducedbyquerymodifications