Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Handlingdataskewadaptivelyin Spark usingDynamicRepartitioning

Zoltá[email protected]

Introduction

• HungarianAcademyofSciences, InstituteforComputerScienceandControl (MTASZTAKI)• Researchinstitutewithstrongindustryties• BigDataprojectsusingSpark,Flink,Cassandra,Hadoopetc.• Multipletelcousecaseslately,withchallengingdatavolumeanddistribution

Agenda

• Ourdata-skewstory• Problemdefinitions&aims• DynamicRepartitioning

• Architecture• Component breakdown• Repartitioning mechanism• Benchmarkresults

• Visualization• Conclusion

Motivation

• Wehavedevelopedanapplicationaggregatingtelcodatathattestedwellontoydata• Whendeployingitagainsttherealdatasettheapplicationseemedhealthy• Howeveritcouldbecomesurprisinglyslow orevencrash• Whatdidgowrong?

Ourdata-skewstory

• Wehaveuse-caseswhenmap-sidecombineisnotanoption:groupBy, join• 80%ofthetrafficgenerated by 20%ofthecommunication towers• Mostofourdatais80–20rule

Theproblem

• Usingdefaulthashingisnotgoingtodistributethedatauniformly• Someunknownpartition(s)tocontainalotofrecordsonthereducerside• Slowtaskswillappear• Datadistributionisnotknowninadvance• „Conceptdrifts”arecommon

stageboundary&shuffle

evenpartitioning skeweddata

slow task

slow task

Aim

Generally,makeSparkadata-awaredistributeddata-processingframework

• Collectinformationaboutthedata-characteristicson-the-fly• Partitionthedataasuniformlyaspossibleon-the-fly• Handlearbitrarydatadistributions• Mitigatetheproblemofslowtasks• Belightweightandefficient• Shouldnotrequireuser-guidance

spark.repartitioning = true

Architecture

Slave Slave Slave

Master

approximatelocal keydistribution &statistics

1collect to master 2

global keydistribution &statistics

3

4redistribute newhash

Driverperspective

• RepartitioningTrackerMaster partofSparkEnv• Listenstojob&stagesubmissions• Holdsavarietyofrepartitioningstrategiesforeachjob&stage• Decides when &how to (re)partition

Executorperspective

• RepartitioningTrackerWorker partofSparkEnv• Duties:

• StoresScannerStrategies (Scanner included) receivedfromtheRTM• Instantiates andbindsScanners toTaskMetrics (wheredata-characteristics iscollected)

• DefinesaninterfaceforScanners tosendDataCharacteristics backtotheRTM

Scalable sampling

• Key-distributions are approximatedwith astrategy,that is• not sensitive to early or late concept drifts,• lightweight andefficient,• scalable by using abackoff strategy

keys

frequency

counters

keys

frequency

sampling rate of𝑠" sampling rate of𝑠" /𝑏

keysfre

quency

sampling rate of𝑠"%&/𝑏

truncateincrease by 𝑠"

𝑇𝐵

Complexity-sensitive sampling

• Reducerrun-timecanhighlycorrelatewiththecomputationalcomplexity ofthevaluesforagivenkey• CalculatingobjectsizeiscostlyinJVMandinsomecasesshowslittlecorrelationtocomputationalcomplexity(functiononthenextstage)• Solution:

• Iftheuserobject isWeightable, usecomplexity-sensitive sampling• Whenincreasingthecounterforaspecificvalue,consider itscomplexity

Scalable sampling in numbers

• InSpark,ithasbeenimplementedwithAccumulators• Notsoefficient,butwewantedtoimplementitwithminimalimpactontheexistingcode

• Optimizedwithmicro-benchmarking• Mainfactors:

• Samplingstrategy- aggressiveness (initialsamplingratio,back-offfactor,etc…)• Complexityofthecurrentstage(mapper)

• Currentstage’sruntime:• Whenusedthroughout theexecutionofthewholestageitadds5-15%toruntime• Afterrepartitioning,wecutoutthesampler’scode-path; inpractice,itadds0.2-1.5% toruntime

Mainnew task-level metrics

• TaskMetrics• RepartitioningInfo – newhashfunction,repartitionstrategy• ShuffleReadMetrics

• (Weightable)DataCharacteristics – usedfortestingthecorrectnessoftherepartitioning&forexecutionvisualizationtools

• ShuffleWriteMetrics• (Weightable)DataCharacteristics – scannedperiodicallybyScannersbasedonScannerStrategy

• insertionTime• repartitioningTime

• inMemory• ofSpills

Scanner

• Instantiatedforeachtaskbeforetheexecutorstartsthem• Differentimplementations:Throughput, Timed, Histogram• ScannerStrategy defines:

• whentosendtotheRTM,• histogram-compaction level.

Decisiontorepartition

• RepartitioningTrackerMaster canusedifferentdecisionstrategies:• numberoflocalhistogramsneeded,• globalhistogram’sdistance fromuniformdistribution,• preferences intheconstruction ofthenewhashfunction.

Construction ofthe new hash function

𝑐𝑢𝑡, - single-key cut

hash akey herewith the probabilityproportional to the remaining space to reach 𝑙

𝑐𝑢𝑡. - uniform distribution

topprominent keys hashed

𝑙

𝑐𝑢𝑡/ - probabilistic cut

Newhash function in numbers

• MorecomplexthanahashCode• Weneedtoevaluateitforeveryrecord• Micro-benchmark(forexampleString):

• Numberofpartitions: 512• HashPartitioner: AVGtime to hash arecord is90.033 ns• KeyIsolatorPartitioner: AVGtime to hash arecord is121.933ns

• Inpracticeitaddsnegligibleoverhead,under1%

Repartitioning

• Reorganizepreviousnaivehashing• Usuallyhappensin-memory• Inpractice,addsadditional1-8%overhead(usuallythelowerend),basedon:• complexity ofthemapper,• lengthofthescanningprocess.

Morenumbers (groupBy)

MusicTimeseries – groupBy on tags from alistenings-stream

1710 200 400

83M

134M

92M

sizeofth

ebiggestpartition

number ofpartitions

over-partitioning overhead&blind scheduling64%reduction

observed 3%map-side overhead

naive

Dynamic Repartitioning

38%reduction

Morenumbers (join)

• Joiningtrackswithtags• Tracksdatasetisskewed• Numberofpartitionsissetto33• Naive:

• sizeofthebiggest partition=4.14M• reducer’sstageruntime=251seconds

• DynamicRepartitioning• sizeofthebiggest partition=2.01M• reducer’sstageruntime=124 seconds• heavymap,only0.9%overhead

Spark RESTAPI

• NewmetricsareavailablethroughtheRESTAPI• AddednewqueriestotheRESTAPI,forexample:„whathappenedinthelast3second?”• BlockFetches arecollectedtoShuffleReadMetrics

Execution visualization ofSpark jobs

Repartitioning in the visualization

heavy keys createheavy partitions (slow tasks)

heavy isalone,size ofthe biggest partition

isminimized

Conclusion

• OurDynamicRepartitioning canhandledataskewdynamically,on-the-flyonanyworkloadandarbitrarykey-distributions• Withverylittleoverhead,dataskewcanbehandledinanatural &general way• Visualizationscanaiddeveloperstobetterunderstandissuesandbottlenecksofcertainworkloads• MakingSparkdata-aware paysoff

Thankyouforyourattention

Zoltá[email protected]

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Data & Analytics

Transcript of Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning