On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
-
Upload
rafael-ferreira-da-silva -
Category
Technology
-
view
77 -
download
0
Transcript of On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
![Page 1: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/1.jpg)
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
Rafael Ferreira da Silva, Scott Callaghan, Ewa Deelman
12th Workflows in Support of Large-Scale Science (WORKS) – SuperComputing’17November 13, 2017
Funded by the US Department of Energy under Grant #DE-SC0012636
![Page 2: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/2.jpg)
OUTLINE
Introduction Burst Buffers Model and Design
Workflow Application Experimental Results
Next-generation SupercomputersData-intensive WorkflowsIn-transit / In-situ
OverviewNode-local / Remote-sharedNERSC BB
BB ReservationsExecution DirectoryI/O Read Operations
Overall Write/Read OperationsI/O Performance per ProcessCumulative CPU TimeRupture Files
CyberShakeWorkflow ImplementationI/O Performance: Darshan
2
ConclusionSummary of FindingsFuture Directions
![Page 3: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/3.jpg)
A Brief Introduction and Contextualization
3
Traditionally,Workflowshaveusedthefilesystemtocommunicatedatabetweentasks
TocopewithincreasingapplicationdemandsonI/Ooperations,solutionstargeting insitu and/orintransit processinghavebecomemainstreamapproachestoattenuateI/Operformancebottlenecks.
”Next-generation of Exascale SupercomputersIncreasedprocessingcapabilitiestoover1018 Flop/s
Memory anddisk capacitywillalsobesignificantlyincreasedPower consumptionmanagement
I/Operformanceoftheparallelfilesystem(PFS)isnotexpectedtoimprovemuch
![Page 4: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/4.jpg)
Data-Intensive Scientific Workflows
4
Whileinsitu iswelladaptedforcomputationsthatconformwiththedatadistribution imposedbysimulations,intransitprocessingtargetsapplications
whereintensivedatatransfers arerequired
1000GenomeWorkflowCyberShake Workflow
Consumers/producesover4.4TB ofdata,andrequiresover24TB ofmemoryacrossalltasks
Consumers/producesover700GB ofdata
![Page 5: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/5.jpg)
Improving I/O Performance with Burst Buffers
5
Burstbuffershaveemergedasanon-volatilestorage solutionthatispositionedbetweentheprocessors’memoryandthePFS,bufferingthelargevolumeofdataproducedbytheapplicationatanhigherratethanthePFS,whileseamlesslydrainingthedatatothePFSasynchronously.
In Transit Processing
PlacementoftheburstbuffernodeswithintheCori system(NERSC)
Aburstbufferconsistsofthecombinationofrapidlyaccessedpersistentmemorywithitsownprocessing
power(e.g.,DRAM),andablockofsymmetricmulti-processorcomputeaccessiblethroughhigh-
bandwidthlinks(e.g.,PCIExpress)
”
Node-localvsRemote-shared
![Page 6: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/6.jpg)
Xeon Processor / DRAM
x8 PCIe
x8 PCIeFlash SSD card
CN
Burst BufferNode
Storage Servers
CN CN CN ComputeNodes
x16 PCIe
x8 PCIe
Aries
Storage Area Network
First Burst Buffers Use at Scale
6
Coriisapetascale HPCsystemand#6ontheJune2017Top500list
NERSCBBisbasedonCrayDataWarp(Cray’simplementationoftheBBconcept)
Cori System (NERSC)
Architecturaloverviewofaburst-buffernodeonCoriatNERSC
EachBBnodecontainsaXeon processor,64GBofDDR3memory,andtwo3.2TBNAND ashSSDmodulesattachedovertwoPCIe gen3x8 interfaces,whichisattachedtoaCrayAries
networkinterconnectoveraPCIe gen3x16interface
~6.5GB/secofsequentialreadandwritebandwidth
![Page 7: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/7.jpg)
Model and Design: Enabling BB in Workflow Systems
7
Automated BB ReservationsBBreservationsoperations(eitherpersistentorscratch)consistinthecreation andrelease,aswellasstagein andstageoutoperations
Transientreservations:needstoimplementstagein/outoperationsatthebeginning/endofeachjobexecution Execution Directory
Automatedmapping betweentheworkflowexecutiondirectory andtheBBreservation
Nochangestotheapplicationcodearenecessary,andtheapplicationjobdirectlywritesitsoutputtotheBBreservation
I/O Read OperationsReadoperationsfromtheBBshouldbetransparenttotheapplications
Approaches: pointtheexecutiondirectorytotheBBreservation,orcreatesymboliclinks todataendpointsintotheBB
![Page 8: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/8.jpg)
Workflow Application: CyberShake Workflow
8
CyberShake isahigh-performancecomputingsoftwarethatuses3DwaveformmodelingtocalculatePSHAestimatesforpopulatedareasofCalifornia
ConstructsandPopulatesa3Dmeshof~1.2billionelements withseismicvelocitydatatocomputeStrainGreenTensors(SGTs)
Post-processing: SGTsareconvolvedwithsliptimehistoriesforeachofabout500,000differentearthquakes togeneratesyntheticseismogramsforeachevent
CyberShake Workflow
CyberShake hazardmapforSouthernCalifornia,showingthespectralaccelerationsata2-second
periodexceededwithaprobabilityof2%in50years
WefocusonthetwoCyberShake jobtypeswhichtogetheraccountfor97%ofthecomputetime:thewavepropagation
codeAWP-ODC-SGT,andthepost-processingcodeDirectSynth
![Page 9: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/9.jpg)
Burst Buffers: Workflow Implementation
9
Workflowiscomposedoftwotightly-coupledparalleljobs(SGT_generator;anddirect_synth),andtwosystemjobs(bb_setup andbb_delete)
Generates/consumesabout550GB ofdata
Pegasus WMS
AgeneralrepresentationoftheCyberShaketestworkflow
bb_setup
direct_synthdirect_synthdirect_synthdirect_synth
direct_synthdirect_synthdirect_synthSGT_generator
bb_delete
Control flowData flow
fx.sgt fx.sgtheader fy.sgt fy.sgtheader
seismogram rotd peakvals
https://github.com/rafaelfsilva/bb-workflow
#SBATCH -p regular#SBATCH -N 64#SBATCH -C haswell#SBATCH -t 05:00:00#DW persistentdw name=csbb
#SBATCH -p debug#SBATCH -N 1#SBATCH -C haswell#SBATCH -t 00:05:00#BB create_persistent name=csbb capacity=700GB access=striped type=scratch
bb_setup jobGoesthroughtheregularqueuingprocessing
![Page 10: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/10.jpg)
Collecting I/O Performance Data with Darshan
10
HPClightweightI/Oprofiling toolthatcapturesanaccuratepictureofI/Obehavior(includingPOSIXIO,MPI-IO,andHDF5IO)inMPIapplications
Darshan: HPC I/O Characterization Tool
Darshan ispartofthedefaultsoftwarestackonCori
![Page 11: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/11.jpg)
0
2500
5000
7500
10000
1 4 8 16 32 64 128 256 313# Nodes
MiB
/s
BB no−BB
Experimental Results: Overall Write Operations
11
AverageI/OperformanceestimateforwriteoperationsattheMPI-IOlayer(left),andaverageI/Owriteperformancegain(right)fortheSGT_generator job
• Overall,writeoperationstothePFS(No-BB)havenearlyconstant I/Operformance
• No-BB:~900MiB/s regardlessofthenumberofnodesused
• BasevaluesobtainedfortheBBexecutions (1node,32cores)areover4,600MiB/s,andpeakvaluesscaleupto∼8,200MiB/s for32nodes(1,024cores)
• Slightdrop intheI/Operformance(#nodes≥64)largenumberofconcurrentwriteoperations
0.75
1.00
1.25
1.50
1.75
1 4 8 16 32 64 128 256 313nodes
Perfo
rman
ce G
ain
BB no−BB
![Page 12: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/12.jpg)
Experimental Results: Overall Read Operations
12
I/OperformanceestimateforreadoperationsattheMPI-IOlayer(left),andaverageI/Owriteperformancegain(right)forthedirect_synth job
• I/OreadoperationsfromthePFS yieldsimilarperformanceregardlessofthenumberofnodesused:~500MiB/s
• BB:single-nodeperformanceof4,000MiB/s,peakvaluesuptoabout8,000MiB/s
• Smalldropintheperformanceforrunsusing64nodesorabove– mayindicateanI/Obottleneck whendrainingthedatato/fromtheunderlyingparallelfilesystem
0
2500
5000
7500
1 4 8 16 32 64 128# Nodes
MiB
/s
BB No−BB
0.5
1.0
1.5
2.0
1 4 8 16 32 64 128nodes
Perfo
rman
ce G
ain
BB No−BB
![Page 13: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/13.jpg)
Experimental Results: I/O Performance per Process
13
POSIXmoduledata:AveragetimeconsumedinI/Oreadoperationsperprocessforthedirect_synth job
• POSIXoperations (left)representbufferingandsynchronizationoperationswiththesystem
• POSIXvaluesarenegligiblewhencomparedtothejob’stotalruntime(~8hfor64nodes)
• MPI-IO:BBaccelerates I/Oreadoperationsupto10times inaverage
• forlargerconfigurations(≥32node),theaveragetimeisnearlythesameaswhenrunningwith16nodesfortheNo-BB
fx.sgt fy.sgt
1 4 8 16 32 64 128 1 4 8 16 32 64 1280
1
2
3
# Nodes
Tim
e (s
econ
ds)
BB No−BB
fx.sgt fy.sgt
1 4 8 16 32 64 128 1 4 8 16 32 64 1280
500
1000
1500
2000
# Nodes
Tim
e (s
econ
ds)
BB No−BB
MPI-IOmoduledata:AveragetimeconsumedinI/Oreadoperationsperprocessforthedirect_synth job
![Page 14: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/14.jpg)
14
Ratiobetweenthecumulativetimespentintheuser(utime)andkernel(stime)spacesforthedirect_synth job,fordifferent
numbersofnodes
• Averagedvalues(foruptothousandsofcores)maymaskslowerprocesses
Insomecases,e.g.64nodes,slowesttimeconsumedinI/Oreadoperationscanslowdowntheapplicationupto12times theaveragedvalue
• Ratiobetweenthetimespentintheuser(utime)andkernel(stime)spaces– handlingI/O-relatedinterruptions,etc.
• Performanceat64nodesis similarto32nodes,suggestinggainsin application parallelefficiencywouldoutweighaslightI/Operformancehit at64nodesandleadtodecreasedoverallruntime
Experimental Results: Cumulative CPU Time
BB No−BB
1 4 8 16 32 64 128 1 4 8 16 32 64 1280
25
50
75
100
# Nodes
Cum
ulat
ive C
PU ti
me
usag
e (%
)
stime utime
![Page 15: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/15.jpg)
15
Ratiobetweenthecumulativetimespentintheuser(utime)andkernel(stime)spacesforthedirect_synth jobfordifferent
numbersofrupturefiles(workflowrunswith64nodes)
• AtypicalexecutionoftheCyberShakeworkflowforaselectedsiteinourexperimentprocessesabout5,700rupturefiles
• TheprocessingofrupturefilesdrivemostoftheCPU(userspace)activitiesforthedirect_synth job
• TheuseofaBBattenuates (about15%)theI/Oprocessingtimeoftheworkflowjobs,forbothreadandwriteoperations
Experimental Results: Rupture Files
BB No−BB
1 10 100 1000 2500 5700 1 10 100 1000 2500 57000
25
50
75
100
# Rupture Files
Cum
ulat
ive C
PU ti
me
usag
e (%
)
stime utime
![Page 16: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/16.jpg)
16
MajorFindings
• I/Owrite performancewasimproved byafactorof9,andI/Oread performancebyafactorof15
• Performancedecreasedslightly atnodecountsabove64 (potentialI/Oceiling)
• I/Operformancemustbebalanced withparallelefficiencywhenusingburstbufferswithhighlyparallelapplications
• I/Ocontentionmaylimit thebroadapplicabilityofburstbuffersforallworkflowapplications(e.g.,insituprocessing)
Conclusion and Future Work
What’sNext?
• SolutionssuchasI/O-awareschedulingorinsituprocessingmayalsonotfulfillallapplicationrequirements
Weintendtoinvestigatetheuseofcombinedinsituandintransitanalysis
• DevelopmentofaproductionsolutionforthePegasusworkflowmanagementsystem
![Page 17: On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows](https://reader033.fdocuments.us/reader033/viewer/2022052606/5a64dd1b7f8b9a824a8b4b4d/html5/thumbnails/17.jpg)
ON THE USE OF BURST BUFFERS FOR ACCELERATING DATA-INTENSIVE SCIENTIFIC WORKFLOWS
Rafael Ferreira da Silva, Ph.D.Research Assistant Professor, Computer Science DepartmentComputer Scientist, USC Information Sciences Institute
[email protected] – http://rafaelsilva.com
Thank You
Questions?
Funded by the US Department of Energy under
Grant #DE-SC0012636