Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC...

1
I/O-ignorant I/O-aware I/O-ignorant I/O-aware Streamed I/O patterns (PFS-side) Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters Motivation Critical Questions I/O-aware scheduling keeps allocated nodes in computation 100% of the time This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. National Science Foundation CCF-1318445/1318417. Stephen Herbein 1 , Dong H. Ahn 2 , Don Lipari 2 , Tom Scogland 2 , Marc Stearman 2 , Jim Garlick 2 , Mark Grondona 2 , Becky Springmeyer 2 , Michela Taufer 1 1 University of Delaware, 2 Lawrence Livermore National Laboratory Peak FLOPS BB SSD CN Parallel File System PFS BW (10s GB/s) BB BW (100s GB/s) PFS BW (1s GB/s) Scheduler Decision Time Efficiency vs. Turnaround I/O-aware scheduling eliminates variability in job performance due to I/O contention I/O-aware scheduling is still viable for online batch job scheduling I/O-aware scheduling increases science (>1.29x) in exchange for increasing turnaround time (<1.52x) Making the Scheduler I/O-aware Modeling the I/O Contention Two scenarios are modeled: § All jobs get their requested BW and extra BW remains § Smaller I/O requests are satisfied, larger requests contend for BW; no extra BW remains Contention occurs in case two and is modeling using an Interference Factor defined in [2] Four levels of PFS provisioning § 0% (70GB/s), 10% (63 GB/s), 20% (56 GB/s), and 30% (49 GB/s) § Simulates a small PFS or a reservation of BW for external sources of I/O Does I/O-aware scheduling: § Impact percentage of time that nodes spend in computation? § Impact the variability of each individual job’s performance? § Affect the time to make a scheduling decision? What is the trade-off between system efficiency and turnaround time? The FLOPS vs. I/O imbalance can cause I/O contention Burst buffers (BB) and smart staging postpone contention Parallel file systems (PFSes) remain the main bottleneck We propose a novel, I/O-aware batch scheduling algorithm that can manage I/O contention at the PFS level [1] Job 1, by itself, can be scheduled on the system Job 2 requests too much BW and can cause contention with Job 1 § Job 2 is delayed until more BW is available (i.e., when Job 1 completes) References : [1] S. Herbein, D. H. Ahn, D. Lipari, T. R. Scogland, M. Stearman, M. Grondona, J. Garlick, B. Springmeyer, and M. Taufer, “Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters,” in Proc. of the 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016. [2] M. Dorier, G. Antoniu, R. Ross, D. Kimpe, and S. Ibrahim. CALCioM: Mitigating I/O Interference in HPC Systems Through Cross-Application Coordination. In Proc. of the 2014 IEEE 28 th International Parallel and Distributed Processing Symposium (IPDPS), May 2014. Growing computational capability Stagnating I/O capabilities Without BBs, the bursty I/O goes straight to the PFS With BBs, the application sees much higher I/O BWs With BBs, the I/O to the PFS is a constant stream PFS is now provisioned for avg. I/O load (not max load) Modeling the I/O Subsystem Core Switch Pool Gateway Node Pool PFS SU 0 SU 1 . SU 12 1 2 3 4 5 6 7 8 9 107 108 18 18 18 Low Level Switches Scalable Units (SUs) Modeled system: § A 1944 node/12 SU cluster § I/O routed round-robin across core switches and gateway nodes Key simplifications: § Merge core switches and gateway nodes § Leverage BBs to model I/O as a constant stream rather than variable bursts SU 0 SU 1 . SU 12 1 2 3 4 5 6 7 8 9 107 108 1 2 3 18 18 18 6 Low Level Switches Scalable Units (SUs) Core Switches Lustre / Parallel File System 2 2 From a complex resource graph… To a simple resource tree 2,500 jobs sampled from LLNL’s workloads § Constant job I/O rate of 18 MB/s 3,888 node system model from LLNL’s CTS-1 I/O-aware/ignorant versions of EASY backfilling § Emulated using the Flux framework emulator Test Configuration I/O-aware Scheduling Scenarios Based on: Liu, N, Cope, J, Carns, P, Carothers, C, Ross, R, Grider, G, Crume, A, Maltzahn, C . “On the Role of Burst Buffers in Leadership-class Storage SystemsMSST/SNAPI 2012 From a talk of Lucy Nowell, DoE Program Director (DoE Workflow Workshop, Rockville, MD, April 20-21, 2015) I/O-aware means using I/O as a key constraint when scheduling jobs § Jobs are delayed if they would cause contention in the I/O subsystem I/O-aware schedulers keep track of I/O allocations and predict potential I/O contention using both the I/O subsystem and I/O contention models Flux framework’s global system view and resource description language enable the use of I/IO subsystem and contention models in a scheduler Total System Performance Individual Job Performance Limit: 256 MB/s Request: 256 MB/s Lowest Level Switch Limit: 256 MB/s Request: 320 MB/s Lowest Level Switch Request: 192 MB/s Job1 0 Compute Node Limit: 192 MB/s Request: 192 MB/s Burst Buffer Request: 128 MB/s Compute Node Limit: 192 MB/s Request: 128 MB/s Burst Buffer Job2 0 Request: 128 MB/s Compute Node Limit: 192 MB/s Request: 128 MB/s Burst Buffer Job2 1 Request: 128 MB/s Compute Node Limit: 192 MB/s Request: 128 MB/s Burst Buffer Job2 2 Limit: 1024 MB/s Request: 576 MB/s Parallel File System Limit: 512 MB/s Request: 576 MB/s Core Network Switch Limit: 256 MB/s Request: 192 MB/s Lowest Level Switch Request: 192 MB/s Job1 0 Compute Node Limit: 192 MB/s Request: 192 MB/s Burst Buffer Compute Node Limit: 192 MB/s Request: 0 MB/s Burst Buffer Compute Node Limit: 192 MB/s Request: 0 MB/s Burst Buffer Compute Node Limit: 192 MB/s Request: 0 MB/s Burst Buffer Limit: 1024 MB/s Request: 192 MB/s Parallel File System Limit: 512 MB/s Request: 192 MB/s Core Network Switch Limit: 256 MB/s Request: 0 MB/s Lowest Level Switch Peak I/O Bandwidth I/O-ignorant I/O-aware Application 1 Application 2 Application 3 Bursty I/O patterns Streamed I/O patterns (App-side) Application 1 Application 2 Application 3 Application 1 Application 2 Application 3 LLNL-POST-690319

Transcript of Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC...

Page 1: Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2016... · 2019-11-18 · O-t O-e O-t O-e Streamed I/O patterns (PFS-side)

I/O-ignorant

I/O-aw

are

I/O-ignorantI/O

-aware

StreamedI/Opatterns(PFS-side)

ScalableI/O-AwareJobSchedulingforBurstBufferEnabledHPCClusters

Motivation

CriticalQuestions

I/O-awareschedulingkeepsallocatednodesincomputation100%ofthetime

ThisworkwasperformedundertheauspicesoftheU.S.DepartmentofEnergybyLawrenceLivermoreNationalLaboratoryunderContractDE-AC52-07NA27344.National Science FoundationCCF-1318445/1318417.

StephenHerbein1,DongH.Ahn2,DonLipari2,TomScogland2,MarcStearman2,JimGarlick2,MarkGrondona2,BeckySpringmeyer2,MichelaTaufer11UniversityofDelaware,2LawrenceLivermoreNationalLaboratory

PeakFLOPS

BBSSD

CN

ParallelFileSystem

PFSBW

(10sGB/s) BBBW(100sGB/s)

PFSBW(1sGB/s)

SchedulerDecisionTime Efficiencyvs.Turnaround

I/O-awareschedulingeliminatesvariabilityinjobperformanceduetoI/Ocontention

I/O-awareschedulingisstillviableforonlinebatchjobscheduling

I/O-awareschedulingincreasesscience(>1.29x)inexchangeforincreasing

turnaroundtime(<1.52x)

MakingtheSchedulerI/O-aware

ModelingtheI/OContention• Twoscenariosaremodeled:

§ AlljobsgettheirrequestedBWandextraBWremains§ SmallerI/Orequestsaresatisfied,largerrequestscontendforBW;noextraBWremains

• ContentionoccursincasetwoandismodelingusinganInterferenceFactor definedin[2]

• FourlevelsofPFSprovisioning§ 0%(70GB/s),10%(63GB/s),20%(56GB/s),and30%(49GB/s)

§ SimulatesasmallPFSorareservationofBWforexternalsourcesofI/O

• DoesI/O-awarescheduling:§ Impactpercentageoftimethatnodesspendincomputation?§ Impactthevariabilityofeachindividualjob’sperformance?§ Affectthetimetomakeaschedulingdecision?

• Whatisthetrade-offbetweensystemefficiencyandturnaroundtime?• TheFLOPSvs.I/OimbalancecancauseI/Ocontention• Burstbuffers(BB)andsmartstagingpostponecontention• Parallelfilesystems(PFSes)remainthemainbottleneck

Weproposeanovel,I/O-awarebatchschedulingalgorithmthatcanmanageI/OcontentionatthePFSlevel[1]

• Job1,byitself,canbescheduledonthesystem• Job2requeststoomuchBWandcancausecontentionwithJob1§ Job2isdelayeduntilmoreBWisavailable(i.e.,whenJob1completes)

References:[1]S.Herbein,D.H.Ahn,D.Lipari,T.R.Scogland,M.Stearman,M.Grondona,J.Garlick,B.Springmeyer,andM.Taufer,“ScalableI/O-AwareJobSchedulingforBurstBufferEnabledHPCClusters,”inProc.ofthe25thInternationalSymposiumonHigh-PerformanceParallelandDistributedComputing(HPDC),2016.[2]M.Dorier,G.Antoniu,R.Ross,D.Kimpe,andS.Ibrahim.CALCioM:MitigatingI/OInterferenceinHPCSystemsThroughCross-ApplicationCoordination.InProc.ofthe2014IEEE28th InternationalParallelandDistributedProcessingSymposium(IPDPS),May2014.

Growingcomputationalcapability StagnatingI/Ocapabilities•WithoutBBs,thebursty I/OgoesstraighttothePFS•WithBBs,theapplicationseesmuchhigherI/OBWs•WithBBs,theI/OtothePFSisaconstantstream• PFSisnowprovisionedforavg.I/Oload(notmaxload)

ModelingtheI/OSubsystem

CoreSwitchPool

GatewayNodePool

PFS

SU0 SU1 … …. SU12

1" 2" 3" 4" 5" 6" 7" 8" 9" 107" 108"

1" 2" 3"

18 18 18

6

High%Level%Switches%

Low%Level%Switches%

Scalable%Units%

Low LevelSwitches

ScalableUnits(SUs)

• Modeledsystem:§ A1944node/12SUcluster§ I/Oroutedround-robinacrosscore

switchesandgatewaynodes

• Keysimplifications:§ Mergecoreswitchesandgatewaynodes§ LeverageBBstomodelI/Oasaconstant

streamratherthanvariablebursts

SU0 SU1 … …. SU12

1" 2" 3" 4" 5" 6" 7" 8" 9" 107" 108"

1" 2" 3"

18 18 18

6

High%Level%Switches%

Low%Level%Switches%

Scalable%Units%

LowLevelSwitches

ScalableUnits(SUs)

CoreSwitches

Lustre /ParallelFileSystem

2 2

Fromacomplexresourcegraph… Toasimpleresourcetree

• 2,500jobssampledfromLLNL’sworkloads§ ConstantjobI/Orateof18MB/s• 3,888nodesystemmodelfromLLNL’sCTS-1• I/O-aware/ignorantversionsofEASYbackfilling§ EmulatedusingtheFluxframeworkemulator

TestConfiguration

I/O-awareSchedulingScenarios

Based on: Liu, N, Cope, J, Carns, P, Carothers, C, Ross, R, Grider, G, Crume, A, Maltzahn, C .“On the Role of Burst Buffers in Leadership-class Storage Systems”MSST/SNAPI 2012

FromatalkofLucyNowell,DoEProgramDirector(DoEWorkflowWorkshop,Rockville,MD,April20-21,2015)

• I/O-awaremeansusingI/Oasakeyconstraintwhenschedulingjobs§ JobsaredelayediftheywouldcausecontentionintheI/Osubsystem

• I/O-awareschedulerskeeptrackofI/OallocationsandpredictpotentialI/OcontentionusingboththeI/OsubsystemandI/Ocontentionmodels

Fluxframework’sglobalsystemviewandresourcedescriptionlanguageenabletheuseofI/IOsubsystemandcontentionmodelsinascheduler

TotalSystemPerformance IndividualJobPerformance

Limit:256MB/sRequest:256MB/s

LowestLevelSwitch

Limit:256MB/sRequest:320MB/s

LowestLevelSwitch

Request:192MB/s

Job10

ComputeNode

Limit:192MB/sRequest:192MB/sBurstBuffer

Request:128MB/s

ComputeNode

Limit:192MB/sRequest:128MB/sBurstBuffer

Job20Request:128MB/s

ComputeNode

Limit:192MB/sRequest:128MB/sBurstBuffer

Job21Request:128MB/s

ComputeNode

Limit:192MB/sRequest:128MB/sBurstBuffer

Job22

Limit:1024MB/sRequest:576MB/s

ParallelFileSystem

Limit:512MB/sRequest:576MB/s

CoreNetworkSwitch

Limit:256MB/sRequest:192MB/s

LowestLevelSwitch

Request:192MB/s

Job10

ComputeNode

Limit:192MB/sRequest:192MB/sBurstBuffer

ComputeNode

Limit:192MB/sRequest:0 MB/sBurstBuffer

ComputeNode

Limit:192MB/sRequest:0 MB/sBurstBuffer

ComputeNode

Limit:192MB/sRequest:0 MB/sBurstBuffer

Limit:1024MB/sRequest:192MB/s

ParallelFileSystem

Limit:512MB/sRequest:192MB/s

CoreNetworkSwitch

Limit:256MB/sRequest:0 MB/s

LowestLevelSwitch

PeakI/OBandwidth

I/O-ignorant

I/O-aw

are

Application1 Application2 Application3

Bursty I/OpatternsStreamedI/Opatterns(App-side)

Application1 Application2 Application3

Application1 Application2 Application3

LLNL-POST-690319