Download - Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2016... · 2019-11-18 · O-t O-e O-t O-e Streamed I/O patterns (PFS-side)

Transcript
Page 1: Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2016... · 2019-11-18 · O-t O-e O-t O-e Streamed I/O patterns (PFS-side)

I/O-ignorant

I/O-aw

are

I/O-ignorantI/O

-aware

StreamedI/Opatterns(PFS-side)

ScalableI/O-AwareJobSchedulingforBurstBufferEnabledHPCClusters

Motivation

CriticalQuestions

I/O-awareschedulingkeepsallocatednodesincomputation100%ofthetime

ThisworkwasperformedundertheauspicesoftheU.S.DepartmentofEnergybyLawrenceLivermoreNationalLaboratoryunderContractDE-AC52-07NA27344.National Science FoundationCCF-1318445/1318417.

StephenHerbein1,DongH.Ahn2,DonLipari2,TomScogland2,MarcStearman2,JimGarlick2,MarkGrondona2,BeckySpringmeyer2,MichelaTaufer11UniversityofDelaware,2LawrenceLivermoreNationalLaboratory

PeakFLOPS

BBSSD

CN

ParallelFileSystem

PFSBW

(10sGB/s) BBBW(100sGB/s)

PFSBW(1sGB/s)

SchedulerDecisionTime Efficiencyvs.Turnaround

I/O-awareschedulingeliminatesvariabilityinjobperformanceduetoI/Ocontention

I/O-awareschedulingisstillviableforonlinebatchjobscheduling

I/O-awareschedulingincreasesscience(>1.29x)inexchangeforincreasing

turnaroundtime(<1.52x)

MakingtheSchedulerI/O-aware

ModelingtheI/OContention• Twoscenariosaremodeled:

§ AlljobsgettheirrequestedBWandextraBWremains§ SmallerI/Orequestsaresatisfied,largerrequestscontendforBW;noextraBWremains

• ContentionoccursincasetwoandismodelingusinganInterferenceFactor definedin[2]

• FourlevelsofPFSprovisioning§ 0%(70GB/s),10%(63GB/s),20%(56GB/s),and30%(49GB/s)

§ SimulatesasmallPFSorareservationofBWforexternalsourcesofI/O

• DoesI/O-awarescheduling:§ Impactpercentageoftimethatnodesspendincomputation?§ Impactthevariabilityofeachindividualjob’sperformance?§ Affectthetimetomakeaschedulingdecision?

• Whatisthetrade-offbetweensystemefficiencyandturnaroundtime?• TheFLOPSvs.I/OimbalancecancauseI/Ocontention• Burstbuffers(BB)andsmartstagingpostponecontention• Parallelfilesystems(PFSes)remainthemainbottleneck

Weproposeanovel,I/O-awarebatchschedulingalgorithmthatcanmanageI/OcontentionatthePFSlevel[1]

• Job1,byitself,canbescheduledonthesystem• Job2requeststoomuchBWandcancausecontentionwithJob1§ Job2isdelayeduntilmoreBWisavailable(i.e.,whenJob1completes)

References:[1]S.Herbein,D.H.Ahn,D.Lipari,T.R.Scogland,M.Stearman,M.Grondona,J.Garlick,B.Springmeyer,andM.Taufer,“ScalableI/O-AwareJobSchedulingforBurstBufferEnabledHPCClusters,”inProc.ofthe25thInternationalSymposiumonHigh-PerformanceParallelandDistributedComputing(HPDC),2016.[2]M.Dorier,G.Antoniu,R.Ross,D.Kimpe,andS.Ibrahim.CALCioM:MitigatingI/OInterferenceinHPCSystemsThroughCross-ApplicationCoordination.InProc.ofthe2014IEEE28th InternationalParallelandDistributedProcessingSymposium(IPDPS),May2014.

Growingcomputationalcapability StagnatingI/Ocapabilities•WithoutBBs,thebursty I/OgoesstraighttothePFS•WithBBs,theapplicationseesmuchhigherI/OBWs•WithBBs,theI/OtothePFSisaconstantstream• PFSisnowprovisionedforavg.I/Oload(notmaxload)

ModelingtheI/OSubsystem

CoreSwitchPool

GatewayNodePool

PFS

SU0 SU1 … …. SU12

1" 2" 3" 4" 5" 6" 7" 8" 9" 107" 108"

1" 2" 3"

18 18 18

6

High%Level%Switches%

Low%Level%Switches%

Scalable%Units%

Low LevelSwitches

ScalableUnits(SUs)

• Modeledsystem:§ A1944node/12SUcluster§ I/Oroutedround-robinacrosscore

switchesandgatewaynodes

• Keysimplifications:§ Mergecoreswitchesandgatewaynodes§ LeverageBBstomodelI/Oasaconstant

streamratherthanvariablebursts

SU0 SU1 … …. SU12

1" 2" 3" 4" 5" 6" 7" 8" 9" 107" 108"

1" 2" 3"

18 18 18

6

High%Level%Switches%

Low%Level%Switches%

Scalable%Units%

LowLevelSwitches

ScalableUnits(SUs)

CoreSwitches

Lustre /ParallelFileSystem

2 2

Fromacomplexresourcegraph… Toasimpleresourcetree

• 2,500jobssampledfromLLNL’sworkloads§ ConstantjobI/Orateof18MB/s• 3,888nodesystemmodelfromLLNL’sCTS-1• I/O-aware/ignorantversionsofEASYbackfilling§ EmulatedusingtheFluxframeworkemulator

TestConfiguration

I/O-awareSchedulingScenarios

Based on: Liu, N, Cope, J, Carns, P, Carothers, C, Ross, R, Grider, G, Crume, A, Maltzahn, C .“On the Role of Burst Buffers in Leadership-class Storage Systems”MSST/SNAPI 2012

FromatalkofLucyNowell,DoEProgramDirector(DoEWorkflowWorkshop,Rockville,MD,April20-21,2015)

• I/O-awaremeansusingI/Oasakeyconstraintwhenschedulingjobs§ JobsaredelayediftheywouldcausecontentionintheI/Osubsystem

• I/O-awareschedulerskeeptrackofI/OallocationsandpredictpotentialI/OcontentionusingboththeI/OsubsystemandI/Ocontentionmodels

Fluxframework’sglobalsystemviewandresourcedescriptionlanguageenabletheuseofI/IOsubsystemandcontentionmodelsinascheduler

TotalSystemPerformance IndividualJobPerformance

Limit:256MB/sRequest:256MB/s

LowestLevelSwitch

Limit:256MB/sRequest:320MB/s

LowestLevelSwitch

Request:192MB/s

Job10

ComputeNode

Limit:192MB/sRequest:192MB/sBurstBuffer

Request:128MB/s

ComputeNode

Limit:192MB/sRequest:128MB/sBurstBuffer

Job20Request:128MB/s

ComputeNode

Limit:192MB/sRequest:128MB/sBurstBuffer

Job21Request:128MB/s

ComputeNode

Limit:192MB/sRequest:128MB/sBurstBuffer

Job22

Limit:1024MB/sRequest:576MB/s

ParallelFileSystem

Limit:512MB/sRequest:576MB/s

CoreNetworkSwitch

Limit:256MB/sRequest:192MB/s

LowestLevelSwitch

Request:192MB/s

Job10

ComputeNode

Limit:192MB/sRequest:192MB/sBurstBuffer

ComputeNode

Limit:192MB/sRequest:0 MB/sBurstBuffer

ComputeNode

Limit:192MB/sRequest:0 MB/sBurstBuffer

ComputeNode

Limit:192MB/sRequest:0 MB/sBurstBuffer

Limit:1024MB/sRequest:192MB/s

ParallelFileSystem

Limit:512MB/sRequest:192MB/s

CoreNetworkSwitch

Limit:256MB/sRequest:0 MB/s

LowestLevelSwitch

PeakI/OBandwidth

I/O-ignorant

I/O-aw

are

Application1 Application2 Application3

Bursty I/OpatternsStreamedI/Opatterns(App-side)

Application1 Application2 Application3

Application1 Application2 Application3

LLNL-POST-690319