Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 -...
Transcript of Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 -...
Edison and Cori: User Update
-1-
Zhengji Zhao, Helen He, Wahid Bhimji NERSC User Group Meeting Berkeley, CA, March 24, 2016
Edison Update
-2-
ZhengjiZhao
Edison upgrades (11/30/2015-3/15)
• Edisonmove11/30-12/23/2015– Edisondisassembled,reassembled,integrated,reconfiguredandtestedat
CRT– 1/4/2016userswereenabled– Freechargingperiod1/4–1/10/2016
• SwitchtoSlurm– SlurmconfiguraDonhasbeeninconDnuousimprovementandadjustment– Usersneededalotofhelpwithrunningjobsandworkflowswitch– Favorlargelytobigjobs– Majorissueistheslowqueueturnaround–weareworkingonit
• /scratch3upgradetoGridRaid– I/OperformanceissueissDllininvesDgaDon
-3-
Edison upgrades
• HostIPchange– Usershadsshissuestologin
• NEWSSHauthenFcaFonmechanism(1/12/2016)– Loginissueaswell
• EdisonexperiencedmulFpleplannedandunplanneddownFmes(poweroutage)duringJan-Mar,2016.– Userjobsaffected
• CDTupgradeson12/23/2015(15.12),2/3/2016(16.01),3/22/2016(16.03)– Encounteredafewmajorbugs;Workaroundsprovidedforallbugs,and
majorbugswerefixedasof1/15;AremainingbugwillbefixedinCDT16.03.Fixeswereinplaceon3/22/2016.
– ExtendedtheCDTtesDngscripttoincludemoretests– DefaultopDon--craype-buildtools-check
-4-
Edison upgrades • AshorterpurgingperiodwillbeinplaceeffecFve
4/1/2016– Purgingperiodwillbe8weeks(from12weeks)– 84%,81%fullon/scratch1and2filesystemscurrently
• /scratch3quotainplaceasof3/17/2016– Quota100TBdiskspace,50,000,000inode– Quotacheckwillbeinplaceinthejobsubmissionfilter,failthesubmissionifoverquota
– 74%full• /scratch1and/scratch2willbeupgradedtoGridRaid,
FmeTBD– Dependingonwhenthecurrent/scratch3performancebugisresolved
-5-
SSP benchmark performance after the move
0"
200"
400"
600"
800"
1000"
1200"
CAM" GAMESS" GTC" IMPACT1T" MAESTRO" MILC" PARATEC"
Time%(s)%
Applicaitons%%
NERSC6%SSP%applica8on%runs%under%Slurm%and%Torque/Moab%
Slurm"1"Dedicated"12/23/2015,1/1/2016"
Torque/Moab"1"ProdcuGon11/24/2015"
Torque/Moab1Dedicated"Acceptance"
-6-
Edisonperformancemonitoring:
hbps://my.nersc.gov/benchmarks-cs.php
FCT performance regression is resolved
0.00#
10.00#
20.00#
30.00#
40.00#
50.00#
60.00#
70.00#
80.00#
12/27/15#0:00# 1/1/16#0:00# 1/6/16#0:00# 1/11/16#0:00# 1/16/16#0:00# 1/21/16#0:00# 1/26/16#0:00# 1/31/16#0:00#
MPI_A
lltoa
ll*+m
e*(sec)*
Run*date*
FCT*performance*on*Edison*before*and*a?er*the*move*to*CRT*building*
-7-
Rundate output/KEY:tag ntasks MPI_AlltoallFme(sec)2/19/14 fct99p1.o785758 132367 40.941/15/15 fct100p1.o2274657 133296 33.36
Beforethemove
I/O performance degradation after the Grid Raid upgrade is still in investigation
• Thisisroughly3FmesperformancedegradaFon.• NERSCneedstherecommendaFonfromCrayand
SeagateabouthowtorunIORbenchmarktocomparewiththeMDGRIDperformance.
-8-
Date Time JobdescripDon:FileperprocessIOR Write(MB/s) Read(MB/s) Comment
12/24/15 2:22 FS3 1m2 1152ranks 8fpo 24ppn 144osts 21718 23952 GRIDRAID
12/24/15 7:48 FS3 1m2 288ranks 8fpo 24ppn 36osts 17602 22911 GRIDRAID
3/26/15 2:39 FS3 1m2 1152ranks 8fpo 24ppn 144osts 62663 54416 MDRAID
Cori Update
-9-
HelenHeandWahidWhimji
Cori Usage Info
-10-
• 11/12/2015:Allusersenabled• 11/30/2015–1/4/2016:Edisonoffline• 12/15/2015:HopperreDred.• 1/12/16:CoristartedchargingwhenAY16began
MorelargejobsduringfreeDmeJ
Cori Usage Info: Free Period and AY16
-11-
162MMPPhoursused(10/29/15-1/11/16)
75.8MMPPhoursused(1/12/16-3/22/16)
• Earlyuserswereenabledin7phases:• AllowCorisystembecamereadyinvariousaspects(networking,
programmingenvironment,batchsystem,etc.)
Cori Phase 1 Data Features • FileSystems
– BurstBufferforhighbandwidth,lowlatencyI/O– HighperformanceLustrefilesystem:28PBofdisk,>700GBI/O
bandwidth– CrossmounDngoffilesystems(CoriscratchonEdisonandDTNs)(TBA)– Largeamountofmemorypercomputenode(128GB)aswellassome
highmemoryloginnodes(775GB).• Networking
– ImprovedoutboundInternetconnecDons(eg.toaccessadatabaseinanothercenter)
– SolwareDefinedNetworkingR&Dforhighbandwidthtransfersinandoutofthecomputenode(TBA)
• Onnodesodware– Improvedsharedlibraryperformance– User-definedimages/Shiler
-12-
Cori Phase 1 Data Features (SLURM) • CoriPhase1alsoknownasthe"CoriDataParFFon”• Designedtoacceleratedata-intensiveapplicaFons,withhigh
throughputand“realFme”need.– "shared”parDDon.MulDplejobsonthesamenode.Largersubmitand
runlimits.40nodessetaside– The1-2nodebininthe"regular"forhighthroughputjobs.Largesubmit
andrunlimits.– “realDme”parDDonforjobsrequiringrealDmedataanalysis.Highest
queuepriority.Specialpermissiononly.– Internalsshd(CCMmode)inanyqueue– Largenumberoflogin/interacDvenodestosupportapplicaDonswith
advancedworkflows– “burstbuffer”usageintegratedinSLURM,inearlyuserperiod.– Encourageuserstorunjobsusing683+nodesonEdisonwithqueue
priorityboostand40%chargingdiscountthere.
-13-
Transition from Hopper/Edison to Cori
• ProgrammingenvironmentisverysimilartoHopper/Edison.PorFngtoCoriisstraighhorwardinregardstosodwarebuilding.
• TheaspectthatusersneedtoadjustthemostisthetransiFonfromTorque/MoabtoSLURM.
• ProvideddetaileddocumentaFonsonSLURMtransiFonguide,examplebatchscripts,andtutorials.
• WorkedwithsomespecificapplicaFonsandusersfortheporFng.CESMisonesuchexample.Itisanewmachineport,withblrequired.
-14-
SLURM Batch Scheduler Adoption
• OverallSLURMadopFonissmooth.• Easytouse“premium”,“ccm”,goodsupportandusage
for“shared”and“realFme”.• Afewtraps(withusereducaFon):
– Hyperthreadingisonbydefault• SLURMsees64CPUspernode• Askingnodeswith“#SBATCH–n”,butwithout“#SBATCH–N”maygethalfthenodedesired
• NeedtosetOMP_NUM_THREADS=1explicitlytorunwithpureMPI(forhybridMPI/OpenMPprogramcompiledwithopenmpenabled)
– AutomaDcprocessandthreadaffinityisgood.CanexplorewithadvancedsevngsformorecomplicatedbindingopDons.
-15-
Batch Job Wait Time • UsersreportedaboutLONGwaitFmeforjobs• MonitoringandtuningSLURMconfiguraFonisanongoingtask• ChangesmadeonJan15
– AddedmaxnumberofbackfilljobsperparDDon(ontopofmaxnumberofbackfilljobsperuser)
– Decreasedmaxsizeofdebugfrom128to112.– Communicatedwithindividualuserstousethe“shared”parDDon,job
arrays,andbundlingjobs.– JobsdonotplantoruninAY16weredeleted– Mostdebugjobsthenstartedwithin30mininsteadofhours,manynow
startinafewmin.– TheregularjobswaitDmearesignificantlysmallertoo
• ChangesmadeonMar22fortheschedulingalgorithmgreatlyincreasedsystemuFlizaFon(keepwatchingJ)
-16-
NERSC Custom Queue Monitoring Script • Original“sqs”providesbasicbatchjobinfoplusthejobrankingbasedonstartFmeprovidedbythebackfillscheduler.
• Anewversionof“sqs”wasdeployedonJan19withtwocolumnsofrankingvaluestogiveusersmoreperspecFveoftheirjobsinqueue.– Addedjobpriorityrankingwithabsolutepriorityvalue(afuncDonofparDDon,QOS,jobwaitDme,andfairshare)
-17-
A Few Tips to Get Faster Job Turnaround
• RequestshorterwallFme,donotuseallowedmaxwallFme.
• Use“shared”parFFonforserialjobsorverysmallparalleljobs.
• Bundlejobs(mulFple“srun”sinonescript,sequenFalorsimultaneously)
• UseJobArrays(benermanagingjobs,notnecessaryfasterturnaround).Eacharraytaskisconsideredasinglejobforscheduling.
• Usejobdependencyfeatureformanagingworkflow.
-18-
Resolved: Cray HDF5 with Intel16
• InternalcompilererrorforFortrancodeswhenusingcray-hdf5,andcray-hdf5-parallel/1.8.14withintel/16.0.0.109
• Twoworkarounds:– UseNERSCbuilthdf5/1.8.14andhdf5-parallel/1.8.14withIntel/16.0.0.109compiler
– Usecray-hdf5/1.8.14,butswapintelcompilerversionfrom16.0.0.109to15.0.1.133.
• cray-hdf5/1.8.16hasbeeninstalledandsettodefaultwhichresolvedthisissue(Feb27,2016)
-19-
Workaround: Node Voltage Fault
• ComputenodevoltagefaultonlyseenwithonespecificQuantumEspressoapplicaFon“pw.x”.
• Bydefault,hyperthreadingisused.AndtheapplicaFongeneratesaveryclosesequenceofcurrentspikesthatmaycausetheVoltageConvertertoself-protectandshutdown.
• WorkaroundbyusereducaFontouse1threadperMPItask.AlsomodifiedtheNERSCprovidedmodulefiletosetOMP_NUM_THREADS=1.(Jan16,2016)
-20-
Resolved: /project IO performance
• TwoapplicaFonsreported10xparallelIOperformanceslowdownin/project,seenaderDec25,2015.
• FixedduringsystemrebootwithscheduledmaintenanceonJan20,2016.
• Exactcauseofslowdownunknown– Unlikelydueto“CoriDVSnodesGPFSIBcablenotused”
-21-
Current Issues
• LoginnodescrashwhenhisngLustrefilesystembug• ComputenodesstuckincompleFngstatesfromcertainBurstBufferjobs
• Computenodeswentdownwithout-of-memoryerrorfromcertainapplicaFons
• BurstBuffersFllinearlyuserperiod
-22-
Cori Phase 1 SSP Performance
-23-
ComminedSSP:68.2MeasuredSSP:83.0
0
200
400
600
800
1000
1200
MiniFE MiniGhost AMG UMT SNAP MiniDFT GTC MILC
RunTime(Sec)
Commined
Measured
Peak Cori Scratch Lustre I/O Performance
-24-
POSIX–File-Per-Process MPI-IO–SingleSharedFile
Thank you.
-25-