Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 -...

25
Edison and Cori: User Update -1- Zhengji Zhao, Helen He, Wahid Bhimji NERSC User Group Meeting Berkeley, CA, March 24, 2016

Transcript of Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 -...

Page 1: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Edison and Cori: User Update

-1-

Zhengji Zhao, Helen He, Wahid Bhimji NERSC User Group Meeting Berkeley, CA, March 24, 2016

Page 2: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Edison Update

-2-

ZhengjiZhao

Page 3: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Edison upgrades (11/30/2015-3/15)

•  Edisonmove11/30-12/23/2015–  Edisondisassembled,reassembled,integrated,reconfiguredandtestedat

CRT–  1/4/2016userswereenabled–  Freechargingperiod1/4–1/10/2016

•  SwitchtoSlurm–  SlurmconfiguraDonhasbeeninconDnuousimprovementandadjustment–  Usersneededalotofhelpwithrunningjobsandworkflowswitch–  Favorlargelytobigjobs–  Majorissueistheslowqueueturnaround–weareworkingonit

•  /scratch3upgradetoGridRaid–  I/OperformanceissueissDllininvesDgaDon

-3-

Page 4: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Edison upgrades

•  HostIPchange–  Usershadsshissuestologin

•  NEWSSHauthenFcaFonmechanism(1/12/2016)–  Loginissueaswell

•  EdisonexperiencedmulFpleplannedandunplanneddownFmes(poweroutage)duringJan-Mar,2016.–  Userjobsaffected

•  CDTupgradeson12/23/2015(15.12),2/3/2016(16.01),3/22/2016(16.03)–  Encounteredafewmajorbugs;Workaroundsprovidedforallbugs,and

majorbugswerefixedasof1/15;AremainingbugwillbefixedinCDT16.03.Fixeswereinplaceon3/22/2016.

–  ExtendedtheCDTtesDngscripttoincludemoretests–  DefaultopDon--craype-buildtools-check

-4-

Page 5: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Edison upgrades •  AshorterpurgingperiodwillbeinplaceeffecFve

4/1/2016–  Purgingperiodwillbe8weeks(from12weeks)–  84%,81%fullon/scratch1and2filesystemscurrently

•  /scratch3quotainplaceasof3/17/2016–  Quota100TBdiskspace,50,000,000inode–  Quotacheckwillbeinplaceinthejobsubmissionfilter,failthesubmissionifoverquota

–  74%full•  /scratch1and/scratch2willbeupgradedtoGridRaid,

FmeTBD–  Dependingonwhenthecurrent/scratch3performancebugisresolved

-5-

Page 6: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

SSP benchmark performance after the move

0"

200"

400"

600"

800"

1000"

1200"

CAM" GAMESS" GTC" IMPACT1T" MAESTRO" MILC" PARATEC"

Time%(s)%

Applicaitons%%

NERSC6%SSP%applica8on%runs%under%Slurm%and%Torque/Moab%

Slurm"1"Dedicated"12/23/2015,1/1/2016"

Torque/Moab"1"ProdcuGon11/24/2015"

Torque/Moab1Dedicated"Acceptance"

-6-

Edisonperformancemonitoring:

hbps://my.nersc.gov/benchmarks-cs.php

Page 7: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

FCT performance regression is resolved

0.00#

10.00#

20.00#

30.00#

40.00#

50.00#

60.00#

70.00#

80.00#

12/27/15#0:00# 1/1/16#0:00# 1/6/16#0:00# 1/11/16#0:00# 1/16/16#0:00# 1/21/16#0:00# 1/26/16#0:00# 1/31/16#0:00#

MPI_A

lltoa

ll*+m

e*(sec)*

Run*date*

FCT*performance*on*Edison*before*and*a?er*the*move*to*CRT*building*

-7-

Rundate output/KEY:tag ntasks MPI_AlltoallFme(sec)2/19/14 fct99p1.o785758 132367 40.941/15/15 fct100p1.o2274657 133296 33.36

Beforethemove

Page 8: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

I/O performance degradation after the Grid Raid upgrade is still in investigation

•  Thisisroughly3FmesperformancedegradaFon.•  NERSCneedstherecommendaFonfromCrayand

SeagateabouthowtorunIORbenchmarktocomparewiththeMDGRIDperformance.

-8-

Date Time JobdescripDon:FileperprocessIOR Write(MB/s) Read(MB/s) Comment

12/24/15 2:22 FS3 1m2 1152ranks 8fpo 24ppn 144osts 21718 23952 GRIDRAID

12/24/15 7:48 FS3 1m2 288ranks 8fpo 24ppn 36osts 17602 22911 GRIDRAID

3/26/15 2:39 FS3 1m2 1152ranks 8fpo 24ppn 144osts 62663 54416 MDRAID

Page 9: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Cori Update

-9-

HelenHeandWahidWhimji

Page 10: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Cori Usage Info

-10-

•  11/12/2015:Allusersenabled•  11/30/2015–1/4/2016:Edisonoffline•  12/15/2015:HopperreDred.•  1/12/16:CoristartedchargingwhenAY16began

MorelargejobsduringfreeDmeJ

Page 11: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Cori Usage Info: Free Period and AY16

-11-

162MMPPhoursused(10/29/15-1/11/16)

75.8MMPPhoursused(1/12/16-3/22/16)

•  Earlyuserswereenabledin7phases:•  AllowCorisystembecamereadyinvariousaspects(networking,

programmingenvironment,batchsystem,etc.)

Page 12: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Cori Phase 1 Data Features •  FileSystems

–  BurstBufferforhighbandwidth,lowlatencyI/O–  HighperformanceLustrefilesystem:28PBofdisk,>700GBI/O

bandwidth–  CrossmounDngoffilesystems(CoriscratchonEdisonandDTNs)(TBA)–  Largeamountofmemorypercomputenode(128GB)aswellassome

highmemoryloginnodes(775GB).•  Networking

–  ImprovedoutboundInternetconnecDons(eg.toaccessadatabaseinanothercenter)

–  SolwareDefinedNetworkingR&Dforhighbandwidthtransfersinandoutofthecomputenode(TBA)

•  Onnodesodware–  Improvedsharedlibraryperformance–  User-definedimages/Shiler

-12-

Page 13: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Cori Phase 1 Data Features (SLURM) •  CoriPhase1alsoknownasthe"CoriDataParFFon”•  Designedtoacceleratedata-intensiveapplicaFons,withhigh

throughputand“realFme”need.–  "shared”parDDon.MulDplejobsonthesamenode.Largersubmitand

runlimits.40nodessetaside–  The1-2nodebininthe"regular"forhighthroughputjobs.Largesubmit

andrunlimits.–  “realDme”parDDonforjobsrequiringrealDmedataanalysis.Highest

queuepriority.Specialpermissiononly.–  Internalsshd(CCMmode)inanyqueue–  Largenumberoflogin/interacDvenodestosupportapplicaDonswith

advancedworkflows–  “burstbuffer”usageintegratedinSLURM,inearlyuserperiod.–  Encourageuserstorunjobsusing683+nodesonEdisonwithqueue

priorityboostand40%chargingdiscountthere.

-13-

Page 14: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Transition from Hopper/Edison to Cori

•  ProgrammingenvironmentisverysimilartoHopper/Edison.PorFngtoCoriisstraighhorwardinregardstosodwarebuilding.

•  TheaspectthatusersneedtoadjustthemostisthetransiFonfromTorque/MoabtoSLURM.

•  ProvideddetaileddocumentaFonsonSLURMtransiFonguide,examplebatchscripts,andtutorials.

•  WorkedwithsomespecificapplicaFonsandusersfortheporFng.CESMisonesuchexample.Itisanewmachineport,withblrequired.

-14-

Page 15: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

SLURM Batch Scheduler Adoption

•  OverallSLURMadopFonissmooth.•  Easytouse“premium”,“ccm”,goodsupportandusage

for“shared”and“realFme”.•  Afewtraps(withusereducaFon):

–  Hyperthreadingisonbydefault•  SLURMsees64CPUspernode•  Askingnodeswith“#SBATCH–n”,butwithout“#SBATCH–N”maygethalfthenodedesired

•  NeedtosetOMP_NUM_THREADS=1explicitlytorunwithpureMPI(forhybridMPI/OpenMPprogramcompiledwithopenmpenabled)

–  AutomaDcprocessandthreadaffinityisgood.CanexplorewithadvancedsevngsformorecomplicatedbindingopDons.

-15-

Page 16: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Batch Job Wait Time •  UsersreportedaboutLONGwaitFmeforjobs•  MonitoringandtuningSLURMconfiguraFonisanongoingtask•  ChangesmadeonJan15

–  AddedmaxnumberofbackfilljobsperparDDon(ontopofmaxnumberofbackfilljobsperuser)

–  Decreasedmaxsizeofdebugfrom128to112.–  Communicatedwithindividualuserstousethe“shared”parDDon,job

arrays,andbundlingjobs.–  JobsdonotplantoruninAY16weredeleted–  Mostdebugjobsthenstartedwithin30mininsteadofhours,manynow

startinafewmin.–  TheregularjobswaitDmearesignificantlysmallertoo

•  ChangesmadeonMar22fortheschedulingalgorithmgreatlyincreasedsystemuFlizaFon(keepwatchingJ)

-16-

Page 17: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

NERSC Custom Queue Monitoring Script •  Original“sqs”providesbasicbatchjobinfoplusthejobrankingbasedonstartFmeprovidedbythebackfillscheduler.

•  Anewversionof“sqs”wasdeployedonJan19withtwocolumnsofrankingvaluestogiveusersmoreperspecFveoftheirjobsinqueue.–  Addedjobpriorityrankingwithabsolutepriorityvalue(afuncDonofparDDon,QOS,jobwaitDme,andfairshare)

-17-

Page 18: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

A Few Tips to Get Faster Job Turnaround

•  RequestshorterwallFme,donotuseallowedmaxwallFme.

•  Use“shared”parFFonforserialjobsorverysmallparalleljobs.

•  Bundlejobs(mulFple“srun”sinonescript,sequenFalorsimultaneously)

•  UseJobArrays(benermanagingjobs,notnecessaryfasterturnaround).Eacharraytaskisconsideredasinglejobforscheduling.

•  Usejobdependencyfeatureformanagingworkflow.

-18-

Page 19: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Resolved: Cray HDF5 with Intel16

•  InternalcompilererrorforFortrancodeswhenusingcray-hdf5,andcray-hdf5-parallel/1.8.14withintel/16.0.0.109

•  Twoworkarounds:–  UseNERSCbuilthdf5/1.8.14andhdf5-parallel/1.8.14withIntel/16.0.0.109compiler

–  Usecray-hdf5/1.8.14,butswapintelcompilerversionfrom16.0.0.109to15.0.1.133.

•  cray-hdf5/1.8.16hasbeeninstalledandsettodefaultwhichresolvedthisissue(Feb27,2016)

-19-

Page 20: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Workaround: Node Voltage Fault

•  ComputenodevoltagefaultonlyseenwithonespecificQuantumEspressoapplicaFon“pw.x”.

•  Bydefault,hyperthreadingisused.AndtheapplicaFongeneratesaveryclosesequenceofcurrentspikesthatmaycausetheVoltageConvertertoself-protectandshutdown.

•  WorkaroundbyusereducaFontouse1threadperMPItask.AlsomodifiedtheNERSCprovidedmodulefiletosetOMP_NUM_THREADS=1.(Jan16,2016)

-20-

Page 21: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Resolved: /project IO performance

•  TwoapplicaFonsreported10xparallelIOperformanceslowdownin/project,seenaderDec25,2015.

•  FixedduringsystemrebootwithscheduledmaintenanceonJan20,2016.

•  Exactcauseofslowdownunknown–  Unlikelydueto“CoriDVSnodesGPFSIBcablenotused”

-21-

Page 22: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Current Issues

•  LoginnodescrashwhenhisngLustrefilesystembug•  ComputenodesstuckincompleFngstatesfromcertainBurstBufferjobs

•  Computenodeswentdownwithout-of-memoryerrorfromcertainapplicaFons

•  BurstBuffersFllinearlyuserperiod

-22-

Page 23: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Cori Phase 1 SSP Performance

-23-

ComminedSSP:68.2MeasuredSSP:83.0

0

200

400

600

800

1000

1200

MiniFE MiniGhost AMG UMT SNAP MiniDFT GTC MILC

RunTime(Sec)

Commined

Measured

Page 24: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Peak Cori Scratch Lustre I/O Performance

-24-

POSIX–File-Per-Process MPI-IO–SingleSharedFile

Page 25: Edison and Cori: User Update - National Energy Research ... · Edison and Cori: User Update - 1 - Zhengji Zhao, Helen He, Wahid Bhimji ... • Provided detailed documentaFons on SLURM

Thank you.

-25-