vasp-gpu on Balena: Usage and Some Benchmarks

Post on 15-Apr-2017

165 views 13 download

Transcript of vasp-gpu on Balena: Usage and Some Benchmarks

Balena UserGroupMeeting

3rd February2017

vasp-gpu onBalena:UsageandSomeBenchmarks

Ø TheVASPSCFcycleinanutshell

Ø ParallelisationinVASP

o Workloadanddatadistribution

o Parallelisationcontrolparameters

o Somerulesofthumbforoptimisingparallelscaling

Ø TheGPU(CUDA)portofVASP

o Compilingandrunning

o Features

o Someinitialbenchmarks

Ø Thoughtsanddiscussionpoints

Balena UserGroupMeeting,February2017|Slide2

Overview

http://www.iue.tuwien.ac.at/phd/goes/dissse14.htmlS.Mainzetal.,Comput.Phys.Comm. 182,1421(2011)

Balena UserGroupMeeting,February2017|Slide3

TheVASPSCFcycleinanutshell

Ø ThenewestversionsofVASPimplementfourlevelsofparallelism:

o k-pointparallelism:KPAR

o Bandparallelismanddatadistribution:NCORE andNPAR

o Parallelisationanddatadistributionoverplane-wavecoefficients(=FFTs;doneoverplanesalongNGZ):LPLANE

o Parallelisationofsomelinear-algebraoperationsusingScaLAPACK (notionallysetatcompiletime,butcanbecontrolledatruntimeusingLSCALAPACK)

Ø Effectiveparallelisationwill…:

o …minimise(relativelyslow)communicationbetweenMPIprocesses,…

o …distributedatatoreducememoryrequirements,…

o …andmakesuretheMPIprocesseshaveenoughworktokeepthembusy

Balena UserGroupMeeting,February2017|Slide4

ParallelisationinVASP

MPIprocesses

KPAR k-pointgroups

NPAR bandgroups

NGZ FFTgroups(?)

Ø WorkloaddistributionoverKPAR k-pointgroups,NBANDS bandgroupsandNGZ plane-wavecoefficient(FFT)groups[not100%surehowthisworks…]

Balena UserGroupMeeting,February2017|Slide5

Parallelisation:Workloaddistribution

Data

KPAR k-pointgroups

NPAR bandgroups

NGZ FFTgroups(?)

Ø DatadistributionoverNBANDS bandgroupsandNGZ plane-wavecoefficient(FFT)groups[alsonot100%surehowthisworks…]

Balena UserGroupMeeting,February2017|Slide6

Parallelisation:Datadistribution

Ø DuringastandardDFTcalculation,k-pointsareindependent->k-pointparallelismshouldbelinearlyscaling,althoughperhapsnotinpractice:https://www.nsc.liu.se/~pla/blog/2015/01/12/vasp-how-many-cores

Ø WARNING:<#procs>mustbedivisiblebyKPAR,buttheparallelisationisviaaround-robinalgorithm,so<#k-points> doesnotneedtobedivisiblebyKPAR ->checkhowmanyirreducible k-pointsyouhave(IBZKPT file)andsetKPAR accordingly

k1

k2

k3

k1 k2

k3

k1 k2 k3

KPAR = 1t =3[OK]

KPAR = 2;t =2[Bad]KPAR = 3t =1[Good]

R1

R2

R3

R1

R2

R1

Balena UserGroupMeeting,February2017|Slide7

Parallelisation:KPAR

NCORE :numberofcoresinbandgroupsNPAR :numberofbandstreatedsimultaneously NCORE =

< #procs >NPAR

Ø ForNCORE = 1/NPAR = <#procs> (thedefault),morebandgroupsappearstoincreasememorypressureandincurasubstantialcommunicationoverhead

7.08x

6.41x

6.32x

Balena UserGroupMeeting,February2017|Slide8

Parallelisation:NCORE andNPAR

Ø WARNING:VASPwillincreasethedefaultNBANDS tothenearestmultipleofthenumberofgroups

Ø SincetheelectronicminimisationscalesasapowerofNBANDS, thiscanbackfireincalculationswithalargeNPAR (e.g.thoserequiringNPAR = <#procs>)

Cores

NBANDS

Default Adjusted

96 455 480

128 455 512

192 455 576

256 455 512

384 455 768

512 455 512

NBANDS =NELECT

2 +NIONS2

Examplesystem:

• 238atomsw/272electrons

• DefaultNBANDS =455

NBANDS =35NELECT + NMAG

Balena UserGroupMeeting,February2017|Slide9

Parallelisation:NCORE andNPAR

Ø TheRMM-DIIS(ALGO = VeryFast | Fast)algorithminvolvesthreesteps:

EDDIAG :subspacediagonalisationRMM-DIIS :electronicminimisationORTHCH :wavefunction orthogonalisation

Routine 312atoms 624 atoms 1,248atoms 1,872 atoms

EDDIAG 2.90(18.64%) 12.97(22.24%) 75.26(26.38%) 208.29(31.31%)

RMM-DIIS 12.39(79.63%) 42.73(73.27%) 187.62(65.78%) 379.80(57.10%)

ORTHCH 0.27(1.74 %) 2.62(4.49%) 22.36(7.84%) 77.11(11.59%)

Ø EDDIAG andORTHCH formallyscaleasN3,andrapidlybegintodominatetheSCFcycletimeforlargecalculations

Ø AgoodScaLAPACK librarycanimprovetheperformanceoftheseroutinesinmassively-parallelcalculations

Seealso:https://www.nsc.liu.se/~pla/blog/2014/01/30/vasp9k

Balena UserGroupMeeting,February2017|Slide10

Parallelisation:ScaLAPACK

Ø KPAR:currentimplementationdoesnotdistributedataoverk-pointgroups->KPAR = N willuseN×morememorythanKPAR = 1

Ø NPAR/NCORE:dataisdistributedoverbandgroups->decreasingNPAR/increasingNCORE willconsiderablyreducememoryrequirements

Ø NPAR takesprecedenceoverNCORE - ifyouuse“master”INCAR files,makesureyoudon’tdefineboth

Ø ThedefaultsforNPAR/NCORE (NPAR = <#procs>,NCORE = 1)areusuallyapoorchoiceforbothmemoryrequirementsand performance

Ø Bandparallelismforhybridfunctionals hasbeensupportedsinceVASP5.3.5;formemory-intensivecalculations,itisagoodalternativetounderpopulating nodes

Ø LPLANE:distributesdataoverplane-wavecoefficients,andspeedsthingsupbyreducingcommunicationduringFFTs- thedefaultisLPLANE = .TRUE.,andshouldonlyneedtobechangedformassively-parallelarchitectures(e.g.BlueGene/Q)

Balena UserGroupMeeting,February2017|Slide11

Parallelisation:Memory

Ø Forx86_64IBsystems(e.g.Balena,Archer…):

o TryKPAR forheavycalculations(e.g.hybrids)

o SetNPAR = (<#procs>/KPAR) orNCORE = <#procs/node>

o 1node/bandgroupper50atoms;maywanttouse2nodes/50atomsforhybrids,ordecreaseto½nodeperbandgroupfor<10atoms

o LeaveLPLANE atthedefault(.TRUE.)

o WARNING:InmyexperienceofCraysystems(Archer/XC30,SiSu/XC40),usingKPARsometimescausesVASPtohangduringmultistepcalculations(e.g.optimisations)

Ø FortheIBMBlueGene/Q(STFCHartree Centre):

o LasttimeIusedit,theHartree machineonlyhadVASP5.2.x->noKPAR

o Trytochooseasquarenumberofcores,andsetNPAR = sqrt(<#procs>)

o ConsidersettingLPLANE = .FALSE. if<#procs> ≥NGZ

Balena UserGroupMeeting,February2017|Slide12

Parallelisation:Somerulesofthumb

Ø GPUcomputingworksinanoffloadmodel

Ø ProgrammingmodelssuchasCUDAandOpenCLprovideAPIsfor:

o CopyingmemorytoandfromtheGPU

o Compilingkernel programstorunontheGPU

o Settingupandrunningkernelsoninputdata

Ø PortingcodesforGPUsinvolvesidentifyingroutinesthatcanbeefficientlymappedtotheGPUarchitecture,writingkernels,andinterfacingthemtotheCPUcode

Data

Data Program

Program

Run

Data

Data

CPU

GPU

Balena UserGroupMeeting,February2017|Slide13

GPUcomputing

Balena UserGroupMeeting,February2017|Slide14

vasp-gpu

Ø StartingfromtheFebruary2016releaseofVASP5.4.1,thedistributionincludesaCUDAportthatoffloadssomeofthecoreDFTroutinesontoNVIDIAGPUs

Ø AculminationofresearchattheUniversityofChicago,CarnegieMellon andENS-Lyon,andahealthydoseofoptimisationbyNVIDIA

Ø Threepaperscoveringtheimplementationandtesting:

o M.Hacene etal.,J.Comput.Chem. 33,2581(2012),10.1002/jcc.23096

o M.HutchinsonandW.Widom,Comput.Phys.Comm. 183,1422(2012),10.1002/jcc.23096

o S.Mainzetal.,Comput.Phys.Comm. 182,1421(2011),10.1016/j.cpc.2011.03.010

Balena UserGroupMeeting,February2017|Slide15

Becausesharingiscaring...

https://github.com/JMSkelton/VASP-GPU-Benchmarking

Ø Easy(ish)withtheVASP5.4.1buildsystem:

o Loadcuda/toolkit (alongwithintel/compiler,intel/mkl,etc.)

o Modifythearch/makefile.include.linux_intel_cuda example

o Makethegpu and/orgpu_ncl targets

intel/compiler/64/15.0.0.090intel/mkl/64/11.2openmpi/intel/1.8.4cuda/toolkit/7.5.18

FC = mpif90FCL = mpif90 -mkl -lstdc++...CUDA_ROOT :=/cm/shared/apps/cuda75/toolkit/7.5.18...MPI_INC =/apps/openmpi/intel-2015/1.8.4/include/

https://github.com/JMSkelton/VASP-GPU-Benchmarking/Compilation

Balena UserGroupMeeting,February2017|Slide16

vasp-gpu:Compilation

Ø AvailableasamoduleonBalena:module load untested vasp/intel/5.4.1

Ø Tousevasp-gpu onBalena,youneedtorequestaGPU-equippednodeandperformsomebasicsetuptasksinyourSLURMscripts

#SBATCH --partition=batch-acc

# Node w/ 1 k20x card.

#SBATCH --gres=gpu:1#SBATCH --constraint=k20x

# Node w/ 4 k20x cards.

##SBATCH --gres=gpu:4##SBATCH --constraint=k20x

if [ ! -d "/tmp/nvidia-mps" ] ; thenmkdir "/tmp/nvidia-mps"

fi

export CUDA_MPS_PIPE_DIRECTORY="/tmp/nvidia-mps"

if [ ! -d "/tmp/nvidia-log" ] ; thenmkdir "/tmp/nvidia-log"

fi

export CUDA_MPS_LOG_DIRECTORY="/tmp/nvidia-log"

nvidia-cuda-mps-control -d

https://github.com/JMSkelton/VASP-GPU-Benchmarking/Scripts

Balena UserGroupMeeting,February2017|Slide17

vasp-gpu:Runningjobs

Ø UsescuFFT andCUDAportsofcompute-heavypartsoftheSCFcycle

Ø ALGO = Normal | VeryFast (+Fast)w/LREAL = Auto fullysupported,alongwithKPAR,exactexchangeandnon-collinearspin

Ø ALGO = All | Damped andtheGW routineswork,butarenotoptimised(“passivelysupported”)

Ø LREAL = .FALSE.,NCORE > 1 (NPAR != N)andelectricfieldsarenotsupported(willcrashwithanerror)

Ø CurrentlynoGamma-onlyversion

Ø Futureroadmap:Γ-pointoptimisationsandsupportforLREAL = .FALSE.,vdWfunctionals,RPA/GW calculationsandbandparallelism

Balena UserGroupMeeting,February2017|Slide18

vasp-gpu:Features

Ø EachMPIprocessallocatesitsownsetofcuFFT plansandCUDAkernels,distributinground-robinamongtheavailableGPUs

Ø ThesizeoftheCUDAkernelsiscontrolledbyNSIM:broadly,NSIM↑=betterGPUutilisationbuthighermemoryrequirements

Ø <#procs> shouldbeamultipleof<#GPUs>,andformostsystemsyouwillprobablyendupunderpopulating theCPUs

Proc1

Proc2

Proc3

Proc4

GPU1

GPU2

Proc1

Proc2

Proc3

Proc4

GPU1

GPU2

GPU3

GPU4

Balena UserGroupMeeting,February2017|Slide19

vasp-gpu:Loadbalancing

Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode

Balena UserGroupMeeting,February2017|Slide20

vasp-gpu:Benchmarking

Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode

NSIM

1 2 4 8 12 16 24 32 48 64

#MPIProcesses

1 13.52 8.88 8.15 7.82 7.77 7.76 7.72 7.74 7.81 7.89

2 9.11 6.75 6.34 6.21 6.23 6.21 6.23 6.25 6.32 OOM

4 6.72 5.57 5.33 5.24 5.29 5.30 OOM OOM OOM OOM

8 6.01 5.26 5.14 OOM OOM OOM OOM OOM OOM OOM

12 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM

16 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM

Balena UserGroupMeeting,February2017|Slide21

vasp-gpu:Benchmarking

Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode

Balena UserGroupMeeting,February2017|Slide22

vasp-gpu:Benchmarking

0.0

1.0

2.0

3.0

4.0

5.0

64 128 192 256 320 384 448 512

Spee

dup

(vasp_gam

)

# Atoms

1 GPU 4 GPUs

0.0

1.0

2.0

3.0

4.0

5.0

64 128 192 256 320 384 448 512

Spee

dup

(vasp_std

)

# Atoms

1 GPU 4 GPUs

NSIM

1 2 4 8 16

#MPIProcesses

1 -14131.52 -158.39 -158.39 -158.39 -158.39

2 -14131.52 -158.39 -158.39 -158.39 -158.39

4 -14131.52 -158.39 -158.39 -158.39 -158.39

8 -14131.52 -158.39 -158.39 - -

12 - - - - -

16 - - - - -

Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode

Balena UserGroupMeeting,February2017|Slide23

vasp-gpu:Benchmarking

Ø Threepaperscoveringtheimplementationandtesting…:

o M.Hacene etal.,J.Comput.Chem. 33,2581(2012),10.1002/jcc.23096

o M.HutchinsonandW.Widom,Comput.Phys.Comm. 183,1422(2012),10.1002/jcc.23096

o S.Mainzetal.,Comput.Phys.Comm. 182,1421(2011),10.1016/j.cpc.2011.03.010

Ø …andacoupleofotherlinks:

o https://www.vasp.at/index.php/news/44-administrative/115-new-release-vasp-5-4-1-with-gpu-support

o https://www.nsc.liu.se/~pla/blog/2015/11/16/vaspgpu/

o http://images.nvidia.com/events/sc15/SC5120-vasp-gpus.html

Balena UserGroupMeeting,February2017|Slide24

Furtherreading

Ø UnderstandingtheparallelisationinVASPandapplyingafewsimplerulesofthumbcanmakeyourjobsscalebetteranduselessresources(thedefaultsettingsaren’tgreat...)

Ø Atthemoment,runningVASPonGPUsismostlyforinterest:

o Doesnotbenefitalltypesofjob

o Requiressomefiddlytestingtogetthebestperformance

o IfyouwillberunningalotofasuitableworkloadonBalena (e.g.largeMDjobs),itcouldbeworththeeffort

Ø Aimsforfurtherbenchmarktests:

o WhattypesofjobbenefitfromGPUacceleration?

o Whatisthemost“balanced”configuration(1/2/4GPUs/node)?

o IsitpossibletorunovermultipleGPUnodes?

o CanGPUsbeacost/powerefficientwaytoruncertainVASPjobs?

Balena UserGroupMeeting,February2017|Slide25

Thoughtsanddiscussionpoints

Balena UserGroupMeeting,February2017|Slide26

Acknowledgements