vasp-gpu on Balena: Usage and Some Benchmarks

Balena UserGroupMeeting

3rd February2017

vasp-gpu onBalena:UsageandSomeBenchmarks

Ø TheVASPSCFcycleinanutshell

Ø ParallelisationinVASP

o Workloadanddatadistribution

o Parallelisationcontrolparameters

o Somerulesofthumbforoptimisingparallelscaling

Ø TheGPU(CUDA)portofVASP

o Compilingandrunning

o Features

o Someinitialbenchmarks

Ø Thoughtsanddiscussionpoints

Balena UserGroupMeeting,February2017|Slide2

Overview

http://www.iue.tuwien.ac.at/phd/goes/dissse14.htmlS.Mainzetal.,Comput.Phys.Comm. 182,1421(2011)

TheVASPSCFcycleinanutshell

Ø ThenewestversionsofVASPimplementfourlevelsofparallelism:

o k-pointparallelism:KPAR

o Bandparallelismanddatadistribution:NCORE andNPAR

o Parallelisationanddatadistributionoverplane-wavecoefficients(=FFTs;doneoverplanesalongNGZ):LPLANE

o Parallelisationofsomelinear-algebraoperationsusingScaLAPACK (notionallysetatcompiletime,butcanbecontrolledatruntimeusingLSCALAPACK)

Ø Effectiveparallelisationwill…:

o …minimise(relativelyslow)communicationbetweenMPIprocesses,…

o …distributedatatoreducememoryrequirements,…

o …andmakesuretheMPIprocesseshaveenoughworktokeepthembusy

ParallelisationinVASP

MPIprocesses

KPAR k-pointgroups

NPAR bandgroups

NGZ FFTgroups(?)

Ø WorkloaddistributionoverKPAR k-pointgroups,NBANDS bandgroupsandNGZ plane-wavecoefficient(FFT)groups[not100%surehowthisworks…]

Parallelisation:Workloaddistribution

KPAR k-pointgroups

NPAR bandgroups

NGZ FFTgroups(?)

Ø DatadistributionoverNBANDS bandgroupsandNGZ plane-wavecoefficient(FFT)groups[alsonot100%surehowthisworks…]

Parallelisation:Datadistribution

Ø DuringastandardDFTcalculation,k-pointsareindependent->k-pointparallelismshouldbelinearlyscaling,althoughperhapsnotinpractice:https://www.nsc.liu.se/~pla/blog/2015/01/12/vasp-how-many-cores

Ø WARNING:<#procs>mustbedivisiblebyKPAR,buttheparallelisationisviaaround-robinalgorithm,so<#k-points> doesnotneedtobedivisiblebyKPAR ->checkhowmanyirreducible k-pointsyouhave(IBZKPT file)andsetKPAR accordingly

k1 k2 k3

KPAR = 1t =3[OK]

KPAR = 2;t =2[Bad]KPAR = 3t =1[Good]

Parallelisation:KPAR

NCORE :numberofcoresinbandgroupsNPAR :numberofbandstreatedsimultaneously NCORE =

< #procs >NPAR

Ø ForNCORE = 1/NPAR = <#procs> (thedefault),morebandgroupsappearstoincreasememorypressureandincurasubstantialcommunicationoverhead

Parallelisation:NCORE andNPAR

Ø WARNING:VASPwillincreasethedefaultNBANDS tothenearestmultipleofthenumberofgroups

Ø SincetheelectronicminimisationscalesasapowerofNBANDS, thiscanbackfireincalculationswithalargeNPAR (e.g.thoserequiringNPAR = <#procs>)

NBANDS

Default Adjusted

96 455 480

128 455 512

192 455 576

256 455 512

384 455 768

512 455 512

NBANDS =NELECT

2 +NIONS2

Examplesystem:

• 238atomsw/272electrons

• DefaultNBANDS =455

NBANDS =35NELECT + NMAG

Parallelisation:NCORE andNPAR

Ø TheRMM-DIIS(ALGO = VeryFast | Fast)algorithminvolvesthreesteps:

EDDIAG :subspacediagonalisationRMM-DIIS :electronicminimisationORTHCH :wavefunction orthogonalisation

Routine 312atoms 624 atoms 1,248atoms 1,872 atoms

EDDIAG 2.90(18.64%) 12.97(22.24%) 75.26(26.38%) 208.29(31.31%)

RMM-DIIS 12.39(79.63%) 42.73(73.27%) 187.62(65.78%) 379.80(57.10%)

ORTHCH 0.27(1.74 %) 2.62(4.49%) 22.36(7.84%) 77.11(11.59%)

Ø EDDIAG andORTHCH formallyscaleasN3,andrapidlybegintodominatetheSCFcycletimeforlargecalculations

Ø AgoodScaLAPACK librarycanimprovetheperformanceoftheseroutinesinmassively-parallelcalculations

Seealso:https://www.nsc.liu.se/~pla/blog/2014/01/30/vasp9k

Parallelisation:ScaLAPACK

Ø KPAR:currentimplementationdoesnotdistributedataoverk-pointgroups->KPAR = N willuseN×morememorythanKPAR = 1

Ø NPAR/NCORE:dataisdistributedoverbandgroups->decreasingNPAR/increasingNCORE willconsiderablyreducememoryrequirements

Ø NPAR takesprecedenceoverNCORE - ifyouuse“master”INCAR files,makesureyoudon’tdefineboth

Ø ThedefaultsforNPAR/NCORE (NPAR = <#procs>,NCORE = 1)areusuallyapoorchoiceforbothmemoryrequirementsand performance

Ø Bandparallelismforhybridfunctionals hasbeensupportedsinceVASP5.3.5;formemory-intensivecalculations,itisagoodalternativetounderpopulating nodes

Ø LPLANE:distributesdataoverplane-wavecoefficients,andspeedsthingsupbyreducingcommunicationduringFFTs- thedefaultisLPLANE = .TRUE.,andshouldonlyneedtobechangedformassively-parallelarchitectures(e.g.BlueGene/Q)

Parallelisation:Memory

Ø Forx86_64IBsystems(e.g.Balena,Archer…):

o TryKPAR forheavycalculations(e.g.hybrids)

o SetNPAR = (<#procs>/KPAR) orNCORE = <#procs/node>

o 1node/bandgroupper50atoms;maywanttouse2nodes/50atomsforhybrids,ordecreaseto½nodeperbandgroupfor<10atoms

o LeaveLPLANE atthedefault(.TRUE.)

o WARNING:InmyexperienceofCraysystems(Archer/XC30,SiSu/XC40),usingKPARsometimescausesVASPtohangduringmultistepcalculations(e.g.optimisations)

Ø FortheIBMBlueGene/Q(STFCHartree Centre):

o LasttimeIusedit,theHartree machineonlyhadVASP5.2.x->noKPAR

o Trytochooseasquarenumberofcores,andsetNPAR = sqrt(<#procs>)

o ConsidersettingLPLANE = .FALSE. if<#procs> ≥NGZ

Parallelisation:Somerulesofthumb

Ø GPUcomputingworksinanoffloadmodel

Ø ProgrammingmodelssuchasCUDAandOpenCLprovideAPIsfor:

o CopyingmemorytoandfromtheGPU

o Compilingkernel programstorunontheGPU

o Settingupandrunningkernelsoninputdata

Ø PortingcodesforGPUsinvolvesidentifyingroutinesthatcanbeefficientlymappedtotheGPUarchitecture,writingkernels,andinterfacingthemtotheCPUcode

Data Program

Program

GPUcomputing

vasp-gpu

Ø StartingfromtheFebruary2016releaseofVASP5.4.1,thedistributionincludesaCUDAportthatoffloadssomeofthecoreDFTroutinesontoNVIDIAGPUs

Ø AculminationofresearchattheUniversityofChicago,CarnegieMellon andENS-Lyon,andahealthydoseofoptimisationbyNVIDIA

Ø Threepaperscoveringtheimplementationandtesting:

o M.Hacene etal.,J.Comput.Chem. 33,2581(2012),10.1002/jcc.23096

o M.HutchinsonandW.Widom,Comput.Phys.Comm. 183,1422(2012),10.1002/jcc.23096

o S.Mainzetal.,Comput.Phys.Comm. 182,1421(2011),10.1016/j.cpc.2011.03.010

Becausesharingiscaring...

https://github.com/JMSkelton/VASP-GPU-Benchmarking

Ø Easy(ish)withtheVASP5.4.1buildsystem:

o Loadcuda/toolkit (alongwithintel/compiler,intel/mkl,etc.)

o Modifythearch/makefile.include.linux_intel_cuda example

o Makethegpu and/orgpu_ncl targets

intel/compiler/64/15.0.0.090intel/mkl/64/11.2openmpi/intel/1.8.4cuda/toolkit/7.5.18

FC = mpif90FCL = mpif90 -mkl -lstdc++...CUDA_ROOT :=/cm/shared/apps/cuda75/toolkit/7.5.18...MPI_INC =/apps/openmpi/intel-2015/1.8.4/include/

https://github.com/JMSkelton/VASP-GPU-Benchmarking/Compilation

vasp-gpu:Compilation

Ø AvailableasamoduleonBalena:module load untested vasp/intel/5.4.1

Ø Tousevasp-gpu onBalena,youneedtorequestaGPU-equippednodeandperformsomebasicsetuptasksinyourSLURMscripts

#SBATCH --partition=batch-acc

# Node w/ 1 k20x card.

#SBATCH --gres=gpu:1#SBATCH --constraint=k20x

# Node w/ 4 k20x cards.

##SBATCH --gres=gpu:4##SBATCH --constraint=k20x

if [ ! -d "/tmp/nvidia-mps" ] ; thenmkdir "/tmp/nvidia-mps"

export CUDA_MPS_PIPE_DIRECTORY="/tmp/nvidia-mps"

if [ ! -d "/tmp/nvidia-log" ] ; thenmkdir "/tmp/nvidia-log"

export CUDA_MPS_LOG_DIRECTORY="/tmp/nvidia-log"

nvidia-cuda-mps-control -d

https://github.com/JMSkelton/VASP-GPU-Benchmarking/Scripts

vasp-gpu:Runningjobs

Ø UsescuFFT andCUDAportsofcompute-heavypartsoftheSCFcycle

Ø ALGO = Normal | VeryFast (+Fast)w/LREAL = Auto fullysupported,alongwithKPAR,exactexchangeandnon-collinearspin

Ø ALGO = All | Damped andtheGW routineswork,butarenotoptimised(“passivelysupported”)

Ø LREAL = .FALSE.,NCORE > 1 (NPAR != N)andelectricfieldsarenotsupported(willcrashwithanerror)

Ø CurrentlynoGamma-onlyversion

Ø Futureroadmap:Γ-pointoptimisationsandsupportforLREAL = .FALSE.,vdWfunctionals,RPA/GW calculationsandbandparallelism

vasp-gpu:Features

Ø EachMPIprocessallocatesitsownsetofcuFFT plansandCUDAkernels,distributinground-robinamongtheavailableGPUs

Ø ThesizeoftheCUDAkernelsiscontrolledbyNSIM:broadly,NSIM↑=betterGPUutilisationbuthighermemoryrequirements

Ø <#procs> shouldbeamultipleof<#GPUs>,andformostsystemsyouwillprobablyendupunderpopulating theCPUs

vasp-gpu:Loadbalancing

Ø 64to1,024atomsinarandomcubicarrangement;ALGO = VeryFast w/LREAL = Auto,k =Γ;1GPUnodew/1or4TeslaK20xcardsvs.1computenode

vasp-gpu:Benchmarking

1 2 4 8 12 16 24 32 48 64

#MPIProcesses

1 13.52 8.88 8.15 7.82 7.77 7.76 7.72 7.74 7.81 7.89

2 9.11 6.75 6.34 6.21 6.23 6.21 6.23 6.25 6.32 OOM

4 6.72 5.57 5.33 5.24 5.29 5.30 OOM OOM OOM OOM

8 6.01 5.26 5.14 OOM OOM OOM OOM OOM OOM OOM

12 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM

16 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM

64 128 192 256 320 384 448 512

(vasp_gam

# Atoms

1 GPU 4 GPUs

64 128 192 256 320 384 448 512

(vasp_std

# Atoms

1 GPU 4 GPUs

1 2 4 8 16

#MPIProcesses

1 -14131.52 -158.39 -158.39 -158.39 -158.39

2 -14131.52 -158.39 -158.39 -158.39 -158.39

4 -14131.52 -158.39 -158.39 -158.39 -158.39

8 -14131.52 -158.39 -158.39 - -

12 - - - - -

16 - - - - -

Ø Threepaperscoveringtheimplementationandtesting…:

o M.Hacene etal.,J.Comput.Chem. 33,2581(2012),10.1002/jcc.23096

o M.HutchinsonandW.Widom,Comput.Phys.Comm. 183,1422(2012),10.1002/jcc.23096

o S.Mainzetal.,Comput.Phys.Comm. 182,1421(2011),10.1016/j.cpc.2011.03.010

Ø …andacoupleofotherlinks:

o https://www.vasp.at/index.php/news/44-administrative/115-new-release-vasp-5-4-1-with-gpu-support

o https://www.nsc.liu.se/~pla/blog/2015/11/16/vaspgpu/

o http://images.nvidia.com/events/sc15/SC5120-vasp-gpus.html

Furtherreading

Ø UnderstandingtheparallelisationinVASPandapplyingafewsimplerulesofthumbcanmakeyourjobsscalebetteranduselessresources(thedefaultsettingsaren’tgreat...)

Ø Atthemoment,runningVASPonGPUsismostlyforinterest:

o Doesnotbenefitalltypesofjob

o Requiressomefiddlytestingtogetthebestperformance

o IfyouwillberunningalotofasuitableworkloadonBalena (e.g.largeMDjobs),itcouldbeworththeeffort

Ø Aimsforfurtherbenchmarktests:

o WhattypesofjobbenefitfromGPUacceleration?

o Whatisthemost“balanced”configuration(1/2/4GPUs/node)?

o IsitpossibletorunovermultipleGPUnodes?

o CanGPUsbeacost/powerefficientwaytoruncertainVASPjobs?

Thoughtsanddiscussionpoints

Acknowledgements

vasp-gpu on Balena: Usage and Some Benchmarks

Science

Transcript of vasp-gpu on Balena: Usage and Some Benchmarks

Dft calculation by vasp

VASP Performance Benchmark and Profiling · –VASP performance benchmarking –Understanding VASP communication patterns –Ways to increase VASP productivity –Compilers and network

VASP Performance Benchmark and Profiling VASP...Vienna Ab initio Simulation Package (VASP) • The Vienna Ab initio Simulation Package (VASP) is a computer program for atomic scale

Vasp Technologies

voluntary aﬂatoxin sampling plan (VASP) program · Figure 1. VASP Program Flow Chart VASP program applicability The VASP program is applicable for ready-to-eat (RTE) almonds, as

Hypoxia-induced up-regulation of VASP promotes ...VASP dynamically co-localized at the SH3N domain of CRKL and mediated its function. Mechanistically, VASP overexpression at the transcriptional

VASP HomeWork

Porting VASP to GPU using OpenACC · VASP on GPU •VASP has organically grown over more than 25 years (450k+ lines of Fortran 77/90/2003/2008/… code) •Current release: some features

voluntary aﬂatoxin sampling plan (VASP) program · Voluntary Aﬂatoxin Sampling Plan (VASP) Program Manual Purpose The VASP program was developed by the Cali-fornia Almond industry

Research Paper CREB1/Lin28/miR-638/VASP Interactive ... · activates the Lin28/miR-638/VASP pathway. Furthermore, CREB1 can also directly bind to the promoter of VASP, and activate

VASP Manual

VASP: Best Practices•Focus on practical aspects of running VASP • Inﬂuential parameters, NPAR/NCORE, ALGO, NSIM, KPAR, … • Memory usage • Benchmarks, examples • Common

VASP:&Some&Accumulated&Wisdom&people.bath.ac.uk/aw558/presentations/vasp_tips_2015.pdf · VASP:&Some&Accumulated&Wisdom& J.&M.&Skelton& & WMD&Group&Mee

Project: To play or not to play with water Prof. Antonella BALENA Benevento, Aug. 31th 2011 Prof. Antonella BALENA Benevento, Aug. 31th 2011 I.T.I. G.B.B.

Geberit Balena 8000

MSD MULTI -ARRAY Assay System - Meso Scale/media/files/product inserts/total vasp base.pdfVasodilator-stimulated phosphoprotein (VASP) is an adaptor protein which belongs to the Ena/VASP

Performance Comparing of VASP on CLAIX- 2018 and SX ......VASP on Aurora vs. Claix18 VASP Power Efficiency 10 Aurora • Power 95-120 Watt per card incl. VH (Xeon is ~340 Watt per

Mollusco & Balena - Company Profile 2013

VASP And Wannier90: A Quick Tutorial

VASP QUARTERLY REPORT