Post on 16-Mar-2020
Porting VASP to GPU using OpenACC
Martijn Marsman, Stefan Maintz, Andreas Hehn, Markus Wetzstein,and Georg Kresse
SC19, Denver, 19th Nov. 2019
The Vienna Ab-initioSimulation Package: VASP
Electronic structure from first principles:
𝐻𝜓 = 𝐸𝜓
• Approximations:
• Density Functional Theory (DFT)
• Hartree-Fock/DFT-HF hybrid functionals
• Random-Phase-Approximation(GW, ACFDT)
• 3500+ licensed academic and industrial groups world wide.
• 10k+ publications in 2015 (Google Scholar),and rising.
• Developed in the group of Prof. G. Kresse at the University Vienna.
NVIDIA BOOTH @ SC19
NVIDIA BOOTH @ SC19
VASP: Computational Characteristics
VASP does:
• Lots of “smallish” FFTs:(e.g. 100⨉100⨉100)
• Matrix-Matrix multiplication(DGEMM and ZGEMM)
• Matrix diagonalization: 𝒪 𝑁3
(𝑁 ≈ #-of-electrons)
• All-2-all communication
Using:
• fftw3d (or fftw-wrappers to mkl-ffts)
• LAPACK BLAS3 (mkl, OpenBLAS)
• scaLAPACK (or ELPA)
• MPI (OpenMPI, impi, …) [+ OpenMP]
VASP is pretty well characterized by the SPECfp2006 benchmark
NVIDIA BOOTH @ SC19
VASP on GPU
• VASP has organically grown over more than 25 years(450k+ lines of Fortran 77/90/2003/2008/… code)
• Current release: some features were ported with CUDA C(DFT and hybrid functionals)
• Upcoming VASP6 release: re-ported to GPU using OpenACC
• The OpenACC port is more complete already than the CUDA port(Gamma-only version and support for reciprocal space projectors)
NVIDIA BOOTH @ SC19
Porting VASP to GPU using OpenACC
• Compiler-directive based: single source, readability, maintainability, …
• cuFFT, cuBLAS, cuSOLVER, CUDA aware MPI, NCCL
• Some dedicated kernel versions: e.g. batching FFTs, loop re-ordering
• “Manual” deep copies of derived types(nested and/or with pointer members)
• Multiple MPI ranks sharing a GPU (using MPS)
• Combine OpenACC and OpenMP(OpenMP threads driving asynchronous execution queues)
NVIDIA BOOTH @ SC19
OpenACC directives
25
OPENACC DIRECTIVESData directives are designed to be optional
Manage
Data
Movement
Initiate
Parallel
Execution
Optimize
Loop
Mappings
!$acc data copyin(a,b) copyout(c)
...!$acc parallel
!$acc loop gang vectordo i=1, n
c(i) = a(i) + b(i)...
enddo!$acc end parallel...
!$acc end data
NVIDIA BOOTH @ SC19
Nested derived types
• OpenACC + Unified Memory not an option yet
• OpenACC 2.6 manual deep copy was key
• Requires large numbers of directives in some cases,
• ... but well encapsulated
• Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives
• When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all
11
MANAGING VASP AGGREGATEDATA STRUCTURES
• OpenACC + Unified Memory not an option today, some aggregates have static members
• OpenACC 2.6 manual deep copy was key
• Requires large numbers of directives in some cases, but well encapsulated (107 lines for COPYIN)
• Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives
• When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all
Derived Type 1Members:3 dynamic
1 derived type 2
Derived Type 2Members:
21 dynamic1 derived type 31 derived type 4
Derived Type 3Members:only static
Derived Type 4Members:8 dynamic
4 derived type 52 derived type 6
Derived Type 5Members:3 dynamic
Derived Type 6Members:8 dynamic
+12 lines of code
+48 lines of code
+26 lines of code
+8 lines of code
+13 lines of code
NVIDIA BOOTH @ SC19
VASP on GPU benchmarks
CuC_vdW
• C@Cu surface (Ω≅ 2800 Å3)
• 96 Cu + 2 C atoms (1064 e−)
• vdW-DFT
• RMM-DIIIS
• OpenACC port outperformsthe previous CUDA port …
1.0 1.0 1.0
1.7
2.32.5
2.2
3.3
3.7
2.9
4.1
4.7
3.3
5.4
6.6
0
1
2
3
4
5
6
7
VASP 5 VASP 6 VASP 6+
Spee
du
p v
s. C
PU
CPU 1 V100 2 V100 4 V100 8 V100
• CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores
NVIDIA BOOTH @ SC19
CUDA C vs. OpenACC port
• Full benchmark timings are interesting for time-to-solution, but are not an ‘apples- to-apples’ comparison between the CUDA and OpenACC versions:
• Amdahl’s law for non-GPU accelerated parts of code affects both implementations, but blurs differences
• Using OpenACC allowed to port additional kernels with minimal effort, has not been undertaken with CUDA version
• OpenACC version uses GPU-aware MPI to help more communication heavy parts, like orthonormalization
• OpenACC version was forked out of a more recent version of CPU code, while CUDA implementation is older
Can we find a fairer comparison? Let’s look at the RMM-DIIS algorithm …
NVIDIA BOOTH @ SC19
Iterative diagonalization: RMM-DIIS (EDDRMM)
• EDDRMM part has comparable GPU- coverage for CUDA and OpenACC versions
• CUDA version uses kernel fusing, OpenACC version uses two refactored kernels
• minimal amount of MPI communication
• OpenACC version improves scaling with number of GPUs
18
0
4
8
12
16
20
1 2 4 8Sp
ee
du
pNumber of V100 GPUs
EDDRMM section (silica_IFPEN), speedup over CPU
VASP 5.4.4
dev_OpenACC
VASP OPENACC PERFORMANCEEDDRMM section of silica_IFPEN on V100
• EDDRMM takes 17% of total runtime
• benefits for expectation values included
• These high speedups are not the single aspect for overall improvement, but an important contribution
• OpenACC improves scaling yet again
• MPS always helps, but does not pay off in total time due to start-up overhead
CPU: dual socket Broadwell E5-2698 v4, compiler Intel 17.0.1GPU: 5.4.4 compiler Intel 17.0.1; dev_OpenACC compiler: PGI 18.3 (CUDA 9.1)
NVIDIA BOOTH @ SC19
Orthonormalization
• GPU-aware MPI benefits from NVLink latency and B/W
• Data remains on GPU, CUDA port streamed data for GEMMs
• Cholesky on CPU saves a (smaller) mem-transfer
• 180 ms (40%) are saved by GPU-aware MPI alone
• 33 ms (7.5%) by others
15
VASP OPENACC PERFORMANCESection-level comparison for orthonormalization
CUDA CPORT
OPENACCPORT
Redistributing wavefunctions
Host-only MPI
(185 ms)
GPU-aware MPI
(110 ms)
Matrix-Matrix-MulsStreamed data
(19 ms)
GPU local data
(15 ms)
Cholesky decomposition
CPU-only
(24 ms)
cuSolver
(12 ms)
Matrix-Matrix-MulsDefault scheme
(30 ms)
better blocking
(13 ms)
Redistributing wavefunctions
Host-only MPI
(185 ms)
GPU-aware MPI
(80 ms)
• GPU-aware MPI benefits from NVLink latency and B/W
• Data remains on GPU, CUDA port streamed data for GEMMs
• Cholesky on CPU saves a (smaller) mem-transfer
• 180 ms (40%) are saved by GPU-aware MPI alone
• 33 ms (7.5%) by others
NVIDIA BOOTH @ SC19
VASP on GPU benchmarks
Si256_VJT_HSE06
• Vacancy in Si (Ω ≅5200 Å3)
• 255 Si atoms (1020 e−)
• DFT/HF-hybrid functional
• Conjugate gradient
• Batched FFTs
• Explicit overlay of computationand communication usingnon-blocking collectives (NCCL)
• CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores
1.0 1.04.7 4.7
8.8 9.0
15.7 15.9
28.1 28.7
0
5
10
15
20
25
30
35
VASP 6 VASP NV
Spe
ed
up
vs.
CP
U
CPU 1 V100 2 V100 4 V100 8 V100
NVIDIA BOOTH @ SC19
The OpenACC port: current limitations
• Some bottlenecks must be addressed: computation of the local potential is still done CPU-side.
• Not all features are ported yet: Currently we are porting the linear response solvers and cubic-scaling ACFDT (RPA total energies)
• Some features of VASP, e.g. cubic-scaling RPA, are very (very) memory intensive, and involve diagonalization of large complex matrices(> 100k ⨉ 100k): e.g. cusolverMgSyevd
• PGI compilers only
NVIDIA BOOTH @ SC19
New Release: VASP6
• …
• Cubic-scaling RPA (ACFDT,GW)
• On-the-fly machine learned force-fields
• Electron-Phonon coupling
• MPI+OpenMP
• OpenACC port
• …
• Caveat: the OpenACC port is stillregarded to be “experimental” at this stage
• Actively gather feedback(from HPC sites)
• Intensive support effort
https://www.vasp.at/wiki/index.php/Category:VASP6
https://www.vasp.at/wiki/index.php/Category:VASP6
NVIDIA BOOTH @ SC19
THE END
Special thanks to Stefan Maintz, Andreas Hehn, and Markus Wetzsteinfrom NVIDIA and PGI!
And to Ani Anciaux-Sedrakian and Thomas Guignon at IFPEN!
And to you for listening!