Download - Porting VASP to GPU using OpenACC · VASP on GPU •VASP has organically grown over more than 25 years (450k+ lines of Fortran 77/90/2003/2008/… code) •Current release: some features

Transcript
  • Porting VASP to GPU using OpenACC

    Martijn Marsman, Stefan Maintz, Andreas Hehn, Markus Wetzstein,and Georg Kresse

    SC19, Denver, 19th Nov. 2019

  • The Vienna Ab-initioSimulation Package: VASP

    Electronic structure from first principles:

    𝐻𝜓 = 𝐸𝜓

    • Approximations:

    • Density Functional Theory (DFT)

    • Hartree-Fock/DFT-HF hybrid functionals

    • Random-Phase-Approximation(GW, ACFDT)

    • 3500+ licensed academic and industrial groups world wide.

    • 10k+ publications in 2015 (Google Scholar),and rising.

    • Developed in the group of Prof. G. Kresse at the University Vienna.

    NVIDIA BOOTH @ SC19

  • NVIDIA BOOTH @ SC19

    VASP: Computational Characteristics

    VASP does:

    • Lots of “smallish” FFTs:(e.g. 100⨉100⨉100)

    • Matrix-Matrix multiplication(DGEMM and ZGEMM)

    • Matrix diagonalization: 𝒪 𝑁3

    (𝑁 ≈ #-of-electrons)

    • All-2-all communication

    Using:

    • fftw3d (or fftw-wrappers to mkl-ffts)

    • LAPACK BLAS3 (mkl, OpenBLAS)

    • scaLAPACK (or ELPA)

    • MPI (OpenMPI, impi, …) [+ OpenMP]

    VASP is pretty well characterized by the SPECfp2006 benchmark

  • NVIDIA BOOTH @ SC19

    VASP on GPU

    • VASP has organically grown over more than 25 years(450k+ lines of Fortran 77/90/2003/2008/… code)

    • Current release: some features were ported with CUDA C(DFT and hybrid functionals)

    • Upcoming VASP6 release: re-ported to GPU using OpenACC

    • The OpenACC port is more complete already than the CUDA port(Gamma-only version and support for reciprocal space projectors)

  • NVIDIA BOOTH @ SC19

    Porting VASP to GPU using OpenACC

    • Compiler-directive based: single source, readability, maintainability, …

    • cuFFT, cuBLAS, cuSOLVER, CUDA aware MPI, NCCL

    • Some dedicated kernel versions: e.g. batching FFTs, loop re-ordering

    • “Manual” deep copies of derived types(nested and/or with pointer members)

    • Multiple MPI ranks sharing a GPU (using MPS)

    • Combine OpenACC and OpenMP(OpenMP threads driving asynchronous execution queues)

  • NVIDIA BOOTH @ SC19

    OpenACC directives

    25

    OPENACC DIRECTIVESData directives are designed to be optional

    Manage

    Data

    Movement

    Initiate

    Parallel

    Execution

    Optimize

    Loop

    Mappings

    !$acc data copyin(a,b) copyout(c)

    ...!$acc parallel

    !$acc loop gang vectordo i=1, n

    c(i) = a(i) + b(i)...

    enddo!$acc end parallel...

    !$acc end data

  • NVIDIA BOOTH @ SC19

    Nested derived types

    • OpenACC + Unified Memory not an option yet

    • OpenACC 2.6 manual deep copy was key

    • Requires large numbers of directives in some cases,

    • ... but well encapsulated

    • Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives

    • When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all

    11

    MANAGING VASP AGGREGATEDATA STRUCTURES

    • OpenACC + Unified Memory not an option today, some aggregates have static members

    • OpenACC 2.6 manual deep copy was key

    • Requires large numbers of directives in some cases, but well encapsulated (107 lines for COPYIN)

    • Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives

    • When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all

    Derived Type 1Members:3 dynamic

    1 derived type 2

    Derived Type 2Members:

    21 dynamic1 derived type 31 derived type 4

    Derived Type 3Members:only static

    Derived Type 4Members:8 dynamic

    4 derived type 52 derived type 6

    Derived Type 5Members:3 dynamic

    Derived Type 6Members:8 dynamic

    +12 lines of code

    +48 lines of code

    +26 lines of code

    +8 lines of code

    +13 lines of code

  • NVIDIA BOOTH @ SC19

    VASP on GPU benchmarks

    CuC_vdW

    • C@Cu surface (Ω≅ 2800 Å3)

    • 96 Cu + 2 C atoms (1064 e−)

    • vdW-DFT

    • RMM-DIIIS

    • OpenACC port outperformsthe previous CUDA port …

    1.0 1.0 1.0

    1.7

    2.32.5

    2.2

    3.3

    3.7

    2.9

    4.1

    4.7

    3.3

    5.4

    6.6

    0

    1

    2

    3

    4

    5

    6

    7

    VASP 5 VASP 6 VASP 6+

    Spee

    du

    p v

    s. C

    PU

    CPU 1 V100 2 V100 4 V100 8 V100

    • CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores

  • NVIDIA BOOTH @ SC19

    CUDA C vs. OpenACC port

    • Full benchmark timings are interesting for time-to-solution, but are not an ‘apples- to-apples’ comparison between the CUDA and OpenACC versions:

    • Amdahl’s law for non-GPU accelerated parts of code affects both implementations, but blurs differences

    • Using OpenACC allowed to port additional kernels with minimal effort, has not been undertaken with CUDA version

    • OpenACC version uses GPU-aware MPI to help more communication heavy parts, like orthonormalization

    • OpenACC version was forked out of a more recent version of CPU code, while CUDA implementation is older

    Can we find a fairer comparison? Let’s look at the RMM-DIIS algorithm …

  • NVIDIA BOOTH @ SC19

    Iterative diagonalization: RMM-DIIS (EDDRMM)

    • EDDRMM part has comparable GPU- coverage for CUDA and OpenACC versions

    • CUDA version uses kernel fusing, OpenACC version uses two refactored kernels

    • minimal amount of MPI communication

    • OpenACC version improves scaling with number of GPUs

    18

    0

    4

    8

    12

    16

    20

    1 2 4 8Sp

    ee

    du

    pNumber of V100 GPUs

    EDDRMM section (silica_IFPEN), speedup over CPU

    VASP 5.4.4

    dev_OpenACC

    VASP OPENACC PERFORMANCEEDDRMM section of silica_IFPEN on V100

    • EDDRMM takes 17% of total runtime

    • benefits for expectation values included

    • These high speedups are not the single aspect for overall improvement, but an important contribution

    • OpenACC improves scaling yet again

    • MPS always helps, but does not pay off in total time due to start-up overhead

    CPU: dual socket Broadwell E5-2698 v4, compiler Intel 17.0.1GPU: 5.4.4 compiler Intel 17.0.1; dev_OpenACC compiler: PGI 18.3 (CUDA 9.1)

  • NVIDIA BOOTH @ SC19

    Orthonormalization

    • GPU-aware MPI benefits from NVLink latency and B/W

    • Data remains on GPU, CUDA port streamed data for GEMMs

    • Cholesky on CPU saves a (smaller) mem-transfer

    • 180 ms (40%) are saved by GPU-aware MPI alone

    • 33 ms (7.5%) by others

    15

    VASP OPENACC PERFORMANCESection-level comparison for orthonormalization

    CUDA CPORT

    OPENACCPORT

    Redistributing wavefunctions

    Host-only MPI

    (185 ms)

    GPU-aware MPI

    (110 ms)

    Matrix-Matrix-MulsStreamed data

    (19 ms)

    GPU local data

    (15 ms)

    Cholesky decomposition

    CPU-only

    (24 ms)

    cuSolver

    (12 ms)

    Matrix-Matrix-MulsDefault scheme

    (30 ms)

    better blocking

    (13 ms)

    Redistributing wavefunctions

    Host-only MPI

    (185 ms)

    GPU-aware MPI

    (80 ms)

    • GPU-aware MPI benefits from NVLink latency and B/W

    • Data remains on GPU, CUDA port streamed data for GEMMs

    • Cholesky on CPU saves a (smaller) mem-transfer

    • 180 ms (40%) are saved by GPU-aware MPI alone

    • 33 ms (7.5%) by others

  • NVIDIA BOOTH @ SC19

    VASP on GPU benchmarks

    Si256_VJT_HSE06

    • Vacancy in Si (Ω ≅5200 Å3)

    • 255 Si atoms (1020 e−)

    • DFT/HF-hybrid functional

    • Conjugate gradient

    • Batched FFTs

    • Explicit overlay of computationand communication usingnon-blocking collectives (NCCL)

    • CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores

    1.0 1.04.7 4.7

    8.8 9.0

    15.7 15.9

    28.1 28.7

    0

    5

    10

    15

    20

    25

    30

    35

    VASP 6 VASP NV

    Spe

    ed

    up

    vs.

    CP

    U

    CPU 1 V100 2 V100 4 V100 8 V100

  • NVIDIA BOOTH @ SC19

    The OpenACC port: current limitations

    • Some bottlenecks must be addressed: computation of the local potential is still done CPU-side.

    • Not all features are ported yet: Currently we are porting the linear response solvers and cubic-scaling ACFDT (RPA total energies)

    • Some features of VASP, e.g. cubic-scaling RPA, are very (very) memory intensive, and involve diagonalization of large complex matrices(> 100k ⨉ 100k): e.g. cusolverMgSyevd

    • PGI compilers only

  • NVIDIA BOOTH @ SC19

    New Release: VASP6

    • …

    • Cubic-scaling RPA (ACFDT,GW)

    • On-the-fly machine learned force-fields

    • Electron-Phonon coupling

    • MPI+OpenMP

    • OpenACC port

    • …

    • Caveat: the OpenACC port is stillregarded to be “experimental” at this stage

    • Actively gather feedback(from HPC sites)

    • Intensive support effort

    https://www.vasp.at/wiki/index.php/Category:VASP6

    https://www.vasp.at/wiki/index.php/Category:VASP6

  • NVIDIA BOOTH @ SC19

    THE END

    Special thanks to Stefan Maintz, Andreas Hehn, and Markus Wetzsteinfrom NVIDIA and PGI!

    And to Ani Anciaux-Sedrakian and Thomas Guignon at IFPEN!

    And to you for listening!