PERFORMANCE OPTIMIZATION FOR SCIENTIFIC...

Alan Gray, Developer Technology Engineer, NVIDIA

GTC, March 26-29 2018

PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS

2

AGENDA

• Introduction

• Single-GPU

• Exposing parallelism

• Memory Coalescing

• Data Management Optimization

• Interoperability

• ESCAPE performance results

• Multi-GPU

• CUDA-aware MPI

• AlltoAll optimization

• DGX-2 (with NVSwitch) vs DGX-1V

3

INTRODUCTION

4

ESCAPE

• NVIDIA’s role is to take existing GPU-enabled codes and optimize.

5

ECMWF

• European Centre for Medium Range Weather Forecasts (ECMWF) are an

intergovernmental org.

• Global forecasts:

• used by more than 22 countries to drive their regional forecasts.

• provide accurate weekly, monthly and seasonal predictions, including early warnings of

severe events.

• In 2012 ECMWF provided the most accurate prediction for the trajectory and landfall

of Hurricane Sandy: information that undoubtedly saved lives.

• We are working closely with ECMWF (and other partners) in ESCAPE to evaluate and

improve the algorithms, techniques and software on GPUs.

• This is done through use of dwarves: mini-apps designed to represent the key

properties of the real simulation applications.

ESCAPE Project Leaders

6

ESCAPE DWARVES

• Info in “grid-point” space can be equivalently represented in “spectral” space, i.e. in

terms of the frequencies of the fluctuating waves, which is more suited to some

calculations.

• IFS therefore repeatedly transforms between these representations, Fourier

transforms (FFTs) in longitude and Legendre transforms (DGEMMs) in latitude, with

AlltoAll data movement in-between.

• This dwarf represents the spectral transforms from IFS.

• NB. Number of points varies (e.g. most round equator, fewest at poles). Additionally,

there exist multiple altitude “levels”, in third dimension away from surface of earth,

each with 3 “fields”.

• ECMWF’s Integrated Forecasting System (IFS) is a global prediction

system: entire earth’s atmosphere is represented as a spherical grid.

Spherical Harmonics (SH) Dwarf

7

ESCAPE DWARVES

• Advection: horizontal transport

• Uses unstructured grid with nearest-neighbour stencils

• MPDATA scheme already used within COSMO-EULAG (PSNC), and of interest to ECMWF for future developments

• Both SH and MPDATA Dwarves Fortran+OpenACC+MPI. SH also has interfacing to CUDA libraries.

• Many of the optimizations I will present are transferable to other applications/languages etc.

MPDATA Dwarf

8

SINGLE-GPU: EXPOSING PARALLELISM

9

EXPOSING PARALLELISM

Loop over timesteps

…

Loop over 1st dimension

…

Loop over 2nd dimension

…

Loop over fields

…

Operation (involving multidimensional arrays)

…

Another Loop over dimensions…

…

OpenACC Loop Mapping

Typical Structure of Application (usually spanning multiple routines/files):

Aim is to expose as much parallelism in this red box as possible, as flexibly as possible

10



…


…

Loop over fields

…

Operation


Before Optimization:

SH • Loop over 1st dimension was sequential:

parallelism not exposed to GPU

MPDATA • Naïve mapping of loops, using “kernels”

and/or “loop” directives without

restructuring.

• Resulting decomposition chosen by

compiler did not work well since runtime

loop extents didn’t map well to GPU

architecture.

11


$!ACC parallel loop collapse(3)



Loop over fields

…

Operation


• Assuming loops are independent, better to restructure such that loops are tightly nested, and use “collapse” clause.

• Loops will be collapsed into a single loop, and compiler will map all parallelism to GPU blocks/threads in an efficient way.

• This can require extensive restructuring, depending on how application is originally written.

12





$!ACC loop seq

Loop with dependencies

…

Operation


• Sometimes we have loop-carried dependencies.

• These can be performed sequentially by each thread at the innermost level.

• Can still perform well if there is enough parallelism in outermost loops.

13


do i=1, N

do j=1, i

do k=1, P

…

Operation



do i=1, N

do j=1, MAX_J

do k=1, P

if(j .le. i) then

…

Operation

• Sometimes extent of a loop depends on index of another (outer) loop which prevents loop collapsing.

• Can replace extent with max value, and use conditional statement in loop body.

14

SINGLE-GPU: MEMORY COALESCING

15

KERNEL OPTIMIZATION


do k=1, …

do j=1, …

do i=1, …

…

Array(i,j,k)=…

Memory Coalescing: data layout

• For memory coalescing, fastest moving index in array access should correspond to vector level (CUDA thread), which will correspond to innermost collapsed loop index.


do m=1, …

do n=1, …

$!ACC loop seq

do p=1, …

…

Array(n,m,p)=…

16

KERNEL OPTIMIZATION

!$ACC parallel loop tile(16,32)

do j=1, …

do i=1, …

…

array_t(i,j)=array(j,i)

Memory Coalescing: transposing with tile clause

• If you need to transpose, either read or write will be in wrong layout for coalescing.

• But can use OpenACC “tile” clause to improve performance

• Compiler tiles the operation by generating new innermost loops

• For each tile, data is staged on-chip

• Results in better global memory access patterns

• Experiment with tile sizes

17

KERNEL OPTIMIZATIONMemory Coalescing: transposing within CUDA BLAS

• In some of our cases, the transpose kernels were adjacent to CUDA BLAS matrix multiplication (DGEMM) calls.

• Coalescing was facilitated through replacing C = AB matrix multiplications by equivalent 𝐶𝑇 = 𝐵𝑇𝐴𝑇.

• This allows transpose operations to be pushed into the DGEMM library calls, which have much higher-performing implementations of transposed data accesses

18

SINGLE-GPU: DATA MANAGEMENT OPTIMIZATION

19

DATA MANAGEMENT OPTIMIZATION

Loop over timestep:

…

loops over spatial dims

…

Minimizing data allocation and movement

• Important to keep as much data as possible resident on GPU within timestep loop.

• All allocations/frees should be outside timestep loop.

• Copies for constant data should be outside main timestep loop.

• Re-use temporary scratch arrays on device

• Any necessary repeated copies (e.g. halo regions in MPI code): volume copied should be minimized.

Data allocation and movement is expensive

Many codes have:

20

SINGLE-GPU: INTEROPERABILITY

21

INTEROPERABILITYSimple example: Calling C/CUDA from PGI Fortran

!main.f90

program main

interface

subroutine kernel_from_f(arg) &

bind(C,name='kernel_wrapper')

use iso_c_binding

integer(c_int),value :: arg

end subroutine kernel_from_f

end interface

call kernel_from_f(1)

end program main

//kernel.cu

#include <stdio.h>

__global__ void kernel(int arg){

if (threadIdx.x==0)

printf("hello from kernel\n");

return;

}

extern "C" void kernel_wrapper(int arg){

kernel <<<1,1>>> (arg);

cudaDeviceSynchronize();

return;

}

$ nvcc -c -arch=sm_60 kernel.cu -o kernel.o

$ pgf90 -c main.f90 -o main.o

$ pgf90 main.o kernel.o -o code -L $CUDA_HOME/lib64 –lcudart

$ ./code

hello from kernel

CUDA libraries can be called from C code in the usual manner.

22

INTEROPERABILITY AND LIBRARIES: SH DWARF

cuFFT cuFFT

cublasDgemm cublasDgemm

Base language Fortran, MPI for multi-GPU communications.

OpenACC OpenACC

OpenACC OpenACC

OpenACCOpenACC

23

BLAS/FFT LIBRARY CALLS IN SH DWARF• At each timestep, SH dwarf performs transforms using Matrix Multiplications and FFTs.

• Multiple operations - one for each:

• Field (associated with vertical levels)

• Longitude (Matmult) / Latitude (FFT)

• Can batch over fields, since sizes are the same. But different longitudes/latitudes

have different sizes: not supported by batched versions of cublasDgemm/cuFFT.

• So, originally we had many small calls: low parallelism exposure and launch latency sensitivity.

• For DGEMM, we pad with zeros up to largest size and batch over longitudes as well as fields: single call to library; extra operations do not contribute to result.

• But FFT does not allow padding in the same way. Worked around launch latency problem by removing sync after each call: allows launch latency to be hidden behind execution.

• As will be seen, however, this is the only part of the dwarf which remains suboptimal. Future: batched FFT with differing sizes should improve performance.

24

SINGLE-GPU: ESCAPE RESULTS

25

MPDATA OPTIMIZATION: P100

Before:

After:

26

OPTIMIZED MPDATA: P100 VS V100

P100

V100

27

ESCAPE DWARF V100 PERFORMANCE

28

MPDATA KERNEL PERFORMANCE

• 100% Roofline is STREAM benchmark throughput, since all kernels are memory bandwidth bound

29

ESCAPE DWARF V100 PERFORMANCE

30

SH KERNEL PERFORMANCE

• 100% Roofline is peak DP Performance (compute bound kernels) or STREAM benchmark throughput (memory bandwidth bound kernels)

Experiments with batching (using av. size) show 4.6X speedup

31

MULTI-GPU: CUDA-AWARE MPI

32

CUDA-AWARE MPI

• Modern MPI implementations are CUDA-aware. This means that pointers to GPU memory can be passed directly into the MPI calls, avoiding unnecessary transfers (both in the application and in the underlying MPI implementation).

!non CUDA-aware:

!$ACC update host(array,…)

call MPI_Alltoallv(array,…)

!$ACC update device(array,…)

!CUDA-aware:

!$ACC host_data use_device(array,…)

call MPI_Alltoallv(array,…)

!$ACC end host_data

• Also for point-to-point etc. Note that if explicit buffer packing is involved, this will also need to be ported from CPU to GPU.

33

MULTI-GPU: ALLTOALL OPTIMIZATION

34

ALLTOALL: 4 GPUS ON DGX-1

• SH dwarf performs AlltoAll operations. Even with CUDA-aware MPI, there are inefficiencies:

• Can instead use CUDA IPC directly with CUDA streams to overlap.

• Future MPI implementations are expected to take care of this

35

CUDA IPC ALLTOALLAs a starting point, used the 0_Simple/simpleIPCcode from the CUDA samples.

Setup:

• On each MPI rank, get IPC memory handle for array locally

• Share IPC memory handles via MPI

• Setup CUDA streams

AlltoAll:

• On each rank, loop over all targetranks (including own)

• targetrank=(targetrank+rank)%number_of_gpus (for better balance)

• Push message to targetrank (in stream[targetrank])

• Sync all streams

36

SH RESULTS ON 4 GPUS

37

MULTI-GPU: DGX-2 (WITH NVSWITCH) VS DGX-1V

38

SPHERICAL HARMONICS: SCALING BEYOND 4 GPUS

• When using all 8 GPU in DGX-1V:

• No AlltoAll NVLINK Connectivity – some messages go through PCIe and system memory

• This limits performance

• When using 16 GPUs across 2 DGX-1V servers

• Some messages go across Infinibandnetwork

• Further bottleneck

39

DGX-2 WITH NVSWITCH

• AlltoAll network architecture with NVSwitch maps perfectly to the problem.

• Full bandwidth between each GPU pair.

40

SPHERICAL HARMONICS: DGX-2 VS DGX-1V

41

SUMMARY AND FUTURE WORK• Optimizing the exposure of parallelism, memory coalescing and data management

can have dramatic effects on performance.

• MPDATA single-GPU performance is now optimal.

• MPDATA multi-GPU: halo-exchange currently implemented via MPI abstracted CPU-based ATLAS library (also developed by ECMWF). To be fully optimized, ATLAS needs to be made CUDA aware, and this is currently being worked on by others.

• SH single-GPU performance is vastly improved, but FFT part remains sub-optimal.

• Implementation of batching where different sizes are allowed within each batch would expectedly fix this.

• DGX-2/NVSwitch all-to-all connectivity allows SH to scale to all 16 GPUs.

• These results give indications that multi-GPU systems can be effectively exploited to allow forecasting agencies to continue to further improve weather predictions.

PERFORMANCE OPTIMIZATION FOR SCIENTIFIC...

Documents

Transcript of PERFORMANCE OPTIMIZATION FOR SCIENTIFIC...