PERFORMANCE OPTIMIZATION FOR SCIENTIFIC...
Transcript of PERFORMANCE OPTIMIZATION FOR SCIENTIFIC...
Alan Gray, Developer Technology Engineer, NVIDIA
GTC, March 26-29 2018
PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS
2
AGENDA
• Introduction
• Single-GPU
• Exposing parallelism
• Memory Coalescing
• Data Management Optimization
• Interoperability
• ESCAPE performance results
• Multi-GPU
• CUDA-aware MPI
• AlltoAll optimization
• DGX-2 (with NVSwitch) vs DGX-1V
3
INTRODUCTION
4
ESCAPE
• NVIDIA’s role is to take existing GPU-enabled codes and optimize.
5
ECMWF
• European Centre for Medium Range Weather Forecasts (ECMWF) are an
intergovernmental org.
• Global forecasts:
• used by more than 22 countries to drive their regional forecasts.
• provide accurate weekly, monthly and seasonal predictions, including early warnings of
severe events.
• In 2012 ECMWF provided the most accurate prediction for the trajectory and landfall
of Hurricane Sandy: information that undoubtedly saved lives.
• We are working closely with ECMWF (and other partners) in ESCAPE to evaluate and
improve the algorithms, techniques and software on GPUs.
• This is done through use of dwarves: mini-apps designed to represent the key
properties of the real simulation applications.
ESCAPE Project Leaders
6
ESCAPE DWARVES
• Info in “grid-point” space can be equivalently represented in “spectral” space, i.e. in
terms of the frequencies of the fluctuating waves, which is more suited to some
calculations.
• IFS therefore repeatedly transforms between these representations, Fourier
transforms (FFTs) in longitude and Legendre transforms (DGEMMs) in latitude, with
AlltoAll data movement in-between.
• This dwarf represents the spectral transforms from IFS.
• NB. Number of points varies (e.g. most round equator, fewest at poles). Additionally,
there exist multiple altitude “levels”, in third dimension away from surface of earth,
each with 3 “fields”.
• ECMWF’s Integrated Forecasting System (IFS) is a global prediction
system: entire earth’s atmosphere is represented as a spherical grid.
Spherical Harmonics (SH) Dwarf
7
ESCAPE DWARVES
• Advection: horizontal transport
• Uses unstructured grid with nearest-neighbour stencils
• MPDATA scheme already used within COSMO-EULAG (PSNC), and of interest to ECMWF for future developments
• Both SH and MPDATA Dwarves Fortran+OpenACC+MPI. SH also has interfacing to CUDA libraries.
• Many of the optimizations I will present are transferable to other applications/languages etc.
MPDATA Dwarf
8
SINGLE-GPU: EXPOSING PARALLELISM
9
EXPOSING PARALLELISM
Loop over timesteps
…
Loop over 1st dimension
…
Loop over 2nd dimension
…
Loop over fields
…
Operation (involving multidimensional arrays)
…
Another Loop over dimensions…
…
OpenACC Loop Mapping
Typical Structure of Application (usually spanning multiple routines/files):
Aim is to expose as much parallelism in this red box as possible, as flexibly as possible
10
EXPOSING PARALLELISM
Loop over 1st dimension
…
Loop over 2nd dimension
…
Loop over fields
…
Operation
OpenACC Loop Mapping
Before Optimization:
SH • Loop over 1st dimension was sequential:
parallelism not exposed to GPU
MPDATA • Naïve mapping of loops, using “kernels”
and/or “loop” directives without
restructuring.
• Resulting decomposition chosen by
compiler did not work well since runtime
loop extents didn’t map well to GPU
architecture.
11
EXPOSING PARALLELISM
$!ACC parallel loop collapse(3)
Loop over 1st dimension
Loop over 2nd dimension
Loop over fields
…
Operation
OpenACC Loop Mapping
• Assuming loops are independent, better to restructure such that loops are tightly nested, and use “collapse” clause.
• Loops will be collapsed into a single loop, and compiler will map all parallelism to GPU blocks/threads in an efficient way.
• This can require extensive restructuring, depending on how application is originally written.
12
EXPOSING PARALLELISM
$!ACC parallel loop collapse(2)
Loop over 1st dimension
Loop over 2nd dimension
$!ACC loop seq
Loop with dependencies
…
Operation
OpenACC Loop Mapping
• Sometimes we have loop-carried dependencies.
• These can be performed sequentially by each thread at the innermost level.
• Can still perform well if there is enough parallelism in outermost loops.
13
EXPOSING PARALLELISM
do i=1, N
do j=1, i
do k=1, P
…
Operation
OpenACC Loop Mapping
$!ACC parallel loop collapse(3)
do i=1, N
do j=1, MAX_J
do k=1, P
if(j .le. i) then
…
Operation
• Sometimes extent of a loop depends on index of another (outer) loop which prevents loop collapsing.
• Can replace extent with max value, and use conditional statement in loop body.
14
SINGLE-GPU: MEMORY COALESCING
15
KERNEL OPTIMIZATION
$!ACC parallel loop collapse(3)
do k=1, …
do j=1, …
do i=1, …
…
Array(i,j,k)=…
Memory Coalescing: data layout
• For memory coalescing, fastest moving index in array access should correspond to vector level (CUDA thread), which will correspond to innermost collapsed loop index.
$!ACC parallel loop collapse(2)
do m=1, …
do n=1, …
$!ACC loop seq
do p=1, …
…
Array(n,m,p)=…
16
KERNEL OPTIMIZATION
!$ACC parallel loop tile(16,32)
do j=1, …
do i=1, …
…
array_t(i,j)=array(j,i)
Memory Coalescing: transposing with tile clause
• If you need to transpose, either read or write will be in wrong layout for coalescing.
• But can use OpenACC “tile” clause to improve performance
• Compiler tiles the operation by generating new innermost loops
• For each tile, data is staged on-chip
• Results in better global memory access patterns
• Experiment with tile sizes
17
KERNEL OPTIMIZATIONMemory Coalescing: transposing within CUDA BLAS
• In some of our cases, the transpose kernels were adjacent to CUDA BLAS matrix multiplication (DGEMM) calls.
• Coalescing was facilitated through replacing C = AB matrix multiplications by equivalent 𝐶𝑇 = 𝐵𝑇𝐴𝑇.
• This allows transpose operations to be pushed into the DGEMM library calls, which have much higher-performing implementations of transposed data accesses
18
SINGLE-GPU: DATA MANAGEMENT OPTIMIZATION
19
DATA MANAGEMENT OPTIMIZATION
Loop over timestep:
…
loops over spatial dims
…
Minimizing data allocation and movement
• Important to keep as much data as possible resident on GPU within timestep loop.
• All allocations/frees should be outside timestep loop.
• Copies for constant data should be outside main timestep loop.
• Re-use temporary scratch arrays on device
• Any necessary repeated copies (e.g. halo regions in MPI code): volume copied should be minimized.
Data allocation and movement is expensive
Many codes have:
20
SINGLE-GPU: INTEROPERABILITY
21
INTEROPERABILITYSimple example: Calling C/CUDA from PGI Fortran
!main.f90
program main
interface
subroutine kernel_from_f(arg) &
bind(C,name='kernel_wrapper')
use iso_c_binding
integer(c_int),value :: arg
end subroutine kernel_from_f
end interface
call kernel_from_f(1)
end program main
//kernel.cu
#include <stdio.h>
__global__ void kernel(int arg){
if (threadIdx.x==0)
printf("hello from kernel\n");
return;
}
extern "C" void kernel_wrapper(int arg){
kernel <<<1,1>>> (arg);
cudaDeviceSynchronize();
return;
}
$ nvcc -c -arch=sm_60 kernel.cu -o kernel.o
$ pgf90 -c main.f90 -o main.o
$ pgf90 main.o kernel.o -o code -L $CUDA_HOME/lib64 –lcudart
$ ./code
hello from kernel
CUDA libraries can be called from C code in the usual manner.
22
INTEROPERABILITY AND LIBRARIES: SH DWARF
cuFFT cuFFT
cublasDgemm cublasDgemm
Base language Fortran, MPI for multi-GPU communications.
OpenACC OpenACC
OpenACC OpenACC
OpenACCOpenACC
23
BLAS/FFT LIBRARY CALLS IN SH DWARF• At each timestep, SH dwarf performs transforms using Matrix Multiplications and FFTs.
• Multiple operations - one for each:
• Field (associated with vertical levels)
• Longitude (Matmult) / Latitude (FFT)
• Can batch over fields, since sizes are the same. But different longitudes/latitudes
have different sizes: not supported by batched versions of cublasDgemm/cuFFT.
• So, originally we had many small calls: low parallelism exposure and launch latency sensitivity.
• For DGEMM, we pad with zeros up to largest size and batch over longitudes as well as fields: single call to library; extra operations do not contribute to result.
• But FFT does not allow padding in the same way. Worked around launch latency problem by removing sync after each call: allows launch latency to be hidden behind execution.
• As will be seen, however, this is the only part of the dwarf which remains suboptimal. Future: batched FFT with differing sizes should improve performance.
24
SINGLE-GPU: ESCAPE RESULTS
25
MPDATA OPTIMIZATION: P100
Before:
After:
26
OPTIMIZED MPDATA: P100 VS V100
P100
V100
27
ESCAPE DWARF V100 PERFORMANCE
28
MPDATA KERNEL PERFORMANCE
• 100% Roofline is STREAM benchmark throughput, since all kernels are memory bandwidth bound
29
ESCAPE DWARF V100 PERFORMANCE
30
SH KERNEL PERFORMANCE
• 100% Roofline is peak DP Performance (compute bound kernels) or STREAM benchmark throughput (memory bandwidth bound kernels)
Experiments with batching (using av. size) show 4.6X speedup
31
MULTI-GPU: CUDA-AWARE MPI
32
CUDA-AWARE MPI
• Modern MPI implementations are CUDA-aware. This means that pointers to GPU memory can be passed directly into the MPI calls, avoiding unnecessary transfers (both in the application and in the underlying MPI implementation).
!non CUDA-aware:
!$ACC update host(array,…)
call MPI_Alltoallv(array,…)
!$ACC update device(array,…)
!CUDA-aware:
!$ACC host_data use_device(array,…)
call MPI_Alltoallv(array,…)
!$ACC end host_data
• Also for point-to-point etc. Note that if explicit buffer packing is involved, this will also need to be ported from CPU to GPU.
33
MULTI-GPU: ALLTOALL OPTIMIZATION
34
ALLTOALL: 4 GPUS ON DGX-1
• SH dwarf performs AlltoAll operations. Even with CUDA-aware MPI, there are inefficiencies:
• Can instead use CUDA IPC directly with CUDA streams to overlap.
• Future MPI implementations are expected to take care of this
35
CUDA IPC ALLTOALLAs a starting point, used the 0_Simple/simpleIPCcode from the CUDA samples.
Setup:
• On each MPI rank, get IPC memory handle for array locally
• Share IPC memory handles via MPI
• Setup CUDA streams
AlltoAll:
• On each rank, loop over all targetranks (including own)
• targetrank=(targetrank+rank)%number_of_gpus (for better balance)
• Push message to targetrank (in stream[targetrank])
• Sync all streams
36
SH RESULTS ON 4 GPUS
37
MULTI-GPU: DGX-2 (WITH NVSWITCH) VS DGX-1V
38
SPHERICAL HARMONICS: SCALING BEYOND 4 GPUS
• When using all 8 GPU in DGX-1V:
• No AlltoAll NVLINK Connectivity – some messages go through PCIe and system memory
• This limits performance
• When using 16 GPUs across 2 DGX-1V servers
• Some messages go across Infinibandnetwork
• Further bottleneck
39
DGX-2 WITH NVSWITCH
• AlltoAll network architecture with NVSwitch maps perfectly to the problem.
• Full bandwidth between each GPU pair.
40
SPHERICAL HARMONICS: DGX-2 VS DGX-1V
41
SUMMARY AND FUTURE WORK• Optimizing the exposure of parallelism, memory coalescing and data management
can have dramatic effects on performance.
• MPDATA single-GPU performance is now optimal.
• MPDATA multi-GPU: halo-exchange currently implemented via MPI abstracted CPU-based ATLAS library (also developed by ECMWF). To be fully optimized, ATLAS needs to be made CUDA aware, and this is currently being worked on by others.
• SH single-GPU performance is vastly improved, but FFT part remains sub-optimal.
• Implementation of batching where different sizes are allowed within each batch would expectedly fix this.
• DGX-2/NVSwitch all-to-all connectivity allows SH to scale to all 16 GPUs.
• These results give indications that multi-GPU systems can be effectively exploited to allow forecasting agencies to continue to further improve weather predictions.