April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND...
Transcript of April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND...
April 4-7, 2016 | Silicon Valley
Jiri Kraus, Senior Devtech Compute, April 7th 2016
HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY
2
OPENACC DIRECTIVES
#pragma acc data copyin(a,b) copyout(c)
{
#pragma acc parallel
{
#pragma acc loop gang vector
for (i = 0; i < n; ++i) {
z[i] = x[i] + y[i];
...
}
}
...
}
4/11/2016
Incremental
Single source
Interoperable
Performance portable
CPU, GPU, MIC
Manage
Data
Movement
Initiate
Parallel
Execution
Optimize
Loop
Mappings
3
UNIFIED MEMORY
Traditional Developer View Developer View With Unified Memory
Unified Memory System Memory
GPU Memory
Dramatically Lower Developer Effort
4
UNIFIED MEMORY
Traditional Developer View Developer View With
Unified Memory void foo(FILE *fp, int N) {
float *x, *y, *z;
x = (float *)malloc(N*sizeof(float));
y = (float *)malloc(N*sizeof(float));
z = (float *)malloc(N*sizeof(float));
fread(x, sizeof(float), N, fp);
fread(y, sizeof(float), N, fp);
#pragma acc kernels copy(x[0:N],y[0:N],z[0:N])
for (int i=0; i<N; ++i)
z[i] = x[i] + y[i];
use_data(z);
free(z); free(y); free(x);
}
void foo(FILE *fp, int N) {
float *x, *y, *z;
x = (float *)malloc(N*sizeof(float));
y = (float *)malloc(N*sizeof(float));
z = (float *)malloc(N*sizeof(float));
fread(x, sizeof(float), N, fp);
fread(y, sizeof(float), N, fp);
#pragma acc kernels
for (int i=0; i<N; ++i)
z[i] = x[i] + y[i];
use_data(z);
free(z); free(y); free(x);
}
5
Identify Available
Parallelism
Express Parallelism
Express Data Movement
Optimize Loop
Performance
6
OPENACC AND UNIFIED MEMORY
All heap allocations are in managed memory (Unified Memory Heap)
Pointers can be used on GPU and CPU
Enabled with compiler switch –ta=tesla:managed,…
More Info at „OpenACC and CUDA Unified Memory”, by Michael Wolfe, PGI Compiler Engineer: https://www.pgroup.com/lit/articles/insider/v6n2a4.htm
PGI Support for Unified Memory with OpenACC
7
OPENACC AND UNIFIED MEMORY
Unified Memory can be used in CPU and GPU code
No need for any data clauses
No need to fully understand data flow and allocation logic of application
Simplifies handling of complex data structures
Incremental profiler driven acceleration -> Data movement is just another optimization
Advantages
8
OPENACC AND UNIFIED MEMORY
Does not apply for stack, static or global data (only heap data)
Limits allocatable memory to available device memory even on the host
Because all heap allocations are placed in device memory even the ones never needed on the GPU. This can (depending on application) significantly limit the maximal problem size.
Data is coherent only at kernel launch and sync points.
Its not allowed to access unified memory in host code while a kernel is running. Doing so may result in a segmentation fault.
Implementations Details on Kepler and Maxwell
9
LBM D2Q37
D2Q37 model
Application developed at U Rome Tore Vergata/INFN,U Ferrara/INFN, TU Eindhoven
Reproduce dynamics of fluid by simulating virtual particles which collide and propagate
Simulation of large systems requires double precision computation and many GPUs
Lattice Boltzmann Method (LBM)
10
LBM D2Q37
MPI + OpenMP + vector intrinsics using AoS data layout
MPI + OpenACC using SoA data layout and traditional data staging with data regions and data clauses (this version, starting without OpenACC directives, was used for the following)
MPI + CUDA C using SoA data layout
OpenCL
Paper comparing these variants have been presented at EUROPAR 2015: „Accelerating Lattice Boltzmann Applications with OpenACC“ – E. Calore, J. Kraus, S. F. Schifano and R. Tripiccione
Versions
11
LBM D2Q37 – INITIAL VERSION CPU Profile (480x512) – 1 MPI rank
Rank Method Time (s)
Initial
1 collide 17.01
2 propagate 10.71
3 other 2.26
4 bc 0.17
Application Reported Solvertime: 27.85 s Profiler: Total Time for Process: 30.15 s
collide
propagate
other
bc
12
LBM D2Q37 – INITIAL VERSION
Enable OpenACC and Managed Memory
-acc -ta=tesla:managed,…
Enable Accelerator Information
-Minfo=accel
Enable CPU Profiling information
-Mprof=func
Change build environment
13
LBM D2Q37 – INITIAL VERSION CPU Profile (480x512) using Unified Memory – 1 MPI rank
Rank Method Time (s)
UM
Time (s)
Initial
1 propagate 41.18 10.71
2 collide 16.82 17.01
3 other 6.58 2.26
4 bc 0.17 0.17
Application Reported Solvertime: 62.96 s (Initial: 27.85 s) Profiler: Total Time for Process: 64.75 s (Initial: 30.15 s)
collide
propagate
other
bc
14
LBM D2Q37 – INITIAL VERSION NVVP Timeline (480x512) using Unified Memory – 1 MPI rank
MPI handling periodic
boundary conditions –
causes flush of data to
GPU in every iteration
15
LBM D2Q37 – INITIAL VERSION NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank
Propagate slowed down
due to unified memory
page migrations
16
LBM D2Q37 – ACCELERATING PROPAGATE
inline void propagate(const data_t* restrict prv, data_t* restrict nxt) {
int ix, iy, site_i;
#pragma acc kernels
#pragma acc loop independent device_type(NVIDIA) gang
for ( ix=HX; ix < (HX+SIZEX); ix++) {
#pragma acc loop independent device_type(NVIDIA) vector(LOCAL_WORK_SIZEX)
for ( iy=HY; iy < (HY+SIZEY); iy++) {
site_i = (ix*NY) + iy;
nxt[ site_i] = prv[ site_i - 3*NY + 1];
nxt[ NX*NY + site_i] = prv[ NX*NY + site_i - 3*NY ];
//...
nxt[35*NX*NY + site_i] = prv[35*NX*NY + site_i + 3*NY ];
nxt[36*NX*NY + site_i] = prv[36*NX*NY + site_i + 3*NY - 1];
}
}
}
17
LBM D2Q37 – PROPAGATE ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank
Rank Method Time (s)
+propagate
Time (s)
UM
Time (s)
Initial
1 bc 34.59 0.17 0.17
2 collide 16.75 16.82 17.01
3 other 6.94 6.58 2.26
4 propagate 2.14 41.18 10.71
Application Reported Solvertime: 57.65 s (UM: 62.96 s) Profiler: Total Time for Process: 60.42 s (UM: 64.75 s)
collide
propagate
other
bc
Propagate
on GPU
18
LBM D2Q37 – PROPAGATE ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank
19
LBM D2Q37 – PROPAGATE ACCELERATED NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank
BC slowed down due to
unified memory page
migrations
20
LBM D2Q37 – BC ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank
Application Reported Solvertime: 55.74 s (propagate: 57.65 s) Profiler: Total Time for Process: 59.86 s (propagate: 60.42 s)
Propagate
on GPU
collide
propagate
other
bc
Propagate
on GPU
bc on GPU
Rank Method Time (s)
+bc
Time (s)
+propagate
Time (s)
UM
Time (s)
Initial
1 collide 49.99 16.75 16.82 17.01
2 other 7.61 6.94 6.58 2.26
3 propagate 2.15 2.14 41.18 10.71
4 bc 0.11 34.59 0.17 0.17
21
LBM D2Q37 – BC ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank
22
LBM D2Q37 – BC ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank
collide slowed down due
to unified memory page
migrations
23
LBM D2Q37 – COLLIDE ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank
Rank Method Time (s)
Final
Time (s)
UM+propagate+bc
Time (s)
Initial
0 main 7.69 2.39 1.89
1 collide 0.52 49.99 17.01
2 lbm 0.41 4.72 0.06
3 init 0.19 0.19 0.04
4 printMass 0.15 0.17 0.01
5 propagate 0.13 2.15 10.71
6 bc 0.09 0.11 0.17
7 projection 0.05 0.05 0.06
Application Reported Solvertime: 0.96 s (bc: 55.74 s, Initial: 27.85 s) Profiler: Total Time for Process: 9.33 s (bc: 59.86 s, Initial: 30.15 s)
24
LBM D2Q37 – COLLIDE ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank
Data stays on GPU while
simulation is running
25
LBM D2Q37 – MULTI GPU
CUDA-aware MPI with support for Unified Memory
E.g. OpenMPI since 1.8.5 or MVAPICH2-GDR since 2.2b with CUDA 7.0
Start one MPI rank per GPU
Requirements
26
LBM D2Q37 – MULTI GPU Handling GPU AFFINITY
int rank = 0; int size = 1;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
#if _OPENACC
int ngpus=acc_get_num_devices(acc_device_nvidia);
int devicenum=rank%ngpus;
acc_set_device_num(devicenum,acc_device_nvidia);
acc_init(acc_device_nvidia);
#endif /*_OPENACC*/
27
LBM D2Q37 – MULTI GPU NVVP Timeline (480x512) using Unified Memory – 2 MPI ranks
28
LBM D2Q37 – MULTI GPU NVVP Timeline (480x512) using Unified Memory - Zoom – 2 MPI ranks
MPI
29
LBM D2Q37 – MULTI GPU Strong Scaling
0
50
100
150
200
250
300
350
400
1 GPUs (1/2 K80) 2 GPUs (1 K80) 4 GPUs (2 K80) 8 GPUs (4 K80)
Runti
me (
s)
1000 Steps - 1440x10240 Grid
Tesla K80 Linear
30
LBM D2Q37 – MULTI GPU
Possible but need to be careful not to use unified memory pointers in host code while kernels are running asynchronously.
All kernel launches when using –ta=tesla:managed are synchronous by default, i.e. PGI_ACC_SYNCHRONOUS=1
Set PGI_ACC_SYNCHRONOUS=0 to allow overlap
Overlapping Communication and Computation
31
LBM D2Q37 – MULTI GPU Overlapping Communication and Computation
Grid size: 1920x2048
32
LBM D2Q37 – MULTI GPU Overlapping Communication and Computation
Grid size: 1920x2048
33
CONCLUSIONS
Unified Memory for OpenACC support makes GPU acceleration even more productive
Profiler guided incremental acceleration
No need to insert any data clauses or to change allocation code
April 4-7, 2016 | Silicon Valley
THANK YOU
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join
35
BACKUP
36
LBM D2Q37
PGI 15.5
CUDA-aware build of OpenMPI 1.8.5 (GPUDirect P2P/RDMA disabled)
CUDA 6.5
Intel Xeon E5-2698 v3 @ 2.30GHz
Problem size: 100 Iterations on 480x512
GPU optimized SoA data layout is used so CPU runtime is not representative.
Setup