April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND...

April 4-7, 2016 | Silicon Valley

Jiri Kraus, Senior Devtech Compute, April 7th 2016

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY

2

OPENACC DIRECTIVES

#pragma acc data copyin(a,b) copyout(c)

{

#pragma acc parallel

{

#pragma acc loop gang vector

for (i = 0; i < n; ++i) {

z[i] = x[i] + y[i];

...

}

}

...

}

4/11/2016

Incremental

Single source

Interoperable

Performance portable

CPU, GPU, MIC

Manage

Data

Movement

Initiate

Parallel

Execution

Optimize

Loop

Mappings

3

UNIFIED MEMORY

Traditional Developer View Developer View With Unified Memory

Unified Memory System Memory

GPU Memory

Dramatically Lower Developer Effort

4

UNIFIED MEMORY

Traditional Developer View Developer View With

Unified Memory void foo(FILE *fp, int N) {

float *x, *y, *z;

x = (float *)malloc(N*sizeof(float));

y = (float *)malloc(N*sizeof(float));

z = (float *)malloc(N*sizeof(float));

fread(x, sizeof(float), N, fp);

fread(y, sizeof(float), N, fp);

#pragma acc kernels copy(x[0:N],y[0:N],z[0:N])

for (int i=0; i<N; ++i)

z[i] = x[i] + y[i];

use_data(z);

free(z); free(y); free(x);

}

void foo(FILE *fp, int N) {

float *x, *y, *z;

x = (float *)malloc(N*sizeof(float));

y = (float *)malloc(N*sizeof(float));

z = (float *)malloc(N*sizeof(float));

fread(x, sizeof(float), N, fp);

fread(y, sizeof(float), N, fp);

#pragma acc kernels

for (int i=0; i<N; ++i)

z[i] = x[i] + y[i];

use_data(z);

free(z); free(y); free(x);

}

5

Identify Available

Parallelism

Express Parallelism

Express Data Movement

Optimize Loop

Performance

6

OPENACC AND UNIFIED MEMORY

All heap allocations are in managed memory (Unified Memory Heap)

Pointers can be used on GPU and CPU

Enabled with compiler switch –ta=tesla:managed,…

More Info at „OpenACC and CUDA Unified Memory”, by Michael Wolfe, PGI Compiler Engineer: https://www.pgroup.com/lit/articles/insider/v6n2a4.htm

PGI Support for Unified Memory with OpenACC

https://www.pgroup.com/lit/articles/insider/v6n2a4.htm

https://www.pgroup.com/lit/articles/insider/v6n2a4.htm

7


Unified Memory can be used in CPU and GPU code

No need for any data clauses

No need to fully understand data flow and allocation logic of application

Simplifies handling of complex data structures

Incremental profiler driven acceleration -> Data movement is just another optimization

Advantages

8


Does not apply for stack, static or global data (only heap data)

Limits allocatable memory to available device memory even on the host

Because all heap allocations are placed in device memory even the ones never needed on the GPU. This can (depending on application) significantly limit the maximal problem size.

Data is coherent only at kernel launch and sync points.

Its not allowed to access unified memory in host code while a kernel is running. Doing so may result in a segmentation fault.

Implementations Details on Kepler and Maxwell

9

LBM D2Q37

D2Q37 model

Application developed at U Rome Tore Vergata/INFN,U Ferrara/INFN, TU Eindhoven

Reproduce dynamics of fluid by simulating virtual particles which collide and propagate

Simulation of large systems requires double precision computation and many GPUs

Lattice Boltzmann Method (LBM)

10

LBM D2Q37

MPI + OpenMP + vector intrinsics using AoS data layout

MPI + OpenACC using SoA data layout and traditional data staging with data regions and data clauses (this version, starting without OpenACC directives, was used for the following)

MPI + CUDA C using SoA data layout

OpenCL

Paper comparing these variants have been presented at EUROPAR 2015: „Accelerating Lattice Boltzmann Applications with OpenACC“ – E. Calore, J. Kraus, S. F. Schifano and R. Tripiccione

Versions

11

LBM D2Q37 – INITIAL VERSION CPU Profile (480x512) – 1 MPI rank

Rank Method Time (s)

Initial

1 collide 17.01

2 propagate 10.71

3 other 2.26

4 bc 0.17

Application Reported Solvertime: 27.85 s Profiler: Total Time for Process: 30.15 s

collide

propagate

other

bc

12

LBM D2Q37 – INITIAL VERSION

Enable OpenACC and Managed Memory

-acc -ta=tesla:managed,…

Enable Accelerator Information

-Minfo=accel

Enable CPU Profiling information

-Mprof=func

Change build environment

13

LBM D2Q37 – INITIAL VERSION CPU Profile (480x512) using Unified Memory – 1 MPI rank


UM

Time (s)

Initial

1 propagate 41.18 10.71

2 collide 16.82 17.01

3 other 6.58 2.26

4 bc 0.17 0.17

Application Reported Solvertime: 62.96 s (Initial: 27.85 s) Profiler: Total Time for Process: 64.75 s (Initial: 30.15 s)

collide

propagate

other

bc

14

LBM D2Q37 – INITIAL VERSION NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

MPI handling periodic

boundary conditions –

causes flush of data to

GPU in every iteration

15

LBM D2Q37 – INITIAL VERSION NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank

Propagate slowed down

due to unified memory

page migrations

16

LBM D2Q37 – ACCELERATING PROPAGATE

inline void propagate(const data_t* restrict prv, data_t* restrict nxt) {

int ix, iy, site_i;

#pragma acc kernels

#pragma acc loop independent device_type(NVIDIA) gang

for ( ix=HX; ix < (HX+SIZEX); ix++) {

#pragma acc loop independent device_type(NVIDIA) vector(LOCAL_WORK_SIZEX)

for ( iy=HY; iy < (HY+SIZEY); iy++) {

site_i = (ix*NY) + iy;

nxt[ site_i] = prv[ site_i - 3*NY + 1];

nxt[ NX*NY + site_i] = prv[ NX*NY + site_i - 3*NY ];

//...

nxt[35*NX*NY + site_i] = prv[35*NX*NY + site_i + 3*NY ];

nxt[36*NX*NY + site_i] = prv[36*NX*NY + site_i + 3*NY - 1];

}

}

}

17

LBM D2Q37 – PROPAGATE ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank


+propagate

Time (s)

UM

Time (s)

Initial

1 bc 34.59 0.17 0.17

2 collide 16.75 16.82 17.01

3 other 6.94 6.58 2.26

4 propagate 2.14 41.18 10.71

Application Reported Solvertime: 57.65 s (UM: 62.96 s) Profiler: Total Time for Process: 60.42 s (UM: 64.75 s)

collide

propagate

other

bc

Propagate

on GPU

18

LBM D2Q37 – PROPAGATE ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

19

LBM D2Q37 – PROPAGATE ACCELERATED NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank

BC slowed down due to

unified memory page

migrations

20

LBM D2Q37 – BC ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank

Application Reported Solvertime: 55.74 s (propagate: 57.65 s) Profiler: Total Time for Process: 59.86 s (propagate: 60.42 s)

Propagate

on GPU

collide

propagate

other

bc

Propagate

on GPU

bc on GPU


+bc

Time (s)

+propagate

Time (s)

UM

Time (s)

Initial

1 collide 49.99 16.75 16.82 17.01

2 other 7.61 6.94 6.58 2.26

3 propagate 2.15 2.14 41.18 10.71

4 bc 0.11 34.59 0.17 0.17

21

LBM D2Q37 – BC ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

22

LBM D2Q37 – BC ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

collide slowed down due

to unified memory page

migrations

23

LBM D2Q37 – COLLIDE ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank


Final

Time (s)

UM+propagate+bc

Time (s)

Initial

0 main 7.69 2.39 1.89

1 collide 0.52 49.99 17.01

2 lbm 0.41 4.72 0.06

3 init 0.19 0.19 0.04

4 printMass 0.15 0.17 0.01

5 propagate 0.13 2.15 10.71

6 bc 0.09 0.11 0.17

7 projection 0.05 0.05 0.06

Application Reported Solvertime: 0.96 s (bc: 55.74 s, Initial: 27.85 s) Profiler: Total Time for Process: 9.33 s (bc: 59.86 s, Initial: 30.15 s)

24

LBM D2Q37 – COLLIDE ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

Data stays on GPU while

simulation is running

25

LBM D2Q37 – MULTI GPU

CUDA-aware MPI with support for Unified Memory

E.g. OpenMPI since 1.8.5 or MVAPICH2-GDR since 2.2b with CUDA 7.0

Start one MPI rank per GPU

Requirements

26

LBM D2Q37 – MULTI GPU Handling GPU AFFINITY

int rank = 0; int size = 1;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

#if _OPENACC

int ngpus=acc_get_num_devices(acc_device_nvidia);

int devicenum=rank%ngpus;

acc_set_device_num(devicenum,acc_device_nvidia);

acc_init(acc_device_nvidia);

#endif /*_OPENACC*/

27

LBM D2Q37 – MULTI GPU NVVP Timeline (480x512) using Unified Memory – 2 MPI ranks

28

LBM D2Q37 – MULTI GPU NVVP Timeline (480x512) using Unified Memory - Zoom – 2 MPI ranks

MPI

29

LBM D2Q37 – MULTI GPU Strong Scaling

0

50

100

150

200

250

300

350

400

1 GPUs (1/2 K80) 2 GPUs (1 K80) 4 GPUs (2 K80) 8 GPUs (4 K80)

Runti

me (

s)

1000 Steps - 1440x10240 Grid

Tesla K80 Linear

30

LBM D2Q37 – MULTI GPU

Possible but need to be careful not to use unified memory pointers in host code while kernels are running asynchronously.

All kernel launches when using –ta=tesla:managed are synchronous by default, i.e. PGI_ACC_SYNCHRONOUS=1

Set PGI_ACC_SYNCHRONOUS=0 to allow overlap

Overlapping Communication and Computation

31

LBM D2Q37 – MULTI GPU Overlapping Communication and Computation

Grid size: 1920x2048

32

LBM D2Q37 – MULTI GPU Overlapping Communication and Computation

Grid size: 1920x2048

33

CONCLUSIONS

Unified Memory for OpenACC support makes GPU acceleration even more productive

Profiler guided incremental acceleration

No need to insert any data clauses or to change allocation code

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

developer.nvidia.com/join

35

BACKUP

36

LBM D2Q37

PGI 15.5

CUDA-aware build of OpenMPI 1.8.5 (GPUDirect P2P/RDMA disabled)

CUDA 6.5

Intel Xeon E5-2698 v3 @ 2.30GHz

Problem size: 100 Iterations on 480x512

GPU optimized SoA data layout is used so CPU runtime is not representative.

Setup

April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND...

Documents

Transcript of April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND...