April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND...

36
April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute, April 7th 2016 HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY

Transcript of April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND...

Page 1: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

April 4-7, 2016 | Silicon Valley

Jiri Kraus, Senior Devtech Compute, April 7th 2016

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY

Page 2: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

2

OPENACC DIRECTIVES

#pragma acc data copyin(a,b) copyout(c)

{

#pragma acc parallel

{

#pragma acc loop gang vector

for (i = 0; i < n; ++i) {

z[i] = x[i] + y[i];

...

}

}

...

}

4/11/2016

Incremental

Single source

Interoperable

Performance portable

CPU, GPU, MIC

Manage

Data

Movement

Initiate

Parallel

Execution

Optimize

Loop

Mappings

Page 3: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

3

UNIFIED MEMORY

Traditional Developer View Developer View With Unified Memory

Unified Memory System Memory

GPU Memory

Dramatically Lower Developer Effort

Page 4: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

4

UNIFIED MEMORY

Traditional Developer View Developer View With

Unified Memory void foo(FILE *fp, int N) {

float *x, *y, *z;

x = (float *)malloc(N*sizeof(float));

y = (float *)malloc(N*sizeof(float));

z = (float *)malloc(N*sizeof(float));

fread(x, sizeof(float), N, fp);

fread(y, sizeof(float), N, fp);

#pragma acc kernels copy(x[0:N],y[0:N],z[0:N])

for (int i=0; i<N; ++i)

z[i] = x[i] + y[i];

use_data(z);

free(z); free(y); free(x);

}

void foo(FILE *fp, int N) {

float *x, *y, *z;

x = (float *)malloc(N*sizeof(float));

y = (float *)malloc(N*sizeof(float));

z = (float *)malloc(N*sizeof(float));

fread(x, sizeof(float), N, fp);

fread(y, sizeof(float), N, fp);

#pragma acc kernels

for (int i=0; i<N; ++i)

z[i] = x[i] + y[i];

use_data(z);

free(z); free(y); free(x);

}

Page 5: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

5

Identify Available

Parallelism

Express Parallelism

Express Data Movement

Optimize Loop

Performance

Page 6: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

6

OPENACC AND UNIFIED MEMORY

All heap allocations are in managed memory (Unified Memory Heap)

Pointers can be used on GPU and CPU

Enabled with compiler switch –ta=tesla:managed,…

More Info at „OpenACC and CUDA Unified Memory”, by Michael Wolfe, PGI Compiler Engineer: https://www.pgroup.com/lit/articles/insider/v6n2a4.htm

PGI Support for Unified Memory with OpenACC

Page 7: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

7

OPENACC AND UNIFIED MEMORY

Unified Memory can be used in CPU and GPU code

No need for any data clauses

No need to fully understand data flow and allocation logic of application

Simplifies handling of complex data structures

Incremental profiler driven acceleration -> Data movement is just another optimization

Advantages

Page 8: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

8

OPENACC AND UNIFIED MEMORY

Does not apply for stack, static or global data (only heap data)

Limits allocatable memory to available device memory even on the host

Because all heap allocations are placed in device memory even the ones never needed on the GPU. This can (depending on application) significantly limit the maximal problem size.

Data is coherent only at kernel launch and sync points.

Its not allowed to access unified memory in host code while a kernel is running. Doing so may result in a segmentation fault.

Implementations Details on Kepler and Maxwell

Page 9: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

9

LBM D2Q37

D2Q37 model

Application developed at U Rome Tore Vergata/INFN,U Ferrara/INFN, TU Eindhoven

Reproduce dynamics of fluid by simulating virtual particles which collide and propagate

Simulation of large systems requires double precision computation and many GPUs

Lattice Boltzmann Method (LBM)

Page 10: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

10

LBM D2Q37

MPI + OpenMP + vector intrinsics using AoS data layout

MPI + OpenACC using SoA data layout and traditional data staging with data regions and data clauses (this version, starting without OpenACC directives, was used for the following)

MPI + CUDA C using SoA data layout

OpenCL

Paper comparing these variants have been presented at EUROPAR 2015: „Accelerating Lattice Boltzmann Applications with OpenACC“ – E. Calore, J. Kraus, S. F. Schifano and R. Tripiccione

Versions

Page 11: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

11

LBM D2Q37 – INITIAL VERSION CPU Profile (480x512) – 1 MPI rank

Rank Method Time (s)

Initial

1 collide 17.01

2 propagate 10.71

3 other 2.26

4 bc 0.17

Application Reported Solvertime: 27.85 s Profiler: Total Time for Process: 30.15 s

collide

propagate

other

bc

Page 12: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

12

LBM D2Q37 – INITIAL VERSION

Enable OpenACC and Managed Memory

-acc -ta=tesla:managed,…

Enable Accelerator Information

-Minfo=accel

Enable CPU Profiling information

-Mprof=func

Change build environment

Page 13: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

13

LBM D2Q37 – INITIAL VERSION CPU Profile (480x512) using Unified Memory – 1 MPI rank

Rank Method Time (s)

UM

Time (s)

Initial

1 propagate 41.18 10.71

2 collide 16.82 17.01

3 other 6.58 2.26

4 bc 0.17 0.17

Application Reported Solvertime: 62.96 s (Initial: 27.85 s) Profiler: Total Time for Process: 64.75 s (Initial: 30.15 s)

collide

propagate

other

bc

Page 14: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

14

LBM D2Q37 – INITIAL VERSION NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

MPI handling periodic

boundary conditions –

causes flush of data to

GPU in every iteration

Page 15: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

15

LBM D2Q37 – INITIAL VERSION NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank

Propagate slowed down

due to unified memory

page migrations

Page 16: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

16

LBM D2Q37 – ACCELERATING PROPAGATE

inline void propagate(const data_t* restrict prv, data_t* restrict nxt) {

int ix, iy, site_i;

#pragma acc kernels

#pragma acc loop independent device_type(NVIDIA) gang

for ( ix=HX; ix < (HX+SIZEX); ix++) {

#pragma acc loop independent device_type(NVIDIA) vector(LOCAL_WORK_SIZEX)

for ( iy=HY; iy < (HY+SIZEY); iy++) {

site_i = (ix*NY) + iy;

nxt[ site_i] = prv[ site_i - 3*NY + 1];

nxt[ NX*NY + site_i] = prv[ NX*NY + site_i - 3*NY ];

//...

nxt[35*NX*NY + site_i] = prv[35*NX*NY + site_i + 3*NY ];

nxt[36*NX*NY + site_i] = prv[36*NX*NY + site_i + 3*NY - 1];

}

}

}

Page 17: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

17

LBM D2Q37 – PROPAGATE ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank

Rank Method Time (s)

+propagate

Time (s)

UM

Time (s)

Initial

1 bc 34.59 0.17 0.17

2 collide 16.75 16.82 17.01

3 other 6.94 6.58 2.26

4 propagate 2.14 41.18 10.71

Application Reported Solvertime: 57.65 s (UM: 62.96 s) Profiler: Total Time for Process: 60.42 s (UM: 64.75 s)

collide

propagate

other

bc

Propagate

on GPU

Page 18: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

18

LBM D2Q37 – PROPAGATE ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

Page 19: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

19

LBM D2Q37 – PROPAGATE ACCELERATED NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank

BC slowed down due to

unified memory page

migrations

Page 20: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

20

LBM D2Q37 – BC ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank

Application Reported Solvertime: 55.74 s (propagate: 57.65 s) Profiler: Total Time for Process: 59.86 s (propagate: 60.42 s)

Propagate

on GPU

collide

propagate

other

bc

Propagate

on GPU

bc on GPU

Rank Method Time (s)

+bc

Time (s)

+propagate

Time (s)

UM

Time (s)

Initial

1 collide 49.99 16.75 16.82 17.01

2 other 7.61 6.94 6.58 2.26

3 propagate 2.15 2.14 41.18 10.71

4 bc 0.11 34.59 0.17 0.17

Page 21: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

21

LBM D2Q37 – BC ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

Page 22: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

22

LBM D2Q37 – BC ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

collide slowed down due

to unified memory page

migrations

Page 23: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

23

LBM D2Q37 – COLLIDE ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank

Rank Method Time (s)

Final

Time (s)

UM+propagate+bc

Time (s)

Initial

0 main 7.69 2.39 1.89

1 collide 0.52 49.99 17.01

2 lbm 0.41 4.72 0.06

3 init 0.19 0.19 0.04

4 printMass 0.15 0.17 0.01

5 propagate 0.13 2.15 10.71

6 bc 0.09 0.11 0.17

7 projection 0.05 0.05 0.06

Application Reported Solvertime: 0.96 s (bc: 55.74 s, Initial: 27.85 s) Profiler: Total Time for Process: 9.33 s (bc: 59.86 s, Initial: 30.15 s)

Page 24: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

24

LBM D2Q37 – COLLIDE ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

Data stays on GPU while

simulation is running

Page 25: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

25

LBM D2Q37 – MULTI GPU

CUDA-aware MPI with support for Unified Memory

E.g. OpenMPI since 1.8.5 or MVAPICH2-GDR since 2.2b with CUDA 7.0

Start one MPI rank per GPU

Requirements

Page 26: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

26

LBM D2Q37 – MULTI GPU Handling GPU AFFINITY

int rank = 0; int size = 1;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

#if _OPENACC

int ngpus=acc_get_num_devices(acc_device_nvidia);

int devicenum=rank%ngpus;

acc_set_device_num(devicenum,acc_device_nvidia);

acc_init(acc_device_nvidia);

#endif /*_OPENACC*/

Page 27: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

27

LBM D2Q37 – MULTI GPU NVVP Timeline (480x512) using Unified Memory – 2 MPI ranks

Page 28: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

28

LBM D2Q37 – MULTI GPU NVVP Timeline (480x512) using Unified Memory - Zoom – 2 MPI ranks

MPI

Page 29: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

29

LBM D2Q37 – MULTI GPU Strong Scaling

0

50

100

150

200

250

300

350

400

1 GPUs (1/2 K80) 2 GPUs (1 K80) 4 GPUs (2 K80) 8 GPUs (4 K80)

Runti

me (

s)

1000 Steps - 1440x10240 Grid

Tesla K80 Linear

Page 30: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

30

LBM D2Q37 – MULTI GPU

Possible but need to be careful not to use unified memory pointers in host code while kernels are running asynchronously.

All kernel launches when using –ta=tesla:managed are synchronous by default, i.e. PGI_ACC_SYNCHRONOUS=1

Set PGI_ACC_SYNCHRONOUS=0 to allow overlap

Overlapping Communication and Computation

Page 31: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

31

LBM D2Q37 – MULTI GPU Overlapping Communication and Computation

Grid size: 1920x2048

Page 32: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

32

LBM D2Q37 – MULTI GPU Overlapping Communication and Computation

Grid size: 1920x2048

Page 33: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

33

CONCLUSIONS

Unified Memory for OpenACC support makes GPU acceleration even more productive

Profiler guided incremental acceleration

No need to insert any data clauses or to change allocation code

Page 34: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

Page 35: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

35

BACKUP

Page 36: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND ...on-demand.gputechconf.com/gtc/2016/presentation/s... · April 4-7, 2016 | Silicon Valley Jiri Kraus, Senior Devtech Compute,

36

LBM D2Q37

PGI 15.5

CUDA-aware build of OpenMPI 1.8.5 (GPUDirect P2P/RDMA disabled)

CUDA 6.5

Intel Xeon E5-2698 v3 @ 2.30GHz

Problem size: 100 Iterations on 480x512

GPU optimized SoA data layout is used so CPU runtime is not representative.

Setup