Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI...

94
MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy [email protected]

Transcript of Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI...

Page 1: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MULTI GPU PROGRAMMING (WITH MPI)

Massimo Bernaschi National Research Council of Italy

[email protected]

Page 2: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

SOME OF THE SLIDES COME FROM A TUTORIAL PRESENTED DURING THE 2014 GTC

CONFERENCE BY Jiri Kraus and Peter Messmer (NVIDIA)

Download the sample codes from: http://twin.iac.rm.cnr.it/cccsample.tgz To unpack the codes: tar zxvf cccsample.tgz Get http://twin.iac.rm.cnr.it/CUDA_C_QuickRef.pdf

Page 3: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

WHAT YOU WILL LEARN § How to use more than one GPU for the same task § How to use MPI for inter GPU communication with CUDA § What CUDA-aware MPI is § What Multi Process Service is and how to use it § How to use NVIDIA Tools in an MPI environment § How to hide MPI communication times

3

Page 4: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MULTI-GPU MEMORY

ü GPUs do not share global memory ü But starting on CUDA 4.0 one GPU can copy data from/to

another GPU memory directly if the GPUs are connected to the same PCIe switch

ü Inter-GPU communication ü Application code is responsible for copying/moving data

between GPU ü Data travel across the PCIe bus

ü Even when GPUs are connected to the same PCIe switch!

Page 5: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

SHARING DATA BETWEEN GPUS

§ Options —  Explicit copies via host

—  Zero-copy shared host array

—  Per-device arrays with peer-to-peer exchange transfers —  Peer-to-peer memory access

Page 6: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MULTI-GPU ENVIRONMENT

ü GPUs have consecutive integer IDs, starting with 0 ü Starting on CUDA 4.0, a host thread can maintain more than one GPU context at a time ü CudaSetDevice allows to change the “active” GPU

(and so the context) ü Device 0 is chosen when cudaSetDevice is not called

ü GPU are ordered by decreasing performance ü Remember that multiple host threads can establish contexts

with the same GPU ü Driver handles time-sharing and resource partitioning unless the GPU is

in exclusive mode

Page 7: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MULTI-GPU ENVIRONMENT: A SIMPLE APPROACH // Run independent kernel on each CUDA device

int numDevs = 0; cudaGetNumDevices(&numDevs); ... for (int d = 0; d < numDevs; d++) { cudaSetDevice(d); kernel<<<blocks, threads>>>(args);

} Exercise: Compare the dot_simple_multiblock.cu and dot_multigpu.cu programs. Compile and run. Attention! Add the –arch=sm_20 option: nvcc –o dot_multigpu dot_multigpu.cu –arch=sm_20

Page 8: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

CUDA FEATURES USEFUL FOR MULTI-GPU

• Control multiple GPUs with a single CPU thread –  Simpler coding: no need for CPU multithreading

• Streams: –  Enable executing kernels and memcopies concurrently – Up to 2 concurrent memcopies: to/from GPU

• Peer-to-Peer (P2P) GPU memory copies – Transfer data between GPUs using PCIe P2P support – Done by GPU DMA hardware – host CPU is not involved

•  Data traverses PCIe links, without touching CPU memory

– Disjoint GPU-pairs can communicate simultaneously

Page 9: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

PEER-TO-PEER GPU COMMUNICATION •  Up to CUDA 4.0, GPU could not

exchange data directly.

Example: read p2pcopy.cu, compile nvcc –o p2pcopy p2pcopy.cu –arch=sm_30

•  With CUDA >= 4.0 two GPU connected to the same PCI/express switch can exchange data directly.

cudaStreamCreate(&stream_on_gpu_0); cudaSetDevice(0); cudaDeviceEnablePeerAccess( 1, 0 ); cudaMalloc(d_0, num_bytes); /* create array on GPU 0 */ cudaSetDevice(1); cudaMalloc(d_1, num_bytes); /* create array on GPU 1 */ cudaMemcpyPeerAsync( d_0, 0, d_1, 1, num_bytes, stream_on_gpu_0 ); /* copy d_1 from GPU 1 to d_0 on GPU 0: pull copy */

Page 10: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

INTER-GPU COMMUNICATION WITH MPI •  Example: simpleMPI.c

–  Generate some random numbers on one node. –  Dispatch them to all nodes. –  Compute their square root on each node's GPU. –  Compute the average of the results by using MPI.

To compile: nvcc -c simpleCUDAMPI.cu

$mpic++ -o simpleMPI simpleMPI.c simpleCUDAMPI.o -L/usr/local/cuda/lib64/ -lcudart

To run: $mpirun –np 2 simpleMPI

On a single line!

Page 11: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPI+CUDA

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node 0

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node n-1

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node 1

Page 12: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPI+CUDA

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node 0

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node n-1

PCI-­‐e  

GPU

GDDR5 Memory

System Memory

CPU

Network Card

Node 1

Page 13: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPI+CUDA

//MPI rank 0 MPI_Send(s_buf_d,size,MPI_CHAR,n-1,tag,MPI_COMM_WORLD); //MPI rank n-1 MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

Page 14: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MESSAGE PASSING INTERFACE - MPI §  Standard to exchange data between processes via messages

—  Defines API to exchanges messages

§  Pt. 2 Pt.: e.g. MPI_Send, MPI_Recv

§  Collectives, e.g. MPI_Reduce

§ Multiple implementations (open source and commercial) —  Binding for C/C++, Fortran, Python, …

—  E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, …

14

Page 15: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPI – A MINIMAL PROGRAM #include <mpi.h>

int main(int argc, char *argv[]) {

int rank,size;

/* Initialize the MPI library */

MPI_Init(&argc,&argv);

/* Determine the calling process rank and total number of ranks */

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&size);

/* Call MPI routines like MPI_Send, MPI_Recv, ... */

...

/* Shutdown MPI library */

MPI_Finalize();

return 0;

} 15

Page 16: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPI – COMPILING AND LAUNCHING $ mpicc –o myapp myapp.c $ mpirun –np 4 ./myapp <args>

16

myapp myapp myapp myapp

Page 17: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

A SIMPLE EXAMPLE

Page 18: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

Solving the Laplace equation with the Jacobi method

T(i,j) T(i+1,j) T(i-1,j)

T(i,j-1)

T(i,j+1)

The Laplace equation in 2D (elliptic PDE)

The Jacobi method: each point is updated as the average of its nearest neighbours

Page 19: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

EXAMPLE: JACOBI SOLVER – SINGLE GPU While not converged § Do Jacobi step: for (int i=1; i < n-1; i++)

for (int j=1; j < m-1; j++)

Tnew[i][j] = 0.0f - 0.25f*(T[i-1][j] + T[i+1][j]

+ T[i][j-1] + T[i][j+1])

§ Copy Tnew to T or Swap Tnew and T § Next iteration

Page 20: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

Exercise LAPLACE CUDA (serial)

•  wget http://twin.iac.rm.cnr.it/eserlaplace.tgz

•  tar zxvf eserlaplace.tgz

•  cd LAPLACE_CNAF/SERIAL

•  C version in the laplace_serial.c, laplace_serial_cuda.c and cuda_tools.c files

•  CUDA version activated by using –DCUDA at compile time

(otherwise code within #ifdef CUDA is not compiled!)

•  Use the Makefile: make c_serial

make c_cuda

Exercise: complete the CUDA parts indicated by EXERCISE in laplace_serial_cuda.c

Page 21: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

EXAMPLE: JACOBI SOLVER – MULTI GPU While not converged § Do Jacobi step: for (int i=1; i < n-1; i++)

for (int j=1; j < m-1; j++)

Tnew[i][j] = 0.0f - 0.25f*(T[i-1][j] + T[i+1][j]

+ T[i][j-1] + T[i][j+1])

§  Exchange halo with neighbors § Copy Tnew to T or Swap Tnew and T § Next iteration

Page 22: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

EXAMPLE: JACOBI SOLVER §  Solves the 2D-Laplace equation on a rectangle

∆𝒖(𝒙,𝒚)=𝟎  ∀  (𝒙,𝒚)∈Ω\𝜹Ω —  Dirichlet boundary conditions (constant values on boundaries)

𝒖(𝒙,𝒚)=𝒇(𝒙,𝒚)∈𝜹Ω

§  2D domain decomposition with n x k domains

Rank (0,0)

Rank (0,1)

Rank (0,n-1) …

Rank (k-1,0)

Rank (k-1,1)

Rank (k-1,n-1)

Page 23: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

EXAMPLE: JACOBI TOP/BOTTOM HALO UPDATE

MPI_Sendrecv(Tnew+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,

Tnew+offset_bottom_boundary, m-2, MPI_DOUBLE, b_nb, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Sendrecv(Tnew+offset_last_row, m-2, MPI_DOUBLE, b_nb, 1,

Tnew+offset_top_boundary, m-2, MPI_DOUBLE, t_nb, 1,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

1

1 2

2

Pla

in C

on

CP

U

Page 24: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

EXAMPLE: JACOBI TOP/BOTTOM HALO UPDATE

MPI_Sendrecv(Tnew_d+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,

Tnew_d+offset_bottom_bondary, m-2, MPI_DOUBLE, b_nb, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Sendrecv(Tnew_d+offset_last_row, m-2, MPI_DOUBLE, b_nb, 1,

Tnew_d+offset_top_bondary, m-2, MPI_DOUBLE, t_nb, 1,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

CU

DA

1

1 2

2

Depending on the CUDA (and MPI) version, the MPI_Sendrecv may work or fail. We may need to copy data from the GPU to the CPU to exchange them with other MPI tasks.

Device pointer

Page 25: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

EXAMPLE: JACOBI – TOP/BOTTOM HALO UPDATE – WITHOUT CUDA-AWARE MPI

25

//send to bottom and receive from top – top bottom omitted

cudaMemcpy(Tnew+1, Tnew_d+1, (m-2)*sizeof(double), cudaMemcpyDeviceToHost);

MPI_Sendrecv(Tnew+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,

Tnew+offset_bottom_bondary, m-2, MPI_DOUBLE, b_nb, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

cudaMemcpy(Tnew_d, Tnew, (m-2)*sizeof(double), cudaMemcpyDeviceToHost);

CU

DA

Page 26: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

EXAMPLE: JACOBI – LEFT/RIGHT HALO UPDATE

//right neighbor omitted

cuda_copy_to_buffer<<<gs,bs,0,s>>>(to_left_d, u_new_d, n, m);

cudaStreamSynchronize(s);

MPI_Sendrecv( to_left_d, n-2, MPI_DOUBLE, l_nb, 0,

from_left_d, n-2, MPI_DOUBLE, l_nb, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE );

cuda_copy_from_buffer<<<gs,bs,0,s>>>(u_new_d, from_left_d, n, m);

27

CU

DA

Page 27: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

Jacobi method with MPI Domain Decomposition

28

T=T0 until (deltaT_max < tol)

1.  Update the halo regions (MPI_Sendrecv) 2.  Tnew(i,j)=T(i-1,j)+T(i+1,j)+T(i,j-1)+T(i,j+1) 3.  deltaT(i,j) = abs(Tnew(i,j)-T(i,j)) 4.  deltaT_max_loc = max(deltaT) 5.  compute deltaT_max (MPI_Allreduce) 6.  T=Tnew

•  The exercise/sample makes use of the MPI Cartesian decomposition •  Each MPI process is in charge of a subdomain and before the execution

of the update needs to receive data on the boundaries of its subdomain •  For the evaluation of the global error it is necessary to carry out also a

distributed reduction operation (MPI_Allreduce).

Page 28: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

Esercizio LAPLACE CUDA MPI

•  wget http://twin.iac.rm.cnr.it/eserlaplace.tgz

•  tar zxvf eserlaplace.tgz

•  cd LAPLACE/MPI

•  C version in the laplace_cuda_mpi.c and cuda_tools.c files

•  CUDA version activated by using –DCUDA at compile time

(otherwise code within #ifdef CUDA is not compiled!)

•  Use the Makefile: make c_mpi make c_mpi_cuda

Exercise: complete the CUDA parts indicated by EXERCISE The copies of MPI buffers correspond to kernels that must be implemented.

Page 29: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

THE DETAILS

Page 30: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

UNIFIED VIRTUAL ADDRESSING No UVA: Multiple Memory Spaces UVA : Single Address Space

System Memory

CPU GPU

GPU Memory

PCI-e

0x0000

0xFFFF

0x0000

0xFFFF

System Memory

CPU GPU

GPU Memory

PCI-e

0x0000

0xFFFF

No UVA : Separate Address Spaces

Page 31: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

UNIFIED VIRTUAL ADDRESSING

§ One address space for all CPU and GPU memory —  Determine physical memory location from a pointer value

—  Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)

§  Supported on devices with compute capability 2.0 for —  64-bit applications on Linux and on Windows also TCC mode

Pinned fabric Buffer

Host Buffer memcpy

GPU Buffer cudaMemcpy

Pinned fabric Buffer Pinned fabric Buffer

Page 32: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPI+CUDA

//MPI rank 0 MPI_Send(s_buf_d,size,…); //MPI rank n-1 MPI_Recv(r_buf_d,size,…);

With UVA and CUDA-aware MPI //MPI rank 0 cudaMemcpy(s_buf_h,s_buf_d,size,…); MPI_Send(s_buf_h,size,…); //MPI rank n-1 MPI_Recv(r_buf_h,size,…); cudaMemcpy(r_buf_d,r_buf_h,size,…);

No UVA and regular MPI

Page 33: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

NVIDIA GPUDIRECTTM

ACCELERATED COMMUNICATION WITH NETWORK & STORAGE DEVICES

Page 34: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

NVIDIA GPUDIRECTTM

ACCELERATED COMMUNICATION WITH NETWORK & STORAGE DEVICES

Page 35: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

NVIDIA GPUDIRECTTM

PEER TO PEER TRANSFERS

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

Page 36: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

NVIDIA GPUDIRECTTM

PEER TO PEER TRANSFERS

Page 37: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

NVIDIA GPUDIRECTTM SUPPORT FOR RDMA

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

Page 38: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU1

GPU1 Memory

PCI-e

CPU

Chip set

System Memory

GPU2

GPU2 Memory

IB

NVIDIA GPUDIRECTTM SUPPORT FOR RDMA

Page 39: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

CUDA-AWARE MPI Example: MPI Rank 0 MPI_Send from GPU Buffer MPI Rank 1 MPI_Recv to GPU Buffer §  Show how CUDA+MPI works in principle

—  Depending on the MPI implementation, message size, system setup, … situation might be different

§ Two GPUs in two nodes

42

Page 40: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

CUDA-AWARE MPI

GPU Buffer

PCI-E DMA

Pinned CUDA Buffer Pinned fabric Buffer

RDMA

Host Buffer

memcpy

Page 41: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPI Rank 0 MPI Rank 1

GPU

Host

MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI GPU TO REMOTE GPU GPUDIRECT SUPPORT FOR RDMA

Page 42: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

Time

MPI_Sendrecv

MPI GPU TO REMOTE GPU GPUDIRECT SUPPORT FOR RDMA

Page 43: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPI Rank 0 MPI Rank 1

GPU

Host

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

REGULAR MPI GPU TO REMOTE GPU

Page 44: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

memcpy H->D MPI_Sendrecv memcpy D->H

Time

REGULAR MPI GPU TO REMOTE GPU

Page 45: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPI Rank 0 MPI Rank 1

GPU

Host

MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

MPI GPU TO REMOTE GPU WITHOUT GPUDIRECT

48

Page 46: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPI_Sendrecv

Time

MPI GPU TO REMOTE GPU WITHOUT GPUDIRECT

49

More Details on pipelines: S4158 - CUDA Streams: Best Practices and Common Pitfalls (Tuesday 03/27 210A)

Page 47: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

PERFORMANCE RESULTS TWO NODES

0 1000 2000 3000 4000 5000 6000 7000

1 4 16

64

256

1024

4096

1638

4

6553

6

2621

44

1048

576

4194

304

BW (

MB/

s)

Message Size (byte)

OpenMPI 1.7.4 MLNX FDR IB (4X) Tesla K40

CUDA-aware MPI with GPUDirect RDMA

CUDA-aware MPI

regular MPI

Latency (1 byte) 19.04 us 16.91 us 5.52 us

Page 48: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

PERFORMANCE RESULTS TWO NODES

1

10

100

1000

10000

1 4 16

64

256

1024

40

96

1638

4 65

536

2621

44

1048

576

4194

304

Late

ncy

(us)

log

scal

e

Message Size (byte)

OpenMPI 1.7.4 MLNX FDR IB (4X) Tesla K40

CUDA-aware MPI with GPUDirect RDMA

CUDA-aware MPI

regular MPI

Page 49: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MULTI PROCESS SERVICE (MPS) FOR MPI APPLICATIONS

Page 50: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU ACCELERATION OF LEGACY MPI APPLICATION

§ Typical legacy application —  MPI parallel

—  Single or few threads per MPI rank (e.g. OpenMP)

§ Running with multiple MPI ranks per node

§ GPU acceleration in phases —  Proof of concept prototype, ..

—  Great speedup at kernel level

§ Application performance misses expectations

Page 51: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU parallelizable part CPU parallel part Serial part

N=1

Multicore CPU only

Page 52: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU parallelizable part CPU parallel part Serial part

N=2 N=1

Multicore CPU only

Page 53: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1

Multicore CPU only

Page 54: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1 N=8

Multicore CPU only

Page 55: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1 N=8

Multicore CPU only GPU accelerated CPU

N=1

Page 56: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1 N=8

Multicore CPU only GPU accelerated CPU

With Hyper-Q/MPS Available in K20, K40

N=4 N=2 N=1 N=8

Page 57: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

PROCESSES SHARING GPU WITHOUT MPS: NO OVERLAP

Process A Process B

Context A Context B

Process A Process B

GPU

Page 58: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU SHARING WITHOUT MPS

Time-slided use of GPU

Context switch

Context Switch

Page 59: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

PROCESSES SHARING GPU WITH MPS: MAXIMUM OVERLAP

Process A Process B

Context A Context B

GPU Kernels from Process A

Kernels from Process B

MPS Process

Page 60: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GPU SHARING WITH MPS

Page 61: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

CASE STUDY: HYPER-Q/MPS FOR UMT Enables overlap

between copy and compute of different

processes

Sharing the GPU between multi MPI

ranks increases GPU utilization

Page 62: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

USING MPS -  No application modifications necessary -  Not limited to MPI applications -  MPS control daemon

-  Spawn MPS server upon CUDA application startup

-  Typical setup export CUDA_VISIBLE_DEVICES=0

nvidia-smi –i 0 –c EXCLUSIVE_PROCESS

nvidia-cuda-mps-control –d

-  On Cray XK/XC systems export CRAY_CUDA_MPS=1

Page 63: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

USING MPS ON MULTI-GPU SYSTEMS -  MPS server only supports a single GPU

-  Use one MPS server per GPU

-  Target specific GPU by setting CUDA_VISIBLE_DEVICES -  Adjust pipe/log directory

export DEVICE=0

export CUDA_VISIBLE_DEVICES=${DEVICE}

export CUDA_MPS_PIPE_DIRECTORY=${HOME}/mps${DEVICE}/pipe

export CUDA_MPS_LOG_DIRECTORY=${HOME}/mps${DEVICE}/log

cuda_mps_server_control –d

export DEVICE=1 …

-  More at http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html

Page 64: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MPS SUMMARY §  Easy path to get GPU acceleration for legacy applications §  Enables overlapping of memory copies and compute between

different MPI ranks

68

Page 65: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

DEBUGGING AND PROFILING

Page 66: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

TOOLS FOR MPI+CUDA APPLICATIONS § Memory Checking cuda-memcheck § Debugging cuda-gdb § Profiling nvprof and NVIDIA Visual Profiler

Page 67: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MEMORY CHECKING WITH CUDA-MEMCHECK § Cuda-memcheck is a functional correctness checking suite

similar to the valgrind memcheck tool § Can be used in a MPI environment

mpirun -np 2 cuda-memcheck ./myapp <args>

§ Problem: output of different processes is interleaved —  Use save, log-file command line options and launcher script

#!/bin/bash

LOG=$1.$OMPI_COMM_WORLD_RANK

#LOG=$1.$MV2_COMM_WORLD_RANK

cuda-memcheck --save $LOG.memcheck $*

mpirun -np 2 cuda-memcheck-script.sh ./myapp <args>

Page 68: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MEMORY CHECKING WITH CUDA-MEMCHECK

Page 69: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

MEMORY CHECKING WITH CUDA-MEMCHECK Read outputfiles with cuda-memcheck --read

Page 70: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

DEBUGGING MPI+CUDA APPLICATIONS USING CUDA-GDB WITH MPI APPLICATIONS

§ You can use cuda-gdb just like gdb with the same tricks §  For smaller applications, just launch xterms and cuda-gdb > mpirun –np 4 xterm -e cuda-gdb --cuda-use-lockfile=0 ./myapp

Page 71: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

DEBUGGING MPI+CUDA APPLICATIONS CUDA-GDB ATTACH

§ CUDA 5.0 and forward have the ability to attach to a running process

if ( rank == 0 ) {

int i=0;

printf("rank %d: pid %d on %s ready for attach\n.", rank, getpid(),name);

while (0 == i) {

sleep(5);

}

}

> mpiexec -np 2 ./jacobi_mpi+cuda

Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and one Tesla M2070 for each process (2049 rows per process).

rank 0: pid 30034 on judge107 ready for attach

> ssh judge107

jkraus@judge107:~> cuda-gdb --pid 30034

Page 72: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

DEBUGGING MPI+CUDA APPLICATIONS CUDA_DEVICE_WAITS_ON_EXCEPTION

Page 73: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

DEBUGGING MPI+CUDA APPLICATIONS THIRD PARTY TOOLS

§ Allinea DDT debugger § Totalview

Page 74: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

PROFILING MPI+CUDA APPLICATIONS USING NVPROF+NVVP

3 Usage modes: 1.  Embed pid in output filename mpirun –np 2 nvprof --output-profile profile.out.%p

2.  Only save the textual output mpirun –np 2 nvprof --log-file profile.out.%p

3.  Collect profile data on all processes that run on a node nvprof --profile-all-processes –o profile.out.%p

EXERCISE: profile your laplace_mpi_cuda code by using the second mode.

Page 75: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

PROFILING MPI+CUDA APPLICATIONS USING NVPROF+NVVP

Page 76: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

PROFILING MPI+CUDA APPLICATIONS USING NVPROF+NVVP

Page 77: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

PROFILING MPI+CUDA APPLICATIONS USING NVPROF+NVVP

Page 78: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

PROFILING MPI+CUDA APPLICATIONS THIRD PARTY TOOLS

§ Multiple parallel profiling tools are CUDA aware —  Score-P

—  Vampir

—  Tau

§ These tools are good for discovering MPI issues as well as basic CUDA performance inhibitors

Page 79: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

ADVANCED MPI ON GPUS

Page 80: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

BEST PRACTICE: USE NONE-BLOCKING MPI

MPI_Request t_b_req[4];

MPI_Irecv(Tnew+offset_top_bondary,m-2,MPI_DOUBLE,t_nb,0,MPI_COMM_WORLD,t_b_req);

MPI_Irecv(Tnew+offset_bottom_bondary,m-2,MPI_DOUBLE,b_nb,1,MPI_COMM_WORLD,t_b_req+1);

MPI_Isend(Tnew+offset_last_row,m-2,MPI_DOUBLE,b_nb,0,MPI_COMM_WORLD,t_b_req+2);

MPI_Isend(Tnew+offset_first_row,m-2,MPI_DOUBLE,t_nb,1,MPI_COMM_WORLD,t_b_req+3);

MPI_Waitall(4, t_b_req, MPI_STATUSES_IGNORE);

85

MPI_Sendrecv(Tnew+offset_first_row, m-2, MPI_DOUBLE, t_nb, 0,

Tnew+offset_bottom_bondary, m-2, MPI_DOUBLE, b_nb, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Sendrecv(Tnew+offset_last_row, m-2, MPI_DOUBLE, b_nb, 1,

Tnew+offset_top_bondary, m-2, MPI_DOUBLE, t_nb, 1,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

NO

NE

-BLO

CK

ING

B

LOC

KIN

G

Gives MPI more

opportunities to build

efficient piplines

Page 81: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

OPTIMIZED CODE WITH “OLD STYLE” MPI

•  Pipelining  at  user  level  with  non-blocking  MPI  and  CUDA  functions

•  Sender:   for  (j  =  0;  j  <  pipeline_len;  j++) cudaMemcpyAsync(s_buf+j*block_sz,s_device+j*block_sz,…); for  (j  =  0;  j  <  pipeline_len;  j++) { while  (result  !=  cudaSuccess)  { results = cudaStreamQuery(..); if (j>0) MPI_Test(…); } MPI_Isend(s_buf+j*block_sz,block_sz,MPI_CHAR,1,1,…) } MPI_Waitall();

Page 82: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

GENERAL MULTI-GPU PROGRAMMING PATTERN

• The goal is to hide communication cost – Overlap with computation

• So, every time-step, each GPU should: – Compute parts (halos) to be sent to neighbors – Compute the internal region (bulk) –  Exchange halos with neighbors

• Linear scaling as long as internal-computation takes longer than halo exchange

– Actually, separate halo computation adds some overhead

overlapped

Page 83: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

OVERLAPPING COMMUNICATION AND COMPUTATION

89

Process whole domain MPI

No

Ove

rlapp

Process inner domain

MPI Process boundary domain Dependency

Boundary and inner domain processing

can overlap

Ove

rlapp

Possible Speedup

Page 84: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

90!

PAGE-LOCKED DATA TRANSFERS ü  cudaMallocHost() allows allocation of page-locked (“pinned”) host memory

ü  Same syntax of cudaMalloc() but works with CPU pointers and memory

ü  Enables highest cudaMemcpy performance ü  3.2 GB/s on PCI-e x16 Gen1 ü  6.5 GB/s on PCI-e x16 Gen2

ü  Use with caution!!! ü  Allocating too much page-locked memory can reduce overall

system performance

See and test the bandwidth.cu sample

Page 85: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

91!

OVERLAPPING DATA TRANSFERS AND COMPUTATION

ü Async Copy and Stream API allow overlap of H2D or D2H data transfers with computation ü CPU computation can overlap data transfers on all CUDA capable devices ü Kernel computation can overlap data transfers on devices with “Concurrent

copy and execution” (compute capability >= 1.1)

ü Stream = sequence of operations that execute in order on GPU ü Operations from different streams can be interleaved ü Stream ID used as argument to async calls and kernel launches

Page 86: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

ü  Asynchronous host-device memory copy returns control immediately to CPU ü  cudaMemcpyAsync(dst, src, size, dir, stream); ü  requires pinned host memory

(allocated with cudaMallocHost)

ü  Overlap CPU computation with data transfer ü  0 = default stream

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);

kernel<<<grid, block>>>(a_d);

cpuFunction();

92!

ASYNCHRONOUS DATA TRANSFERS

overlapped

Page 87: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

OVERLAPPING KERNEL AND DATA TRANSFER

ü  Requires: ü  Kernel and transfer use different, non-zero streams

ü  For the creation of a stream: cudaStreamCreate ü  Remember that a CUDA call to stream 0 blocks until all previous calls complete

and cannot be overlapped

ü  Example: cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);

cudaMemcpyAsync(dst, src, size, dir,stream1); kernel<<<grid, block, 0, stream2>>>(…);

overlapped Exercise: read, compile and run simpleStreams.cu

Page 88: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

OVERLAPPING COMMUNICATION AND COMPUTATION

process_boundary_and_pack<<<gs_b,bs_b,0,s1>>>(u_new_d,u_d,to_left_d,to_right_d,n,m);

process_inner_domain<<<gs_id,bs_id,0,s2>>>(u_new_d, u_d,to_left_d,to_right_d,n,m);

cudaStreamSynchronize(s1); //wait for boundary

MPI_Request req[8];

//Exchange halo with left, right, top and bottom neighbor

// Using non-blocking MPI primitives

MPI_Waitall(8, req, MPI_STATUSES_IGNORE);

unpack<<<gs_s,bs_s>>>(u_new_d, from_left_d, from_right_d, n, m);

cudaDeviceSynchronize(); //wait for iteration to finish

95

CU

DA

Page 89: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

HANDLING MULTI GPU NODES

§ Multi GPU nodes and GPU-affinity: —  Use local rank:

int local_rank = //determine local rank

int num_devices = 0;

cudaGetDeviceCount(&num_devices);

cudaSetDevice(local_rank % num_devices);

—  Look at function assignDeviceToProcess in file assign_process_gpu.c (directory LAPLACE_CNAF/MPI)

—  Use exclusive process mode + cudaSetDevice(0)

98

Page 90: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

HANDLING MULTI GPU NODES

§ How to determine local rank: —  Rely on process placement (with one rank per GPU)

int rank = 0;

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

int num_devices = 0;

cudaGetDeviceCount(&num_devices); // num_devices == ranks per node

int local_rank = rank % num_devices;

—  Use environment variables provided by MPI launcher §  e.g. for OpenMPI

int local_rank = atoi(getenv("OMPI_COMM_WORLD_LOCAL_RANK"));

§  e.g. For MVPAICH2

int local_rank = atoi(getenv("MV2_COMM_WORLD_LOCAL_RANK"));

99

Page 91: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

HOMEWORK

§  Implement the Jacobi method for the solution of the 2D Laplace equation by using CUDA streams and MPI non-blocking point-to-point primitives to overlap computation and communication.

§ Develop a 3D solution by using a decomposition along one axis (e.g., the Z direction)

100

Page 92: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

CONCLUSIONS

§ Using MPI as abstraction layer for Multi GPU programming allows multi GPU programs to scale beyond a single node

—  CUDA-aware MPI delivers ease of use, reduced network latency and increased bandwidth

§ All NVIDIA tools are usable and third party tools are available § Multipe CUDA-aware MPI implementations available

—  OpenMPI, MVAPICH2, Cray, IBM Platform MPI

§ With CUDA streams it is possible to use the CPU as a GPU network-coprocessor.

103

Page 93: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

OVERLAPPING COMMUNICATION AND COMPUTATION – TIPS AND TRICKS § CUDA-aware MPI might use the default stream

—  Allocate stream with the non-blocking flag (cudaStreamNonBlocking)

§  In case of multiple kernels for boundary handling the kernel processing the inner domain might sneak in

—  Use single stream or events for inter stream dependencies via cudaStreamWaitEvent – disables overlapping of boundary and inner domain kernels

—  Use high priority streams for boundary handling kernels – allows overlapping of boundary and inner domain kernels

§ As of CUDA 6.0 GPUDirect P2P in multi process can overlap disable it for older releases

104

Page 94: Multi GPU Programming with MPI - CNRtwin.iac.rm.cnr.it/Multi_GPU_Programming_with_MPI.pdf · MULTI GPU PROGRAMMING (WITH MPI) Massimo Bernaschi National Research Council of Italy

HIGH PRIORITY STREAMS

§  Improve scalability with high priority streams (cudaStreamCreateWithPriority)

Comp. Local Forces

Comp. Non-Local Forces

Ex. Non-local Atom pos.

Ex. Non-local forces

Stream 1

Stream 2

Comp. Local Forces

Comp. Non-Local Forces

Ex. Non-local Atom pos.

Ex. Non-local forces

Stream 1 (LP)

Stream 2 (HP)

Possible Speedup