S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING...

1

Davide Rossetti, Elena Agostini

S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT

2

GPUDIRECT ELSEWHERE AT GTC2017

H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT TECHNOLOGIES ON MELLANOX NETWORK INTERCONNECTS (Today 5pm, D.Rossetti, @Mellanox )

S7155 - OPTIMIZED INTER-GPU COLLECTIVE OPERATIONS WITH NCCL (Tue 9am, S.Jeaugey @NVIDIA)

S7142 - MULTI-GPU PROGRAMMING MODELS (Wed 1pm, S.Potluri, J.Krauss @NVIDIA)

S7489 - CLUSTERING GPUS WITH ETHERNET (Wed 4pm, F.Osman @Broadcom)

S7356 - MVAPICH2-GDR: PUSHING THE FRONTIER OF HPC AND DEEP LEARNING (Thu 2pm, D.K.Panda @OSU)

3

AGENDA

GPUDirect technologies

NVLINK-enabled multi-GPU systems

GPUDirect P2P

GPUDirect RDMA

GPUDirect Async

Async Benchmarks & applications results

4

INTRODUCTION TO GPUDIRECT TECHNOLOGIES

5

GPUDIRECT FAMILY1

Technologies, enabling products !!!

GPU pinned memory shared with other RDMA capable devicesAvoids intermediate copies

GPUDIRECT SHARED GPU-SYSMEM

Accelerated GPU-GPU memory copiesInter-GPU direct load/store access

GPUDIRECT P2P

Direct GPU to 3rd party device transfersE.g. direct I/O, optimized inter-node communication

GPUDIRECT RDMA2

Direct GPU to 3rd party device synchronizationsE.g. optimized inter-node communication

GPUDIRECT ASYNC

[1] https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma

https://developer.nvidia.com/gpudirect

http://docs.nvidia.com/cuda/gpudirect-rdma

6

GPUDIRECTscopes

• GPUDirect P2P → data

• Intra-node

• GPUs both master and slave

• Over PCIe or NVLink

• GPUDirect RDMA → data

• Inter-node

• GPU slave, 3rd party device master

• Over PCIe

• GPUDirect Async → control

• GPU & 3rd party device, master & slave

• Over PCIe

Data

pla

ne

Control plane

GPUDirect AsyncHOST GPU

GPU

GPUDirect RDMA/P2P

7

NVLINK-enabled Multi-GPU servers

9

DGX-1 SYSTEM TOPOLOGY

GPU – CPU link:

PCIe

12.5+12.5 GB/s eff BW

GPUDirect P2P:

GPU – GPU link is NVLink

Cube mesh topology

not all-to-all

GPUDirect RDMA:

GPU – NIC link is PCIe

10

IBM MINSKY

2 POWER8 with NVLink

4 NVIDIA Tesla P100 GPUs

256 GB System Memory

2 SSD storage devices

High-speed interconnect: IB or Ethernet

Optional:Up to 1 TB System MemoryPCIe attached NVMe storage

11

P100GPU

POWER8CPU

GPUMemory

System Memory

P100GPU

80 GB/s

GPUMemory

NVLink

115 GB/s

P100GPU

POWER8CPU

GPUMemory

System Memory

P100GPU

80 GB/s

GPUMemory

115 GB/s

NVLink

IBM MINSKY SYSTEM TOPOLOGY

GPU – CPU link:

2x NVLINK

40+40 GB/s raw BW

GPUDirect P2P:

GPU – GPU link is 2x NVLink

Two cliques topology

GPUDirect RDMA:

Not supported

12

GPUDIRECT AND MULTI-GPU SYSTEMSTHE CASE OF DGX-1

13

HOW TO’S

Device topology, link type and capabilities

GPUa - GPUb link: P2P over NVLINK vs PCIe, speed, etc

Same for CPU – GPU link: NVLINK or PCIe

Same for NIC – GPU link (HWLOC)

Select an optimized GPU/CPU/NIC combination in MPI runs

Enable GPUDirect RDMA

14

CUDA LINK CAPABILITIES

// CUDA driver APItypedef enum CUdevice_P2PAttribute_enum {

CU_DEVICE_P2P_ATTRIBUTE_PERFORMANCE_RANK = 0x01,

CU_DEVICE_P2P_ATTRIBUTE_ACCESS_SUPPORTED = 0x02,

CU_DEVICE_P2P_ATTRIBUTE_NATIVE_ATOMIC_SUPPORTED = 0x03

} CUdevice_P2PAttribute;

cuDeviceGetP2PAttribute(int* value, CUdevice_P2PAttribute

attrib, CUdevice srcDevice, CUdevice dstDevice)

// CUDA runtime APIcudaDeviceGetP2PAttribute(int *value, enum

cudaDeviceP2PAttr attr, int srcDevice, int dstDevice)

basic info, GPU-GPU links only

A relative value indicating

the performance of the

link between two GPUs

(NVLINK ranks higher than

PCIe).

Can do remote native

atomics in GPU kernels

15

GPUDIRECT P2P: NVLINK VS PCIE

Note: some GPUs are not connected e.g. GPU0-GPU7Note2: others have multiple potential link (NVLINK and PCIe) but cannot use both at the same time!!!

NVLINK transparently picked if available

cudaSetDevice(0);

cudaMalloc(&buf0, size);

cudaCanAccessPeer (&access, 0, 1);

assert(access == 1);

cudaEnablePeerAccess (1, 0);

cudaSetDevice(1);

cudaMalloc(&buf1, size);

…

cudaSetDevice (0);

cudaMemcpy (buf0, buf1, size, cudaMemcpyDefault);

16

MULTI GPU RUNS ON DGX-1

Create wrapper script

Use local MPI rank (MPI impldependent)

Don’t use CUDA_VISIBLE_DEVICES, hurts P2P!!!

Environment variables to pass selection down to MPI and app

In application cudaSetDevice(“USE_GPU”)

Run wrapper script

Select best GPU/CPU/NIC for each MPI rank

$ cat wrapper.sh

if [ ! –z $OMPI_COMM_WORLD_LOCAL_RANK ]; then

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

elif [ ! –z $MV2_COMM_WORLD_LOCAL_RANK ]; then

lrank=$MV2_COMM_WORLD_LOCAL_RANK

fi

if (( $lrank > 7 )); then echo "too many ranks"; exit; fi

case ${HOSTNAME} in

*dgx*)

USE_GPU=$((2*($lrank%4)+$lrank/4)) # 0,2,4,6,1,3,5,7

export USE_SOCKET=$(($USE_GPU/4)) # 0,0,1,1,0,0,1,1

HCA=mlx5_$(($USE_GPU/2)) # 0,1,2,3,0,1,2,3

export OMPI_MCA_btl_openib_if_include=${HCA}

export MV2_IBA_HCA=${HCA}

export USE_GPU;;

…

esac

numactl --cpunodebind=${USE_SOCKET} –l $@

$ mpirun –np N wrapper.sh myapp param1 param2 …

17

NVML1 NVLINK

nvmlDeviceGetNvLinkVersion(nvmlDevice_t device,

unsigned int link, unsigned int *version)

nvmlDeviceGetNvLinkState(nvmlDevice_t device,

unsigned int link, nvmlEnableState_t *isActive)

nvmlDeviceGetNvLinkCapability(nvmlDevice_t

device, unsigned int link,

nvmlNvLinkCapability_t capability, unsigned int

*capResult)

nvmlDeviceGetNvLinkRemotePciInfo(nvmlDevice_t

device, unsigned int link, nvmlPciInfo_t *pci)

Link discovery and info APIs

1 http://docs.nvidia.com/deploy/nvml-api/

nvmlDevice separate from CUDA gpu id’s (all devices vs CUDA_VISIBLE_DEVICES)

NVML_NVLINK_MAX_LINKS=6

See later for capabilities

domain:bus:device.function PCI identifier of device on the other side of the link, can be socket PCIe bridge (IBM POWER8)

http://docs.nvidia.com/deploy/nvml-api/

18

NVLINK CAPABILITIES

nvidia-smi nvlink –I <GPU id> –c

typedef enum nvmlNvLinkCapability_enum {

NVML_NVLINK_CAP_P2P_SUPPORTED = 0,

NVML_NVLINK_CAP_SYSMEM_ACCESS = 1,

NVML_NVLINK_CAP_P2P_ATOMICS = 2,

NVML_NVLINK_CAP_SYSMEM_ATOMICS= 3,

NVML_NVLINK_CAP_SLI_BRIDGE = 4,

NVML_NVLINK_CAP_VALID = 5,

} nvmlNvLinkCapability_t;

On DGX-1

19

NVLINK COUNTERS

Per GPU (-i 0), per link (-l <0..3>)

Two sets of counters (-g <0|1>)

Per set counter types: cycles,packets,bytes (-sc xyz)

Reset individually (-r <0|1>)

On DGX-1

20

NVML TOPOLOGY

nvmlDeviceGetTopologyNearestGpus( nvmlDevice_t device, nvmlGpuTopologyLevel_t level, unsigned int* count,nvmlDevice_t* deviceArray )

nvmlDeviceGetTopologyCommonAncestor( nvmlDevice_t device1, nvmlDevice_t device2, nvmlGpuTopologyLevel_t* pathInfo )

nvmlSystemGetTopologyGpuSet(unsigned int cpuNumber, unsigned int* count, nvmlDevice_t* deviceArray )

nvmlDeviceGetCpuAffinity(nvmlDevice_t device, unsigned intcpuSetSize, unsigned long *cpuSet);

GPU-GPU & GPU-CPU topology query1 APIs

1 http://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html

NVML_TOPOLOGY_INTERNAL,

NVML_TOPOLOGY_SINGLE,

NVML_TOPOLOGY_MULTIPLE,

NVML_TOPOLOGY_HOSTBRIDGE,

NVML_TOPOLOGY_CPU,

NVML_TOPOLOGY_SYSTEM,

http://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html

21

SYSTEM TOPOLOGY

$ nvidia-smi topo -m

On DGX-1

22

SYSTEM TOPOLOGY

$ nvidia-smi topo -mp

On DGX-1, PCIe only

25

GPUDIRECT P2P

26

DGX-1 P2P PERFORMANCE

in CUDA toolkit samples

Sources: samples/1_Utilities/p2pBandwidthLatencyTest

Binary: samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest

p2pBandwidthLatencyTest

27

DGX-1 P2P PERFORMANCE

In CUDA toolkit demo suite:

/usr/local/cuda-8.0/extras/demo_suite/busGrind –h

Usage: -h: print usage

-p [0,1] enable or disable pinned memory tests (default on)

-u [0,1] enable or disable unpinned memory tests (default off)

-e [0,1] enable or disable p2p enabled memory tests (default on)

-d [0,1] enable or disable p2p disabled memory tests (default off)

-a enable all tests

-n disable all tests

busGrind

28

Intra-node MPI BW

8/3/16

GPU-aware MPI running over GPUDirect P2PDual IVB Xeon 2U server (K40 PCIe) vs DGX-1 (P100-nvlink)

0

5

10

15

20

25

30

35

40

1K 4K 16K 64K 256K 1M 4M 16M

Ban

dw

idth

GB

/se

c

Message Size (Bytes)

k40-pcie k40-pcie-bidir p100-nvlink p100-nvlink-bidir

~35 GB/sec Bi-dir

~17 GB/sec Uni-dir

29

GPUDIRECT RDMA

30

GPUDirect RDMA over RDMA networks

For Linux rdma subsystem

open-source nvidia_peer_memory kernel module1

important bug fix in ver 1.0-3 !!!

enables NVIDIA GPUDirect RDMA on OpenFabrics stack

Multiple vendors

Mellanox2: ConnectX3 to ConnectX-5, IB/RoCE

Chelsio3: T5, iWARP

Others to come

for better network communication latency

1 https://github.com/Mellanox/nv_peer_memory2 http://www.mellanox.com/page/products_dyn?product_family=1163 http://www.chelsio.com/gpudirect-rdma

https://github.com/Mellanox/nv_peer_memory

http://www.mellanox.com/page/products_dyn?product_family=116

http://www.chelsio.com/gpudirect-rdma

31

GPUDirect RDMA over Infiniband

For bandwidth:

$ git clone

git://git.openfabrics.org/~grockah/perf

test.git

$ cd perftest

$ ./autogen.sh

$ export CUDA_H_PATH=/usr/local/cuda-

8.0/include/cuda.h

$ ./configure –prefix=$HOME/test

$ make all install

E.g. host to GPU memory (H-G) BW test:

server$ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda

client $ ~/test/bin/ib_write_bw -n 1000 -O -a

server.name.org

GPU to GPU memory (G-G) BW test:

server$ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda

client $ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda

server.name.org

Benchmarking bandwidth

32

DGX-1 GPUDirect RDMA uni-dir BW

IB message size, 5000 iterations, RC protocol

0

2000

4000

6000

8000

10000

12000

14000

2 4 8 16

32

64

128

256

512

1KB

2KB

4KB

8KB

16KB

32KB

64KB

128KB

256KB

512KB

1MB

2MB

4MB

8MB

BW(M

B/s)

RDMABWDGX-1,CX-4,P100SXM,socket0tosocket1

HH

HG

GH

GG

33

GPUDIRECT ASYNC

34

GPUDIRECT ASYNC

Communications prepared by CPU

• hardly parallelizable, branch intensive

• GPU orchestrates flow

Run by GPU front-end unit

• Same one scheduling GPU work

• Now also scheduling network communications

leverage GPU front-end unit

Front-end unit

Compute Engines

35

GPUDIRECT ASYNC OVER OFA VERBS

• Prerequisites: nvidia_peer_memory driver, GDRcopy1 library

• CUDA 8.0+ Stream Memory Operations (MemOps) APIs

• MLNX OFED 4.0+ Peer-Direct Async Verbs APIs

• libgdsync2: bridging CUDA & IB Verbs

• MPI: experimental support in MVAPICH2-GDS3

• libmp: lightweight, MPI-like stream-sync primitives, internal benchmarking

1 http://github.com/NVIDIA/gdrcopy2 http://github.com/gpudirect/libgdsync, devel branch3 see DK Panda’s talk

SW stack bits

http://github.com/NVIDIA/gdrcopy

http://github.com/gpudirect/libgdsync

36

SW STACK

CUDA

driverIB verbs

IB core

mlx5

libgdsync

NV display

driver

kernel-mode

user-mode

HCA GPU

HW

mixed

open-

source

proprietary

nv_peer_mem

extensions

for Async

MVAPICH2

RDMA

ext. for Async

CUDA RT

libmpOpen MPI

applications benchmarks

IB Verbs

extensions

for Async

cxgb4

Async

extensions for

RDMA/Async

37

GPUDIRECT ASYNC OVER INFINIBAND

May need special HCA configuration on Kepler/Maxwell GPUs, e.g. on Mellanox:

$ mlxconfig -d /dev/mst/mtxxx_pciconf0 set NON_PREFETCHABLE_PF_BAR=1

$ reboot

Enable GPU peer mappings:

$ cat /etc/modprobe.d/nvidia.conf options nvidia

NVreg_RegistryDwords="PeerMappingOverride=1"

Requirements

38

CUDA STREAM MemOpsCU_STREAM_WAIT_VALUE_GEQ = 0x0,CU_STREAM_WAIT_VALUE_EQ = 0x1,CU_STREAM_WAIT_VALUE_AND = 0x2,CU_STREAM_WAIT_VALUE_NOR = 0x3,CU_STREAM_WAIT_VALUE_FLUSH = 1<<30 CUresult cuStreamWaitValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags);CUresult cuStreamWaitValue64(CUstream stream, CUdeviceptr addr, cuuint64_t value, unsigned int flags);

CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER = 0x1CUresult cuStreamWriteValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags);CUresult cuStreamWriteValue64(CUstream stream, CUdeviceptr addr, cuuint64_t value, unsigned int flags);

CU_STREAM_MEM_OP_WAIT_VALUE_32 = 1,CU_STREAM_MEM_OP_WRITE_VALUE_32 = 2,CU_STREAM_MEM_OP_WAIT_VALUE_64 = 4,CU_STREAM_MEM_OP_WRITE_VALUE_64 = 5,CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES = 3CUresult cuStreamBatchMemOp(CUstream stream, unsigned int count, CUstreamBatchMemOpParams *paramArray, unsigned int flags);

polling on 32/64bit

word

32/64bit word write

lower-overhead batched

work submission

guarantees memoryconsistency for RDMA

39

CUDA STREAM MemOpsAPIs features

• batching multiple consecutive MemOps save ~1.5us each op

• use cuStreamBatchMemOp()

• APIs accept device pointers

• memory need registration (cuMemHostRegister)

• device pointer retrieval (cuMemHostGetDevicePointer)

• 3rd party device PCIe resources (aka BARs)

• assumed physically contiguous & uncached

• special flag needed in cuMemHostRegister

40

ASYNC: LIBMP

41

LIBMP

Lightweight message passing library

thin layer on top of IB Verbs

in-order receive buffer matching

no tags, no wildcards, no data types

zero copy transfers, uses flow control of IB RC transport

Eases benchmarking of multi-GPU applications

Not released (yet)

42

LIBMPPrototype message passing library

PREPARED BY TRIGGERED BY

CPU synchronous CPU CPU

Stream synchronous CPU GPU front-end unit

Kernel initiated CPU GPU SMs

43NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LIBMP CPU-SYNC COMMUNICATION// Send/Recv

int mp_irecv (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req);

int mp_isend (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req);

// Put/Get

int mp_window_create(void *addr, size_t size, mp_window_t *window_t);

int mp_iput (void *src, int size, mp_reg_t *src_reg, int peer, size_t displ,

mp_window_t *dst_window_t, mp_request_t *req);

int mp_iget (void *dst, int size, mp_reg_t *dst_reg, int peer, size_t displ,

mp_window_t *src_window_t, mp_request_t *req);

// Wait

int mp_wait (mp_request_t *req);

int mp_wait_all (uint32_t count, mp_request_t *req);


LIBMP STREAM-SYNC COMMUNICATION// Send

int mp_isend_on_stream (void *buf, int size, int peer, mp_reg_t *mp_reg,

mp_request_t *req, cudaStream_t stream);

// Put/Get

int mp_iput_on_stream (void *src, int size, mp_reg_t *src_reg, int peer, size_t

displ, mp_window_t *dst_window_t, mp_request_t *req, cudaStream_t stream);

int mp_iget_on_stream (void *dst, int size, mp_reg_t *dst_reg, int peer, size_t

displ, mp_window_t *src_window_t, mp_request_t *req, cudaStream_t stream);

// Wait

int mp_wait_on_stream (mp_request_t *req, cudaStream_t stream);

int mp_wait_all_on_stream (uint32_t count, mp_request_t *req, cudaStream_t

stream);

45

CUDA stream-sync communications

Loop { mp_irecv(…)

compute <<<…,stream>>> (buf)

mp_isend_on_stream(…)

mp_wait_all_on_stream (…)}

CPU HCAstream


LIBMP STREAM-SYNC batch submission

// Prepare single requests, IB Verbs side only

int mp_send_prepare (void *buf, int size, int peer, mp_reg_t *mp_reg,

mp_request_t *req);

// Post set of prepared requests

int mp_isend_post_on_stream (mp_request_t *req, cudaStream_t stream);

int mp_isend_post_all_on_stream (uint32_t count, mp_request_t *req,

cudaStream_t stream);


LIBMP KERNEL-SYNC COMMUNICATION

// Host side code// extract descriptors from prepared requests

int mp::mlx5::get_descriptors(send_desc_t *send_info, mp_request_t *req);

int mp::mlx5::get_descriptors(wait_desc_t *wait_info, mp_request_t *req);

// Device side code

__device__ int mp::device::mlx5::send(send_desc_t * send_info);

__device__ int mp::device::mlx5::wait (wait_desc_t * wait_info);

48

Kernel-sync communication

Loop { mp_irecv(…, rreq);mp_get_descriptors(wi, rreq);mp_send_prepare(…, sreq);mp_get_descriptors(si, sreq);

compute_and_communicate <<<…,stream>>> (buf,wi,si) {do_something(buf);mlx5::send(si);do_something_else();mlx5::wait(wi);keep_working();

}}

CPU HCAstream

49

ASYNC: experimental MPI BINDINGS

50

CODE SAMPLE

MPI_Comm comm = MPI_COMM_WORLD;

cudaStreamCreate (&stream);

Loop {

kernel <<<nblocks, nthreads, stream>>> (buf1, buf2);

cudaStreamSynchronize (stream);

MPI_Irecv(buf2, count, MPI_INT, peer, 0, comm, &req[0]);

MPI_Isend(buf1, count, MPI_INT, peer, 0, comm, &req[1]);

MPI_Waitall (2, req, statuses);

}

Standard GPU-aware MPI

51

STREAM-SYNCHRONOUS VERSION

MPI_Comm stream_comm;

cudaStreamCreate (&stream);

MPIX_Comm_dup_with_stream( MPI_COMM_WORLD, &stream_comm, &stream);

Loop {

kernel <<<nblocks, nthreads, stream>>> (buf1, buf2);

MPI_Recv(buf2, count, MPI_BYTE, peer, 0, stream_comm);

MPI_Send(buf1, count, MPI_BYTE, peer, 0, stream_comm);

}

MPIX_Wait_stream_completion(stream_comm);

Experimental MPI extensions

52

ASYNCBENCHMARKS

LibMP models

HPGMG-FV

CoMD

Lulesh2

53

LIBMP MODELS

Three different execution models:

• CPU-synchronous (default)

• CUDA Stream-synchronous, CPU asynchronous (SA)

• communications triggered by a CUDA Stream

• CUDA Kernel-synchronous (or Kernel-Initiated, KI)

• Communications triggered by CUDA Kernel threads

Need to evaluate performance: 3 multi-process applications MPI to LibMP

Summary

54

HPGMG-CUDA

55

HPGMG-FVOverview

High Performance Geometric Multigrid (Lawrence Berkeley National Laboratory) :

• Geometric multi-grid solver for variable-coefficient elliptic problems on isotropic Cartesian grids

using the Finite Volume method

• F-cycle: multiple V-cycles a finer grid is smoothed and restricted into a coarser grid. A direct

solver is applied to the coarsest grid and the solution is then iteratively interpolated to finer

grids.

Multi-process execution:

• Each grid is divided into several same-size boxes

• Workload is fairly distributed among processes

56

HPGMG-FV CUDAOverview

GPU accelleration

Multi-process and multi-GPU, MPI

Level size threshold:

• Lower levels computed on CPU

• Higher levels computed on GPU

57

HPGMG-FV CUDAMulti-process execution: MPI communications

CPU Level

(coarser)

Smoother

Residual

Exchange

boundary

GPU Level

(finer)

Smoother

Residual

Exchange

boundary

CPU Level

(coarser)

Smoother

Residual

Exchange

boundary

Interpolation Restriction

Inter-level communication: restriction (moving from a finer level to a coarser level) and

interpolation (moving from a coarser level to a finer level)

Intra-level communication: exchange_boundary function (boundary region exchange between

processes)

59

EXCHANGE BOUNDARY IMPLEMENTATION

cudaDeviceSynchronize();for(p=0; p < num_peers; p++) {

MPI_Irecv(rBuf, …, rReqs[p]);}

pack_kernel<<<…, stream>>>(sBuf, …);cudaDeviceSynchronize();

for(p=0; p < num_peers; p++) {MPI_Isend(sBuf, …, sReqs[p]);

}

local_kernel<<<…, stream>>>();

MPI_Waitall(rReqs, …);unpack_kernel<<<…, stream>>>(rBuf, …);MPI_Waitall(sReqs, …);

for(p=0; p < num_peers; p++) {mp_irecv(rBuf, …, rReqs[p]);

}

pack_kernel<<<…, stream>>>(sBuf, …);

for(p=0; p < num_peers; p++) {mp_isend_on_stream(sBuf, …, sReqs[p], stream);

}

local_kernel<<<…, stream>>>();

mp_wait_all_on_stream(…, rReqs, stream);unpack_kernel<<<…, stream>>>(rBuf, …);mp_wait_all_on_stream(…, sReqs, stream);

Code comparison: MPI vs SA-Model

MPI Version LibMP SA-Model Version

60

MPI IMPLEMENTATIONConsecutive exchange_boundary() calls, CUDA Visual Profiler

Synchronization

CUDA kernels

Communications (Send or Wait)

Host

GPU Stream

NetworkIsend WaitAll

61

MPI IMPLEMENTATIONConsecutive exchange_boundary() calls, CUDA Visual Profiler

Host

GPU Stream

Network

GPU Idle time

63

SA MODEL IMPLEMENTATIONConsecutive exchange_boundary() calls, CUDA Visual Profiler

CUDA kernels

Communications (Send, Put or Wait)

Isend WaitRecv

WaitIsend

Host

GPU Stream

Network

CUDA 8.0 driver functions

64

SA MODEL IMPLEMENTATIONWilkes cluster - performance gain

• Wilkes cluster (Cambridge, UK):

• Telsa K20 GPUs

• CUDA 8.0

• Mellanox Connect-IB OFED 3.2

• OpenMPI

• Up to 24% time improvement wrt MPI

• 4 processes, single GPU level:

• 64% CPU work time reduction

• 28% GPU activity time reduction

• The bigger is the box size, the more

message size grows: communication

overhead becomes less important

65

KI MODEL IMPLEMENTATIONHow to

For each exchange_boundary() execution:

• Kernel fusion, single vs three CUDA kernels

• Move communications inside fused kernel

• Respect mutual dependencies (hard!)

launch

fused

kernel

exchange_boundary

fused kernel

CPU

GPU

67

KI MODEL IMPLEMENTATIONexchange_boundary() fused kernel – CUDA blocks organization

Fused kernel

Set 1 Set 2 Set 3

CUB library and GPU global memory variable

68

KI MODEL IMPLEMENTATIONWilkes cluster - performance gain


• Telsa K20 GPUs

• CUDA 8.0


• OpenMPI

• Up to 26% time improvement wrt MPI

• 4 processes, single GPU level:

• 77% CPU work time reduction

• 32% GPU activity time reduction

69

DGX-1 CLUSTER BENCHMARKSPerformance gain

• DGX-1 cluster (NVIDIA):

• Tesla P100 GPUs

• CUDA 8.0

• Mellanox ConnectX-4 OFED 3.4

• LibMP v2.0

• OpenMPI 1.10.5

• 2 processes only: about 8% less than

Wilkes

• Gain wrt MPI:

• CPU sync default

• SA model

• KI model

70

COMD-CUDA

71

COMD-CUDAImplementation

• CoMD (www.exmatex.org project) is a classical molecular dynamics proxy

application

• O(N2) within cutoff

• It considers materials where the interatomic potentials are short range and the

simulation requires the evaluation of all forces between atom pairs within the cutoff

distance

• Dependency between CUDA kernels and communications

• Explored SA-model only

http://www.exmatex.org/

72

SA MODEL IMPLEMENTATION

Repeat for 3 times {

load_kernel <<<…, stream>>>(sBufA, …);load_kernel <<<…, stream>>>(sBufB, …);cudaDeviceSynchronize();

MPI_Sendrecv(sBufA, rBufA, …);MPI_Sendrecv(sBufB, rBufB, …);

cuda_tasks(stream);

unload_kernel <<<…, stream>>>(rBufA, …);unload_kernel <<<…, stream>>>(rBufB, …);

}

for(p=0; p < (3 x 2); p++) {mp_irecv(…, rReqs[p]);

}

for(p=0; p < (3 x 2); p += 2) {

load_kernel <<<…, stream>>>(sBufA, …);mp_isend_on_stream(…, sReqs[p], stream);load_kernel <<<…, stream>>>(sBufB, …);mp_isend_on_stream(…, sReqs[p+1], stream);

cuda_tasks(stream);

mp_wait_all_on_stream(…, rReqs+p, stream);unload_kernel <<<…, stream>>>(rBufA, …);unload_kernel <<<…, stream>>>(rBufB, …); mp_wait_all_on_stream(…, sReqs+p, stream);}

Communication periods - Code comparison


73

SA MODEL PERFORMANCE GAINWilkes cluster – communications gain


• Telsa K20 GPUs

• CUDA 8.0


• OpenMPI

• 25% ~ 35% time gain

• Communication periods only

• Gain SA model wrt MPI

74

SA MODEL PERFORMANCE GAINDGX-1 cluster – communications

• DGX-1 cluster (NVIDIA):

• Tesla P100 GPUs

• CUDA 8.0

• Mellanox ConnectX-4 OFED 3.4

• LibMP v2.0

• OpenMPI 1.10.5

• 4 processes only: 13% less than

Wilkes

• Communication periods only

• Gain SA model wrt MPI

75

LULESH2-CUDA

76

LULESH2-CUDAImplementation

• Proxy application developed at Lawrence Livermore National Laboratory

• “… It approximates the hydrodynamics equations discretely by partitioning the

spatial problem domain into a collection of volumetric elements defined by a

mesh.” ( https://codesign.llnl.gov/lulesh.php )

• Dependency between CUDA kernels and communications

• Explored SA-model only

https://codesign.llnl.gov/lulesh.php

77

Repeat up to 26 times {mp_irecv(rBuf, …, rReqs);

}……Repeat up to 26 times {

cuda_kernel<<<…, stream>>>(…);cudaMemcpyAsync(…, stream);mp_isend_on_stream(…, sReqs, stream);

}mp_wait_all_on_stream(…, sReqs, stream);……

Repeat up to 26 times {mp_wait_on_stream(…, rReqs, stream);cudaMemcpyAsync(…, stream);cuda_kernel<<<…, stream>>>(…);

}

SA MODEL IMPLEMENTATION

Repeat up to 26 times {MPI_Irecv(rBuf, …);

}……Repeat up to 26 times {

cuda_kernel<<<…, stream>>>(sBuf, …);cudaDeviceSynchronize();cudaMemcpyAsync(…, stream);MPI_Isend(sBuf, …);

}MPI_Waitall(…);……

Repeat up to 26 times {MPI_Wait(…);cuda_kernel<<<…, stream>>>(rBuf, …);cudaMemcpyAsync(…, stream);cudaDeviceSynchronize();

}

Communication periods - Code comparison


78

SA MODEL IMPLEMENTATIONWilkes cluster - performance gain

• Progressively increasing the number

of iterations to intensify the GPU

workload


• Telsa K20 GPUs

• CUDA 8.0


• OpenMPI

• Average gain of 13%

• 27 and 64 processes

79

GAME OVER

S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING...

Documents

Transcript of S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING...