S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING...
Transcript of S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING...
![Page 1: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/1.jpg)
1
Davide Rossetti, Elena Agostini
S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT
![Page 2: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/2.jpg)
2
GPUDIRECT ELSEWHERE AT GTC2017
H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT TECHNOLOGIES ON MELLANOX NETWORK INTERCONNECTS (Today 5pm, D.Rossetti, @Mellanox )
S7155 - OPTIMIZED INTER-GPU COLLECTIVE OPERATIONS WITH NCCL (Tue 9am, S.Jeaugey @NVIDIA)
S7142 - MULTI-GPU PROGRAMMING MODELS (Wed 1pm, S.Potluri, J.Krauss @NVIDIA)
S7489 - CLUSTERING GPUS WITH ETHERNET (Wed 4pm, F.Osman @Broadcom)
S7356 - MVAPICH2-GDR: PUSHING THE FRONTIER OF HPC AND DEEP LEARNING (Thu 2pm, D.K.Panda @OSU)
![Page 3: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/3.jpg)
3
AGENDA
GPUDirect technologies
NVLINK-enabled multi-GPU systems
GPUDirect P2P
GPUDirect RDMA
GPUDirect Async
Async Benchmarks & applications results
![Page 4: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/4.jpg)
4
INTRODUCTION TO GPUDIRECT TECHNOLOGIES
![Page 5: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/5.jpg)
5
GPUDIRECT FAMILY1
Technologies, enabling products !!!
GPU pinned memory shared with other RDMA capable devicesAvoids intermediate copies
GPUDIRECT SHARED GPU-SYSMEM
Accelerated GPU-GPU memory copiesInter-GPU direct load/store access
GPUDIRECT P2P
Direct GPU to 3rd party device transfersE.g. direct I/O, optimized inter-node communication
GPUDIRECT RDMA2
Direct GPU to 3rd party device synchronizationsE.g. optimized inter-node communication
GPUDIRECT ASYNC
[1] https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma
![Page 6: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/6.jpg)
6
GPUDIRECTscopes
• GPUDirect P2P → data
• Intra-node
• GPUs both master and slave
• Over PCIe or NVLink
• GPUDirect RDMA → data
• Inter-node
• GPU slave, 3rd party device master
• Over PCIe
• GPUDirect Async → control
• GPU & 3rd party device, master & slave
• Over PCIe
Data
pla
ne
Control plane
GPUDirect AsyncHOST GPU
GPU
GPUDirect RDMA/P2P
![Page 7: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/7.jpg)
7
NVLINK-enabled Multi-GPU servers
![Page 8: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/8.jpg)
8
NVIDIA DGX-1AI Supercomputer-in-a-Box
170 TFLOPS | 8x Tesla P100 16GB | NVLink Hybrid Cube Mesh
2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U — 3200W
![Page 9: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/9.jpg)
9
DGX-1 SYSTEM TOPOLOGY
GPU – CPU link:
PCIe
12.5+12.5 GB/s eff BW
GPUDirect P2P:
GPU – GPU link is NVLink
Cube mesh topology
not all-to-all
GPUDirect RDMA:
GPU – NIC link is PCIe
![Page 10: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/10.jpg)
10
IBM MINSKY
2 POWER8 with NVLink
4 NVIDIA Tesla P100 GPUs
256 GB System Memory
2 SSD storage devices
High-speed interconnect: IB or Ethernet
Optional:Up to 1 TB System MemoryPCIe attached NVMe storage
![Page 11: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/11.jpg)
11
P100GPU
POWER8CPU
GPUMemory
System Memory
P100GPU
80 GB/s
GPUMemory
NVLink
115 GB/s
P100GPU
POWER8CPU
GPUMemory
System Memory
P100GPU
80 GB/s
GPUMemory
115 GB/s
NVLink
IBM MINSKY SYSTEM TOPOLOGY
GPU – CPU link:
2x NVLINK
40+40 GB/s raw BW
GPUDirect P2P:
GPU – GPU link is 2x NVLink
Two cliques topology
GPUDirect RDMA:
Not supported
![Page 12: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/12.jpg)
12
GPUDIRECT AND MULTI-GPU SYSTEMSTHE CASE OF DGX-1
![Page 13: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/13.jpg)
13
HOW TO’S
Device topology, link type and capabilities
GPUa - GPUb link: P2P over NVLINK vs PCIe, speed, etc
Same for CPU – GPU link: NVLINK or PCIe
Same for NIC – GPU link (HWLOC)
Select an optimized GPU/CPU/NIC combination in MPI runs
Enable GPUDirect RDMA
![Page 14: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/14.jpg)
14
CUDA LINK CAPABILITIES
// CUDA driver APItypedef enum CUdevice_P2PAttribute_enum {
CU_DEVICE_P2P_ATTRIBUTE_PERFORMANCE_RANK = 0x01,
CU_DEVICE_P2P_ATTRIBUTE_ACCESS_SUPPORTED = 0x02,
CU_DEVICE_P2P_ATTRIBUTE_NATIVE_ATOMIC_SUPPORTED = 0x03
} CUdevice_P2PAttribute;
cuDeviceGetP2PAttribute(int* value, CUdevice_P2PAttribute
attrib, CUdevice srcDevice, CUdevice dstDevice)
// CUDA runtime APIcudaDeviceGetP2PAttribute(int *value, enum
cudaDeviceP2PAttr attr, int srcDevice, int dstDevice)
basic info, GPU-GPU links only
A relative value indicating
the performance of the
link between two GPUs
(NVLINK ranks higher than
PCIe).
Can do remote native
atomics in GPU kernels
![Page 15: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/15.jpg)
15
GPUDIRECT P2P: NVLINK VS PCIE
Note: some GPUs are not connected e.g. GPU0-GPU7Note2: others have multiple potential link (NVLINK and PCIe) but cannot use both at the same time!!!
NVLINK transparently picked if available
cudaSetDevice(0);
cudaMalloc(&buf0, size);
cudaCanAccessPeer (&access, 0, 1);
assert(access == 1);
cudaEnablePeerAccess (1, 0);
cudaSetDevice(1);
cudaMalloc(&buf1, size);
…
cudaSetDevice (0);
cudaMemcpy (buf0, buf1, size, cudaMemcpyDefault);
![Page 16: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/16.jpg)
16
MULTI GPU RUNS ON DGX-1
Create wrapper script
Use local MPI rank (MPI impldependent)
Don’t use CUDA_VISIBLE_DEVICES, hurts P2P!!!
Environment variables to pass selection down to MPI and app
In application cudaSetDevice(“USE_GPU”)
Run wrapper script
Select best GPU/CPU/NIC for each MPI rank
$ cat wrapper.sh
if [ ! –z $OMPI_COMM_WORLD_LOCAL_RANK ]; then
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
elif [ ! –z $MV2_COMM_WORLD_LOCAL_RANK ]; then
lrank=$MV2_COMM_WORLD_LOCAL_RANK
fi
if (( $lrank > 7 )); then echo "too many ranks"; exit; fi
case ${HOSTNAME} in
*dgx*)
USE_GPU=$((2*($lrank%4)+$lrank/4)) # 0,2,4,6,1,3,5,7
export USE_SOCKET=$(($USE_GPU/4)) # 0,0,1,1,0,0,1,1
HCA=mlx5_$(($USE_GPU/2)) # 0,1,2,3,0,1,2,3
export OMPI_MCA_btl_openib_if_include=${HCA}
export MV2_IBA_HCA=${HCA}
export USE_GPU;;
…
esac
numactl --cpunodebind=${USE_SOCKET} –l $@
$ mpirun –np N wrapper.sh myapp param1 param2 …
![Page 17: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/17.jpg)
17
NVML1 NVLINK
nvmlDeviceGetNvLinkVersion(nvmlDevice_t device,
unsigned int link, unsigned int *version)
nvmlDeviceGetNvLinkState(nvmlDevice_t device,
unsigned int link, nvmlEnableState_t *isActive)
nvmlDeviceGetNvLinkCapability(nvmlDevice_t
device, unsigned int link,
nvmlNvLinkCapability_t capability, unsigned int
*capResult)
nvmlDeviceGetNvLinkRemotePciInfo(nvmlDevice_t
device, unsigned int link, nvmlPciInfo_t *pci)
Link discovery and info APIs
1 http://docs.nvidia.com/deploy/nvml-api/
nvmlDevice separate from CUDA gpu id’s (all devices vs CUDA_VISIBLE_DEVICES)
NVML_NVLINK_MAX_LINKS=6
See later for capabilities
domain:bus:device.function PCI identifier of device on the other side of the link, can be socket PCIe bridge (IBM POWER8)
![Page 18: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/18.jpg)
18
NVLINK CAPABILITIES
nvidia-smi nvlink –I <GPU id> –c
typedef enum nvmlNvLinkCapability_enum {
NVML_NVLINK_CAP_P2P_SUPPORTED = 0,
NVML_NVLINK_CAP_SYSMEM_ACCESS = 1,
NVML_NVLINK_CAP_P2P_ATOMICS = 2,
NVML_NVLINK_CAP_SYSMEM_ATOMICS= 3,
NVML_NVLINK_CAP_SLI_BRIDGE = 4,
NVML_NVLINK_CAP_VALID = 5,
} nvmlNvLinkCapability_t;
On DGX-1
![Page 19: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/19.jpg)
19
NVLINK COUNTERS
Per GPU (-i 0), per link (-l <0..3>)
Two sets of counters (-g <0|1>)
Per set counter types: cycles,packets,bytes (-sc xyz)
Reset individually (-r <0|1>)
On DGX-1
![Page 20: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/20.jpg)
20
NVML TOPOLOGY
nvmlDeviceGetTopologyNearestGpus( nvmlDevice_t device, nvmlGpuTopologyLevel_t level, unsigned int* count,nvmlDevice_t* deviceArray )
nvmlDeviceGetTopologyCommonAncestor( nvmlDevice_t device1, nvmlDevice_t device2, nvmlGpuTopologyLevel_t* pathInfo )
nvmlSystemGetTopologyGpuSet(unsigned int cpuNumber, unsigned int* count, nvmlDevice_t* deviceArray )
nvmlDeviceGetCpuAffinity(nvmlDevice_t device, unsigned intcpuSetSize, unsigned long *cpuSet);
GPU-GPU & GPU-CPU topology query1 APIs
1 http://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html
NVML_TOPOLOGY_INTERNAL,
NVML_TOPOLOGY_SINGLE,
NVML_TOPOLOGY_MULTIPLE,
NVML_TOPOLOGY_HOSTBRIDGE,
NVML_TOPOLOGY_CPU,
NVML_TOPOLOGY_SYSTEM,
![Page 21: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/21.jpg)
21
SYSTEM TOPOLOGY
$ nvidia-smi topo -m
On DGX-1
![Page 22: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/22.jpg)
22
SYSTEM TOPOLOGY
$ nvidia-smi topo -mp
On DGX-1, PCIe only
![Page 23: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/23.jpg)
25
GPUDIRECT P2P
![Page 24: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/24.jpg)
26
DGX-1 P2P PERFORMANCE
in CUDA toolkit samples
Sources: samples/1_Utilities/p2pBandwidthLatencyTest
Binary: samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest
p2pBandwidthLatencyTest
![Page 25: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/25.jpg)
27
DGX-1 P2P PERFORMANCE
In CUDA toolkit demo suite:
/usr/local/cuda-8.0/extras/demo_suite/busGrind –h
Usage: -h: print usage
-p [0,1] enable or disable pinned memory tests (default on)
-u [0,1] enable or disable unpinned memory tests (default off)
-e [0,1] enable or disable p2p enabled memory tests (default on)
-d [0,1] enable or disable p2p disabled memory tests (default off)
-a enable all tests
-n disable all tests
busGrind
![Page 26: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/26.jpg)
28
Intra-node MPI BW
8/3/16
GPU-aware MPI running over GPUDirect P2PDual IVB Xeon 2U server (K40 PCIe) vs DGX-1 (P100-nvlink)
0
5
10
15
20
25
30
35
40
1K 4K 16K 64K 256K 1M 4M 16M
Ban
dw
idth
GB
/se
c
Message Size (Bytes)
k40-pcie k40-pcie-bidir p100-nvlink p100-nvlink-bidir
~35 GB/sec Bi-dir
~17 GB/sec Uni-dir
![Page 27: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/27.jpg)
29
GPUDIRECT RDMA
![Page 28: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/28.jpg)
30
GPUDirect RDMA over RDMA networks
For Linux rdma subsystem
open-source nvidia_peer_memory kernel module1
important bug fix in ver 1.0-3 !!!
enables NVIDIA GPUDirect RDMA on OpenFabrics stack
Multiple vendors
Mellanox2: ConnectX3 to ConnectX-5, IB/RoCE
Chelsio3: T5, iWARP
Others to come
for better network communication latency
1 https://github.com/Mellanox/nv_peer_memory2 http://www.mellanox.com/page/products_dyn?product_family=1163 http://www.chelsio.com/gpudirect-rdma
![Page 29: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/29.jpg)
31
GPUDirect RDMA over Infiniband
For bandwidth:
$ git clone
git://git.openfabrics.org/~grockah/perf
test.git
$ cd perftest
$ ./autogen.sh
$ export CUDA_H_PATH=/usr/local/cuda-
8.0/include/cuda.h
$ ./configure –prefix=$HOME/test
$ make all install
E.g. host to GPU memory (H-G) BW test:
server$ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda
client $ ~/test/bin/ib_write_bw -n 1000 -O -a
server.name.org
GPU to GPU memory (G-G) BW test:
server$ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda
client $ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda
server.name.org
Benchmarking bandwidth
![Page 30: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/30.jpg)
32
DGX-1 GPUDirect RDMA uni-dir BW
IB message size, 5000 iterations, RC protocol
0
2000
4000
6000
8000
10000
12000
14000
2 4 8 16
32
64
128
256
512
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
1MB
2MB
4MB
8MB
BW(M
B/s)
RDMABWDGX-1,CX-4,P100SXM,socket0tosocket1
HH
HG
GH
GG
![Page 31: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/31.jpg)
33
GPUDIRECT ASYNC
![Page 32: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/32.jpg)
34
GPUDIRECT ASYNC
Communications prepared by CPU
• hardly parallelizable, branch intensive
• GPU orchestrates flow
Run by GPU front-end unit
• Same one scheduling GPU work
• Now also scheduling network communications
leverage GPU front-end unit
Front-end unit
Compute Engines
![Page 33: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/33.jpg)
35
GPUDIRECT ASYNC OVER OFA VERBS
• Prerequisites: nvidia_peer_memory driver, GDRcopy1 library
• CUDA 8.0+ Stream Memory Operations (MemOps) APIs
• MLNX OFED 4.0+ Peer-Direct Async Verbs APIs
• libgdsync2: bridging CUDA & IB Verbs
• MPI: experimental support in MVAPICH2-GDS3
• libmp: lightweight, MPI-like stream-sync primitives, internal benchmarking
1 http://github.com/NVIDIA/gdrcopy2 http://github.com/gpudirect/libgdsync, devel branch3 see DK Panda’s talk
SW stack bits
![Page 34: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/34.jpg)
36
SW STACK
CUDA
driverIB verbs
IB core
mlx5
libgdsync
NV display
driver
kernel-mode
user-mode
HCA GPU
HW
mixed
open-
source
proprietary
nv_peer_mem
extensions
for Async
MVAPICH2
RDMA
ext. for Async
CUDA RT
libmpOpen MPI
applications benchmarks
IB Verbs
extensions
for Async
cxgb4
Async
extensions for
RDMA/Async
![Page 35: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/35.jpg)
37
GPUDIRECT ASYNC OVER INFINIBAND
May need special HCA configuration on Kepler/Maxwell GPUs, e.g. on Mellanox:
$ mlxconfig -d /dev/mst/mtxxx_pciconf0 set NON_PREFETCHABLE_PF_BAR=1
$ reboot
Enable GPU peer mappings:
$ cat /etc/modprobe.d/nvidia.conf options nvidia
NVreg_RegistryDwords="PeerMappingOverride=1"
Requirements
![Page 36: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/36.jpg)
38
CUDA STREAM MemOpsCU_STREAM_WAIT_VALUE_GEQ = 0x0,CU_STREAM_WAIT_VALUE_EQ = 0x1,CU_STREAM_WAIT_VALUE_AND = 0x2,CU_STREAM_WAIT_VALUE_NOR = 0x3,CU_STREAM_WAIT_VALUE_FLUSH = 1<<30 CUresult cuStreamWaitValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags);CUresult cuStreamWaitValue64(CUstream stream, CUdeviceptr addr, cuuint64_t value, unsigned int flags);
CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER = 0x1CUresult cuStreamWriteValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags);CUresult cuStreamWriteValue64(CUstream stream, CUdeviceptr addr, cuuint64_t value, unsigned int flags);
CU_STREAM_MEM_OP_WAIT_VALUE_32 = 1,CU_STREAM_MEM_OP_WRITE_VALUE_32 = 2,CU_STREAM_MEM_OP_WAIT_VALUE_64 = 4,CU_STREAM_MEM_OP_WRITE_VALUE_64 = 5,CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES = 3CUresult cuStreamBatchMemOp(CUstream stream, unsigned int count, CUstreamBatchMemOpParams *paramArray, unsigned int flags);
polling on 32/64bit
word
32/64bit word write
lower-overhead batched
work submission
guarantees memoryconsistency for RDMA
![Page 37: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/37.jpg)
39
CUDA STREAM MemOpsAPIs features
• batching multiple consecutive MemOps save ~1.5us each op
• use cuStreamBatchMemOp()
• APIs accept device pointers
• memory need registration (cuMemHostRegister)
• device pointer retrieval (cuMemHostGetDevicePointer)
• 3rd party device PCIe resources (aka BARs)
• assumed physically contiguous & uncached
• special flag needed in cuMemHostRegister
![Page 38: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/38.jpg)
40
ASYNC: LIBMP
![Page 39: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/39.jpg)
41
LIBMP
Lightweight message passing library
thin layer on top of IB Verbs
in-order receive buffer matching
no tags, no wildcards, no data types
zero copy transfers, uses flow control of IB RC transport
Eases benchmarking of multi-GPU applications
Not released (yet)
![Page 40: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/40.jpg)
42
LIBMPPrototype message passing library
PREPARED BY TRIGGERED BY
CPU synchronous CPU CPU
Stream synchronous CPU GPU front-end unit
Kernel initiated CPU GPU SMs
![Page 41: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/41.jpg)
43NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
LIBMP CPU-SYNC COMMUNICATION// Send/Recv
int mp_irecv (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req);
int mp_isend (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req);
// Put/Get
int mp_window_create(void *addr, size_t size, mp_window_t *window_t);
int mp_iput (void *src, int size, mp_reg_t *src_reg, int peer, size_t displ,
mp_window_t *dst_window_t, mp_request_t *req);
int mp_iget (void *dst, int size, mp_reg_t *dst_reg, int peer, size_t displ,
mp_window_t *src_window_t, mp_request_t *req);
// Wait
int mp_wait (mp_request_t *req);
int mp_wait_all (uint32_t count, mp_request_t *req);
![Page 42: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/42.jpg)
44NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
LIBMP STREAM-SYNC COMMUNICATION// Send
int mp_isend_on_stream (void *buf, int size, int peer, mp_reg_t *mp_reg,
mp_request_t *req, cudaStream_t stream);
// Put/Get
int mp_iput_on_stream (void *src, int size, mp_reg_t *src_reg, int peer, size_t
displ, mp_window_t *dst_window_t, mp_request_t *req, cudaStream_t stream);
int mp_iget_on_stream (void *dst, int size, mp_reg_t *dst_reg, int peer, size_t
displ, mp_window_t *src_window_t, mp_request_t *req, cudaStream_t stream);
// Wait
int mp_wait_on_stream (mp_request_t *req, cudaStream_t stream);
int mp_wait_all_on_stream (uint32_t count, mp_request_t *req, cudaStream_t
stream);
![Page 43: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/43.jpg)
45
CUDA stream-sync communications
Loop { mp_irecv(…)
compute <<<…,stream>>> (buf)
mp_isend_on_stream(…)
mp_wait_all_on_stream (…)}
CPU HCAstream
![Page 44: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/44.jpg)
46NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
LIBMP STREAM-SYNC batch submission
// Prepare single requests, IB Verbs side only
int mp_send_prepare (void *buf, int size, int peer, mp_reg_t *mp_reg,
mp_request_t *req);
// Post set of prepared requests
int mp_isend_post_on_stream (mp_request_t *req, cudaStream_t stream);
int mp_isend_post_all_on_stream (uint32_t count, mp_request_t *req,
cudaStream_t stream);
![Page 45: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/45.jpg)
47NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
LIBMP KERNEL-SYNC COMMUNICATION
// Host side code// extract descriptors from prepared requests
int mp::mlx5::get_descriptors(send_desc_t *send_info, mp_request_t *req);
int mp::mlx5::get_descriptors(wait_desc_t *wait_info, mp_request_t *req);
// Device side code
__device__ int mp::device::mlx5::send(send_desc_t * send_info);
__device__ int mp::device::mlx5::wait (wait_desc_t * wait_info);
![Page 46: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/46.jpg)
48
Kernel-sync communication
Loop { mp_irecv(…, rreq);mp_get_descriptors(wi, rreq);mp_send_prepare(…, sreq);mp_get_descriptors(si, sreq);
compute_and_communicate <<<…,stream>>> (buf,wi,si) {do_something(buf);mlx5::send(si);do_something_else();mlx5::wait(wi);keep_working();
}}
CPU HCAstream
![Page 47: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/47.jpg)
49
ASYNC: experimental MPI BINDINGS
![Page 48: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/48.jpg)
50
CODE SAMPLE
MPI_Comm comm = MPI_COMM_WORLD;
cudaStreamCreate (&stream);
Loop {
kernel <<<nblocks, nthreads, stream>>> (buf1, buf2);
cudaStreamSynchronize (stream);
MPI_Irecv(buf2, count, MPI_INT, peer, 0, comm, &req[0]);
MPI_Isend(buf1, count, MPI_INT, peer, 0, comm, &req[1]);
MPI_Waitall (2, req, statuses);
}
Standard GPU-aware MPI
![Page 49: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/49.jpg)
51
STREAM-SYNCHRONOUS VERSION
MPI_Comm stream_comm;
cudaStreamCreate (&stream);
MPIX_Comm_dup_with_stream( MPI_COMM_WORLD, &stream_comm, &stream);
Loop {
kernel <<<nblocks, nthreads, stream>>> (buf1, buf2);
MPI_Recv(buf2, count, MPI_BYTE, peer, 0, stream_comm);
MPI_Send(buf1, count, MPI_BYTE, peer, 0, stream_comm);
}
MPIX_Wait_stream_completion(stream_comm);
Experimental MPI extensions
![Page 50: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/50.jpg)
52
ASYNCBENCHMARKS
LibMP models
HPGMG-FV
CoMD
Lulesh2
![Page 51: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/51.jpg)
53
LIBMP MODELS
Three different execution models:
• CPU-synchronous (default)
• CUDA Stream-synchronous, CPU asynchronous (SA)
• communications triggered by a CUDA Stream
• CUDA Kernel-synchronous (or Kernel-Initiated, KI)
• Communications triggered by CUDA Kernel threads
Need to evaluate performance: 3 multi-process applications MPI to LibMP
Summary
![Page 52: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/52.jpg)
54
HPGMG-CUDA
![Page 53: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/53.jpg)
55
HPGMG-FVOverview
High Performance Geometric Multigrid (Lawrence Berkeley National Laboratory) :
• Geometric multi-grid solver for variable-coefficient elliptic problems on isotropic Cartesian grids
using the Finite Volume method
• F-cycle: multiple V-cycles a finer grid is smoothed and restricted into a coarser grid. A direct
solver is applied to the coarsest grid and the solution is then iteratively interpolated to finer
grids.
Multi-process execution:
• Each grid is divided into several same-size boxes
• Workload is fairly distributed among processes
![Page 54: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/54.jpg)
56
HPGMG-FV CUDAOverview
GPU accelleration
Multi-process and multi-GPU, MPI
Level size threshold:
• Lower levels computed on CPU
• Higher levels computed on GPU
![Page 55: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/55.jpg)
57
HPGMG-FV CUDAMulti-process execution: MPI communications
CPU Level
(coarser)
Smoother
Residual
Exchange
boundary
GPU Level
(finer)
Smoother
Residual
Exchange
boundary
CPU Level
(coarser)
Smoother
Residual
Exchange
boundary
Interpolation Restriction
Inter-level communication: restriction (moving from a finer level to a coarser level) and
interpolation (moving from a coarser level to a finer level)
Intra-level communication: exchange_boundary function (boundary region exchange between
processes)
![Page 56: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/56.jpg)
59
EXCHANGE BOUNDARY IMPLEMENTATION
cudaDeviceSynchronize();for(p=0; p < num_peers; p++) {
MPI_Irecv(rBuf, …, rReqs[p]);}
pack_kernel<<<…, stream>>>(sBuf, …);cudaDeviceSynchronize();
for(p=0; p < num_peers; p++) {MPI_Isend(sBuf, …, sReqs[p]);
}
local_kernel<<<…, stream>>>();
MPI_Waitall(rReqs, …);unpack_kernel<<<…, stream>>>(rBuf, …);MPI_Waitall(sReqs, …);
for(p=0; p < num_peers; p++) {mp_irecv(rBuf, …, rReqs[p]);
}
pack_kernel<<<…, stream>>>(sBuf, …);
for(p=0; p < num_peers; p++) {mp_isend_on_stream(sBuf, …, sReqs[p], stream);
}
local_kernel<<<…, stream>>>();
mp_wait_all_on_stream(…, rReqs, stream);unpack_kernel<<<…, stream>>>(rBuf, …);mp_wait_all_on_stream(…, sReqs, stream);
Code comparison: MPI vs SA-Model
MPI Version LibMP SA-Model Version
![Page 57: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/57.jpg)
60
MPI IMPLEMENTATIONConsecutive exchange_boundary() calls, CUDA Visual Profiler
Synchronization
CUDA kernels
Communications (Send or Wait)
Host
GPU Stream
NetworkIsend WaitAll
![Page 58: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/58.jpg)
61
MPI IMPLEMENTATIONConsecutive exchange_boundary() calls, CUDA Visual Profiler
Host
GPU Stream
Network
GPU Idle time
![Page 59: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/59.jpg)
63
SA MODEL IMPLEMENTATIONConsecutive exchange_boundary() calls, CUDA Visual Profiler
CUDA kernels
Communications (Send, Put or Wait)
Isend WaitRecv
WaitIsend
Host
GPU Stream
Network
CUDA 8.0 driver functions
![Page 60: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/60.jpg)
64
SA MODEL IMPLEMENTATIONWilkes cluster - performance gain
• Wilkes cluster (Cambridge, UK):
• Telsa K20 GPUs
• CUDA 8.0
• Mellanox Connect-IB OFED 3.2
• OpenMPI
• Up to 24% time improvement wrt MPI
• 4 processes, single GPU level:
• 64% CPU work time reduction
• 28% GPU activity time reduction
• The bigger is the box size, the more
message size grows: communication
overhead becomes less important
![Page 61: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/61.jpg)
65
KI MODEL IMPLEMENTATIONHow to
For each exchange_boundary() execution:
• Kernel fusion, single vs three CUDA kernels
• Move communications inside fused kernel
• Respect mutual dependencies (hard!)
launch
fused
kernel
exchange_boundary
fused kernel
CPU
GPU
![Page 62: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/62.jpg)
67
KI MODEL IMPLEMENTATIONexchange_boundary() fused kernel – CUDA blocks organization
Fused kernel
Set 1 Set 2 Set 3
CUB library and GPU global memory variable
![Page 63: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/63.jpg)
68
KI MODEL IMPLEMENTATIONWilkes cluster - performance gain
• Wilkes cluster (Cambridge, UK):
• Telsa K20 GPUs
• CUDA 8.0
• Mellanox Connect-IB OFED 3.2
• OpenMPI
• Up to 26% time improvement wrt MPI
• 4 processes, single GPU level:
• 77% CPU work time reduction
• 32% GPU activity time reduction
![Page 64: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/64.jpg)
69
DGX-1 CLUSTER BENCHMARKSPerformance gain
• DGX-1 cluster (NVIDIA):
• Tesla P100 GPUs
• CUDA 8.0
• Mellanox ConnectX-4 OFED 3.4
• LibMP v2.0
• OpenMPI 1.10.5
• 2 processes only: about 8% less than
Wilkes
• Gain wrt MPI:
• CPU sync default
• SA model
• KI model
![Page 65: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/65.jpg)
70
COMD-CUDA
![Page 66: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/66.jpg)
71
COMD-CUDAImplementation
• CoMD (www.exmatex.org project) is a classical molecular dynamics proxy
application
• O(N2) within cutoff
• It considers materials where the interatomic potentials are short range and the
simulation requires the evaluation of all forces between atom pairs within the cutoff
distance
• Dependency between CUDA kernels and communications
• Explored SA-model only
![Page 67: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/67.jpg)
72
SA MODEL IMPLEMENTATION
Repeat for 3 times {
load_kernel <<<…, stream>>>(sBufA, …);load_kernel <<<…, stream>>>(sBufB, …);cudaDeviceSynchronize();
MPI_Sendrecv(sBufA, rBufA, …);MPI_Sendrecv(sBufB, rBufB, …);
cuda_tasks(stream);
unload_kernel <<<…, stream>>>(rBufA, …);unload_kernel <<<…, stream>>>(rBufB, …);
}
for(p=0; p < (3 x 2); p++) {mp_irecv(…, rReqs[p]);
}
for(p=0; p < (3 x 2); p += 2) {
load_kernel <<<…, stream>>>(sBufA, …);mp_isend_on_stream(…, sReqs[p], stream);load_kernel <<<…, stream>>>(sBufB, …);mp_isend_on_stream(…, sReqs[p+1], stream);
cuda_tasks(stream);
mp_wait_all_on_stream(…, rReqs+p, stream);unload_kernel <<<…, stream>>>(rBufA, …);unload_kernel <<<…, stream>>>(rBufB, …); mp_wait_all_on_stream(…, sReqs+p, stream);}
Communication periods - Code comparison
MPI Version LibMP SA-Model Version
![Page 68: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/68.jpg)
73
SA MODEL PERFORMANCE GAINWilkes cluster – communications gain
• Wilkes cluster (Cambridge, UK):
• Telsa K20 GPUs
• CUDA 8.0
• Mellanox Connect-IB OFED 3.2
• OpenMPI
• 25% ~ 35% time gain
• Communication periods only
• Gain SA model wrt MPI
![Page 69: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/69.jpg)
74
SA MODEL PERFORMANCE GAINDGX-1 cluster – communications
• DGX-1 cluster (NVIDIA):
• Tesla P100 GPUs
• CUDA 8.0
• Mellanox ConnectX-4 OFED 3.4
• LibMP v2.0
• OpenMPI 1.10.5
• 4 processes only: 13% less than
Wilkes
• Communication periods only
• Gain SA model wrt MPI
![Page 70: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/70.jpg)
75
LULESH2-CUDA
![Page 71: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/71.jpg)
76
LULESH2-CUDAImplementation
• Proxy application developed at Lawrence Livermore National Laboratory
• “… It approximates the hydrodynamics equations discretely by partitioning the
spatial problem domain into a collection of volumetric elements defined by a
mesh.” ( https://codesign.llnl.gov/lulesh.php )
• Dependency between CUDA kernels and communications
• Explored SA-model only
![Page 72: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/72.jpg)
77
Repeat up to 26 times {mp_irecv(rBuf, …, rReqs);
}……Repeat up to 26 times {
cuda_kernel<<<…, stream>>>(…);cudaMemcpyAsync(…, stream);mp_isend_on_stream(…, sReqs, stream);
}mp_wait_all_on_stream(…, sReqs, stream);……
Repeat up to 26 times {mp_wait_on_stream(…, rReqs, stream);cudaMemcpyAsync(…, stream);cuda_kernel<<<…, stream>>>(…);
}
SA MODEL IMPLEMENTATION
Repeat up to 26 times {MPI_Irecv(rBuf, …);
}……Repeat up to 26 times {
cuda_kernel<<<…, stream>>>(sBuf, …);cudaDeviceSynchronize();cudaMemcpyAsync(…, stream);MPI_Isend(sBuf, …);
}MPI_Waitall(…);……
Repeat up to 26 times {MPI_Wait(…);cuda_kernel<<<…, stream>>>(rBuf, …);cudaMemcpyAsync(…, stream);cudaDeviceSynchronize();
}
Communication periods - Code comparison
MPI Version LibMP SA-Model Version
![Page 73: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/73.jpg)
78
SA MODEL IMPLEMENTATIONWilkes cluster - performance gain
• Progressively increasing the number
of iterations to intensify the GPU
workload
• Wilkes cluster (Cambridge, UK):
• Telsa K20 GPUs
• CUDA 8.0
• Mellanox Connect-IB OFED 3.2
• OpenMPI
• Average gain of 13%
• 27 and 64 processes
![Page 74: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/74.jpg)
79
GAME OVER
![Page 75: S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS ... · SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT. 2 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc1fd0b86fd3f2a4873a98c/html5/thumbnails/75.jpg)
80