HPC, GPUS AND NETWORKING - MUG ::...
Transcript of HPC, GPUS AND NETWORKING - MUG ::...
![Page 1: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/1.jpg)
Davide Rossetti, Senior Engineer, CUDA Driver
HPC, GPUS AND NETWORKING
![Page 2: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/2.jpg)
2
INDEX
GPU-friendly networks
history: APEnet+
research: GGAS, PEACH2,
corporate: w/ MLNX OFED 1.5.4.1+GDR, OFED 2.1/2.2
GDR enabler for low-perf CPUs
current limitations and WARs
accelerate eager for GPU
perspective: I/O bus limited -> NVLINK
dream: Frankenstein board
![Page 3: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/3.jpg)
3
SCALING PROBLEM
many interesting apps bang on the scaling wall
single big problem = capability computing
fixed physical problem size
parallel decomposition
lots of GPUs O(10^2-10^3)
different reasons
unbalance
poor coding
overheads -> GPU<->network interaction
even optimized apps
let’s move the wall further away…
![Page 4: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/4.jpg)
4
GPU-friendly networks
![Page 5: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/5.jpg)
5
RESEARCH
APEnet+ card:
Network Processor, SoC design
FPGA based
8-ports switch
6 bidirectional links (max 34 Gbps)
PCIe X8 Gen2 in X16 (4+4 GB/s)
32bit RISC
Accelerators:
Zero-copy RDMA host interface
GPU P2P support
P2P/GPUDirectRDMA on APEnet+
[*] R.Ammendola et al, “GPU peer-to-peer techniques applied to a cluster interconnect”, CASS 2013
![Page 6: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/6.jpg)
6
DIRECT VS (PIPELINED) STAGING• Staging to host memory:
• 14us latency
• 7-10us penalty for each cudaMemcpy
• < 1us NIC HW latency
• SW complexity
• Pipelined algorithm
• MVAPICH2, OpenMPI
Direct path:
7us latency (11/2011)
GPU<->GPU path
both NV P2P and GPUDirectRDMA
GPU<->3rd party devices
HPC scaling
Other applications: HEP, Astronomy
![Page 7: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/7.jpg)
7
APENET
![Page 8: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/8.jpg)
8
GPU-NETWORK FUSION CARD
Frankenstein board
Take Half of a “GTX 590”
Add dedicated APEnet
Tune net bandwidth
Put 1,2,.. per server
Perfect for HPC
PLX
![Page 9: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/9.jpg)
9
RESEARCH
PEACH2 @tsukuba
GGAS=Global GPU Address Spaces @Heidelberg
PEACH2 & GGAS
Center for Computational Sciences, Univ. of Tsukuba
PEACH2 board (Production version for
HA-PACS/TCA)
PCI Express Gen2 x8 peripheral board
Compatible with PCIe Spec.
2014/02/19 External Review 7
Top View Side View
![Page 10: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/10.jpg)
10
GPUDIRECTRDMA ON MELLANOX IB BOARDS
GPUDirect RDMA support
early prototype started in Feb 2013
presented at GTC 2013 on MVAPICH2 1.9
rework for Mellanox SW stack
May 2013 - Dec 2013
showcased at SC’13 at Mellanox booth
shipping since 1/2014 MOFED 2.1
OpenMPI 1.7.4
MVAPICH2 2.0b
![Page 11: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/11.jpg)
11
GPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI
Less overhead = lower latency and more effective bandwidth
When GDRDMA is effective:
3x more BW
~70% less latency
Not always good:
Experimentally, cross-over behavior at 8-16KB
GDRDMA reading BW cap at 800MB/s on SNB
GDRDMA writing BW saturates.
2 nodes MPI benchmarks
with and without GPUDirect RDMA
Intel SNB, K20c, MLNX FDR PCIe X8 gen3
![Page 12: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/12.jpg)
12
ON IVY BRIDGE XEON
![Page 13: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/13.jpg)
13
MPI USING GPUDIRECT RDMA
any space for improvement ?
10 Webinar - June ‘14
GPU-GPU Internode MPI Latency
0
100
200
300
400
500
600
700
800
8K 32K 128K 512K 2M
1-Rail
2-Rail
1-Rail-GDR
2-Rail-GDR
Large Message Latency
Message Size (Bytes)
La
ten
cy (
us
)
MVAPICH2-GDR 2.0b
Intel Ivy Bridge (E5-2680 v2) node with 20 cores w/ NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA,
CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
10 %
0
5
10
15
20
25
1 4 16 64 256 1K 4K
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Small Message Latency
Message Size (Bytes)
La
ten
cy (
us
)
67 %
5.49 usec
Performance of MVAPICH2 with GPUDirect-RDMA: Latency back in GTC14 time…
13
![Page 14: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/14.jpg)
14
BENCHMARK RESULTS
MV2-GDR 2.0b original version
MV2-GDR-New experimental version
latency
17 Webinar - June ‘14
Performance of MVAPICH2 with GPUDirect-RDMA: Latency
0
2
4
6
8
10
12
1 4 16 64 256 1024 4096
MV2-GDR 2.0b
MV2-GDR-New (Loopback)
MV2-GDR-New (Fast Copy)
Small Message Latency
Message Size (Bytes)
Late
ncy (
us) 7.3 usec
5.0 usec
3.5 usec
52%
Intel Ivy Bridge (E5-2630 v2) node with 12 cores w/ NVIDIA Tesla K20c GPU, Mellanox Connect-IB Dual-FDR HCA
CUDA 6.0, Mellanox OFED 2.1 with GPUDirect-RDMA Plug-in
GPU-GPU Internode MPI Latency
![Page 15: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/15.jpg)
15
RESTORING EAGER
Eager protocol on IB:
sender:
copy on pre-registered TX buffers
ibv_post_send, on UD,UC,RC,…
receiver:
pre-post temp RX buffers, use credits for proper bookkeeping
RX matching then copy to final destination
recalling Eager protocol
![Page 16: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/16.jpg)
16
RESTORING EAGER
get rid of rendezvous as:
bad interplay with apps
excessive sync among nodes
more round-trip’s
Problem:
staging: moving data from host temp buf to GPU final dest
cudaMemcpy has >4us overhead (~24KB threshold at 6GB/s)
Eager essential tool for low-latency / small msgs
![Page 17: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/17.jpg)
17
RESTORING EAGER
IB loopback
use NIC as low-latency copy engine
GDRCopy
CPU-driven zero-latency BAR1 copy
Two tricks
![Page 18: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/18.jpg)
18
RX IB LOOPBACK FLOW
CPU
GPU
MLNX
C-IB
TX side
GPU MEMsrc_buf
CPU
MEMCPU
GPU
MLNX
C-IB
GPU MEM
CPU
MEM
dst_buf
post buf
RX side
![Page 19: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/19.jpg)
19
EAGER WITH GDRCOPY
experimental super low-latency copy library
three sets of primitives:
Pin/Unpin, setup/tear down BAR1 mappings of GPU memory
Map/Unmap, memory-map BAR1 on user-space CPU address range
CPU can use standard load/store instructions (MMIO) to access the GPU memory.
Copy to/from PCIe BAR, highly tuned R/W functions
(ab)use CPU to copy data to/from GPU mem
![Page 20: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/20.jpg)
20
MPI RX GDRCOPY FLOW
CPU
GPU
MLNX
C-IB
TX side
GPU MEMsrc_buf
CPU
MEMCPU
GPU
MLNX
C-IB
GPU MEM
CPU
MEM
dst_buf
post buf
RX side
GDRCopy
data path
![Page 21: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/21.jpg)
21 21
GDRCOPY
zero latency vs ~1us for loopback vs 4-6us for cudaMemcpy
D-H: 20-30MB/s
H-D: 6GB/s vs 9GB/s for cudaMemcpy
sensitive to
CPU MMIO performance
NUMA effects, eg on IVB D-H=3GB/s when on wrong socket
available soon on https://github.com/drossetti/gdrcopy
performance
![Page 22: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/22.jpg)
22
BENCHMARK RESULTS
on IVB Xeon + K40m + MLNX C-IB
# MV-2.0b MV2-GDR-2.0b MV2-GDR loopback gdrcopy
0 1.27 1.31 1.16 1.22 1.21
1 19.23 7.03 5.29 4.77 3.19
2 19.26 7.02 5.28 4.78 3.18
4 19.35 7.01 5.26 4.77 3.17
8 19.45 7.00 5.26 4.79 3.17
* with MVAPICH2 GDR pre-release
![Page 23: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/23.jpg)
23
BENCHMARK RESULTS
bandwidth
18 Webinar - June ‘14
Performance of MVAPICH2 with GPUDirect-RDMA: BW & BiBW
Intel Ivy Bridge (E5-2630 v2) node with 12 cores w/ NVIDIA Tesla K20c GPU, Mellanox Connect-IB Dual-FDR HCA
CUDA 6.0, Mellanox OFED 2.1 with GPUDirect-RDMA Plug-in
0
500
1000
1500
2000
2500
3000
1 4 16 64 256 1024 4096
MV2-GDR-2.0b
MV2-GDR-New (Loopback)
MV2-GDR-New (Fast Copy)
Small Message Bandwidth
Message Size (Bytes)
Ban
dw
idth
(M
B/s
)
2.2x
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 4 16 64 256 1024 4096
MV2-GDR-2.0b
MV2-GDR-New (Loopback)
MV2-GDR-New (Fast Copy)
Small Message Bi-Bandwidth
Message Size (Bytes)
Bi-
Ba
nd
wid
th (
MB
/s)
2.1x
GPU-GPU Internode MPI Bandwidth & Bi-Bandwidth
18 Webinar - June ‘14
Performance of MVAPICH2 with GPUDirect-RDMA: BW & BiBW
Intel Ivy Bridge (E5-2630 v2) node with 12 cores w/ NVIDIA Tesla K20c GPU, Mellanox Connect-IB Dual-FDR HCA
CUDA 6.0, Mellanox OFED 2.1 with GPUDirect-RDMA Plug-in
0
500
1000
1500
2000
2500
3000
1 4 16 64 256 1024 4096
MV2-GDR-2.0b
MV2-GDR-New (Loopback)
MV2-GDR-New (Fast Copy)
Small Message Bandwidth
Message Size (Bytes)
Ba
nd
wid
th (
MB
/s)
2.2x
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 4 16 64 256 1024 4096
MV2-GDR-2.0b
MV2-GDR-New (Loopback)
MV2-GDR-New (Fast Copy)
Small Message Bi-Bandwidth
Message Size (Bytes)
Bi-
Ba
nd
wid
th (
MB
/s)
2.1x
GPU-GPU Internode MPI Bandwidth & Bi-Bandwidth
![Page 24: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/24.jpg)
24
coming soon…
![Page 25: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/25.jpg)
25
MULTI-GPU APPS SCALING PROBLEM
The Glotzer Group @ University of Michigan
Time Spent in Subroutines
LJ Pair
Ghost UpdateNeighbor
List
MVAPICH2 2.0b
K20X GPU
64000 particles
• Compute kernels scale• Communication cost is constant
Ê
Ê
Ê
ÊÊ
Ê
ÊÊ
‡
‡
‡‡ ‡
‡‡
‡
ÏÏ Ï Ï Ï Ï Ï
1 2 3 4 6 8 12 16
100
200
300
150
P H= GPUsL
Tst
ep@m
sDNVT1
NVT2
NetForce
![Page 26: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/26.jpg)
26
HOOMD-BLUE PROFILING
MV2-GDR 2.0b
MPI Launch
latency
GPU BAR1 here but invisibleCPU idle
![Page 27: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/27.jpg)
27
LIMITED SCALING
CPU mgmt, eg cudaStreamQuery() & MPI_Test():
poll GPU kernels and trigger NIC comm
poll NIC comm then schedule GPU kernels
Today scaling needs fast CPU
implications for ARM
slow CPU not enough at ~ 10^2 GPUs
how it works today
![Page 28: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/28.jpg)
28
1D STENCIL COMPUTATION
halo<<<stream0>>>(h0,p0);
MPIX_Isend(p0,stream0,&req0[0]);
halo<<<stream1>>>(h1,p1);
MPIX_Isend(p1,stream1,&req[0]);
bulk<<<stream2>>>(d);
MPIX_Irecv(h0,stream0,&req0[1]);
MPIX_Wait(stream0, req0, 2);
cudaEventRecord(e0,stream0);
MPIX_Irecv(h1,stream1,&req1[1]);
MPIX_Wait(stream1, req1, 2);
cudaEventRecord(e1,stream1);
cudaStreamWaitEvent(stream2, e0);
cudaStreamWaitEvent(stream2, e1);
Energy<<<stream2>>>(d);
cudaStreamSynchronize(stream2);
Even record
will wake the
stream wait
event
Wait for both
Isend and
Irecv
expand into CUDA + IBverbs + verbs/CUDA interop
![Page 29: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/29.jpg)
29
DEPENDENCY GRAPHHalo krn
Isend
event
record
Bulk krn
Stream event
wait
stream0 stream1stream2
Stream event
wait
Irecv
Wait
Halo krn
Isend
event
record
Irecv
Wait
Energy krn
Command
Engine
Compute
Engine
![Page 30: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/30.jpg)
30
NEXT GEN I/O
NVlink 1.0 in Pascal
beyond PCIe
additional data path, 80-200GB/s
NVlink attached network ?
![Page 31: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/31.jpg)
31
SUMMARY
p2p, platforms (IVB) still limited, Haswell ?
challenge, manage multiple datapaths and topologies, eventually NVLINK
reduce overheads, use GPU scheduler
![Page 32: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/32.jpg)
32
Thank You !!!
Questions ?
![Page 33: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/33.jpg)
33
PCI BAR ?BAR=Base Address Register
PCI resource
up to 6 32bits BAR regs
64bits BAR uses 2 regs
physical address range
prefetchable ~ cachable[danger here]
most GPUs expose 3 BARs
BAR1 space varies:
K20/40c=256MB
K40m=16GB [new!]
$ lspci –vv –s 1:0.0
01:00.0 3D controller: NVIDIA Corporation Device 1024 (rev a1)
Subsystem: NVIDIA Corporation Device 0983
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 86
Region 0: Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at b0000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at c0000000 (64-bit, prefetchable) [size=32M]
$ nvidia-smi
…
BAR1 Memory Usage
![Page 34: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/34.jpg)
34
GPU BAR
BAR1 = aperture on GPU memory
GPUDirectRDMA == GPU BAR1 as used by 3rd party devices:
network cards
storage devices
SSD
reconfigurable devices
FPGAs
CPU
GPU3rd party
device
SYSMEM
BAR1 DMA engine
device mem
![Page 35: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/35.jpg)
35
PLATFORM-RELATED LIMITATIONS
PCIE topology and NUMA effects:
number and type of traversed chipsets/bridges/switches
GPU memory reading BW most affected
Sandy Bridge Xeon severely limited
On older chipsets, writing BW affected too
crossing inter-CPU bus huge bottleneck
currently allowed though
![Page 36: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/36.jpg)
36
NUMA EFFECTS AND PCIE TOPOLOGY
GPU BAR1 reading is half PCIE peak, by design
GPUDirect RDMA performance may suffer from bottlenecks
GPU memory reading most affected
old chipsets, writing too
IVB Xeon PCIE much better
![Page 37: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/37.jpg)
37
MPI STRATEGY
use GPUDirectRDMA for small/mid size buffers
threshold depends on platform
and on NUMA (e.g. crossing QPI)
don’t tax BAR1 too much
bookkeeping needed [being implemented]
revert to pipelined staging through SYSMEM
GPU CE latency >= 6us
no Eager-send for G-to-G/H
excessive inter-node sync
![Page 38: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/38.jpg)
38
PLATFORM & BENCHMARKS
SMC 2U SYS-2027GR-TRFH
IVB Xeon
PEX 8747 PCIE switch, on a raiser-card
SMC 1U SYS-1027GR-TRF
NVIDIA K40m
Mellanox dual-port Connect-IB
Benchmarks:
GPU-extended ibv_ud_pingpong
GPU-extended ib_rdma_bw
![Page 39: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/39.jpg)
39
HOST TO GPU (IVY BRIDGE)
GK110B
x16 Gen3 x16 Gen3
IVB Xeon
GK110B
SYSMEM SYSMEM
Single rail FDR
6.1 GB/s
BW: 6.1GB/s
lat: 1.7us IVB Xeon
MLNX
C-IB
MLNX
C-IB
Single rail FDR
6.1 GB/s
TX side RX side
![Page 40: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/40.jpg)
40
GPU TO HOST (IVY BRIDGE)
GK110B
x16 Gen3 x16 Gen3
IVB Xeon
GK110B
SYSMEM SYSMEM
Single rail FDR
6.1 GB/s
BW: 3.4/3.7*GB/s
lat: 1.7**usIVB Xeon
MLNX
C-IB
MLNX
C-IB
Single rail FDR
6.1 GB/s
RX side TX side
![Page 41: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/41.jpg)
41
GPU TO GPU (IVY BRIDGE)
IVB Xeon
GK110B
x16 Gen3 x16 Gen3
IVB Xeon
GK110B
SYSMEM SYSMEM
Single rail FDR
6.1 GB/sSingle rail FDR
6.1 GB/s
BW: 3.4/3.7*GB/s
lat: 1.9us
MLNX
C-IB
MLNX
C-IB
TX side RX side
![Page 42: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/42.jpg)
42
GPU TO HOST (SANDY BRIDGE)
GK110
x16 Gen2 x8 Gen3
SNB Xeon
GK110
SYSMEM SYSMEM
Single rail FDR
6.1 GB/s
SNB XeonBW: 800MB/s
MLNX
C-X3
MLNX
C-X3
Single rail FDR
6.1 GB/s
RX side TX side
x16 Gen2 x8 Gen3
![Page 43: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/43.jpg)
43
GPU TO GPU, TX ACROSS QPI
GK110B
x16 Gen3 x16 Gen3
IVB Xeon
GK110B
SYSMEMIVB Xeon
Single rail FDR
6.1 GB/sSingle rail FDR
6.1 GB/s
IVB XeonBW: 1.1GB/s
lat: 1.9us
MLNX
C-IB
MLNX
C-IB
TX side RX side
![Page 44: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/44.jpg)
44
GPU TO GPU, RX ACROSS QPI
GK110B
x16 Gen3 x16 Gen3
IVB Xeon
GK110B
SYSMEM
Single rail FDR
6.1 GB/sSingle rail FDR
6.1 GB/s
IVB XeonIVB XeonBW: .25GB/s
lat: 2.2us
MLNX
C-IB
MLNX
C-IB
RX side TX side
![Page 45: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/45.jpg)
45
HOST TO GPU, RX ACROSS PLX
IVB Xeon
GK110B
X16 Gen3
PLX
X16 Gen3
BW: 6.1GB/s
lat: 1.9usIVB Xeon
GK110B
PLX
MLNX
C-IB
MLNX
C-IB
SYSMEM
TX side RX sideSingle rail FDR
6.1 GB/s
Single rail FDR
6.1 GB/s
![Page 46: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec5b93f26ea6d3c9424f670/html5/thumbnails/46.jpg)
46
Single rail FDR
6.1 GB/s
GPU TO GPU, TX/RX ACROSS PLX
IVB Xeon
GK110B
X16 Gen3
PLX
X16 Gen3
Single rail FDR
6.1 GB/s
BW: 5.8GB/s
lat: 1.9usIVB Xeon
GK110B
PLX
MLNX
C-IB
MLNX
C-IB
TX side RX side