NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2...
Transcript of NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2...
![Page 1: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/1.jpg)
Sylvain Jeaugey
NCCL 2.0
![Page 2: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/2.jpg)
2
DEEP LEARNING ON GPUS Making DL training times shorter
Multi-core CPU GPU
CUDA
Multi-GPU
NCCL 1
Multi-GPU
Multi-node
NCCL 2
Deeper neural networks, larger data sets … training is a very, very long operation !
![Page 3: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/3.jpg)
3
NCCL A multi-GPU communication library
PCIe
NVLink
Sockets (Ethernet)
Infiniband, with GPU Direct RDMA
To other systems
Within a system
GPU Direct P2P
![Page 4: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/4.jpg)
4
NCCL Architecture
NCCL
CUDA
Caffe
CUBLAS
Caffe2 Torch TF MXNET CNTK
Deep Learning Frameworks
NVIDIA GPUs
CUDNN
![Page 5: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/5.jpg)
5
AGENDA
NCCL History Design
NCCL 2.0 New features API Changes
Performance
Future
![Page 6: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/6.jpg)
6
HISTORY
Q4 2015: NCCL 1.x
Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations.
Limited to intra-node.
Q2 2017: NCCL 2.x and beyond
NVIDIA Library, multi-node support and improved API.
![Page 7: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/7.jpg)
7
DESIGN
Optimized collective communication library between CUDA devices.
Easy to integrate into any DL framework, as well as traditional HPC apps using MPI.
Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage.
Operates on CUDA pointers. Operations are tied to a CUDA stream.
Uses as little threads as possible to permit other computation to progress simultaneously.
What is NCCL ?
![Page 8: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/8.jpg)
8
DESIGN Rings
NCCL uses rings to move data across all GPUs and perform reductions.
![Page 9: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/9.jpg)
9
DESIGN Rings
NCCL uses rings to move data across all GPUs and perform reductions.
PCIe / QPI : 1 unidirectional ring
![Page 10: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/10.jpg)
10
DESIGN Rings
NCCL uses rings to move data across all GPUs and perform reductions.
DGX-1 : 4 unidirectional rings PCIe / QPI : 1 unidirectional ring
![Page 11: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/11.jpg)
11
Reduction
DESIGN Kernels
sendbuff recvbuff
FIFO
Next GPU
in the ring
Previous GPU
in the ring
![Page 12: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/12.jpg)
12
NCCL 2.0
![Page 13: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/13.jpg)
13
NCCL 2.0
Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.
Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.
Inter-node communication
PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband
![Page 14: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/14.jpg)
14
NCCL 2.0
Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.
Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.
Inter-node communication
PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband
![Page 15: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/15.jpg)
15
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node 1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node N-1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7 4 GPUs per
socket
2 sockets
per node
n nodes
![Page 16: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/16.jpg)
16
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node 1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node n-1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
1 process
per GPU P
0
P
1
P
2
P
3
P
4
P
5
P
6
P
7
P
8
P
9 P10
P11
P12
P13
P14
P15
P8n
-8
P 8n
-7
P 8n
-6
P 8n
-5
P 8n
-4
P 8n
-3
P 8n
-2
P 8n
-1
![Page 17: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/17.jpg)
17
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node 1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node n-1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Process 0
t0 t1 t2 t3
Process 1
t0 t1 t2 t3
Process 2
t0 t1 t2 t3
Process 3
t0 t1 t2 t3
Process 2n-2
t0 t1 t2 t3
Process 2n-1
t0 t1 t2 t3
1 thread
per GPU
1 process
per socket
![Page 18: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/18.jpg)
18
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node 1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node n-1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
1 process
per node Process 0 Process 1 Process n-1
8 GPUs per
process
![Page 19: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/19.jpg)
19
NCCL 2.0 API
NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread
NCCL 1.x : for (int i=0; i<ngpus; i++) {
cudaSetDevice(devices[i]);
ncclAllReduce(…, comms[i], streams[i]);
}
NCCL 2.0 : ncclGroupStart();
for (int i=0; i<ngpus; i++) {
ncclAllReduce(…, comms[i], streams[i]);
}
ncclGroupEnd();
Group calls
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Process 0
![Page 20: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/20.jpg)
20
NCCL 2.0 API
Inter-node communicator creation still uses the NCCL 1.x verbs : ncclGetUniqueId/ncclCommInitRank if (rank == 0) ncclGetUniqueId(&id)
My_Bcast(&id);
ncclCommInitRank(&comm, nranks, id, rank);
Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd if (rank == 0) ncclGetUniqueId(&id)
My_Bcast(&id);
ncclGroupStart();
for (int i=0; i<ndev; i++) {
cudaSetDevice(devices[i]);
ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i);
}
ncclGroupEnd();
Integration with parallel environments
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
P
0
P
1
P
2
P
3
P
4
P
5
P
6
P
7
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Process 0
![Page 21: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/21.jpg)
21
NCCL 2.0 API
Other small API adjustments over the NCCL 1.x API :
Counts are now of type size_t instead of int
allGather arguments order has been fixed to be similar to other operations
Additions/clarification on datatypes : integral : int8 = char, uint8, int32 = int, uint32, int64, uint64 floating point : float16 = half, float32 = float, float64 = double
Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations
Others
![Page 22: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/22.jpg)
22
PERFORMANCE
![Page 23: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/23.jpg)
23
PERFORMANCE Intra-node performance
0
10
20
30
40
50
60
4 QPI 4 CPU 4 PCI DGX-1
AllReduce bandwidth (OMB, size=128MB, in GB/s)
![Page 24: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/24.jpg)
24
PERFORMANCE Inter-node performance
0
5
10
15
20
25
30
35
40
45
2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)
AllReduce bandwidth (OMB, size=128MB, in GB/s)
MPI
Baidu Allreduce
NCCL
![Page 25: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/25.jpg)
25
PERFORMANCE Deep Learning - CNTK
0
1000
2000
3000
4000
5000
6000
7000
8000
0 8 16 24 32
CNTK scaling ResNet50, images/s
Ideal MPI NCCL
217
1684
3281
6569
1645 1744
3360
![Page 26: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/26.jpg)
26
FUTURE
![Page 27: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/27.jpg)
27
FUTURE
Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions)
User-defined reduction operations also, trying to merge computation and communication better
Windows support
Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm
Top asked features
![Page 28: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)](https://reader031.fdocuments.us/reader031/viewer/2022022615/5ba2076709d3f2bb6a8d9be3/html5/thumbnails/28.jpg)