Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU...
Transcript of Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU...
Multi-GPU Systems
GPU becoming more specialized2
Modern GPU “Processing Block”● 32 Threads● 16 INT● 16 single-precision FP● 8 double-precision FP● 4 SFU (sin, cos, log)● 2 Tensor units for DNN● 64KB RF
GPU Streaming Multiprocessor3
● Contains 4 “Processing Blocks”● Each independently schedules a
set of 32 threads called a warp● Share L1 Cache between blocks
GPU Hardware4
● V100 has 80 SM● 5376 FPU● Peak 15.7
TFLOPS
GPU “Data center in a box”
› DGX› A Multi-GPU “Node”› 300GB/s NVlink 2.0 cube mesh› 1 PFLOPS› Faster Machine Learning
5
6
DGX Data Center7
GPU Support in Cloud Computing Stack8
GPUs in the Cloud
› Exponential demand for more compute power
9
GPU inter-connection is getting complex10
Picture sources: NVlink whitepaper
20
20
20
20
80(in)+80(out)=160GB/s(Total)
GPU inter-connection is getting complex11
Picture sources: NVlink whitepaper
25 50
50 25150(in)+150(out)=300GB/s(Total)
GPU inter-connection is getting complex12
How can we make efficient use of GPU inter-connects?
13
NVLink: Fast communication between multi-GPUs
14
Challenges of complex GPU inter-connects
› Programming Multi-GPU applications is hard
15
Cliff Woolley, Sr. Manager, Developer Technology Software, NVIDIA
NCCL: ACCELERATED MULTI-GPU COLLECTIVE COMMUNICATIONS
2
BACKGROUND
Efficiency of parallel computation tasks
• Amount of exposed parallelism
• Amount of work assigned to each processor
Expense of communications among tasks
• Amount of communication
• Degree of overlap of communication with computation
What limits the scalability of parallel applications?
3
COMMON COMMUNICATION PATTERNS
4
COMMUNICATION AMONG TASKS
Point-to-point communication
• Single sender, single receiver
• Relatively easy to implement efficiently
Collective communication
• Multiple senders and/or receivers
• Patterns include broadcast, scatter, gather, reduce, all-to-all, …
• Difficult to implement efficiently
What are common communication patterns?
5
POINT-TO-POINT COMMUNICATION
Most common pattern in HPC, where communication is usually to nearest neighbors
Single-sender, single-receiver per instance
2D Decomposition
Boundaryexchanges
7
COLLECTIVE COMMUNICATIONMultiple senders and/or receivers
8
BROADCASTOne sender, multiple receivers
GPU0 GPU1 GPU2 GPU3
A
GPU0 GPU1 GPU2 GPU3
A A A A
broadcast
9
SCATTEROne sender; data is distributed among multiple receivers
scatter
GPU0 GPU1 GPU2 GPU3
A B C D
GPU0 GPU1 GPU2 GPU3
A
B
C
D
10
GATHERMultiple senders, one receiver
GPU0 GPU1 GPU2 GPU3
A
B
C
D
GPU0 GPU1 GPU2 GPU3
A B C D
gather
11
ALL-GATHERGather messages from all; deliver gathered data to all participants
GPU0 GPU1 GPU2 GPU3
A B C D
GPU0 GPU1 GPU2 GPU3
A A A A
B B B B
C C C C
D D D D
all-gather
12
REDUCECombine data from all senders; deliver the result to one receiver
GPU0 GPU1 GPU2 GPU3
A B C D
reduce
GPU0 GPU1 GPU2 GPU3
A+B+C+D
13
ALL-REDUCECombine data from all senders; deliver the result to all participants
GPU0 GPU1 GPU2 GPU3
A B C D
all-reduce
GPU0 GPU1 GPU2 GPU3
A+B+C+D
A+B+C+D
A+B+C+D
A+B+C+D
14
REDUCE-SCATTERCombine data from all senders; distribute result across participants
reduce-scatter
GPU0 GPU1 GPU2 GPU3
A0+B0+C0+D0
A1+B1+C1+D1
A2+B2+C2+D2
A3+B3+C3+D3
GPU0 GPU1 GPU2 GPU3
A0 B0 C0 D0
A1 B1 C1 D1
A2 B2 C2 D2
A3 B3 C3 D3
15
ALL-TO-ALLScatter/Gather distinct messages from each participant to every other
GPU0 GPU1 GPU2 GPU3
A0 B0 C0 D0
A1 B1 C1 D1
A2 B2 C2 D2
A3 B3 C3 D3
all-to-all
GPU0 GPU1 GPU2 GPU3
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
D0 D1 D2 D3
16
THE CHALLENGE OF COLLECTIVES
17
THE CHALLENGE OF COLLECTIVES
Having multiple senders and/or receivers compounds communication inefficiencies
• For small transfers, latencies dominate; more participants increase latency
• For large transfers, bandwidth is key; bottlenecks are easily exposed
• May require topology-aware implementation for high performance
• Collectives are often blocking/non-overlapped
Collectives are often avoided because they are expensive. Why?
18
THE CHALLENGE OF COLLECTIVES
Collectives are central to scalability in a variety of key applications:
• Deep Learning (All-reduce, broadcast, gather)
• Parallel FFT (Transposition is all-to-all)
• Molecular Dynamics (All-reduce)
• Graph Analytics (All-to-all)
• …
If collectives are so expensive, do they actually get used? YES!
19
THE CHALLENGE OF COLLECTIVES
Scaling requires efficient communication algorithms and careful implementation
Communication algorithms are topology-dependent
Topologies can be complex – not every system is a fat tree
Most collectives amenable to bandwidth-optimal implementation on rings, andmany topologies can be interpreted as one or more rings [P. Patarasuk and X. Yuan]
Many implementations seen in the wild are suboptimal
20
RING-BASED COLLECTIVES: A PRIMER
21
BROADCASTwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
22
BROADCAST
Step 1: ∆𝑡 = 𝑁/𝐵
𝑁: bytes to broadcast
𝐵: bandwidth of each link
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
23
BROADCAST
Step 1: ∆𝑡 = 𝑁/𝐵Step 2: ∆𝑡 = 𝑁/𝐵
𝑁: bytes to broadcast
𝐵: bandwidth of each link
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
24
BROADCAST
Step 1: ∆𝑡 = 𝑁/𝐵Step 2: ∆𝑡 = 𝑁/𝐵Step 3: ∆𝑡 = 𝑁/𝐵
𝑁: bytes to broadcast
𝐵: bandwidth of each link
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
25
BROADCAST
Step 1: ∆𝑡 = 𝑁/𝐵Step 2: ∆𝑡 = 𝑁/𝐵Step 3: ∆𝑡 = 𝑁/𝐵Total time: 𝑘 − 1 𝑁/𝐵
𝑁: bytes to broadcast
𝐵: bandwidth of each link
𝑘: number of GPUs
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
26
BROADCASTwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
27
BROADCAST
Split data into 𝑆 messages
Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
28
BROADCAST
Split data into 𝑆 messages
Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
29
BROADCAST
Split data into 𝑆 messages
Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
30
BROADCAST
Split data into 𝑆 messages
Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)Step 4: ∆𝑡 = 𝑁/(𝑆𝐵)
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
31
BROADCAST
Split data into 𝑆 messages
Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)Step 4: ∆𝑡 = 𝑁/(𝑆𝐵)...
Total time: S𝑁/(𝑆𝐵) + (𝑘 − 2) 𝑁 (𝑆𝐵)= 𝑁(𝑆 + 𝑘 − 2)/(𝑆𝐵) → 𝑁/𝐵
with unidirectional ring
GPU0 GPU1 GPU2 GPU3
32
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 0
33
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 1
34
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 2
35
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 3
36
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 4
37
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 5
38
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 6
39
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 1Step: 7
40
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 2Step: 0
41
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 2Step: 1
42
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
Chunk: 2Step: 2
43
ALL-REDUCEwith unidirectional ring
GPU0 GPU1 GPU2 GPU3
done
44
RING-BASED COLLECTIVESA primer
GPU0 GPU1
CPU
4-GPU-PCIe
GPU2 GPU3
Switch
PCIe Gen3 x16 ~12 GB/s
45
RING-BASED COLLECTIVESA primer
GPU0 GPU1
CPU
4-GPU-PCIe
GPU2 GPU3
Switch
PCIe Gen3 x16 ~12 GB/s
46
RING-BASED COLLECTIVES…apply to lots of possible topologies
GPU0 GPU1
CPU
GPU2 GPU3
Switch
GPU4 GPU5
CPU
GPU6 GPU7
Switch
PCIe Gen3 x16 ~12 GB/s
SMP Connection (e.g., QPI)
47
RING-BASED COLLECTIVES…apply to lots of possible topologies
GPU2
CPU
4-GPU-FC
GPU1
GPU3
GPU0
Switch
32 GB/s
GPU2
CPU
4-GPU-Ring
GPU1
GPU3
GPU0
Switch
32 GB/s
NVLink~16 GB/s
PCIe Gen3 x16 ~12 GB/s
48
RING-BASED COLLECTIVES…apply to lots of possible topologies
GPU2
CPU
4-GPU-FC
GPU1
GPU3
GPU0
Switch
GPU2
CPU
4-GPU-Ring
GPU1
GPU3
GPU0
Switch
NVLink~16 GB/s
PCIe Gen3 x16 ~12 GB/s
49
INTRODUCING NCCL (“NICKEL”):ACCELERATED COLLECTIVES
FOR MULTI-GPU SYSTEMS
50
INTRODUCING NCCL
GOAL:
• Build a research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-GPU applications
APPROACH:
• Pattern the library after MPI’s collectives
• Handle the intra-node communication in an optimal way
• Provide the necessary functionality for MPI to build on top to handle inter-node
Accelerating multi-GPU collective communications
51
NCCL FEATURES AND FUTURES
Collectives• Broadcast• All-Gather• Reduce• All-Reduce• Reduce-Scatter• Scatter• Gather• All-To-All• Neighborhood
Key Features• Single-node, up to 8 GPUs
• Host-side API
• Asynchronous/non-blocking interface
• Multi-thread, multi-process support
• In-place and out-of-place operation
• Integration with MPI
• Topology Detection
• NVLink & PCIe/QPI* support
(Green = Currently available)
52
NCCL IMPLEMENTATION
Implemented as monolithic CUDA C++ kernels combining the following:
• GPUDirect P2P Direct Access
• Three primitive operations: Copy, Reduce, ReduceAndCopy
• Intra-kernel synchronization between GPUs
• One CUDA thread block per ring-direction
53
NCCL EXAMPLE
#include <nccl.h>ncclComm_t comm[4];ncclCommInitAll(comm, 4, {0, 1, 2, 3});
foreach g in (GPUs) { // or foreach thread
cudaSetDevice(g);
double *d_send, *d_recv;
// allocate d_send, d_recv; fill d_send with data
ncclAllReduce(d_send, d_recv, N, ncclDouble, ncclSum, comm[g], stream[g]);
// consume d_recv
}
All-reduce
54
NCCL PERFORMANCEBandwidth at different problem sizes (4 Maxwell GPUs)
All-Gather
All-Reduce
Reduce-Scatter
Broadcast
55
AVAILABLE NOWgithub.com/NVIDIA/nccl
56
THANKS TO MY COLLABORATORS
Nathan Luehr
Jonas Lippuner
Przemek Tredak
Sylvain Jeaugey
Natalia Gimelshein
Simon Layton
This research is funded in part by the U.S. DOE DesignForward program