Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU...

Multi-GPU Systems

GPU becoming more specialized2

Modern GPU “Processing Block”● 32 Threads● 16 INT● 16 single-precision FP● 8 double-precision FP● 4 SFU (sin, cos, log)● 2 Tensor units for DNN● 64KB RF

GPU Streaming Multiprocessor3

● Contains 4 “Processing Blocks”● Each independently schedules a

set of 32 threads called a warp● Share L1 Cache between blocks

GPU Hardware4

● V100 has 80 SM● 5376 FPU● Peak 15.7

TFLOPS

GPU “Data center in a box”

› DGX› A Multi-GPU “Node”› 300GB/s NVlink 2.0 cube mesh› 1 PFLOPS› Faster Machine Learning

5

DGX Data Center7

GPU Support in Cloud Computing Stack8

GPUs in the Cloud

› Exponential demand for more compute power

9

GPU inter-connection is getting complex10

Picture sources: NVlink whitepaper

20

20

20

20

80(in)+80(out)=160GB/s(Total)


Picture sources: NVlink whitepaper

25 50

50 25150(in)+150(out)=300GB/s(Total)

How can we make efficient use of GPU inter-connects?

13

NVLink: Fast communication between multi-GPUs

14

Challenges of complex GPU inter-connects

› Programming Multi-GPU applications is hard

15

Cliff Woolley, Sr. Manager, Developer Technology Software, NVIDIA

NCCL: ACCELERATED MULTI-GPU COLLECTIVE COMMUNICATIONS

2

BACKGROUND

Efficiency of parallel computation tasks

• Amount of exposed parallelism

• Amount of work assigned to each processor

Expense of communications among tasks

• Amount of communication

• Degree of overlap of communication with computation

What limits the scalability of parallel applications?

3

COMMON COMMUNICATION PATTERNS

4

COMMUNICATION AMONG TASKS

Point-to-point communication

• Single sender, single receiver

• Relatively easy to implement efficiently

Collective communication

• Multiple senders and/or receivers

• Patterns include broadcast, scatter, gather, reduce, all-to-all, …

• Difficult to implement efficiently

What are common communication patterns?

5

POINT-TO-POINT COMMUNICATION

Most common pattern in HPC, where communication is usually to nearest neighbors

Single-sender, single-receiver per instance

2D Decomposition

Boundaryexchanges

7

COLLECTIVE COMMUNICATIONMultiple senders and/or receivers

8

BROADCASTOne sender, multiple receivers

GPU0 GPU1 GPU2 GPU3

A

GPU0 GPU1 GPU2 GPU3

A A A A

broadcast

9

SCATTEROne sender; data is distributed among multiple receivers

scatter

GPU0 GPU1 GPU2 GPU3

A B C D

GPU0 GPU1 GPU2 GPU3

A

B

C

D

10

GATHERMultiple senders, one receiver

GPU0 GPU1 GPU2 GPU3

A

B

C

D

GPU0 GPU1 GPU2 GPU3

A B C D

gather

11

ALL-GATHERGather messages from all; deliver gathered data to all participants

GPU0 GPU1 GPU2 GPU3

A B C D

GPU0 GPU1 GPU2 GPU3

A A A A

B B B B

C C C C

D D D D

all-gather

12

REDUCECombine data from all senders; deliver the result to one receiver

GPU0 GPU1 GPU2 GPU3

A B C D

reduce

GPU0 GPU1 GPU2 GPU3

A+B+C+D

13

ALL-REDUCECombine data from all senders; deliver the result to all participants

GPU0 GPU1 GPU2 GPU3

A B C D

all-reduce

GPU0 GPU1 GPU2 GPU3

A+B+C+D

A+B+C+D

A+B+C+D

A+B+C+D

14

REDUCE-SCATTERCombine data from all senders; distribute result across participants

reduce-scatter

GPU0 GPU1 GPU2 GPU3

A0+B0+C0+D0

A1+B1+C1+D1

A2+B2+C2+D2

A3+B3+C3+D3

GPU0 GPU1 GPU2 GPU3

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

15

ALL-TO-ALLScatter/Gather distinct messages from each participant to every other

GPU0 GPU1 GPU2 GPU3

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

all-to-all

GPU0 GPU1 GPU2 GPU3

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

D0 D1 D2 D3

16

THE CHALLENGE OF COLLECTIVES

17


Having multiple senders and/or receivers compounds communication inefficiencies

• For small transfers, latencies dominate; more participants increase latency

• For large transfers, bandwidth is key; bottlenecks are easily exposed

• May require topology-aware implementation for high performance

• Collectives are often blocking/non-overlapped

Collectives are often avoided because they are expensive. Why?

18


Collectives are central to scalability in a variety of key applications:

• Deep Learning (All-reduce, broadcast, gather)

• Parallel FFT (Transposition is all-to-all)

• Molecular Dynamics (All-reduce)

• Graph Analytics (All-to-all)

• …

If collectives are so expensive, do they actually get used? YES!

19


Scaling requires efficient communication algorithms and careful implementation

Communication algorithms are topology-dependent

Topologies can be complex – not every system is a fat tree

Most collectives amenable to bandwidth-optimal implementation on rings, andmany topologies can be interpreted as one or more rings [P. Patarasuk and X. Yuan]

Many implementations seen in the wild are suboptimal

20

RING-BASED COLLECTIVES: A PRIMER

21

BROADCASTwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

22

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵

𝑁: bytes to broadcast

𝐵: bandwidth of each link

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

23

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵Step 2: ∆𝑡 = 𝑁/𝐵




GPU0 GPU1 GPU2 GPU3

24

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵Step 2: ∆𝑡 = 𝑁/𝐵Step 3: ∆𝑡 = 𝑁/𝐵




GPU0 GPU1 GPU2 GPU3

25

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵Step 2: ∆𝑡 = 𝑁/𝐵Step 3: ∆𝑡 = 𝑁/𝐵Total time: 𝑘 − 1 𝑁/𝐵



𝑘: number of GPUs


GPU0 GPU1 GPU2 GPU3

26

BROADCASTwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

27

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)


GPU0 GPU1 GPU2 GPU3

28

BROADCAST


Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)


GPU0 GPU1 GPU2 GPU3

29

BROADCAST


Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)


GPU0 GPU1 GPU2 GPU3

30

BROADCAST


Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)Step 4: ∆𝑡 = 𝑁/(𝑆𝐵)


GPU0 GPU1 GPU2 GPU3

31

BROADCAST


Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)Step 4: ∆𝑡 = 𝑁/(𝑆𝐵)...

Total time: S𝑁/(𝑆𝐵) + (𝑘 − 2) 𝑁 (𝑆𝐵)= 𝑁(𝑆 + 𝑘 − 2)/(𝑆𝐵) → 𝑁/𝐵


GPU0 GPU1 GPU2 GPU3

32

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 0

33


GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 1

34


GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 2

35


GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 3

36


GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 4

37


GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 5

38


GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 6

39


GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 7

40


GPU0 GPU1 GPU2 GPU3

Chunk: 2Step: 0

41


GPU0 GPU1 GPU2 GPU3

Chunk: 2Step: 1

42


GPU0 GPU1 GPU2 GPU3

Chunk: 2Step: 2

43


GPU0 GPU1 GPU2 GPU3

done

44

RING-BASED COLLECTIVESA primer

GPU0 GPU1

CPU

4-GPU-PCIe

GPU2 GPU3

Switch

PCIe Gen3 x16 ~12 GB/s

45

RING-BASED COLLECTIVESA primer

GPU0 GPU1

CPU

4-GPU-PCIe

GPU2 GPU3

Switch


46

RING-BASED COLLECTIVES…apply to lots of possible topologies

GPU0 GPU1

CPU

GPU2 GPU3

Switch

GPU4 GPU5

CPU

GPU6 GPU7

Switch


SMP Connection (e.g., QPI)

47


GPU2

CPU

4-GPU-FC

GPU1

GPU3

GPU0

Switch

32 GB/s

GPU2

CPU

4-GPU-Ring

GPU1

GPU3

GPU0

Switch

32 GB/s

NVLink~16 GB/s


48


GPU2

CPU

4-GPU-FC

GPU1

GPU3

GPU0

Switch

GPU2

CPU

4-GPU-Ring

GPU1

GPU3

GPU0

Switch

NVLink~16 GB/s


49

INTRODUCING NCCL (“NICKEL”):ACCELERATED COLLECTIVES

FOR MULTI-GPU SYSTEMS

50

INTRODUCING NCCL

GOAL:

• Build a research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-GPU applications

APPROACH:

• Pattern the library after MPI’s collectives

• Handle the intra-node communication in an optimal way

• Provide the necessary functionality for MPI to build on top to handle inter-node

Accelerating multi-GPU collective communications

51

NCCL FEATURES AND FUTURES

Collectives• Broadcast• All-Gather• Reduce• All-Reduce• Reduce-Scatter• Scatter• Gather• All-To-All• Neighborhood

Key Features• Single-node, up to 8 GPUs

• Host-side API

• Asynchronous/non-blocking interface

• Multi-thread, multi-process support

• In-place and out-of-place operation

• Integration with MPI

• Topology Detection

• NVLink & PCIe/QPI* support

(Green = Currently available)

52

NCCL IMPLEMENTATION

Implemented as monolithic CUDA C++ kernels combining the following:

• GPUDirect P2P Direct Access

• Three primitive operations: Copy, Reduce, ReduceAndCopy

• Intra-kernel synchronization between GPUs

• One CUDA thread block per ring-direction

53

NCCL EXAMPLE

#include <nccl.h>ncclComm_t comm[4];ncclCommInitAll(comm, 4, {0, 1, 2, 3});

foreach g in (GPUs) { // or foreach thread

cudaSetDevice(g);

double *d_send, *d_recv;

// allocate d_send, d_recv; fill d_send with data

ncclAllReduce(d_send, d_recv, N, ncclDouble, ncclSum, comm[g], stream[g]);

// consume d_recv

}

All-reduce

54

NCCL PERFORMANCEBandwidth at different problem sizes (4 Maxwell GPUs)

All-Gather

All-Reduce

Reduce-Scatter

Broadcast

55

AVAILABLE NOWgithub.com/NVIDIA/nccl

56

THANKS TO MY COLLABORATORS

Nathan Luehr

Jonas Lippuner

Przemek Tredak

Sylvain Jeaugey

Natalia Gimelshein

Simon Layton

This research is funded in part by the U.S. DOE DesignForward program

Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU...

Documents

Transcript of Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU...