GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU...

26
Performance of 3-D FFT Using Multiple GPUs with CUDA 4 Akira Nukada Tokyo Institute of Technology

Transcript of GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU...

Page 1: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Performance of 3-D FFT

Using Multiple GPUs with CUDA 4

Akira Nukada

Tokyo Institute of Technology

Page 2: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

GPGPU

• High performance, high memory bandwidth

• High efficiencies in power, space, price

• Complicated programming

• Capacity of device memory

• Requires parallelism

• Need optimizations in algorithm, data placement, ….

Page 3: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Next stage of GPGPU

We need multiple GPUs for

• More computing power

• More device memory

There is no SLI for GPGPU, so we have to manage data

distribution, load-balancing, and synchronizations between

GPUs.

Page 4: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Multi-GPU programming

• Similar to MPI applications (distributed memory)

– Easy for developers with MPI experience

– Not easy to achieve expected performance

Page 5: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

CUDA 4.0

• New features for multi-GPU applications

– Direct P2P via PCI-E network

– NIC interoperability with InfiniBand HCAs

These improve the data transfer speed between GPUs

Page 6: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Intra-node

CPU 0 Westmere

IO-Hub Tylersburg DDR3 RAM

DDR3 RAM

GPU 0 Fermi

GPU 1 Fermi PCI-E x16

PCI-E x16 DDR3 RAM

QPI

CPU 0 Westmere

IO-Hub Tylersburg DDR3 RAM

DDR3 RAM

GPU 0 Fermi

GPU 1 Fermi PCI-E x16

PCI-E x16 DDR3 RAM

QPI

CUDA 3.X

CUDA 4.X

Page 7: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Inter-node : CUDA 3.X

CPU 0

GPU 0 Fermi

QDR x4 IB

CUDA pinned

IB pinned

CPU 0

GPU 0 Fermi

QDR x4 IB

CUDA pinned

IB pinned

Memory copy on host Disable DMA access by IB

Page 8: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Inter-node : CUDA 4.X

CPU 0

GPU 0 Fermi

QDR x4 IB

CUDA pinned

IB pinned

Allows DMA access by IB

Page 9: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

3-D FFT

Performs 1-D FFT for each dimension of 3-D data.

y x

z

Memory-intensive computation • O(N log N) ops for O(N) data

Several FFT implementations for CUDA • NVIDIA CUFFT library • Our NukadaFFT library • others…

Page 10: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

3-D FFT on distributed memory

y x

z

1-D FFTs for dim. X & Y on each device

1-D FFTs for dim. Z on each device

all-to-all comm.

Page 11: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

All-to-all comm. in FFT Parallel FFT typically requires 1~3 all-to-all comm.

The number depends on the data distribution before and after the

transform.

(NX, NY, NZ/P) (NX, NY/P, NZ)

before after

In such a case, only 1 all-to-all comm. is required.

Many applications allow this:

• 3D-RISM, Molecular simulation. (Motivation of this work)

• Convolution based computations.

Page 12: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

TSUBAME 2.0 Compute Node

Thin node (HP ProLiant SL390s G7)

• Intel Xeon X5670 (Westmere-EP) x2

• NVIDIA Tesla M2050 x3

• Mellanox QDR x4 InfiniBand HCA x2

Medium node (HP ProLiant DL580 G7)

• Intel Xeon X7550 (Nehalem-EX) x4

• NVIDIA Tesla M2070 x4

(NextIO vCORE Express 2070)

• Mellanox QDR x4 InfiniBand HCA x1

Single-node Multi-GPU Multi-node

Page 13: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

SINGLE NODE MULTI GPUS….

Page 14: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Medium Node

Page 15: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Four scheduling

via host memory

1. Default

2. Exclusive

3. Overlapped

Direct P2P

4. P2P

PCI-E & QPI are bi-directional ;

Important to fill both direction.

Page 16: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Performance of 3-D FFT using different scheduling

policies (256x256x256, DP)

0

10

20

30

40

50

60

70

80

90

100

1GPU 2GPU 3GPU 4GPU

default

exclusive

"+overlap"

"+P2P"

Perf

orm

ance (

GF

LO

PS

)

Page 17: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Performance of 3-D FFT using P2P

0

20

40

60

80

100

120

CUFFT NukadaFFT 2GPU 3GPU 4GPU

256

384

512

Perf

orm

ance (

GF

LO

PS

)

(1GPU) (1GPU)

Page 18: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Breakdown of 2563 3-D FFT

0

5

10

15

20

25

30

35

40

CUFFT NukadaFFT 2GPU 3GPU 4GPU

Total1-D FFT for dim.Z2-D FFT & Transfer

Tim

e (

msec.)

Minimum time for 2-D FFT Single GPU

Page 19: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

MULTIPLE NODES….

Page 20: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Thin Node

CPU 0 Westmere-EP

CPU 1 Westmere-EP

IO-Hub Tylersburg

QPI

QPI

IO-Hub Tylersburg

QPI

QPI

DDR3 RAM

DDR3 RAM

DDR3 RAM

DDR3 RAM

DDR3 RAM

DDR3 RAM

GPU 0 Fermi

GPU 1 Fermi

GPU 2 Fermi

QDR x4 IB

QDR x4 IB

PCI-E x16

PCI-E x16

PCI-E x16

PCI-E x8

PCI-E x8

Page 21: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Comm. between GPUs on multi-node

Three steps of DMA transfers

(1) Device-to-host by CUDA API

(2) Between node by MPI

(3) Host-to-device by CUDA API

All-to-all between P nodes performs (P – 1) Comm. between GPUs.

Those steps should be pipelined, because all DMA controllers can work

simultaneously.

cuMemcpyDtoH()

cuMemcpyHtoD()

MPI_Alltoall()

Page 22: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

MPI-CUDA all-to-all comm.

D2H 0->1

MPI_Send 0->1

H2D 0->1

H2D 99->0

MPI_Recv 99->0

D2H 99->0

D2H 0->2

MPI_Send 0->2

H2D 0->2

D2H 0->3

MPI_Send 0->3

H2D 0->3

D2H 0->4

MPI_Send 0->4

H2D 0->4

H2D 98->0

MPI_Recv 98->0

D2H 98->0

H2D 97->0

MPI_Recv 97->0

D2H 97->0

H2D 96->0

MPI_Recv 96->0

D2H 96->0

H2D 95->0

MPI_Recv 95->0

D2H 95->0

MPI_Recv 94->0

D2H 94->0

D2H 93->0

D2H 0->5

MPI_Send 0->5

H2D 0->5

D2H 0->6

MPI_Send 0->6

D2H 0->7

On node #0

Page 23: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

3-D FFT using MPI (1GPU/node)

50

100

200

400

800

1600

4 8 16 32 64 128

256

384

512

1024

Perf

orm

ance (

GF

LO

PS

)

# of node

Page 24: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Multi-GPU per node, multiple nodes.

(1) Hybrid : multiple GPU per process

1GPU (0) 3GPU (0) 2GPU (0)

GPU0

GPU1

GPU2

CPU0

CPU1

QDR x4 IB

(2) Flat: one GPU per process

2GPU (0-0) 2GPU (0-1)

3GPU (0-0-1) 3GPU (0-0-0) 3GPU (0-1-1)

Page 25: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Performance with 64 nodes

0

200

400

600

800

1000

1200

1400

256

384

512

1024Perf

orm

ance (

GF

LO

PS

)

Page 26: GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU Technology Conference 2012. Keywords 3d fast fourier transform, cuda 4, fft, p2p, peer to

Summary

• Speed-up of 3-D FFT using multi-GPU – Up to 2X using 4 GPUs (single node)

– Up to 4X using 32GPUs (32nodes)

• Small message in all-to-all limits the scalability

• Use of multi-GPU per node

– Decrease performance for small data size

– Increase performance for large data size