GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU...

Performance of 3-D FFT

Using Multiple GPUs with CUDA 4

Akira Nukada

Tokyo Institute of Technology

• High performance, high memory bandwidth

• High efficiencies in power, space, price

• Complicated programming

• Capacity of device memory

• Requires parallelism

• Need optimizations in algorithm, data placement, ….

Next stage of GPGPU

We need multiple GPUs for

• More computing power

• More device memory

There is no SLI for GPGPU, so we have to manage data

distribution, load-balancing, and synchronizations between

Multi-GPU programming

• Similar to MPI applications (distributed memory)

– Easy for developers with MPI experience

– Not easy to achieve expected performance

CUDA 4.0

• New features for multi-GPU applications

– Direct P2P via PCI-E network

– NIC interoperability with InfiniBand HCAs

These improve the data transfer speed between GPUs

Intra-node

CPU 0 Westmere

IO-Hub Tylersburg DDR3 RAM

DDR3 RAM

GPU 0 Fermi

GPU 1 Fermi PCI-E x16

PCI-E x16 DDR3 RAM

CPU 0 Westmere

IO-Hub Tylersburg DDR3 RAM

DDR3 RAM

GPU 0 Fermi

GPU 1 Fermi PCI-E x16

PCI-E x16 DDR3 RAM

CUDA 3.X

CUDA 4.X

Inter-node : CUDA 3.X

GPU 0 Fermi

QDR x4 IB

CUDA pinned

IB pinned

GPU 0 Fermi

QDR x4 IB

CUDA pinned

IB pinned

Memory copy on host Disable DMA access by IB

Inter-node : CUDA 4.X

GPU 0 Fermi

QDR x4 IB

CUDA pinned

IB pinned

Allows DMA access by IB

3-D FFT

Performs 1-D FFT for each dimension of 3-D data.

Memory-intensive computation • O(N log N) ops for O(N) data

Several FFT implementations for CUDA • NVIDIA CUFFT library • Our NukadaFFT library • others…

3-D FFT on distributed memory

1-D FFTs for dim. X & Y on each device

1-D FFTs for dim. Z on each device

all-to-all comm.

All-to-all comm. in FFT Parallel FFT typically requires 1~3 all-to-all comm.

The number depends on the data distribution before and after the

transform.

(NX, NY, NZ/P) (NX, NY/P, NZ)

before after

In such a case, only 1 all-to-all comm. is required.

Many applications allow this:

• 3D-RISM, Molecular simulation. (Motivation of this work)

• Convolution based computations.

TSUBAME 2.0 Compute Node

Thin node (HP ProLiant SL390s G7)

• Intel Xeon X5670 (Westmere-EP) x2

• NVIDIA Tesla M2050 x3

• Mellanox QDR x4 InfiniBand HCA x2

Medium node (HP ProLiant DL580 G7)

• Intel Xeon X7550 (Nehalem-EX) x4

• NVIDIA Tesla M2070 x4

(NextIO vCORE Express 2070)

• Mellanox QDR x4 InfiniBand HCA x1

Single-node Multi-GPU Multi-node

SINGLE NODE MULTI GPUS….

Medium Node

Four scheduling

via host memory

1. Default

2. Exclusive

3. Overlapped

Direct P2P

4. P2P

PCI-E & QPI are bi-directional ;

Important to fill both direction.

Performance of 3-D FFT using different scheduling

policies (256x256x256, DP)

1GPU 2GPU 3GPU 4GPU

default

exclusive

"+overlap"

"+P2P"

ance (

Performance of 3-D FFT using P2P

CUFFT NukadaFFT 2GPU 3GPU 4GPU

ance (

(1GPU) (1GPU)

Breakdown of 2563 3-D FFT

CUFFT NukadaFFT 2GPU 3GPU 4GPU

Total1-D FFT for dim.Z2-D FFT & Transfer

msec.)

Minimum time for 2-D FFT Single GPU

MULTIPLE NODES….

Thin Node

CPU 0 Westmere-EP

CPU 1 Westmere-EP

IO-Hub Tylersburg

DDR3 RAM

GPU 0 Fermi

GPU 1 Fermi

GPU 2 Fermi

QDR x4 IB

PCI-E x16

PCI-E x8

Comm. between GPUs on multi-node

Three steps of DMA transfers

(1) Device-to-host by CUDA API

(2) Between node by MPI

(3) Host-to-device by CUDA API

All-to-all between P nodes performs (P – 1) Comm. between GPUs.

Those steps should be pipelined, because all DMA controllers can work

simultaneously.

cuMemcpyDtoH()

cuMemcpyHtoD()

MPI_Alltoall()

MPI-CUDA all-to-all comm.

D2H 0->1

MPI_Send 0->1

H2D 0->1

H2D 99->0

MPI_Recv 99->0

D2H 99->0

D2H 0->2

MPI_Send 0->2

H2D 0->2

D2H 0->3

MPI_Send 0->3

H2D 0->3

D2H 0->4

MPI_Send 0->4

H2D 0->4

H2D 98->0

MPI_Recv 98->0

D2H 98->0

H2D 97->0

MPI_Recv 97->0

D2H 97->0

H2D 96->0

MPI_Recv 96->0

D2H 96->0

H2D 95->0

MPI_Recv 95->0

D2H 95->0

MPI_Recv 94->0

D2H 94->0

D2H 93->0

D2H 0->5

MPI_Send 0->5

H2D 0->5

D2H 0->6

MPI_Send 0->6

D2H 0->7

On node #0

3-D FFT using MPI (1GPU/node)

4 8 16 32 64 128

ance (

# of node

Multi-GPU per node, multiple nodes.

(1) Hybrid : multiple GPU per process

1GPU (0) 3GPU (0) 2GPU (0)

QDR x4 IB

(2) Flat: one GPU per process

2GPU (0-0) 2GPU (0-1)

3GPU (0-0-1) 3GPU (0-0-0) 3GPU (0-1-1)

Performance with 64 nodes

1024Perf

ance (

Summary

• Speed-up of 3-D FFT using multi-GPU – Up to 2X using 4 GPUs (single node)

– Up to 4X using 32GPUs (32nodes)

• Small message in all-to-all limits the scalability

• Use of multi-GPU per node

– Decrease performance for small data size

– Increase performance for large data size

GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU...

Documents

Transcript of GTC 2012 - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0209...Presented at GPU...

GPU ACCELERATED DEEP LEARNING WITH CUDNN - NVIDIAon-demand.gputechconf.com/gtc/2015/webinar/gtc... · TK1 TX1 . 7 DEEP LEARNING WITH CUDNN cuDNN is a library of primitives for deep

GPGPU in Film Production - NVIDIAon-demand.gputechconf.com/gtc/2013/presentations/S... · vertex or tessellation shader •Generate new geometry •Used for hair, particles, etc.

Bare Metal Library - NVIDIAon-demand.gputechconf.com/gtc/.../s8154-bare-metal... · Three use cases Current solutions ... High performance trading systems Lock-free algos, distributed

Exploring Design Considerations - NVIDIAon-demand.gputechconf.com/gtc/2015/presentation/S5485-Fred-Devoir.pdfFred Devoir Sr. Architect TEXTRON MIS, EMCCAe, ITILv3 Relevant VDI Experience:

CUDA 7.5 OVERVIEW WEBINAR - NVIDIAon-demand.gputechconf.com/gtc/2015/webinar/gtc-express... · 2015-07-24 · CUDA 7.5 OVERVIEW WEBINAR . 2 CUDA 7.5 16-bit Floating-Point Storage

The Self Learning Car - NVIDIAon-demand.gputechconf.com/gtc/2016/presentation/s... · The Self Learning Car Nvidia GTC Nick Black – Chief Product Officer April 2016

Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Greg Scantlen, CEOdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Stereoscopic Computational Microscope GTC 2012 May 15 . GITerDone GTC 2009 GTC 2012 May 15 Switchless IB TRIAD ...

Subtractive 3D Printing - NVIDIAon-demand.gputechconf.com/gtc/2016/presentation/s6146... · 2016-04-13 · Subtractive 3D Printing Tommy Tucker, PhD Tucker Innovations, Inc ... NVIDIA

Bio-inspired Active Vision - NVIDIAon-demand.gputechconf.com/gtc/2013/presentations/S3104... · 2013-03-20 · Specific template or computational representation is required to allow

Virtual Automotive: Projection Mapped - NVIDIAon-demand.gputechconf.com/gtc/2014/presentations/S... · experience through Projected Light • Add dynamic details with the use of projection

VULKAN TECHNOLOGY UPDATE - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7191...Share device memory & synchronization primitives across APIs and processes Created in context

Optimizing Texture Transfers - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S... · Non CPU-blocking transfer using Pixel Buffer Objects (PBO) —Ping-pong PBO’s for optimal

Deep Learning –a cookbook view - NVIDIAon-demand.gputechconf.com/gtc-eu/2017/presentation/... · Machine learning – DeepBlueBeating ... TensorFlow -Inference ( Inferences per

Dive Deeper in Finance - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7625-daniel-egl… · Dive Deeper in Finance GTC 2017 –San José –California Daniel Egloff Dr.

Tim Kaldewey, Rene Mueller Mar 19 2013 - NVIDIAon-demand.gputechconf.com/gtc/2013/presentations/S... · © 2012 IBM Corporation Let your GPU do the heavy lifting in your data Warehouse

Deploying GPGPUs in Manufacturing - NVIDIAon-demand.gputechconf.com/gtc/2013/presentations/S... · •Simulation Engineer, Boeing Australia (2001-2002) •Research Scholar, University

Inside Kepler.pptx [Read-Only] - NVIDIAon-demand.gputechconf.com/gtc/2012/presentations/S0642... · 2012-11-27 · Inside Kepler. Welcome the Kepler GK110 GPU Efficiency Programmability

SURGICAL:AI - NVIDIAon-demand.gputechconf.com/gtc/2017/...surgical-ai.pdf · SURGICAL:AI AI Democratise Computer Aided Surgery. WHY NOT GOOGLE MAP AMAZON UBER FACEBOOK ALPHAGO FOR

Textures & Surfaces - NVIDIAon-demand.gputechconf.com/gtc-express/2011/presentations/texture... · Textures, Surfaces and CUDA Array creation: Programming Guide, 3.2.10 Texture and