GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted...

Pak Lui

March 2013, GTC

GPU-InfiniBand Accelerations for Hybrid Compute Systems

© 2013 Mellanox Technologies 2 - Mellanox Confidential -

Leading Supplier of End-to-End Interconnect Solutions

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Virtual Protocol Interconnect

Storage Front / Back-End

Server / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand

10/40/56GbE & FCoE 10/40/56GbE

Fibre Channel

Virtual Protocol Interconnect


Virtual Protocol Interconnect (VPI) Technology

64 ports 10GbE

36 ports 40/56GbE

48 10GbE + 12 40/56GbE

36 ports IB up to 56Gb/s

8 VPI subnets

Switch OS Layer

Mezzanine Card

VPI Adapter VPI Switch

Ethernet: 10/40/56 Gb/s

InfiniBand:10/20/40/56 Gb/s

Unified Fabric Manager

Networking Storage Clustering Management

Applications

Acceleration Engines

LOM Adapter Card

3.0

From data center to

campus and metro

connectivity

http://www.redhat.com/


A new interconnect architecture for compute intensive applications

World’s fastest server and storage interconnect solution

Enables unlimited clustering (compute and storage) scalability

Accelerates compute-intensive and parallel-intensive applications

Optimized for multi-tenant environments of 100s of Virtual Machines per server

Connect-IB: The Exascale Foundation


World’s first 100Gb/s InfiniBand interconnect adapter

• PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s

Highest InfiniBand message rate: 137 million messages per second

• 4X higher than other InfiniBand solutions

Connect-IB Performance Highlights

Unparalleled Throughput and Message Injection Rates


The GPUDirect project was announced Nov 2009

• “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand Networks”,

http://www.nvidia.com/object/io_1258539409179.html

GPUDirect was developed together by Mellanox and NVIDIA

• New interface (API) within the Tesla GPU driver

• New interface within the Mellanox InfiniBand drivers

• Linux kernel modification to allow direct communication between drivers

GPUDirect 1.0 was announced Q2’10

• “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC

Performance and Efficiency”

• “Mellanox was the lead partner in the development of NVIDIA GPUDirect”

GPUDirect RDMA will be released Q2’13

GPUDirect History


GPU communications uses “pinned” buffers for data movement

• A section in the host memory that is dedicated for the GPU

• Allows optimizations such as write-combining and overlapping GPU computation and data transfer for best

performance

InfiniBand uses “pinned” buffers for efficient RDMA transactions

• Zero-copy data transfers, Kernel bypass

• Reduces CPU overhead

GPU-InfiniBand Bottleneck (pre-GPUDirect)

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory


GPUDirect 1.0

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory 1 2

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

2

Transmit Receive

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

Non GPUDirect

GPUDirect 1.0


LAMMPS

• 3 nodes, 10% gain

Amber – Cellulose


Amber – FactorIX


GPUDirect 1.0 – Application Performance

3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node


GPUDirect RDMA

Transmit Receive

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

GPUDirect RDMA

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

GPUDirect 1.0

• Preliminary driver for GPU-Direct is under work by NVIDIA and

Mellanox

• OSU has done an initial design of MVAPICH2 with the latest GPU-

Direct-RDMA Driver

– Hybrid design

– Takes advantage of GPU-Direct-RDMA for short messages

– Uses host-based buffered design in current MVAPICH2 for large messages

– Alleviates Sandybridge chipset bottleneck

DK-OSU-MVAPICH2-GPU-Direct-RDMA 11

Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA


Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Latency

0

5

10

15

20

25

30

35

40

1 4 16 64 256 1K 4K

MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

0

200

400

600

800

1000

1200

16K 64K 256K 1M 4M


Large Message Latency


Late

ncy

(u

s)





GPU-GPU Internode MPI Uni-directional Bandwidth

0

100

200

300

400

500

600

700

800

900

1 4 16 64 256 1K 4K



Ban

dw

idth

(M

B/s

)

Small Message Bandwidth

0

1000

2000

3000

4000

5000

6000

7000

16K 64K 256K 1M 4M



Ban

dw

idth

(M

B/s

)

Large Message Bandwidth





GPU-GPU Internode MPI Bi-directional Bandwidth

0

200

400

600

800

1000

1200

1 4 16 64 256 1K 4K



Ban

dw

idth

(M

B/s

)

Small Message Bi-Bandwidth

0

2000

4000

6000

8000

10000

12000

16K 64K 256K 1M 4M



Ban

dw

idth

(M

B/s

)

Large Message Bi-Bandwidth


Remote GPU Access through rCUDA

GPU servers GPU as a Service

rCUDA daemon

Network Interface CUDA

Driver + runtime Network Interface

rCUDA library

Application

Client Side Server Side

Application

CUDA

Driver + runtime

CUDA Application

rCUDA provides remote access from

every node to any GPU in the system


GPU as a Service

GPUs as a network-resident service • Little to no overhead when using FDR InfiniBand

Virtualize and decouple GPU services from CPU services • A new paradigm in cluster flexibility

• Lower cost, lower power and ease of use with shared GPU resources

• Remove difficult physical requirements of the GPU for standard compute

servers

GPU

CPU

GPU

CPU GPU

CPU

GPU

CPU

GPU

CPU

GPUs in every server GPUs as a Service

CPU

VGPU

CPU

VGPU

CPU

VGPU

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU


Thank You

GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted...

Documents

Transcript of GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted...