GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted...
Transcript of GPU-InfiniBand Accelerations for Hybrid Compute Systems … · GPU-based clusters are being adopted...
Pak Lui
March 2013, GTC
GPU-InfiniBand Accelerations for Hybrid Compute Systems
© 2013 Mellanox Technologies 2 - Mellanox Confidential -
Leading Supplier of End-to-End Interconnect Solutions
Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Virtual Protocol Interconnect
Storage Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Fibre Channel
Virtual Protocol Interconnect
© 2013 Mellanox Technologies 3 - Mellanox Confidential -
Virtual Protocol Interconnect (VPI) Technology
64 ports 10GbE
36 ports 40/56GbE
48 10GbE + 12 40/56GbE
36 ports IB up to 56Gb/s
8 VPI subnets
Switch OS Layer
Mezzanine Card
VPI Adapter VPI Switch
Ethernet: 10/40/56 Gb/s
InfiniBand:10/20/40/56 Gb/s
Unified Fabric Manager
Networking Storage Clustering Management
Applications
Acceleration Engines
LOM Adapter Card
3.0
From data center to
campus and metro
connectivity
© 2013 Mellanox Technologies 4 - Mellanox Confidential -
A new interconnect architecture for compute intensive applications
World’s fastest server and storage interconnect solution
Enables unlimited clustering (compute and storage) scalability
Accelerates compute-intensive and parallel-intensive applications
Optimized for multi-tenant environments of 100s of Virtual Machines per server
Connect-IB: The Exascale Foundation
© 2013 Mellanox Technologies 5 - Mellanox Confidential -
World’s first 100Gb/s InfiniBand interconnect adapter
• PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s
Highest InfiniBand message rate: 137 million messages per second
• 4X higher than other InfiniBand solutions
Connect-IB Performance Highlights
Unparalleled Throughput and Message Injection Rates
© 2013 Mellanox Technologies 6 - Mellanox Confidential -
The GPUDirect project was announced Nov 2009
• “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand Networks”,
http://www.nvidia.com/object/io_1258539409179.html
GPUDirect was developed together by Mellanox and NVIDIA
• New interface (API) within the Tesla GPU driver
• New interface within the Mellanox InfiniBand drivers
• Linux kernel modification to allow direct communication between drivers
GPUDirect 1.0 was announced Q2’10
• “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC
Performance and Efficiency”
• “Mellanox was the lead partner in the development of NVIDIA GPUDirect”
GPUDirect RDMA will be released Q2’13
GPUDirect History
© 2013 Mellanox Technologies 7 - Mellanox Confidential -
GPU communications uses “pinned” buffers for data movement
• A section in the host memory that is dedicated for the GPU
• Allows optimizations such as write-combining and overlapping GPU computation and data transfer for best
performance
InfiniBand uses “pinned” buffers for efficient RDMA transactions
• Zero-copy data transfers, Kernel bypass
• Reduces CPU overhead
GPU-InfiniBand Bottleneck (pre-GPUDirect)
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
© 2013 Mellanox Technologies 8 - Mellanox Confidential -
GPUDirect 1.0
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory 1 2
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
2
Transmit Receive
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
Non GPUDirect
GPUDirect 1.0
© 2013 Mellanox Technologies 9 - Mellanox Confidential -
LAMMPS
• 3 nodes, 10% gain
Amber – Cellulose
• 8 nodes, 32% gain
Amber – FactorIX
• 8 nodes, 27% gain
GPUDirect 1.0 – Application Performance
3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node
© 2013 Mellanox Technologies 10 - Mellanox Confidential -
GPUDirect RDMA
Transmit Receive
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
GPUDirect RDMA
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
GPUDirect 1.0
• Preliminary driver for GPU-Direct is under work by NVIDIA and
Mellanox
• OSU has done an initial design of MVAPICH2 with the latest GPU-
Direct-RDMA Driver
– Hybrid design
– Takes advantage of GPU-Direct-RDMA for short messages
– Uses host-based buffered design in current MVAPICH2 for large messages
– Alleviates Sandybridge chipset bottleneck
DK-OSU-MVAPICH2-GPU-Direct-RDMA 11
Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA
DK-OSU-MVAPICH2-GPU-Direct-RDMA 12
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Latency
0
5
10
15
20
25
30
35
40
1 4 16 64 256 1K 4K
MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
0
200
400
600
800
1000
1200
16K 64K 256K 1M 4M
MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
DK-OSU-MVAPICH2-GPU-Direct-RDMA 13
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Uni-directional Bandwidth
0
100
200
300
400
500
600
700
800
900
1 4 16 64 256 1K 4K
MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Small Message Bandwidth
0
1000
2000
3000
4000
5000
6000
7000
16K 64K 256K 1M 4M
MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Large Message Bandwidth
DK-OSU-MVAPICH2-GPU-Direct-RDMA 14
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Bi-directional Bandwidth
0
200
400
600
800
1000
1200
1 4 16 64 256 1K 4K
MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Small Message Bi-Bandwidth
0
2000
4000
6000
8000
10000
12000
16K 64K 256K 1M 4M
MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Large Message Bi-Bandwidth
© 2013 Mellanox Technologies 15 - Mellanox Confidential -
Remote GPU Access through rCUDA
GPU servers GPU as a Service
rCUDA daemon
Network Interface CUDA
Driver + runtime Network Interface
rCUDA library
Application
Client Side Server Side
Application
CUDA
Driver + runtime
CUDA Application
rCUDA provides remote access from
every node to any GPU in the system
© 2013 Mellanox Technologies 16 - Mellanox Confidential -
GPU as a Service
GPUs as a network-resident service • Little to no overhead when using FDR InfiniBand
Virtualize and decouple GPU services from CPU services • A new paradigm in cluster flexibility
• Lower cost, lower power and ease of use with shared GPU resources
• Remove difficult physical requirements of the GPU for standard compute
servers
GPU
CPU
GPU
CPU GPU
CPU
GPU
CPU
GPU
CPU
GPUs in every server GPUs as a Service
CPU
VGPU
CPU
VGPU
CPU
VGPU
GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU
© 2013 Mellanox Technologies 17 - Mellanox Confidential -
Thank You