Paving the Road to Exascale Computing · • US Department of Energy (DOE) funded project – ORNL...
Transcript of Paving the Road to Exascale Computing · • US Department of Energy (DOE) funded project – ORNL...
Gilad Shainer, VP Marketing February 2014
Paving the Road to Exascale Computing
© 2013 Mellanox Technologies 2
Leading Supplier of End-to-End Interconnect Solutions
MXM Mellanox Messaging
Acceleration
FCA Fabric Collectives
Acceleration
Management
UFM Unified Fabric Management
Storage and Data
VSA Storage Accelerator
(iSCSI)
UDA Unstructured Data
Accelerator
Comprehensive End-to-End Software Accelerators and Managment
Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN
© 2013 Mellanox Technologies 3
Mellanox InfiniBand Paves the Road to Exascale Computing
Accelerating Half of the World’s Petascale Systems Mellanox Connected Petascale System Examples
© 2013 Mellanox Technologies 4
20K InfiniBand nodes Mellanox end-to-end FDR and QDR InfiniBand Supports variety of scientific and engineering projects
• Coupled atmosphere-ocean models • Future space vehicle design • Large-scale dark matter halos and galaxy evolution
NASA Ames Research Center Pleiades
Asian Monsoon Water Cycle High-Resolution Climate Simulations
© 2013 Mellanox Technologies 5
Helping to Make the World a Better Place
SANGER • Sequence Analysis and Genomics Research • Genomic Analysis for pediatric cancer patients
Challenge: An individual patient’s RNA analysis took 7 days Goal: Reduce it to 5 days
InfiniBand reduced the RNA-Sequence data analysis time
for patients to only 1 hour! Fast interconnect for fighting pediatric cancer
© 2013 Mellanox Technologies 6
13 Million Financial Transactions Per Day, 4 Billion Database Inserts Real Time Fraud Detection
235 Supermarkets, 8 States, USA
Reacting to Customers’ Needs in Real Time! Reducing Data Queries from 20 minutes to 20 seconds
Accuracy, Details, Fast Response 10X Higher Performance, 50% CAPEX Reduction
Microsoft Bing Maps
Businesses Success Depends on Fast Interconnect
97% Reduction in Database Recovery Time From 7 Days to 4 Hours!
Tier-1 Fortune100 Company Web 2.0 Application
© 2013 Mellanox Technologies 7
InfiniBand Enables Lowest Application Cost in the Cloud (Examples)
Microsoft Windows Azure 90.2% Cloud Efficiency
33% Lower Cost per Application
Cloud Application Performance
Improved up to 10X
3x Increase in VMs per Physical Server
Consolidation of Network and Storage I/O
32% Lower Cost per Application
694% Higher Network Performance
© 2013 Mellanox Technologies 8
InfiniBand’s Unsurpassed System Efficiency
TOP500 systems listed according to their efficiency InfiniBand is the key element responsible for the highest system efficiency Mellanox delivers efficiencies of up to 96% with InfiniBand
© 2013 Mellanox Technologies 9
FDR InfiniBand Delivers Highest Return on Investment
Higher is better
Higher is better Higher is better
Source: HPC Advisory Council
© 2013 Mellanox Technologies 10
Technology Roadmap
2000 2020 2010 2005
20Gbs 40Gbs 56Gbs 100Gbs
“Roadrunner” Mellanox Connected
1st 3rd TOP500 2003
Virginia Tech (Apple)
2015
200Gbs
Mega Supercomputers
Terascale Petascale Exascale
10Gbs
© 2013 Mellanox Technologies 11
Architectural Foundation for Exascale Computing
Connect-IB Interconnect Adapter
© 2013 Mellanox Technologies 12
Mellanox Connect-IB The World’s Fastest Adapter
The 7th generation of Mellanox interconnect adapters
World’s first 100Gb/s interconnect adapter (dual-port FDR 56Gb/s InfiniBand)
Delivers 137 million messages per second – 4X higher than competition
Support the new innovative InfiniBand scalable transport – Dynamically Connected
© 2013 Mellanox Technologies 13
Connect-IB Provides Highest Interconnect Throughput
Source: Prof. DK Panda
Hig
her i
s Be
tter
Gain Your Performance Leadership With Connect-IB Adapters
0
2000
4000
6000
8000
10000
12000
14000
4 16 64 256 1024 4K 16K 64K 256K 1M
Unidirectional Bandwidth
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
3385
6343
12485
12810
0
5000
10000
15000
20000
25000
30000
4 16 64 256 1024 4K 16K 64K 256K 1M
ConnectX2-PCIe2-QDR
ConnectX3-PCIe3-FDR
Sandy-ConnectIB-DualFDR
Ivy-ConnectIB-DualFDR
Bidirectional Bandwidth
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
11643
6521
21025
24727
© 2013 Mellanox Technologies 14
Memory Scalability
1
1,000
1,000,000
1,000,000,000
InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012
8 nodes
2K nodes
10K nodes
100K nodes
Hos
t Mem
ory
Con
sum
ptio
n (M
B)
© 2013 Mellanox Technologies 15
Accelerator and GPU Offloads
© 2013 Mellanox Technologies 16
GPUDirect 1.0
CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory 1
2
CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory 1
2
Transmit Receive
CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory
1 CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory
1
Non GPUDirect
GPUDirect 1.0
© 2013 Mellanox Technologies 17
GPUDirect RDMA
Transmit Receive
CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory
1 CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory
1
GPUDirect RDMA
CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory
1 CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory
1
GPUDirect 1.0
© 2013 Mellanox Technologies 18
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
Message Size (bytes)
Ban
dwid
th (M
B/s
)
0
5
10
15
20
25
1 4 16 64 256 1K 4K
Message Size (bytes)
Late
ncy
(us)
GPU-GPU Internode MPI Latency
Lower is B
etter 67 %
5.49 usec
Performance of MVAPICH2 with GPUDirect RDMA
67% Lower Latency
5X
GPU-GPU Internode MPI Bandwidth
Hig
her i
s B
ette
r
5X Increase in Throughput
Source: Prof. DK Panda
© 2013 Mellanox Technologies 19
Remote GPU Access through rCUDA
GPU servers GPU as a Service
rCUDA daemon
Network Interface CUDA Driver + runtime Network Interface
rCUDA library
Application
Client Side Server Side
Application
CUDA Driver + runtime
CUDA Application
rCUDA provides remote access from every node to any GPU in the system
CPU VGPU
CPU VGPU
CPU VGPU
GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU
© 2013 Mellanox Technologies 20
rCUDA Performance Comparison
© 2013 Mellanox Technologies 21
Solutions for MPI/SHMEM/PGAS
Fabric Collective Accelerations
© 2013 Mellanox Technologies 22
Collective algorithms are not topology aware and can be inefficient Congestion due to many-to-many
communications
Slow nodes and OS jitter affect scalability and
increase variability
Collective Operation Challenges at Large Scale
Ideal Actual
© 2013 Mellanox Technologies 23
CORE-Direct
• US Department of Energy (DOE) funded project – ORNL and Mellanox
• Adapter-based hardware offloading for collectives operations
• Includes floating-point capability on the adapter for data reductions
• CORE-Direct API is exposed through the Mellanox drivers
FCA
• FCA is a software plug-in package that integrates into available MPIs
• Provides scalable topology aware collective operations
• Utilizes powerful InfiniBand multicast and QOS capabilities
• Integrates CORE-Direct collective hardware offloads
Mellanox Collectives Acceleration Components
© 2013 Mellanox Technologies 24
Minimizing the impact of system noise on applications – critical for scalability
The Effects of System Noise on Applications Performance
Ideal System noise CORE-Direct (Offload)
© 2013 Mellanox Technologies 25
Provide support for overlapping computation and communication
CORE-Direct Enables Computation and Communication Overlap
Synchronous CORE-Direct - Asynchronous
© 2013 Mellanox Technologies 26
Nonblocking Alltoall (Overlap-Wait) Benchmark
CoreDirect Offload allows Alltoall benchmark with almost 100% compute
© 2013 Mellanox Technologies 27
Summary
© 2013 Mellanox Technologies 28
The Only Provider of End-to-End 40/56Gb/s Solutions
From Data Center to Metro and WAN
X86, ARM and Power based Compute and Storage Platforms
The Interconnect Provider For 10Gb/s and Beyond
Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN
Thank You