1© 2018 Mellanox Technologies | Confidential
Paving the Road to ExascaleNovember 2018
Interconnect Your Future
2© 2018 Mellanox Technologies | Confidential
Highest-Performance 200Gb/s Interconnect Solutions
TransceiversActive Optical and Copper Cables(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)
40 HDR (200Gb/s) InfiniBand Ports80 HDR100 InfiniBand PortsThroughput of 16Tb/s, <90ns Latency
200Gb/s Adapter, 0.6us latency215 million messages per second(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)
16 400GbE, 32 200GbE, 128 25/50GbE Ports(10 / 25 / 40 / 50 / 100 / 200 GbE)Throughput of 6.4Tb/s
MPI, SHMEM/PGAS, UPCFor Commercial and Open Source ApplicationsLeverages Hardware Accelerations
System on Chip and SmartNICProgrammable adapterSmart Offloads
3© 2018 Mellanox Technologies | Confidential
The Need for Intelligent and Faster Interconnect
CPU-Centric (Onload) Data-Centric (Offload)
Must Wait for the DataCreates Performance Bottlenecks
Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale
GPU
CPU
GPU
CPU
Onload Network In-Network Computing
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Analyze Data as it Moves!Higher Performance and Scale
4© 2018 Mellanox Technologies | Confidential
Data Centric Architecture to Overcome Latency Bottlenecks
CPU-Centric (Onload) Data-Centric (Offload)
Communications Latencies of 30-40us
Intelligent Interconnect Paves the Road to Exascale Performance
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Communications Latenciesof 3-4us
5© 2018 Mellanox Technologies | Confidential
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
Reliable Scalable General Purpose Primitive In-network Tree based aggregation mechanism Large number of groups Multiple simultaneous outstanding operations
Applicable to Multiple Use-cases HPC Applications using MPI / SHMEM Distributed Machine Learning applications
Scalable High Performance Collective Offload Barrier, Reduce, All-Reduce, Broadcast and more Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64 bits
SHArP Tree
SHARP Tree Aggregation Node
(Process running on HCA)
SHARP Tree Endnode
(Process running on HCA)
SHARP Tree Root
6© 2018 Mellanox Technologies | Confidential
SHARP AllReduce Performance Advantages (128 Nodes)
SHARP enables 75% Reduction in Latency
Providing Scalable Flat Latency
7© 2018 Mellanox Technologies | Confidential
SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology
SHARP Enables Highest Performance
8© 2018 Mellanox Technologies | Confidential
SHARP Performance – Application (OSU)
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/
Source: Prof. DK Panda, Ohio State University
9© 2018 Mellanox Technologies | Confidential
Performs the Gradient AveragingReplaces all physical parameter serversAccelerate AI Performance
SHARP Accelerates AI Performance
The CPU in a parameter server becomes the bottleneck
10© 2018 Mellanox Technologies | Confidential
SHARP Performance Advantage for AI
SHARP provides 16% Performance Increase for deep learning, initial results TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum)
16%
11© 2018 Mellanox Technologies | Confidential
SHIELD - Self Healing Interconnect Technology
The ability to overcome network failures, locally, by the switches
Software-based solutions suffer from long delays detecting network failures 5-30 seconds for 1K to 10K nodes clusters
Accelerates network recovery time by 5000X
The higher the speed or scale the greater the recovery value
Available with EDR and HDR switches and beyond
Enables Unbreakable Data Centers
12© 2018 Mellanox Technologies | Confidential
SHIELD: Consider a Flow From A to B
Data
Server A Server B
13© 2018 Mellanox Technologies | Confidential
SHIELD: The Simple Case: Local Fix
Server A Server B
Data
14© 2018 Mellanox Technologies | Confidential
SHIELD: The Remote Case - Using Fault Recovery Notifications
Server A Server B
Data
FRN
Data
15© 2018 Mellanox Technologies | Confidential
Network Topologies
16© 2018 Mellanox Technologies | Confidential
Supporting Variety of Topologies
Torus DragonflyFat Tree Hypercube
17© 2018 Mellanox Technologies | Confidential
Traditional Dragonfly vs Dragonfly+
Dragonfly+s
3
1
2 l1
s
3
1
2 l1
s
3
1
2 l1
s
3
1
2 l1
18© 2018 Mellanox Technologies | Confidential
HCA
x 20
1 2 20
HCA
x 20
HCA
x 20
3.1 3.2 3.20
HCA
x 20
1 2 20
HCA
x 20
HCA
x 20
2.1 2.2 2.20
Dragonfly+ Topology
Several “groups”, connected using all to all links
The topology inside each group can be any topology
Reduce total cost of network (fewer long cables)
Utilizes Adaptive Routing to for efficient operations
Simplifies future system expansion
Full-Graph connecting
every group to all
other groups
Group 1
1 2 H
Group 2
H+1 H+2 2H
Group G
GH
BB
B
B
L
1200-Nodes Dragonfly+ Systems Example
HCA
x 20
1 2 20
HCA
x 20
HCA
x 20
1.1 1.2 1.20
G1 G2 G3
19© 2018 Mellanox Technologies | Confidential
Dragonfly+ Topology
Several “groups”, connected using all to all links
The topology inside each group can be any topology
Reduce total cost of network (fewer long cables)
Utilizes Adaptive Routing to for efficient operations
Simplifies future system expansion
Full-Graph connecting
every group to all
other groups
Group 1
1 2 H
Group 2
H+1 H+2 2H
Group G
GH
BB
B
B
L
1.1
2.1
3.1
1.2
2.23
.2
1.2
0
2.20
3.2
0
1200-Nodes Dragonfly+ Systems Example
HCA
x 20
1 2 20
HCA
x 20
HCA
x 20
HCA
x 20
1 2 20
HCA
x 20
HCA
x 20HCA
x 20
1 2 20
HCA
x 20
HCA
x 20
10
20© 2018 Mellanox Technologies | Confidential
1 112
1
20HCA
x 20
2 20
20
1
20HCA
x 20
2 20
20
1
20HCA
x 20
2 20
20
Future Expansion of Dragonfly+ Based System
Topology expansion of a Fat Tree, or a regular/Aries like Dragonfly requires one of the following Reduction of early phase bisection bandwidth due to reservation of ports on the network switches Re-cabling the long cables
Dragonfly+ is the only topology that allows system expansion at zero cost While maintaining bisection bandwidth No port reservation No re-cabling
1.2
0
2.20
21
.201
.2
2.2
21.2
1.1
2.1
21.1
Phase 1:
11x400 =
4400 hosts
21© 2018 Mellanox Technologies | Confidential
1 112
1
20HCA
x 20
2 20
20
1
20HCA
x 20
2 20
20
1
20HCA
x 20
2 20
20
Future Expansion of Dragonfly+ Based System
1.1
1.2
0
1.2
2.12.202.2
21.1
21
.20
21.2
1221
1
20HCA
x 20
220
20
1
20HCA
x 20
220
20
21.1 12.121.20 12.2021.2 12.2
Re-cable the central racks,
a change local to the RACK
Phase 1:
11x400 =
4400 hosts
Phase 2:
+10x400 =
8400 hosts
22© 2018 Mellanox Technologies | Confidential
Thank You
Top Related