Boosting Scalability ofBoosting Scalability of InfiniBand-based HPC Clusters
f S
© 2010 Voltaire Inc.
Asaf Wachtel, Senior Product Manager
InfiniBand-based HPC ClustersScalability Challenges
► Cluster TCO Scalability• Hardware costs• Hardware costs• Software License costs• Space, Power & Cooling
► Communication Scalability• Handle increasing compute power• Multi-core GPUsMulti core, GPUs
► Utilization Scalability• Many jobs & users• Varying sizes, traffic patterns & QoS
► Application Scalability• Home grown or ISVs
© 2010 Voltaire Inc. 2sc10
• Home-grown or ISVs• MPI Collectives
Voltaire 40Gb/s InfiniBand Portfolio
Fabric provisioning and Application Accelerationp gperformance monitoring
pp
40Gb/s InfiniBand Switching Platforms
HSSMHSSMSSI Blade Switch
© 2010 Voltaire Inc. 3sc10
4700324/648 x IB ports
4200162 x IB ports
403636 x IB ports
4036E34 x IB ports + 2 x 1/10GbE
Scalable Architectures
► Fat Tree• Full bi-sectional bandwidth at any node county• Uniform oversubscription options
► HyperScale► HyperScale• Scale to thousands of nodes with linear performance
• Large non-blocking islands (more than 2,000 cores)
• 4-hops maximum latency to any port
• Lowest number of switches and cables
► Torus• Lowest cost solution
B ilt ti l ith d it h d bl
© 2010 Voltaire Inc. 4sc10
• Built entirely with edge switches and copper cables
• Optimized support by Voltaire software, including Torus2QoS routing
HyperScale in the Top500
► Large, low-latency, non-blocking Islands► Lowest number of switches & cables► Lowest number of switches & cables► Scales to thousands of nodes with linear
performance1 200 d1,200-node
Interconnect in only 2 Racks
8:1 Oversubscribed Core
13 x non-blocking
© 2010 Voltaire Inc. 5sc10
13 x non blocking HyperScale Islands 1.05PFLOPs
83.7% Efficiency
The Challenge: Static Routing Inefficiency
► The Challenge: One Size Routing does not Fit All• Static routing assumes uniform traffic across entire fabricStatic routing assumes uniform traffic across entire fabric• Real life is different
Most jobs use small portion of the clustersDifferent traffic patterns for different jobsDifferent requirements for different traffic types (e.g. storage)
► The Solution: Voltaire TARA™ (Traffic Aware Routing Algorithm)
• A new routing algorithm on top of OpenSM• Dynamically optimizes routing according to defined traffic patterns:
Fabric topologyJob-specific communication patternsSymmetric/Asymmetric communicationTraffic load/QoS
F ll i t t d ith l di j b h d l
© 2010 Voltaire Inc. 6sc10
• Fully integrated with leading job schedulers
TARA – Traffic Aware Routing AlgorithmMaximizing Cluster Utilization
OpenSM without UFM TARA UFM TARA is ON
1600
1800
2000
1600
1800
2000
600
800
1000
1200
1400
port
wei
ght
600
800
1000
1200
1400
port
wei
ght
© 2010 Voltaire Inc. 7sc10
0
200
400
1.18
1.28
2.20
2.30
3.22
3.32
4.24
4.34
5.26
6.18
6.28
7.20
7.30
8.22
8.32
9.24
9.34
10.26
11.18
11.28
12.20
12.30
13.22
13.32
14.24
14.34
15.26
16.18
16.28
17.20
17.30
switch.port
0
200
400
1.18
1.28
2.20
2.30
3.22
3.32
4.24
4.34
5.26
6.18
6.28
7.20
7.30
8.22
8.32
9.24
9.34
10.26
11.18
11.28
12.20
12.30
13.22
13.32
14.24
14.34
15.26
16.18
16.28
17.20
17.30
switch.portInternal ports on the line cards
The Challenge: Collective Operations Scalability
► Grouping algorithms are unaware of the topology and inefficient► Network congestion due to “all-to-all” communication► Slow nodes & OS involvement impair scalability and predictability► The more powerful servers get (GPUs, more cores), the more poorly
collectives scale in the fabriccollectives scale in the fabric
% collectives out of total run time
Total run time Run time variance
# Ranks # Ranks# Ranks
© 2010 Voltaire Inc. 8sc10
# Ranks # Ranks# Ranks
Significant Inhibitor to MPI Application Scalability
Introducing:Voltaire Fabric Collective Accelerator
Grid Director Switches: F b i
Grid Director Switches:
Unified Fabric Manager (UFM)
FCA Manager:Topology-based collective tree
+ +Fabric Processing Power
Switches:Collective
operations offloaded to switch CPUs
(UFM):Topology Aware Orchestrator
p gySeparate Virtual networkIB multicast for result distribution
switch CPUs
++ FCA Agent:Inter-core processing
………. ……….localized & optimized
© 2010 Voltaire Inc. 9sc10
Breakthrough performance with no additional hardware
FCA– Fabric Collective AcceleratorUnmatched Application Scalability
► First and only system-wide solution for offloading MPI collectives
► Accelerates MPI collective computation by as much as 100X► 10-40% improvement in application runtime► Integrated with leading MPI implementations
140160180
Fluent truck_111m 192 cores
406080
100120
PMPI
PMPI + FCA
© 2010 Voltaire Inc. 10sc10
020
PMPI PMPI + FCA
Summary
► Reduced total cost of ownership via scalable topologies
(HyperScale)
► Increase cluster utilization via Traffic Aware Routing (TARA)g ( )
► Boost application scalability using Fabric Collective
Acceleration (FCA)Acceleration (FCA)
$© 2010 Voltaire Inc. 11sc10
More Performance for each $ Spent
Top Related