Download - Boosting Scalability ofBoosting Scalability of InfiniBand ...

Boosting Scalability ofBoosting Scalability of InfiniBand-based HPC Clusters

f S

© 2010 Voltaire Inc.

Asaf Wachtel, Senior Product Manager

InfiniBand-based HPC ClustersScalability Challenges

► Cluster TCO Scalability• Hardware costs• Hardware costs• Software License costs• Space, Power & Cooling

► Communication Scalability• Handle increasing compute power• Multi-core GPUsMulti core, GPUs

► Utilization Scalability• Many jobs & users• Varying sizes, traffic patterns & QoS

► Application Scalability• Home grown or ISVs

© 2010 Voltaire Inc. 2sc10

• Home-grown or ISVs• MPI Collectives

Voltaire 40Gb/s InfiniBand Portfolio

Fabric provisioning and Application Accelerationp gperformance monitoring

pp

40Gb/s InfiniBand Switching Platforms

HSSMHSSMSSI Blade Switch


4700324/648 x IB ports

4200162 x IB ports

403636 x IB ports

4036E34 x IB ports + 2 x 1/10GbE

Scalable Architectures

► Fat Tree• Full bi-sectional bandwidth at any node county• Uniform oversubscription options

► HyperScale► HyperScale• Scale to thousands of nodes with linear performance

• Large non-blocking islands (more than 2,000 cores)

• 4-hops maximum latency to any port

• Lowest number of switches and cables

► Torus• Lowest cost solution

B ilt ti l ith d it h d bl


• Built entirely with edge switches and copper cables

• Optimized support by Voltaire software, including Torus2QoS routing

HyperScale in the Top500

► Large, low-latency, non-blocking Islands► Lowest number of switches & cables► Lowest number of switches & cables► Scales to thousands of nodes with linear

performance1 200 d1,200-node

Interconnect in only 2 Racks

8:1 Oversubscribed Core

13 x non-blocking


13 x non blocking HyperScale Islands 1.05PFLOPs

83.7% Efficiency

The Challenge: Static Routing Inefficiency

► The Challenge: One Size Routing does not Fit All• Static routing assumes uniform traffic across entire fabricStatic routing assumes uniform traffic across entire fabric• Real life is different

Most jobs use small portion of the clustersDifferent traffic patterns for different jobsDifferent requirements for different traffic types (e.g. storage)

► The Solution: Voltaire TARA™ (Traffic Aware Routing Algorithm)

• A new routing algorithm on top of OpenSM• Dynamically optimizes routing according to defined traffic patterns:

Fabric topologyJob-specific communication patternsSymmetric/Asymmetric communicationTraffic load/QoS

F ll i t t d ith l di j b h d l


• Fully integrated with leading job schedulers

TARA – Traffic Aware Routing AlgorithmMaximizing Cluster Utilization

OpenSM without UFM TARA UFM TARA is ON

1600

1800

2000

1600

1800

2000

600

800

1000

1200

1400

port

wei

ght

600

800

1000

1200

1400

port

wei

ght


0

200

400

1.18

1.28

2.20

2.30

3.22

3.32

4.24

4.34

5.26

6.18

6.28

7.20

7.30

8.22

8.32

9.24

9.34

10.26

11.18

11.28

12.20

12.30

13.22

13.32

14.24

14.34

15.26

16.18

16.28

17.20

17.30

switch.port

0

200

400

1.18

1.28

2.20

2.30

3.22

3.32

4.24

4.34

5.26

6.18

6.28

7.20

7.30

8.22

8.32

9.24

9.34

10.26

11.18

11.28

12.20

12.30

13.22

13.32

14.24

14.34

15.26

16.18

16.28

17.20

17.30

switch.portInternal ports on the line cards

The Challenge: Collective Operations Scalability

► Grouping algorithms are unaware of the topology and inefficient► Network congestion due to “all-to-all” communication► Slow nodes & OS involvement impair scalability and predictability► The more powerful servers get (GPUs, more cores), the more poorly

collectives scale in the fabriccollectives scale in the fabric

% collectives out of total run time

Total run time Run time variance

# Ranks # Ranks# Ranks


# Ranks # Ranks# Ranks

Significant Inhibitor to MPI Application Scalability

Introducing:Voltaire Fabric Collective Accelerator

Grid Director Switches: F b i

Grid Director Switches:

Unified Fabric Manager (UFM)

FCA Manager:Topology-based collective tree

+ +Fabric Processing Power

Switches:Collective

operations offloaded to switch CPUs

(UFM):Topology Aware Orchestrator

p gySeparate Virtual networkIB multicast for result distribution

switch CPUs

++ FCA Agent:Inter-core processing

………. ……….localized & optimized


Breakthrough performance with no additional hardware

FCA– Fabric Collective AcceleratorUnmatched Application Scalability

► First and only system-wide solution for offloading MPI collectives

► Accelerates MPI collective computation by as much as 100X► 10-40% improvement in application runtime► Integrated with leading MPI implementations

140160180

Fluent truck_111m 192 cores

406080

100120

PMPI

PMPI + FCA


020

PMPI PMPI + FCA

Summary

► Reduced total cost of ownership via scalable topologies

(HyperScale)

► Increase cluster utilization via Traffic Aware Routing (TARA)g ( )

► Boost application scalability using Fabric Collective

Acceleration (FCA)Acceleration (FCA)

$© 2010 Voltaire Inc. 11sc10

More Performance for each $ Spent