A weighted fat-tree routing algorithm for efficient load ...
Transcript of A weighted fat-tree routing algorithm for efficient load ...
![Page 1: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/1.jpg)
Feroz Zahid, Ernst Gunnar Gran, Tor Skeie Simula Research Laboratory, Norway Bartosz Bogdanksi, BjØrn Dag Johnsen Oracle Corporation
PDP 2015, Turku, Finland
March 5, 2015
A weighted fat-tree routing algorithm for efficient load-balancing in InfiniBand clusters
![Page 2: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/2.jpg)
InfiniBand (IB) is a popular interconnect for HPC systems
Source: Top500 Supercomputers List, http://top500.org/
44.8% share in November 2014 top supercomputers list
![Page 3: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/3.jpg)
Network performance in HPC systems depends on three important factors
Routing
Network Topology
Traffic Patterns
![Page 4: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/4.jpg)
Many different topologies are found in real-world clusters Ring, Kautz, Torus, Clos, Fat-trees
Fat-tree and its variants are very common in IB networks
• k-ary-n-tree • n levels, 𝑘𝑘𝑛𝑛 nodes n . 𝑘𝑘𝑛𝑛−1 switches • 2k ports on each switch • Each switch has equal number of up and down connections • Only half of the ports of the root switches are used
• XGFTs • More generalized • Allows different number of up and down connections on switches • Also, allows different number of connections at each level
• PGFTs • Allows multiple connecting links between switches
• RLFTs • Restrictions on PGFTs • Same port switches at all levels
![Page 5: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/5.jpg)
Maintenance of full-bisection bandwidth
A B
Easy deadlock-free Routing
Fault Tolerance
Fat-trees have nice properties that make them popular
Up Down
![Page 6: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/6.jpg)
Routing in IB networks is generally deterministic
Based on linear forwarding tables (LFTs) stored in the switches
Deterministic routing is traffic oblivious!
![Page 7: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/7.jpg)
Routing in fat-tree networks can be source based or destination based, and can be closed form or iterative
• Source-based • Out-port for a packet at a switch based on source node identifier
• Destination-based • Out-port for a packet at a switch based on destination node identifier
• Closed form • D-mod-K, S-mod-K
• Iterative
for each leaf switch lf for each node connected to lf id <= node identifier route_downgoing_go_up(id) ... end for end for
![Page 8: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/8.jpg)
OFED’s fat-tree routing algorithm tends to spread the routes across the tree using counters
Ref: Zahavi, Eitan, et al. "Optimized InfiniBand fat-tree routing for shift all-to-all communication patterns." Concurrency and Computation: Practice and Experience 22.2 (2010): 217-231.
OFED is the de-facto standard software stack for building and deploying IB based applications
• Deterministic • High-performance, Avoids out-of-order packet deliveries
• Destination-based • Direct realization in IB networks
• Iterative • Better routes balancing
• Maintains counters on ports • When a new route is added - +1
• Supports XGFTs, PGFTs, RLFTs
![Page 9: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/9.jpg)
“Multi-stage switches are not cross-bars!”
The effective bisection-bandwidth depends on the traffic pattern
Ref: Hoefler, Torsten, Timo Schneider, and Andrew Lumsdaine. "Multistage switches are not crossbars: Effects of static routing in high-performance networks." Cluster Computing, 2008
![Page 10: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/10.jpg)
“Multi-stage switches are not cross-bars!”
The effective bisection-bandwidth depends on the traffic pattern
Ref: Hoefler, Torsten, Timo Schneider, and Andrew Lumsdaine. "Multistage switches are not crossbars: Effects of static routing in high-performance networks." Cluster Computing, 2008
![Page 11: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/11.jpg)
“Multi-stage switches are not cross-bars!”
The effective bisection-bandwidth depends on the traffic pattern
Ref: Hoefler, Torsten, Timo Schneider, and Andrew Lumsdaine. "Multistage switches are not crossbars: Effects of static routing in high-performance networks." Cluster Computing, 2008
Node 1 and 4 share same index position in their leaf switches
![Page 12: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/12.jpg)
We identify two important issues with the fat-tree routing algorithm as implemented by OFED’s subnet manager
• Node Traffic Oblivious Routing • All nodes treated equally • Node roles ignored
• Non-predictable Performance • Node are routed in an order that depends on the port numbers • Port numbering is hard to set
• Sysadmins do not care about it • Addition of new nodes
• Which nodes share links? • Depends on the indexing sequence!
![Page 13: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/13.jpg)
Some nodes tends to receive more traffic than others, so routes towards those nodes are more likely to be congested Node 4 and 5 are more likely to receive traffic e.g. storage nodes
![Page 14: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/14.jpg)
Some nodes tends to receive more traffic than others, so routes towards those nodes are more likely to be congested Node 4 and 5 are more likely to receive traffic e.g. storage nodes
![Page 15: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/15.jpg)
Some nodes tends to receive more traffic than others, so routes towards those nodes are more likely to be congested
We call these nodes receiver nodes!
Node 4 and 5 are more likely to receive traffic e.g. storage nodes
![Page 16: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/16.jpg)
648-port fat-tree is a common building block for HPC systems
![Page 17: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/17.jpg)
Result: The probability of index collision for receiver nodes is very high for node oblivious routing
Probability of about 90% that two receiver nodes will share the same index for 2 rcv/switch !
![Page 18: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/18.jpg)
The weighted fat-tree routing algorithm (wFatTree) assigns weights to the nodes
The algorithm is still deterministic!
• All compute nodes are assigned a new parameter • receive weight
• Weights can be assigned based on • Known node roles e.g. storage nodes • Known traffic priorities e.g. following QoS levels • Traffic profiling
• Nodes are routed in the decreasing order of their weights • Not based on port numbering • Predictable
• Port selection is based on both • Downward weight • Upward weight
![Page 19: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/19.jpg)
Port selection in wFatTree uses both downward and upward weights
![Page 20: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/20.jpg)
Result: Evaluation on 648-port fat-tree shows substantial improvements in total network bandwidth
18 Switches with receiver nodes
27 Switches with receiver nodes
![Page 21: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/21.jpg)
Result: Evaluation on 648-port fat-tree shows substantial improvements in total network bandwidth
All 36 Switches with receiver nodes
![Page 22: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/22.jpg)
Result: wFatTree minimizes the total contention on the links by routes balancing
![Page 23: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/23.jpg)
Result: wFatTree minimizes the total contention on the links by routes balancing
![Page 24: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/24.jpg)
Result: The wFatTree execution time is competitive to the original fat tree routing
Topology No. of End Nodes Fat Tree Routing wFatTree Routing
4-ary-2-tree 16 0.167 0.255
8-ary-2-tree 64 0.318 0.365
16-ary-2-tree 256 1.686 2.268
8-ary-3-tree 512 16.386 19.657
12-ary-3-tree 1728 188.856 230.639
16-ary-3-tree 4096 1029.369 1434.287
![Page 25: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/25.jpg)
Future Work: Enable smart network provisioning – Four important components
Nodes with weights
Balanced Traffic Better Routes
Optimized Algorithms
Smart Routing Reconfiguration Load Balancing Congestion Control
IB Congestion Control
Performance
Adjusting to Load
Optimization
Monitor->Optimize->Execute Loop
![Page 26: A weighted fat-tree routing algorithm for efficient load ...](https://reader031.fdocuments.us/reader031/viewer/2022012514/618d674936d92477104e4546/html5/thumbnails/26.jpg)
Questions?
State-of-the fat-tree routing with oblivious path assignment
The weighted fat-tree routing with
better load-balancing
In summary, weighted fat-tree routing improves actual load-balancing in IB based fat-tree networks