Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks
description
Transcript of Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks
![Page 1: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/1.jpg)
DISTRIBUTED ADAPTIVE ROUTING FOR BIG-DATA APPLICATIONS
RUNNING ON DATA CENTER NETWORKS
* Mellanox Technologies LTD, + Technion - EE Department
Eitan Zahavi*+Isaac Keslassy+
Avinoam Kolodny+
ANCS 2012
![Page 2: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/2.jpg)
2
Big Data – Larger Flows Data-set sizes keep rising
Web2 and Cloud Big-Data applications Data Center Traffic changes to:
Longer, Higher BW and Fewer Flows
![Page 3: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/3.jpg)
3
Static Routing of Big-Data = Low BW
Static Routing cannot balance a small number of flows Congestion: when BW of link flows > link capacity When longer and higher-BW flows contend:
On lossy network: packet drop → BW drop On lossless network: congestion spreading → BW drop
Data flow
SR
![Page 4: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/4.jpg)
4
Traffic Aware Load Balancing Systems
Centralized Flows are routed according to
a “global” knowledge
Distributed Each flow is routed by its input
switch with “local” knowledgeCentral Routing Control
SR
SR
SR
Adaptive Routing adjusts routing to network load
Self Routing
Unit
![Page 5: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/5.jpg)
5
Central vs. Distributed Adaptive Routing
Distributed Adaptive Routing is either scalable or have global knowledge
It is Reactive
Property Central Adaptive Routing
Distributed Adaptive Routing
Scalability Low HighKnowledge
Global Local (to keep scalability)
Non-Blocking
Yes Unknown
![Page 6: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/6.jpg)
6
Research Question Can a Scalable Distributed Adaptive
Routing System perform like centralized system and produce non-blocking routing assignments in reasonable time?
![Page 7: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/7.jpg)
7
Trial and ErrorIs Fundamental to Distributed AR Randomize output port – Trial 1 Send the traffic Contention 1 Un-route contending flow
Randomize new output port – Trial 2 Send the traffic Contention 2 Un-route contending flow
Randomize new output port – Trial 3 Send the traffic
Convergence!
SR
SR
SR
![Page 8: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/8.jpg)
8
Routing Trials Cause BW Loss Packet Simulation: R1 is delivered followed by G1 R2 is stuck behind G1 Re-route R3 arrives before R2
Out-of-Order Packets delivery!
Implications are significant drop in flow BW TCP* sees out-of-order as packet-drop and throttle the senders See “Incast” papers…* Or any other reliable transport
R1
R1
G1
R2SR
SR
SR
R3
![Page 9: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/9.jpg)
9
Research Plan Given
1. Analyze Distributed Adaptive Routing systems2. Find how many routing trials are required to
converge3. Find conditions that make the system reach a non-
blocking assignment in a reasonable time
t
events
New Traffic
Trial 1Trial 2
Trial NNo Contention
![Page 10: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/10.jpg)
10
A Simple Policy for Selecting a Flow to Re-Route At each time step
Each output switch Request re-route of a single worst contending flow
At t=0 New traffic pattern is applied Randomize output-ports
and Send flows At t=0.5 Request Re-Routes Repeat for t=t+1 until no contention
1
m
1
r
SR
SR
SR
1
n
1
n
input switch
output switch
![Page 11: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/11.jpg)
11
Evaluation Measure average number of iterations I to
convergence I is exponential with system size !
![Page 12: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/12.jpg)
12
A Balls and Bins Representation Each output switch is a “balls and bins” system Bins are the switch input links, balls are the link
flows Assume 1 ball (=flow) is allowed on each bin (=link)
A “good” bin has ≤ 1 ball Bins are either “empty”, “good” or “bad”
SR
SR
SR
1
m
empty
bad
good
Middle Switch
![Page 13: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/13.jpg)
13
System Dynamics Two reasons of ball moves
Improvement or Induced-move
Balls are numbered by their input switch number
1 2 3
Middle Switch: 1 2 3 4
Output switch 1
3
Induced
2 1
3
Middle Switch: 1 2 3 4
Output switch 2
Improve
2
1
3
4
2
1
3
SW2
SW1
SW3 3
![Page 14: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/14.jpg)
14
The “Last” Step Governs Convergence
Estimated Markov chain models What is the probability of the required last
Improvement to not cause a bad Induced move?
Each one of the r output-switches must do that step
Therefore convergence time is exponential with r
1 0AB
CD
GoodBad
1 0AB
CD
GoodBad
Output switch 1
Output switch 2
1 0AB
CD
GoodBadOutput switch r
Absorbing – 1 Absorbing
![Page 15: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/15.jpg)
15
Introducing p Assume a symmetrical system: flows have
same BW What if the Flow_BW < Link_BW? The network load is Flow_BW/Link_BW p = how many balls are allowed in one bin
SR
p=1
p=1
p=2
p=2SR
SR
![Page 16: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/16.jpg)
16
p has Great Impact on Convergence
Measure average number of iterations I to convergence
I shows very strong dependency on p
![Page 17: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/17.jpg)
17
Implementable Distributed System
Replace congestion detection by flow-count with QCN Detected on middle switch output – not output
switch input Replace “worst flow selection” by congested flow
sampling Implement as extension to detailed InfiniBand flit
level model
1152 nodes 2 Levels Fat-Tree SNB m=2n=2rParameter Full BW: 40Gbps Full BW: 40Gbps 48% BW: 19.2GbpsAvg of Avg Throughput 3,770MB/s 2,302MB/s 1,890MB/sMin of Avg Throughput 3,650MB/s 2,070MB/s 1,890MB/sAvg of Avg Pkt Network Latency 17.03usec 32usec 0.56usecMax of Max Pkt Network Latency 877.5usec 656usec 11.68usecAvg of Avg Out-of-order/In-Order Ratio 1.87 / 1 5.2 / 1 0.0023 / 1Max of Avg Out-of-order/In-Order Ratio 3.25 / 1 7.4 / 1 0.0072 / 1Avg of Avg Out-of-order Window 6.9pkts 5.1pkts 2.28pktsMax of Avg Out-of-order Window 8.9pkts 5.7pkts 5pkts
RNB m=n=r
![Page 18: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/18.jpg)
18
52% Load on 1152 nodes Fat-Tree
No change in number of adaptations over time !
No convergence
![Page 19: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/19.jpg)
19
48% Load on 1152 nodes Fat-Tree
t [sec]
Sw
itch
Rou
ting
Ada
ptat
ions
/ 10u
sec
![Page 20: Distributed Adaptive Routing for Big-Data Applications Running on Data Center Networks](https://reader035.fdocuments.us/reader035/viewer/2022062814/568166ee550346895ddb4111/html5/thumbnails/20.jpg)
20
Conclusions Study: Distributed Adaptive Routing of Big-Data flows Focus on: Time to convergence to non-blocking routing Learning: The cause for the slow convergence Corollary: Half link BW flows converge in few iterations Evaluation: 1152 nodes fat-tree simulation reproduce these
results
Distributed Adaptive Routing of Half Link_BW Flows is both Non-Blocking and Scalable