EE382C Final Project
Crouching Tiger, Hidden Dragonfly
Alexander Neckar
Camilo Moreno
Matthew Murray
Ziyad Abdel Khaleq
Outline
• Topology, consideration and layout
• Routing solution
• Mirroring and simulation
• Results and conclusion
Dragonfly Topology
Fully-connected local groups
Low hop count
Fast access to global links
Dragonfly Topology
Load balance:
Endpoints/router >= global links per router
~All traffic is bound for other groups. BW should fit.
Local links per router >= endpoints+global links
~All traffic needs to traverse local link before,after global.
Adaptive Routing helps deal with adversarial traffic.
As long as overall BW is sufficient
And we have good backpressure
Considerations
Costs
Optical links drive cost
Minimize number, good utilization
Local links much cheaper
Overprovisioning helps feed global links
Physical layout
fully-connected group size limit (5m cables)
Considerations
Power
Links dominate power
Traffic
Mostly limited in throughput by send window(RPC).
some (RDMA) very large packets.
hotspots.
So... what?
Layout Considerations
Maybe as many as 60 racks per group!
Layout Considerations
Realistically, 34ish
Layout Considerations
Maximize racks per group?
routers on bottom slots, wire diagonally
Actually not a constraint
Balance / cost issues with very large groups.
100m optical cables
~70m square: 147 x 50 racks: >200K rack slots
Chips
Channels:
5GB/s = 4 diff. Pairs @10Gb/s
1 optical cable
4 elec. cable pairs each direction
Chips size is perimeter-driven
buffers+crossbar are only a few mm2.
High-radix requires large perimeter for I/O
Exploring options
Lots of guesstimation!
Basic
TOPOLOGY 13x26x13
Cost 6.16M
Power 68Kw
Router Radix 51
Opt. Links 57291
Elect. Links 110175
Groups 339
Endpts/group 338
>114k nodes
Balanced for uniform random
Cheaper, better?
TOPOLOGY 10x32x10
Cost 5.64M
Power 70.7Kw
Router Radix 51
Opt. Links 51360
Elect. Links 159216
Groups 321
Endpts/group 320
Fewer optical cables
Overprovisioned in-group links
8.5% cheaper
4% higher power
A little more savings
TOPOLOGY 10x34x9
Cost 5.22M
Power 70.5Kw
Router Radix 52
Opt. Links 46971
Elect. Links 172227
Groups 307
Endpts/group 340
90% of normal global links
Overprovisioned in-group links
Even cheaper
Any good?
What if...?
TOPOLOGY 10x45x5
Cost 3.11M
Power 65.9Kw
Router Radix 59
Opt. Links 25425
Elect. Links 223740
Groups 226
Endpts/group 450
Half the “necessary” global links
Very overprovisioned in-group links
Otherwise not 100K
Almost half the price!
Improving Global Adaptive Routing
I feel the need…the need for speed.
Challenges
Quick congestion detection
Quick and accurate return to minimal
Tricks with credits, etc., can provide stiff backpressure
How do we avoid incorrectly taking the non-minimal route?
Solution idea
Use the rate of change of the queue to provide quick congestion detection and quick return to minimal
Potential advantages:
More accurate representation of network performance
Rapid detection
Potential problems:
Sensitivity to burstiness
Our Work
ROC = 0.99*prev_ROC + 0.01*cur_ROC
Developed two new routing algorithms:
Min_queue_rate < 2*nonmin_queue_rate || min_queue_rate < 0
Old algorithm || min_queue_rate < 0
Results
1024 nodes, 2*p = 2*h = a = 8, injection
Uniform:
2% increase in average, 5% increase in max for both ROC and combo
Bad_dragon:
ROC = 69% ave. latency, 82% max
Combo = 72% ave., 90% max
Bad Dragon Results
0
10
20
30
40
50
60
70
80
90
100
Original ROC Combo
Ave Latency
Max Latency
Hops
Simulation Challenge
Booksim's cycle-accurate nature is at odds with simulating our very large system
std::bad_alloc...
Solution: Slicing
Do a fraction of the work and get all of the results!
How do we not include components in our simulation and still effectively simulate the entire network?
Slicing idea 1: Scaledown
A = 8, H = 2
Idea: Relationships
Forget about hotspots for a minute...
Slicing Idea 2: Mirroring
Routing
Mirroring with Hotspots
Results for Different topologies
p/a/h
p: Endpoints per switch
a: Switches per group
h: Global links per switch
100,000 nodes with “Project Traffic”
Best from 10/32/10 @ 3.0277 Million Cycles
Simulation Results For 13 / 26 / 13
100
97 97.4
100
108.14
106.43
90
92
94
96
98
100
102
104
106
108
110
Original ROC Combo
Average Latency
Hops3,217,516 cycles
3,209,757 cycles
3,247,934 cycles
Simulation Results For 10 / 32 / 10
100 99.3
97.44
100
112.89
110.9
85
90
95
100
105
110
115
Original ROC Combo
Average Latency
Hops
3,064,421 cycles
3,027,714 cycles 3,054,955 cycles
Simulation Results
For 10 / 32 / 10 WITH 10 Hotspots
100
97.37
98.58
100
113.071
111.1
85
90
95
100
105
110
115
Original ROC Combo
Average Latency
Hops
3,057,401cycles
3,025,221cycles
3,063,628cycles
Other Simulation Results
16 / 28 / 8:
Runtime 4,130,224
Average Latency 519.74 (too big)
10 / 45 / 5 (half global links)
Runtime 4,190,192
Average latency 528.51
Conclusion
ROC always wins in average latency and runtime cycles.
At a small cost of additional power (4%) over the basic 13 / 26 / 13. We can get higher performance cheaper with the 10 / 32 / 10 topology.
Simulated hotspots scenario is pessimistic, numbers are fine.
Questions
Top Related