Download - EE382C Final Project Crouching Tiger, Hidden Dragonflycva.stanford.edu/classes/ee382c/projects/alex_camilo_matthew_ziyad.pdf · Ziyad Abdel Khaleq Outline • Topology, consideration

EE382C Final Project

Crouching Tiger, Hidden Dragonfly

Alexander Neckar

Camilo Moreno

Matthew Murray

Ziyad Abdel Khaleq

Outline

• Topology, consideration and layout

• Routing solution

• Mirroring and simulation

• Results and conclusion

Dragonfly Topology

Fully-connected local groups

Low hop count

Fast access to global links

Dragonfly Topology

Load balance:

Endpoints/router >= global links per router

~All traffic is bound for other groups. BW should fit.

Local links per router >= endpoints+global links

~All traffic needs to traverse local link before,after global.

Adaptive Routing helps deal with adversarial traffic.

As long as overall BW is sufficient

And we have good backpressure

Considerations

Costs

Optical links drive cost

Minimize number, good utilization

Local links much cheaper

Overprovisioning helps feed global links

Physical layout

fully-connected group size limit (5m cables)

Considerations

Power

Links dominate power

Traffic

Mostly limited in throughput by send window(RPC).

some (RDMA) very large packets.

hotspots.

So... what?

Layout Considerations

Maybe as many as 60 racks per group!


Realistically, 34ish


Maximize racks per group?

routers on bottom slots, wire diagonally

Actually not a constraint

Balance / cost issues with very large groups.

100m optical cables

~70m square: 147 x 50 racks: >200K rack slots

Chips

Channels:

5GB/s = 4 diff. Pairs @10Gb/s

1 optical cable

4 elec. cable pairs each direction

Chips size is perimeter-driven

buffers+crossbar are only a few mm2.

High-radix requires large perimeter for I/O

Exploring options

Lots of guesstimation!

Basic

TOPOLOGY 13x26x13

Cost 6.16M

Power 68Kw

Router Radix 51

Opt. Links 57291

Elect. Links 110175

Groups 339

Endpts/group 338

>114k nodes

Balanced for uniform random

Cheaper, better?

TOPOLOGY 10x32x10

Cost 5.64M

Power 70.7Kw

Router Radix 51

Opt. Links 51360

Elect. Links 159216

Groups 321

Endpts/group 320

Fewer optical cables

Overprovisioned in-group links

8.5% cheaper

4% higher power

A little more savings

TOPOLOGY 10x34x9

Cost 5.22M

Power 70.5Kw

Router Radix 52

Opt. Links 46971

Elect. Links 172227

Groups 307

Endpts/group 340

90% of normal global links

Overprovisioned in-group links

Even cheaper

Any good?

What if...?

TOPOLOGY 10x45x5

Cost 3.11M

Power 65.9Kw

Router Radix 59

Opt. Links 25425

Elect. Links 223740

Groups 226

Endpts/group 450

Half the “necessary” global links

Very overprovisioned in-group links

Otherwise not 100K

Almost half the price!

Improving Global Adaptive Routing

I feel the need…the need for speed.

Challenges

Quick congestion detection

Quick and accurate return to minimal

Tricks with credits, etc., can provide stiff backpressure

How do we avoid incorrectly taking the non-minimal route?

Solution idea

Use the rate of change of the queue to provide quick congestion detection and quick return to minimal

Potential advantages:

More accurate representation of network performance

Rapid detection

Potential problems:

Sensitivity to burstiness

Our Work

ROC = 0.99*prev_ROC + 0.01*cur_ROC

Developed two new routing algorithms:

Min_queue_rate < 2*nonmin_queue_rate || min_queue_rate < 0

Old algorithm || min_queue_rate < 0

Results

1024 nodes, 2*p = 2*h = a = 8, injection

Uniform:

2% increase in average, 5% increase in max for both ROC and combo

Bad_dragon:

ROC = 69% ave. latency, 82% max

Combo = 72% ave., 90% max

Bad Dragon Results

0

10

20

30

40

50

60

70

80

90

100

Original ROC Combo

Ave Latency

Max Latency

Hops

Simulation Challenge

Booksim's cycle-accurate nature is at odds with simulating our very large system

std::bad_alloc...

Solution: Slicing

Do a fraction of the work and get all of the results!

How do we not include components in our simulation and still effectively simulate the entire network?

Slicing idea 1: Scaledown

A = 8, H = 2

Idea: Relationships

Forget about hotspots for a minute...

Slicing Idea 2: Mirroring

Routing

Mirroring with Hotspots

Results for Different topologies

p/a/h

p: Endpoints per switch

a: Switches per group

h: Global links per switch

100,000 nodes with “Project Traffic”

Best from 10/32/10 @ 3.0277 Million Cycles

Simulation Results For 13 / 26 / 13

100

97 97.4

100

108.14

106.43

90

92

94

96

98

100

102

104

106

108

110

Original ROC Combo

Average Latency

Hops3,217,516 cycles

3,209,757 cycles

3,247,934 cycles

Simulation Results For 10 / 32 / 10

100 99.3

97.44

100

112.89

110.9

85

90

95

100

105

110

115

Original ROC Combo

Average Latency

Hops

3,064,421 cycles

3,027,714 cycles 3,054,955 cycles

Simulation Results

For 10 / 32 / 10 WITH 10 Hotspots

100

97.37

98.58

100

113.071

111.1

85

90

95

100

105

110

115

Original ROC Combo

Average Latency

Hops

3,057,401cycles

3,025,221cycles

3,063,628cycles

Other Simulation Results

16 / 28 / 8:

Runtime 4,130,224

Average Latency 519.74 (too big)

10 / 45 / 5 (half global links)

Runtime 4,190,192

Average latency 528.51

Conclusion

ROC always wins in average latency and runtime cycles.

At a small cost of additional power (4%) over the basic 13 / 26 / 13. We can get higher performance cheaper with the 10 / 32 / 10 topology.

Simulated hotspots scenario is pessimistic, numbers are fine.

Questions