A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router
description
Transcript of A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router
Slide 1
Inte
lA Comparative Study of Arbitration
Algorithms for the Alpha 21364 Pipelined Router
Shubu Mukherjee*, Federico Silla!, Peter Bannon$, Joel Emer*, Steve
Lang*, & Dave Webb$
(ack: Richard Kessler)
Intel*, UPV!, & HP$
Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002
Slide 2
Inte
lAlpha 21364 NetworkAlpha 21364 Network
21364 Chip(including Router)
RambusMemory
I/O
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
L2 CacheData
L2 CacheData
Router MC2 MC1
L2 Cache Tags
21264CORE
Slide 3
Inte
lThe Alpha 21364 8x7 RouterThe Alpha 21364 8x7 Router
CROSSBAR
Input Ports
OutputPorts
Distributed Arbitration Algorithm Controls the Crossbar
• 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O• 7 Output ports: 4 network, 2 memory/cache, 1 I/O• Router Pipeline Length = 13/14 cycles• Virtual Cut-Through
Slide 4
Inte
lProblem: Maximize # Matches Problem: Maximize # Matches
Input Port 0 1 2
Input Port 1 1 2 3
Input Port 2 1 2 3
Input Port 3 1 2 3
Input Port 4 1 6 3
Input Port 5 0 2 3
Input Port 6 4 2 3
Input Port 7 5 2 3
• Oldest Packet First: one match• Smarter algorithm (shaded boxes): 7 matches (perfect)
numbers in table cells: destination output port
older packet at input port
3
Slide 5
Inte
lSimpler Algorithms Have Fewer MatchesSimpler Algorithms Have Fewer Matches
0
1
2
3
4
5
6
7
0 5 10 15 20 25 30
% Occupied Input Packet Buffers in a 21364 router
# A
rbit
rati
on
Ma
tch
es
Pe
r C
yc
le
Perfect
Complex (WFA)
Complex (PIM)
Complex (PIM1)
Simple (SPAA)
Assumes all output ports are free
complexity
Slide 6
Inte
lComplexity may not pay offComplexity may not pay off
0
1
2
3
4
5
6
7
0 0.25 0.5 0.75
Fraction of Output Ports Occupied
# A
rbitr
atio
n M
atch
es P
er
Cyc
le
PerfectComplex (WFA)Complex (PIM)Complex (PIM1)Simple (21364)
complexity
@ 30% input buffer occupancy
Slide 7
Inte
lKey ResultsKey Results
Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,
SGI Spider)– PIM1: Parallel Iterative Matching with one iteration
(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)
SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when many
output ports are busy)
+ SPAA minimizes interactions between ports
+ SPAA can be pipelined more effectively
Rotary Rule + avoids network saturation under very heavy load
Slide 8
Inte
lWave Front Arbiter (WFA)Wave Front Arbiter (WFA)
Proposed by Tamir & Chi, 1993– used in the SGI Spider/Origin switch
Implement via “connection” matrix
E
N
S
W
Grant
Request
i,j
1 2 3 4
5
6
7
output ports
Grant = Request & N & W
S = N & NOT(Grant)
E = W & NOT(Grant)
input port 0
input port 1
input port 2
input port 3
Slide 9
Inte
lWFA Advantage & PipelineWFA Advantage & Pipeline
+ High degree of interaction among output portsreduces arbitration collisions & improves # of matches
Algorithm (implemented via a connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)
(1) (2) (3)1.5 1.5 1
Slide 10
Inte
lWFA LimitationsWFA Limitations
- Higher number of estimated cycles 4 cycles in 0.18 micron
- Harder to pipeline effectively micropipelining waves (2) is difficult because initial cell
changes every cycle restarting (1) before (2) completes is complex
large in-flight packet table due to large number of nominations (up to 54)
may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)
3 cycles
(1) (2) (3)1.5 1.5 1
(1) (2) (3)
Slide 11
Inte
lParallel Iterative Matching (PIM)Parallel Iterative Matching (PIM)
Steps in One Iteration (PIM1) Nominate: each input port nominates packets for every
output port (same packet nominated multiple times …) Grant: unmatched output port selects an input port packet
randomly Accept: unselected input port selects a grant randomly
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
Nominate Grant Accept
Output Port 0 unused in this arbitration round
Slide 12
Inte
lPIM1 Advantage & PipelinePIM1 Advantage & Pipeline
+ High interaction between input and output portsreduces arbitration collisions & improves # of matches
Algorithm (implemented via connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)
(1) (2) (3)1.5 1.5 1
Slide 13
Inte
lPIM1 LimitationsPIM1 Limitations
- Higher number of estimated cycles 4 cycles in 0.18 micron
- Harder to pipeline effectively restarting (1) before (2) completes is complex
same packet can be nominated multiple times requiring the “Accept” step (part of stage 2)
large in-flight packet table due to large number of nominations (up to 54)
may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)
3 cycles
(1) (2) (3)1.5 1.5 1
(1) (2) (3)
Slide 14
Inte
lSimple, Pipelined Arbitration Algorithm (SPAA)
used in the Alpha 21364 Router
Simple, Pipelined Arbitration Algorithm (SPAA)used in the Alpha 21364 Router
Algorithm Nominate: each input port nominates packets for exactly
one output port (one packet nominated only once) Grant: each output port selects an input port packet based
on the least-recently selected one Reset: input ports reset state of all unselected packets and
renominate them in subsequent cycles
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
Nominate Grant Accept
Reset
Slide 15
Inte
lSPAA’s SimplicitySPAA’s Simplicity
Low degree of interaction among ports- increases arbitration collisions+ reduces complexity
Algorithm (no centralized matrix)(1) Select packet at input port & load matrix (1 cycle)(2) Forward packets to output ports (1 cycle)(3) Output ports select packets and return feedback to input ports
(1 cycle)
1
(1) (2) (3)11
Slide 16
Inte
lSPAA’s AdvantagesSPAA’s Advantages
+ Fewer cycles 3 cycles in 0.18micron
+ Speculatively read out input buffer prior to output port arbitration because only one packet is nominated to one output port
+ Easier to pipeline restart (1) for free input ports before (2) completes
only one packet nominated to one output port small number (16) of in-flight packets avoids any centralized matrix
speculative read allows data flits to follow header flits
(1) (2) (3)
1
(1) (2) (3)11
1 cycle
Slide 17
Inte
lSummary: Simpler is BetterSummary: Simpler is Better
WFA PIM1 SPAA
Alpha 21364
# Matches Per Cycle High Medium Lower
# cycles
(0.18 microns)
4 4 3
Restart
Rate
Every 3 cycles
Every 3 cycles
Every cycle
Slide 18
Inte
lSaturation BehaviorSaturation Behavior
• Reasons: Hot spots & tree saturation • 21364’s router shows cyclic pattern (link utilization with time)
• Ideally, operate at saturation bandwidth • Solution: throttle input load
64 Node Network, Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rag
e P
acke
t L
aten
cy
(nan
ose
con
ds)
SPAA-base
saturation point
Slide 19
Inte
lRotary RuleRotary Rule
21364’s in-built throttling+ maximum outstanding cache miss requests per processor = 16
Rotary Rule: more throttling+ 21364 is a “direct” network
+ Rotary Rule prioritizes traffic in network ports over local ports
+ also, clears network congestion
+ relies on anti-starvation mechanism
WFA+Rotary: change first cell SPAA+Rotary: change output port priority to
the Rotary Rule
Slide 20
Inte
lSimulation MethodologySimulation Methodology
Asim modeling infrastructure detailed timing model of 21364 network selected design points validated against RTL
Traffic Patterns 70% three coherence hops, 30% two coherence hops random destinations other traffic combinations in paper and simulated internally
Slide 21
Inte
l64 Node Network: Base Case64 Node Network: Base Case
Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rag
e P
acke
t L
aten
cy
(nan
ose
con
ds)
PIM1
WFA-base
SPAA-base
• SPAA outperforms WFA & PIM124% higher throughput at knee
Knee
Slide 22
Inte
l64 Node Network: With Rotary Rule64 Node Network: With Rotary Rule
Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rag
e P
acke
t L
aten
cy
(nan
ose
con
ds)
PIM1
WFA-base
WFA-rotary
SPAA-base
SPAA-rotary
• Rotary Rule helps both SPAA & WFA
Slide 23
Inte
lSummary & ConclusionsSummary & Conclusions
Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,
SGI Spider)– PIM1: Parallel Iterative Matching with one iteration
(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)
SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when
many output ports are busy)
+ SPAA minimizes interactions between ports
+ SPAA can be pipelined more effectively
Rotary Rule+ avoids network saturation under heavy load