A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

23
Slide 1 Intel A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla ! , Peter Bannon $ , Joel Emer*, Steve Lang*, & Dave Webb $ (ack: Richard Kessler) Intel*, UPV ! , & HP $ Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002

description

A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router. Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002. - PowerPoint PPT Presentation

Transcript of A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Page 1: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 1

Inte

lA Comparative Study of Arbitration

Algorithms for the Alpha 21364 Pipelined Router

Shubu Mukherjee*, Federico Silla!, Peter Bannon$, Joel Emer*, Steve

Lang*, & Dave Webb$

(ack: Richard Kessler)

Intel*, UPV!, & HP$

Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002

Page 2: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 2

Inte

lAlpha 21364 NetworkAlpha 21364 Network

21364 Chip(including Router)

RambusMemory

I/O

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

M

IO

L2 CacheData

L2 CacheData

Router MC2 MC1

L2 Cache Tags

21264CORE

Page 3: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 3

Inte

lThe Alpha 21364 8x7 RouterThe Alpha 21364 8x7 Router

CROSSBAR

Input Ports

OutputPorts

Distributed Arbitration Algorithm Controls the Crossbar

• 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O• 7 Output ports: 4 network, 2 memory/cache, 1 I/O• Router Pipeline Length = 13/14 cycles• Virtual Cut-Through

Page 4: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 4

Inte

lProblem: Maximize # Matches Problem: Maximize # Matches

Input Port 0 1 2

Input Port 1 1 2 3

Input Port 2 1 2 3

Input Port 3 1 2 3

Input Port 4 1 6 3

Input Port 5 0 2 3

Input Port 6 4 2 3

Input Port 7 5 2 3

• Oldest Packet First: one match• Smarter algorithm (shaded boxes): 7 matches (perfect)

numbers in table cells: destination output port

older packet at input port

3

Page 5: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 5

Inte

lSimpler Algorithms Have Fewer MatchesSimpler Algorithms Have Fewer Matches

0

1

2

3

4

5

6

7

0 5 10 15 20 25 30

% Occupied Input Packet Buffers in a 21364 router

# A

rbit

rati

on

Ma

tch

es

Pe

r C

yc

le

Perfect

Complex (WFA)

Complex (PIM)

Complex (PIM1)

Simple (SPAA)

Assumes all output ports are free

complexity

Page 6: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 6

Inte

lComplexity may not pay offComplexity may not pay off

0

1

2

3

4

5

6

7

0 0.25 0.5 0.75

Fraction of Output Ports Occupied

# A

rbitr

atio

n M

atch

es P

er

Cyc

le

PerfectComplex (WFA)Complex (PIM)Complex (PIM1)Simple (21364)

complexity

@ 30% input buffer occupancy

Page 7: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 7

Inte

lKey ResultsKey Results

Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,

SGI Spider)– PIM1: Parallel Iterative Matching with one iteration

(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)

SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when many

output ports are busy)

+ SPAA minimizes interactions between ports

+ SPAA can be pipelined more effectively

Rotary Rule + avoids network saturation under very heavy load

Page 8: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 8

Inte

lWave Front Arbiter (WFA)Wave Front Arbiter (WFA)

Proposed by Tamir & Chi, 1993– used in the SGI Spider/Origin switch

Implement via “connection” matrix

E

N

S

W

Grant

Request

i,j

1 2 3 4

5

6

7

output ports

Grant = Request & N & W

S = N & NOT(Grant)

E = W & NOT(Grant)

input port 0

input port 1

input port 2

input port 3

Page 9: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 9

Inte

lWFA Advantage & PipelineWFA Advantage & Pipeline

+ High degree of interaction among output portsreduces arbitration collisions & improves # of matches

Algorithm (implemented via a connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)

(1) (2) (3)1.5 1.5 1

Page 10: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 10

Inte

lWFA LimitationsWFA Limitations

- Higher number of estimated cycles 4 cycles in 0.18 micron

- Harder to pipeline effectively micropipelining waves (2) is difficult because initial cell

changes every cycle restarting (1) before (2) completes is complex

large in-flight packet table due to large number of nominations (up to 54)

may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)

3 cycles

(1) (2) (3)1.5 1.5 1

(1) (2) (3)

Page 11: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 11

Inte

lParallel Iterative Matching (PIM)Parallel Iterative Matching (PIM)

Steps in One Iteration (PIM1) Nominate: each input port nominates packets for every

output port (same packet nominated multiple times …) Grant: unmatched output port selects an input port packet

randomly Accept: unselected input port selects a grant randomly

input port 0

input port 1

output port 0

output port 1

input port 0

input port 1

output port 0

output port 1

input port 0

input port 1

output port 0

output port 1

Nominate Grant Accept

Output Port 0 unused in this arbitration round

Page 12: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 12

Inte

lPIM1 Advantage & PipelinePIM1 Advantage & Pipeline

+ High interaction between input and output portsreduces arbitration collisions & improves # of matches

Algorithm (implemented via connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)

(1) (2) (3)1.5 1.5 1

Page 13: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 13

Inte

lPIM1 LimitationsPIM1 Limitations

- Higher number of estimated cycles 4 cycles in 0.18 micron

- Harder to pipeline effectively restarting (1) before (2) completes is complex

same packet can be nominated multiple times requiring the “Accept” step (part of stage 2)

large in-flight packet table due to large number of nominations (up to 54)

may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)

3 cycles

(1) (2) (3)1.5 1.5 1

(1) (2) (3)

Page 14: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 14

Inte

lSimple, Pipelined Arbitration Algorithm (SPAA)

used in the Alpha 21364 Router

Simple, Pipelined Arbitration Algorithm (SPAA)used in the Alpha 21364 Router

Algorithm Nominate: each input port nominates packets for exactly

one output port (one packet nominated only once) Grant: each output port selects an input port packet based

on the least-recently selected one Reset: input ports reset state of all unselected packets and

renominate them in subsequent cycles

input port 0

input port 1

output port 0

output port 1

input port 0

input port 1

output port 0

output port 1

Nominate Grant Accept

Reset

Page 15: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 15

Inte

lSPAA’s SimplicitySPAA’s Simplicity

Low degree of interaction among ports- increases arbitration collisions+ reduces complexity

Algorithm (no centralized matrix)(1) Select packet at input port & load matrix (1 cycle)(2) Forward packets to output ports (1 cycle)(3) Output ports select packets and return feedback to input ports

(1 cycle)

1

(1) (2) (3)11

Page 16: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 16

Inte

lSPAA’s AdvantagesSPAA’s Advantages

+ Fewer cycles 3 cycles in 0.18micron

+ Speculatively read out input buffer prior to output port arbitration because only one packet is nominated to one output port

+ Easier to pipeline restart (1) for free input ports before (2) completes

only one packet nominated to one output port small number (16) of in-flight packets avoids any centralized matrix

speculative read allows data flits to follow header flits

(1) (2) (3)

1

(1) (2) (3)11

1 cycle

Page 17: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 17

Inte

lSummary: Simpler is BetterSummary: Simpler is Better

WFA PIM1 SPAA

Alpha 21364

# Matches Per Cycle High Medium Lower

# cycles

(0.18 microns)

4 4 3

Restart

Rate

Every 3 cycles

Every 3 cycles

Every cycle

Page 18: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 18

Inte

lSaturation BehaviorSaturation Behavior

• Reasons: Hot spots & tree saturation • 21364’s router shows cyclic pattern (link utilization with time)

• Ideally, operate at saturation bandwidth • Solution: throttle input load

64 Node Network, Random Traffic0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8

Delivered flits/router/nanoseconds

Ave

rag

e P

acke

t L

aten

cy

(nan

ose

con

ds)

SPAA-base

saturation point

Page 19: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 19

Inte

lRotary RuleRotary Rule

21364’s in-built throttling+ maximum outstanding cache miss requests per processor = 16

Rotary Rule: more throttling+ 21364 is a “direct” network

+ Rotary Rule prioritizes traffic in network ports over local ports

+ also, clears network congestion

+ relies on anti-starvation mechanism

WFA+Rotary: change first cell SPAA+Rotary: change output port priority to

the Rotary Rule

Page 20: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 20

Inte

lSimulation MethodologySimulation Methodology

Asim modeling infrastructure detailed timing model of 21364 network selected design points validated against RTL

Traffic Patterns 70% three coherence hops, 30% two coherence hops random destinations other traffic combinations in paper and simulated internally

Page 21: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 21

Inte

l64 Node Network: Base Case64 Node Network: Base Case

Random Traffic0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8

Delivered flits/router/nanoseconds

Ave

rag

e P

acke

t L

aten

cy

(nan

ose

con

ds)

PIM1

WFA-base

SPAA-base

• SPAA outperforms WFA & PIM124% higher throughput at knee

Knee

Page 22: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 22

Inte

l64 Node Network: With Rotary Rule64 Node Network: With Rotary Rule

Random Traffic0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8

Delivered flits/router/nanoseconds

Ave

rag

e P

acke

t L

aten

cy

(nan

ose

con

ds)

PIM1

WFA-base

WFA-rotary

SPAA-base

SPAA-rotary

• Rotary Rule helps both SPAA & WFA

Page 23: A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

Slide 23

Inte

lSummary & ConclusionsSummary & Conclusions

Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,

SGI Spider)– PIM1: Parallel Iterative Matching with one iteration

(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)

SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when

many output ports are busy)

+ SPAA minimizes interactions between ports

+ SPAA can be pipelined more effectively

Rotary Rule+ avoids network saturation under heavy load