Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally...

22
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    1

Transcript of Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally...

Allocator Implementations for Network-on-Chip Routers

Daniel U. Becker and William J. DallyConcurrent VLSI Architecture Group

Stanford University

Allocator Implementations for NoC Routers 2

Overview

• Allocators have major impact on router performance– Zero-load latency, throughput under load, cycle time

• On-chip environment imposes stringent constraints– Cycle time, power, no iterative / multi-cycle allocators

• Main Contributions:– RTL-based performance & cost evaluation of virtual

channel and switch allocators for NoC routers– Sparse VC allocation scheme reduces delay, area &

power– Pessimistic speculation scheme minimizes delay penalty

11/18/09

Allocator Implementations for NoC Routers 3

Separable Allocators

11/18/09

• Implement allocation as two phases– Local arbitration at each input– Global arbitration at each output

• Pros:– Straightforward implementation– Delay scales logarithmically

• Cons:– Arbiters within each phase are

independent– Bad choice in first phase can limit matching

Input-first:

Output-first:

Outputs

Outputs

Inpu

tsIn

puts

Allocator Implementations for NoC Routers 4

[Tamir’93]

Wavefront Allocator

• Consider inputs and outputs together– Grant requests on diagonal, kill conflicts– Repeat for other diagonals

• Pros:– Tends to generate better matchings– Tiled design facilitates full-custom implem.

• Cons:– Delay scales linearly– Orig. design has (false) combinational loops

11/18/09

Outputs

Inpu

ts

Allocator Implementations for NoC Routers 5

Evaluation Methodology

• Analytical models useful for developing intuition• But becoming increasingly inaccurate– Wire delay impact, synthesized vs. full-custom logic, …

• Use two-pronged evaluation approach:– Delay & cost via detailed RTL-based evaluation

• Synthesized using Synopsys Design Compiler in topo mode• Commercial 45nm low power library @ worst case

– Network-level performance via simulation• Cycle-oriented interconnection network simulator• 64-node networks: 2D mesh & 2D flattened butterfly• Request-reply traffic, synthetic traffic patterns

11/18/09

Allocator Implementations for NoC Routers 6

Virtual Channel Allocation

• Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels)

• Before packets can proceed through router, need to claim ownership of VC buffer at next router

• VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use– P×V inputs (input VCs), P×V outputs (output VCs)– Once assigned, VC is used for entire packet’s duration

11/18/09

Allocator Implementations for NoC Routers 7

Sparse VC Allocation (1)

• VCs are used for variety of purposes:– Deadlock avoidance

• Break cyclic dependencies• Routing deadlock (within network)• Protocol deadlock (at network boundary)

– Flow control• Decouple buffers and channels to avoid head-of-line blocking

• Idea: Partition set of VCs to restrict legal requests– Significantly reduces VC allocator logic complexity– Delay/area/power savings of up to 41%/90%/83%

11/18/09

Allocator Implementations for NoC Routers 8

Sparse VC Allocation (2)

11/18/09

REQ

REP

NM

MIN

NM

MIN

IVC OVC

P×8 Requests

P×4 Requests

P×2 Requests

P×4 Requests

P×2 Requests

8 VCs2×4 VCs2×2×2 VCs

64 Requests32 Requests24 Requests

Allocator Implementations for NoC Routers 9

VC Allocator Performance

bitcomp bitrev neighbor shuffle tornado transpose uniform0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

sep/if sep/of wf

satu

ratio

n ra

te (fl

its/c

ycle

)

11/18/09

[FBfly, 2×2×2 VCs]

Allocator Implementations for NoC Routers 10

VC Allocator Delay

2x1x1 2x1x2 2x1x4 2x2x1 2x2x2 2x2x4mesh fbfly

0

0.5

1

1.5

2

2.5

3

3.5

4

sep_if sep_of wf

dela

y (n

s)

11/18/09

Allocator Implementations for NoC Routers 11

VC Allocator Cost

2x1x

1

2x1x

2

2x1x

4

2x2x

1

2x2x

2

2x2x

4

mesh fbfly

0

100000

200000

300000

400000

500000

600000

sep_if sep_of wf

area

(sq.

um

)

2x1x

1

2x1x

2

2x1x

4

2x2x

1

2x2x

2

2x2x

4

mesh fbfly

0

0.005

0.01

0.015

0.02

0.025

sep_if sep_of wf

pow

er (m

W)

11/18/09

Allocator Implementations for NoC Routers 12

Switch Allocation

• Flits require crossbar access to traverse router• VCs at each input port share crossbar input• Switch allocator generates crossbar schedule– Allocation performed on cycle-by-cycle basis– P×V inputs (input VCs), P outputs (output ports)– At most one VC per input can be granted in each cycle

• Speculative allocation reduces zero-load latency– Start switch allocation before VC allocation completes

11/18/09

Allocator Implementations for NoC Routers 13

Pessimistic Speculation (1)

• Conventional approach:– Separate allocators for spec. and non-spec. requests– Non-spec. grants mask conflicting spec. grants– Conflict detection is on critical path

• At low load, most requests are granted• Idea: Assume all requests will be granted– Mask spec. grants with non-spec. requests– Overlap conflict detection and allocation– Sacrifice speculation accuracy for lower delay– But preserve zero-load latency improvement

11/18/09

Allocator Implementations for NoC Routers 14

Pessimistic Speculation (2)

11/18/09

nonspec. allocator

spec. allocator

conflict detection

mask

nonspec.requests

spec.requests

nonspec.grants

spec.grants

Allocator Implementations for NoC Routers 15

Switch Allocator Performance (1)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40

102030405060708090

100

sep_if sep_of wf

injection rate (flits/cycle)

avg.

pac

ket l

aten

cy (c

ycle

s)

11/18/09

[Mesh, 2×1×1 VCs]

Allocator Implementations for NoC Routers 16

Switch Allocator Performance (2)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

102030405060708090

100

sep_if sep_of wf

injection rate (flits/cycle)

avg.

pac

ket l

aten

cy (c

ycle

s)

11/18/09

[FBfly, 2×2×4 VCs]>20%

Allocator Implementations for NoC Routers 17

Switch Allocator Delay

2x1x1 2x1x2 2x1x4 2x2x1 2x2x2 2x2x4mesh fbfly

0

0.5

1

1.5

2

2.5

3

3.5

4

sep_if sep_of wf

dela

y (n

s)

11/18/09

Allocator Implementations for NoC Routers 18

Switch Allocator Cost

2x1x

1

2x1x

2

2x1x

4

2x2x

1

2x2x

2

2x2x

4

mesh fbfly

020000400006000080000

100000120000140000160000180000200000

sep_if sep_of wf

area

(sq.

um

)

2x1x

1

2x1x

2

2x1x

4

2x2x

1

2x2x

2

2x2x

4

mesh fbfly

00.0020.0040.0060.008

0.010.0120.0140.0160.018

sep_if sep_of wf

pow

er (m

W)

11/18/09

Allocator Implementations for NoC Routers 19

Speculation Performance (1)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40

10

20

30

40

50

60

nonspec spec req spec gnt

injection rate (flits/cycle)

avg.

pac

ket l

ante

cy (c

ycle

s)

11/18/09

[Mesh, 2×1×1 VCs]

Allocator Implementations for NoC Routers 20

Speculation Performance (2)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

10

20

30

40

50

60

nonspec spec req spec gnt

injection rate (flits/cycle)

avg.

pac

ket l

ante

cy (c

ycle

s)

11/18/09

[Fbfly, 2×2×4 VCs]

Allocator Implementations for NoC Routers 21

Speculation Implementation

11/18/09

2x1x1 2x1x2 2x1x4 2x2x1 2x2x2 2x2x4mesh fbfly

0

0.5

1

1.5

2

2.5

3

nonspec spec_req spec_gnt

dela

y (n

s)

Allocator Implementations for NoC Routers 22

Conclusions

• Network-level performance is largely insensitive to VC allocator implemetation– Light effective load facilitates near-ideal matchings

• Sparse VC allocation can greatly reduce delay & cost– Partition set of VCs based on functionality– Restrict possible requests allocator must handle

• For switch allocation, wavefront allocator produces better matchings but increases delay & cost– Difference increases with number of ports, VCs

• Pessimistic speculation reduces switch allocator delay– Trade for some performance degradation near saturation

11/18/09