Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally...
-
date post
19-Dec-2015 -
Category
Documents
-
view
219 -
download
1
Transcript of Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally...
Allocator Implementations for Network-on-Chip Routers
Daniel U. Becker and William J. DallyConcurrent VLSI Architecture Group
Stanford University
Allocator Implementations for NoC Routers 2
Overview
• Allocators have major impact on router performance– Zero-load latency, throughput under load, cycle time
• On-chip environment imposes stringent constraints– Cycle time, power, no iterative / multi-cycle allocators
• Main Contributions:– RTL-based performance & cost evaluation of virtual
channel and switch allocators for NoC routers– Sparse VC allocation scheme reduces delay, area &
power– Pessimistic speculation scheme minimizes delay penalty
11/18/09
Allocator Implementations for NoC Routers 3
Separable Allocators
11/18/09
• Implement allocation as two phases– Local arbitration at each input– Global arbitration at each output
• Pros:– Straightforward implementation– Delay scales logarithmically
• Cons:– Arbiters within each phase are
independent– Bad choice in first phase can limit matching
Input-first:
Output-first:
Outputs
Outputs
Inpu
tsIn
puts
Allocator Implementations for NoC Routers 4
[Tamir’93]
Wavefront Allocator
• Consider inputs and outputs together– Grant requests on diagonal, kill conflicts– Repeat for other diagonals
• Pros:– Tends to generate better matchings– Tiled design facilitates full-custom implem.
• Cons:– Delay scales linearly– Orig. design has (false) combinational loops
11/18/09
Outputs
Inpu
ts
Allocator Implementations for NoC Routers 5
Evaluation Methodology
• Analytical models useful for developing intuition• But becoming increasingly inaccurate– Wire delay impact, synthesized vs. full-custom logic, …
• Use two-pronged evaluation approach:– Delay & cost via detailed RTL-based evaluation
• Synthesized using Synopsys Design Compiler in topo mode• Commercial 45nm low power library @ worst case
– Network-level performance via simulation• Cycle-oriented interconnection network simulator• 64-node networks: 2D mesh & 2D flattened butterfly• Request-reply traffic, synthetic traffic patterns
11/18/09
Allocator Implementations for NoC Routers 6
Virtual Channel Allocation
• Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels)
• Before packets can proceed through router, need to claim ownership of VC buffer at next router
• VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use– P×V inputs (input VCs), P×V outputs (output VCs)– Once assigned, VC is used for entire packet’s duration
11/18/09
Allocator Implementations for NoC Routers 7
Sparse VC Allocation (1)
• VCs are used for variety of purposes:– Deadlock avoidance
• Break cyclic dependencies• Routing deadlock (within network)• Protocol deadlock (at network boundary)
– Flow control• Decouple buffers and channels to avoid head-of-line blocking
• Idea: Partition set of VCs to restrict legal requests– Significantly reduces VC allocator logic complexity– Delay/area/power savings of up to 41%/90%/83%
11/18/09
Allocator Implementations for NoC Routers 8
Sparse VC Allocation (2)
11/18/09
REQ
REP
NM
MIN
NM
MIN
IVC OVC
P×8 Requests
P×4 Requests
P×2 Requests
P×4 Requests
P×2 Requests
8 VCs2×4 VCs2×2×2 VCs
64 Requests32 Requests24 Requests
Allocator Implementations for NoC Routers 9
VC Allocator Performance
bitcomp bitrev neighbor shuffle tornado transpose uniform0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
sep/if sep/of wf
satu
ratio
n ra
te (fl
its/c
ycle
)
11/18/09
[FBfly, 2×2×2 VCs]
Allocator Implementations for NoC Routers 10
VC Allocator Delay
2x1x1 2x1x2 2x1x4 2x2x1 2x2x2 2x2x4mesh fbfly
0
0.5
1
1.5
2
2.5
3
3.5
4
sep_if sep_of wf
dela
y (n
s)
11/18/09
Allocator Implementations for NoC Routers 11
VC Allocator Cost
2x1x
1
2x1x
2
2x1x
4
2x2x
1
2x2x
2
2x2x
4
mesh fbfly
0
100000
200000
300000
400000
500000
600000
sep_if sep_of wf
area
(sq.
um
)
2x1x
1
2x1x
2
2x1x
4
2x2x
1
2x2x
2
2x2x
4
mesh fbfly
0
0.005
0.01
0.015
0.02
0.025
sep_if sep_of wf
pow
er (m
W)
11/18/09
Allocator Implementations for NoC Routers 12
Switch Allocation
• Flits require crossbar access to traverse router• VCs at each input port share crossbar input• Switch allocator generates crossbar schedule– Allocation performed on cycle-by-cycle basis– P×V inputs (input VCs), P outputs (output ports)– At most one VC per input can be granted in each cycle
• Speculative allocation reduces zero-load latency– Start switch allocation before VC allocation completes
11/18/09
Allocator Implementations for NoC Routers 13
Pessimistic Speculation (1)
• Conventional approach:– Separate allocators for spec. and non-spec. requests– Non-spec. grants mask conflicting spec. grants– Conflict detection is on critical path
• At low load, most requests are granted• Idea: Assume all requests will be granted– Mask spec. grants with non-spec. requests– Overlap conflict detection and allocation– Sacrifice speculation accuracy for lower delay– But preserve zero-load latency improvement
11/18/09
Allocator Implementations for NoC Routers 14
Pessimistic Speculation (2)
11/18/09
nonspec. allocator
spec. allocator
conflict detection
mask
nonspec.requests
spec.requests
nonspec.grants
spec.grants
Allocator Implementations for NoC Routers 15
Switch Allocator Performance (1)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40
102030405060708090
100
sep_if sep_of wf
injection rate (flits/cycle)
avg.
pac
ket l
aten
cy (c
ycle
s)
11/18/09
[Mesh, 2×1×1 VCs]
Allocator Implementations for NoC Routers 16
Switch Allocator Performance (2)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
102030405060708090
100
sep_if sep_of wf
injection rate (flits/cycle)
avg.
pac
ket l
aten
cy (c
ycle
s)
11/18/09
[FBfly, 2×2×4 VCs]>20%
Allocator Implementations for NoC Routers 17
Switch Allocator Delay
2x1x1 2x1x2 2x1x4 2x2x1 2x2x2 2x2x4mesh fbfly
0
0.5
1
1.5
2
2.5
3
3.5
4
sep_if sep_of wf
dela
y (n
s)
11/18/09
Allocator Implementations for NoC Routers 18
Switch Allocator Cost
2x1x
1
2x1x
2
2x1x
4
2x2x
1
2x2x
2
2x2x
4
mesh fbfly
020000400006000080000
100000120000140000160000180000200000
sep_if sep_of wf
area
(sq.
um
)
2x1x
1
2x1x
2
2x1x
4
2x2x
1
2x2x
2
2x2x
4
mesh fbfly
00.0020.0040.0060.008
0.010.0120.0140.0160.018
sep_if sep_of wf
pow
er (m
W)
11/18/09
Allocator Implementations for NoC Routers 19
Speculation Performance (1)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40
10
20
30
40
50
60
nonspec spec req spec gnt
injection rate (flits/cycle)
avg.
pac
ket l
ante
cy (c
ycle
s)
11/18/09
[Mesh, 2×1×1 VCs]
Allocator Implementations for NoC Routers 20
Speculation Performance (2)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
10
20
30
40
50
60
nonspec spec req spec gnt
injection rate (flits/cycle)
avg.
pac
ket l
ante
cy (c
ycle
s)
11/18/09
[Fbfly, 2×2×4 VCs]
Allocator Implementations for NoC Routers 21
Speculation Implementation
11/18/09
2x1x1 2x1x2 2x1x4 2x2x1 2x2x2 2x2x4mesh fbfly
0
0.5
1
1.5
2
2.5
3
nonspec spec_req spec_gnt
dela
y (n
s)
Allocator Implementations for NoC Routers 22
Conclusions
• Network-level performance is largely insensitive to VC allocator implemetation– Light effective load facilitates near-ideal matchings
• Sparse VC allocation can greatly reduce delay & cost– Partition set of VCs based on functionality– Restrict possible requests allocator must handle
• For switch allocation, wavefront allocator produces better matchings but increases delay & cost– Difference increases with number of ports, VCs
• Pessimistic speculation reduces switch allocator delay– Trade for some performance degradation near saturation
11/18/09