Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡...
Trace-Driven Optimization of Networks-on-Chip Configurations
Andrew B. Kahng†‡
Bill Lin‡
Kambiz Samadi‡
Rohit Sunkam Ramanujam‡
University of California, San DiegoCSE† and ECE‡ Departments
June 16, 2010
1
Outline Motivation Trace-Driven Problem Formulation
Greedy Addition VC Allocation Greedy Deletion VC Allocation Runtime Analysis
Experimental Setup Evaluation
Experimental Results Power Impact
Conclusions
2
Motivation
Processing Element
Router
NoCs needed to interconnect many-core chips Scalable on-chip communication fabric An emerging interconnection paradigm to build
complex VLSI systems NoCs can be used to interconnect general-purpose
chip multiprocessors (CMPs) or application-specific multiprocessor systems-on-chip (MPSoCs)
3
CMPs vs. MPSoCs
Traditional application domains MPSoCs target embedded domains CMPs target general purpose computing
Common Need for high memory bandwidth Power efficiency, system control, etc.
Different CMPs are to run a wide range of applications MPSoCs have more irregularities MPSoCs have tighter cost and time-to-market
Conclusion: application-specific optimization is required for MPSoCs
4
Trace Driven vs. Average-Rate Driven
Actual traffic behavior of two PARSEC benchmark traces Actual traffic tends to very bursty with substantial fluctuations
over time Average-rate driven approaches are misled by the average
traffic characteristics poor design choices Our approach: trace-driven NoC configuration optimization
0
200
400
600
800
1000
1200
1400
1600
0 200000 400000 600000 800000 1000000
time (cycles)
Tra
ffic
(flit
s/1
00
00
cyc
les)
vipsMean - vipsbsMean - bs
5
Head-of-Line (HOL) Blocking Problem
Output 1
Output 2
HOL happens in input-buffered routers
Flits are blocked if the head flit is blocked significantly increases latency and reduces throughput
Virtual channels overcome this problem by multiplexing the input buffers
Output 3
Output 1
Output 2
Output 3
Blocked!
6
Average-Rate Driven Shortcoming
3 packets with the following (source, destination): (A, G), (B, E), (F, E) Suppose all 3 packets are 10-flits in size, and all injected at t = 0 Channels 2 and 3 will carry two packets from (A, G) and (B, E), and
Channel 4 will also carry two packets from (B, E) and (F, E) Average-rate analysis concludes that adding an additional VC to Channels 2
and 3 is as good as adding a VC to Channel 4 since all 3 channels have the same “load”
Average-rate driven approaches lead to poor design choices
A B C D F
G
E
1 2 3 5
4
6
7
(B, E)
(A, G)
(F, E)
Wormhole Configuration
At “t = 1”, the above channels color coded are held by each packet, assuming single VC (i.e. wormhole routing)
At “t = 2”, Packet (A, G) is “blocked” from proceeding because Channel 2 already held by packet (B, E)
At “t = 12” (=3 + 9), packet (B, E) can proceed to Channel 4 since it has already been released by packet (F, E)
At “t = 20”, Packet (A, G) acquires Channel 3 At “t = 21”, Packet (A, G) acquires Channel 6 as well, and Packet (B, E)
completes
A B C D F
G
E
1 2 3 5
4
6
Packet (A, G) will complete at “t = 35” 8
Latency Reduction via VC Allocation
Now assume Channels 2 and 3 each have 2 VCs In this case, Packet (A, G) can “bypass” Packet (B, E) while packet (B, E) is
being blocked by packet (F, E) at Channel 4 At “t = 12”, Packet (F, E) completes, and Packet (B, E) can proceed on
Channel 4 At “t = 13”, last flit of Packet (A, G) is at Channel 6 At “t = 22”, last flit of Packet (B, E) is at Channel 4, and Packet (A, G) has
already completed
A B C D F
G
E
1 2 3 5
4
6
With 2 VCs at Channels 2 and 3, completion time is 23 cycles vs. 35 cycles without these VCs
Main reason for the improvement is because we prevented Channels 2 and 3 from being “idle”
9
OutlineMotivation Trace-Driven Problem Formulation
Greedy Addition VC Allocation Greedy Deletion VC Allocation Runtime Analysis
Experimental Setup Evaluation
Experimental Results Power Impact
Conclusions
10
Problem Formulation Given:
Application communication trace, Ctrace
Network topology, T(P,L) Deterministic routing algorithm, R Target latency, Dtarget
Determine: A mapping from nVC from the set of links L to the set of positive integers,
i.e., nVC : L → Z+, where for any l L, nVC(l) gives the number of VCs
associated with link l
Objective: Minimize
Subject to:
Ll
VC ln )(
targetVC DRnD ),(11
Greedy Addition VC Allocation Heuristic (1) Inputs:
Communication traffic trace, Ctrace
Network topology, T(P,L) Routing algorithm, R Target latency, Dtarget
Output: Vector nVC, which contains
the number of VCs associated with each link
time src dest packet size
1 (1,0) (0,2) 42 (2,2) (3,1) 45 (1,3) (3,1) 47 (2,1) (3,2) 4 …(3,0) (3,1) (3,2) (3,3)
(2,0) (2,1) (2,2)
(1,1)
(0,1)(0,0)
(1,0) (1,3)(1,2)
(2,3)
(0,3)(0,2)
12
Greedy Addition VC Allocation Heuristic (2) Algorithm initializes every link with one VC Algorithm proceeds in greedy fashion
In each iteration, performance of all VC perturbations are evaluated
Each perturbation consists of adding exactly one VC to one link
Average packet latency (APL) of perturb VC configurations are evaluated the configuration with the smallest APL is chosen for the next iteration
Algorithm stops if either (1) the total allocated VCs exceeds the VC budget, or (2) a configuration with better APL than the target latency is achieved
13
Greedy Addition VC Allocation Heuristic (3)1. for i = 1 to NL
2. nVCcurrent(l) = 1;
3. end for
4. nVCbest = nVC
current;
5. NVC = NL;
6. while (NVC <= budgetVC)
7. for l = 1 to NL
8. nVCnew = nVC
current;
9. nVCnew(l) = nVC
current(l) + 1;
10. run trace simulation on nVCnew and record D(nVC
new,R)
11. end for
12. find nVCbest;
13. nVCcurrent = nVC
best;
14. if (D(nVCnew,R) <= Dtarget)
15. break;16. end if
17. NVC++;
18. end while
initializing to wormhole configuration
check the VC budget
VC perturbations evaluated in parallel in each iteration
find the best configuration of the current iteration
14
Greedy Addition VC Allocation Heuristic Drawback
Packets (A, F) and (A, E) share links A→B and B→C, both of which have only one VC
(A, F) turns west and (A, E) turns east at Node C adding a VC to either link A→B or link B→C may not have a
significant impact on APL If VCs are added to both links A→B and B→C, the APL may be
significantly reduced Greedy VC addition approach may fail to realize the benefits of
these combined additions and not pick either of the links
B
A
D EF C(A, E)
15
(A, F)
Greedy Deletion VC Allocation Heuristic 1. nVC
current = nVCinitial;
2. nVCbest = nVC
current;
3. NVC = nVCcurrent(l);
4. while (NVC >= budgetVC)
5. for l = 1 to NL
6. nVCnew = nVC
current;
7. if (nVCcurrent(l) > 1)
8. nVCnew(l) = nVC
current(l) - 1;
9. run trace simulation on nVCnew and record D(nVC
new,R)
10. end if
11. end for
12. find nVCbest;
13. nVCcurrent = nVC
best;
14. if (D(nVCnew,R) <= Dtarget)
15. break;
16. end if
17. NVC--;
18. end while
Ll
start with a given VC configuration
each link should at least have 1 VC, i.e., wormhole configuration
find the best configuration of the current iteration, i.e., the one with least degradation in APL
16
Addition and Deletion Heuristics Comparison
APL decreases as VCs are increased (addition heuristic) APL increases as VCs are removed (deletion heuristic) Adding a single VC to a link may not have a significant impact on APL
APL change is much smoother in deletion heuristic 17
Runtime Analysis Let m be the number of VCs added to (deleted from)
an initial VC configuration Theuristic = m × NL × T(trace simulation) Theuristic is the average time to run trace simulations on all VC
configurations explored by the algorithm
Our heuristics can easily be parallelized Evaluating all VC configurations in parallel Theuristic = m × T(trace simulation)max
represents the average of the maximum runtimes of trace simulation at each iteration
For Larger networks, to maintain a reasonable runtime we need O(L) processing nodes Trace compression Other metrics to more efficiently capture the impact of VC
on APL 18
OutlineMotivationTrace-Driven Problem Formulation
Greedy Addition VC Allocation Greedy Deletion VC Allocation Runtime Analysis
Experimental Setup Evaluation
Experimental Results Power Impact
Conclusions
19
Experimental Setup (1) We use Popnet for trace simulation
Popnet models a typical four-stage router pipeline Head flit of a packet traverses all four stages, while body flits bypass the
first stage Number of VCs at the input port can be individually configured to allow
nonuniform VC configuration at a router
Latency of a packet is measured as the delay between the time the head flit is injected into the network and the time the tail flit is consumed at the destination
Reported APL value is the average latency over all packets in the input traffic trace
20
GEMS Simics
network configuration
workload communicationtrace
Experimental Setup (2) To evaluate our VC allocation heuristics we use seven
different applications from PARSEC benchmark suite Network traffic traces are generated by running the
above applications on Virtutech Simics GEMS toolset is used for accurate timing simulation We simulate a 16-core, 4x4:
Cores 16
Private L1 Cache 32KB
Shared L2 1MB distributed over 16 banks
Memory Latency 170 cycles
Network 4x4x mesh
Packet Sizes 72B data packets, 8B control packets21
OutlineMotivationTrace-Driven Problem Formulation
Greedy Addition VC Allocation Greedy Deletion VC Allocation Runtime Analysis
Experimental Setup Evaluation
Experimental Results Power Impact
Conclusions
22
Comparison vs. Uniform-2VC
Average-rate driven method is outperformed by uniform VC allocation Our addition and deletion heuristics achieve up to 36% and 34% reduction in
number of VCs, respectively (w.r.t. uniform-2VC configuration) On average, both of our heuristics reduce the number of VCs by around 21%
across all traces (w.r.t. uniform-2VC configuration)
020406080
100120140160
Ave
rage
Pac
ket L
aten
cy (c
ycle
s)uniform 2VC addition deletion average-rate
23
Comparison vs. Uniform-3VC
Our addition and deletion heuristics achieve up to 48% and 51% reduction in number of VCs, respectively
On average, our addition and deletion heuristics achieve up to 31% and 41% reduction in number of VCs across all traces
We observe up to 35% reduction in number of VCs compared against an existing average-rate driven approach
0
50
100
150
200
Ave
rage
Pac
ket L
aten
cy (c
ycle
s)uniform 3VC addition deletion average-rate
24
Latency and #VC Reductions
With #VC=128 our greedy deletion heuristic improves the APL by 32% and 74% for fluidanimate and vips traces compared with the uniform-2VC configuration, respectively
Our deletion heuristic also achieves 50% and 42% reduction in number of VCs compared with uniform-4VC configuration, respectively
Our proposed trace-driven approach can potentially be used to (1) improve performance within a given power constraint, and (2) reduce power within a given performance constraint
Latency reduction
VC reduction
Latencyreduction
VC reduction
vips tracefluidanimate trace
25
Impact on Power We use ORION 2.0 to assess the impact of our approach on
power consumption ORION 2.0 assumes same number of VCs at every port in the router Need to compute the router power for nonuniform VC configurations Estimate the power overhead of adding a single VC to all router ports Estimate the power overhead of adding a single VC to just one port
Similar approach is used to estimate the area overhead of adding a single VC to one router port
We observe that our proposed approach achieves up to 7% and 14% reduction in power compared against uniform-2VC and uniform-3VC configurations (without any performance degradation), respectively
Similarly, we observe up to 9% and 16% reduction in area compared against uniform-2VC and uniform-3VC configurations, respectively
26
OutlineMotivationTrace-Driven Problem Formulation
Greedy Addition VC Allocation Greedy Deletion VC Allocation Runtime Analysis
Experimental SetupEvaluation
Experimental Results Power Impact
Conclusions
27
Conclusions Proposed trace-driven method for optimizing NoC configurations Considered the problem of application-specific VC allocation Showed that existing average-rate driven VC allocation
approaches fail to capture the application-specific characteristics to further improve performance and reduce power
In comparison with uniform VC allocation, our approaches achieve up to 51% and 74% reduction in number of VCs and average packet latency, respectively
In comparison with an existing average-rate driven approach, we observe up to 35% reduction in number of VCs
Ongoing work New metrics to more efficiently capture the impact of VC allocation on
average packet latency New metaheuristics to further improve our performance improvement and
VC reduction gains28
Thank You
29