Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡...

Trace-Driven Optimization of Networks-on-Chip Configurations

Andrew B. Kahng†‡

Bill Lin‡

Kambiz Samadi‡

Rohit Sunkam Ramanujam‡

University of California, San DiegoCSE† and ECE‡ Departments

June 16, 2010

1

Outline Motivation Trace-Driven Problem Formulation

Greedy Addition VC Allocation Greedy Deletion VC Allocation Runtime Analysis

Experimental Setup Evaluation

Experimental Results Power Impact

Conclusions

2

Motivation

Processing Element

Router

NoCs needed to interconnect many-core chips Scalable on-chip communication fabric An emerging interconnection paradigm to build

complex VLSI systems NoCs can be used to interconnect general-purpose

chip multiprocessors (CMPs) or application-specific multiprocessor systems-on-chip (MPSoCs)

3

CMPs vs. MPSoCs

Traditional application domains MPSoCs target embedded domains CMPs target general purpose computing

Common Need for high memory bandwidth Power efficiency, system control, etc.

Different CMPs are to run a wide range of applications MPSoCs have more irregularities MPSoCs have tighter cost and time-to-market

Conclusion: application-specific optimization is required for MPSoCs

4

Trace Driven vs. Average-Rate Driven

Actual traffic behavior of two PARSEC benchmark traces Actual traffic tends to very bursty with substantial fluctuations

over time Average-rate driven approaches are misled by the average

traffic characteristics poor design choices Our approach: trace-driven NoC configuration optimization

0

200

400

600

800

1000

1200

1400

1600

0 200000 400000 600000 800000 1000000

time (cycles)

Tra

ffic

(flit

s/1

00

00

cyc

les)

vipsMean - vipsbsMean - bs

5

Head-of-Line (HOL) Blocking Problem

Output 1

Output 2

HOL happens in input-buffered routers

Flits are blocked if the head flit is blocked significantly increases latency and reduces throughput

Virtual channels overcome this problem by multiplexing the input buffers

Output 3

Output 1

Output 2

Output 3

Blocked!

6

Average-Rate Driven Shortcoming

3 packets with the following (source, destination): (A, G), (B, E), (F, E) Suppose all 3 packets are 10-flits in size, and all injected at t = 0 Channels 2 and 3 will carry two packets from (A, G) and (B, E), and

Channel 4 will also carry two packets from (B, E) and (F, E) Average-rate analysis concludes that adding an additional VC to Channels 2

and 3 is as good as adding a VC to Channel 4 since all 3 channels have the same “load”

Average-rate driven approaches lead to poor design choices

A B C D F

G

E

1 2 3 5

4

6

7

(B, E)

(A, G)

(F, E)

Wormhole Configuration

At “t = 1”, the above channels color coded are held by each packet, assuming single VC (i.e. wormhole routing)

At “t = 2”, Packet (A, G) is “blocked” from proceeding because Channel 2 already held by packet (B, E)

At “t = 12” (=3 + 9), packet (B, E) can proceed to Channel 4 since it has already been released by packet (F, E)

At “t = 20”, Packet (A, G) acquires Channel 3 At “t = 21”, Packet (A, G) acquires Channel 6 as well, and Packet (B, E)

completes

A B C D F

G

E

1 2 3 5

4

6

Packet (A, G) will complete at “t = 35” 8

Latency Reduction via VC Allocation

Now assume Channels 2 and 3 each have 2 VCs In this case, Packet (A, G) can “bypass” Packet (B, E) while packet (B, E) is

being blocked by packet (F, E) at Channel 4 At “t = 12”, Packet (F, E) completes, and Packet (B, E) can proceed on

Channel 4 At “t = 13”, last flit of Packet (A, G) is at Channel 6 At “t = 22”, last flit of Packet (B, E) is at Channel 4, and Packet (A, G) has

already completed

A B C D F

G

E

1 2 3 5

4

6

With 2 VCs at Channels 2 and 3, completion time is 23 cycles vs. 35 cycles without these VCs

Main reason for the improvement is because we prevented Channels 2 and 3 from being “idle”

9

OutlineMotivation Trace-Driven Problem Formulation




Conclusions

10

Problem Formulation Given:

Application communication trace, Ctrace

Network topology, T(P,L) Deterministic routing algorithm, R Target latency, Dtarget

Determine: A mapping from nVC from the set of links L to the set of positive integers,

i.e., nVC : L → Z+, where for any l L, nVC(l) gives the number of VCs

associated with link l

Objective: Minimize

Subject to:

Ll

VC ln )(

targetVC DRnD ),(11

Greedy Addition VC Allocation Heuristic (1) Inputs:

Communication traffic trace, Ctrace

Network topology, T(P,L) Routing algorithm, R Target latency, Dtarget

Output: Vector nVC, which contains

the number of VCs associated with each link

time src dest packet size

1 (1,0) (0,2) 42 (2,2) (3,1) 45 (1,3) (3,1) 47 (2,1) (3,2) 4 …(3,0) (3,1) (3,2) (3,3)

(2,0) (2,1) (2,2)

(1,1)

(0,1)(0,0)

(1,0) (1,3)(1,2)

(2,3)

(0,3)(0,2)

12

Greedy Addition VC Allocation Heuristic (2) Algorithm initializes every link with one VC Algorithm proceeds in greedy fashion

In each iteration, performance of all VC perturbations are evaluated

Each perturbation consists of adding exactly one VC to one link

Average packet latency (APL) of perturb VC configurations are evaluated the configuration with the smallest APL is chosen for the next iteration

Algorithm stops if either (1) the total allocated VCs exceeds the VC budget, or (2) a configuration with better APL than the target latency is achieved

13

Greedy Addition VC Allocation Heuristic (3)1. for i = 1 to NL

2. nVCcurrent(l) = 1;

3. end for

4. nVCbest = nVC

current;

5. NVC = NL;

6. while (NVC <= budgetVC)

7. for l = 1 to NL

8. nVCnew = nVC

current;

9. nVCnew(l) = nVC

current(l) + 1;

10. run trace simulation on nVCnew and record D(nVC

new,R)

11. end for

12. find nVCbest;

13. nVCcurrent = nVC

best;

14. if (D(nVCnew,R) <= Dtarget)

15. break;16. end if

17. NVC++;

18. end while

initializing to wormhole configuration

check the VC budget

VC perturbations evaluated in parallel in each iteration

find the best configuration of the current iteration

14

Greedy Addition VC Allocation Heuristic Drawback

Packets (A, F) and (A, E) share links A→B and B→C, both of which have only one VC

(A, F) turns west and (A, E) turns east at Node C adding a VC to either link A→B or link B→C may not have a

significant impact on APL If VCs are added to both links A→B and B→C, the APL may be

significantly reduced Greedy VC addition approach may fail to realize the benefits of

these combined additions and not pick either of the links

B

A

D EF C(A, E)

15

(A, F)

Greedy Deletion VC Allocation Heuristic 1. nVC

current = nVCinitial;

2. nVCbest = nVC

current;

3. NVC = nVCcurrent(l);

4. while (NVC >= budgetVC)

5. for l = 1 to NL

6. nVCnew = nVC

current;

7. if (nVCcurrent(l) > 1)

8. nVCnew(l) = nVC

current(l) - 1;

9. run trace simulation on nVCnew and record D(nVC

new,R)

10. end if

11. end for

12. find nVCbest;

13. nVCcurrent = nVC

best;

14. if (D(nVCnew,R) <= Dtarget)

15. break;

16. end if

17. NVC--;

18. end while

Ll

start with a given VC configuration

each link should at least have 1 VC, i.e., wormhole configuration

find the best configuration of the current iteration, i.e., the one with least degradation in APL

16

Addition and Deletion Heuristics Comparison

APL decreases as VCs are increased (addition heuristic) APL increases as VCs are removed (deletion heuristic) Adding a single VC to a link may not have a significant impact on APL

APL change is much smoother in deletion heuristic 17

Runtime Analysis Let m be the number of VCs added to (deleted from)

an initial VC configuration Theuristic = m × NL × T(trace simulation) Theuristic is the average time to run trace simulations on all VC

configurations explored by the algorithm

Our heuristics can easily be parallelized Evaluating all VC configurations in parallel Theuristic = m × T(trace simulation)max

represents the average of the maximum runtimes of trace simulation at each iteration

For Larger networks, to maintain a reasonable runtime we need O(L) processing nodes Trace compression Other metrics to more efficiently capture the impact of VC

on APL 18

OutlineMotivationTrace-Driven Problem Formulation




Conclusions

19

Experimental Setup (1) We use Popnet for trace simulation

Popnet models a typical four-stage router pipeline Head flit of a packet traverses all four stages, while body flits bypass the

first stage Number of VCs at the input port can be individually configured to allow

nonuniform VC configuration at a router

Latency of a packet is measured as the delay between the time the head flit is injected into the network and the time the tail flit is consumed at the destination

Reported APL value is the average latency over all packets in the input traffic trace

20

GEMS Simics

network configuration

workload communicationtrace

Experimental Setup (2) To evaluate our VC allocation heuristics we use seven

different applications from PARSEC benchmark suite Network traffic traces are generated by running the

above applications on Virtutech Simics GEMS toolset is used for accurate timing simulation We simulate a 16-core, 4x4:

Cores 16

Private L1 Cache 32KB

Shared L2 1MB distributed over 16 banks

Memory Latency 170 cycles

Network 4x4x mesh

Packet Sizes 72B data packets, 8B control packets21





Conclusions

22

Comparison vs. Uniform-2VC

Average-rate driven method is outperformed by uniform VC allocation Our addition and deletion heuristics achieve up to 36% and 34% reduction in

number of VCs, respectively (w.r.t. uniform-2VC configuration) On average, both of our heuristics reduce the number of VCs by around 21%

across all traces (w.r.t. uniform-2VC configuration)

020406080

100120140160

Ave

rage

Pac

ket L

aten

cy (c

ycle

s)uniform 2VC addition deletion average-rate

23

Comparison vs. Uniform-3VC

Our addition and deletion heuristics achieve up to 48% and 51% reduction in number of VCs, respectively

On average, our addition and deletion heuristics achieve up to 31% and 41% reduction in number of VCs across all traces

We observe up to 35% reduction in number of VCs compared against an existing average-rate driven approach

0

50

100

150

200

Ave

rage

Pac

ket L

aten

cy (c

ycle

s)uniform 3VC addition deletion average-rate

24

Latency and #VC Reductions

With #VC=128 our greedy deletion heuristic improves the APL by 32% and 74% for fluidanimate and vips traces compared with the uniform-2VC configuration, respectively

Our deletion heuristic also achieves 50% and 42% reduction in number of VCs compared with uniform-4VC configuration, respectively

Our proposed trace-driven approach can potentially be used to (1) improve performance within a given power constraint, and (2) reduce power within a given performance constraint

Latency reduction

VC reduction

Latencyreduction

VC reduction

vips tracefluidanimate trace

25

Impact on Power We use ORION 2.0 to assess the impact of our approach on

power consumption ORION 2.0 assumes same number of VCs at every port in the router Need to compute the router power for nonuniform VC configurations Estimate the power overhead of adding a single VC to all router ports Estimate the power overhead of adding a single VC to just one port

Similar approach is used to estimate the area overhead of adding a single VC to one router port

We observe that our proposed approach achieves up to 7% and 14% reduction in power compared against uniform-2VC and uniform-3VC configurations (without any performance degradation), respectively

Similarly, we observe up to 9% and 16% reduction in area compared against uniform-2VC and uniform-3VC configurations, respectively

26



Experimental SetupEvaluation


Conclusions

27

Conclusions Proposed trace-driven method for optimizing NoC configurations Considered the problem of application-specific VC allocation Showed that existing average-rate driven VC allocation

approaches fail to capture the application-specific characteristics to further improve performance and reduce power

In comparison with uniform VC allocation, our approaches achieve up to 51% and 74% reduction in number of VCs and average packet latency, respectively

In comparison with an existing average-rate driven approach, we observe up to 35% reduction in number of VCs

Ongoing work New metrics to more efficiently capture the impact of VC allocation on

average packet latency New metaheuristics to further improve our performance improvement and

VC reduction gains28

Thank You

29

Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡...

Documents

Transcript of Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡...