IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 3,...

Spatial Locality Speculation to Reduce Energyin Chip-Multiprocessor Networks-on-Chip

Hyungjun Kim, Boris Grot, Member, IEEE, Paul V. Gratz, Member, IEEE, and

Daniel A. Jimenez, Member, IEEE

Abstract—As processor chips become increasingly parallel, an efficient communication substrate is critical for meeting performance

and energy targets. In this work, we target the root cause of network energy consumption through techniques that reduce link and

router-level switching activity. We specifically focus on memory subsystem traffic, as it comprises the bulk of NoC load in a CMP. By

transmitting only the flits that contain words predicted useful using a novel spatial locality predictor, our scheme seeks to reduce

network activity. We aim to further lower NoC energy through microarchitectural mechanisms that inhibit datapath switching activity for

unused words in individual flits. Using simulation-based performance studies and detailed energy models based on synthesized router

designs and different link wire types, we show that 1) the prediction mechanism achieves very high accuracy, with an average rate of

false-unused prediction of just 2.5 percent; 2) the combined NoC energy savings enabled by the predictor and microarchitectural

support is 36 percent, on average, and up to 57 percent in the best case; and 3) there is no system performance penalty as a result of

this technique.

Index Terms—Power management, cache memory, spatial locality, interconnections

Ç

1 INTRODUCTION

WHILE process technology scaling continues providingmore transistors, the transistor performance and

power gains that accompany process scaling have largelyceased [1]. Chip-multiprocessor (CMP) designs achievegreater efficiency than traditional uniprocessors throughconcurrent parallel execution of multiple programs orthreads. As the core count in chip-multiprocessor systemsincreases, networks-on-chip (NoCs) present a scalablealternative to traditional, bus-based designs for intercon-nection between processor cores [2]. As in most currentVLSI designs, power efficiency has also become a first-orderconstraint in NoC design. The energy consumed by theNoC itself is 28 percent of the per-tile power in the IntelTeraflop chip [3] and 36 percent of the total chip power inMIT RAW chip [4]. In this paper, we present a noveltechnique to reduce energy consumption for CMP coreinterconnect leveraging spatial locality speculation toidentify unused cache block words. In particular, wepropose to predict which words in each cache block fetch

will be used and leverage that prediction to reduce dynamicenergy consumption in the NoC channels and routersthrough diminished switching activity.

1.1 Motivation

Current CMPs employ cache hierarchies of multiple levelsprior to main memory [5], [6]. Caches organize data intoblocks containing multiple contiguous words in an effort tocapture spatial locality and reduce the likelihood ofsubsequent misses. Unfortunately, applications often donot fully utilize all the words fetched for a given cacheblock, as recently noted by Pujara and Aggarwal [7].

Fig. 1 shows the percentage of words utilized in thePARSEC multithreaded benchmark suite [8]. On average,61 percent of cache block words in the PARSEC suitebenchmarks will never be referenced and represent energywasted in transference through the memory hierarchy.1 Inthis work, we focus on the waste associated with traditionalapproaches to spatial locality, in particular the wastedenergy and power caused by large cache blocks containingunused data.

1.2 CMP Interconnect

Networks-on-chip purport to be a scalable interconnect tomeet the increasing bandwidth demands of future CMPs[2]. NoCs must be carefully designed to meet manyconstraints. Energy efficiency, in particular, is a challengein future NoCs as the energy consumed by the NoC itselfis a significant fraction of the total chip power [3], [4]. TheNoC packet datapath, consisting of the link, crossbar, andFIFOs, consumes a significant portion of interconnectpower, 55 percent of network power in the Intel Teraflopchip [3].

IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 3, MARCH 2014 543

. H. Kim is with the Department of Electrical and Computer Engineering,Texas A&M University, 801 Spring Loop Apt 1608, College Station, TX77840. E-mail: [email protected].

. B. Grot is with the Institute of Computing and Multimedia Systems, EPFLIC ISIM PARSA, INJ 238 (Batiment INJ), Station 14, CH-1015,Lausanne, Switzerland. E-mail: [email protected].

. P.V. Gratz is with the Department of Electrical and ComputerEngineering, Texas A&M University, 3259 TAMU, College Station, TX77843-3259. E-mail: [email protected], [email protected].

. D.A. Jimenez is with The University of Texas at San Antonio, SanAntonio, TX 78249, and at 14781 Memorial Drive #1529, Houston, TX77079. E-mail: [email protected], [email protected].

Manuscript received 31 Oct. 2011; revised 1 Apr. 2012; accepted 13 Sept.2012.Recommended for acceptance by R. Ginosar and K.S. Chatha.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCSI-2011-10-0783.Digital Object Identifier no. 10.1109/TC.2012.238

1. Versus 64-byte lines, 32-byte lines reduces unused words to 45 percent,however it increases AMAT 11 percent (see Section 5.3).

0018-9340/14/$31.00 � 2014 IEEE Published by the IEEE Computer Society

Existing NoCs implement channels with relatively largelink bit-widths (>¼128 bits) [9], [10], a trend expected tocontinue as more wire density becomes available in futureprocess technologies. These high-bandwidth link wiresreduce the latency of cache block transmission by allowingmore words to be transferred in each cycle, minimizingserialization latency for large packets. In some cases,however, not all the words in a flit are useful to theprocessor. In particular, unused cache block wordsrepresent wasted power and energy. We propose to usespatial locality speculation to leverage unused words ofthe block transfers between the lower and upper cachelevels to save energy.

1.3 Proposed Technique

The goal of the proposed technique is to reduce dynamicenergy in CMP interconnect by leveraging spatial localityspeculation on the expected used words in fetched cacheblocks in CMP processor memory systems.

The paper makes the following contributions:

. A novel intracache-block spatial locality predictor,to identify words unlikely to be used before theblock is evicted.

. A static packet encoding technique which leveragesspatial locality prediction to reduce the networkactivity factor, and hence dynamic energy, in theNoC routers and links. The static encoding requiresno modification to the NoC and minimal additionsto the processor caches to achieve significant energysavings with negligible performance overhead.

. A complementary dynamic packet encoding techni-que which facilitates additional energy savings inNoC links and routers via light-weight microarchi-tectural enhancements.

In a 16-core CMP implemented in a 45-nm processtechnology, the proposed technique achieves an average of�35 percent savings in total dynamic interconnect energyat the cost of less than 1 percent increase in memorysystem latency.

The rest of this paper is organized as follows. Section 2discusses the related work and background in caches,NoCs and power efficiency on-chip to provide theintuition behind our power saving flit-encoding technique.Section 3 discusses our proposed technique in detail,including the proposed spatial locality predictor and theproposed packet encoding schemes. Sections 4 and5 present the experimental setup and the results. Finally,we conclude in Section 6.

2 BACKGROUND AND RELATED WORK

2.1 Dynamic Power Consumption

When a bit is transmitted over interconnect wire or storedin an SRAM cell, dynamic power is consumed as a result ofa capacitive load being charged up and also due to transientcurrents during the momentary short from Vdd to Gndwhile transistors are switching. Dynamic power is notconsumed in the absence of switching activity. Equation (1)shows the dynamic and short-circuit components of powerconsumption in a CMOS circuit

P ¼ � � C � V 2 � f þ t � � � V � Ishort � f: ð1Þ

In the equation, P is the power consumed, C is theswitched capacitance, V is the supplied voltage, and F isthe clock frequency. � represents the activity factor, whichis the probability that the capacitive load C is charged in agiven cycle. C, V , and F are a function of technology anddesign parameters. In systems that support dynamicvoltage-frequency scaling (DVFS), V and F might betunable at runtime; however, dynamic voltage and fre-quency adjustments typically cannot be done at a finespatial or temporal granularity [11]. In this work, we targetthe activity factor, �, as it enables dynamic energyreduction at a very fine granularity.

2.2 NoC Power and Energy

Researchers have recently begun focusing on the energyand power in NoCs, which have been shown to besignificant contributors to overall chip power and energyconsumption [3], [4], [12], [13].

One effective way to reduce NoC power consumption isto reduce the amount of data sent over the network. To thatextent, recent work has focused on compression at the cacheand network levels [14], [15] as an effective power-reduction technique. In general, however, compressiontechniques have overheads in terms of latency for compres-sion and decompression. The technique we present isorthogonal to, and could potentially be used in conjunctionwith, these loss-less compression techniques to furtherreduce power. Our work seeks to reduce the amount of datatransmitted through identification and removal of uselesswords; traditional compression could be used to moredensely pack the remaining data.

Researchers have also proposed a variety of techniquesto reduce interconnect energy consumption through re-duced voltage swing [16]. Schinkel et al. propose a schemewhich uses a capacitative transmitter to lower the signalswing to 125 mV without the use of an additional low-voltage power supply [17]. In this work, we evaluate ourprediction and packet encoding techniques for linkscomposed of both full-signal swing as well as low-signalswing wires.

NoC router microarchitectures for low power have alsobeen explored to reduce power in the transmission of datawhich is much smaller than a flit. Das et al. propose a novelcrossbar and arbiter design that supports concurrenttransfers of multiple flits on a single link to improvebandwidth utilization [18].

Finally, static power consumption due to leakagecurrents is also a significant contributor to total system

544 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 3, MARCH 2014

Fig. 1. Percentage of 64-byte block, cache words utilized per block in thePARSEC multithreaded benchmarks.

power. However, researchers have shown that power-gating techniques can be comprehensively applied at theNoC level and are highly effective at reducing leakagepower at periods of low network activity [19].

2.3 Spatial Locality and Cache Block Utilization

Spatial and temporal locality have been studied extensivelysince caches came into wide use in the early 1980’s [20].Several works in the 1990’s and early 2000’s focused onindirectly improving spatial locality through compile andruntime program and data transformations which improvethe utilization of cache lines [21], [22], [23], [24]. While thesetechniques are promising, they either require compilertransformations or program changes and cannot be retro-fitted onto existing code. Our proposed approach relies onlow-overhead hardware mechanisms and is completelytransparent to software.

Hardware techniques to minimize data transfer amongcaches and main memory have also been explored in theliterature. Sectored caches were proposed to reduce the datatransmitted for large cache block sizes while keepingoverhead minimal [25]. With the sectored cache, only aportion of the block (a sector) is fetched, significantlyreducing both the miss time and the bus traffic. Theproposed technique builds upon this idea by speculativelyfetching, not just the missing sector but sectors (words in thiscase) which have predicted spatial locality with the miss.

Prefetching is a technique where cache lines expected tobe used in the future are fetched prior to their demandrequest, to improve performance by reducing misses. Thismay come at the cost of more power spent in theinterconnect between caches when inaccurate prefetcheslead to unused cache block fetches. Our technique iscomplementary and can be used to compensate prefetchenergy overhead by gating unused words. prefetches.

Pujara and Aggarwal examined the utilization of cachelines and showed that only 57 percent of words are actuallyused by the processor and the usage pattern is quitepredictable [7]. They leverage this information to lowerpower in the cache itself by reducing the number of wordsread from the lower level cache and written to the upperlevel cache. This mechanism is orthogonal and potentiallycomplementary to our technique, as we focus primarily onachieving lower energy consumption in the interconnect.Yoon et al. proposed an architecture that adaptively choosesmemory system granularity based on spatial locality anderror-tolerance tradeoffs [26]. While this work focuses oncontention for off-chip memory bandwidth, our worktargets on-chip interconnect energy consumption by obser-ving spatial locality. Qureshi et al. suggested a method topack the used words in a part of the cache after evicting itfrom the normal cache, thus increasing performance andreducing misses [27]. Their work thus focuses on perfor-mance rather than energy-efficiency and targets the effec-tiveness of the second-level cache.

Spatial locality prediction is similar to dead blockprediction [28]. A dead block predictor predicts whether acache block will be used again before it is evicted. Thespatial locality predictor introduced in this paper canbe thought of as a similar device at a finer granularity. Thespatial locality predictor, however, takes into account

locality relative to the critical word offset, unlike dead blockpredictors. Chen et al. predicted a spatial pattern using apattern history table which can be referenced by the pcappended with the critical offset [29]. The number of entriesin the pattern history table as well as the number of indexesincrease the memory requirement of the technique. Unlikethese schemes, our predictor uses a different mechanism formanaging prediction thresholds in the face of mispredic-tions. Kim et al. proposed spatial locality speculation toreduce energy in the interconnect [30], we present here anextended journal version of this earlier work.

3 DESCRIPTION

Our goal is to save dynamic energy in the memory systeminterconnect by eliminating switching activity associatedwith unused words in cache blocks transferred between thedifferent levels of the on-chip cache hierarchy. To this end,we developed a simple, low complexity, spatial localitypredictor, which identifies the words expected to be used ineach cache block. A used word prediction is made on an L1cache miss, before generating a request to the L2. Thisprediction is used to generate the response packet elidingthe unused words with the proposed flit encoding schemesdescribed below.

Fig. 2 depicts the general, baseline architecture, repre-senting a 16-node NoC-connected CMP. A tile consists of aprocessor, a portion of the cache hierarchy and a NetworkInterface Controller (NIC), and is bound to a router in theinterconnection network. Each processor tile containsprivate L1 instruction and data caches. We assume thatthe L2 is organized as a shared S-NUCA cache [31], each tilecontaining one bank of the L2. The chip integrates twomemory controllers, accessed via the east port of node 7 andwest port of node 8. Caches have a 64-byte block size. TheNoC link width is 16 bytes, discounting flow-controloverheads. Thus, cache-block-bearing packets are five flitslong, with one header flit and four data flits. Each data flitcontains four 32-bit words, as shown in Fig. 5b.

3.1 Spatial Locality Prediction

3.1.1 Prediction Overview

Our predictor leverages the history of use patterns withincache blocks brought by a certain instruction has beenaccessed. The intuition behind our predictor is that a givenset of instructions may access multiple different memoryaddress regions in a similar manner. In fact, we have

KIM ET AL.: SPATIAL LOCALITY SPECULATION TO REDUCE ENERGY IN CHIP-MULTIPROCESSOR NETWORKS-ON-CHIP 545

Fig. 2. General CMP architecture.

observed that patterns of spatial locality are highlycorrelated to the address of the instruction responsible forfilling the cache (the fill PC). The literature also shows that asmall number of instructions cause the most cache misses[32]. Moreover, a given sequence of memory instructionsaccesses the same fields of data structures throughoutmemory [29]. Data structure instances are unfortunately notaligned to the cache block; this misalignment can beadjusted by using the offset of the word which causes thecache miss (the critical word offset) while accessing theprediction table as in [7].

3.1.2 Predictor Implementation

Our prediction table is composed of rows of four-bitsaturating counters where each counter corresponds to aword in the cache block. The table is accessed such that thefill PC picks the row of the table, and then n consecutivecounters starting from the critical word offset are chosenwhere n is the number of words in a cache block. (Thus, thereare 2n� 1 counters per row to account for all possible n� 1offsets.) These counters represent the history of word usagein cache blocks brought by a certain memory instruction.

The value of the saturating counter relates to theprobability that the corresponding word is used. The lowerthe counter is, the higher confidence the word will not beused. Initially, all counters are set to their max value,representing a prediction where all words are used. Ascache lines are evicted with unused words, countersassociated with those unused words are decremented whilecounters associated with used words are incremented. If agiven word counter is equal to or greater than a fixedthreshold (configured at design time), then the word ispredicted to be used; otherwise, it is predicted not used. Wedefine used-vector as a bit vector which identifies the wordspredicted used by the predictor in the cache block to befilled. A used-vector of 0xFFFF represents a prediction thatall 16 words will be used while a used-vector of 0xFF00signifies that only the first eight words will be used.

Fig. 3 shows the steps to the prediction. In this example,the number of words in a block is assumed to be 4 and thethreshold is 1, for simplicity. In the figure, a cache missoccurs on an instruction accessing the second word in agiven block (criticalwordoffset ¼ 1). The lower order bitsof the fill PC select a row in the prediction table. Amongthe counters in the row, the selection window of fourcounters, which initially includes the four rightmostcounters, moves to the left by the number of the critical

word offset. Those selected counters are translated into apredicted used-vector based on the threshold value. Theused-vector, 1100, indicates that the first and the secondwords in this block will be used.

The L1 cache keeps track of the actual used-vector whilethe block is live, as well as the lower order bits of the fill PC,and the critical word offset. When the block is evicted fromthe L1, the predictor is updated with the actual used-vector;if a word was used, then the corresponding counter isincremented; otherwise, it is decremented. While updating,it finds the corresponding counters with the fill-PC and thecritical word offset as it does for prediction. In the event aword is falsely predicted “unused,” the counters for theentire row are reset to 0xF to reduce the likelihood of futuremispredictions. This form of resetting have been shown toimprove confidence over up/down counters for branchpredictors [33]; in initial development of the predictor, wefound a considerable improvement in accuracy using thistechnique as well. Resetting the counters allows thepredictor to quickly react in the event of destructiveinterference and/or new program phases.

3.1.3 Impact on Energy

We model a cache with 64B blocks and 4B words. Each rowof the predictor is composed of 31 four-bit saturatingcounters where all counters are initialized to 0xF. Thepredictor table has 256 rows of 31� 4 bits each, thusrequiring �4 KB of storage. We note, although the “word”size here is 4B, this represents the prediction granularity, itdoes not preclude use in a 64b (8B) word architecture.

In addition to the 4KB prediction table, our schemerequires extended metadata in each cache tag. In L1, 16 bits(one per word) are necessary to determine which wordshave been accessed so that we can update the predictor.8 bits for fill-PC and 4 bits for critical word offset are requiredto access the prediction table as well. We also replace thesingle valid bit with a vector of 16 bits in L1 and L2 caches.Although, this metadata increases the power per access ofthe L1 and L2 caches by 0.35 and 0.72 percent, respectively,we also reduce the number of words read from the lowerlevel caches and written to the upper level cache by �30percent and also the number of words written back into thelower level cache by �40 percent. Altogether this results anaverage �20 percent reduction in dynamic energy con-sumption per cache access. The dynamic energy consumedby the prediction tables is discussed in Section 4.

Although not the focus of this work, leakage energydissipation can be also optimized with the help of spatiallocality speculation. Chen et al. achieved 41 percent ofleakage energy reduction with their proposed spatiallocality predictor and a circuit level selective sub-blockingtechnique [29]. We expect a better reduction could beachieved with our technique (as our predictor accuracy ishigher), we plan to explore this in a future work on usedword prediction for cache power reduction.

3.1.4 Impact on Performance

When an L1 cache miss is discovered, the predictor suppliesa prediction to inform flit composition. The prediction willtake two cycles: one for the table access and one forthresholding and shifting. The fastest approach would be to


Fig. 3. Prediction example for four words/block cache model.

speculatively assume that every cache access will result in amiss and begin the prediction simultaneously with addresstranslation; thus, the latency can be completely hidden. Amore energy efficient approach is to begin the prediction assoon as the tag mismatch is discovered and simultaneouslywith victim selection in the L1 cache. While this approachwould add a cycle to the L1 miss time, no time would beadded to the more performance critical L1 hit time. Thelatter approach was used in our experiments. If a wordpredicted unused actually is used, it is treated as a miss andall words initially predicted as unused are brought into theupper level cache in order to correct this misprediction. Thisperformance impact of these extra misses is discussed inSection 5.1.

On eviction of an L1 cache block, the used-vector and fillPC collected for that block are used to update the predictor.This process is less latency sensitive than prediction sincethe predictor does not need to be updated immediately toprovide good accuracy.

3.2 Packet Composition

Once we have predicted the application’s expected spatiallocality to determine the unused words in a missing cacheblock, we employ a flit encoding technique which leveragesunused words to reduce dynamic link and router energy inthe interconnect between the L1, directory, L2 cache banksand memory controllers. We propose two complementarymeans to leverage spatial locality prediction to reduce �, theactivity factor, in the NoC, thereby directly reducingdynamic energy: 1) remove flits from NoC packets (flit-drop); 2) keep unused interconnect wires at fixed polarityduring packet traversal (word-repeat). For example, if twoflits must be transmitted and all the words in the second flitare predicted unused, our flit-drop scheme would discardthe unused flit to reduce the number of flits transmittedover the wire. In contrast, our word-repeat scheme wouldretransmit the first flit, keeping the wires at fixed polarity toreduce gate switching. These encoding schemes are alsoused for writeback packets to include dirty words only.

The packet compositioning may be implemented either“statically,” whereby packet encoding occurs at packetgeneration time, or “dynamically,” in which the unusedwords in each flit are gated within the router FIFOs,crossbars and links to avoid causing bit transitions regard-less of the flits which proceed or follow it. We will first

discuss the “static” packet compositioning techniquesincluding flit-drop, static-word-repeat and their combination.We then discuss the “dynamic” packet compositiontechniques which allow greater reductions in activity factor,at the cost of a small increase in logical complexity in therouters and a slight increase in link bit-width.

3.2.1 Static Packet Composition

Fig. 4 depicts the format of cache request and reply packetflits in our design. A packet is composed either of a head flitand a number of body flits (when the packet contains acache block) or it consists of one atomic flit, as in the case ofa request packet or a coherence protocol message. Thehead/atomic flit contains a used-vector. The head flit alsocontains source and destination node identifiers, and thephysical memory address of the cache block. The remainingbytes in the head/atomic flit are unused. We assume a flow-control overhead of three bits, 1 bit for virtual channel id(VC) and 2 bits for flit type (FT). As each of body/tail flitcontains data of four words (16 bytes), a flit is 16 bytes and3 bits wide including flow control overheads.

Fig. 5a depicts an example of read request (L1 fill). In thisexample, tile #1 requests a block at address 0x00001200which resides in the S-NUCA L2 cache bank in tile #8. Theused-vector is 1111 1100 0000 1010, indicating the wordsword0—word5, word12, and word14 are predicted used. Thecorresponding response packet must contain at least thosewords. Since the baseline architecture sends the wholeblock as it is, the packet contains all of the words fromword0 to word15, as shown in Fig. 5b.

Flit-drop. In the flit-drop technique, flits which arepredicted to contain only unused words are dropped fromthe packet and only those flits which contain one or moreused words are transmitted. The reduction in the number offlits per packet reduces the number of bit transitions overinterconnect wires and, therefore, the energy consumed.Latency due to packet serialization and NoC bandwidth


Fig. 4. Flit format for static and dynamic encoding. (Shaded portion notpresent in static encoding.)

Fig. 5. Read request and corresponding response packets (VC is notshown in this figure).

will also be reduced as well. Although a read request packetmay have an arbitrary used-vector, the response packetmust contain all flits which have any words predicted usedleading to some lost opportunity for packets which haveused and unused words intermingled throughout.

Fig. 5c depicts the response packet to the request shownin Fig. 5a for the flit-drop scheme. The first body flit,containing word0-word3, therefore must be in the packet asall of these words are used. The second body flit, withword4-word7, also contains all valid words, despite theprediction word6 and word7 would not be used. These extrawords are overhead in the flit-drop scheme because they arenot predicted used but must be sent nevertheless. Althoughthese words waste dynamic power when the prediction iscorrect, they may reduce the miss-prediction probability.

Static-word-repeat. The static-word-repeat scheme re-duces the activity factor of flits containing unused wordsby repeating the contents in previous flit in the place ofunused words. Flits with fewer used words consume lesspower because there are fewer bit transitions between flits.Words marked as “used” in the used-vector contain real,valid data. Words marked as “unused” in the used-vectorcontain repeats of the word in the same location in theprevious flit. For instance, if word4xþ1 is predicted unused,the NIC places word4ðx�1Þþ1 in its place. As the bit-linesrepeat the same bits, there are no transitions on those wiresand no dynamic energy consumption. A buffer retainingfour words previously fetched by the NIC is placedbetween the cache and the NIC and helps the NIC inrepeating words. An extra mux and logic gates are alsonecessary in the NIC to encode repeated words.

Fig. 5d depicts the response packet for the request inFig. 5a using the static-word-repeat scheme. In body1, word6,and word7 are unused and, thus, replaced with word2 andword3 which are at the same location in the previous flit. Allof the words in body2 are repeated by the words in body1,thus it carries virtually nothing but flow-control overhead.We also encode the unused header words, if possible.

3.2.2 Dynamic Packet Composition

The effectiveness of static packet compositioning schemes isreduced in two commonly occurring scenarios: 1) whensingle-flit, atomic packets are being transmitted, and2) when flits from multiple packets are interleaved in thechannel. In both cases, repeated words in the flits cannotbe statically leveraged to eliminate switching activity in thecorresponding parts of the datapath. In response, wepropose dynamic packet composition to reduce NoC switchingactivity by taking advantage of invalid words on a flit-by-flit basis. The difference between dynamic and staticcomposition schemes resides primarily in how word-repeattreats unused words. In static composition, the unusedportion of a flit is statically set at packet injection by the NICto minimize interflit switching activity, requiring nochanges to the router datapath. In dynamic composition,portions of the router datapath are dynamically enabledand disabled based on the validity of each word in the flit.In effect, an invalid word causes the corresponding portionof the datapath to hold its previous value, creating theillusion of word repeat.

To facilitate dynamic composition, the “used-vector” isdistributed into each flit as shown in Fig. 4b. As a result thelink width must be widened by four bits to accommodatethe new “valid-word-vector,” where each bit indicateswhether the corresponding word in that flit is valid. As thefigure shows, the head flit’s “valid-word-vector” is alwaysset to 1100 because the portion which corresponds toWord2 and Word3 of a body/tail flit are always unused.

Dynamic packet compositioning requires some modifi-cations to a standard NoC router to enable datapath gatingin response to per-flit valid bits. Fig. 6 depicts themicroarchitecture of our dynamic packet compositioningrouter. Assuming that a whole cycle is required for a flit totraverse a link, latches are required on both sides of eachlink. The additional logic required for link encoding isshaded in the magnified output port. Plain D-flip-flops arereplaced with enable-D-flip-flops to force the repeat of theprevious flit’s word when the “valid-word-vector” bit forthat word is set to zero, indicating that word is not used.Alternately, if the “valid-word-vector” bit for the givenword is one, the word is propagated onto the link in thefollowing cycle, as it would in the traditional NoC router. Incases where the link traversal consumes less than a fullcycle, this structure could be replaced with a tristate bufferto similar effect.

We further augment the router’s input FIFO buffers withper-word write enables connected to the “valid-word-vector” as shown in Fig. 6. In our design, the read andwrite pointer control logic in the router’s input FIFOsremain unmodified; however, the SRAM array storage usedto hold the flits is broken into four banks, each one word inwidth. The “valid-word-vector” bits would gate the validwrite enables going to each of word-wide banks, disablingwrites associated with unused words in incoming flits, andsaving the energy associated with those word writes. Thecombination of these techniques for dynamic packetcomposition will reduce the power and energy consump-tion of the NoC links and router datapath proportional to


Fig. 6. Dynamic packet compositioning router. (Shaded portion notpresent in baseline router.)

the reduction in activity factor due to the word-repeat andflit-drop of unused words.

As flit-drop and word-repeat are complementary, we willalso examine their combination in the evaluation section.One alternative technique, we explored packs together usedwords into a minimal size packet. Experimentally, wefound this approach produces latency and power benefitsnegligibly different from the combination of flit-drop andword-repeat, while our technique requires less additionalhardware in packet composition, so these results are notpresented. These encoding schemes also are used forwritebacks by marking clean words as unused.

4 EVALUATION

4.1 Baseline Architecture and PhysicalImplementation

Fig. 2 depicts the baseline architecture, representing a 16-node NoC-connected CMP. A tile consists of a processor, aportion of the cache hierarchy and a Network InterfaceController, and is bound to a router in the interconnectionnetwork. The baseline architecture employs a 4� 4 2Dmesh topology with X-Y routing and wormhole flowcontrol. Each router contains 2 VCs and each input bufferis four flits deep. In our baseline configuration, we assumethe tiles are 36 mm2 with 6 mm-long links between nodes.Our target technology is 45 nm.

Processor tiles. Each 36 mm2 tile contains an in-orderprocessor core similar to an Intel Atom Z510 (26 mm2) [34],a 512 KB L2 cache slice (4 mm2), two 32 KB L1 caches(0:65 mm2 each) and an interconnect router (0:0786 mm2).The remaining area is devoted to a directory cache and anNIC. Our system is composed of 16 tiles and results in576 mm2, approximately the size of an IBM Power7 die [35].We used CACTI 6.0 [36] to estimate cache parameters.

The L1 caches are two-way set-associative with a twocycle access latency. The L2 banks are eight-way set-associative with a 15-cycle access time. The 16 L2 banksspread across the chip comprise an 8-MB S-NUCA L2 [31].Cache lines in both L1 and L2 caches are 64B wide (16 four-byte words), except where otherwise noted. Each node alsocontains a slice of the directory cache, interleaved the sameas the L2. Its latency is 2 cycles. The number of entries ineach directory cache is equal to the number of sets in an L2bank. We assume the latency of the main memory is100 cycles. The MESI protocol is used by the directory tomaintain cache coherence. The predictor’s performance is

examined with the threshold value of 1 unless statedotherwise. The NoC link width is assumed to be 128 bitswide, discounting flow-control overheads.

NoC link wires. NoC links require repeaters to improvedelay in the presence of the growing wire RC delays due todiminishing interconnect dimensions [1]. These repeatersare major sources of channel power and area overhead.Equally problematic is their disruptive effect on floor-planning, as large swaths of space must be allocated foreach repeater stage. Our analysis shows that a single,energy-optimized 6 mm link in 45 nm technology requires13 repeater stages and dissipates over 42 mW of power for128 bits of data at 1 Ghz.

In this work, we consider both full-swing repeatedinterconnects (full-swing links) and an alternative design thatlowers the voltage swing to reduce link power consumption(low-swing links). We adopt a scheme by Schinkel et al. [17]which uses a capacitative transmitter to lower the signalswing to 125 mV without the use of an additional low-voltage power supply. The scheme requires differentialwires, doubling the NoC wire requirements. Our analysisshows a 3:5� energy reduction with low-swing links.However, low-swing links are not as static-word-repeatfriendly as much as full-swing links are. There is linkenergy dissipation on low-swing links, even when a bitrepeats the bit ahead because of leakage currents and highsense amp power consumption on the receiver side. Thus,the repeated unused-words consume �18 percent of whatused-words do. The dynamic encoding technique fullyshuts down those portions of link by power gating allcomponents with the “valid-word-vector” bits.

Router implementation. We synthesized both the base-line router and our dynamic encoding router on a TSMC45 nm library to an operating frequency of 1 Ghz. Table 1shows the area and power of the different router designs.Note that the baseline router and the one used in staticencoding scheme are identical. The table shows theaverage power consumed under PARSEC traffic, simulatedwith the methodology described in Section 4.2. Thedynamic power for each benchmark is computed bydividing the total dynamic energy consumption by theexecution time, then by the number of routers. Summariz-ing the data, a router design supporting the proposeddynamic composition technique requires �7 percent morearea, while reducing dynamic power by 46 percent underthe loads examined over the baseline at the cost of4.3 percent more leakage power.

Table 2 shows the dynamic energy consumed by a flit witha given number of words encoded as used, traversing a routerand a link, with respect to the three flit composition schemes:baseline(base), static(sta), and dynamic(dyn) encoding. In


TABLE 1Area and Power

TABLE 2Per-Flit Dynamic Energy (pJ)

n: number of used words

baseline, a flit always consumes energy as if it carries fourused words. In static encoding, as the number of used wordsdecreases, flits consume less energy on routers and full-swinglinks. Static-encoding reduces NoC energy by minimizing thenumber of transitions on the wires in the links and in therouters’ crossbars. Dynamic-encoding further reduces routerenergy by gating flit buffer accesses. The four-bit, valid-word-vector in each flit controls the write enable signals ofeach word buffer, disabling writes associated with unusedwords. Similarly, it also gates low-swing links, shutting downthe transceiver pair on wires associated with the unusedwords. CACTI 6.0 [36] was used to measure the energyconsumption due to accessing the predictor; which is 10.9 pJper access.

4.2 Simulation Methodology

We used the M5 full system simulator to generate CMPcache block utilization traces for multithreaded applications[37]. Details of the system configuration are presented inSection 4.1. Our workload consists of the PARSEC shared-memory multiprocessor benchmarks [8], cross compiledusing the methodology described by Gebhart et al. [38]. Allapplications in the suite currently supported by M5 wereused. Traces were taken from the “region of interest.” Eachtrace contains up to a billion memory operations; fewer ifthe end of the application was reached. Cycle accuratetiming estimation was performed using the Netrace,memory system dependence tracking methodology [39].

The total network energy consumption for each bench-mark is measured by summing the energy of all L1 and L2cache fill and spill and coherence packets as they traverserouters and links in the network. In effect, Table 2 isconsulted whenever a flit with a certain number of usedwords traverses a router and a link. Note that even for thesame flit, the used word number may vary according to theencoding scheme in use. For example, for an atomic flit, n ¼4 in static encoding while n ¼ 2 in dynamic. The predictor’senergy is also added whenever the predictor is accessed.

4.3 Energy Consumption

Fig. 7 shows the breakdown of dynamic energy consump-tion. For each benchmark, we conducted energy simula-tions for three configurations, each represented by onestacked bar for that benchmark: 1) baseline—baseline, 2) s-combo—static-word-repeat and flit-drop combined, and3) d-combo—dynamic-word-repeat and flit-drop combined.We also show the average energy consumption with pureflit-drop (flitdrop), static-word-repeat (s-wr), and dynamic-word-repeat (d-wr). The bars are normalized against theenergy consumed by baseline. Each bar is subdivided intoup to four components. The first bar shows the “read”energy, energy consumed by cache fills and the second bar,“write,” by writebacks. The third bar, “coh,” shows theenergy due to the cache coherence packets and the fourthbar, “pred” shows the energy consumed by predictors. Thefigure shows data for both full-swing and low-swing links.

In the baseline configuration we see that, on average, readcommunication consumes the most dynamic energy with�59 percent of the total. Coherence traffic consumes thesecond most with �28 percent of the total energy followedby write communication with �13 percent of the total

energy. This breakdown follows the intuition that reads aremore frequent than writes. Thus, techniques which onlyfocus on writebacks will miss much potential gain. It is alsointeresting to note that cache coherence traffic shows a verysignificant contribution to overall cache interconnect en-ergy. Similarly, work which does not consider atomicpackets may miss significant gain.

The figure also shows that among the flit encodingschemes, d-combo shows the greatest improvement with �36percent dynamic energy savings, on average, when full-signal swing link is used. If low-signal swing links are used,it becomes �34 percent. The pure dynamic-word-repeat isthe second best resulting in additional �1 percent energyconsumption. This implies that dropping flits only withflow control bits does not significantly contribute to energyreduction when dynamic-word-repeat is used. However,combining flit-drop is still beneficial to reduce latency. Thecombined static encoding (s-combo) provides an energysavings of only �17 and �15 percent of baseline, under full-swing and low-swing links, respectively. This indicates thesignificant gains that dynamic encoding provides, primarilyin the cache coherence traffic which is predominately madeup of single flit packets. We find the predictor merelycontributes 1.5 percent of the total energy when full-signalswing link is used, and 4.1 percent when low-signal swinglink is used.

Table 1 shows the average power with either type oflinks. It reveals that despite the increased static power, thedynamic encoding scheme still outperforms the baselineand the static encoding as well, regardless of link type.

In the following sections, we will examine each of thetraffic types in detail to gain a deeper understanding of theperformance of our proposed technique. As full-swing linkand low-swing link show similar trends, only graphs forfull-swing links will be shown hereafter.


Fig. 7. Dynamic energy breakdown.

4.3.1 Read Energy Discussion

Fig. 8 shows the breakdown of dynamic energy consump-tion for reads. Each bar is subdivided into five compo-nents and also normalized against baseline. The first bar“l2” depicts the energy consumed by L2 cache fills andspills. Although the prediction actually occurs on L1 cachemisses, the encoding schemes are also used for thetransactions between the L2 and memory controller, basedupon used-vector generated on the L1 cache miss the leadto the L2 miss.

The second bar shows the “used” energy, energyconsumed by the words which will be referenced by theprogram, hence “used” bars are nearly equal, with theexception of a slight increase in energy due to routeroverheads in the dynamic scheme. The third bar, “unused,”shows the energy consumed to bring in words which willnot be referenced prior to eviction. This also includes theenergy consumed by words which result from false-positivepredictions, i.e., an incorrect prediction that the word willbe used. The fourth bar, “overhead,” shows the energy forNoC packet overheads, including header information andflow control bits. The fifth bar, “extra,” shows the energyconsumed by the packet overhead due extra cache line fillsto correct “false-negative” mispredictions. Our goal is toremove, as much as possible, the dynamic datapath energyconsumed by unused words denoted by unused and, wherepossible, the packet overheads in overhead, while minimiz-ing redundant misses due to mispredictions in extra.Unused words consume an average of 33 percent of totaldynamic datapath energy, and up to 53 percent of totaldynamic datapath energy in case of blackscholes (shown asBlack in the graphs.)

The d-combo scheme, on average, reduces total dynamicdatapath energy by �32 percent. Our prediction mechan-ism combined with the encoding schemes approximatelyhalves the “unused” portion, on average. In case of Black,where the predictor performs the best, the speculationmechanism removes 90 percent of the “unused” portionresulting in a 66 percent energy savings for cache fillswhen combined dynamic encoding is used. The extratransmission due to mispredictions, shown as “extra” inthe stack, contributes less than 1 percent of energyconsumption for cache fills.

4.3.2 Coherence Energy Discussion

In the simulated system coherence is maintained via theMESI protocol. Coherence protocol messages and responses

represent a significant fraction of the network traffic. Thoseprotocol messages are composed primarily of single-flitpackets, and contribute �28 percent of total network energyconsumption. Fig. 9 shows the breakdown of dynamicenergy consumption for coherence packets. Although thesesingle-flit packets contain �50 percent unused data, asdiscussed in Section 3.2.1, static encoding cannot be used toreduce their energy dissipation. Dynamic encoding, how-ever, reduces it by up to 45.5 percent.

4.3.3 Write Energy Discussion

Fig. 7 shows that writebacks consume an average of13 percent of total dynamic energy. Upon dirty linewriteback, we find that, on average, 40 percent of thewords in the block are clean, and those words contribute23 percent of the total energy consumed by writebacks.

Fig. 10 shows the dynamic energy breakdown caused bywritebacks. The first bar, “dirty,” shows the energyconsumed by the dirty words in cache lines. The secondbar “overhead” shows the energy consumed by NoC packetoverheads. The third bar, “clean,” includes the link energyconsumed by sending clean words in writeback packets.Our goal is to remove the portion of the energy consump-tion associated with transmitting “clean” words. Onaverage, the s-combo scheme reduces the energy consump-tion due to writebacks by 29 percent. Further savings areachieved by d-combo. It encodes not only body/tail flits butalso head flits of the writeback packets resulting in a40 percent savings. When full-swing links are used, it ispossible to remove all of energy dissipation due to cleanwords with the static flit-encoding scheme. However, whenstatic word repeat is used with low-swing links, althoughclean words are encoded to repeat the words in the flitahead, those words cause energy dissipation due to leakagecurrents and high sense amp power consumption on thereceiver side.


Fig. 8. Dynamic energy breakdown for reads.

Fig. 9. Dynamic energy breakdown for coherent packets.

Fig. 10. Dynamic energy breakdown for writes.

5 ANALYSIS

In this section, we analyze the performance impact of the

proposed energy reduction technique, explore predictor

training and compare against a 32-byte cache line baseline

design.

5.1 Performance

The performance impact of the proposed technique is

governed by two divergent effects. First, the scheme should

improve performance because flit-drop reduces the number

of flits injected into the network. Decreased flit count

improves performance through less serialization latency,

and reduced congestion due to lower load. Second,

offsetting the benefit from flit-drop, incorrectly predicting

a word unused can lead to more misses, increasing network

load and average memory access time (AMAT). To

quantify the impact of the proposed technique, Fig. 11

shows the reduction in flits injected into the network for

each benchmark, the reduction in individual packet

latency, and the AMAT for each benchmark, all normalized

against baseline. Each value number is normalized against

baseline. As the figure shows, although flit count and

individual packet latency decrease significantly, AMAT is

essentially flat across the benchmarks. In this section, we

examine the relationship between network performance

and system performance.

5.1.1 Network Performance

As shown in Fig. 11, the flit count is reduced by 12 percent,

on average. Optimally, the flit count reduction should be

directly proportional to the block utilization. From Figs. 1, 7,

and 11, we see that Blackscholes, which has the lowest block

utilization, and the greatest portion of read energy

consumption, has the greatest reduction flits across the

PARSEC benchmarks. Alternately, Bodytrack, which also

shows one of the lowest block utilizations, removes merely

1.8 percent the injected flits. This is because flit count

reduction is related not only to block utilization, but also

prediction accuracy, proportion of single flit packets, and

the used-unused pattern within the packet. In Bodytrack,

flit reduction is low because single-flit coherent packets

make up a larger portion of the injected packets, and the

predictor is less accurate than for Blackscholes.

Lowered flit count should be correlated with reducedpacket latency. Fig. 11 shows normalized packet latency. Onaverage, the network latency is reduced by �6 percent asthe number of flits has decreased. In Blackscholes, withgreatest reduction in flit count, the packet latency isreduced by 9 percent, showing one of the best networkperformance improvements across the PARSEC bench-marks. Alternately, Bodytrack’s network performance isimproved by only 1 percent. Interestingly, flit countreduction and network latency are not always strictlyproportional to each other; X264, counter-intuitively showsa greater improvement in packet latency than its reductionin flit count, warranting further analysis.

Flit-drop improves network performance not only byreducing the serialization latency but also by avoidingnetwork congestion. Fig. 12 shows the breakdown ofaverage packet latency. Each bar consists of two compo-nents; 1) zero load shows the packet latency due to static hopcount and serialization latencies, 2) congestion shows thelatencies due to the resource conflicts. Although eachcomponent contributes 67 and 33 percent of the averagelatency, respectively, the greater impact of reduced flitcount lies in congestion. This effect is illustrated by X264which has the second greatest congestion latency, as a resulta relatively small reduction in flits translates into a greaterreduction in packet latency. The overall average packetlatency has been reduced by 5.7 percent.

5.1.2 Overall System Performance Discussion

As a proxy for overall system performance, we examine thetechnique’s impact on average memory access time. Despitean improvement in packet latency, Fig. 11 shows thatAMAT is unchanged on average, with some benchmarksshowing a slight improvement, while others showing aslight degradation. To explore this counter-intuitive result,we examine how AMAT relates to packet latency and L1miss rate. AMAT in this work and it can be estimated by

AMAT ¼ LatencyðL1Þ þ ð1�HitRateðL1ÞÞ� LatencyðL2þÞ:

ð2Þ

In this equation, LatencyðL1Þ is the constant L1 latency forthe system, and HitRateðL1Þ is the hit rate of L1 accesseswhich varies by benchmark locality and is effected by falseunused-word predictions. LatencyðL2þÞ is the latency ofmemory accesses served by L2 and beyond, which is alsoknown as L1 miss latency. It is a function of the constant L2cache access time, L2 miss rate, network latency and the


Fig. 11. Flit count, packet latency, and AMAT normalized againstbaseline.

Fig. 12. Packet latency breakdown.

constant memory access time. Assuming LatencyðL2þÞ isfixed, AMAT is a linear function of HitRateðL1Þ. Fig. 13visualizes AMAT as fðrÞ where r is L1 hit rate.

In Fig. 13, let f1ðrÞ be the AMAT characteristic for abenchmark, where T1 is the L1 latency and T2 is L1 latencyplus L1 miss latency. Say, with the baseline scheme, the L1hit rate is r0 and AMAT becomes f1ðr0Þ. If our predictionmechanism drops the L1 hit rate to r1 and that does notchange L1 miss latency, the AMAT becomes f1ðr1Þ. In sucha case, f1ðr1Þ � f1ðr0Þ represents the performance loss dueto mispredictions. However, thanks to our packet composi-tion technique, L1 miss latency, in general, is lower thanthat of baseline cases. Thus, its AMAT characteristicfunction should be redrawn as f2ðrÞ and the AMAT at r1

is f2ðr1Þ. If f2ðr1Þ < f1ðr0Þ as in this example, the difference,f1ðr0Þ � f2ðr1Þ denotes the performance benefit from ourprediction technique.

In this example, we can also see that as long as thepredictor drops the L1 hit rate no lower than r2,performance improvement is expected. We define safe rangeas the range of L1 hit rate where our prediction schemeshows equal or better AMAT than the baseline design. Inthis particular example, the safe range is ½r2; r0�. Togeneralize, safe range, �r, is calculated as below:

�r ¼ �T

To ��Tð1� roÞ; ð3Þ

where To and ro are the original L1 miss latency and L1 hitrate, respectively, and �T the reduced amount of L1 misslatency. The wider safe range we have the better chance thatwe achieve the performance improvement.

Fig. 14a shows the average L1 miss latency for eachbenchmark. The bars marked as base show the L1 misslatency for the baseline design, while pred shows the latencyfor the proposed scheme. In every case, a reduction in L1miss latency is observed. This reduction is closely related tothe reduction in the interconnect network latency shown inFig. 12. Although the packet latency in the figure isnormalized, the L1 miss latency shows the similar trendto that: the more reduction in the network latency we have,the more reduction in L1 miss latency. According to (3), thewider safe range, and, in turn, no performance degradation,is expected for benchmarks with a lower base and a biggergap between base and pred. However, the safe range is stillup to L1 hit rate.

Fig. 14b shows the original L1 hit rate (“l1 hit base”) thenew L1 hit rate (“l1 hit pred”) and the changed average

memory access time (“norm lat”). On average, L1 hit rate isdecreased by 0.15 percent. Blackscholes (“Black”) has thethird greatest improvement in L1 miss latency, the lowest L1miss latency and the second lowest L1 hit rate, it results inthe greatest overall performance improvement of4.2 percent. By contrast, “X264,” while having the greatestL1 miss latency improvement, also has one of the highest L1miss latency, and the highest L1 hit rate, therefore achievesthe worst overall system performance. Although “Canneal”shows the worst impact on L1 hit rate, due to its low originalL1 hit rate, the reduced L1 hit rate is still in its safe range,thus, no performance penalty for mispredictions is shown.

5.2 Predictor Tuning

As with many speculative techniques, our scheme incurs aperformance penalty when mis-speculation occurs. In thiscase, the penalty manifests as increased L1D misses. AsFig. 14b shows, L1D hit rates are barely impacted by ourtechnique. One possible interpretation of this data is that ourpredictor is overly conservative, and that energy gain couldbe achieved by more aggressively tuning our predictor, inthis section we explore predictor tuning to this end.

Our prediction model requires a threshold value to beconfigured at design time. As described in Section 3.1.1, thethreshold determines whether a certain word will be usedor not according to its usage history counter. If the countervalue is less than the threshold, the word is predicted to beunused. The smaller threshold value, the more biasedthe predictor toward predicting a word will be used. Thus,the threshold value tunes the tradeoff between energyconsumption and memory access time.

Fig. 15 shows the prediction outcomes with respect tovarious threshold values (numbers along the bottom) foreach of the benchmarks examined. Each bar is broken intocomponents to show average number of words in a cache


Fig. 13. AMAT graph.

Fig. 14. Overall system performance.

line with the following characteristics. The bars markedtrue_pos show the fraction of true positives: words predictedused and actually used. The bars marked true_neg show theportion of true negatives: words predicted unused andactually not used. The words in this category are the sourceof the energy reduction in our design. These two categoriesform the portion of the words that the predictor correctlyspeculates their spatial localities. The bars marked false_posshow the fraction of false positives: words predicted usedbut actually unused. The words in this category do notcause any miss penalty but represent a lost opportunity forenergy reduction. Finally, the bars marked false_neg showthe portion of false negatives: words predicted unused butactually used. These words result in additional memorysystem packets, potentially increasing both energy andlatency. The threshold value “0” in this figure represents thebaseline configuration, where all words are assumed used.In general, as the threshold value increases, the portions oftrue_neg and false_neg increase while true_pos and false_posdecrease. This implies that the higher threshold chosen, thelower energy consumption (due to true_neg predictions) butalso the higher the latency(due to false_neg predictions). Wealso note that even with the most aggressive thresholdsetting, a significant number of false_pos predictions remain,despite significant increases in false_neg predictions, imply-ing that headroom for improvement via a more accurateprediction mechanism exists.

Fig. 16 depicts the normalized energy consumption andnormalized average memory access time for thresholdvalues from 1 to 15. For this experiment, we use low-swinglinks, a similar trend of energy consumption and AMATwas found for full swing links. Fig. 16a shows a modestdownward trend in energy consumption as the thresholdvalue increases, with the greatest increase between thresh-olds of 8 and 12. This is the expected outcome of growingtrue_neg with higher threshold values in Fig. 15. In somebenchmarks, such as “Black,” “Fluid,” and “X264,” there isslight increase at the highest threshold value. The mainreason for this increase is the energy required to serviceincreased L1D misses, which overcome the benefit oftransmitting fewer words in the initial request.

Fig. 16b shows the normalized AMAT with varyingthreshold. In general, the latency shows a modest upward

trend as the threshold grows. The higher the threshold, themore words are speculated as unused by the predictor,leading to increased L1 miss rates and degrading the overallmemory system latency. Though this trend becomesdramatic for a threshold of 15, increasing AMAT by up to23 percent for one application; we find that thresholds ofless than 4 have a minimal negative impact on AMAT.

Given our goal was to decrease energy with a minimalimpact on performance, we use the Energy�Delay2 metricas a figure of merit for our design. Experimentally wedetermined that Energy�Delay2 is approximately equalacross the thresholds between 1 and 8, however, itconsiderably increases beyond the threshold 12. This resultvalidates our choice a threshold value of 1 in ourexperiments. We find the performance impact with thisbias is negligible. On average, with this threshold, theadditional latency of each operation is �0.6 percent. Theseresults show that further energy savings could be achievedthrough improved predictor accuracy, which we leave tofuture work.


Fig. 15. Breakdown of predictions outcomes.

Fig. 16. Normalized energy and AMAT for different threshold values.

5.3 Case Study: Comparison with Smaller Lines

Naıvely, one might conclude that the low cache blockutilization shown in Fig. 1 could be an indication that cacheline size is in-fact too long, and that utilization could beimproved by implementing a smaller cache line size. Toexplore this concern, we examine our technique versus a 32-byte cache lines baseline.

Fig. 17 shows results for three different configurations:1) the baseline design with 64-byte cache lines (64 base), 2) abaseline design with 32-byte lines (32 base) keeping the cachesize and associativity the same as 64 base, and 3) 64-byte lineswith our prediction and dynamic packet compositiontechnique (64 pred). Fig. 17a shows the arithmetic averageof block utilization for each configuration across thePARSEC benchmarks. Fig. 17b shows the geometric meanof AMAT and Fig. 17c depicts the geometric mean of totalenergy consumption. These figures show, 32-byte lines havebetter utilization than the 64-byte baseline. Compared with64 base, however, only a marginal energy reduction isachieved at the cost of considerable performance loss. Thesmaller cache block size results in the increased L1 misses,and thereby, increased latency of memory accessing opera-tions. 64 pred, shows even greater block utilization than 32base while maintaining the performance of 64 pred andconsuming much less energy than the rest. Hence, theproposed technique is a better design choice than shrinkingcache lines to reduce power.

6 CONCLUSIONS

In this paper, we introduce a simple, yet powerfulmechanism using spatial locality speculation to identifyunused cache block words. We also propose a set of staticand dynamic methods of packet composition, leveragingspatial locality speculation to reduce energy consumption inCMP interconnect. These techniques combine to reduce thedynamic energy of the NoC datapath through a reductionin the number of bit transitions, reducing � the activityfactor of the network.

Our results show that with only simple static packetencoding, requiring no change to typical NoC routers andvery little overhead in the cache hierarchy, we achieve anaverage of 17 percent reduction in the dynamic energy ofthe network if full-signal swing links are used. Ourdynamic compositioning technique, requiring a smallamount of logic overhead in the routers, enables deeperenergy savings of 36 and 34 percent, for full-swing and low-swing links, respectively.

REFERENCES

[1] Int’l Technology Roadmap for Semiconductors (ITRS) WorkingGroup, “Int’l Technology Roadmap for Semiconductors (ITRS),”http://www.itrs.net/Links/2009ITRS/Home2009.htm, 2009.

[2] W.J. Dally and B. Towles, “Route Packets, Not Wires: On-ChipInterconnection Networks,” Proc. 38th Int’l Design AutomationConf. (DAC), 2001.

[3] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, “A 5-GHz Mesh Interconnect for a Teraflops Processor,” IEEE Micro,vol. 27, no. 5, pp. 51-61, Sept./Oct. 2007.

[4] M. Taylor, M.B. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal,“Scalar Operand Networks: On-Chip Interconnect for ILP inPartitioned Architectures,” Proc. IEEE Int’l Symp. High PerformanceComputer Architecture (HPCA), 2002.

[5] D. Molka, D. Hackenberg, R. Schone, and M.S. Muller, “MemoryPerformance and Cache Coherency Effects on an Intel NehalemMultiprocessor System,” Proc. 18th Int’l Conf. Parallel Architecturesand Compilation Techniques (PACT), 2009.

[6] Advanced Micro Devices (AMD) Inc., “AMD Opteron Processorsfor Servers: AMD64-Based Server Solutions for x86 Computing,”http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796,00.h tml, 2012.

[7] P. Pujara and A. Aggarwal, “Cache Noise Prediction,” IEEE Trans.Computers, vol. 57, no. 10, pp. 1372-1386, Oct. 2008.

[8] C. Bienia, S. Kumar, J.P. Singh, and K. Li, “The PARSECBenchmark Suite: Characterization and Architectural Implica-tions,” Proc. 17th Int’l Conf. Parallel Architectures and CompilationTechniques, 2008.

[9] P. Gratz, C. Kim, R. McDonald, S.W. Keckler, and D. Burger,“Implementation and Evaluation of On-Chip Network Architec-tures,” Proc. IEEE Int’l Conf. Computer Design (ICCD), 2006.

[10] P. Gratz, K. Sankaralingam, H. Hanson, P. Shivakumar, R.McDonald, S.W. Keckler, and D. Burger, “Implementation andEvaluation of a Dynamically Routed Processor Operand Net-work,” Proc. ACM/IEEE First Int’l Symp. Networks-on-Chip (NOCS),2007.

[11] L. Shang, L.-S. Peh, and N. Jha, “Dynamic Voltage Scaling withLinks for Power Optimization of Interconnection Networks,” Proc.High-Performance Computer Architecture, 2003.

[12] A.B. Kahng, B. Li, L.-S. Peh, and K. Samadi, “ORION 2.0: A Fastand Accurate NoC Power and Area Model for Early-Stage DesignSpace Exploration,” Proc. Design, Automation and Test in EuropeConf. and Exhibition (DATE), 2009.

[13] A. Banerjee, R. Mullins, and S. Moore, “A Power and EnergyExploration of Network-on-Chip Architectures,” Proc. First Int’lSymp. Networks-on-Chip, 2007.

[14] R. Das, A. Mishra, C. Nicopoulos, D. Park, V. Narayanan, R. Iyer,M. Yousif, and C. Das, “Performance and Power Optimizationthrough Data Compression in Network-on-Chip Architectures,”Proc. IEEE 14th Int’l Symp. High Performance Computer Architecture,2008.

[15] Y. Jin, K.H. Yum, and E.J. Kim, “Adaptive Data Compression forHigh-Performance Low-Power On-Chip Networks,” Proc. IEEE/ACM 41st Ann. Int’l Symp. Microarchitecture (MICRO), 2008.

[16] H. Zhang, V. George, and J. Rabaey, “Low-Swing On-ChipSignaling Techniques: Effectiveness and Robustness,” IEEE Trans.Very Large Scale Integration Systems, vol. 8, no. 3, pp. 264-272, June2000.

[17] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, and B.Nauta, “Low-Power, High-Speed Transceivers for Network-on-Chip Communication,” IEEE Trans. Very Large Scale IntegrationSystems, vol. 17, no. 1, pp. 12-21, Jan. 2009.

[18] R. Das, S. Eachempati, A. Mishra, V. Narayanan, and C. Das,“Design and Evaluation of a Hierarchical on-Chip Interconnect forNext-Generation CMPS,” Proc. IEEE 15th Int’l Symp. HighPerformance Computer Architecture, 2009.

[19] K. Hale, B. Grot, and S. Keckler, “Segment Gating for Static EnergyReduction in Networks-on-Chip,” Proc. Second Int’l WorkshopNetwork on Chip Architectures, 2009.

[20] J.L. Hennessy and D.A. Patterson, Computer Architecture: AQuantitative Approach. Morgan Kaufmann Publishers Inc., 2003.

[21] S. Carr, K.S. McKinley, and C.-W. Tseng, “Compiler Optimiza-tions for Improving Data Locality,” ACM SIGPLAN Notices,vol. 29, no. 11, pp. 252-262, 1994.

[22] B. Calder, C. Krintz, S. John, and T. Austin, “Cache-ConsciousData Placement,” Proc. Eighth Int’l Conf. Architectural Support forProgramming Languages and Operating Systems (ASPLOS), 1998.


Fig. 17. Comparison to a smaller cache line.

[23] T.M. Chilimbi, B. Davidson, and J.R. Larus, “Cache-ConsciousStructure Definition,” ACM SIGPLAN Notices, vol. 34, no. 5,pp. 13-24, 1999.

[24] T. Chilimbi, M. Hill, and J. Larus, “Making Pointer-Based DataStructures Cache Conscious,” Computer, vol. 33, no. 12, pp. 67-74,Dec. 2000.

[25] J.S. Liptay, “Structural Aspects of the System/360 Model 85, II:The Cache,” IBM Systems J., vol. 7, no. 1, pp. 15-21, 1968.

[26] D.H. Yoon, M.K. Jeong, and M. Erez, “Adaptive GranularityMemory Systems: A Tradeoff between Storage Efficiency andThroughput,” Proc. 38th Ann. Int’l Symp. Computer Architecture,2011.

[27] M.K. Qureshi, M.A. Suleman, and Y.N. Patt, “Line Distillation:Increasing Cache Capacity by Filtering Unused Words in CacheLines,” Proc. Int’l Symp. High-Performance Computer Architecture,2007.

[28] A.-C. Lai, C. Fide, and B. Falsafi, “Dead-Block Prediction & Dead-Block Correlating Prefetchers,” SIGARCH Computer ArchitectureNews, vol. 29, no. 2, pp. 144-154, 2001.

[29] C.F. Chen, S. hyun Yang, and B. Falsafi, “Accurate andComplexity-Effective Spatial Pattern Prediction,” Proc. 10th Int’lSymp. High Performance Computer Architecture (HPCA-10), 2004.

[30] H. Kim, P. Ghoshal, B. Grot, P.V. Gratz, and D.A. Jimenez,“Reducing Network-on-Chip Energy Consumption through Spa-tial Locality Speculation,” Proc. ACM/IEEE Fifth Int’l Symp.Networks-on-Chip (NOCS), 2011.

[31] C. Kim, D. Burger, and S.W. Keckler, “An Adaptive, Non-UniformCache Structure for Wire-Delay Dominated On-Chip Caches,”ACM SIGPLAN Notices, vol. 37, pp. 211-222, 2002.

[32] S.G. Abraham, R.A. Sugumar, D. Windheiser, B.R. Rau, and R.Gupta, “Predictability of Load/Store Instruction Latencies,” Proc.26th Ann. Int’l Symp. Microarchitecture, 1993.

[33] E. Jacobsen, E. Rotenberg, and J.E. Smith, “Assigning Confidenceto Conditional Branch Predictions,” Proc. 29th Ann. Int’l Symp.Microarchitecture, 1996.

[34] Intel, “Intel Atom Processor Z510,” http://ark.intel.com/Product.aspx?id=35469&processor=Z510&spec-codes=SLB2C,2012.

[35] Jon Stokes, “IBM’s 8-Core POWER7: Twice the Muscle, Half theTransistors,” http://arstechnica.com/hardware/news/2009/09/ibms-8-core-power7-twice-the-mu scle-half-the-transistors.ars,2012.

[36] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Opti-mizing Nuca Organizations and Wiring Alternatives for LargeCaches with Cacti 6.0,” Proc. IEEE/ACM 40th Ann. Int’l Symp.Microarchitecture, 2007.

[37] N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G. Saidi, andS.K. Reinhardt, “The M5 Simulator: Modeling NetworkedSystems,” IEEE Micro, vol. 26, no. 4, pp. 52-60, July/Aug. 2006.

[38] M. Gebhart, J. Hestness, E. Fatehi, P. Gratz, and S.W. Keckler,“Running PARSEC 2.1 on M5,” technical report, Dept. ofComputer Science, The Univ. of Texas at Austin, 2009.

[39] J. Hestness and S.W. Keckler, “Netrace: Dependency-TrackingTraces for Efficient Network-on-Chip Experimentation,” technicalreport, Dept. of Computer Science, The Univ. of Texas at Austin,2011.

Hyungjun Kim received the BS degree fromYonsei University in 2002, and the MS degree inelectrical engineering from the University ofSouthern California in 2008. He is workingtoward the PhD degree in the Department ofElectrical and Computer Engineering at TexasA&M University. He is currently working withprofessor Paul V. Gratz. His research interestsinclude low-power memory systems and on-chipinterconnection networks.

Boris Grot received the PhD degree in compu-ter science from The University of Texas atAustin. He is a postdoctoral researcher at EPFL,Switzerland. His research interests includeprocessor architectures, memory systems, andinterconnection networks for high-throughput,energy-aware computing. He is a member ofthe IEEE and the ACM.

Paul V. Gratz received the BS and MS degreesin electrical engineering from the University ofFlorida in 1994 and 1997, respectively, and thePhD degree in electrical and computer en-gineering from The University of Texas atAustin in 2008. He is an assistant professorin the Department of Electrical and ComputerEngineering at Texas A&M University. Hisresearch interests include high-performancecomputer architecture, processor memory sys-

tems, and on-chip interconnection networks. He is a member of theIEEE and the ACM.

Daniel A. Jimenez received the PhD degree incomputer sciences from The University of Texasat Austin. He is a professor and chair of theDepartment of Computer Science at The Uni-versity of Texas at San Antonio. His researchinterests include microarchitecture and low-levelcompiler optimizations. He is a member of theIEEE and a senior member of the ACM.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 3,...

Documents

Transcript of IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 3,...