AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WRITE THROUGH POLICY

102 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013

An Energy-Efficient L2 Cache Architecture UsingWay Tag Information Under Write-Through Policy

Jianwei Dai and Lei Wang, Senior Member, IEEE

Abstract—Many high-performance microprocessors employcache write-through policy for performance improvement and atthe same time achieving good tolerance to soft errors in on-chipcaches. However, write-through policy also incurs large energyoverhead due to the increased accesses to caches at the lowerlevel (e.g., L2 caches) during write operations. In this paper,we propose a new cache architecture referred to as way-taggedcache to improve the energy efficiency of write-through caches.By maintaining the way tags of L2 cache in the L1 cache duringread operations, the proposed technique enables L2 cache towork in an equivalent direct-mapping manner during write hits,which account for the majority of L2 cache accesses. This leadsto significant energy reduction without performance degradation.Simulation results on the SPEC CPU2000 benchmarks demon-strate that the proposed technique achieves 65.4% energy savingsin L2 caches on average with only 0.02% area overhead and noperformance degradation. Similar results are also obtained underdifferent L1 and L2 cache configurations. Furthermore, the ideaof way tagging can be applied to existing low-power cache designtechniques to further improve energy efficiency.

Index Terms—Cache, low power, write-through policy.

I. INTRODUCTION

M ULTI-LEVEL on-chip cache systems have been widelyadopted in high-performance microprocessors [1]–[3].

To keep data consistence throughout the memory hierarchy,write-through and write-back policies are commonly employed.Under the write-back policy, a modified cache block is copiedback to its corresponding lower level cache only when theblock is about to be replaced. While under the write-throughpolicy, all copies of a cache block are updated immediately afterthe cache block is modified at the current cache, even thoughthe block might not be evicted. As a result, the write-throughpolicy maintains identical data copies at all levels of the cachehierarchy throughout most of their life time of execution. Thisfeature is important as CMOS technology is scaled into thenanometer range, where soft errors have emerged as a majorreliability issue in on-chip cache systems. It has been reported

Manuscript received March 28, 2011; revised July 15, 2011 and October 08,2011; accepted December 07, 2011. Date of publication January 26, 2012; dateof current version December 19, 2012. This work was supported by the NationalScience Foundation under Grant CNS-0954037, Grant CNS-1127084, and inpart by the University of Connecticut Faculty Research Grant 443874.J. Dai is with Intel Corporation, Hillsboro, OH 97124 USA (e-mail: jianwei.

[email protected]).L.Wang is with the Department of Electrical and Computer Engineering, Uni-

versity of Connecticut, Storrs, CT 06269 USA (e-mail: [email protected]).Digital Object Identifier 10.1109/TVLSI.2011.2181879

that single-event multi-bit upsets are getting worse in on-chipmemories [7]–[9]. Currently, this problem has been addressedat different levels of the design abstraction. At the architecturelevel, an effective solution is to keep data consistent amongdifferent levels of the memory hierarchy to prevent the systemfrom collapse due to soft errors [10]–[12]. Benefited fromimmediate update, cache write-through policy is inherentlytolerant to soft errors because the data at all related levels of thecache hierarchy are always kept consistent. Due to this feature,many high-performance microprocessor designs have adoptedthe write-through policy [13]–[15].While enabling better tolerance to soft errors, the

write-through policy also incurs large energy overhead.This is because under the write-through policy, caches at thelower level experience more accesses during write operations.Consider a two-level (i.e., Level-1 and Level-2) cache systemfor example. If the L1 data cache implements the write-backpolicy, a write hit in the L1 cache does not need to access the L2cache. In contrast, if the L1 cache is write-through, then bothL1 and L2 caches need to be accessed for every write operation.Obviously, the write-through policy incurs more write accessesin the L2 cache, which in turn increases the energy consumptionof the cache system. Power dissipation is now considered asone of the critical issues in cache design. Studies have shownthat on-chip caches can consume about 50% of the total powerin high-performance microprocessors [4]–[6].In this paper, we propose a new cache architecture, referred

to as way-tagged cache, to improve the energy efficiency ofwrite-through cache systems with minimal area overhead andno performance degradation. Consider a two-level cache hier-archy, where the L1 data cache is write-through and the L2 cacheis inclusive for high performance. It is observed that all the dataresiding in the L1 cache will have copies in the L2 cache. Inaddition, the locations of these copies in the L2 cache will notchange until they are evicted from the L2 cache. Thus, we canattach a tag to each way in the L2 cache and send this tag infor-mation to the L1 cache when the data is loaded to the L1 cache.By doing so, for all the data in the L1 cache, we will know ex-actly the locations (i.e., ways) of their copies in the L2 cache.During the subsequent accesses when there is a write hit in theL1 cache (which also initiates a write access to the L2 cacheunder the write-through policy), we can access the L2 cache inan equivalent direct-mapping manner because the way tag of thedata copy in the L2 cache is available. As this operation accountsfor the majority of L2 cache accesses in most applications, theenergy consumption of L2 cache can be reduced significantly.The basic idea of way-tagged cache was initially proposed in

our past work [26] with some preliminary results. In this paper,

1063-8210/$26.00 © 2012 IEEE

DAI AND WANG: ENERGY-EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION 103

we extend this work by making the following contributions.First, a detailed VLSI architecture of the proposed way-taggedcache is developed, where various design issues regardingtiming, control logic, operating mechanisms, and area overheadhave been studied. Second, we demonstrate that the idea of waytagging can be extended to many existing low-power cachedesign techniques so that better tradeoffs of performance andenergy efficiency can be achieved. Third, a detailed energymodel is developed to quantify the effectiveness of the pro-posed technique. Finally, a comprehensive suite of simulationsis performed with new results covering the effectiveness of theproposed technique under different cache configurations. It isalso shown that the proposed technique can be integrated withexisting low-power cache design techniques to further improveenergy efficiency.The rest of this paper is organized as follows. In Section II, we

provide a review of related low-power cache design techniques.In Section III, we present the proposed way-tagged cache. InSection IV, we discuss the detailed VLSI architecture of theway-tagged cache. Section V extends the idea of way taggingto existing cache design techniques to further improve energyefficiency. An energy model is presented in Section VI to studythe effectiveness of the proposed technique. Simulation resultsare given in Section VII.

II. RELATED WORK

Many techniques have been developed to reduce cachepower dissipation. In this section, we briefly review someexisting work related to the proposed technique.In [16], Su et al. partitioned cache data arrays into sub-

banks. During each access, only the subbank containing thedesired data is activated. Ghose et al. further divided cachebitlines into small segmentations [17]. When a memory cellis accessed, only the associated bitline segmentations areevaluated. By modifying the structure of cache systems, thesetechniques effectively reduce the energy per access withoutchanging cache architectures. At the architecture level, mostwork focuses on set-associative caches due to their low missrates. In conventional set-associative caches, all tag and dataarrays are accessed simultaneously for performance consider-ation. This, however, comes at the cost of energy overhead.Many techniques have been proposed to reduce the energyconsumption of set-associative caches. The basic idea is toactivate fewer tag and data arrays during an access, so thatcache power dissipation can be reduced. In the phased cache[18] proposed by Hasegawa et al., one cache access is dividedinto two phases. Cache tag arrays are accessed in the first phasewhile in the second phase only the data array corresponding tothe matched tag, if any, is accessed. Energy consumption canbe reduced because at most only one data array is accessed ascompared to data arrays in a conventional -way set-as-sociative cache. Due to the increase in access cycles, phasedcaches are usually employed in the lower level memory tominimize the performance impact. Another technique referredto as way concatenation was proposed by Zhang et al. [19]to reduce the cache energy in embedded systems. With thenecessary software support, this cache can be configured as di-rect-mapping, two-way, or four-way set-associative according

to the system requirement. By accessing fewer tag and dataarrays, better energy efficiency is attained. Although effectivefor embedded systems, this technique may not be suitablefor high-performance general purpose microprocessors due tothe induced performance overhead. Other techniques includeway-predicting set-associative caches, proposed by Inoue et al.[20]–[22], that make a prediction on the ways of both tag anddata arrays in which the desired date might be located. If theprediction is correct, only one way is accessed to complete theoperation; otherwise, the rest ways of the cache are accessed tocollect the desired data. Because of the improved energy effi-ciency, many way-prediction based techniques are employed inmicroprocessor designs. Another similar approach proposed byMin et al. [23] employs redundant cache (refer to as locationcache) to predict the incoming cache references. The locationcache needs to be trigged for every operation in the L1 cache(including both read and write assesses), which wastes energyif the hit rate of L1 cache is high.Among the above related work, phased caches and way-pre-

dicting caches are commonly used in high-performancemicroprocessors. Compared with these techniques, the pro-posed way-tagged cache achieves better energy efficiency withno performance degradation. Specifically, the basic idea ofway-predicting caches is to keep a small number of the mostrecently used (MRU) addresses and make a prediction basedon these stored addresses. Since L2 caches are usually unifiedcaches, the MRU-based prediction has a poor prediction rate[24], [25], and mispredictions introduce performance degra-dation. In addition, applying way prediction to L2 cachesintroduces large overheads in timing and area [23]. For phasedcaches, the energy consumption of accessing tag arrays ac-counts for a significant portion of total L2 cache energy. Asshown in Section V, applying the proposed technique of waytagging can reduce this energy consumption. Section VII-Dprovides more details comparing the proposed technique withthese related work.

III. WAY-TAGGED CACHE

In this section, we propose a way-tagged cache that exploitsthe way information in L2 cache to improve energy efficiency.We consider a conventional set-associative cache system whenthe L1 data cache loads/writes data from/into the L2 cache,all ways in the L2 cache are activated simultaneously forperformance consideration at the cost of energy overhead. InSection V, we will extend this technique to L2 caches withphased tag-data accesses.Fig. 1 illustrates the architecture of the two-level cache. Only

the L1 data cache and L2 unified cache are shown as the L1instruction cache only reads from the L2 cache. Under the write-through policy, the L2 cache always maintains the most recentcopy of the data. Thus, whenever a data is updated in the L1cache, the L2 cache is updated with the same data as well. Thisresults in an increase in the write accesses to the L2 cache andconsequently more energy consumption.Here we examine some important properties of write-through

caches through statistical characterization of cache accesses.Fig. 2 shows the simulation results of L2 cache accesses based


Fig. 1. Illustration of the conventional two-level cache architecture.

Fig. 2. Read and write accesses in the L2 cache running SPEC CPU2000benchmarks.

on the SPEC CPU2000 benchmarks [30]. These results are ob-tained from Simplescalar1 for the cache configuration given inSection VII-A. Unlike the L1 cache where read operations ac-count for a large portion of total memory accesses, write opera-tions are dominant in the L2 cache for all but three benchmarks(galgel, ammp, and art). This is because read accesses in theL2 cache are initiated by the read misses in the L1 cache, whichtypically occur much less frequently (the miss rate is less than5% on average [27]). For galgel, ammp, and art, L1 readmiss rates are high resulting in more read accesses than writeaccesses. Nevertheless, write accesses still account for about20%–40% of the total accesses in the L2 cache. From the resultsin Section VII, each L2 read or write access consumes roughlythe same amount of energy on average. Thus, reducing the en-ergy consumption of L2 write accesses is an effective way formemory power management.As explained in the introduction, the locations (i.e., way tags)

of L1 data copies in the L2 cache will not change until the dataare evicted from the L2 cache. The proposed way-tagged cacheexploits this fact to reduce the number of ways accessed duringL2 cache accesses. When the L1 data cache loads a data from

1[Online]. Available: http://www.simplescalar.com/

TABLE IEQUIVALENT L2 ACCESS MODES UNDER DIFFERENT

OPERATIONS IN THE L1 CACHE

the L2 cache, the way tag of the data in the L2 cache is also sentto the L1 cache and stored in a new set of way-tag arrays (seedetails of the implementation in Section IV). These way tagsprovide the key information for the subsequent write accessesto the L2 cache.In general, both write and read accesses in the L1 cache may

need to access the L2 cache. These accesses lead to differentoperations in the proposed way-tagged cache, as summarized inTable I. Under the write-through policy, all write operations ofthe L1 cache need to access the L2 cache. In the case of a writehit in the L1 cache, only one way in the L2 cache will be acti-vated because the way tag information of the L2 cache is avail-able, i.e., from the way-tag arrays we can obtain the L2 way ofthe accessed data. While for a write miss in the L1 cache, therequested data is not stored in the L1 cache. As a result, its cor-responding L2 way information is not available in the way-tagarrays. Therefore, all ways in the L2 cache need to be acti-vated simultaneously. Since write hit/miss is not known a priori,the way-tag arrays need to be accessed simultaneously with allL1 write operations in order to avoid performance degradation.Note that the way-tag arrays are very small and the involved en-ergy overhead can be easily compensated for (see Section VII).For L1 read operations, neither read hits nor misses need to ac-cess the way-tag arrays. This is because read hits do not need toaccess the L2 cache; while for read misses, the correspondingway tag information is not available in the way-tag arrays. Asa result, all ways in the L2 cache are activated simultaneouslyunder read misses.From Fig. 2 write accesses account for the majority of L2

cache accesses in most applications. In addition, write hits aredominant among all write operations. Therefore, by activatingfewer ways in most of the L2 write accesses, the proposed way-tagged cache is very effective in reducing memory energy con-sumption.Fig. 3 shows the system diagram of proposed way-tagged

cache. We introduce several new components: way-tag arrays,way-tag buffer, way decoder, and way register, all shown in thedotted line. The way tags of each cache line in the L2 cacheare maintained in the way-tag arrays, located with the L1 datacache. Note that write buffers are commonly employed in write-through caches (and even in many write-back caches) to im-prove the performance.With a write buffer, the data to bewritteninto the L1 cache is also sent to the write buffer. The operationsstored in the write buffer are then sent to the L2 cache in se-quence. This avoids write stalls when the processor waits forwrite operations to be completed in the L2 cache. In the pro-posed technique, we also need to send the way tags stored inthe way-tag arrays to the L2 cache along with the operations inthe write buffer. Thus, a small way-tag buffer is introduced tobuffer the way tags read from the way-tag arrays. A way de-coder is employed to decode way tags and generate the enable


Fig. 3. Proposed way-tagged cache.

signals for the L2 cache, which activate only the desired waysin the L2 cache. Each way in the L2 cache is encoded into a waytag. A way register stores way tags and provides this informa-tion to the way-tag arrays.

IV. IMPLEMENTATION OF WAY-TAGGED CACHE

In this section, we discuss the implementation of the proposedway-tagged cache.

A. Way-Tag Arrays

In the proposed way-tagged cache, each cache line in the L1cache keeps its L2 way tag information in the correspondingentry of the way-tag arrays, as shown in Fig. 4, where onlyone L1 data array and the associated way-tag array are shownfor simplicity. When a data is loaded from the L2 cache to theL1 cache, the way tag of the data is written into the way-tagarray. At a later time when updating this data in the L1 datacache, the corresponding copy in the L2 cache needs to be up-dated as well under the write-through policy. The way tag storedin the way-tag array is read out and forwarded to the way-tagbuffer (see Section IV-B) together with the data from the L1data cache. Note that the data arrays in the L1 data cache andthe way-tag arrays share the same address as the mapping be-tween the two is exclusive. The write/read signal of way-tagarrays, WRITEH_W, is generated from the write/read signalof the data arrays in the L1 data cache as shown in Fig. 4.A control signal referred to as UPDATE is obtained from thecache controller. When the write access to the L1 data cache iscaused by a L1 cache miss, UPDATE will be asserted and allowWRITEH_W to enable the write operation to the way-tag arrays( , , see Table II). If a STORE in-struction accesses the L1 data cache, UPDATE keeps invalidand WRITE_W indicates a read operation to the way-tag arrays( , ). During the read operations ofthe L1 cache, the way-tag arrays do not need to be accessed andthus are deactivated to reduce energy overhead. To achieve this,

Fig. 4. Way-tag arrays.

TABLE IIOPERATIONS OF WAY-TAG ARRAYS

the wordline selection signals generated by the decoder are dis-abled byWRITEH ( , ) throughAND gates. The above operations are summarized in Table II.Note that the proposed technique does not change the cache

replacement policy. When a cache line is evicted from the L2cache, the status of the cache line changes to “invalid” to avoidfuture fetching and thus prevent cache coherence issues. Aread or write operation to this cache line will lead to a miss,which can be handled by the proposed way-tagged cache (seeSection III). Since way-tag arrays will be accessed only when adata is written into the L1 data cache (either when CPU updatesa data in the L1 data cache or when a data is loaded from theL2 cache), they are not affected by cache misses.It is important to minimize the overhead of way-tag arrays.

The size of a way-tag array can be expressed as

(1)

where , , and are the size of the L1 datacache, cache line size, and the number of ways in the L1 datacache, respectively. Each way in the L2 cache is represented by

bits assuming the binary code is applied.As shown in (1), the overhead increases linearly with the size ofL1 data cache and sublinearly with the number of ways inL2 cache . In addition, since is very small com-pared with (i.e., ), the overheadaccounts for a very small portion of the L1 data cache. Clearly,the proposed technique shows good scalability trends with theincreasing sizings of L1 and L2 caches.As an example, consider a two-level cache hierarchy where

the L1 data cache and instruction cache are both 16 kB 2-wayset-associative with cache line size of 32 B. The L2 cache is4-way set-associative with 32 kB and each cache line has 64


Fig. 5. Way-tag buffer.

B. Thus, 16 kB, 32 B, , and. The size of each way-tag array is 16 K

512 bits, and two way-tag arrays are needed for the L1 datacache. This introduces an overhead of only K

of the L1 data cache, or K KK of the entire L1 and L2 caches.To avoid performance degradation, the way-tag arrays are

operated in parallel with the L1 data cache. Due to their smallsize, the access delay is much smaller than that of the L1 cache.On the other hand, the way-tag arrays share the address lineswith the L1 data cache. Therefore, the fan-out of address lineswill increase slightly. This effect can be well-managed viacareful floorplan and layout during the physical design. Thus,the way-tag arrays will not create new critical paths in the L1cache. Note that accessing way-tag arrays will also introduce asmall amount of energy overhead. However, the energy savingsachieved by the proposed technique can offset this overhead,as shown in Section VII.

B. Way-Tag Buffer

Way-tag buffer temporarily stores the way tags read from theway-tag arrays. The implementation of the way-tag buffer isshown in Fig. 5. It has the same number of entries as the writebuffer of the L2 cache and shares the control signals with it.Each entry of the way-tag buffer has bits, where is theline size of way-tag arrays. An additional status bit indicateswhether the operation in the current entry is a write miss onthe L1 data cache. When a write miss occurs, all the ways inthe L2 cache need to be activated as the way information isnot available. Otherwise, only the desired way is activated. Thestatus bit is updated with the read operations of way-tag arraysat the same clock cycle.Similar to the write buffer of the L2 cache, the way-tag buffer

has separate write and read logic in order to support parallelwrite and read operations. The write operations in the way-tagbuffer always occur one clock cycle later than the correspondingwrite operations in the write buffer. This is because the writebuffer, L1 cache, and way-tag arrays are all updated at the sameclock cycle when a STORE instruction accesses the L1 datacache (see Fig. 4). Since the way tag to be sent to the way-tagbuffer comes from the way-tag arrays, this tag will be writteninto the way-tag buffer one clock cycle later. Thus, the write

Fig. 6. Timing diagram of way-tag buffer.

signal of the way-tag buffer can be generated by delaying thewrite signal of the write buffer by one clock cycle, as shown inFig. 5.The proposed way-tagged cache needs to send the operation

stored in the write buffer along with its way tag to the L2 cache.This requires sending the data in the write buffer and its waytag in the way-tag buffer at the same time. However, simplyusing the same read signal for both the write buffer and theway-tag buffer might cause write/read conflicts in the way-tagbuffer. This problem is shown in Fig. 6. Assume that at the thclock cycle an operation is stored into the write buffer while theway-tag buffer is empty. At the th clock cycle, a readsignal is sent to the write buffer to get the operation while itsway tag just starts to be written into the way-tag buffer. If thesame read signal is used by the way-tag buffer, then read andwrite will target the same location of the way-tag buffer at thesame time, causing a data hazard.One way to fix this problem is to insert one cycle delay to

the write buffer. This, however, will introduce a performancepenalty. In this paper, we propose to use a bypass multiplexer(MUX in Fig. 5) between the way-tag arrays and the L2 cache.If an operation in the write buffer is ready to be processed whilethe way-tag buffer is still empty, we bypass the way-tag bufferand send the way tag directly to the L2 cache. The EMPTYsignal of the way-tag buffer is employed as the enable signalfor read operations; i.e., when the way-tag buffer is empty, aread operation is not allowed. During normal operations, thewrite operation and the way tag will be written into the writebuffer and way-tag buffer, respectively. Thus, when this writeoperation is ready to be sent to the L2 cache, the correspondingway tag is also available in the way-tag buffer, both of which canbe sent together, as indicated by the th cycle in Fig. 6. Withthis bypass multiplexer, no performance overhead is incurred.

C. Way Decoder

The function of the way decoder is to decode way tags and ac-tivate only the desired ways in the L2 cache. As the binary codeis employed, the line size of way-tag arrays is bits,where is the number of ways in the L2 cache. This minimizesthe energy overhead from the additional wires and the impact onchip area is negligible. For a L2 write access caused by a writehit in the L1 cache, the way decoder works as a -to- decoderthat selects just one way-enable signal. The technique proposedin [19] can be employed to utilize the way-enable signal to ac-tivate the corresponding way in the L2 cache. The way decoder


Fig. 7. Implementation of the way decoder.

operates simultaneously with the decoders of the tag and dataarrays in the L2 cache. For a write miss or a read miss in the L1cache, we need to assert all way-enable signals so that all waysin the L2 cache are activated. To achieve this, the way decodercan be implemented by the circuit shown in Fig. 7. Two signals,read and write miss, determine the operation mode of the waydecoder. Signal read will be “1” when a read access is sent tothe L2 cache. Signal write miss will be “1” if the write opera-tion accessing the L2 cache is caused by a write miss in the L1cache.

D. Way Register

The way register provides way tags for the way-tag arrays.For a 4-way L2 cache, labels “00”, “01”, “10”, and “11” arestored in the way register, each tagging one way in the L2 cache.When the L1 cache loads a data from the L2 cache, the corre-sponding way tag in the way register is sent to the way-tag ar-rays.With these new components, the proposed way-tagged cache

operates under different modes during read and write operations(see Table I). Only the way containing the desired data is acti-vated in the L2 cache for a write hit in the L1 cache, making theL2 cache equivalently a direct-mapping cache to reduce energyconsumption without introducing performance overhead.

V. APPLICATION OF WAY TAGGING IN PHASEDACCESS CACHES

In this section, we will show that the idea of way tagging canbe extended to other low-power cache design techniques suchas the phased access cache [18]. Note that since the processorperformance is less sensitive to the latency of L2 caches, manyprocessors employ phased accesses of tag and data arrays in L2caches to reduce energy consumption. By applying the idea ofway tagging, further energy reduction can be achieved withoutintroducing performance degradation.In phased caches, all ways in the cache tag arrays need to be

activated to determine which way in the data arrays containsthe desired data (as shown in the solid-line part of Fig. 8). Inthe past, the energy consumption of cache tag arrays has beenignored due to their relatively small sizes. Recently, Min et al.show that this energy consumption has become significant [33].As high-performance microprocessors start to utilize longer ad-dresses, cache tag arrays become larger. Also, high associativity

Fig. 8. Architecture of the WT-based phased access cache.

Fig. 9. Operation modes of the WT-based phased access cache.

is important for L2 caches in certain applications [34]. Thesefactors lead to the higher energy consumption in accessing cachetag arrays [35]. Therefore, it has become important to reduce theenergy consumption of cache tag arrays.The idea of way tagging can be applied to the tag arrays of

phased access cache used as a L2 cache. Note that the tag arraysdo not need to be accessed for a write hit in the L1 cache (asshown in the dotted-line part in Fig. 9). This is because the des-tination way of data arrays can be determined directly from theoutput of the way decoder shown in Fig. 7. Thus, by accessingfewer ways in the cache tag arrays, the energy consumption ofphased access caches can be further reduced.Fig. 8 shows the architecture of the phased access L2 cache

with way-tagging (WT) enhancement. The operation of thiscache is summarized in Fig. 9. Multiplexor M1 is employed togenerate the enable signal for the tag arrays of the L2 cache.When the status bit in the way-tag buffer indicates a write hit,M1 outputs “0” to disable all the ways in the tag arrays. Asmentioned before, the destination way of the access can beobtained from the way decoder and thus no tag comparison isneeded in this case. Multiplexor M2 chooses the output fromthe way decoder as the selection signal for the data arrays. Ifon the other hand the access is caused by a write miss or a readmiss from the L1 cache, all ways are enabled by the tag arraydecoder, and the result of tag comparison is selected by M2 asthe selection signal for the data arrays. Overall, fewer waysin the tag arrays are activated, thereby reducing the energyconsumption of the phased access cache.Note that the phased access cache divides an access into two

phases; thus, M2 is not on the critical path. Applying way tag-ging does not introduce performance overhead in comparisonwith the conventional phased cache.


VI. ENERGY MODEL

To study the effectiveness of the proposed way-tagged cache,we utilize an energy model that describes the major compo-nents of cache energy consumption. In general, a cache systemconsists of address decoders and data multiplexers, which areshared by all ways. Each way contains several components suchas tag arrays, data arrays, precharging circuit, way comparators,and sense amplifier circuit. Thus, the energy consumption peraccess of a conventional -way associative L2 cache can beexpressed as

(2)

where , , and denote the energy consumption ofaddress decoders, data multiplexers, and one way in the cache,respectively. Note that in the conventional L2 cache, all waysare activated during each access.Given the number of accesses , the total energy con-

sumption can be determined as

(3)

Different from conventional caches, the proposedway-tagged cache activates different components depending onthe type of cache accesses. As shown in Table I, if the accessis caused by a read miss or a write miss in the L1 cache, theL2 cache works as a conventional cache, of which the energyconsumption can be obtained from (2). On the other hand, ifthe access is caused by a write hit in the L1 cache, only oneway in the L2 cache will be activated, of which the energyconsumption is given by

(4)

Assuming the numbers of read misses, write misses, andwrite hits in the L1 cache are , , and

, respectively, the energy consumption of the pro-posed way-tagged L2 cache can be expressed as

(5)

where

(6)

(7)

Note that read hits in the L1 cache do not need to accessthe L2 cache and thus they are not included in (5). The energyoverheads introduced by accessing (read and write) the way-tagarrays and other components (including the bypass multiplexer,

way-tag buffer, way decoder and way register) are denotedas , , and , respectively.

and are the numbers of write and read accesses,respectively, to the L2 cache.Since the proposed way-tagged cache does not affect the

cache miss rate, the energy consumption related to cache missessuch as replacement, off chip memory accesses and micropro-cessor stalls will be the same as that of the conventional cache.Therefore, we do not include these components in (5). Note thatthe energy components in (5) represent the switching power.Leakage power reduction is an important topic for our futurestudy.We define the efficiency of the proposed way-tagged cache as

(8)

Substituting (2)–(7) into (8), we obtain

(9)

where

(10)

From (9), it is clear that the efficiency of the proposed way-tagged cache is affected by a number of factors such as thenumber of ways in the L2 cache and the configuration of the L1cache [e.g., size and the number of ways, which affectin (9)]. The impact of these factors will be evaluated in the nextsection. Note that in this paper we choose to evaluate the energyefficiency at the cache level as the proposed technique focusesexclusively on cache energy reduction. Smaller energy savingswill be expected at the processor level because the L2 cacheonly consumes a portion of the total energy, e.g., around 12% inPentium Pro CPU [31]. Reducing the total power of a processorthat consists of different components (ALU, memory, busses,I/O, etc.) is an important research topic but is beyond the scopeof this work.

VII. EVALUATION AND DISCUSSION

In this section, we evaluate the proposed technique by com-paring energy savings, area overhead, and performance with ex-isting cache design techniques.

A. Simulation Setup

We consider separate L1 instruction and data caches, both of16 kB 4-way set-associative with cache line size of 64 B. TheL2 cache is a unified 8-way set-associative cache with size of512 kB and cache line size of 128 B. There are eight banksin the L2 cache. The L1 data cache utilizes the write-throughpolicy and the L2 cache is inclusive. This cache configuration,used in Pentium-4 [23], will be used as a baseline system forcomparison with the proposed technique under different cacheconfigurations.


TABLE IIIENERGY CONSUMPTION PER READ ANDWRITE ACCESS OF THE CONVENTIONAL

SET-ASSOCIATIVE L2 CACHE AND THE PROPOSED L2 CACHE

In these simulations, Simplescalar1 is employed to obtain thecache access statistics and performance. The energy consump-tion is estimated by CACTI 5.32 for a 90-nmCMOS process. Allthe simulations are based on the SPEC CPU2000 benchmarkscollected from the stream-based trace compression (SBC) [32],where trace files of 23 benchmarks are available. All the bench-marks were simulated for at least two billion memory refer-ences.

B. Results of Baseline Cache Configuration

1) Energy Efficiency: Table III compares the average en-ergy consumption of a read access and a write access in theconventional 8-way set-associative L2 cache and the proposedway-tagged L2 cache. Due to the fewer activated ways, the av-erage energy consumption of the proposedway-tagged L2 cacheis only about 12.9% of the conventional L2 cache during thewrite access under a write hit. Since the way-tag arrays are verysmall, they introduce only 0.01% energy overhead per read andwrite accesses. The energy overheads due to the way-tag buffer,bypass multiplexer, way decoder, and way register are muchsmaller and thus are not shown in Table III.Based on the cache access statistics obtained from Sim-

plescalar, we estimate the values of , ,, , and in (5). Employing (2)–(8), we

can determine the energy efficiency of the proposed way-taggedcache. Fig. 10 shows that the energy reduction achieved bythe proposed technique ranges from 83.4% (meas) to 13.1%(ammp) as compared to the conventional L2 cache. On average,the proposed technique can reduce 65.4% of the L2 energyconsumption, or equivalently 7.5% of total processor powerif applied in Pentium Pro CPU [31], where the L2 cacheconsumes about 12% of total processor power. These resultsdemonstrate that by reducing the unnecessary way accesses inthe L2 cache, our technique is very effective in reducing L2cache power consumption. It is noted that the energy reductionachieved by the different applications is not uniform. This isbecause different applications have different write hit rates,which affect in (9) and in turn affect energy savings.2) Area Overhead and Performance Impact: The area over-

head of the proposed technique comes mainly from four compo-nents: way-tag arrays, way-tag buffer, bypass multiplexer, andway decoder. As discussed in Section IV, these componentsare very small. For example, the size of the way-tag arrays,the largest component, is only about 0.02% of the whole cachesystem. Also, only three additional wires are introduced for way

2[Online]. Available: http://www.hpl.hp.com/research/cacti/

Fig. 10. Energy reduction of the way-tagged L2 cache compared with the con-ventional set-associative L2 cache.

tag delivery between the L1 and L2 caches. Thus, we expect thearea overhead can be easily accommodated.The proposed way-tagged cache does not affect the hit

rate, i.e., no performance degradation, as it does not changethe cache placement policy. Furthermore, the way-tag arrays,way-tag buffer, and way decoder are operated in parallel withthe L1 data cache, write buffer, and decoders of tag and dataarrays in the L2 cache, respectively. Due to their small sizes,the access delay can be fully covered by the delay of the L1data cache, i.e., no new critical paths are created. As a result,the proposed technique does not introduce any performanceoverhead at the architecture and circuit levels.

C. Energy Reduction under Different Cache Configurations

As discussed in Section VI, the effectiveness of the proposedtechnique varies with the configurations of L1 and L2 caches. Inthis subsection, we will study this effect by assessing the energyreduction achieved under different cache configurations in termsof the associativity and the sizes of L1 and L2 caches.Fig. 11 shows the energy reduction of the proposed technique

for a 4-way set-associative L1 cache with cache sizes of 8, 16,and 32 kB, while the size of L2 cache is 512 kB. The block sizein these L1 caches is 64 B while that of the L2 cache is 128 B.A larger energy reduction in the L2 cache is observed with theincrease in the size of L1 cache. This is because by increasingthe size of L1 cache, the miss rate will decrease which leadsto a larger . This enables larger energy reduction ac-cording to (9). We also performed simulations by varying thesize of the L2 cache from 256 to 1024 kB. Since the proposedtechnique does not target the misses in the L2 cache, changingL2 cache size has little effect on the relative energy reduction(i.e., both energy consumption and energy savings change pro-portionally).Fig. 12 shows the energy reduction under the 16 kB L1 cache

and 512 kB L2 cache, where the number of ways in the L1 cachevaries from 2, 4, to 8. The L2 cache is 8-way set-associative. Itis shown that the proposed technique becomes more effectiveas the associativity of L1 cache increases. This comes from thefact that a higher associativity in general results in a smallermissrate, which enables better energy efficiency in the L2 cache (i.e.,more write hits and thus fewer accessed ways in the L2 cache).A similar trend can also be found in Fig. 13, which demonstrates


Fig. 11. Energy reduction of the way-tagged L2 cache compared with the con-ventional set-associative L2 cache under different L1 sizes.

Fig. 12. Energy reduction of the way-tagged L2 cache compared with the con-ventional set-associative L2 cache under different L1 set associativity.

Fig. 13. Energy reduction of the way-tagged L2 cache compared with the con-ventional set-associative L2 cache under different L2 set associativity.

the effectiveness of the proposed technique for a 512 kB L2cache with the associativity ranging from 4, 8, to 16, while the16 kB L1 cache is 4-way associative. As the number of ways inthe L2 cache increases, in (9) becomes larger and thus enablesbetter energy efficiency. In other words, each time when thereis a write hit in the L1 cache, only a small part of the L2 cacheis activated as the number of ways in the L2 cache increases.

Fig. 14. Comparison of the MRU-based way-predicting cache and the pro-posed cache.

Fig. 15. Energy reduction of the WT-based phased access L2 cache comparedwith the conventional phased access L2 cache.

D. Comparison With Existing Low-Power Cache Techniques

In this subsection, we compare the proposed way-taggedcache with two existing low-power cache design techniques:phased access cache and MRU-based way-predicting cache.Figs. 14 and 15 show the energy reduction achieved by the

three techniques for a 256 K 16-way set-associative L2 cache.It is shown that the proposed way-tagged cache is much moreeffective than the way-predicting cache in energy reduction.Specifically, our technique can achieve 32.6% more energyreduction on average. This is because the L2 cache is a unifiedcache, which in general leads to a poor prediction rate in theway-predicting cache. To compare with the phased cache,we employ the proposed WT-based phased access cache (seeSection V) as it has the same number of access cycles as thephased cache. As shown in Fig. 15, the proposed techniqueachieves energy savings ranging from 45.1% to 8.1% with anaverage of 34.9% for the whole L2 cache at the same perfor-mance level. These results indicate that the energy consumptionof tag arrays accounts for a significant portion of the total L2cache energy. Thus, applying the technique of way tagging inthe phased cache is quite effective.We also study the performance of these three cache design

techniques. As discussed before, the proposed way-taggedcache in Section III has no performance degradation comparedwith the conventional set-associative L2 cache with simulta-neous tag-data accesses. Using Simplescalar, we observed the


performance degradation of the phased cache is very small formost applications, below 0.5% in terms of instruction per cycle(IPC). This is expected as L2 cache latency is not very criticalto the processor performance. However, nontrivial performancedegradation was observed in some applications. For example,benchmark perlbmk sees 3.7% decrease in IPC while theIPC of gzip decreases by 1.7%. The performance degradationmay be more significant for other applications that are sensitiveto L2 cache latency, such as TPC-C as indicated in [29]. Asa result, L2 caches with simultaneous tag-data accesses arestill preferred in some high-performance microprocessors [23],[28]. The similar trend was also observed in the way-predictingcache.

VIII. CONCLUSION

This paper presents a new energy-efficient cache tech-nique for high-performance microprocessors employing thewrite-through policy. The proposed technique attaches a tag toeach way in the L2 cache. This way tag is sent to the way-tagarrays in the L1 cache when the data is loaded from the L2cache to the L1 cache. Utilizing the way tags stored in theway-tag arrays, the L2 cache can be accessed as a direct-map-ping cache during the subsequent write hits, thereby reducingcache energy consumption. Simulation results demonstrate sig-nificantly reduction in cache energy consumption with minimalarea overhead and no performance degradation. Furthermore,the idea of way tagging can be applied to many existinglow-power cache techniques such as the phased access cache tofurther reduce cache energy consumption. Future work is beingdirected towards extending this technique to other levels ofcache hierarchy and reducing the energy consumption of othercache operations.

REFERENCES[1] G. Konstadinidis, K. Normoyle, S. Wong, S. Bhutani, H. Stuimer, T.

Johnson, A. Smith, D. Cheung, F. Romano, S. Yu, S. Oh, V. Melamed,S. Narayanan, D. Bunsey, C. Khieu, K. J. Wu, R. Schmitt, A. Dumlao,M. Sutera, J. Chau, andK. J. Lin, “Implementation of a third-generation1.1-GHz 64-bit microprocessor,” IEEE J. Solid-State Circuits, vol. 37,no. 11, pp. 1461–1469, Nov. 2002.

[2] S. Rusu, J. Stinson, S. Tam, J. Leung, H. Muljono, and B. Cherkauer,“A 1.5-GHz 130-nm itanium 2 processor with 6-MB on-die L3 cache,”IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. 1887–1895, Nov. 2003.

[3] D. Wendell, J. Lin, P. Kaushik, S. Seshadri, A. Wang, V. Sun-dararaman, P. Wang, H. McIntyre, S. Kim, W. Hsu, H. Park, G.Levinsky, J. Lu, M. Chirania, R. Heald, and P. Lazar, “A 4 MBon-chip L2 cache for a 90 nm 1.6 GHz 64 bit SPARC microprocessor,”in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,2004, pp. 66–67.

[4] S. Segars, “Low power design techniques for microprocessors,” inProc. Int. Solid-State Circuits Conf. Tutorial, 2001, pp. 268–273.

[5] A. Malik, B. Moyer, and D. Cermak, “A low power unified cache ar-chitecture providing power and performance flexibility,” in Proc. Int.Symp. Low Power Electron. Design, 2000, pp. 241–243.

[6] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A framework for ar-chitectural-level power analysis and optimizations,” in Proc. Int. Symp.Comput. Arch., 2000, pp. 83–94.

[7] J. Maiz, S. hareland, K. Zhang, and P. Armstrong, “Characterization ofmulti-bit soft error events in advanced SRAMs,” in Proc. Int. ElectronDevices Meeting, 2003, pp. 21.4.1–21.4.4.

[8] K. Osada, K. Yamaguchi, and Y. Saitoh, “SRAM immunity tocosmic-ray-induced multierrors based on analysis of an inducedparasitic bipolar effect,” IEEE J. Solid-State Circuits, pp. 827–833,2004.

[9] F. X. Ruckerbauer and G. Georgakos, “Soft error rates in 65 nmSRAMs: Analysis of new phenomena,” in Proc. IEEE Int. On-LineTest. Symp., 2007, pp. 203–204.

[10] G. H. Asadi, V. Sridharan,M. B. Tahoori, andD. Kaeli, “Balancing per-formance and reliability in the memory hierarchy,” in Proc. Int. Symp.Perform. Anal. Syst. Softw., 2005, pp. 269–279.

[11] L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin,“Soft error and energy consumption interactions: A data cache per-spective,” in Proc. Int. Symp. Low Power Electron. Design, 2004, pp.132–137.

[12] X. Vera, J. Abella, A. Gonzalez, and R. Ronen, “Reducing soft errorvulnerability of data caches,” presented at the Workshop System Ef-fects Logic Soft Errors, Austin, TX, 2007.

[13] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-way mul-tithreaded Sparc processor,” IEEE Micro, vol. 25, no. 2, pp. 21–29,Mar. 2005.

[14] J. Mitchell, D. Henderson, and G. Ahrens, “IBM POWER5 pro-cessor-based servers: A highly available design for business-criticalapplications,” IBM, Armonk, NY, White Paper, 2005. [Online].Available: http://www03.ibm.com/systems/p/hardware/whitepa-pers/power5_ras.pdf

[15] N. Quach, “High availability and reliability in the Itanium processor,”IEEE Micro, pp. 61–69, 2000.

[16] C. Su and A. Despain, “Cache design tradeoffs for power and perfor-mance optimization: A case study,” in Proc. Int. Symp. Low PowerElectron. Design, 1997, pp. 63–68.

[17] K. Ghose andM. B.Kamble, “Reducing power in superscalar processorcaches using subbanking, multiple line buffers and bit-line segmen-tation,” in Proc. Int. Symp. Low Power Electron. Design, 1999, pp.70–75.

[18] A. Hasegawa, I. Kawasaki, K. Yamada, S. Yoshioka, S. Kawasaki, andP. Biswas, “Sh3: High code density, low power,” IEEE Micro, vol. 15,no. 6, pp. 11–19, Dec. 1995.

[19] C. Zhang, F. Vahid, and W. Najjar, “A highly-configurable cache ar-chitecture for embedded systems,” in Proc. Int. Symp. Comput. Arch.,2003, pp. 136–146.

[20] K. Inoue, T. Ishihara, and K. Murakami, “Way-predicting set-associa-tive cache for high performance and low energy consumption,” in Proc.Int. Symp. Low Power Electron. Design, 1999, pp. 273–275.

[21] A. Ma, M. Zhang, and K. Asanovi, “Way memoization to reduce fetchenergy in instruction caches,” in Proc. ISCA Workshop Complexity Ef-fective Design, 2001, pp. 1–9.

[22] T. Ishihara and F. Fallah, “A way memoization technique for reducingpower consumption of caches in application specific integrated proces-sors,” in Proc. Design Autom. Test Euro. Conf., 2005, pp. 358–363.

[23] R. Min, W. Jone, and Y. Hu, “Location cache: A low-power L2 cachesystem,” in Proc. Int. Symp. Low Power Electron. Design, 2004, pp.120–125.

[24] B. Calder, D. Grunwald, and J Emer, “Predictive sequential associativecache,” in Proc. 2nd IEEE Symp. High-Perform. Comput. Arch., 1996,pp. 244–254.

[25] T. N. Vijaykumar, “Reactive-associative caches,” in Proc. Int. Conf.Parallel Arch. Compiler Tech., 2011, p. 4961.

[26] J. Dai and L. Wang, “Way-tagged cache: An energy efficient L2 cachearchitecture under write through policy,” in Proc. Int. Symp. LowPower Electron. Design, 2009, pp. 159–164.

[27] L. Hennessey and D. A. Patterson, Computer Architecture: A Quanti-tative Approach, 4th ed. New York: Elsevier Science & TechnologyBooks, 2006.

[28] B. Brock and M. Exerman, “Cache Latencies of the PowerPCMPC7451,” Freescale Semiconductor, Austin, TX, 2006. [Online].Available: cache.freescale.com

[29] T. Lyon, E. Delano, C.McNairy, andD.Mulla, “Data cache design con-siderations for Itanium 2 processor,” in Proc. IEEE Int. Conf. Comput.Design, 2002, pp. 356–362.

[30] Standard Performance Evaluation Corporation, Gainesville, VA,“SPEC CPU2000,” 2006. [Online]. Available: http://www.spec.org/cpu

[31] “Pentium Pro Family Developer’s Manual,” Intel, Santa Clara, CA,1996.

[32] A. Milenkovic and M. Milenkovic, “Exploiting streams in instructionand data address trace compression,” in Proc. IEEE 6th Annu. Work-shop Workload Characterization, 2003, pp. 99–107.

[33] R.Min,W. Jone, and Y. Hu, “Phased tag cache: An efficient low powercache system,” in Proc. Int. Symp. Circuits Syst., 2004, pp. 23–26.


[34] M. K. Qureshi, D. Thompson, and Y. N. Patt, “The V-way cache: De-mand based associativity via global replacement,” in Proc. Int. Symp.Comput. Arch., 2005, pp. 544–555.

[35] J. M. Rabaey, Digital Integrated Circuits: A Design Perspective. En-glewood Cliffs, NJ: Prentice-Hall, 1996.

Jianwei Dai received the B.S. degree from BeijingUniversity of Chemical Technology, Beijing, China,in 2002, the M.Eng. degree from Beihang University,Beihang, China, in 2005, and the Ph.D. degree fromthe University of Connecticut, Storrs, in 2011.Currently, he is with Intel Corporation, Hillsboro,

OR, where he is participating in designing next gen-eration processors. His research interests include lowpower VLSI design, error and reliability-centric sta-tistical modeling for emerging technology, and nano-computing.

Lei Wang (M’01–SM’11) received the B.Eng. de-gree and the M.Eng. degree from Tsinghua Univer-sity, Beijing, China, in 1992 and 1996, respectively,and the Ph.D. degree from the University of Illinois atUrbana-Champaign, iUrbana-Champaign, in 2001.During the Summer of 1999, he worked with

Microprocessor Research Laboratories, Intel Cor-poration, Hillsboro, OR, where his work involveddevelopment of high-speed and noise-tolerant VLSIcircuits and design methodologies. From December2001 to July 2004, he was with Microprocessor

Technology Laboratories, Hewlett-Packard Company, Fort Collins, CO, wherehe participated in the design of the first dual-core multi-threaded Itanium Ar-chitecture Processor, a joint project between Intel and Hewlett-Packard. SinceAugust 2004, he has been with the Department of Electrical and ComputerEngineering, University of Connecticut, where he is presently an AssociateProfessor.Dr. Wang was a recipient of the National Science Foundation CAREER

Award in 2010. He is a member of IEEE Signal Processing Society TechnicalCommittee on Design and Implementation of Signal Processing Systems.He currently serves as an Associate Editor for the IEEE TRANSACTIONS ONCOMPUTERS. He has served on Technical Program Committees of variousinternational conferences.

AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WRITE THROUGH POLICY

Technology

Transcript of AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WRITE THROUGH POLICY