Quasi-Corporations Quasi-Employees and Quasi-Tax Relief for Professional Persons
Implementing Branch Predictor Decay Using Quasi …mrmgroup.cs.princeton.edu/papers/taco03.pdfAn...
Transcript of Implementing Branch Predictor Decay Using Quasi …mrmgroup.cs.princeton.edu/papers/taco03.pdfAn...
ImplementingBranchPredictorDecayUsingQuasi-StaticMemoryCells1
Philo Juang�, Kevin Skadron
�, Margaret Martonosi
�,
Zhigang Hu�, Douglas W. Clark
�, Philip W. Diodato
�, Stefanos Kaxiras
�
�Departments of Electrical Engineering and Computer Science, Princeton University�Department of Computer Science, University of Virginia�IBM T.J. Watson Research Center, � Agere Systems, University of Patras
1. INTRODUCTION
Power and energy consumptionare a major concernin general-purposeand high-performanceprocessors.Power
dissipationcanbeseparatedinto two parts.Dynamic powerdissipationis dueto switchingactivity in theprocessors,
andin idealdevices,is theonly sourceof dissipation,ascurrentonly flowsduringswitching.Static or leakage power
is consumedregardlessof activity, especiallythroughsubthreshold leakagecurrents[Kesharvarzietal. 1997]thatflow
evenwhenthetransistoris nominallyoff.
As progressin fabricationprocesseshasled to steadyreductionsin supplyvoltages,thresholdvoltagesarebeing
loweredto thepoint whereleakagehasbecomeanimportantandgrowing fractionof total powerdissipationin high-
performanceCMOSCPUs.If it is notaddressedthroughfabricationor circuit-level changes,someforecastspredictas
muchasafive-fold increasein leakageenergy pertechnologygeneration[SemiconductorIndustryAssociation2001].
At suchrates,thecurrentleakagecomponent,roughly5% of total chip power now, would balloonto 50%or morein
just a few generations.
In responseto thistrend,researchershaveproposedcircuit- andarchitecture-levelmechanismsfor managingleakage
energy [J. A. Butts andG. Sohi 2000;Kaxiraset al. 2001;Yanget al. 2001;Flautneret al. 2002]. Using gated-�� �techniques,Kaxiraset al. [Kaxirasetal. 2001]showedthatturningcachelinesoff if they havenotbeenusedin a long
time canbeeffective in reducingleakageenergy with little performanceimpact. Li et al. [Li et al. 2004]examined
several circuit-level mechanismsfor managingleakagecontrol, showing that in casessuchas faston-chipcaches,
non-state-preservingtechniquessuchasgated- � � aresuperiorto drowsy caches.After caches,branchpredictorsare
amongthe largestandmostpower-consumingarraystructuresin currentCPUs. CurrentpredictorsandBTBs are
4–8Kbytesin size,alreadythesizeof a small cache.They dissipate7-10%of theprocessor’s total dynamicpower
dissipation[Parikhet al. 2002].
However, aspipelinesgrow deeperandthe demandon the branchpredictorincreases,the branchpredictormay
increasein sizefasterthancaches,raisingtheir contribution to the chip’s total energy budget. For example,in the
proposedAlpha EV8 [Seznecet al. 2002], the branchpredictorwas352 Kbits (44 KBytes) in size,over 12 times
thatof the21264.In addition,Jimenezet al. [Jimenezet al. 2000]pointedout thathierarchicalpredictorscanavoid
cycle-time constraintsand that large second-level predictorscan give substantialincreasesin predictionaccuracy,
resultingin predictorsthatcouldbeaslargeandhavethesamesubstantialleakageasfirst-level caches.Theincreased
popularityof tournamentpredictorswouldrapidly increasethesizeof thebranchpredictoraswell. As leakageenergy
consumptionincreasinglybecomesa concernwith respectto dynamicenergy consumption,how to conserve leakage
�For questionsor comments,[email protected]
ACM TransactionsonArchitectureandComputerOrganization,Vol. V, No. N, Month20YY, Pages1–0??.
2 �
energy in a largearraystructuresuchasthebranchpredictorbecomesasignificantandimportantproblem.Therefore,
energy consumptionproblemin thebranchpredictorwill likely grow in importancein thefuture,especiallywith the
increasingimportanceof leakageenergy.
As decaytechniquesfor cacheshaveproveneffective,branchpredictorsarea naturalnext step.However, thereare
several importantdistinctions.First, it is lessobviouswhena branchpredictorentrymaybeconsidered“dead” and
canthereforebeturnedoff with little performanceimpact.First,many branchesmaymapto thesamepredictorentry.
Sincethis sharingis sometimesbeneficial,notionsof cacheconflictsandeviction do not translatedirectly into the
branchpredictionworld.
Second,a branchpredictorentry is not simply valid or invalid, asin a cache.A branchpredictorentrymayhave
reachedthe “strongly not taken” statedueto the effectsof several differentbranchesandmay be usefulto the next
branchthataccessesit, evenif this branchhasneverbeenexecutedbefore.
Third, branchpredictorentriesaretoosmallto deactivateindividually, soonemustconsidersomelargercollection,
suchasa row of predictorentriesin thesquarearrayin which thepredictoris likely implemented.Thechallengehere
is thatunlike thegroupingof datainto acacheline, thegroupingof branchpredictorentriesin a row is not something
for which applicationprogrammersandsystemsbuildershave a senseof spatiallocality. This paperevaluatesdesign
optionsrelatedto thesequestions.
Last, branchpredictorscanbe logically organizedin a numberof ways, from simplebimodal branchpredictors
[Smith1981],whichkeeponetwo-bit counterperpredictorentry, to multi-tablepredictorslikehybrid predictors[Mc-
Farling 1993],which operateseveralpredictionstructuresin parallel. For example,hybrid predictorsmayencounter
instanceswhenoneof the predictorcomponentshasbeendeactivatedbut the otherhasnot. The choosermight be
designedto pick the still active componentin suchsituations.For otherbranches,the choosermayexhibit a strong
biasfor onepredictorcomponentover the other. In this case,predictorentriesthat arenot beingselectedmight be
deactivated.
Traditionally, statepreservingstructureshave beendesignedwith six transistor(6T) SRAM cells; unlike DRAM
cells,SRAM cells retaintheir valuewhile charged. Thus,cachesneednot have built in circuitry to refreshthecells
asin mainmemory. With this in mind,decayandsimilar techniquesfocusonsomehow shuttingdown thesecells,for
exampleeithergatingthesourceor drainvoltagesto a signalthatis enabledwhenever thecell needsto beturnedoff.
By attachingdecaycountersto eachcell, onecandetectif thecell hasgone“long enough”without activity. If this is
thecase,thecell is assumedto beunusedandcanbedeactivated.
Traditionaltechniquesof deactivatingcellsinvolveattachingagatetransistorto thesupplyvoltage����� ; to deactivate
a cell, thegatetransistordisconnectsthesupplyto therestof thecell andthusthecell is effectively turnedoff [Yang
et al. 2001].Gated-����� basedcachedecaytechniqueshavea few problems,however. First,attachinga gatetransistor
to eachcacheline mighthaveanadverseimpacton thepitchof thecacheline, increasingtheoverallsizeof thecache.
In addition,the countersthemselvesconsumedynamicandstaticpower, which mustbe taken into accountaswell.
Onecannotassumethatpoweringup or poweringdown a transistorcomesfor free; instead,thereis somedynamic
costto theprocessroughlyof thesameorderasa cacheline access.
An alternativeto traditionaldecayis to useaquasi-static,4-transistor(4T) memorycell. 4T cellsareapproximately
asfastas6T SRAM cells, but do not have connectionsto the sourcevoltage( ����� ). Rather, the 4T cell is charged
uponeachaccess,whetherreador write, andslowly leaksthechargeover time. Whenthechargein thecell hasbeen
depleted,the valuestoredis lost. Thus,they maybe referredto asquasi-static;like conventionalDRAM, they leak
their charge,but do not requirea full read-modify-writeoperationto accomplishacell refresh.
4T cellstypically appearin embeddedprocessors,wherelow-powerconsumptionis themostimportantaspect.4T
cellshave thespeedof SRAM but in termsof powerbehavemorelike DRAM. They have beenlargely overlookedin
general-purposeprocessorsbecausewithout periodicrefreshthey arenot truly static. However, in decaytechniques,
wherewe exploit the transientnatureof shorttermstorageby turning themon andoff dependingon usage,onecan
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
� 3
take advantageof 4T cells’ dynamicnature.Decayinga 6T cell seemssomewhat like squashingits greatestasset–
that it retainsits valuefor as long as it is poweredup. Implementingdecaywith 4T cells is thereforeconvenient;
asit decaysnaturallyover time if not accessed,this distinctive behavior fits perfectlywith the transientbehavior of
elementsin a cacheor branchpredictor.
Decaywith quasi-staticcellsisadvantageousto conventionaldecayin severalrespects.First,by virtueof having two
fewer transistorsandby eliminatingtheneedto run power rails lengthwisethroughoutthearray, 4T basedstructures
aresmallerthanequivalent6T structures.As 4T cellsdecayandrevivenaturallythroughtheread-writeprocess,some
of thecostsbehindactivatingdecayedcellsareeliminated.For example,awrite to a 4T cell automaticallyreactivates
the cell, whereasreactivatinga 6T cell is somewhat morecomplex, requiringextra hardwareinvolved in gatingthe
supplyvoltage.
Contributions
Becausebranchpredictorshave behavior morenuancedthancaches,cachedecaymethodscannotbe directly ap-
plied. This papershows thateffectivedecaytechniquescanneverthelessbedevisedfor branchpredictors.Thepaper
thengoesonto exploretheinteractionof decaypolicieswith someof thewealthof branchpredictordesignparameters.
We show that:
—Branchpredictors,like caches,exhibit a significantamountof spatialandtemporallocality, andthusdecaytech-
niquescanbeapplied.
—Unlike caches,a decayedpredictionneednot impactperformanceasmuchasa decayedcacheline. A decayed
prediction,whethera staticfallbackor a logically randomvalue,maystill bea correctprediction.
—Decaycanreducenet leakageenergy in theconditionalbranchpredictorby 40–60%andthe branchtargetbuffer
(BTB) by 90%.
—In hybrid predictors,decaypoliciescanachieve 50%higherreductionsin leakageenergy if thedecaypolicy takes
advantageof thehybrid predictororganizationto boostdecayopportunities.
—Decayis mosteffective for intervalsof 64K cyclesor larger. If decayis appliedtoo aggressively, extra mispredic-
tionsresultin significantcostsin bothperformanceanddynamicpower.
—Quasi-static,four transistor(4T) cellsareanaturalfit for decay, asthey graduallyleakchargeovertime. Moreover,
they suffer from lessleakageoverall than6T cells,makingthemanattractivechoicefor managingleakageenergy.
Thepaperis organizedasfollows. First,we discusstheorganizationandlayoutof branchpredictors,andestablish
that spatialandtemporallocality exists. We thenshow resultsfor basicdecayin (6T) branchpredictors.Next, we
discussthe structureandbehavior of the 4T quasi-staticcell, followed by the resultsof the decayimplementation.
Last,we discussrelatedwork andoffer our conclusions.
2. DECAY WITH BASIC BRANCH PREDICTORS
2.1 Logical Structure of Branch Predictors
Althougha wealthof dynamicbranchpredictorshave beenproposed,we focuson theeffectsof decayfor thegshare
andhybridpredictors,becausethey presentthemostinterestingtradeoffs.
Thegsharepredictor, shown in theleft half of Figure1, triesto detectandpredictsequencesof correlatedbranches
by tracking a global history (the global branchhistory register or GBHR) of the outcomesof the � most recent
branches.In gshare,the global branchhistory and the branchaddressare XOR’d to reducealiasing. This paper
modelsa 16K-entrygsharepredictorin which 12bits of historyareXOR’d with 14bitsof branchaddress.This is the
configurationthatappearsin theSunUltraSPARC-III [Song1997].
Insteadof usingglobal history, a two-level predictorcantrack local branchhistory on a per-branchbasis. Local
history is effective at exposingpatternsin the behavior of individual branches.Becausemostprogramshave some
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
4 �
taken/not−taken
PHT(16K)
xor
branch address
GBHR
14
1412
component #1 (global)
component #2 (local)
taken/not−taken
GBHR (12)
PHT(4K)
selector(uses global hist)
PHT(1K)PHT
(4K)
BHT (1K x10)
Fig. 1. Gsharepredictorin theSunUltraSPARC-III (Left) and21264-stylehybridpredictor(Right).
branchesthatperformbetterwith globalhistoryandothersthatperformbetterwith local history, a hybrid predictor
[Changet al. 1995; McFarling 1993] combinesthe two. It operatestwo independentbranchpredictorcomponents
in parallelandusesa third predictor—theselector or chooser—to learnfor eachbranchwhich of thecomponentsis
moreaccurateandchoosesits prediction. This papermodelsa hybrid predictorwith a 4K-entry selectorthat only
uses12 bits of globalhistoryto index its PHT, a global-historycomponentpredictorof thesameconfiguration,anda
1K-entry local historypredictorwith a 10-bit wide BHT and1K-entryPHT. This configurationappearsin theAlpha
21264[Gwennap1996]andis depictedon theright sideof Figure1.
2.2 Proposed Decay Implementation
Logically, branchpredictorsarearraysof countersthataretypically just two bits wide. Like caches,however, branch
predictorsarephysicallyimplementedassquareor nearly-squarearraystructures.Thishelpsminimizethecomplexity
of therow andcolumndecodersandbalancewordlineandbitline lengthanddelay. Thepredictorarrayis thussimilar
to acachearray, exceptthatit needsno tags.For example,the16K-entrygsharepredictordiscussedabovecanbelaid
out asa ����������� � arrayof 2-bit counters.Alternatively, it canbedividedinto 4 banks,eacha ! "#�$! " counterarray.
We referto thesetwo organizationsas“unbanked” and“banked” respectively.
Whenfirst consideringdecaytechniqueswith branchpredictors,it mightbetemptingto considerdeactivatingindi-
vidual predictorentries.This is untenable,however, becauseour methodsrequiretwo bits of stateperindependently-
activatedunit; suchoverheadwould be excessive if appliedto individual two-bit predictorentries. Thus, a cost-
effectivechoicefor turningoff thesecountersis at thegranularityof rows in thearraystructureratherthanindividual
entries. This requiresonly two bits of stateper row (the referencebit and the active bit), or a total overheadof
�&%(' )�*,+.-�/.)10 bits.
Our techniqueshave the following generalstructure. At regular intervals, all rows of predictorentriesnot been
usedduring the interval areassumedto have decayed andarethereforedeactivated. The interval, calledthe decay
interval, is measuredin processorcyclesandis a critical parameterfor theseschemes.The shorterthe interval, the
moreopportunitiesfor rows to bedeactivatedbut themorelikely it is to deactivaterowsprematurelyandinduceextra
mispredictions.Intervalslong enoughto minimizeextra mispredictions,on theotherhand,resultin thedeactivation
of fewerentries.
Figure2 showsapictureof a typical branchpredictorarrangedasa squarememoryarray. A groupof entries,then,
is a row of thesquarememoryarray. Onebit per row, thereference bit, indicateswhetherany predictorentry in that
row hasbeenaccessedwithin the last decayinterval. All referencebits areclearedat the endof eachinterval. A
secondbit for eachrow, theactive bit, indicateswhetherthatrow is currentlyactive (i.e., not decayed).If a predictor
lookup tries to accessa decayedrow, the predictorsignalsthata predictioncannotbe made;the row is re-activated
andpossiblyinitialized to somedesiredstartingstate;in themeantime,a default predictionis made.Uponactivation,
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
2 5
PHT index
two−bit counter
rowdecoder
columndecoder
assertedwordline
chosenbitlines
b b/2
b/2
Fig. 2. A schematicof asquarifiedbranchpredictortableof two-bit counters(thePHT).
our experimentsusea default of not takenandinitialize all thecountersto 01. Thus,subsequentbranchesusingthe
re-activatedline startin theweakly-not-takenstate.
Theactive ratio is theaveragepercentageof predictorrows foundto beactive (not decayed);it is a proxy for the
actualleakageenergyconsumedby thepredictor. Shorterdecayintervalsyield smalleractiveratios(andlargerleakage
energy savings),but performancemaysuffer, sinceusefulpredictorentriesaresometimesdeactivated.Exploringthis
power-performancetradeoff is a key objectiveof this paper.
To evaluatetheneteffectiveness,wecombinethereducedvalueof leakageenergy thatwasobservedwith decayand
theoverheadenergy associatedwith thedecaytechnique.For eachof thepredictortypeswe study, we thenplot the
normalized leakage energy for differentdecayintervalsagainsttheoriginal valuefor leakageenergy. This approach
for measuringthenetreductionin leakageenergy is similar to thetechniquesusedby Kaxiraset al. in their work on
cachedecay[Kaxirasetal. 2001].
2.3 Spatial and Temporal Locality in Branch Predictors
In orderto applydecayto branchpredictors,we mustfirst establishtheexistenceof spatialandtemporallocality. In
today’s machines,branchpredictorrows typically include32-256counterentries.Fortunately, programbranchesare
clusteredratherthanrandom,andacrossall thepredictororganizationsweexamineourexperimentsconsistentlyshow
thatsomerowshaveheavy activity while othersareidle andcanbedeactivated.
In the instructioncache,programsclearly exhibit spatiallocality. Over a shortperiodof time, only oneor a few
smallcontiguousregionsof theprogramarelikely to beactive,sobranchinstructionsarelikely to beclosein termsof
their PC.This alsotranslatesinto spatiallocality in branch-predictoraccesses.For branchpredictors,spatiallocality
meansthatat any point in theprogram,active rowsarelikely to havemany countersactiveandidle rowsarelikely to
beentirelyidle. This is mosttruefor thebimodalpredictor, which is indexedonly by PC.
If brancheswereuniformly distributed,we would expect the probability of two successive conditionalbranches
falling into thesamerow to becloseto 1/rows, or about2%for a4K-entrypredictorarray. To establishspatiallocality,
we mustshow that this probability is muchhigherthanthis. Indeed,the probability that two successive conditional
branchesfall into thesamerow in a4K-entrybimodalpredictoris greaterthan40%for all ourbenchmarks,andgreater
than50%for all but five.
For gshareor hybrid predictors,spatiallocality is not aspronouncedasfor bimodal,sincetheseotherpredictors
areindexedby a combinationof thebranchindex andglobalhistory. Thus,dependingon theglobalhistory, a single
branchcanindex into many differentPHTentries,acrossmany differentrows. Nevertheless,in half of thebenchmarks
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
6 31K cycles 10K cycles 100Kcycles 1M cycles Overall
gzip 24 32 45 103 281vpr 31 45 58 65 742gcc 2 9 79 193 512mcf 65 83 92 116 565crafty 104 305 592 855 1701parser 53 90 157 294 2265eon 81 289 357 415 652perlbmk 90 453 631 1112 1541gap 62 281 325 576 745vortex 124 502 1227 1642 1996bzip2 22 33 45 56 460twolf 48 210 300 334 351wupwise 42 52 53 55 193swim 3 6 11 15 687mgrid 3 6 9 25 500applu 1 2 4 7 579mesa 83 114 139 267 697galgel 2 6 8 10 508art 2 2 5 18 109equake 167 192 193 202 226facerec 7 24 25 39 144ammp 11 26 105 230 794lucas 3 3 3 4 242fma3d 80 450 452 465 499sixtrack 39 49 55 99 734apsi 14 85 117 125 342geomean 20 46 70 109 529max 167(equake) 502(vortex) 1227(vortex) 1642(vortex) 2265(parser)min 1 (applu) 2 (applu) 3 (lucas) 4 (lucas) 109(art)
TableI. Averagenumberof staticbranchestouchedevery decayinterval for SPEC2000.The rightmostcolumnlabeled‘Overall’ givesthe staticbranchfootprint for thewholesimulationperiod.
the probability of hitting the samerow as the previous branchis above 10%. This is muchhigher thana random
distribution (about1% for the large, 16K-entrygshare),and thuseven thesemorecomplex predictorsalsoexhibit
somespatiallocality.
Yet this is not usefulif theactive rows changerapidly, sowe mustalsoestablishtemporallocality. Oneimmediate
factorthatcreatestemporallocality is thefactthatmany benchmarkshave smallstaticbranchfootprints(thenumber
of uniquebranchinstructionsitesthat areexecuted),asseenin TableI. Decaywill thereforeclearly help bimodal
predictors,aseachstaticbranchtouchesonly onepredictorentry. FromTableI weknow thatthey areclustered.
3. EXPERIMENTAL SETUP
Thissectiondescribesoursimulationtechniqueandbenchmarks,andour methodfor evaluatingenergy savings.
3.1 Simulation Setup
Simulationsin this paperarebasedon theSimpleScalar3.0 toolkit [BurgerandAustin 1997]. Our modelprocessor
hasmicroarchitecturalparametersthat resemblein mostrespectsthe Intel P-III processor[Diefendorff 1999]. The
mainprocessorandmemoryhierarchyparametersareshown in TableII. For performanceestimatesandbehavioral
statistics,we useSimpleScalar’s sim-outorder simulator. For energy estimates,we usetheWattchsimulator[Brooks
etal.2000].WattchusesSimpleScalar’ssim-outorder cycle-accuratemodelandaddscycle-by-cycletrackingof power
dissipationby estimatingunit capacitancesandactivity factors.Becausemostprocessorstodayhavepipelineslonger
than5 stages,both our architecturalandpower modelsextendthe sim-outorderpipelineby addingthreeadditional
stagesbetweendecodeandissue.
We tried to matchascloselyin performancerespectsascloseto a P-III classprocessoraspossible,a 4-issuewidth
machine.As for theproposedbranchpredictors,weselectedbranchpredictorsfoundin other4-issuewidth machines,
suchasthehybrid foundin the21264andthe16K Gsharefoundin theUltraSparc-III.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
4 7
ProcessorCoreInstructionWindow 40-RUU, 16-LSQIssuewidth 4 instructionspercycleFunctionalUnits 4 IntALU,1 IntMult/Div,
4 FPALU,1 FPMult/Div,2 MemPortsMemoryHierarchy
L1 D-cacheSize 32KB, 1-way, 32B blocks,3-cycle latencyL1 I-cacheSize 16KB, 4-way, 32B blocks,3-cycle latencyL2 Unified,256KB,8-way LRU,
32B blocks,8-cycle latency, WBMemory 100cyclesTLB Size 128-entry, 30-cycle misspenalty
TableII. Configurationof simulatedprocessor.
3.2 Technology Assumptions and Circuit Simulations
Theresultsof this research,while broadlyapplicable,areevaluatedbasedon particulartechnologyassumptions.We
chosea featuresizeof 57698�: ; , a <�=�= of 1.5V anda clock speedof 1 GHz. We use6T and4T cells from cell libraries;
no customdesignsareassumed.4T cellsaresimpleto design,do not requirespecialprocesstechnology, andcanbe
includedin librarieswith otherSRAM memorycells.
As processtechnologyprimarily determinesleakagecurrents,with eachsuccessive generationleakageincreases
substantially. In particular, Figure3 illustratesthisprogressionby illustratingleakagecurrents(in nA) for four industry
processtechnologiesfrom at AgereSystems.At 2.5V and0.25; m, COM1 is largely out of date;we do not consider
it further. COM2 is a reasonablycurrentCMOSprocess,with a 1.5V supplyvoltageand0.16; m featuresize.COM3
andCOM4 areslightly moreforward-lookingCMOSprocesses.COM3 is a 1.0V, 0.12; m process,andCOM4 is a
1.0V, 0.1; m process.
Transistor Leakage Current at 25C
0.1
3
10
100
0.01
0.1
1
10
100
COM1 COM2 COM3 COM4
Technology Node
Tra
nsi
sto
r L
ea
kag
e C
urr
en
t (n
A)
Fig. 3. Leakagecurrentpertransistorfor a rangeof technologiesin useat Agere.
Theleakagecurrentsareshown for roomtemperature(25> C). Althoughthis understatesthemagnitudeof theleak-
ageseenin operatingchipswherethetemperatureis likely to bemuchhigher, our dataagreeswith otherpredictions
[SemiconductorIndustryAssociation2001]that leakageis growing exponentiallywith successive technologygener-
ations.
Our transistormodelsarebasedon thecurrently-in-productionCOM2 process.We useAgere’s SPICE-like simu-
latorsfor very detailedcircuit simulationswith a 25 pico-secondresolution.Although leakagesavings with COM2
aresmall,COM2 is a processfor which we cangatherverifiablesimulationresults.As such,it is a usefulvehiclefor
detailedcircuit studiesof hold timesandtemperatureeffects.Leakagecurrentwill beaseriousproblemin subsequent
generations,andthis papershowssubstantialsavingsfor theCOM3andCOM4processes.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
8 ?
Fig. 4. Current-Voltagecurvesfor COM2transistorsatvaryingtemperatures.
0.01
0.1
1
10
100
1000
-25 0 25 50 75 100 125
Transistor Junction Temperature (C)
Tra
nsi
sto
r L
ea
kag
e (
na
no
-a
mp
ere
s)
Fig. 5. Transistorleakagecurrent(nA) for varyingtemperaturesfor theCOM2process.
Figure5 shows the exponentialrelationshipof temperatureto transistorleakagecurrentsastemperatureis varied
for 1.5V 0.16uCOM2transistors.Variationsin temperatureresultin largevariationsin leakage,andthesewill impact
designtradeoffs. Our designstargetanoperationaltemperatureof 85 C (appropriatefor high-performanceor mobile
processors)but wealsosimulatedveryhightemperature(125C) operation,appropriatein hardto controlenvironments
likeautomobiles.
3.3 Benchmarks
We evaluateour resultsusingbenchmarksfrom the SPECCPU2000suite [The StandardPerformanceEvaluation
Corporation2000].Thebenchmarksarecompiledandstaticallylinkedfor theAlpha instructionsetusingtheCompaq
Alpha compilerwith SPECpeak settingsandincludeall linkedlibraries.For eachprogram,we skip thefirst 1 billion
instructionsto avoid unrepresentativebehavior at thebeginningof theprogram’sexecution.We thensimulate500M
(committed)instructionsusing the referenceinput set. Simulationis conductedusing SimpleScalar’s EIO traces,
which include all external inputsand outputsinto the programto ensurereproducibleresultsfor eachbenchmark
acrossmultiple simulations.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
@ 9
TableIII summarizesthebenchmarksusedandtheir predictionaccuracieswith severalpredictors.
DynamicConditional PredictionRate PredictionRate PredictionRateBranchFrequency w/ Bimod4K w/ Gshare16K w/ Hybrid 21264-style
gzip 9.40% 87.58% 90.67% 92.40%vpr 11.08% 90.00% 97.68% 97.03%gcc 2.56% 99.61% 99.74% 99.75%mcf 19.40% 98.42% 99.27% 98.94%crafty 11.13% 91.76% 93.33% 94.38%parser 15.60% 91.25% 94.31% 94.89%eon 11.08% 81.03% 89.71% 91.76%perlbmk 12.43% 95.15% 97.29% 97.60%gap 6.62% 90.10% 96.50% 96.96%vortex 16.00% 97.85% 97.61% 97.87%bzip2 12.13% 94.06% 94.05% 94.12%twolf 12.24% 86.44% 88.53% 88.48%wupwise 10.50% 92.05% 97.54% 96.66%swim 1.35% 99.36% 99.54% 99.55%mgrid 0.33% 92.57% 97.77% 97.95%applu 0.28% 93.07% 98.70% 98.21%mesa 8.73% 94.09% 96.98% 96.31%galgel 6.09% 99.13% 99.27% 99.29%art 11.29% 92.99% 99.02% 96.40%equake 17.13% 98.04% 99.39% 99.28%facerec 3.49% 97.92% 99.04% 99.21%ammp 21.67% 98.77% 99.27% 99.23%lucas 8.67%% 90.84% 98.70% 98.84%fma3d 18.09% 95.93% 98.39% 97.68%sixtrack 8.08% 89.58% 99.72% 99.78%apsi 3.51% 86.22% 96.52% 97.12%
TableIII. Benchmarksummary.
3.4 Energy Evaluation
The net energy savings from branchpredictordecaymust accountfor two oppositeeffects resultingfrom decay.
Turningoff A�B�B in idle countersor rowscanpreventthemfrom leaking;however, decayingthecounterscausesthem
to losetheirstoredhistoryinformationstored,possiblydegradingtheperformanceof thebranchpredictorandresulting
in moreenergy spenton mis-speculationanda longerexecutiontime.
WeuseWattch[Brooksetal.2000]to measurethetotalprocessordynamicenergywith andwithoutapplyingdecay,
andsubtractthemto obtainthedynamicenergy overhead.For leakagepower, Yanget al. [Yanget al. 2001]estimate
that with a C�C�DFE C operatingtemperature,a 1 GHz operatingfrequency, a supply voltageof 1.0V anda threshold
voltageof 0.2V, a SRAM cell consumesabout C1G�HFDJIKCLDNM�O nJleakageenergy percycle. This is roughly in line with
industrydatathatAgerehasfoundwith theirCOM3andCOM4processgenerations[Diodato2001].
Basedon theabovedata,we estimatethat the4K-entrybimodalpredictor, the16K-entrygsharepredictor, andthe
21264stylehybrid predictorconsumeabout0.014nJ,0.057nJ,and0.0517nJleakageenergy eachcycle, respectively.
In general,the overall leakageenergy for thebranchpredictorcanbecalculatedasleakage energy per bit per cycle
* #bits * #cycles. Theleakageenergy of theextra statusbits canbecalculatedsimilarly. Sincethesestatusbits only
switch infrequently(at most onceevery decayinterval – e.g. every 64K cycles),we ignore their dynamicenergy
overheadin ourcalculation.
4. DECAY WITH SIX-TRANSISTOR STATIC CELLS
To establishthe effectivenessof basicbranchpredictordecay, we first needto examinehow decayinteractswith
basicbranchpredictor. Using conventional6T cells andgated-A�PQP / A B�B techniques,we simulatedthreebasicbranch
predictors– a bimodalpredictor, a gsharepredictor, anda hybridpredictor.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
10 R
4.1 Bimodal Predictors
Figure6 showstheactiveratioanddirectionpredictionaccuracy for 4K-entrybimodalbranchpredictorswith different
decayintervals.We seefrom theactiveratiographthatdecayis veryeffectiveat shuttingoff idle countersin bimodal
predictors. The geometricmeanactive ratio, which is the percentageof predictorentriespoweredon (so smaller
valuesarebetter),is 37%,28%,22%and18%for decayintervalsof 4096K,512K,64K and8K cycles,respectively.
Thismeansthatdecaytechniqueshavethepotentialto reducebranchpredictorleakageby 2X or more.Sincebimodal
predictorsareindexedby PC,benchmarkswith largestaticbranchfootprintstendto havehigheractiveratios,asshown
in crafty, perlbmk, gap and vortex. Otherbenchmarkstypically leave morethanhalf of their countersdeactivateddue
to their smallfootprints.
Furthermore,the predictionrate graphshows that shuttingoff theseidle counterscomeswith minimal loss in
performanceexceptfor the apparentlytoo-aggressive decayinterval of 8K cycles. With a 64K cycle decayinterval,
theoverall lossin predictionaccuracy is about0.14%,with only onebenchmark(mgrid) over1%.
Interestingly, in somebenchmarks(wupwise and facerec) we actuallyobserved tiny improvementsin prediction
accuracy. We attribute this to the effect of removing somedestructive interference.Interferenceoccurswhentwo
differentbranchesmapto thesamecounter. Theinterferenceis destructive whenthebranchesarebiasedin opposite
directions,for example,whenoneof the conflicting branchesis strongly taken andthe other is stronglynot taken.
Deactivation resetsthe counterto the neutral,weakly not taken value,which effectively isolatesthe two conflicting
branches.
Figure 7 comparesthe leakageenergy beforeandafter applying decay. Whendecayis enabled,we add to the
leakageenergy the extra leakageenergy from the decaystatusbits aswell asany dynamic-energy overheaddueto
extra mispredictionsor longerrun time. As thedecayinterval lengthens,fewer mispredictionsareinduced,theextra
dynamicenergy incurredbecomesminimal,andover60%of theoriginal leakageenergy canbesaved.
4.2 Gshare Predictors
To evaluateif decaywill beeffective for a gsharepredictor, we mustfirst look at theactive ratio overvariousinterval
lengths.Figure8 shows thegeometricmeanof theactive ratiosacrossthebenchmarksfor bothbankedandunbanked
16K-entrygsharepredictorsand,asa reference,the 4K-entry bimodalpredictor. As expected,the active ratiosare
quite small (i.e. goodfrom a decaypoint of view) for the bimodalpredictor. Becausegshareis designedto spread
branchstateacrossthepredictor, the active ratio will be larger. Yet, significantnumbersof rows remainuntouched.
This indicatesthatdecay-basedtechniquesstill show significantpromisefor addressingleakageconcerns.
Figure8 shows datafor an unbanked aswell asa banked versionof gshare. Breakingthe predictorinto banks
makestheactive ratio smaller(betterfor decay)by reducingthegranularityoverwhich activity is measured.Indeed,
theactive ratio for gshareis 15–35%smallerif it is brokeninto four banksof 4K entrieseach.Overall, theseactive
ratios yield leakageenergy savings of 40% for unbanked gshareandabout50% for banked gshare. Even greater
savings canbe achieved for structuresdirectly indexed by PC: about65% for bimodalpredictorsand90% for the
BTB. For moredetails,referto [Hu etal. 2001].Notethatenergy savingsarepresentedin termsof netleakagesavings
in the branchpredictor, not the total processor, to separatenet leakageenergy savings in the branchpredictorfrom
potentialnetleakagesavingsindependentof thebranchpredictor.
Figure9 showstheactiveratioandbranchmispredictionratesfor anunbankedgsharepredictoronaper-benchmark
basis. The geometricmeanactive ratio acrossthe 26 benchmarksis 46% for a 64K-cycle decayinterval, with a
negligible drop in predictoraccuracy. While its decaybenefitsare not as pronouncedas for a bimodal predictor,
neverthelessdecaystill producessubstantialreductionsin leakagepower with minimal performanceimpact. Figure
10 shows thattheoverall reductionin leakageenergy is 41%for a 64K-cycledecayinterval.
Furtherenergy savingscanberealizedusinga bankedorganization,which givesa reductionin leakageenergy of
51%.Notethatwith thebankedpredictor, thereductionin leakageenergy will besomewhatlessthanthereductionin
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
S 11
0
10
20
30
40
50
60
70
80
90
100
gzi
p
vpr
gcc
mcf
craf
ty
par
ser
eon
per
lbm
k
gap
vort
ex
bzi
p2
two
lf
wu
pw
ise
swim
mg
rid
app
lu
mes
a
gal
gel ar
t
equ
ake
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
acti
ve r
atio
(%)
orig 4096K cycle decay interval 512Kc 64Kc 8Kc
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
gzi
p
vpr
gcc
mcf
craf
ty
par
ser
eon
per
lbm
k
gap
vort
ex
bzi
p2
two
lf
wu
pw
ise
swim
mg
rid
app
lu
mes
a
gal
gel ar
t
equ
ake
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
bp
red
dir
rat
e
orig 4096K cycle decay interval 512Kc 64Kc 8Kc
Fig. 6. Active ratio (Top)andpredictionsuccessrate(Bottom)for a4K-entrybimodalpredictor
active ratio for two reasons.First, becausethebankedorganizationhassmallerrowsandhencetheoverheadof more
decaystatusbits. Second,thebankedorganizationis moreaggressive thanunbankedgsharein deactivatingrows for
thesamedecayinterval. Thismakesit morelikely thatthebankedorganizationdeactivatesarow thatharmsprediction
accuracy. This extra dynamic-power overhead,however, decreaseswith longerdecayinterval becausefor very long
intervals,evena bankedrow that is idle is extremelylikely to have genuinelydecayed.This is why thedifferencein
energy savingsis largerfor 512Kcyclesand4096Kcyclesthanfor 64K cycles.
4.3 Hybrid Predictors
With two competingcomponents(the global componentandthe local component),hybrid predictorsexhibit many
interestingdesignchoiceswhenimplementingdecay. In this sectionwe will explore thesedesignchoicesaswell as
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
12 T
0.0
0.2
0.4
0.6
0.8
1.0
1.2
orig 4096K�cycle 512Kc 64Kc 8Kc
decay interval
no
rmal
ized
bp
red
leak
age
ener
gy
Fig. 7. Normalizedleakageenergy for the4K bimodalpredictor
0
20
40
60
80
100
1000 10000 100000 1000000 orig
decay interval (cycles)
acti
ve r
atio
(%
)
gshare unbanked gshare banked bimod
Fig. 8. Meanactive ratio for unbanked andbanked gsharepredictorsanda bimodalpredictorwith differentdecayintervals. Therightmostlabel
“orig” correspondsto non-decayingpredictors.
their effectondecayin anAlpha 21264-stylehybridpredictor.
4.3.1 Selection Policy. The selectionpolicy refersto the policy for choosinga predictionfrom oneof the two
componentpredictors. In a non-decaying21264-stylehybrid predictor, the choosermakes this decisionusing the
globalhistory;seeFigure1. However, whendecayis enabled,it mayhappenthatonly oneof thetwo componentsis
active while theotheris decayed.In this case,sincethedecayedcomponenthaslost its information,it is intuitively
appealingto usethepredictionfrom theactive component,no matterwhatthechoosersuggests.This policy is called
“believe the active component,” and is implementedin all our experiments. It may also happenthat both the two
componentsaredecayed,in which caseall componentsarereactivatedandthe branchis predictedas“weakly not
taken,” asin bimodalandgsharepredictors.
4.3.2 Wakeup Policy. The wakeuppolicy refersto the decisionof whetherto reactivatea decayedrow whenit
is accessedby a branchinstruction. A naive policy would alwayswake up any rows that areaccessed.In a hybrid
predictor, a moreelegantpolicy is possible:thedecayedcomponentwill be reactivatedonly if thechooserwantsto
selectit. We referto thispolicy as“believe thechooser.”
In the situationwhen the accessedrow in the chooseris decayed,we know that the global componentis also
decayedin the 21264-stylepredictor. This is becausethe chooserandglobal predictorare indexed, accessedand
thusdecayedin exactly thesameway; seeFigure1. In this case,thechooserhasno usefulinformation. If the local
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
U 13
0
10
20
30
40
50
60
70
80
90
100
gzi
p
vpr
gcc
mcf
craf
ty
par
ser
eon
per
lbm
k
gap
vort
ex
bzi
p2
two
lf
wu
pw
ise
swim
mg
rid
app
lu
mes
a
gal
gel ar
t
equ
ake
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
acti
ve r
atio
(%)
orig 4096K cycle decay interval 512Kc 64Kc 8Kc
0.00
0.05
0.10
0.15
0.20
gzi
p
vpr
gcc
mcf
craf
ty
par
ser
eon
per
lbm
k
gap
vort
ex
bzi
p2
two
lf
wu
pw
ise
swim
mg
rid
app
lu
mes
a
gal
gel ar
t
equ
ake
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
mis
pre
dic
tio
n r
ate
orig 4096K cycle decay interval 512Kc 64Kc 8Kc
Fig. 9. Active ratio (Top)andmispredictionrate(Bottom)for unbankedgsharepredictor
componentis active, thenwe leave thechooserandtheglobalcomponentinactive andreturnthepredictionfrom the
local component.Otherwise(whenthe local componentis alsoinactive), we reactivateall componentsandreturna
predictionof “weakly not taken.”
4.3.3 Results. Figure11detailstheactiveratioandbranchmispredictionratefor naivedecay, whichalwayswake
upany rowsthatareaccessed,with a21264-stylehybridpredictor. Weseethateventhoughtheactiveratiosarehigher
(worse)thanfor bimodalor gsharepredictors,decayhasa negligible impacton themispredictionratefor intervalsof
64K cyclesor larger. Notethatin orderto computeactiveratiosensiblyon amulti-tablestructure,wecomputeit over
all predictionandchooserbits in thestructure.Overall,asFigure12 shows,naive decayrealizesstrongreductionsin
energy savings—40%for a 64K-cycle interval.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
14 V
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
orig 4096K�cycle 512Kc 64Kc 8Kc
decay interval
no
rmal
ized
bp
red
leak
age
ener
gy
banked unbanked
Fig. 10. Normalizedleakageenergy for gsharebranchpredictorfor bothunbankedandbankedpredictors.
0102030405060708090
100
gzi
p
vpr
gcc
mcf
craf
ty
par
ser
eon
per
lbm
k
gap
vort
ex
bzi
p2
two
lf
wu
pw
ise
swim
mg
rid
app
lu
mes
a
gal
gel ar
t
equ
ake
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
acti
ve r
atio
(%
)
orig 4096K cycle decay interval 512Kc 64Kc 8Kc
0.00
0.02
0.04
0.06
0.08
0.10
0.12
gzi
p
vpr
gcc
mcf
craf
ty
par
ser
eon
per
lbm
k
gap
vort
ex
bzi
p2
two
lf
wu
pw
ise
swim
mg
rid
app
lu
mes
a
gal
gel ar
t
equ
ake
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
mis
pre
dic
tio
n r
ate
orig 4096K cycle decay interval 512Kc 64Kc 8Kc
Fig. 11. Active ratio (Top)andmispredictionrate(Bottom)for 21264’shybridpredictor
We canobtain even betterenergy savings by taking advantageof the “believe the chooser”wakeuppolicy. As
shown in Figure12, this moresophisticatedpolicy leadsto leakagepower reductionsabout50% betterthanfor the
naivepolicy. Weachievethisby exploiting informationfrom thechooser. Thechooserpointsto aparticularcomponent
for a reason– thatis, onecomponentis performingbetterthantheother, andthuswe only needinformationfrom the
superiorcomponent.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
W 15
0.0
0.2
0.4
0.6
0.8
1.0
1.2
orig 4096K�cycledecay�interval
512Kc 64Kc 8Kc
decay interval
no
rmal
ized
bp
red
leak
age
ener
gy
naive believe the chooser
Fig. 12. Normalizedleakageenergy of 21264-styleHybrid predictorwith naive and“believe thechooser”wake uppolicies
4.4 Decay with the Branch Target Buffer
In additionto thestructurefor predictingthedirectionof conditionalbranches,a branchtargetbuffer (BTB) [Holgate
andIbbett1980;Losq1982]is commonlyusedto storethetargetsof takenbranches.TheBTB is typically organized
like a cache,eitherdirect-mappedor setassociative,but taggedin orderto identify hits andmisses.Sinceit is solely
indexed by PC, it hassimilar locality characteristicsas instructioncachesand bimodal predictors. However, the
granularityin theBTB is singlebranchinstructions,insteadof instructionblocksasin instructioncaches.Thisallows
evenfinercontrolfor decay. Decayis thereforeespeciallyeffective for BTBs.
We evaluateda 2048-entry, 4-way associative BTB, which appearsin the Intel PIII processor[Diefendorff 1999].
Exceptfor vortex, perlbmk andcrafty, mostbenchmarkshavea very low active ratio,below 20%or evenbelow 10%.
This is no surprise,consideringthat the staticfootprintsof mostbenchmarksin TableI arevery small comparedto
ourBTB capacity. With decay, thesestaticfootprintsareautomaticallytrackedandall otheridle slotsareturnedoff to
save leakageenergy. On theotherhand,BTB hit ratesobserve only negligible degradationuntil decayinterval drops
to 8K cyclesor less.Overall,with decayintervalsof 64K cyclesor more,about90%of theBTB leakageenergy can
besaved.
4.5 Summary and Discussion
In this section,we have exploredthe applicationof decaytechniquesto reduceleakageenergy in branchpredictor
structures.Theparticularpoliciesthatwehavedeveloped(e.g., “believethechooser”)applyto all non-state-preserving
leakage-controlmechanisms.Thekey contributionsaremoregeneral,giving usefulguidanceto any work on leakage
control. In particular, ourlocalityanalysisshowsthatall predictororganizationsweexaminedhavesignificantnumbers
of rows that are inactive for long periodsof time. This makes leakagecontrol in the branchpredictorand BTB
profitable,regardlessof mechanism.Additionally, prior work by Jimenezet al. [Jimenezet al. 2000], andParikh
et al. [Parikh et al. 2002] suggeststhat branchpredictorsshouldbe larger for both performanceandpower-saving
considerations,but that localized-heatingconcernsmay be an obstacle. This makes leakagecontrol in the branch
predictorandBTB particularlyimportant.
Over all the configurationswe explored, the bestreductionsin leakagepower were achieved with the bimodal
predictor, but thepower savingsachievedwith the “believe the chooser”policy for hybrid predictionwerenearlyas
good. Sincehybrid predictionhassuperiorpredictionaccuracy andhenceperformanceandoverall energy, hybrid
predictionremainsthe bestchoicefrom both a performanceandenergy standpoint[Parikh et al. 2002]. Overall, as
branchpredictionaccuracy andexecutiontime wereincreasednegligibly, thereductionin leakageenergy consumed
by thebranchpredictorunit causestheprocessorto saveenergy asa whole.
Someof our interestingresultswerespecificallydueto thefactthattherearetwo independentcomponentpredictors
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
16 X
0
10
20
30
40
50
60
70
80
90
100
gzi
p
vpr
gcc
mcf
craf
ty
par
ser
eon
per
lbm
k
gap
vort
ex
bzi
p2
two
lf
wu
pw
ise
swim
mg
rid
app
lu
mes
a
gal
gel ar
t
equ
ake
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
acti
ve r
atio
(%)
orig 4096Kc decay interval 512Kc 64Kc 8Kc
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
gzi
p
vpr
gcc
mcf
craf
ty
par
ser
eon
per
lbm
k
gap
vort
ex
bzi
p2
two
lf
wu
pw
ise
swim
mg
rid
app
lu
mes
a
gal
gel ar
t
equ
ake
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
bp
red
ad
dr
rate
orig 4096K cycle decay interval 512Kc 64Kc 8Kc
Fig. 13. Active ratio (Top)andaccuracy (Bottom)for theBTB with differentdecayintervals.
in a hybridpredictor. In the21264’shybridpredictor, for example,thechooserandglobalcomponentwereorganized
in thesameway. This let usdeduceusefulfactsaboutonebasedonstatein theother. Moregenerally, theinfluenceof
indexing on decaymaysuggestthatthechoiceof predictorindex is worth revisiting (yet again).In thepast,indexing
functionshave beenchosento minimizealiasing,usuallyby spreadingthestatefrom branchesin the sameworking
setaswidely aspossible.In contrast,decaycanbemoreeffectiveby clusteringsimilar statesinto rows,with thegoal
of makingasmany countersin a row aspossibleidle or active at any point in time. This observationthereforesets
up a tradeoff betweenreducingaliasingandincreasingdecayopportunities.Althoughbeyondthescopeof this paper,
seekingindex functionsthatcanbalancethetwo factorsis an interestingareafor futurework. It maybepossibleto
develop tractableindex functionsthat clustersimilar stateinto rows with minimal increasein aliasing. In addition,
largepredictorsin particularmaybeableto accommodateindex functionsthatboostrow clustering.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
Y 17
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
orig 4096K�cycle 512Kc 64Kc 8Kc
decay interval
no
rmal
ized
bp
red
leak
age
ener
gy
Fig. 14. Normalizedleakageenergy for theBTB with differentdecayintervals
As for theactualdecayintervals,this paperappliesonly fixeddecayintervals. Kaxiraset al. [Kaxiraset al. 2001]
show that for caches,even betterdecayratescanbe achieved with decayintervals that adaptto changingprogram
behavior. Otherwork by Zhouet al. [Zhouetal.2001],Velusamyet al. [Velusamyetal. 2002],andSankaranarayanan
andSkadron[SankaranarayananandSkadron2004] also improved on cachedecayby selectingbetterpolicies for
adaptingthe decayinterval. While studyingadaptive branchpredictordecaywasbeyond the scopeof this paper, it
remainsworth investigatingsinceour datashowed that somebenchmarkssufferedno performancelossevenwith a
short,8K cycledecayinterval (andwould getevenmorebenefitfrom decay),while otherssuffereddrastically.
A furtherconsiderationis thatbankedandmulti-tableorganizationsprovidesubstantialbenefitsin reduceddynamic
energy, by reducingword- andbit-line lengths.Furthermore,whendecayis appliedto multi-tablepredictors,some
tablescanbe left inactive, asin the“believe thechooser”policy. Anotherexamplearisesin local-historyprediction,
wheretiming issuesmayrequirecachingthepredicteddirectionfor eachBHT entryin theBHT. As with “believe the
chooser”,decaymight permit the PHT updateandlookup in sucha local-historyorganizationto be omittedif that
PHTentryis inactiveandthecacheddirectionwascorrect.Suchapolicy mightbecalled“believetheBHT”. Not only
do these“believe” policiesreduceleakageenergy, they avoid thedynamicpowerassociatedwith accessingthattable.
Our work heredid not modeltheseextra savings,but doingsowould only makedecaymorevaluable.
Anotherissuethatwarrantsstudyfurtheris interferencein branchpredictors.In someexperimentsweobservedmild
but interestingimprovementsin predictionratewith decay. This shows thatdecay(by settingthetwo-bit countersto
a weakstate)may have the effect of reducingdestructive interference,somethingwe plan to quantify in our future
work. Lastly, sincemany otherpredictionstructuressuchasvaluepredictors[Lipasti etal. 1996]andprefetchaddress
predictors[Lai et al. 2001]areorganizedsimilarly to branchpredictors,an interestingfuture effort will be to apply
strategiesfoundin this paperto thesestructures.
5. DECAY WITH FOUR-TRANSISTOR RAM CELLS
Thusfar we have exploredthe useof traditionalSRAM cells andtheir usein decay. Traditionaldecaytechniques
involvemanipulatingthepower supplyto createtheillusion of dynamicbehavior on top of a cell originally designed
andselectedfor its ability to hold its valueindefinitely. Viewed in this light, it seemsnaturalto usequasi-static4T
cellsto implementdecayingbranchpredictors.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
18 Z
5.1 The Quasi-Static 4T Cell
Basicfour-transistor(4T) DRAM cellsarewell establishedanddescribedin introductoryVLSI textbooksandvarious
articles[Lyons et al. 1987; Noda et al. 1998; Wolf 1998]. 4T cells are similar to ordinary6T cells but lack two
transistorsconnectedto [�\�\ that replenishthe charge that is lost via leakage(Figure15). Using exactly the same
transistorsasanoptimized6T design,the4T cell requiresapproximately2/3 of thearea.For example,undera 0.18]process,an ordinary6T cell occupiesan areaof 5.51] m , whereasa 4T cell would occupy anareaof 3.8] m (i.e.,
69%of thearea[Diodatoet al. 1998]).
Essentially, the4T cell is justasfastasa6T cell. While ourdatademonstrateaslightspeeddisadvantage,thediffer-
enceis sosmall thatcoupledwith thesmalleramountof parasiticinterconnect,thedifferenceessentiallydisappears.
More importantly, 4T DRAM cells naturallydecayover time (without the needto switch themoff); oncethey lose
their chargethey leakvery little sincethereis no connectionto [�\�\ . However, somesecondaryleakagevia theaccess
transistorsstill remainsdueto bit-line precharging,which wedo take into accountin our transistor-level simulations.
Precharging, in both SRAM andDRAM designs,sapssomeof the power savings; any charge loss,power dissi-
pation,or cell behavior that is a resultof precharging is accountedfor in our simulation.Therearevariousformsof
precharge(full- [ \�\ , half-[ \�\ , 2/3-[ \�\ ) andthey vary from thosethatarenormallyon to thosethatarenormallyoff.
We usefull- [ \�\ in our simulations.Theprechargecontrolcircuitry in thesimulationis normallyoff; theprechargeis
on for 33%of thetimeandthenoff at thespecificmomentof areador write. Whethertheprechargetransistorsareon
or off, residualchargeonthebit-line doesleakthroughtheaccesstransistorsandeventuallyis lost to thegroundnode.
Evenwhenall residualchargeis removedfrom thebit-line, theleakageof theprechargetransistorssupplya low level
of currentflow throughtheaccesstransistors.
B B
WW
VddB B
WW
Fig. 15. Circuit diagramsof the6T SRAM cell (left) andthe4T quasi-staticRAM cell (right).
In addition,4T cellsareautomaticallyrefreshedfrom theprechargedbit lineswhenever they areaccessed.Whena
4T cell is accessed,its internalhigh nodeis restoredto high potential,refreshingthe logical valuestoredin it; there
is no needfor a read-writecycle asin 1T DRAM. As the cell decaysandleakscharge, the voltagedifferenceof its
internalnodesdropsto thepointwherethesenseampscannotdistinguishits logicalvalue.Conservatively, thisoccurs
whenthe nodevoltagedifferentialdropsbelow a thresholdof the orderof 100 mV (for 1.5V designs).Below this
thresholdwe have a decayedstate,wherereadinga 4T DRAM cell mayproducea randomvalue(not necessarilya
zero).Overa long time thecell reachesa steadystatewhereboththehighnodeandthelow nodeof thecell “float” at
about30mV (for 1.5V designs).
Thus4T cellspossesstwo characteristicsfitting for decay:they arerefresheduponaccess,anddecayover time if
not accessed.In therestof this sectionwe discussextensively the4T decaydesign,includingdecayor dataretention
times,dynamicenergy andotherconsiderations.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
_ 19
5.2 Hold Times In 4T Cells
The critical parameterfor a 4T designis retention time. Whenimplementedin 4T cells,decaytechniqueshave the
cell’s retentiontime astheir naturaldecayinterval. We defineretentiontime to bethetime from thelastaccessto the
time whentheinternaldifferentialvoltageof thecell dropsbelow thedetectionthreshold.Hold time dependson the
leakagecurrentspresentin the4T cell which in turn dependon processtechnologyvariationsandtemperature.
To study retentiontimes for the 4T cachedecaywe choseAgere’s COM2 0.16uCMOS processfor which we
have accuratetransistormodels. We alsouseCOM2 for this studybecauseit is the mostmodernof the four COM
processesthat is availablein-house,in realsilicon, to validateour modelsagainstrealmeasurements.Althoughthis
particulartechnologydoesnot suffer excessively from leakage,our analysisscalesto futuregenerations.Subsequent
sectionswill discusshow to manipulateretentiontimesto matchapplicationneeds,soalthoughour initial retention
timenumbersarefor COM2,we feel they areachievablein COM3 andCOM4 aswell.
Variationsin the processtechnologysignificantly affect leakagecurrentsand thereforeretentiontimes. In our
studiesweusethreereferencepointsin theprocesstechnology:(i) Nominaltransistors(NOM) representtheexpected
behavior andconstitutethebulk of theproduction;(ii) WorstCaseFast(WCF) transistorsaretransistorsthataresix
standarddeviations(sigma)( `K`baFaNc a�aed ) from theprocessmeanin termsof speed.They arevery fastbut very leaky
transistors;(iii) WorstCaseSlow (WCS)transistorsaretheoppositesix-sigmapoint in theprocessdistribution,where
transistorsarequiteslow but leakfar less.
TheWCFandWCSareextremecasesthatby definitionrarelyappearin production;they areessentiallytheoretical
constructsusedasbounds.In general,fastchipsarelikely to leakmoreandthushaveshorterretentiontimes.However,
fastchipsarealsoclocked at higherspeedswhich meansthat even their short retentiontimescorrespondto many
cycles. It is thecycle countthatmatters,not real time, sincedecayis anarchitecturalphenomenoncharacterizedby
cyclecountsandlargely independentof cycleduration.
As mentionedpreviously, variationsin temperaturealsoresult in large variationsin retentiontimes. Our designs
targetanoperationaltemperatureof 85C (appropriatefor examplefor mobileprocessors)but wealsodiscussmecha-
nismsto adjustin veryhigh temperature(125C).
Finally, selecting3.3V I/O transistors— readilyavailablein COM2 technology— to replacethe1.5V transistors
while maintaining1.5V signalingin the4T cellssignificantlyextendsretentiontimesat theexpenseof increasedcell
area.Evenin this casethe4T cell, areais still lessthanthatof the6T cell. Building 4T cellsout of 3.3V transistors
while operatingthemat 1.5V is feasiblebecauseof thenatureof the4T cell which actsasa placeholderfor charge.
Thesamecannotbedonefor anactive6T circuit (two cross-coupledinverters)whichrequiresits transistorsto befully
biasedto work correctly.
1.5V 3.3VRETENTIONTIME RETENTIONTIMEf�gih
C j gih C k f�gih Cf�gih
C j gih C k f�gih CNOM 18,000 1700 560 1,040,000 57,200 9400
TableIV. Hold timesin nanosecondsfor 1.5V and3.3V versionsof 4T cells at differentoperatingtemperatures.For a 1 GHz (1nscycle time)processor, onecanalsoconsidertheseretentiontimesascycle counts.
Basedon theseassumptions,we determinedretentiontimes for our technologythroughdetailedtransistor-level
simulations.We simulatedanaccessto acell, followedby a longperiodin which thecell wasleft unread.During this
time, leakagecausesthecell’s internalnodesto losecharge. We definedthehold time to bethedurationbetweenan
accessandthepointatwhich thedifferentialvoltagesof the4T cellsinternalnodeslapsedto avaluelessthan100mV.
We used100mVasour criteria for theminimumvoltageat which we would expectthesenseampsto distinguishthe
logical state. If the 4T cell is readafter it hasdecayedbeyond this point, it still operatessafelyanddoesnot raise
metastabilityor otherconcerns;it just doesnot have the datait hadbefore. Readinga decayedcell, asshown and
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
20 l
verifiedin our circuit models,producesanelectricallystable,thoughlogically random,value.It is essentialtherefore
to know well in advancewhendatahavedecayedin a4T cache.Thisnecessitatesdecaycountersnot to turnthebranch
predictorrow on andoff asin previouswork but to just signalthedecayof a branchpredictorrow. TableIV givesthe
cell retentiontimesin nanosecondsfor theCOM2 technology. Theseretentiontimesarefor a 4T cell usingthesame
highly optimized6T cell transistors.
5.3 Sense Amplifier Considerations
Oneimplication of the 4T cell is that it becomesprogressively moreexpensive to readas it decays(similar to the
behavior of DRAM). As thecell decays,thesenseampmustwork harderto detectandamplify thedifferentialsignal
on thebit lines.Besidesthesenseampenergy thereis alsoenergy expendedto refreshthecell aswe readit. We have
studiedbothof theseeffectsfor all thetypesof 4T cells(with 1.5V and3.3V devices)andfor variousstagesof decay.
TableV givesthereadenergy for the4T cell at variousinternalvoltages.
4T 6TInternal ReadEnergy (fJ) ReadEnergy (fJ)voltage 1.5V 3.3V 1.5V1.5V na 149 1561.0V 166 151 na0.7V 171 153 na0.5V 175 155 na0.1V 219 199 na
TableV. Readenergies in fJ for nominal1.5V and3.3V 4T cells at m�nio C for differentstatesof decay(i.e., different internalvoltages).Valuesincludeenergy dissipatedin thecell, precharge circuits,andsenseamplifier. The3.3V 4T cellsareoperatedat 1.5V. An internalvoltageof 1.5Vis not availablein the 1.5V 4T becausethe 4T cell internalnodevoltagesareconstrainedby the cut-off region of the MOSFETSusedasaccesstransistors.This affect couldbeovercomeby overdriving thewordlines,asin conventionalDRAM design”
5.4 Locality Considerations
Granularityis alsorelevantin the4T designbut hereit stemsfrom theway4T cellsarerefreshed.As readinga single
cell assertsthewordlineandrefreshesall thecellsin theaccompanying row, cellsthatwouldhavedecayedif left alone
getrefreshedcoincidentallyby nearbyactivecells.This leadsto theobservationthateven4T cellswith shortretention
timesmaynotactuallylosedataasquickly if therow sizeis longenough.Thusretentiontimeselectiongoestogether
with locality granularitybecauselargerow granularitiesmake theapparentrateof refreshmuchhigher. (Segmented
wordlineswould allow moreselective refreshbut thesedesignsareoutsidethescopeof thispaper.)
In contrast,in a designwith very fine row granularity, onewould opt for 4T cellswith very long retentiontimes.
Finegranularityleadsto a very gooddecayratio but theimportantcellsmustremainalive on their own (without the
benefitof accidentalrefreshes)for considerabletime.
5.5 Metastability
A key issueregarding4T structuresis thefact thatmetastabilityproblemsmaybea concernwhenthecell’s internal
differentialvoltageis toosmall.To avoid metastability, it is temptingto userefresh,but thiswouldobviatethesavings
of ourapproach.Instead,onecandetectthesmalldifferentialvoltagesandavoid relyingon arraydataat thesepoints.
As an exampleof the latter, we proposethat one could avoid metastabilityin a 4T branchpredictorby addinga
referencecolumnwhosesolepurposeis to detectlow differentialvoltageandto prevent the senseampoutputfrom
propagatingfurther into thecircuit. In this column,insteadof a senseamplifierwe have a voltagecomparator. When
thevoltagedifferenceis too small,thecomparatoroutputforcesthereferencecell to readasa logical zero(otherwise
it readsasa logical one).Theoutputof thecomparatorqualifiestheresult.A smalldifferentialis thereforeprevented
from inducingmetastabilityin thesubsequentcircuit.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
p 21
Otherapproachesarepossible,suchasthe useof decaycounters[Kaxiras et al. 2001],but the onewe have pro-
posed— theuseof a singlecomparatorin a referencecolumn— is appealingbecauseit preventsmetastabilitywhile
requiringminimal extraareaor power.
5.6 Reliability, Determinism, and Controlled Operation
In somecasesit is necessaryfor 4T memorystructuresto appearstatic. As in any otherDRAM, this is doneby
periodically refreshingthe dynamiccells. Refreshcanbe accomplishedin a way that is largely transparentto the
normaloperationof thememoryarray. Hanamuraet al. [Hanamuraet al. 1987]describea refreshmechanismthaton
a periodicbasismomentarilystrobestheaddresslinesafterthebit lineshavebeenprecharged.Senseampsareidle at
this point, resultingin low power consumption.Theshortstrobepulseallows chargeto flow from thebit linesto the
cellsof theselectedline andrestoretheir contents.Ordinaryreadsandwritestake precedenceover refresh,resulting
in minimal performanceimpact.
Suchrefreshcircuitry (whosearea-costis negligible) is likely to be includedwith a 4T dataarray. In normal
operationthe refreshcircuit is inactive andthe 4T structureoperatesin decay(low-leakage)mode. However, under
specialcircumstances,refreshprovidesindispensablefunctionality. Theseinclude: (i) chip testing,(ii) deterministic
operationfor real-timeapplications,(iii) possibleoperationbeyondrecommendedtemperaturerange.While thefirst
two shouldbeapparent,thethirdonearisesbecauseathightemperatures,leakagemaybesohighthatbranchpredictors
decaytoo quickly. In suchsituations,the systemwill want to re-enablerefresh,reverting to non-decaymode. One
methodto achievethis is by usingon-chipsensorsto automaticallytriggertherefreshwhennecessary.
5.7 Trends in Process Technology
With theadventof new processtechnology, transistorleakagewill increaseat a tremendousrate,making6T solutions
somewhatlessattractivedueto powerdissipationconcerns.Thissametransistorleakagewill of courselower thedata
retentionof 4T cells; however sinceaccesstimeswill alsoimprove, therewould appearto besomekind of balance
betweenthenumberof machinecyclesthatthe4T cell canhold its dataandthenumberof machinescyclesthatthe4T
cell is requiredto hold its datain cacheapplications.For example,simulatingunderSPICEusinga futuretechnology
(65nm),at 85Ca 3.3V 4T cell hasa 1.4q s retentiontime, translatinginto 1400processorcyclesat 1 GHz. However,
accordingto theITRS roadmap,by thetime 65nmis reachedtheprocessorclock shouldbeapproximately7.5 GHz,
translatinginto aneffective retentiontime of just over10000cycles.Coupledwith architecturaltechniquesto extend
theretentiontime,evenwith theimpactof processtechnology, we canstill field aneffectiveretentiontime for decay.
Therefore,4T still wins.
As processtechnologyadvances:
—transistorleakageincreasesfor all devices,especiallyfor thosetypesof transistorsusedin 6T cells, but not so
rapidly for thoselongerchanneltransistorsusedin 4T cells
—Dataaccesstime decreasesin all cases
—Dataretentiontime decreasesin 4T cells
—Alpha particleimmunity (andsoft errorsin general)will getworsefor bothtypesof cells.
—Precharge andaccesstransistorsleakageis a mechanismof purecharge lossandcompletepower wastein a 6T-
SRAM
—Prechargeandaccesstransistorleakagetendsto replenishthe internalnodechargelost dueto the leakageof other
transistorsin the4T-DRAM
Thelastpoint is akey pointbecause6T cellsin thiserahaveshown softerrorproblemsowing to their largersource-
drainarea.1T cells(previouslynotoriousfor theirsofterrorsensitivity) arenotgettingasbadasoncethoughtbecause
their source-drainareais so small that the offendingparticlehasa smallertarget to hit it, andwhenit hits thereis
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
22 r
lessof achargeimbalancebecausethecross-sectionalareaof thedepletionregion is sosmall.Onceagainthe4T cell
seemsto bea reasonablemiddlegroundbetween6T and1T.
5.8 Controlling Retention Times
Thesuccessof a 4T decaydesigndependson matchingretentiontimesto access(i.e., “refresh”) intervals. Thus,the
ability to controlretentiontimescouldgiveusanew degreeof freedomin designing4T decaystructures.Oneway to
affect retentiontimesis to adddevicessuchasresistorsor capacitorsto thebasic4T cell [Diodatoet al. 2001]. Such
devicescanbeusedto slowly replenishthe lost charge. If the rateof replenishmentis lessthanthe leakagethecell
will still decayalbeitmuchmoreslowly, thusretentiontimecanbeextendedsignificantly.
The ability to control retentiontimes is key to the applicability of 4T cells to branchpredictordesigns,as the
retentiontime is dependenton temperature.Thus,onecanadaptthe naturaldecayinterval (in cycles)moreclosely
to thedesireddecayinterval in responseto thegiventemperaturerange.As describedbelow, refreshcircuitry canbe
includedto beusedasa fail-safeif thetemperatureexceedsacertainthreshold.
In addition, retentiontimescanbe controlledusingarchitecturaltechniques,suchasusingmultiple refreshesto
prolongretentiontime,accomplishedeitherstaticallyor adaptively. Moreover, asdescribedabove,thestructureof the
cachemaymaketheapparentrateof refreshmuchhigher;longerrow granularitiescancompensatefor ashorterdecay
interval. Onethuscandesignthestructureto orient towardslongerrows for anincreaseddecayinterval (anda larger
safeguardagainstperformanceloss)or shorterrows for greaterleakageenergy savings.
6. RESULTS FOR 4T-BASED BRANCH PREDICTORS
To illustrate the effects of the increasein the mispredictionrate due to decay, we simulateda processorwith the
directionpredictorandBTB built with 4T structures.Wesimulatedundertheworstcasescenario:smallstaticleakage
in comparisonto dynamicenergy, whichcorrelatescloselywith theCOM2process.Thusweuseslow-decaying3.3V
4T cells in our design,operatingat 85 C, leaving usa decayinterval of 57,200cyclesfor boththedirectionpredictor
andtheBTB.
Thesevaluescanbeadjustedusingthe techniquesasdescribedin Section5; for example,if thedecayinterval of
57,200cyclesis too long, we caninsteaduse1.5V 4T cells,which have a shorterdecayinterval (8000cycles),and
extendthatretentiontimeappropriately.
6.1 4T Decay Results
6.1.1 Gshare. Figure16 shows thenormalizedexecutiontime andmispredictionrateof the4T branchpredictor
versusa standard(non-decaying)branchpredictor. We seethat the mispredictionrateandthus the executiontime
arevirtually unchanged.In fact,dueto therandomeffectsof readingthedecayedvalues,a few benchmarksactually
improve2-3%.Overall, theaccuracy wasaffectedlessthan5%.
Figure17 shows theactive ratio of thedirectioncountersundervariousSPECbenchmarks.On average,we seea
28.2%active ratio,which directly translatesinto over70%savingson leakagepowerovera traditional,non-decaying
predictor. Thesavingsis not without cost,however: additionalreadenergy is requiredevery time anearlydecayedor
completelydecayedcounteris read.Underourcurrentprocess,this penaltyis anadditional27.5%of thereadenergy
of a non-decayedcounter.
Moreover, thispenaltyis appliedalsowhennearlydecayedcountersonthesamerow arerefreshedaccidentallydue
to a nearbyread;fully decayedcounterson thesamerow, however, areleft decayedandthereforenot refreshed.We
implementthis by detectingthe voltagedifferentialon the bitlines; if this voltageswing is closeto zero,we cantie
thesignalinto therefreshlinesandavoid refreshinganunusedcell. Thesumtotal is thatwe retaintheeffective long
decayinterval of low row granularitybut achievemuchloweractiveratios.
Finally, thenormaldynamicenergy overheadof additionalmispredictionsmustbe includedin our results.Figure
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
s 23
95.0
96.0
97.0
98.0
99.0
100.0
101.0
gzip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
wup
wis
e
swim
mgr
id
appl
u
mes
a
galg
el
equa
ke
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
exec
utio
n tim
e (%
)
standard decayed [4T]
0.00
0.05
0.10
0.15
gzip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
wup
wis
e
swim
mgr
id
appl
u
mes
a
galg
el
equa
ke
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
mis
pred
ictio
n ra
te
standard decayed [4T]
Fig. 16. Normalizedexecutiontime (Top) andmispredictionrate(Bottom)of standardand4T predictors.4T predictorsproduceminimal perfor-
mancelosses.
0
10
20
30
40
50
60
70
80
90
100
gzip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
wup
wis
e
swim
mgr
id
appl
u
mes
a
galg
el
equa
ke
face
rec
amm
p
luca
s
fma3
d
sixt
rack
apsi
[geo
mea
n]
activ
e ra
tio (
%)
Fig. 17. Active ratio of a 4T basedpredictor.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
24 t
4T COM3 [85 C]
4T COM3 [85 C]
4T COM4 [85 C]4T COM4 [85 C]
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
original 4096K 512K 64K 8K
Decay Interval
No
rmal
ized
Bra
nch
Pre
dic
tor
Lea
kag
e E
ner
gy
COM2 COM3 COM4 4T COM2 4T COM3 4T COM4
4T COM2 [85 C]
Fig. 18. Normalizedleakageenergy for the4T branchpredictorat 57.2kand8k cycles.
18 shows the4T basedbranchpredictorin comparisonwith theprevious6T designs.While the4T decayinterval is
setat 57.2k,at equivalentprocesses,the4T usuallyoutperformsthe6T. For instance,a 6T baseddecayingpredictor
on a COM3 processwould actuallyconsumemoreenergy thana standard,non-decayingpredictor, whereasthe 4T
versionof thesamepredictor, with ashorterdecayinterval, doesbetter.
In addition,thedataalsoshowsthatit is possiblewith 4T cellsto useveryaggressivedecayintervalsin thedirection
countertables.UndertheCOM4 process,we seeleakagepowersavingsevenwhenthedecayinterval is 8000cycles.
As mentionedabove,we areseeingtheeffectsof thelower active ratiosat thesamerow granularitythat the4T cells
allow.
6.1.2 Bi-mode Results. Figure19showsthenormalizedexecutiontimeandmispredictionrateof amorecomplex
predictor, the bi-mode. In this configurationwe usedan aggressive bi-modepredictorcomposedof two 16K PHTs
activatedby an8K selector. We seethatexecutiontime is movedvery little; mispredictionrateis increasedroughly
0.3%,translatinginto a performancelossof 0.1%. The complexity of the predictorhelpsalleviate the performance
lossdueto the misprediction,asa singlestaticbranchcoversa larger areathan,for example,a bimodalpredictor,
which raisestheapparentrefreshrate. This is confirmedby the fact thatbi-modepredictorssuffer lessasthedecay
interval is decreasedcomparedto thegshare.Theresultsfor varyingthedecayinterval of gsharewill bediscussedin
Section6.2.
Figure20 shows the active ratio of the bi-modepredictor. The active ratio appearsto be smallerthanthat of the
gsharebecausetheactive ratio is anaverageof all threecomponentsof thebi-mode– oneof which is a very simple
bimodalpredictor. While thebimodalcomponentis half thesizeof thePHTs,its active ratio is significantlysmaller
thanthatof thePHTsandthuslowerstheoverallactive ratio.
6.1.3 BTB Results. The resultsfor the 4T BTB aresimilar to the 6T BTB. BecauseeachBTB target is much
larger thanthe two-bit counter, we canafford to attachcountersto eachBTB target andthusachieve the optimum
granularity;thosecountersthatareusedarerefreshed,andthosethatdecayarenotaccidentallyrefreshed.As aresult,
theactive ratiosof the6T BTB areidenticalto the4T BTB.
6.2 Sensitivity to Decay Interval
Figure21 shows theimpactof decayinterval over executiontime andpredictoraccuracy. Shorter(aggressive)decay
intervalsincreasethedecayratio,but at thecostof performance.For example,ata256cycledecayinterval, theactive
ratio is lessthan1%, but the executiontime is increased11% dueto the branchpredictorsuffering a 10% drop in
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
u 25
0.99
0.992
0.994
0.996
0.998
1
1.002
1.004
1.006
1.008
1.01
gzip vp
rgc
cm
cf
craf
ty
pars
er eon
perlb
mk
gap
vorte
xbz
ip2 twolf
wupwise
swim
mgr
idap
plum
esa
galge
lar
t
equa
ke
face
rec
amm
pluc
as
fma3
d
sixtra
ckap
si
geom
ean
exec
uti
on
tim
e(%
)
standard decayed [4T]
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
gzip vp
rgc
cm
cf
craf
ty
pars
er eon
perlb
mk
gap
vorte
xbz
ip2 twolf
wupwise
swim
mgr
idap
plum
esa
galge
lar
t
equa
ke
face
rec
amm
pluc
as
fma3
dap
si
sixtra
ck
geom
ean
mis
pre
dic
tio
n r
ate
standard decayed[4T]
Fig. 19. Normalizedexecutiontime (Top)andmispredictionrate(Bottom)of standardand4T bi-modepredictors.4T predictorsproduceminimal
performancelosses.
accuracy. Overall, thenormalizedleakageenergy for thebranchpredictoris increased7 times.
While temperaturecancausethe4T cell to leakmorerapidly (andthusshortenthedecayinterval), it is important
to restatethata randomizedvaluein thedirectionalcountersmaystill becorrect,andthustheeffect of accidentally
decayingbranchpredictorsis not assevereasdecayingcachelines. More, asmentionedearlier, retentiontime can
becontrolledusingthetechniquesmentionedin Section5. For example,if thetemperatureis beyondsomethreshold
suchthatcellsareleakingtoo quickly, a stand-byrefreshcircuit canbeenabledto returnthebranchpredictorinto a
non-decayingmode.
6.3 Summary and Discussion
Weseethatusingquasi-static4T cellsallowsusto build naturallydecayingbranchpredictorswith minimal impacton
performance.Furthermore,predictorsbuilt using4T-baseddecayoffer additionalbenefitsover thosealreadyshown
with decaywith 6T cells.We show that4T RAM cellsareasmuchasone-thirdsmallerin areaandarecomparablein
termsof readenergy whenaccessedfrequently. Most importantly, they reduceleakageenergy comparedto 6T cells
andarecomparablein termsof programperformance.
Thusourproposalto use4T cellsfor predictorstructuresis drivenby thefollowing observations.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
26 v
0
10
20
30
40
50
60
70
80
90
100
gzip vp
rgc
cm
cf
craf
ty
pars
er eon
perlb
mk
gap
vorte
xbz
ip2 twolf
wupwise
swim
mgr
idap
plum
esa
amm
p
galge
lar
t
equa
ke
face
rec
lucas
fma3
d
sixtra
ckap
si
geom
ean
acti
ve r
atio
(%)
Fig. 20. Active ratio of a 4T basedbi-modepredictor.
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
256 512 1024 2048 4096 8192 16384 32768 32768 131072
Decay interval (in cycles)
Normalized execution time Branch prediction accuracy
Fig. 21. Normalizedexecutiontime andbranchpredictoraccuracy for the16K gsharepredictoratvaryingdecayintervals.
—Branchpredictorentriesexhibit locality. Data arraysareapproximatelysquare,anddecaytechniquesare most
easilyappliedto rows in thesearrays.This helps4T structures,becauseanaccessto anything alonga row boosts
the voltagesof all cells in that row. Locality meansthatactive coderegionshave their predictorsrefreshedwhile
inactiveregionsdecay. Oncedecayed,they leakvery little, essentiallycappingtheenergy dissipatedby idle entries.
—Theretentiontime for 4T cells— theamountof time it takesfor a 4T cell to leakenoughchargethat thevalueit
storesis corrupted— is longenoughthat4T cells’ decayingbehavior is naturallysuitedto exploiting decayin large
structuresoverour benchmarks:it is rarethatavalueleaksawaybeforeit is neededagain.
—Retentiontimesvary with designstyle, fabricationtechnology, andtemperature.Nonetheless,we have discussed
techniquesthatallow oneto modulatetheretentiontimessufficiently to guaranteegoodperformance.We havealso
shown thatthereexist severaltechniquesthatwill preventsmallvoltagedifferentialsin decayedcellsfrom inducing
metastability. Our resultsin Section5.7show thatour techniquewill beviableevenin futuretechnologies.
Theseinsightssuggestthat4T cellsareinherentlyusefulfor managingleakagein predictivestructures.Thefactthat
4T cellsdecaynaturallyprovidesleakage-energy savingswithout eithertheoverheadof gated-w�x�x techniquesor the
overheadof maintainingdecaycounterbits. Finally, 4T designsaresmaller, saving die areaandpossiblypermitting
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
y 27
fasteraccesstimesfor a givennumberof bits.
7. RELATED WORK
Much researchhasrecentlybeendonein the field of static leakage,both at the architecturallevel andat the device
level. At the architecturallevel, prior work focusedon improving the static leakageperformanceby shuttingoff
(decaying)unusedportionsof large array-like structures,for example,in caches. [Yanget al. 2001] describedan
instructioncacheorganizationthatdisableswaysof a set-associativecacheto matchthecapacitycurrentlyneededby
theexecutingprogram.
Severalpapers[Powell et al. 2000], [Kaxiraset al. 2001],and[Hu et al. 2002],describetechniquesfor disabling
individual cachelines by inferring that lines which have not beenusedin a long time have decayed: they probably
containdatathat is not likely to be usedagainbeforereplacement.Building on the cachedecaywork, [Hu et al.
2002a]extendeddecaytechniquesto thebranchpredictor. Most of theabovetechniquesassumetechnologiessuchas
gated-z�{�{ or supply-voltagegatingto controlwhetheror not a portionof thestructureis enabledor disabled.More
recently, Hansonet al. presentedacomparisonof leakage-controltechniques[Hansonetal. 2001],andconcludedthat
gated-z�{�{ maynot bethebestapproachfor controllingstaticpower in largedataarrays.Instead,they find thatdual-
z7| techniques[Roy 1998]save moreoverall energy for structuresthat shouldhave fastlookup times(like first-level
caches).
[Zhou et al. 2003] proposedturningoff only the dataportion of the cache,leaving the tag portionactive in order
to monitor andexploit missinformation. Furthermore,usingadaptive cachedecayasan example,[Velusamyet al.
2002]appliedformalfeedbackcontrolmechanismsto improveleakagesavingsby implementinganadaptivecontroller
whichvarieddecayintervalsdynamically. SankaranarayananandSkadron[SankaranarayananandSkadron2004]also
improvedoncachedecayby selectingbetterpoliciesfor adaptingthedecayinterval. [Flautneretal. 2002]presenteda
techniquein which to improve leakagesavingsnot by fully decayingcachelines,but ratherto put thein a low-power
”drowsy” state,whichmaintainsthecachestateratherthandiscardingit asin decaytechniques.Extendinggated-z�{�{ ,[Agarwal et al. 2002] DRG-Cachesusedata-retentive leakage-awarecells to reduceleakageenergy. Alternatively,
[Azizi etal. 2002]introducedanasymmetricdual-VtSRAM cell. Anotherleakagecontroltechniquesaving thecache
statewasdescribedby [Heo etal. 2002].Combiningstate-preservingandstate-destroying techniques,[Li etal. 2002]
demonstratedatechniquein which leakageenergy wasfurtherreducedby exploiting dataduplicationacrossthecache
hierarchy. Lastly, focusinginsteadon the compiler, [Zhanget al. 2002] presentedtechniquesusingthe compilerto
improveinstructioncacheleakage.
A separate,architecturalway of improving energy consumptionperformanceof structuressuchasthe cachewas
proposedby BalasubramonianandAlbonesi [Balasubramonianet al. 2000] andYanget al. [Yanget al. 2002], in
which the cacheis reconfiguredin order to bestsupportthe working setof the datawith as little wastedcacheas
needed.Resizablecachesareindeedorthogonalto decaytechniquesandalsotake advantageof the fact that some
applicationsrequirevery little eitherthecacheor from thebranchpredictor. However, thedecayapproachis simple,
requiresvery little hardware,andcanpotentiallyusevery smalloperatingintervals. Still, thereis nothingpreventing
a resizablecachefrom beingbuilt with decay, andcombiningthebenefitsof both.
Takingadifferentapproach,[Juangetal.2002]lookedateliminating z�{�{ altogetherin replacingthestandardSRAM
cell with a dynamic,quasi-staticfour transistorcell. Using the quasi-staticcell allows for an easilybuilt, naturally
decayingcache. In additionto caches,[Hu et al. 2002b]describedwaysof saving leakagein branchpredictorsby
implementingthemwith quasi-staticcells.Finally, [Hu etal. 2003]includedcachedecayasanexampleof improving
power dissipationandperformanceby looking at more than just simple time-independentbehavior, suchas event
ordering,andexploiting thegenerationalbehavior of structuressuchasthecacheandprefetchunit.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
28 }
8. CONCLUSIONS
In this paper, we examineimplementationsof decay-basedleakagecontrol usingquasi-staticmemorycells. Cache
decay, first proposedin [Kaxiras et al. 2000;Kaxiraset al. 2001],aimedto turn off unusedportionsof cacheswith
long idle timesto reducecacheleakageenergy.
We startby evaluatingdecayimplementationsbasedon coarse-grainedcountersappendedto traditional6T SRAM
arrays. While previously consideredfor caches,this papershows that suchtechniquescanbe effective for branch
predictorsaswell. Inherentin this successis theobservationthatbranchpredictorsexhibit row-basedspatiallocality
thatallowssomeof therows to bedeactivateddueto long idle times,while otherrowsareheavily accessed.
We seethat,muchlike caches,exploiting this spatialandtemporallocality allowsusto save significantamountsof
leakageenergy, especiallyin simplerbranchpredictors.In morecomplex branchpredictorswe show that intelligent
policiescansavesignificantamountsof leakageenergy overnaivedecay.
Prior implementationsof decaywerebasedon gated-~���� or gated-~�� � [Yanget al. 2001],wherepowerandground
aregatedby a transistor. This implementation,however, not only leadsto about5% areaoverheadbut alsobrings
aboutcomplex designissues,especiallyfor turningon adecayedcacheline.
A closerexaminationof gated-~ ��� baseddecayrevealsan intrinsic conflict betweenthe standard6-transistorcell
designandgated-~ ��� . Specifically, while gated-~ ��� triesto shutoff cachelines,thetwo loadtransistorsin 6-transistor
cells aremerelywastingcharge. To resolve this conflict, we proposedusing4-transistorquasi-staticmemorycells
to implementcachedecay. By removing the two transistorsthat connectthe cell to ~���� , quasi-staticcells naturally
implementsthetwo key functionsfor decay:theirchargesloseovertimeandarereplenishedduringeachcacheaccess.
Comparedto gated-~���� implementation,thismethodavoidstheareaoverheadandthedesignissuesrelatedto thegate
transistor.
Thus,we alsoexaminethe useof quasi-static4T cells for implementingbranchpredictordecay. Suchcells have
beenproposedto implementon-chip embeddedDRAM in a fairly-traditional style with refreshcircuitry. In our
work, weexamineusingthenaturaldecayof the4T cellsto implementdecayfor leakage-controlin branchpredictors.
Becausebranchpredictorsareperformancehints,notcorrectness-critical,lostentriesdonotcauseincorrectexecution.
Moreover, we show that4T cellscanbebuilt with sufficient naturalretentiontimesto implementusefuldecay-based
predictorswith negligible impacton predictionrate.
Using a combinationof transistor-level simulationand instruction-level simulation,we show that a 4-transistor
decayimplementationachievessignificantsavingsin leakagepower. While 4-transistordecaytypically offersgreater
savings than 6-transistorsolutions,thereare many placeswere it may not be feasibleto implementa 4T branch
predictor. 6T decayis more flexible in the selectionof retentiontime, as well as a finer granularity in selecting
dynamicretentiontimes. Furthermore,it may not alwaysbe possible,either throughdevice or layout concerns,to
wholesalereplacea 6T solutionwith a4T solution.
Thus,by presentingbothtraditional(6T) andquasi-static(4T) solutions,we show thatdecaycanbeeasilyapplied
eitherarchitecturallyon top of a givendesign,or bean integral partof thedesignfrom thebeginning. While branch
predictorleakageis now only 7-10%of totalCPUleakage,weareableto reduceit by asignificantfraction,sometimes
90% or more. This reducesoverall chip leakageby 5-7%. Furthermorewe canreduceleakagewith essentiallyno
performancecost. In futureprocessors,asbranchpredictorspotentiallywill grow in sizevery quickly, outpacingthe
growth of thecache,thepower consumptionof thebranchpredictorwill beanimportantproblem.Most broadly, the
paperpromptsa rethinkingof how transientdatacanbestbeexploitedin designingpower-efficientprocessors.
Acknowledgements
We would like to thankPolychronisXekalakisandYanZhangfor their help in simulatingthe4T cell underfuture
technologiesin SPICE.
MargaretMartonosiandDougClark’sresearchonenergy-efficientprocessorsis supportedin partby NSFITR grant
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.
� 29
CCR-0086031.In addition,MargaretMartonosigratefullyacknowledgesfundingandequipmentdonationsfrom Intel
andIBM.
Kevin Skadron’s researchwassupportedin part by NSF grantsCCR-0105626,an NSF CAREERaward (CCR-
0133634),anda grantfrom Intel MRL.
StefanosKaxiraswould like to thankIntel for equipmentdonations.
We alsothanktheanonymousreviewersfor their detailed,helpful suggestions.
REFERENCES
AGARWAL , A . ET AL . 2002.DRG-Cache:adataretentiongated-groundcachefor low power. In In the Proceedings of the 39th Design Automation
Conference.
AZIZI , N. ET AL . 2002. Low-leakageasymmetric-cellSRAM. In In Proceedings of the International Symposium on Low Power Electronics and
Design.
BALASUBRAMONIAN, R. ET AL . 2000.Memoryhierarchyreconfigurationfor energy andperformancein general-purposeprocessorarchitectures.
In In Proceedings of the 33rd International Symposium on Microarchitecture.
BROOKS, D., T IWARI , V., AND MARTONOSI , M. 2000. Wattch:A Framework for Architecture-Level Power AnalysisandOptimizations. In
Proc. ISCA-27.
BURGER, D. C. AND AUSTIN, T. M. 1997.TheSimpleScalartool set,version2.0. Computer Architecture News 25, 3 (June),13–25.
CHANG, P.-Y., HAO, E., AND PATT, Y. N. 1995.Alternative implementationsof hybridbranchpredictors.In Proc. Micro-28. 252–57.
DIEFENDORFF, K . 1999.PentiumIII = PentiumII + SSE.Microprocessor Report.
DIODATO, P. ET AL . 1998.MergedDRAM-LOGIC in theYear2001.Proc. of the IEEE International Workshop on Memory, Technology, Design,
and Testing.
DIODATO, P. ET AL . 2001.EmbeddedDRAM: An ElementandCircuit Evaluation.In IEEE Custom Integrated Circuits Conference.
DIODATO, P. W. 2001.Personalcommunication.
FLAUTNER, K . ET AL . 2002. Drowsy Caches:Simple Techniquesfor ReducingLeakagePower. In Proceedings of the 29th International
Symposium on Computer Architecture.
GWENNAP, L . 1996.Digital 21264setsnew standard.Microprocessor Report, 11–16.
HANAMURA , S. ET AL . 1987.A 256KCMOSSRAM with InternalRefresh.In The 1987 IEEE International Solid-State Circuits Conference.
HANSON, H. ET AL . 2001.Staticenergy reductiontechniquesfor microprocessorcaches.In Proceedings of the 2001 International Conference on
Computer Design. 276–83.
HEO, S. ET AL . 2002.DynamicFine-GrainLeakageReductionusingLeakage-BiasedBitlines. In Proceedings of the 29th International Symposium
on Computer Architecture.
HOLGATE, R. W. AND IBBETT, R. N. 1980. An analysisof instruction fetching strategies in pipelinedcomputers. IEEE Transactions on
Computers C-29, 4 (Apr.), 325–329.
HU, Z. ET AL . 2002a. Applying DecayStrategies to BranchPredictorsfor LeakageEnergy Savings. In Proceedings of the 2002International
Conference on Computer Design.
HU, Z. ET AL . 2002b. ManagingLeakagefor TransientData: DecayandQuasi-Static4T MemoryCells. In Proceedings of the 2002 International
Symposium on International Symposium on Low Power Electronics and Design.
HU, Z., JUANG, P., SKADRON, K ., CLARK , D., AND MARTONOSI , M. 2001. Applying decaystrategiesto branchpredictorsfor leakageenergy
savings. Tech.ReportCS-2001-24,Univ. of Virginia.
HU, Z., KAXIRAS, S., AND MARTONOSI , M. 2002. Let cachesdecay:Reducingleakageenergy via exploitationof cachegenerationalbehavior.
ACM Transactions on Computer Systems.
HU, Z., KAXIRAS, S., AND MARTONOSI , M. 2003.Timekeepingtechniquesfor predictingandoptimizingmemorybehavior. In The 2003 IEEE
International Solid-State Circuits Conference.
J. A. BUTTS AND G. SOHI. 2000.A StaticPowerModelfor Architects.In Proceedings of the 33rd International Symposium on Microarchitecture.
JIMENEZ, D. A., KECKLER, S. W., AND L IN, C. 2000. The impactof delayon the designof branchpredictors. In Proceedings of the 33rd
International Symposium on Microarchitecture. 67–77.
JUANG, P. ET AL . 2002.Implementingdecaytechniquesusing4T quasi-staticmemorycells. Computer Architecture Letters.
KAXIRAS, S., HU, Z., ET AL . 2000.Cache-LineDecay:A Mechanismto ReduceCacheLeakagePower. In Workshop on Power-Aware Computer
Systems (PACS, In conjunction with ASPLOS-IX.
KAXIRAS, S., HU, Z., AND MARTONOSI , M. 2001. CacheDecay:Exploiting GenerationalBehavior to ReduceCacheLeakagePower. In
Proceedings of the 28th International Symposium on Computer Architecture.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.
30 �
KESHARVARZI , A . ET AL . 1997. Intrinsic iddq: Origins,reduction,andapplicationsin deepsub-mlow-power cmosic’s. In Proceedings of the
IEEE International Test Conference. 146–155.
LAI , A ., FIDE, C., AND FALSAFI, B. 2001.Dead-BlockPredictionandDead-BlockCorrelatingPrefetchers.In Proceedings of the 28th Interna-
tional Symposium on Computer Architecture.
L I , L . ET AL . 2002.Leakageenergy managementin cachehiearchies.In Proceedings of the 2002 International Conference on Parallel Architec-
tures and Compilation Techniques.
L I , Y. ET AL . 2004. State-preservingvs. non-state-preservingleakagecontrol in caches.In In Proceedings of the 2004 Design, Automation and
Test in Europe (DATE).
L IPASTI , M., WI LKERSON, C. B., AND SHEN, J. P. 1996.Valuelocality andloadvalueprediction.In Proc. ASPLOS-VII. 138–47.
LOSQ, J. J. 1982.Generalizedhistorytablefor branchprediction.IBM Technical Disclosure Bulletin 25, 1 (June),99–101.
LYONS, R. ET AL . 1987. CMOS StaticMemory With a New Four-TransistorMemory Cell. Proc. of the 1987 Stanford Conf. On Advanced
Research in VLSI, 111–132.
MCFARLING, S. 1993.Combiningbranchpredictors.Tech.NoteTN-36,DECWRL. June.
NODA , K . ET AL . 1998.A 1.9um2loadlessCMOSFour-TransistorSRAM Cell in a0.18umlogic technology.IEDM Tech. Dig., 847–850.
PARIKH, D., SKADRON, K ., ZHANG, Y., BARCELLA , M., AND STAN, M. 2002. Power issuesrelatedto branchprediction. In Proc. HPCA-8.
233–44.
POWELL , M. ET AL . 2000.Gated-Vdd:A circuit techniqueto reduceleakagein cachememories.In In Proceedings of the International Symposium
on Low Power Electronics and Design.
ROY, K . 1998. Leakagepower reductionin low-voltageCMOSdesigns.In Proceedings of the International Conference on Electronics, Circuits,
and Systems. 167–73.
SANKARANARAYANAN, K . AND SKADRON, K . 2004. Profile-basedadaptationfor cachedecay. ACM Transactions on Architecture and Code
Optimization. To appear.
SCHUSTER, S., TERMAN, L ., AND FRANCH, R. 1987.A 4-Device CMOSStaticRAM Cell UsingSub-ThresholdConduction.In Symposium on
VLSI Technology, Systems, and Applications.
SEMICONDUCTOR INDUSTRY ASSOCIATION, . 2001. From website : The international technology roadmapfor semiconductors. In
http://public.itrs.net/Files/2001ITRS/Home.htm.
SEZNEC, A . ET AL . 2002.Designtradeoffs for thealphaev8 conditionalbranchpredictor. In In Proceedings of the 2002 International Symposium
on Computer Architecture.
SMITH, J. E. 1981.A studyof branchpredictionstrategies.In Proceedings of the 8th International Symposium on Computer Architecture. 135–48.
SONG, P. 1997.UltraSparc-3aimsatMP servers.Microprocessor Report, 29–34.
THE STANDARD PERFORMANCE EVALUATION CORPORATION. 2000.WWW Site. http://www.spec.org.
VELUSAMY, S. ET AL . 2002.Adaptivecachedecayusingformal feedbackcontrol. In Proceedings of the 2002 Workshop on Memory Performance
Issues (in conjunction with ISCA-29).
WOLF, W. 1998.Modern VLSI Design: Systems on Silicon. PrenticeHall. Prentice-Hall.
YANG, S. ET AL . 2002.Exploitingchoicein resizablecachedesignto optimizedeep-submicronprocessorenergy-delay. In In Proceedings of the
8th International Symposium on High-Performance Computer Architecture.
YANG, S.-H. ET AL . 2001. An IntegratedCircuit/Architecture Approachto ReducingLeakagein Deep-SubmicronHigh-PerformanceI-Caches.
In Proceedings of the 7th International Symposium on High-Performance Computer Architecture.
ZHANG, W. ET AL . 2002.Compiler-directedinstructioncacheleakageoptimization.In Proceedings of the 35th Annual IEEE/ACM International
Symposium on Microarchitecture.
ZHOU, H. ET AL . 2003. Adaptive modecontrol:A static-power-efficient cachedesign. In ACM Transactions on Embedded Computing Systems
(TECS), Special issue on power-aware embedded computing.
ZHOU, H., TOBUREN, M., ROTENBERG, E., AND CONTE, T. 2001.Adaptivemodecontrol:A static-power-efficient cachedesign.In Proceedings
2001 International Conference on Parallel Architectures and Compilation.
ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.