Implementing Branch Predictor Decay Using Quasi …mrmgroup.cs.princeton.edu/papers/taco03.pdfAn...

ImplementingBranchPredictorDecayUsingQuasi-StaticMemoryCells1

Philo Juang�, Kevin Skadron

�, Margaret Martonosi

�,

Zhigang Hu�, Douglas W. Clark

�, Philip W. Diodato

�, Stefanos Kaxiras

�

�Departments of Electrical Engineering and Computer Science, Princeton University�Department of Computer Science, University of Virginia�IBM T.J. Watson Research Center, � Agere Systems, University of Patras

1. INTRODUCTION

Power and energy consumptionare a major concernin general-purposeand high-performanceprocessors.Power

dissipationcanbeseparatedinto two parts.Dynamic powerdissipationis dueto switchingactivity in theprocessors,

andin idealdevices,is theonly sourceof dissipation,ascurrentonly flowsduringswitching.Static or leakage power

is consumedregardlessof activity, especiallythroughsubthreshold leakagecurrents[Kesharvarzietal. 1997]thatflow

evenwhenthetransistoris nominallyoff.

As progressin fabricationprocesseshasled to steadyreductionsin supplyvoltages,thresholdvoltagesarebeing

loweredto thepoint whereleakagehasbecomeanimportantandgrowing fractionof total powerdissipationin high-

performanceCMOSCPUs.If it is notaddressedthroughfabricationor circuit-level changes,someforecastspredictas

muchasafive-fold increasein leakageenergy pertechnologygeneration[SemiconductorIndustryAssociation2001].

At suchrates,thecurrentleakagecomponent,roughly5% of total chip power now, would balloonto 50%or morein

just a few generations.

In responseto thistrend,researchershaveproposedcircuit- andarchitecture-levelmechanismsfor managingleakage

energy [J. A. Butts andG. Sohi 2000;Kaxiraset al. 2001;Yanget al. 2001;Flautneret al. 2002]. Using gated-�� techniques,Kaxiraset al. [Kaxirasetal. 2001]showedthatturningcachelinesoff if they havenotbeenusedin a long

time canbeeffective in reducingleakageenergy with little performanceimpact. Li et al. [Li et al. 2004]examined

several circuit-level mechanismsfor managingleakagecontrol, showing that in casessuchas faston-chipcaches,

non-state-preservingtechniquessuchasgated- � � aresuperiorto drowsy caches.After caches,branchpredictorsare

amongthe largestandmostpower-consumingarraystructuresin currentCPUs. CurrentpredictorsandBTBs are

4–8Kbytesin size,alreadythesizeof a small cache.They dissipate7-10%of theprocessor’s total dynamicpower

dissipation[Parikhet al. 2002].

However, aspipelinesgrow deeperandthe demandon the branchpredictorincreases,the branchpredictormay

increasein sizefasterthancaches,raisingtheir contribution to the chip’s total energy budget. For example,in the

proposedAlpha EV8 [Seznecet al. 2002], the branchpredictorwas352 Kbits (44 KBytes) in size,over 12 times

thatof the21264.In addition,Jimenezet al. [Jimenezet al. 2000]pointedout thathierarchicalpredictorscanavoid

cycle-time constraintsand that large second-level predictorscan give substantialincreasesin predictionaccuracy,

resultingin predictorsthatcouldbeaslargeandhavethesamesubstantialleakageasfirst-level caches.Theincreased

popularityof tournamentpredictorswouldrapidly increasethesizeof thebranchpredictoraswell. As leakageenergy

consumptionincreasinglybecomesa concernwith respectto dynamicenergy consumption,how to conserve leakage

�For questionsor comments,[email protected]

ACM TransactionsonArchitectureandComputerOrganization,Vol. V, No. N, Month20YY, Pages1–0??.

2 �

energy in a largearraystructuresuchasthebranchpredictorbecomesasignificantandimportantproblem.Therefore,

energy consumptionproblemin thebranchpredictorwill likely grow in importancein thefuture,especiallywith the

increasingimportanceof leakageenergy.

As decaytechniquesfor cacheshaveproveneffective,branchpredictorsarea naturalnext step.However, thereare

several importantdistinctions.First, it is lessobviouswhena branchpredictorentrymaybeconsidered“dead” and

canthereforebeturnedoff with little performanceimpact.First,many branchesmaymapto thesamepredictorentry.

Sincethis sharingis sometimesbeneficial,notionsof cacheconflictsandeviction do not translatedirectly into the

branchpredictionworld.

Second,a branchpredictorentry is not simply valid or invalid, asin a cache.A branchpredictorentrymayhave

reachedthe “strongly not taken” statedueto the effectsof several differentbranchesandmay be usefulto the next

branchthataccessesit, evenif this branchhasneverbeenexecutedbefore.

Third, branchpredictorentriesaretoosmallto deactivateindividually, soonemustconsidersomelargercollection,

suchasa row of predictorentriesin thesquarearrayin which thepredictoris likely implemented.Thechallengehere

is thatunlike thegroupingof datainto acacheline, thegroupingof branchpredictorentriesin a row is not something

for which applicationprogrammersandsystemsbuildershave a senseof spatiallocality. This paperevaluatesdesign

optionsrelatedto thesequestions.

Last, branchpredictorscanbe logically organizedin a numberof ways, from simplebimodal branchpredictors

[Smith1981],whichkeeponetwo-bit counterperpredictorentry, to multi-tablepredictorslikehybrid predictors[Mc-

Farling 1993],which operateseveralpredictionstructuresin parallel. For example,hybrid predictorsmayencounter

instanceswhenoneof the predictorcomponentshasbeendeactivatedbut the otherhasnot. The choosermight be

designedto pick the still active componentin suchsituations.For otherbranches,the choosermayexhibit a strong

biasfor onepredictorcomponentover the other. In this case,predictorentriesthat arenot beingselectedmight be

deactivated.

Traditionally, statepreservingstructureshave beendesignedwith six transistor(6T) SRAM cells; unlike DRAM

cells,SRAM cells retaintheir valuewhile charged. Thus,cachesneednot have built in circuitry to refreshthecells

asin mainmemory. With this in mind,decayandsimilar techniquesfocusonsomehow shuttingdown thesecells,for

exampleeithergatingthesourceor drainvoltagesto a signalthatis enabledwhenever thecell needsto beturnedoff.

By attachingdecaycountersto eachcell, onecandetectif thecell hasgone“long enough”without activity. If this is

thecase,thecell is assumedto beunusedandcanbedeactivated.

Traditionaltechniquesof deactivatingcellsinvolveattachingagatetransistorto thesupplyvoltage�� ; to deactivate

a cell, thegatetransistordisconnectsthesupplyto therestof thecell andthusthecell is effectively turnedoff [Yang

et al. 2001].Gated-�� basedcachedecaytechniqueshavea few problems,however. First,attachinga gatetransistor

to eachcacheline mighthaveanadverseimpacton thepitchof thecacheline, increasingtheoverallsizeof thecache.

In addition,the countersthemselvesconsumedynamicandstaticpower, which mustbe taken into accountaswell.

Onecannotassumethatpoweringup or poweringdown a transistorcomesfor free; instead,thereis somedynamic

costto theprocessroughlyof thesameorderasa cacheline access.

An alternativeto traditionaldecayis to useaquasi-static,4-transistor(4T) memorycell. 4T cellsareapproximately

asfastas6T SRAM cells, but do not have connectionsto the sourcevoltage( �� ). Rather, the 4T cell is charged

uponeachaccess,whetherreador write, andslowly leaksthechargeover time. Whenthechargein thecell hasbeen

depleted,the valuestoredis lost. Thus,they maybe referredto asquasi-static;like conventionalDRAM, they leak

their charge,but do not requirea full read-modify-writeoperationto accomplishacell refresh.

4T cellstypically appearin embeddedprocessors,wherelow-powerconsumptionis themostimportantaspect.4T

cellshave thespeedof SRAM but in termsof powerbehavemorelike DRAM. They have beenlargely overlookedin

general-purposeprocessorsbecausewithout periodicrefreshthey arenot truly static. However, in decaytechniques,

wherewe exploit the transientnatureof shorttermstorageby turning themon andoff dependingon usage,onecan

ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month20YY.

� 3

take advantageof 4T cells’ dynamicnature.Decayinga 6T cell seemssomewhat like squashingits greatestasset–

that it retainsits valuefor as long as it is poweredup. Implementingdecaywith 4T cells is thereforeconvenient;

asit decaysnaturallyover time if not accessed,this distinctive behavior fits perfectlywith the transientbehavior of

elementsin a cacheor branchpredictor.

Decaywith quasi-staticcellsisadvantageousto conventionaldecayin severalrespects.First,by virtueof having two

fewer transistorsandby eliminatingtheneedto run power rails lengthwisethroughoutthearray, 4T basedstructures

aresmallerthanequivalent6T structures.As 4T cellsdecayandrevivenaturallythroughtheread-writeprocess,some

of thecostsbehindactivatingdecayedcellsareeliminated.For example,awrite to a 4T cell automaticallyreactivates

the cell, whereasreactivatinga 6T cell is somewhat morecomplex, requiringextra hardwareinvolved in gatingthe

supplyvoltage.

Contributions

Becausebranchpredictorshave behavior morenuancedthancaches,cachedecaymethodscannotbe directly ap-

plied. This papershows thateffectivedecaytechniquescanneverthelessbedevisedfor branchpredictors.Thepaper

thengoesonto exploretheinteractionof decaypolicieswith someof thewealthof branchpredictordesignparameters.

We show that:

—Branchpredictors,like caches,exhibit a significantamountof spatialandtemporallocality, andthusdecaytech-

niquescanbeapplied.

—Unlike caches,a decayedpredictionneednot impactperformanceasmuchasa decayedcacheline. A decayed

prediction,whethera staticfallbackor a logically randomvalue,maystill bea correctprediction.

—Decaycanreducenet leakageenergy in theconditionalbranchpredictorby 40–60%andthe branchtargetbuffer

(BTB) by 90%.

—In hybrid predictors,decaypoliciescanachieve 50%higherreductionsin leakageenergy if thedecaypolicy takes

advantageof thehybrid predictororganizationto boostdecayopportunities.

—Decayis mosteffective for intervalsof 64K cyclesor larger. If decayis appliedtoo aggressively, extra mispredic-

tionsresultin significantcostsin bothperformanceanddynamicpower.

—Quasi-static,four transistor(4T) cellsareanaturalfit for decay, asthey graduallyleakchargeovertime. Moreover,

they suffer from lessleakageoverall than6T cells,makingthemanattractivechoicefor managingleakageenergy.

Thepaperis organizedasfollows. First,we discusstheorganizationandlayoutof branchpredictors,andestablish

that spatialandtemporallocality exists. We thenshow resultsfor basicdecayin (6T) branchpredictors.Next, we

discussthe structureandbehavior of the 4T quasi-staticcell, followed by the resultsof the decayimplementation.

Last,we discussrelatedwork andoffer our conclusions.

2. DECAY WITH BASIC BRANCH PREDICTORS

2.1 Logical Structure of Branch Predictors

Althougha wealthof dynamicbranchpredictorshave beenproposed,we focuson theeffectsof decayfor thegshare

andhybridpredictors,becausethey presentthemostinterestingtradeoffs.

Thegsharepredictor, shown in theleft half of Figure1, triesto detectandpredictsequencesof correlatedbranches

by tracking a global history (the global branchhistory register or GBHR) of the outcomesof the � most recent

branches.In gshare,the global branchhistory and the branchaddressare XOR’d to reducealiasing. This paper

modelsa 16K-entrygsharepredictorin which 12bits of historyareXOR’d with 14bitsof branchaddress.This is the

configurationthatappearsin theSunUltraSPARC-III [Song1997].

Insteadof usingglobal history, a two-level predictorcantrack local branchhistory on a per-branchbasis. Local

history is effective at exposingpatternsin the behavior of individual branches.Becausemostprogramshave some

ACM Transactionson ArchitectureandComputerOrganization,Vol. V, No. N, Month 20YY.

4 �

taken/not−taken

PHT(16K)

xor

branch address

GBHR

14

1412

component #1 (global)

component #2 (local)

taken/not−taken

GBHR (12)

PHT(4K)

selector(uses global hist)

PHT(1K)PHT

(4K)

BHT (1K x10)

Fig. 1. Gsharepredictorin theSunUltraSPARC-III (Left) and21264-stylehybridpredictor(Right).

branchesthatperformbetterwith globalhistoryandothersthatperformbetterwith local history, a hybrid predictor

[Changet al. 1995; McFarling 1993] combinesthe two. It operatestwo independentbranchpredictorcomponents

in parallelandusesa third predictor—theselector or chooser—to learnfor eachbranchwhich of thecomponentsis

moreaccurateandchoosesits prediction. This papermodelsa hybrid predictorwith a 4K-entry selectorthat only

uses12 bits of globalhistoryto index its PHT, a global-historycomponentpredictorof thesameconfiguration,anda

1K-entry local historypredictorwith a 10-bit wide BHT and1K-entryPHT. This configurationappearsin theAlpha

21264[Gwennap1996]andis depictedon theright sideof Figure1.

2.2 Proposed Decay Implementation

Logically, branchpredictorsarearraysof countersthataretypically just two bits wide. Like caches,however, branch

predictorsarephysicallyimplementedassquareor nearly-squarearraystructures.Thishelpsminimizethecomplexity

of therow andcolumndecodersandbalancewordlineandbitline lengthanddelay. Thepredictorarrayis thussimilar

to acachearray, exceptthatit needsno tags.For example,the16K-entrygsharepredictordiscussedabovecanbelaid

out asa �� arrayof 2-bit counters.Alternatively, it canbedividedinto 4 banks,eacha ! "#�$! " counterarray.

We referto thesetwo organizationsas“unbanked” and“banked” respectively.

Whenfirst consideringdecaytechniqueswith branchpredictors,it mightbetemptingto considerdeactivatingindi-

vidual predictorentries.This is untenable,however, becauseour methodsrequiretwo bits of stateperindependently-

activatedunit; suchoverheadwould be excessive if appliedto individual two-bit predictorentries. Thus, a cost-

effectivechoicefor turningoff thesecountersis at thegranularityof rows in thearraystructureratherthanindividual

entries. This requiresonly two bits of stateper row (the referencebit and the active bit), or a total overheadof

�&%(' )�*,+.-�/.)10 bits.

Our techniqueshave the following generalstructure. At regular intervals, all rows of predictorentriesnot been

usedduring the interval areassumedto have decayed andarethereforedeactivated. The interval, calledthe decay

interval, is measuredin processorcyclesandis a critical parameterfor theseschemes.The shorterthe interval, the

moreopportunitiesfor rows to bedeactivatedbut themorelikely it is to deactivaterowsprematurelyandinduceextra

mispredictions.Intervalslong enoughto minimizeextra mispredictions,on theotherhand,resultin thedeactivation

of fewerentries.

Figure2 showsapictureof a typical branchpredictorarrangedasa squarememoryarray. A groupof entries,then,

is a row of thesquarememoryarray. Onebit per row, thereference bit, indicateswhetherany predictorentry in that

row hasbeenaccessedwithin the last decayinterval. All referencebits areclearedat the endof eachinterval. A

secondbit for eachrow, theactive bit, indicateswhetherthatrow is currentlyactive (i.e., not decayed).If a predictor

lookup tries to accessa decayedrow, the predictorsignalsthata predictioncannotbe made;the row is re-activated

andpossiblyinitialized to somedesiredstartingstate;in themeantime,a default predictionis made.Uponactivation,


2 5

PHT index

two−bit counter

rowdecoder

columndecoder

assertedwordline

chosenbitlines

b b/2

b/2

Fig. 2. A schematicof asquarifiedbranchpredictortableof two-bit counters(thePHT).

our experimentsusea default of not takenandinitialize all thecountersto 01. Thus,subsequentbranchesusingthe

re-activatedline startin theweakly-not-takenstate.

Theactive ratio is theaveragepercentageof predictorrows foundto beactive (not decayed);it is a proxy for the

actualleakageenergyconsumedby thepredictor. Shorterdecayintervalsyield smalleractiveratios(andlargerleakage

energy savings),but performancemaysuffer, sinceusefulpredictorentriesaresometimesdeactivated.Exploringthis

power-performancetradeoff is a key objectiveof this paper.

To evaluatetheneteffectiveness,wecombinethereducedvalueof leakageenergy thatwasobservedwith decayand

theoverheadenergy associatedwith thedecaytechnique.For eachof thepredictortypeswe study, we thenplot the

normalized leakage energy for differentdecayintervalsagainsttheoriginal valuefor leakageenergy. This approach

for measuringthenetreductionin leakageenergy is similar to thetechniquesusedby Kaxiraset al. in their work on

cachedecay[Kaxirasetal. 2001].

2.3 Spatial and Temporal Locality in Branch Predictors

In orderto applydecayto branchpredictors,we mustfirst establishtheexistenceof spatialandtemporallocality. In

today’s machines,branchpredictorrows typically include32-256counterentries.Fortunately, programbranchesare

clusteredratherthanrandom,andacrossall thepredictororganizationsweexamineourexperimentsconsistentlyshow

thatsomerowshaveheavy activity while othersareidle andcanbedeactivated.

In the instructioncache,programsclearly exhibit spatiallocality. Over a shortperiodof time, only oneor a few

smallcontiguousregionsof theprogramarelikely to beactive,sobranchinstructionsarelikely to beclosein termsof

their PC.This alsotranslatesinto spatiallocality in branch-predictoraccesses.For branchpredictors,spatiallocality

meansthatat any point in theprogram,active rowsarelikely to havemany countersactiveandidle rowsarelikely to

beentirelyidle. This is mosttruefor thebimodalpredictor, which is indexedonly by PC.

If brancheswereuniformly distributed,we would expect the probability of two successive conditionalbranches

falling into thesamerow to becloseto 1/rows, or about2%for a4K-entrypredictorarray. To establishspatiallocality,

we mustshow that this probability is muchhigherthanthis. Indeed,the probability that two successive conditional

branchesfall into thesamerow in a4K-entrybimodalpredictoris greaterthan40%for all ourbenchmarks,andgreater

than50%for all but five.

For gshareor hybrid predictors,spatiallocality is not aspronouncedasfor bimodal,sincetheseotherpredictors

areindexedby a combinationof thebranchindex andglobalhistory. Thus,dependingon theglobalhistory, a single

branchcanindex into many differentPHTentries,acrossmany differentrows. Nevertheless,in half of thebenchmarks


6 31K cycles 10K cycles 100Kcycles 1M cycles Overall

gzip 24 32 45 103 281vpr 31 45 58 65 742gcc 2 9 79 193 512mcf 65 83 92 116 565crafty 104 305 592 855 1701parser 53 90 157 294 2265eon 81 289 357 415 652perlbmk 90 453 631 1112 1541gap 62 281 325 576 745vortex 124 502 1227 1642 1996bzip2 22 33 45 56 460twolf 48 210 300 334 351wupwise 42 52 53 55 193swim 3 6 11 15 687mgrid 3 6 9 25 500applu 1 2 4 7 579mesa 83 114 139 267 697galgel 2 6 8 10 508art 2 2 5 18 109equake 167 192 193 202 226facerec 7 24 25 39 144ammp 11 26 105 230 794lucas 3 3 3 4 242fma3d 80 450 452 465 499sixtrack 39 49 55 99 734apsi 14 85 117 125 342geomean 20 46 70 109 529max 167(equake) 502(vortex) 1227(vortex) 1642(vortex) 2265(parser)min 1 (applu) 2 (applu) 3 (lucas) 4 (lucas) 109(art)

TableI. Averagenumberof staticbranchestouchedevery decayinterval for SPEC2000.The rightmostcolumnlabeled‘Overall’ givesthe staticbranchfootprint for thewholesimulationperiod.

the probability of hitting the samerow as the previous branchis above 10%. This is muchhigher thana random

distribution (about1% for the large, 16K-entrygshare),and thuseven thesemorecomplex predictorsalsoexhibit

somespatiallocality.

Yet this is not usefulif theactive rows changerapidly, sowe mustalsoestablishtemporallocality. Oneimmediate

factorthatcreatestemporallocality is thefactthatmany benchmarkshave smallstaticbranchfootprints(thenumber

of uniquebranchinstructionsitesthat areexecuted),asseenin TableI. Decaywill thereforeclearly help bimodal

predictors,aseachstaticbranchtouchesonly onepredictorentry. FromTableI weknow thatthey areclustered.

3. EXPERIMENTAL SETUP

Thissectiondescribesoursimulationtechniqueandbenchmarks,andour methodfor evaluatingenergy savings.

3.1 Simulation Setup

Simulationsin this paperarebasedon theSimpleScalar3.0 toolkit [BurgerandAustin 1997]. Our modelprocessor

hasmicroarchitecturalparametersthat resemblein mostrespectsthe Intel P-III processor[Diefendorff 1999]. The

mainprocessorandmemoryhierarchyparametersareshown in TableII. For performanceestimatesandbehavioral

statistics,we useSimpleScalar’s sim-outorder simulator. For energy estimates,we usetheWattchsimulator[Brooks

etal.2000].WattchusesSimpleScalar’ssim-outorder cycle-accuratemodelandaddscycle-by-cycletrackingof power

dissipationby estimatingunit capacitancesandactivity factors.Becausemostprocessorstodayhavepipelineslonger

than5 stages,both our architecturalandpower modelsextendthe sim-outorderpipelineby addingthreeadditional

stagesbetweendecodeandissue.

We tried to matchascloselyin performancerespectsascloseto a P-III classprocessoraspossible,a 4-issuewidth

machine.As for theproposedbranchpredictors,weselectedbranchpredictorsfoundin other4-issuewidth machines,

suchasthehybrid foundin the21264andthe16K Gsharefoundin theUltraSparc-III.


4 7

ProcessorCoreInstructionWindow 40-RUU, 16-LSQIssuewidth 4 instructionspercycleFunctionalUnits 4 IntALU,1 IntMult/Div,

4 FPALU,1 FPMult/Div,2 MemPortsMemoryHierarchy

L1 D-cacheSize 32KB, 1-way, 32B blocks,3-cycle latencyL1 I-cacheSize 16KB, 4-way, 32B blocks,3-cycle latencyL2 Unified,256KB,8-way LRU,

32B blocks,8-cycle latency, WBMemory 100cyclesTLB Size 128-entry, 30-cycle misspenalty

TableII. Configurationof simulatedprocessor.

3.2 Technology Assumptions and Circuit Simulations

Theresultsof this research,while broadlyapplicable,areevaluatedbasedon particulartechnologyassumptions.We

chosea featuresizeof 57698�: ; , a <�=�= of 1.5V anda clock speedof 1 GHz. We use6T and4T cells from cell libraries;

no customdesignsareassumed.4T cellsaresimpleto design,do not requirespecialprocesstechnology, andcanbe

includedin librarieswith otherSRAM memorycells.

As processtechnologyprimarily determinesleakagecurrents,with eachsuccessive generationleakageincreases

substantially. In particular, Figure3 illustratesthisprogressionby illustratingleakagecurrents(in nA) for four industry

processtechnologiesfrom at AgereSystems.At 2.5V and0.25; m, COM1 is largely out of date;we do not consider

it further. COM2 is a reasonablycurrentCMOSprocess,with a 1.5V supplyvoltageand0.16; m featuresize.COM3

andCOM4 areslightly moreforward-lookingCMOSprocesses.COM3 is a 1.0V, 0.12; m process,andCOM4 is a

1.0V, 0.1; m process.

Transistor Leakage Current at 25C

0.1

3

10

100

0.01

0.1

1

10

100

COM1 COM2 COM3 COM4

Technology Node

Tra

nsi

sto

r L

ea

kag

e C

urr

en

t (n

A)

Fig. 3. Leakagecurrentpertransistorfor a rangeof technologiesin useat Agere.

Theleakagecurrentsareshown for roomtemperature(25> C). Althoughthis understatesthemagnitudeof theleak-

ageseenin operatingchipswherethetemperatureis likely to bemuchhigher, our dataagreeswith otherpredictions

[SemiconductorIndustryAssociation2001]that leakageis growing exponentiallywith successive technologygener-

ations.

Our transistormodelsarebasedon thecurrently-in-productionCOM2 process.We useAgere’s SPICE-like simu-

latorsfor very detailedcircuit simulationswith a 25 pico-secondresolution.Although leakagesavings with COM2

aresmall,COM2 is a processfor which we cangatherverifiablesimulationresults.As such,it is a usefulvehiclefor

detailedcircuit studiesof hold timesandtemperatureeffects.Leakagecurrentwill beaseriousproblemin subsequent

generations,andthis papershowssubstantialsavingsfor theCOM3andCOM4processes.


8 ?

Fig. 4. Current-Voltagecurvesfor COM2transistorsatvaryingtemperatures.

0.01

0.1

1

10

100

1000

-25 0 25 50 75 100 125

Transistor Junction Temperature (C)

Tra

nsi

sto

r L

ea

kag

e (

na

no

-a

mp

ere

s)

Fig. 5. Transistorleakagecurrent(nA) for varyingtemperaturesfor theCOM2process.

Figure5 shows the exponentialrelationshipof temperatureto transistorleakagecurrentsastemperatureis varied

for 1.5V 0.16uCOM2transistors.Variationsin temperatureresultin largevariationsin leakage,andthesewill impact

designtradeoffs. Our designstargetanoperationaltemperatureof 85 C (appropriatefor high-performanceor mobile

processors)but wealsosimulatedveryhightemperature(125C) operation,appropriatein hardto controlenvironments

likeautomobiles.

3.3 Benchmarks

We evaluateour resultsusingbenchmarksfrom the SPECCPU2000suite [The StandardPerformanceEvaluation

Corporation2000].Thebenchmarksarecompiledandstaticallylinkedfor theAlpha instructionsetusingtheCompaq

Alpha compilerwith SPECpeak settingsandincludeall linkedlibraries.For eachprogram,we skip thefirst 1 billion

instructionsto avoid unrepresentativebehavior at thebeginningof theprogram’sexecution.We thensimulate500M

(committed)instructionsusing the referenceinput set. Simulationis conductedusing SimpleScalar’s EIO traces,

which include all external inputsand outputsinto the programto ensurereproducibleresultsfor eachbenchmark

acrossmultiple simulations.


@ 9

TableIII summarizesthebenchmarksusedandtheir predictionaccuracieswith severalpredictors.

DynamicConditional PredictionRate PredictionRate PredictionRateBranchFrequency w/ Bimod4K w/ Gshare16K w/ Hybrid 21264-style

gzip 9.40% 87.58% 90.67% 92.40%vpr 11.08% 90.00% 97.68% 97.03%gcc 2.56% 99.61% 99.74% 99.75%mcf 19.40% 98.42% 99.27% 98.94%crafty 11.13% 91.76% 93.33% 94.38%parser 15.60% 91.25% 94.31% 94.89%eon 11.08% 81.03% 89.71% 91.76%perlbmk 12.43% 95.15% 97.29% 97.60%gap 6.62% 90.10% 96.50% 96.96%vortex 16.00% 97.85% 97.61% 97.87%bzip2 12.13% 94.06% 94.05% 94.12%twolf 12.24% 86.44% 88.53% 88.48%wupwise 10.50% 92.05% 97.54% 96.66%swim 1.35% 99.36% 99.54% 99.55%mgrid 0.33% 92.57% 97.77% 97.95%applu 0.28% 93.07% 98.70% 98.21%mesa 8.73% 94.09% 96.98% 96.31%galgel 6.09% 99.13% 99.27% 99.29%art 11.29% 92.99% 99.02% 96.40%equake 17.13% 98.04% 99.39% 99.28%facerec 3.49% 97.92% 99.04% 99.21%ammp 21.67% 98.77% 99.27% 99.23%lucas 8.67%% 90.84% 98.70% 98.84%fma3d 18.09% 95.93% 98.39% 97.68%sixtrack 8.08% 89.58% 99.72% 99.78%apsi 3.51% 86.22% 96.52% 97.12%

TableIII. Benchmarksummary.

3.4 Energy Evaluation

The net energy savings from branchpredictordecaymust accountfor two oppositeeffects resultingfrom decay.

Turningoff A�B�B in idle countersor rowscanpreventthemfrom leaking;however, decayingthecounterscausesthem

to losetheirstoredhistoryinformationstored,possiblydegradingtheperformanceof thebranchpredictorandresulting

in moreenergy spenton mis-speculationanda longerexecutiontime.

WeuseWattch[Brooksetal.2000]to measurethetotalprocessordynamicenergywith andwithoutapplyingdecay,

andsubtractthemto obtainthedynamicenergy overhead.For leakagepower, Yanget al. [Yanget al. 2001]estimate

that with a C�C�DFE C operatingtemperature,a 1 GHz operatingfrequency, a supply voltageof 1.0V anda threshold

voltageof 0.2V, a SRAM cell consumesabout C1G�HFDJIKCLDNM�O nJleakageenergy percycle. This is roughly in line with

industrydatathatAgerehasfoundwith theirCOM3andCOM4processgenerations[Diodato2001].

Basedon theabovedata,we estimatethat the4K-entrybimodalpredictor, the16K-entrygsharepredictor, andthe

21264stylehybrid predictorconsumeabout0.014nJ,0.057nJ,and0.0517nJleakageenergy eachcycle, respectively.

In general,the overall leakageenergy for thebranchpredictorcanbecalculatedasleakage energy per bit per cycle

* #bits * #cycles. Theleakageenergy of theextra statusbits canbecalculatedsimilarly. Sincethesestatusbits only

switch infrequently(at most onceevery decayinterval – e.g. every 64K cycles),we ignore their dynamicenergy

overheadin ourcalculation.

4. DECAY WITH SIX-TRANSISTOR STATIC CELLS

To establishthe effectivenessof basicbranchpredictordecay, we first needto examinehow decayinteractswith

basicbranchpredictor. Using conventional6T cells andgated-A�PQP / A B�B techniques,we simulatedthreebasicbranch

predictors– a bimodalpredictor, a gsharepredictor, anda hybridpredictor.


10 R

4.1 Bimodal Predictors

Figure6 showstheactiveratioanddirectionpredictionaccuracy for 4K-entrybimodalbranchpredictorswith different

decayintervals.We seefrom theactiveratiographthatdecayis veryeffectiveat shuttingoff idle countersin bimodal

predictors. The geometricmeanactive ratio, which is the percentageof predictorentriespoweredon (so smaller

valuesarebetter),is 37%,28%,22%and18%for decayintervalsof 4096K,512K,64K and8K cycles,respectively.

Thismeansthatdecaytechniqueshavethepotentialto reducebranchpredictorleakageby 2X or more.Sincebimodal

predictorsareindexedby PC,benchmarkswith largestaticbranchfootprintstendto havehigheractiveratios,asshown

in crafty, perlbmk, gap and vortex. Otherbenchmarkstypically leave morethanhalf of their countersdeactivateddue

to their smallfootprints.

Furthermore,the predictionrate graphshows that shuttingoff theseidle counterscomeswith minimal loss in

performanceexceptfor the apparentlytoo-aggressive decayinterval of 8K cycles. With a 64K cycle decayinterval,

theoverall lossin predictionaccuracy is about0.14%,with only onebenchmark(mgrid) over1%.

Interestingly, in somebenchmarks(wupwise and facerec) we actuallyobserved tiny improvementsin prediction

accuracy. We attribute this to the effect of removing somedestructive interference.Interferenceoccurswhentwo

differentbranchesmapto thesamecounter. Theinterferenceis destructive whenthebranchesarebiasedin opposite

directions,for example,whenoneof the conflicting branchesis strongly taken andthe other is stronglynot taken.

Deactivation resetsthe counterto the neutral,weakly not taken value,which effectively isolatesthe two conflicting

branches.

Figure 7 comparesthe leakageenergy beforeandafter applying decay. Whendecayis enabled,we add to the

leakageenergy the extra leakageenergy from the decaystatusbits aswell asany dynamic-energy overheaddueto

extra mispredictionsor longerrun time. As thedecayinterval lengthens,fewer mispredictionsareinduced,theextra

dynamicenergy incurredbecomesminimal,andover60%of theoriginal leakageenergy canbesaved.

4.2 Gshare Predictors

To evaluateif decaywill beeffective for a gsharepredictor, we mustfirst look at theactive ratio overvariousinterval

lengths.Figure8 shows thegeometricmeanof theactive ratiosacrossthebenchmarksfor bothbankedandunbanked

16K-entrygsharepredictorsand,asa reference,the 4K-entry bimodalpredictor. As expected,the active ratiosare

quite small (i.e. goodfrom a decaypoint of view) for the bimodalpredictor. Becausegshareis designedto spread

branchstateacrossthepredictor, the active ratio will be larger. Yet, significantnumbersof rows remainuntouched.

This indicatesthatdecay-basedtechniquesstill show significantpromisefor addressingleakageconcerns.

Figure8 shows datafor an unbanked aswell asa banked versionof gshare. Breakingthe predictorinto banks

makestheactive ratio smaller(betterfor decay)by reducingthegranularityoverwhich activity is measured.Indeed,

theactive ratio for gshareis 15–35%smallerif it is brokeninto four banksof 4K entrieseach.Overall, theseactive

ratios yield leakageenergy savings of 40% for unbanked gshareandabout50% for banked gshare. Even greater

savings canbe achieved for structuresdirectly indexed by PC: about65% for bimodalpredictorsand90% for the

BTB. For moredetails,referto [Hu etal. 2001].Notethatenergy savingsarepresentedin termsof netleakagesavings

in the branchpredictor, not the total processor, to separatenet leakageenergy savings in the branchpredictorfrom

potentialnetleakagesavingsindependentof thebranchpredictor.

Figure9 showstheactiveratioandbranchmispredictionratesfor anunbankedgsharepredictoronaper-benchmark

basis. The geometricmeanactive ratio acrossthe 26 benchmarksis 46% for a 64K-cycle decayinterval, with a

negligible drop in predictoraccuracy. While its decaybenefitsare not as pronouncedas for a bimodal predictor,

neverthelessdecaystill producessubstantialreductionsin leakagepower with minimal performanceimpact. Figure

10 shows thattheoverall reductionin leakageenergy is 41%for a 64K-cycledecayinterval.

Furtherenergy savingscanberealizedusinga bankedorganization,which givesa reductionin leakageenergy of

51%.Notethatwith thebankedpredictor, thereductionin leakageenergy will besomewhatlessthanthereductionin


S 11

0

10

20

30

40

50

60

70

80

90

100

gzi

p

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

wu

pw

ise

swim

mg

rid

app

lu

mes

a

gal

gel ar

t

equ

ake

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

acti

ve r

atio

(%)

orig 4096K cycle decay interval 512Kc 64Kc 8Kc

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

gzi

p

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

wu

pw

ise

swim

mg

rid

app

lu

mes

a

gal

gel ar

t

equ

ake

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

bp

red

dir

rat

e


Fig. 6. Active ratio (Top)andpredictionsuccessrate(Bottom)for a4K-entrybimodalpredictor

active ratio for two reasons.First, becausethebankedorganizationhassmallerrowsandhencetheoverheadof more

decaystatusbits. Second,thebankedorganizationis moreaggressive thanunbankedgsharein deactivatingrows for

thesamedecayinterval. Thismakesit morelikely thatthebankedorganizationdeactivatesarow thatharmsprediction

accuracy. This extra dynamic-power overhead,however, decreaseswith longerdecayinterval becausefor very long

intervals,evena bankedrow that is idle is extremelylikely to have genuinelydecayed.This is why thedifferencein

energy savingsis largerfor 512Kcyclesand4096Kcyclesthanfor 64K cycles.

4.3 Hybrid Predictors

With two competingcomponents(the global componentandthe local component),hybrid predictorsexhibit many

interestingdesignchoiceswhenimplementingdecay. In this sectionwe will explore thesedesignchoicesaswell as


12 T

0.0

0.2

0.4

0.6

0.8

1.0

1.2

orig 4096K�cycle 512Kc 64Kc 8Kc

decay interval

no

rmal

ized

bp

red

leak

age

ener

gy

Fig. 7. Normalizedleakageenergy for the4K bimodalpredictor

0

20

40

60

80

100

1000 10000 100000 1000000 orig

decay interval (cycles)

acti

ve r

atio

(%

)

gshare unbanked gshare banked bimod

Fig. 8. Meanactive ratio for unbanked andbanked gsharepredictorsanda bimodalpredictorwith differentdecayintervals. Therightmostlabel

“orig” correspondsto non-decayingpredictors.

their effectondecayin anAlpha 21264-stylehybridpredictor.

4.3.1 Selection Policy. The selectionpolicy refersto the policy for choosinga predictionfrom oneof the two

componentpredictors. In a non-decaying21264-stylehybrid predictor, the choosermakes this decisionusing the

globalhistory;seeFigure1. However, whendecayis enabled,it mayhappenthatonly oneof thetwo componentsis

active while theotheris decayed.In this case,sincethedecayedcomponenthaslost its information,it is intuitively

appealingto usethepredictionfrom theactive component,no matterwhatthechoosersuggests.This policy is called

“believe the active component,” and is implementedin all our experiments. It may also happenthat both the two

componentsaredecayed,in which caseall componentsarereactivatedandthe branchis predictedas“weakly not

taken,” asin bimodalandgsharepredictors.

4.3.2 Wakeup Policy. The wakeuppolicy refersto the decisionof whetherto reactivatea decayedrow whenit

is accessedby a branchinstruction. A naive policy would alwayswake up any rows that areaccessed.In a hybrid

predictor, a moreelegantpolicy is possible:thedecayedcomponentwill be reactivatedonly if thechooserwantsto

selectit. We referto thispolicy as“believe thechooser.”

In the situationwhen the accessedrow in the chooseris decayed,we know that the global componentis also

decayedin the 21264-stylepredictor. This is becausethe chooserandglobal predictorare indexed, accessedand

thusdecayedin exactly thesameway; seeFigure1. In this case,thechooserhasno usefulinformation. If the local


U 13

0

10

20

30

40

50

60

70

80

90

100

gzi

p

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

wu

pw

ise

swim

mg

rid

app

lu

mes

a

gal

gel ar

t

equ

ake

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

acti

ve r

atio

(%)


0.00

0.05

0.10

0.15

0.20

gzi

p

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

wu

pw

ise

swim

mg

rid

app

lu

mes

a

gal

gel ar

t

equ

ake

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

mis

pre

dic

tio

n r

ate


Fig. 9. Active ratio (Top)andmispredictionrate(Bottom)for unbankedgsharepredictor

componentis active, thenwe leave thechooserandtheglobalcomponentinactive andreturnthepredictionfrom the

local component.Otherwise(whenthe local componentis alsoinactive), we reactivateall componentsandreturna

predictionof “weakly not taken.”

4.3.3 Results. Figure11detailstheactiveratioandbranchmispredictionratefor naivedecay, whichalwayswake

upany rowsthatareaccessed,with a21264-stylehybridpredictor. Weseethateventhoughtheactiveratiosarehigher

(worse)thanfor bimodalor gsharepredictors,decayhasa negligible impacton themispredictionratefor intervalsof

64K cyclesor larger. Notethatin orderto computeactiveratiosensiblyon amulti-tablestructure,wecomputeit over

all predictionandchooserbits in thestructure.Overall,asFigure12 shows,naive decayrealizesstrongreductionsin

energy savings—40%for a 64K-cycle interval.


14 V

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6


decay interval

no

rmal

ized

bp

red

leak

age

ener

gy

banked unbanked

Fig. 10. Normalizedleakageenergy for gsharebranchpredictorfor bothunbankedandbankedpredictors.

0102030405060708090

100

gzi

p

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

wu

pw

ise

swim

mg

rid

app

lu

mes

a

gal

gel ar

t

equ

ake

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

acti

ve r

atio

(%

)


0.00

0.02

0.04

0.06

0.08

0.10

0.12

gzi

p

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

wu

pw

ise

swim

mg

rid

app

lu

mes

a

gal

gel ar

t

equ

ake

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

mis

pre

dic

tio

n r

ate


Fig. 11. Active ratio (Top)andmispredictionrate(Bottom)for 21264’shybridpredictor

We canobtain even betterenergy savings by taking advantageof the “believe the chooser”wakeuppolicy. As

shown in Figure12, this moresophisticatedpolicy leadsto leakagepower reductionsabout50% betterthanfor the

naivepolicy. Weachievethisby exploiting informationfrom thechooser. Thechooserpointsto aparticularcomponent

for a reason– thatis, onecomponentis performingbetterthantheother, andthuswe only needinformationfrom the

superiorcomponent.


W 15

0.0

0.2

0.4

0.6

0.8

1.0

1.2

orig 4096K�cycledecay�interval

512Kc 64Kc 8Kc

decay interval

no

rmal

ized

bp

red

leak

age

ener

gy

naive believe the chooser

Fig. 12. Normalizedleakageenergy of 21264-styleHybrid predictorwith naive and“believe thechooser”wake uppolicies

4.4 Decay with the Branch Target Buffer

In additionto thestructurefor predictingthedirectionof conditionalbranches,a branchtargetbuffer (BTB) [Holgate

andIbbett1980;Losq1982]is commonlyusedto storethetargetsof takenbranches.TheBTB is typically organized

like a cache,eitherdirect-mappedor setassociative,but taggedin orderto identify hits andmisses.Sinceit is solely

indexed by PC, it hassimilar locality characteristicsas instructioncachesand bimodal predictors. However, the

granularityin theBTB is singlebranchinstructions,insteadof instructionblocksasin instructioncaches.Thisallows

evenfinercontrolfor decay. Decayis thereforeespeciallyeffective for BTBs.

We evaluateda 2048-entry, 4-way associative BTB, which appearsin the Intel PIII processor[Diefendorff 1999].

Exceptfor vortex, perlbmk andcrafty, mostbenchmarkshavea very low active ratio,below 20%or evenbelow 10%.

This is no surprise,consideringthat the staticfootprintsof mostbenchmarksin TableI arevery small comparedto

ourBTB capacity. With decay, thesestaticfootprintsareautomaticallytrackedandall otheridle slotsareturnedoff to

save leakageenergy. On theotherhand,BTB hit ratesobserve only negligible degradationuntil decayinterval drops

to 8K cyclesor less.Overall,with decayintervalsof 64K cyclesor more,about90%of theBTB leakageenergy can

besaved.

4.5 Summary and Discussion

In this section,we have exploredthe applicationof decaytechniquesto reduceleakageenergy in branchpredictor

structures.Theparticularpoliciesthatwehavedeveloped(e.g., “believethechooser”)applyto all non-state-preserving

leakage-controlmechanisms.Thekey contributionsaremoregeneral,giving usefulguidanceto any work on leakage

control. In particular, ourlocalityanalysisshowsthatall predictororganizationsweexaminedhavesignificantnumbers

of rows that are inactive for long periodsof time. This makes leakagecontrol in the branchpredictorand BTB

profitable,regardlessof mechanism.Additionally, prior work by Jimenezet al. [Jimenezet al. 2000], andParikh

et al. [Parikh et al. 2002] suggeststhat branchpredictorsshouldbe larger for both performanceandpower-saving

considerations,but that localized-heatingconcernsmay be an obstacle. This makes leakagecontrol in the branch

predictorandBTB particularlyimportant.

Over all the configurationswe explored, the bestreductionsin leakagepower were achieved with the bimodal

predictor, but thepower savingsachievedwith the “believe the chooser”policy for hybrid predictionwerenearlyas

good. Sincehybrid predictionhassuperiorpredictionaccuracy andhenceperformanceandoverall energy, hybrid

predictionremainsthe bestchoicefrom both a performanceandenergy standpoint[Parikh et al. 2002]. Overall, as

branchpredictionaccuracy andexecutiontime wereincreasednegligibly, thereductionin leakageenergy consumed

by thebranchpredictorunit causestheprocessorto saveenergy asa whole.

Someof our interestingresultswerespecificallydueto thefactthattherearetwo independentcomponentpredictors


16 X

0

10

20

30

40

50

60

70

80

90

100

gzi

p

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

wu

pw

ise

swim

mg

rid

app

lu

mes

a

gal

gel ar

t

equ

ake

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

acti

ve r

atio

(%)

orig 4096Kc decay interval 512Kc 64Kc 8Kc

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

gzi

p

vpr

gcc

mcf

craf

ty

par

ser

eon

per

lbm

k

gap

vort

ex

bzi

p2

two

lf

wu

pw

ise

swim

mg

rid

app

lu

mes

a

gal

gel ar

t

equ

ake

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

bp

red

ad

dr

rate


Fig. 13. Active ratio (Top)andaccuracy (Bottom)for theBTB with differentdecayintervals.

in a hybridpredictor. In the21264’shybridpredictor, for example,thechooserandglobalcomponentwereorganized

in thesameway. This let usdeduceusefulfactsaboutonebasedonstatein theother. Moregenerally, theinfluenceof

indexing on decaymaysuggestthatthechoiceof predictorindex is worth revisiting (yet again).In thepast,indexing

functionshave beenchosento minimizealiasing,usuallyby spreadingthestatefrom branchesin the sameworking

setaswidely aspossible.In contrast,decaycanbemoreeffectiveby clusteringsimilar statesinto rows,with thegoal

of makingasmany countersin a row aspossibleidle or active at any point in time. This observationthereforesets

up a tradeoff betweenreducingaliasingandincreasingdecayopportunities.Althoughbeyondthescopeof this paper,

seekingindex functionsthatcanbalancethetwo factorsis an interestingareafor futurework. It maybepossibleto

develop tractableindex functionsthat clustersimilar stateinto rows with minimal increasein aliasing. In addition,

largepredictorsin particularmaybeableto accommodateindex functionsthatboostrow clustering.


Y 17

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0


decay interval

no

rmal

ized

bp

red

leak

age

ener

gy

Fig. 14. Normalizedleakageenergy for theBTB with differentdecayintervals

As for theactualdecayintervals,this paperappliesonly fixeddecayintervals. Kaxiraset al. [Kaxiraset al. 2001]

show that for caches,even betterdecayratescanbe achieved with decayintervals that adaptto changingprogram

behavior. Otherwork by Zhouet al. [Zhouetal.2001],Velusamyet al. [Velusamyetal. 2002],andSankaranarayanan

andSkadron[SankaranarayananandSkadron2004] also improved on cachedecayby selectingbetterpolicies for

adaptingthe decayinterval. While studyingadaptive branchpredictordecaywasbeyond the scopeof this paper, it

remainsworth investigatingsinceour datashowed that somebenchmarkssufferedno performancelossevenwith a

short,8K cycledecayinterval (andwould getevenmorebenefitfrom decay),while otherssuffereddrastically.

A furtherconsiderationis thatbankedandmulti-tableorganizationsprovidesubstantialbenefitsin reduceddynamic

energy, by reducingword- andbit-line lengths.Furthermore,whendecayis appliedto multi-tablepredictors,some

tablescanbe left inactive, asin the“believe thechooser”policy. Anotherexamplearisesin local-historyprediction,

wheretiming issuesmayrequirecachingthepredicteddirectionfor eachBHT entryin theBHT. As with “believe the

chooser”,decaymight permit the PHT updateandlookup in sucha local-historyorganizationto be omittedif that

PHTentryis inactiveandthecacheddirectionwascorrect.Suchapolicy mightbecalled“believetheBHT”. Not only

do these“believe” policiesreduceleakageenergy, they avoid thedynamicpowerassociatedwith accessingthattable.

Our work heredid not modeltheseextra savings,but doingsowould only makedecaymorevaluable.

Anotherissuethatwarrantsstudyfurtheris interferencein branchpredictors.In someexperimentsweobservedmild

but interestingimprovementsin predictionratewith decay. This shows thatdecay(by settingthetwo-bit countersto

a weakstate)may have the effect of reducingdestructive interference,somethingwe plan to quantify in our future

work. Lastly, sincemany otherpredictionstructuressuchasvaluepredictors[Lipasti etal. 1996]andprefetchaddress

predictors[Lai et al. 2001]areorganizedsimilarly to branchpredictors,an interestingfuture effort will be to apply

strategiesfoundin this paperto thesestructures.

5. DECAY WITH FOUR-TRANSISTOR RAM CELLS

Thusfar we have exploredthe useof traditionalSRAM cells andtheir usein decay. Traditionaldecaytechniques

involvemanipulatingthepower supplyto createtheillusion of dynamicbehavior on top of a cell originally designed

andselectedfor its ability to hold its valueindefinitely. Viewed in this light, it seemsnaturalto usequasi-static4T

cellsto implementdecayingbranchpredictors.


18 Z

5.1 The Quasi-Static 4T Cell

Basicfour-transistor(4T) DRAM cellsarewell establishedanddescribedin introductoryVLSI textbooksandvarious

articles[Lyons et al. 1987; Noda et al. 1998; Wolf 1998]. 4T cells are similar to ordinary6T cells but lack two

transistorsconnectedto [�\�\ that replenishthe charge that is lost via leakage(Figure15). Using exactly the same

transistorsasanoptimized6T design,the4T cell requiresapproximately2/3 of thearea.For example,undera 0.18]process,an ordinary6T cell occupiesan areaof 5.51] m , whereasa 4T cell would occupy anareaof 3.8] m (i.e.,

69%of thearea[Diodatoet al. 1998]).

Essentially, the4T cell is justasfastasa6T cell. While ourdatademonstrateaslightspeeddisadvantage,thediffer-

enceis sosmall thatcoupledwith thesmalleramountof parasiticinterconnect,thedifferenceessentiallydisappears.

More importantly, 4T DRAM cells naturallydecayover time (without the needto switch themoff); oncethey lose

their chargethey leakvery little sincethereis no connectionto [�\�\ . However, somesecondaryleakagevia theaccess

transistorsstill remainsdueto bit-line precharging,which wedo take into accountin our transistor-level simulations.

Precharging, in both SRAM andDRAM designs,sapssomeof the power savings; any charge loss,power dissi-

pation,or cell behavior that is a resultof precharging is accountedfor in our simulation.Therearevariousformsof

precharge(full- [ \�\ , half-[ \�\ , 2/3-[ \�\ ) andthey vary from thosethatarenormallyon to thosethatarenormallyoff.

We usefull- [ \�\ in our simulations.Theprechargecontrolcircuitry in thesimulationis normallyoff; theprechargeis

on for 33%of thetimeandthenoff at thespecificmomentof areador write. Whethertheprechargetransistorsareon

or off, residualchargeonthebit-line doesleakthroughtheaccesstransistorsandeventuallyis lost to thegroundnode.

Evenwhenall residualchargeis removedfrom thebit-line, theleakageof theprechargetransistorssupplya low level

of currentflow throughtheaccesstransistors.

B B

WW

VddB B

WW

Fig. 15. Circuit diagramsof the6T SRAM cell (left) andthe4T quasi-staticRAM cell (right).

In addition,4T cellsareautomaticallyrefreshedfrom theprechargedbit lineswhenever they areaccessed.Whena

4T cell is accessed,its internalhigh nodeis restoredto high potential,refreshingthe logical valuestoredin it; there

is no needfor a read-writecycle asin 1T DRAM. As the cell decaysandleakscharge, the voltagedifferenceof its

internalnodesdropsto thepointwherethesenseampscannotdistinguishits logicalvalue.Conservatively, thisoccurs

whenthe nodevoltagedifferentialdropsbelow a thresholdof the orderof 100 mV (for 1.5V designs).Below this

thresholdwe have a decayedstate,wherereadinga 4T DRAM cell mayproducea randomvalue(not necessarilya

zero).Overa long time thecell reachesa steadystatewhereboththehighnodeandthelow nodeof thecell “float” at

about30mV (for 1.5V designs).

Thus4T cellspossesstwo characteristicsfitting for decay:they arerefresheduponaccess,anddecayover time if

not accessed.In therestof this sectionwe discussextensively the4T decaydesign,includingdecayor dataretention

times,dynamicenergy andotherconsiderations.


_ 19

5.2 Hold Times In 4T Cells

The critical parameterfor a 4T designis retention time. Whenimplementedin 4T cells,decaytechniqueshave the

cell’s retentiontime astheir naturaldecayinterval. We defineretentiontime to bethetime from thelastaccessto the

time whentheinternaldifferentialvoltageof thecell dropsbelow thedetectionthreshold.Hold time dependson the

leakagecurrentspresentin the4T cell which in turn dependon processtechnologyvariationsandtemperature.

To study retentiontimes for the 4T cachedecaywe choseAgere’s COM2 0.16uCMOS processfor which we

have accuratetransistormodels. We alsouseCOM2 for this studybecauseit is the mostmodernof the four COM

processesthat is availablein-house,in realsilicon, to validateour modelsagainstrealmeasurements.Althoughthis

particulartechnologydoesnot suffer excessively from leakage,our analysisscalesto futuregenerations.Subsequent

sectionswill discusshow to manipulateretentiontimesto matchapplicationneeds,soalthoughour initial retention

timenumbersarefor COM2,we feel they areachievablein COM3 andCOM4 aswell.

Variationsin the processtechnologysignificantly affect leakagecurrentsand thereforeretentiontimes. In our

studiesweusethreereferencepointsin theprocesstechnology:(i) Nominaltransistors(NOM) representtheexpected

behavior andconstitutethebulk of theproduction;(ii) WorstCaseFast(WCF) transistorsaretransistorsthataresix

standarddeviations(sigma)( `K`baFaNc a�aed ) from theprocessmeanin termsof speed.They arevery fastbut very leaky

transistors;(iii) WorstCaseSlow (WCS)transistorsaretheoppositesix-sigmapoint in theprocessdistribution,where

transistorsarequiteslow but leakfar less.

TheWCFandWCSareextremecasesthatby definitionrarelyappearin production;they areessentiallytheoretical

constructsusedasbounds.In general,fastchipsarelikely to leakmoreandthushaveshorterretentiontimes.However,

fastchipsarealsoclocked at higherspeedswhich meansthat even their short retentiontimescorrespondto many

cycles. It is thecycle countthatmatters,not real time, sincedecayis anarchitecturalphenomenoncharacterizedby

cyclecountsandlargely independentof cycleduration.

As mentionedpreviously, variationsin temperaturealsoresult in large variationsin retentiontimes. Our designs

targetanoperationaltemperatureof 85C (appropriatefor examplefor mobileprocessors)but wealsodiscussmecha-

nismsto adjustin veryhigh temperature(125C).

Finally, selecting3.3V I/O transistors— readilyavailablein COM2 technology— to replacethe1.5V transistors

while maintaining1.5V signalingin the4T cellssignificantlyextendsretentiontimesat theexpenseof increasedcell

area.Evenin this casethe4T cell, areais still lessthanthatof the6T cell. Building 4T cellsout of 3.3V transistors

while operatingthemat 1.5V is feasiblebecauseof thenatureof the4T cell which actsasa placeholderfor charge.

Thesamecannotbedonefor anactive6T circuit (two cross-coupledinverters)whichrequiresits transistorsto befully

biasedto work correctly.

1.5V 3.3VRETENTIONTIME RETENTIONTIMEf�gih

C j gih C k f�gih Cf�gih

C j gih C k f�gih CNOM 18,000 1700 560 1,040,000 57,200 9400

TableIV. Hold timesin nanosecondsfor 1.5V and3.3V versionsof 4T cells at differentoperatingtemperatures.For a 1 GHz (1nscycle time)processor, onecanalsoconsidertheseretentiontimesascycle counts.

Basedon theseassumptions,we determinedretentiontimes for our technologythroughdetailedtransistor-level

simulations.We simulatedanaccessto acell, followedby a longperiodin which thecell wasleft unread.During this

time, leakagecausesthecell’s internalnodesto losecharge. We definedthehold time to bethedurationbetweenan

accessandthepointatwhich thedifferentialvoltagesof the4T cellsinternalnodeslapsedto avaluelessthan100mV.

We used100mVasour criteria for theminimumvoltageat which we would expectthesenseampsto distinguishthe

logical state. If the 4T cell is readafter it hasdecayedbeyond this point, it still operatessafelyanddoesnot raise

metastabilityor otherconcerns;it just doesnot have the datait hadbefore. Readinga decayedcell, asshown and


20 l

verifiedin our circuit models,producesanelectricallystable,thoughlogically random,value.It is essentialtherefore

to know well in advancewhendatahavedecayedin a4T cache.Thisnecessitatesdecaycountersnot to turnthebranch

predictorrow on andoff asin previouswork but to just signalthedecayof a branchpredictorrow. TableIV givesthe

cell retentiontimesin nanosecondsfor theCOM2 technology. Theseretentiontimesarefor a 4T cell usingthesame

highly optimized6T cell transistors.

5.3 Sense Amplifier Considerations

Oneimplication of the 4T cell is that it becomesprogressively moreexpensive to readas it decays(similar to the

behavior of DRAM). As thecell decays,thesenseampmustwork harderto detectandamplify thedifferentialsignal

on thebit lines.Besidesthesenseampenergy thereis alsoenergy expendedto refreshthecell aswe readit. We have

studiedbothof theseeffectsfor all thetypesof 4T cells(with 1.5V and3.3V devices)andfor variousstagesof decay.

TableV givesthereadenergy for the4T cell at variousinternalvoltages.

4T 6TInternal ReadEnergy (fJ) ReadEnergy (fJ)voltage 1.5V 3.3V 1.5V1.5V na 149 1561.0V 166 151 na0.7V 171 153 na0.5V 175 155 na0.1V 219 199 na

TableV. Readenergies in fJ for nominal1.5V and3.3V 4T cells at m�nio C for differentstatesof decay(i.e., different internalvoltages).Valuesincludeenergy dissipatedin thecell, precharge circuits,andsenseamplifier. The3.3V 4T cellsareoperatedat 1.5V. An internalvoltageof 1.5Vis not availablein the 1.5V 4T becausethe 4T cell internalnodevoltagesareconstrainedby the cut-off region of the MOSFETSusedasaccesstransistors.This affect couldbeovercomeby overdriving thewordlines,asin conventionalDRAM design”

5.4 Locality Considerations

Granularityis alsorelevantin the4T designbut hereit stemsfrom theway4T cellsarerefreshed.As readinga single

cell assertsthewordlineandrefreshesall thecellsin theaccompanying row, cellsthatwouldhavedecayedif left alone

getrefreshedcoincidentallyby nearbyactivecells.This leadsto theobservationthateven4T cellswith shortretention

timesmaynotactuallylosedataasquickly if therow sizeis longenough.Thusretentiontimeselectiongoestogether

with locality granularitybecauselargerow granularitiesmake theapparentrateof refreshmuchhigher. (Segmented

wordlineswould allow moreselective refreshbut thesedesignsareoutsidethescopeof thispaper.)

In contrast,in a designwith very fine row granularity, onewould opt for 4T cellswith very long retentiontimes.

Finegranularityleadsto a very gooddecayratio but theimportantcellsmustremainalive on their own (without the

benefitof accidentalrefreshes)for considerabletime.

5.5 Metastability

A key issueregarding4T structuresis thefact thatmetastabilityproblemsmaybea concernwhenthecell’s internal

differentialvoltageis toosmall.To avoid metastability, it is temptingto userefresh,but thiswouldobviatethesavings

of ourapproach.Instead,onecandetectthesmalldifferentialvoltagesandavoid relyingon arraydataat thesepoints.

As an exampleof the latter, we proposethat one could avoid metastabilityin a 4T branchpredictorby addinga

referencecolumnwhosesolepurposeis to detectlow differentialvoltageandto prevent the senseampoutputfrom

propagatingfurther into thecircuit. In this column,insteadof a senseamplifierwe have a voltagecomparator. When

thevoltagedifferenceis too small,thecomparatoroutputforcesthereferencecell to readasa logical zero(otherwise

it readsasa logical one).Theoutputof thecomparatorqualifiestheresult.A smalldifferentialis thereforeprevented

from inducingmetastabilityin thesubsequentcircuit.


p 21

Otherapproachesarepossible,suchasthe useof decaycounters[Kaxiras et al. 2001],but the onewe have pro-

posed— theuseof a singlecomparatorin a referencecolumn— is appealingbecauseit preventsmetastabilitywhile

requiringminimal extraareaor power.

5.6 Reliability, Determinism, and Controlled Operation

In somecasesit is necessaryfor 4T memorystructuresto appearstatic. As in any otherDRAM, this is doneby

periodically refreshingthe dynamiccells. Refreshcanbe accomplishedin a way that is largely transparentto the

normaloperationof thememoryarray. Hanamuraet al. [Hanamuraet al. 1987]describea refreshmechanismthaton

a periodicbasismomentarilystrobestheaddresslinesafterthebit lineshavebeenprecharged.Senseampsareidle at

this point, resultingin low power consumption.Theshortstrobepulseallows chargeto flow from thebit linesto the

cellsof theselectedline andrestoretheir contents.Ordinaryreadsandwritestake precedenceover refresh,resulting

in minimal performanceimpact.

Suchrefreshcircuitry (whosearea-costis negligible) is likely to be includedwith a 4T dataarray. In normal

operationthe refreshcircuit is inactive andthe 4T structureoperatesin decay(low-leakage)mode. However, under

specialcircumstances,refreshprovidesindispensablefunctionality. Theseinclude: (i) chip testing,(ii) deterministic

operationfor real-timeapplications,(iii) possibleoperationbeyondrecommendedtemperaturerange.While thefirst

two shouldbeapparent,thethirdonearisesbecauseathightemperatures,leakagemaybesohighthatbranchpredictors

decaytoo quickly. In suchsituations,the systemwill want to re-enablerefresh,reverting to non-decaymode. One

methodto achievethis is by usingon-chipsensorsto automaticallytriggertherefreshwhennecessary.

5.7 Trends in Process Technology

With theadventof new processtechnology, transistorleakagewill increaseat a tremendousrate,making6T solutions

somewhatlessattractivedueto powerdissipationconcerns.Thissametransistorleakagewill of courselower thedata

retentionof 4T cells; however sinceaccesstimeswill alsoimprove, therewould appearto besomekind of balance

betweenthenumberof machinecyclesthatthe4T cell canhold its dataandthenumberof machinescyclesthatthe4T

cell is requiredto hold its datain cacheapplications.For example,simulatingunderSPICEusinga futuretechnology

(65nm),at 85Ca 3.3V 4T cell hasa 1.4q s retentiontime, translatinginto 1400processorcyclesat 1 GHz. However,

accordingto theITRS roadmap,by thetime 65nmis reachedtheprocessorclock shouldbeapproximately7.5 GHz,

translatinginto aneffective retentiontime of just over10000cycles.Coupledwith architecturaltechniquesto extend

theretentiontime,evenwith theimpactof processtechnology, we canstill field aneffectiveretentiontime for decay.

Therefore,4T still wins.

As processtechnologyadvances:

—transistorleakageincreasesfor all devices,especiallyfor thosetypesof transistorsusedin 6T cells, but not so

rapidly for thoselongerchanneltransistorsusedin 4T cells

—Dataaccesstime decreasesin all cases

—Dataretentiontime decreasesin 4T cells

—Alpha particleimmunity (andsoft errorsin general)will getworsefor bothtypesof cells.

—Precharge andaccesstransistorsleakageis a mechanismof purecharge lossandcompletepower wastein a 6T-

SRAM

—Prechargeandaccesstransistorleakagetendsto replenishthe internalnodechargelost dueto the leakageof other

transistorsin the4T-DRAM

Thelastpoint is akey pointbecause6T cellsin thiserahaveshown softerrorproblemsowing to their largersource-

drainarea.1T cells(previouslynotoriousfor theirsofterrorsensitivity) arenotgettingasbadasoncethoughtbecause

their source-drainareais so small that the offendingparticlehasa smallertarget to hit it, andwhenit hits thereis


22 r

lessof achargeimbalancebecausethecross-sectionalareaof thedepletionregion is sosmall.Onceagainthe4T cell

seemsto bea reasonablemiddlegroundbetween6T and1T.

5.8 Controlling Retention Times

Thesuccessof a 4T decaydesigndependson matchingretentiontimesto access(i.e., “refresh”) intervals. Thus,the

ability to controlretentiontimescouldgiveusanew degreeof freedomin designing4T decaystructures.Oneway to

affect retentiontimesis to adddevicessuchasresistorsor capacitorsto thebasic4T cell [Diodatoet al. 2001]. Such

devicescanbeusedto slowly replenishthe lost charge. If the rateof replenishmentis lessthanthe leakagethecell

will still decayalbeitmuchmoreslowly, thusretentiontimecanbeextendedsignificantly.

The ability to control retentiontimes is key to the applicability of 4T cells to branchpredictordesigns,as the

retentiontime is dependenton temperature.Thus,onecanadaptthe naturaldecayinterval (in cycles)moreclosely

to thedesireddecayinterval in responseto thegiventemperaturerange.As describedbelow, refreshcircuitry canbe

includedto beusedasa fail-safeif thetemperatureexceedsacertainthreshold.

In addition, retentiontimescanbe controlledusingarchitecturaltechniques,suchasusingmultiple refreshesto

prolongretentiontime,accomplishedeitherstaticallyor adaptively. Moreover, asdescribedabove,thestructureof the

cachemaymaketheapparentrateof refreshmuchhigher;longerrow granularitiescancompensatefor ashorterdecay

interval. Onethuscandesignthestructureto orient towardslongerrows for anincreaseddecayinterval (anda larger

safeguardagainstperformanceloss)or shorterrows for greaterleakageenergy savings.

6. RESULTS FOR 4T-BASED BRANCH PREDICTORS

To illustrate the effects of the increasein the mispredictionrate due to decay, we simulateda processorwith the

directionpredictorandBTB built with 4T structures.Wesimulatedundertheworstcasescenario:smallstaticleakage

in comparisonto dynamicenergy, whichcorrelatescloselywith theCOM2process.Thusweuseslow-decaying3.3V

4T cells in our design,operatingat 85 C, leaving usa decayinterval of 57,200cyclesfor boththedirectionpredictor

andtheBTB.

Thesevaluescanbeadjustedusingthe techniquesasdescribedin Section5; for example,if thedecayinterval of

57,200cyclesis too long, we caninsteaduse1.5V 4T cells,which have a shorterdecayinterval (8000cycles),and

extendthatretentiontimeappropriately.

6.1 4T Decay Results

6.1.1 Gshare. Figure16 shows thenormalizedexecutiontime andmispredictionrateof the4T branchpredictor

versusa standard(non-decaying)branchpredictor. We seethat the mispredictionrateandthus the executiontime

arevirtually unchanged.In fact,dueto therandomeffectsof readingthedecayedvalues,a few benchmarksactually

improve2-3%.Overall, theaccuracy wasaffectedlessthan5%.

Figure17 shows theactive ratio of thedirectioncountersundervariousSPECbenchmarks.On average,we seea

28.2%active ratio,which directly translatesinto over70%savingson leakagepowerovera traditional,non-decaying

predictor. Thesavingsis not without cost,however: additionalreadenergy is requiredevery time anearlydecayedor

completelydecayedcounteris read.Underourcurrentprocess,this penaltyis anadditional27.5%of thereadenergy

of a non-decayedcounter.

Moreover, thispenaltyis appliedalsowhennearlydecayedcountersonthesamerow arerefreshedaccidentallydue

to a nearbyread;fully decayedcounterson thesamerow, however, areleft decayedandthereforenot refreshed.We

implementthis by detectingthe voltagedifferentialon the bitlines; if this voltageswing is closeto zero,we cantie

thesignalinto therefreshlinesandavoid refreshinganunusedcell. Thesumtotal is thatwe retaintheeffective long

decayinterval of low row granularitybut achievemuchloweractiveratios.

Finally, thenormaldynamicenergy overheadof additionalmispredictionsmustbe includedin our results.Figure


s 23

95.0

96.0

97.0

98.0

99.0

100.0

101.0

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

galg

el

equa

ke

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

exec

utio

n tim

e (%

)

standard decayed [4T]

0.00

0.05

0.10

0.15

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

galg

el

equa

ke

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

mis

pred

ictio

n ra

te


Fig. 16. Normalizedexecutiontime (Top) andmispredictionrate(Bottom)of standardand4T predictors.4T predictorsproduceminimal perfor-

mancelosses.

0

10

20

30

40

50

60

70

80

90

100

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

wup

wis

e

swim

mgr

id

appl

u

mes

a

galg

el

equa

ke

face

rec

amm

p

luca

s

fma3

d

sixt

rack

apsi

[geo

mea

n]

activ

e ra

tio (

%)

Fig. 17. Active ratio of a 4T basedpredictor.


24 t

4T COM3 [85 C]

4T COM3 [85 C]

4T COM4 [85 C]4T COM4 [85 C]

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

original 4096K 512K 64K 8K

Decay Interval

No

rmal

ized

Bra

nch

Pre

dic

tor

Lea

kag

e E

ner

gy

COM2 COM3 COM4 4T COM2 4T COM3 4T COM4

4T COM2 [85 C]

Fig. 18. Normalizedleakageenergy for the4T branchpredictorat 57.2kand8k cycles.

18 shows the4T basedbranchpredictorin comparisonwith theprevious6T designs.While the4T decayinterval is

setat 57.2k,at equivalentprocesses,the4T usuallyoutperformsthe6T. For instance,a 6T baseddecayingpredictor

on a COM3 processwould actuallyconsumemoreenergy thana standard,non-decayingpredictor, whereasthe 4T

versionof thesamepredictor, with ashorterdecayinterval, doesbetter.

In addition,thedataalsoshowsthatit is possiblewith 4T cellsto useveryaggressivedecayintervalsin thedirection

countertables.UndertheCOM4 process,we seeleakagepowersavingsevenwhenthedecayinterval is 8000cycles.

As mentionedabove,we areseeingtheeffectsof thelower active ratiosat thesamerow granularitythat the4T cells

allow.

6.1.2 Bi-mode Results. Figure19showsthenormalizedexecutiontimeandmispredictionrateof amorecomplex

predictor, the bi-mode. In this configurationwe usedan aggressive bi-modepredictorcomposedof two 16K PHTs

activatedby an8K selector. We seethatexecutiontime is movedvery little; mispredictionrateis increasedroughly

0.3%,translatinginto a performancelossof 0.1%. The complexity of the predictorhelpsalleviate the performance

lossdueto the misprediction,asa singlestaticbranchcoversa larger areathan,for example,a bimodalpredictor,

which raisestheapparentrefreshrate. This is confirmedby the fact thatbi-modepredictorssuffer lessasthedecay

interval is decreasedcomparedto thegshare.Theresultsfor varyingthedecayinterval of gsharewill bediscussedin

Section6.2.

Figure20 shows the active ratio of the bi-modepredictor. The active ratio appearsto be smallerthanthat of the

gsharebecausetheactive ratio is anaverageof all threecomponentsof thebi-mode– oneof which is a very simple

bimodalpredictor. While thebimodalcomponentis half thesizeof thePHTs,its active ratio is significantlysmaller

thanthatof thePHTsandthuslowerstheoverallactive ratio.

6.1.3 BTB Results. The resultsfor the 4T BTB aresimilar to the 6T BTB. BecauseeachBTB target is much

larger thanthe two-bit counter, we canafford to attachcountersto eachBTB target andthusachieve the optimum

granularity;thosecountersthatareusedarerefreshed,andthosethatdecayarenotaccidentallyrefreshed.As aresult,

theactive ratiosof the6T BTB areidenticalto the4T BTB.

6.2 Sensitivity to Decay Interval

Figure21 shows theimpactof decayinterval over executiontime andpredictoraccuracy. Shorter(aggressive)decay

intervalsincreasethedecayratio,but at thecostof performance.For example,ata256cycledecayinterval, theactive

ratio is lessthan1%, but the executiontime is increased11% dueto the branchpredictorsuffering a 10% drop in


u 25

0.99

0.992

0.994

0.996

0.998

1

1.002

1.004

1.006

1.008

1.01

gzip vp

rgc

cm

cf

craf

ty

pars

er eon

perlb

mk

gap

vorte

xbz

ip2 twolf

wupwise

swim

mgr

idap

plum

esa

galge

lar

t

equa

ke

face

rec

amm

pluc

as

fma3

d

sixtra

ckap

si

geom

ean

exec

uti

on

tim

e(%

)


0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

gzip vp

rgc

cm

cf

craf

ty

pars

er eon

perlb

mk

gap

vorte

xbz

ip2 twolf

wupwise

swim

mgr

idap

plum

esa

galge

lar

t

equa

ke

face

rec

amm

pluc

as

fma3

dap

si

sixtra

ck

geom

ean

mis

pre

dic

tio

n r

ate

standard decayed[4T]

Fig. 19. Normalizedexecutiontime (Top)andmispredictionrate(Bottom)of standardand4T bi-modepredictors.4T predictorsproduceminimal

performancelosses.

accuracy. Overall, thenormalizedleakageenergy for thebranchpredictoris increased7 times.

While temperaturecancausethe4T cell to leakmorerapidly (andthusshortenthedecayinterval), it is important

to restatethata randomizedvaluein thedirectionalcountersmaystill becorrect,andthustheeffect of accidentally

decayingbranchpredictorsis not assevereasdecayingcachelines. More, asmentionedearlier, retentiontime can

becontrolledusingthetechniquesmentionedin Section5. For example,if thetemperatureis beyondsomethreshold

suchthatcellsareleakingtoo quickly, a stand-byrefreshcircuit canbeenabledto returnthebranchpredictorinto a

non-decayingmode.

6.3 Summary and Discussion

Weseethatusingquasi-static4T cellsallowsusto build naturallydecayingbranchpredictorswith minimal impacton

performance.Furthermore,predictorsbuilt using4T-baseddecayoffer additionalbenefitsover thosealreadyshown

with decaywith 6T cells.We show that4T RAM cellsareasmuchasone-thirdsmallerin areaandarecomparablein

termsof readenergy whenaccessedfrequently. Most importantly, they reduceleakageenergy comparedto 6T cells

andarecomparablein termsof programperformance.

Thusourproposalto use4T cellsfor predictorstructuresis drivenby thefollowing observations.


26 v

0

10

20

30

40

50

60

70

80

90

100

gzip vp

rgc

cm

cf

craf

ty

pars

er eon

perlb

mk

gap

vorte

xbz

ip2 twolf

wupwise

swim

mgr

idap

plum

esa

amm

p

galge

lar

t

equa

ke

face

rec

lucas

fma3

d

sixtra

ckap

si

geom

ean

acti

ve r

atio

(%)

Fig. 20. Active ratio of a 4T basedbi-modepredictor.

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

256 512 1024 2048 4096 8192 16384 32768 32768 131072

Decay interval (in cycles)

Normalized execution time Branch prediction accuracy

Fig. 21. Normalizedexecutiontime andbranchpredictoraccuracy for the16K gsharepredictoratvaryingdecayintervals.

—Branchpredictorentriesexhibit locality. Data arraysareapproximatelysquare,anddecaytechniquesare most

easilyappliedto rows in thesearrays.This helps4T structures,becauseanaccessto anything alonga row boosts

the voltagesof all cells in that row. Locality meansthatactive coderegionshave their predictorsrefreshedwhile

inactiveregionsdecay. Oncedecayed,they leakvery little, essentiallycappingtheenergy dissipatedby idle entries.

—Theretentiontime for 4T cells— theamountof time it takesfor a 4T cell to leakenoughchargethat thevalueit

storesis corrupted— is longenoughthat4T cells’ decayingbehavior is naturallysuitedto exploiting decayin large

structuresoverour benchmarks:it is rarethatavalueleaksawaybeforeit is neededagain.

—Retentiontimesvary with designstyle, fabricationtechnology, andtemperature.Nonetheless,we have discussed

techniquesthatallow oneto modulatetheretentiontimessufficiently to guaranteegoodperformance.We havealso

shown thatthereexist severaltechniquesthatwill preventsmallvoltagedifferentialsin decayedcellsfrom inducing

metastability. Our resultsin Section5.7show thatour techniquewill beviableevenin futuretechnologies.

Theseinsightssuggestthat4T cellsareinherentlyusefulfor managingleakagein predictivestructures.Thefactthat

4T cellsdecaynaturallyprovidesleakage-energy savingswithout eithertheoverheadof gated-w�x�x techniquesor the

overheadof maintainingdecaycounterbits. Finally, 4T designsaresmaller, saving die areaandpossiblypermitting


y 27

fasteraccesstimesfor a givennumberof bits.

7. RELATED WORK

Much researchhasrecentlybeendonein the field of static leakage,both at the architecturallevel andat the device

level. At the architecturallevel, prior work focusedon improving the static leakageperformanceby shuttingoff

(decaying)unusedportionsof large array-like structures,for example,in caches. [Yanget al. 2001] describedan

instructioncacheorganizationthatdisableswaysof a set-associativecacheto matchthecapacitycurrentlyneededby

theexecutingprogram.

Severalpapers[Powell et al. 2000], [Kaxiraset al. 2001],and[Hu et al. 2002],describetechniquesfor disabling

individual cachelines by inferring that lines which have not beenusedin a long time have decayed: they probably

containdatathat is not likely to be usedagainbeforereplacement.Building on the cachedecaywork, [Hu et al.

2002a]extendeddecaytechniquesto thebranchpredictor. Most of theabovetechniquesassumetechnologiessuchas

gated-z�{�{ or supply-voltagegatingto controlwhetheror not a portionof thestructureis enabledor disabled.More

recently, Hansonet al. presentedacomparisonof leakage-controltechniques[Hansonetal. 2001],andconcludedthat

gated-z�{�{ maynot bethebestapproachfor controllingstaticpower in largedataarrays.Instead,they find thatdual-

z7| techniques[Roy 1998]save moreoverall energy for structuresthat shouldhave fastlookup times(like first-level

caches).

[Zhou et al. 2003] proposedturningoff only the dataportion of the cache,leaving the tag portionactive in order

to monitor andexploit missinformation. Furthermore,usingadaptive cachedecayasan example,[Velusamyet al.

2002]appliedformalfeedbackcontrolmechanismsto improveleakagesavingsby implementinganadaptivecontroller

whichvarieddecayintervalsdynamically. SankaranarayananandSkadron[SankaranarayananandSkadron2004]also

improvedoncachedecayby selectingbetterpoliciesfor adaptingthedecayinterval. [Flautneretal. 2002]presenteda

techniquein which to improve leakagesavingsnot by fully decayingcachelines,but ratherto put thein a low-power

”drowsy” state,whichmaintainsthecachestateratherthandiscardingit asin decaytechniques.Extendinggated-z�{�{ ,[Agarwal et al. 2002] DRG-Cachesusedata-retentive leakage-awarecells to reduceleakageenergy. Alternatively,

[Azizi etal. 2002]introducedanasymmetricdual-VtSRAM cell. Anotherleakagecontroltechniquesaving thecache

statewasdescribedby [Heo etal. 2002].Combiningstate-preservingandstate-destroying techniques,[Li etal. 2002]

demonstratedatechniquein which leakageenergy wasfurtherreducedby exploiting dataduplicationacrossthecache

hierarchy. Lastly, focusinginsteadon the compiler, [Zhanget al. 2002] presentedtechniquesusingthe compilerto

improveinstructioncacheleakage.

A separate,architecturalway of improving energy consumptionperformanceof structuressuchasthe cachewas

proposedby BalasubramonianandAlbonesi [Balasubramonianet al. 2000] andYanget al. [Yanget al. 2002], in

which the cacheis reconfiguredin order to bestsupportthe working setof the datawith as little wastedcacheas

needed.Resizablecachesareindeedorthogonalto decaytechniquesandalsotake advantageof the fact that some

applicationsrequirevery little eitherthecacheor from thebranchpredictor. However, thedecayapproachis simple,

requiresvery little hardware,andcanpotentiallyusevery smalloperatingintervals. Still, thereis nothingpreventing

a resizablecachefrom beingbuilt with decay, andcombiningthebenefitsof both.

Takingadifferentapproach,[Juangetal.2002]lookedateliminating z�{�{ altogetherin replacingthestandardSRAM

cell with a dynamic,quasi-staticfour transistorcell. Using the quasi-staticcell allows for an easilybuilt, naturally

decayingcache. In additionto caches,[Hu et al. 2002b]describedwaysof saving leakagein branchpredictorsby

implementingthemwith quasi-staticcells.Finally, [Hu etal. 2003]includedcachedecayasanexampleof improving

power dissipationandperformanceby looking at more than just simple time-independentbehavior, suchas event

ordering,andexploiting thegenerationalbehavior of structuressuchasthecacheandprefetchunit.


28 }

8. CONCLUSIONS

In this paper, we examineimplementationsof decay-basedleakagecontrol usingquasi-staticmemorycells. Cache

decay, first proposedin [Kaxiras et al. 2000;Kaxiraset al. 2001],aimedto turn off unusedportionsof cacheswith

long idle timesto reducecacheleakageenergy.

We startby evaluatingdecayimplementationsbasedon coarse-grainedcountersappendedto traditional6T SRAM

arrays. While previously consideredfor caches,this papershows that suchtechniquescanbe effective for branch

predictorsaswell. Inherentin this successis theobservationthatbranchpredictorsexhibit row-basedspatiallocality

thatallowssomeof therows to bedeactivateddueto long idle times,while otherrowsareheavily accessed.

We seethat,muchlike caches,exploiting this spatialandtemporallocality allowsusto save significantamountsof

leakageenergy, especiallyin simplerbranchpredictors.In morecomplex branchpredictorswe show that intelligent

policiescansavesignificantamountsof leakageenergy overnaivedecay.

Prior implementationsof decaywerebasedon gated-~�� or gated-~�� [Yanget al. 2001],wherepowerandground

aregatedby a transistor. This implementation,however, not only leadsto about5% areaoverheadbut alsobrings

aboutcomplex designissues,especiallyfor turningon adecayedcacheline.

A closerexaminationof gated-~ �� baseddecayrevealsan intrinsic conflict betweenthe standard6-transistorcell

designandgated-~ �� . Specifically, while gated-~ �� triesto shutoff cachelines,thetwo loadtransistorsin 6-transistor

cells aremerelywastingcharge. To resolve this conflict, we proposedusing4-transistorquasi-staticmemorycells

to implementcachedecay. By removing the two transistorsthat connectthe cell to ~�� , quasi-staticcells naturally

implementsthetwo key functionsfor decay:theirchargesloseovertimeandarereplenishedduringeachcacheaccess.

Comparedto gated-~�� implementation,thismethodavoidstheareaoverheadandthedesignissuesrelatedto thegate

transistor.

Thus,we alsoexaminethe useof quasi-static4T cells for implementingbranchpredictordecay. Suchcells have

beenproposedto implementon-chip embeddedDRAM in a fairly-traditional style with refreshcircuitry. In our

work, weexamineusingthenaturaldecayof the4T cellsto implementdecayfor leakage-controlin branchpredictors.

Becausebranchpredictorsareperformancehints,notcorrectness-critical,lostentriesdonotcauseincorrectexecution.

Moreover, we show that4T cellscanbebuilt with sufficient naturalretentiontimesto implementusefuldecay-based

predictorswith negligible impacton predictionrate.

Using a combinationof transistor-level simulationand instruction-level simulation,we show that a 4-transistor

decayimplementationachievessignificantsavingsin leakagepower. While 4-transistordecaytypically offersgreater

savings than 6-transistorsolutions,thereare many placeswere it may not be feasibleto implementa 4T branch

predictor. 6T decayis more flexible in the selectionof retentiontime, as well as a finer granularity in selecting

dynamicretentiontimes. Furthermore,it may not alwaysbe possible,either throughdevice or layout concerns,to

wholesalereplacea 6T solutionwith a4T solution.

Thus,by presentingbothtraditional(6T) andquasi-static(4T) solutions,we show thatdecaycanbeeasilyapplied

eitherarchitecturallyon top of a givendesign,or bean integral partof thedesignfrom thebeginning. While branch

predictorleakageis now only 7-10%of totalCPUleakage,weareableto reduceit by asignificantfraction,sometimes

90% or more. This reducesoverall chip leakageby 5-7%. Furthermorewe canreduceleakagewith essentiallyno

performancecost. In futureprocessors,asbranchpredictorspotentiallywill grow in sizevery quickly, outpacingthe

growth of thecache,thepower consumptionof thebranchpredictorwill beanimportantproblem.Most broadly, the

paperpromptsa rethinkingof how transientdatacanbestbeexploitedin designingpower-efficientprocessors.

Acknowledgements

We would like to thankPolychronisXekalakisandYanZhangfor their help in simulatingthe4T cell underfuture

technologiesin SPICE.

MargaretMartonosiandDougClark’sresearchonenergy-efficientprocessorsis supportedin partby NSFITR grant


� 29

CCR-0086031.In addition,MargaretMartonosigratefullyacknowledgesfundingandequipmentdonationsfrom Intel

andIBM.

Kevin Skadron’s researchwassupportedin part by NSF grantsCCR-0105626,an NSF CAREERaward (CCR-

0133634),anda grantfrom Intel MRL.

StefanosKaxiraswould like to thankIntel for equipmentdonations.

We alsothanktheanonymousreviewersfor their detailed,helpful suggestions.

REFERENCES

AGARWAL , A . ET AL . 2002.DRG-Cache:adataretentiongated-groundcachefor low power. In In the Proceedings of the 39th Design Automation

Conference.

AZIZI , N. ET AL . 2002. Low-leakageasymmetric-cellSRAM. In In Proceedings of the International Symposium on Low Power Electronics and

Design.

BALASUBRAMONIAN, R. ET AL . 2000.Memoryhierarchyreconfigurationfor energy andperformancein general-purposeprocessorarchitectures.

In In Proceedings of the 33rd International Symposium on Microarchitecture.

BROOKS, D., T IWARI , V., AND MARTONOSI , M. 2000. Wattch:A Framework for Architecture-Level Power AnalysisandOptimizations. In

Proc. ISCA-27.

BURGER, D. C. AND AUSTIN, T. M. 1997.TheSimpleScalartool set,version2.0. Computer Architecture News 25, 3 (June),13–25.

CHANG, P.-Y., HAO, E., AND PATT, Y. N. 1995.Alternative implementationsof hybridbranchpredictors.In Proc. Micro-28. 252–57.

DIEFENDORFF, K . 1999.PentiumIII = PentiumII + SSE.Microprocessor Report.

DIODATO, P. ET AL . 1998.MergedDRAM-LOGIC in theYear2001.Proc. of the IEEE International Workshop on Memory, Technology, Design,

and Testing.

DIODATO, P. ET AL . 2001.EmbeddedDRAM: An ElementandCircuit Evaluation.In IEEE Custom Integrated Circuits Conference.

DIODATO, P. W. 2001.Personalcommunication.

FLAUTNER, K . ET AL . 2002. Drowsy Caches:Simple Techniquesfor ReducingLeakagePower. In Proceedings of the 29th International

Symposium on Computer Architecture.

GWENNAP, L . 1996.Digital 21264setsnew standard.Microprocessor Report, 11–16.

HANAMURA , S. ET AL . 1987.A 256KCMOSSRAM with InternalRefresh.In The 1987 IEEE International Solid-State Circuits Conference.

HANSON, H. ET AL . 2001.Staticenergy reductiontechniquesfor microprocessorcaches.In Proceedings of the 2001 International Conference on

Computer Design. 276–83.

HEO, S. ET AL . 2002.DynamicFine-GrainLeakageReductionusingLeakage-BiasedBitlines. In Proceedings of the 29th International Symposium

on Computer Architecture.

HOLGATE, R. W. AND IBBETT, R. N. 1980. An analysisof instruction fetching strategies in pipelinedcomputers. IEEE Transactions on

Computers C-29, 4 (Apr.), 325–329.

HU, Z. ET AL . 2002a. Applying DecayStrategies to BranchPredictorsfor LeakageEnergy Savings. In Proceedings of the 2002International

Conference on Computer Design.

HU, Z. ET AL . 2002b. ManagingLeakagefor TransientData: DecayandQuasi-Static4T MemoryCells. In Proceedings of the 2002 International

Symposium on International Symposium on Low Power Electronics and Design.

HU, Z., JUANG, P., SKADRON, K ., CLARK , D., AND MARTONOSI , M. 2001. Applying decaystrategiesto branchpredictorsfor leakageenergy

savings. Tech.ReportCS-2001-24,Univ. of Virginia.

HU, Z., KAXIRAS, S., AND MARTONOSI , M. 2002. Let cachesdecay:Reducingleakageenergy via exploitationof cachegenerationalbehavior.

ACM Transactions on Computer Systems.

HU, Z., KAXIRAS, S., AND MARTONOSI , M. 2003.Timekeepingtechniquesfor predictingandoptimizingmemorybehavior. In The 2003 IEEE

International Solid-State Circuits Conference.

J. A. BUTTS AND G. SOHI. 2000.A StaticPowerModelfor Architects.In Proceedings of the 33rd International Symposium on Microarchitecture.

JIMENEZ, D. A., KECKLER, S. W., AND L IN, C. 2000. The impactof delayon the designof branchpredictors. In Proceedings of the 33rd

International Symposium on Microarchitecture. 67–77.

JUANG, P. ET AL . 2002.Implementingdecaytechniquesusing4T quasi-staticmemorycells. Computer Architecture Letters.

KAXIRAS, S., HU, Z., ET AL . 2000.Cache-LineDecay:A Mechanismto ReduceCacheLeakagePower. In Workshop on Power-Aware Computer

Systems (PACS, In conjunction with ASPLOS-IX.

KAXIRAS, S., HU, Z., AND MARTONOSI , M. 2001. CacheDecay:Exploiting GenerationalBehavior to ReduceCacheLeakagePower. In

Proceedings of the 28th International Symposium on Computer Architecture.


30 �

KESHARVARZI , A . ET AL . 1997. Intrinsic iddq: Origins,reduction,andapplicationsin deepsub-mlow-power cmosic’s. In Proceedings of the

IEEE International Test Conference. 146–155.

LAI , A ., FIDE, C., AND FALSAFI, B. 2001.Dead-BlockPredictionandDead-BlockCorrelatingPrefetchers.In Proceedings of the 28th Interna-

tional Symposium on Computer Architecture.

L I , L . ET AL . 2002.Leakageenergy managementin cachehiearchies.In Proceedings of the 2002 International Conference on Parallel Architec-

tures and Compilation Techniques.

L I , Y. ET AL . 2004. State-preservingvs. non-state-preservingleakagecontrol in caches.In In Proceedings of the 2004 Design, Automation and

Test in Europe (DATE).

L IPASTI , M., WI LKERSON, C. B., AND SHEN, J. P. 1996.Valuelocality andloadvalueprediction.In Proc. ASPLOS-VII. 138–47.

LOSQ, J. J. 1982.Generalizedhistorytablefor branchprediction.IBM Technical Disclosure Bulletin 25, 1 (June),99–101.

LYONS, R. ET AL . 1987. CMOS StaticMemory With a New Four-TransistorMemory Cell. Proc. of the 1987 Stanford Conf. On Advanced

Research in VLSI, 111–132.

MCFARLING, S. 1993.Combiningbranchpredictors.Tech.NoteTN-36,DECWRL. June.

NODA , K . ET AL . 1998.A 1.9um2loadlessCMOSFour-TransistorSRAM Cell in a0.18umlogic technology.IEDM Tech. Dig., 847–850.

PARIKH, D., SKADRON, K ., ZHANG, Y., BARCELLA , M., AND STAN, M. 2002. Power issuesrelatedto branchprediction. In Proc. HPCA-8.

233–44.

POWELL , M. ET AL . 2000.Gated-Vdd:A circuit techniqueto reduceleakagein cachememories.In In Proceedings of the International Symposium

on Low Power Electronics and Design.

ROY, K . 1998. Leakagepower reductionin low-voltageCMOSdesigns.In Proceedings of the International Conference on Electronics, Circuits,

and Systems. 167–73.

SANKARANARAYANAN, K . AND SKADRON, K . 2004. Profile-basedadaptationfor cachedecay. ACM Transactions on Architecture and Code

Optimization. To appear.

SCHUSTER, S., TERMAN, L ., AND FRANCH, R. 1987.A 4-Device CMOSStaticRAM Cell UsingSub-ThresholdConduction.In Symposium on

VLSI Technology, Systems, and Applications.

SEMICONDUCTOR INDUSTRY ASSOCIATION, . 2001. From website : The international technology roadmapfor semiconductors. In

http://public.itrs.net/Files/2001ITRS/Home.htm.

SEZNEC, A . ET AL . 2002.Designtradeoffs for thealphaev8 conditionalbranchpredictor. In In Proceedings of the 2002 International Symposium

on Computer Architecture.

SMITH, J. E. 1981.A studyof branchpredictionstrategies.In Proceedings of the 8th International Symposium on Computer Architecture. 135–48.

SONG, P. 1997.UltraSparc-3aimsatMP servers.Microprocessor Report, 29–34.

THE STANDARD PERFORMANCE EVALUATION CORPORATION. 2000.WWW Site. http://www.spec.org.

VELUSAMY, S. ET AL . 2002.Adaptivecachedecayusingformal feedbackcontrol. In Proceedings of the 2002 Workshop on Memory Performance

Issues (in conjunction with ISCA-29).

WOLF, W. 1998.Modern VLSI Design: Systems on Silicon. PrenticeHall. Prentice-Hall.

YANG, S. ET AL . 2002.Exploitingchoicein resizablecachedesignto optimizedeep-submicronprocessorenergy-delay. In In Proceedings of the

8th International Symposium on High-Performance Computer Architecture.

YANG, S.-H. ET AL . 2001. An IntegratedCircuit/Architecture Approachto ReducingLeakagein Deep-SubmicronHigh-PerformanceI-Caches.

In Proceedings of the 7th International Symposium on High-Performance Computer Architecture.

ZHANG, W. ET AL . 2002.Compiler-directedinstructioncacheleakageoptimization.In Proceedings of the 35th Annual IEEE/ACM International

Symposium on Microarchitecture.

ZHOU, H. ET AL . 2003. Adaptive modecontrol:A static-power-efficient cachedesign. In ACM Transactions on Embedded Computing Systems

(TECS), Special issue on power-aware embedded computing.

ZHOU, H., TOBUREN, M., ROTENBERG, E., AND CONTE, T. 2001.Adaptivemodecontrol:A static-power-efficient cachedesign.In Proceedings

2001 International Conference on Parallel Architectures and Compilation.


Implementing Branch Predictor Decay Using Quasi …mrmgroup.cs.princeton.edu/papers/taco03.pdfAn...

Documents

Transcript of Implementing Branch Predictor Decay Using Quasi …mrmgroup.cs.princeton.edu/papers/taco03.pdfAn...