Chapter 4 Optimizing Power @ Design Time – Circuit-Level...

Chapter 4

Optimizing Power @ Design Time – Circuit-Level Techniques

Jan M. Rabaey

Optimizing Power @ Design Time

Circuits

Dejan MarkovicBorivoje Nikolic

Slide 4.1

With the sources of powerdissipation in modern inte-grated circuits well under-stood, we can start toexplore the various sortsof power reduction techni-ques. As is made clear inthe beginning of the chap-ter, power or energy mini-mization can be performedat many stages in the designprocess and may addressdifferent targets such as

dynamic or static power. This chapter focuses on techniques for power reduction at design timeand at circuit level. Practical questions often expressed by designers are addressed: whether gatesizing or choice of supply voltage yields larger returns in terms of power–delay; howmany suppliesare needed; what the preferred ratio of discrete supplies to thresholds is; etc. As was made clear atthe end of the previous chapter, all optimizations should be seen in the broader light of anenergy–delay trade-off. To help guide this process, we introduce a unified sensitivity-basedoptimization framework. The availability of such a framework makes it possible to compare inan unbiased way the impact of various parameters such as gate size and supply and thresholdvoltages on a given design topology. The results serve as the foundation for optimization at thehigher levels of abstraction, which is the focus of later chapters.

J. Rabaey, Low Power Design Essentials, Series on Integrated Circuits and Systems,DOI 10.1007/978-0-387-71713-5_4, � Springer ScienceþBusiness Media, LLC 2009

77

Chapter Outline

Optimization framework for energy–delay trade-offDynamic-power optimization – Multiple supply voltages– Transistor sizing– Technology mapping

Static-power optimization– Multiple thresholds– Transistor stacking

Slide 4.2

The chapter starts with theintroduction of a unifiedenergy–delay optimizationframework, constructed asan extension of the power-ful logical-effort approach,which was originally con-structed to target perfor-mance optimization. Thedeveloped techniques arethen used to evaluate theeffectiveness and applic-ability of design-timepower reduction techni-

ques at the circuit level. Strategies to address both dynamic and static power are considered.

Energy/Power Optimization Strategy

For given function and activity, an optimal operation point can be derived in the energy–performance spaceTime of optimization depends upon activity profile Different optimizations apply to active and static power

Fixed Activity

Variable Activity

No Activity – Standby

ActiveDesign time Run time Sleep

Static

Slide 4.3

Before embarking on anyoptimization, we shouldrecall that the power andenergy metrics are related,but that they are by nomeans identical. The linkbetween the two is the activ-ity, which changes the ratiobetween the dynamic andstatic power components,and which may vary dyna-mically between operationalstates. Take, for instance,the example of an adder.

When the circuit is operated at its maximum speed and inputs are changing constantly andrandomly, the dynamic power component dominates. On the other hand, when the activity islow, static power rules. In addition, the desired performance of the adder may very well vary overtime as well, further complicating the optimization trajectory.

It will become apparent in this chapter that different design techniques apply to theminimization of dynamic and static power. Hence it is worth classifying power reductiontechniques based on the activity level, which is a dynamically varying parameter as discussedbefore. Fortunately, there exists a broad spectrum of optimizations that can be readily appliedat design time, either because they are independent of the activity level or because the moduleactivity is fixed and known in advance. These ‘‘design-time’’ design techniques are the topic ofthe next four chapters. In general though, activity and performance requirements vary overtime, and the minimization of power/energy under these circumstances requires techniques thatadapt to the prevailing conditions. These are called ‘‘run-time’’ optimizations. Finally, oneoperational condition requires special attention: the case where the system is idle (or is in‘‘standby’’). Under such circumstances, the dynamic power component approaches zero, and

78 Chapter #4

leakage power dominates. Keeping the static power within bounds under such conditionsrequires dedicated design techniques.

Maximize throughput for given energy orMinimize energy for given throughput

Delay

design

Emax

DmaxDmin

Energy/op

Emin

Energy–Delay Optimization and Trade-off

Trade-off space

Other important metrics: Area, Reliability, Reusability

Unoptimized

Slide 4.4

At the end of the previouschapter, it was argued thatdesign optimization forpower and/or energyrequires trade-offs, andthat energy and delayrepresent the major axes ofthe trade-off space. (Othermetrics such as area orreliability play a role aswell, but are only consid-ered as secondary factorsin this book.) This natu-rally motivates the use ofenergy–delay (E–D) spaceas the coordinate system inwhich designers evaluate

the effectiveness of their techniques.By changing the various independent design parameters, each design maps onto a constrained

region of the energy–delay plane. Starting from a non-optimized design, we want to either speed upthe system while keeping the design under the power cap (indicated by Emax), or minimize energywhile satisfying the throughput constraint (Dmax). The optimization space is bounded by the optimalenergy–delay curve. This curve is optimal (for the given set of design parameters), because all otherachievable points either consumemore energy for the same delay or have a longer delay for the sameenergy. Although finding the optimal curve seems quite simple in this slide, in real life it is far morecomplex. Observe also that any optimal energy–delay curve assumes a given activity level, and thatchanges in activity may cause the curve to shift.

Slide 4.5

The problem is that there are many sets of parameters to adjust. Some of these variables arecontinuous, like transistor sizes, and supply and threshold voltages. Others are discrete, likedifferent logic styles, topologies, and micro-architectures. In theory, it should be possible toconsider all parameters at the same time, and to define a single optimization problem. In practice,we have learned that the complexity of the problem becomes overwhelming, and that the resultingdesigns (if the process ever converges) are very often sub-optimal.

Hence, design methodologies for integrated circuits rely on some important concepts to helpmanage complexity: abstraction (hiding the details) and hierarchy (building larger entities througha composition of smaller ones). The two most often go hand-in-hand. The abstraction stack of atypical digital IC design flow is shown in this slide. Most design parameters are, in general,confined to and selected in a single layer of the stack only. For instance, the choice betweendifferent instruction sets is a typical micro-architecture optimization, while the choice betweendevices with different threshold voltages is best performed at the circuit layer.

Optimizing Power @ Design Time – Circuit-Level Techniques 79

Layering, hence, is thepreferred technique toman-age complexity in the designoptimization process.

Architecture

Micro-Architecture

Circuit (Logic & FFs)

Optimization Can/Must Span Multiple Levels

Design optimization combines top-down and bottom-up: “meet-in-the-middle”

Slide 4.6

The layered approach maygive the false impressionthat optimizations withindifferent layers are inde-pendent of each other.This is definitely not thecase. For instance, thechoice of the threshold vol-tages at the circuit layerchanges the shape of theoptimization space at thelogical or architecturallayers. Similarly, introdu-cing architectural transfor-

mations such as pipelining may increase the size of the optimization space at the circuit level, thusleading to larger potential gains. Hence, optimizations may and must span the layers.

Design optimization in general follows a ‘‘meet-in-the-middle’’ formulation: specifications andrequirements are propagated from the highest abstraction layer downward (top-down), andconstraints are propagated upward from the lowest abstraction later (bottom-up).

Slide 4.7

Continuous design parameters such as supply voltages and transistor sizes give rise to a continuousoptimization space and a single optimal energy–delay curve. Discrete parameters, such as the choicebetween different adder topologies, result in a set of optimal boundary curves. The overall optimumis then defined by their composite.

For example, topology B is better in the energy-performance sense for large target delays,whereas topology A is more effective for shorter delays.

The Design Abstraction Stack

Logic/RT

(Micro-)Architecture

Software

Circuit

Device

System/Application

Thi

s C

hapt

er

A very rich set of design parameters to consider!It helps to consider options in relation to their abstraction layer

sizing, supply, thresholds

logic family, standard cell versus custom

Parallel versus pipelined, general purpose versus application-specific

Bulk versus SOI

Choice of algorithm

Amount of concurrency

80 Chapter #4

One of the goals of thischapter is to demonstratehow we can quickly searchfor this global optimum,and based on that, buildan understanding of thescope and effectivenessof the different designparameters.

Some Optimization Observations

∂E/∂A∂D/∂A A=A0

SA =

SB

SA

f(A0,B)

f(A,B0)

Delay

En

erg

y

D0

(A0,B0)

Energy–Delay Sensitivities

[Ref: V. Stojanovic, ESSCIRC’02]

Slide 4.8

Given an appropriate for-mulation of the energy anddelay as a function of thedesign parameters, anyoptimization program canbe used to derive the opti-mal energy–delay curve.Most of the optimizationsand design explorations inthis text were performedusing various modules ofthe MATLAB program[Mathworks].

Yet, though relying onautomated optimization isvery useful to address largeproblems or to get preciseresults quickly, some analy-tical techniques often come

in handy to judge the effectiveness of a given parameter, or to come to a closed-form solution.The energy–delay sensitivity is a tool that does just that: it presents an effective way to evaluate the

effectiveness of changes in various design variables. It relies on simple gradient expressions thatquantify the profitability of a design modification: how much change in energy and delay resultsfrom tuning one of the design variables. Consider, for instance, the operation point (A0,B0), whereAand B are the design variables being studied. The sensitivity to each of the variables is simply theslope of the curve obtained by a small change in that variable. Observe that the sensitivities arenegative owing to the nature of energy–delay trade-off (when we compare sensitivities in the rest of

topology A

Delay

En

erg

y/o

p

Globally optimal energy–delay curve for a given function

Energy–Delay Optimization

topology B

topology A

topology B

Delay

En

erg

y/o

p


the text, we will use their absolute values – a larger absolute value indicates a higher potential forenergy reduction). For example, variable B has higher energy–delay sensitivity at point (A0,B0) thanthe variable A. Changing B hence yields a larger potential gain.

ΔE = SA · (−ΔD) + SB · ΔD

On the optimal curve, all sensitivities must be equal

Finding the Optimal Energy–Delay Curve

f(A0,B)

f(A,B0)

Delay

En

erg

y

D0

(A0,B0)

ΔD

f(A1,B)

Pareto-optimal:the best that can be achieved without disadvantaging at least one metric.

Slide 4.9

The optimal energy–delaycurve as defined earlier is apareto-optimal curve (anotion borrowed from eco-nomics). An assignment oroperational point in amulti-dimensional search ispareto-optimal if improvingon one metric by necessitymeans hurting another.

An interesting propertyof a pareto-optimal point isthat the sensitivities to alldesign variables must beequal. This can be under-stood intuitively. If the sen-sitivities are not equal, the

difference can be exploited to generate a no-loss improvement. Consider, for instance, the examplepresented here, where we strive tominimize the energy for a given delayD0. Using the ‘‘lower-energy-cost’’ variableA, we first create some timing slack �D at a small expense in energy�E (proportionaltoA’s E–D sensitivity). From the new operation point (A1,B0), we can now use ‘‘higher-energy-cost’’variable B to achieve an overall energy reduction as indicated by the formula. The fixed point in theoptimization is clearly reached when all sensitivities are equal.

Reducing voltages– Lowering the supply voltage (VDD) at the expense of clock speed– Lowering the logic swing (Vswing)

Reducing transistor sizes (CL )– Slows down logic

Reducing activity (α)– Reducing switching activity through transformations– Reducing glitching by balancing logic

fVVCP DDswingLactive ⋅⋅⋅⋅~DDswingLactive VVCα

αE ⋅⋅⋅~

Reducing Active Energy @ Design Time

Slide 4.10

In the rest of the chapter,we primarily focus on thecircuit and logic layers.Let us first focus on theactive component of powerdissipation, or, in light ofthe E–D trade-off perspec-tive, active energy dissipa-tion. The latter is a productof switching activity at theoutput of a gate, load capa-citance at the output, logicswing, and supply voltage.The simple guideline for

energy reduction is therefore to reduce each of the terms in the product expression. Some variables,however, are more efficient than others.

The largest impact on active energy is effected seemingly through supply voltage scaling, because of itsquadratic impact on power (we assume that the logic swing scales accordingly). All other terms have

82 Chapter #4

linear impact. For example, smaller transistors have less capacitance. Switching activity mostly dependson the choice of circuit topology.

For a fixed circuit topology, the most interesting trade-off exists between supply voltage and gatesizing, as these tuning knobs affect both energy and performance. Threshold voltages play asecondary role in this discussion as they impact performance without influencing dynamic energy.

Downsizing and/or lowering the supply on the critical path lowers the operating frequencyDownsizing non-critical paths reduces energy for free, but– Narrows down the path–delay distribution– Increases impact of variations, impacts robustness

tp(path)

# of

pat

hs

targetdelay

# of

pat

hs

targetdelay

Observation

tp(path)

Slide 4.11

Throughout this discussion,it is useful to keep in mindthat the optimizations in theE–D space also impactother important designmetrics that are not cap-tured here, such as areaor reliability. Take, forexample, the relationshipbetween transistor sizingand circuit reliability. Trim-ming the gates on the non-critical paths saves powerwithout a performancepenalty – and hence seemsto be a win-win operation.Yet in the extreme case, this

results in all paths becoming critical (unless a minimum gate size constraint is reached, of course).This effect is illustrated in the slide. The downsizing of non-critical gates narrows the delay distribu-tion and moves the average closer to the maximum delay. This makes this design vulnerable toprocess-variation effects and degrades its reliability.

topology A

topology B

Delay

En

erg

y/o

p

Reference case– Dmin sizing @ VDD max, VTH ref

minimize Energy (VDD, VTH, W )subject to Delay (VDD, VTH, W ) D≤ con

ConstraintsVDD min < VDD < VDD max

VTH min < VTH< VTH max

Wmin < W

Circuit Optimization Framework

[Ref: V. Stojanovic, ESSCIRC’02]

Slide 4.12

To evaluate fully the impactof the design variables inquestion, that of supplyand threshold voltages andgate size on energy and per-formance, we need to con-struct a simple and effective,yet accurate, optimizationframework. The search fora globally optimal energy–delay curve for a given cir-cuit topology and activitylevel is formulated as anoptimization problem:

Minimize energy subject to a delay constraint and bounds on the range of the optimization variables (VDD, VTH,and W).


Optimization is performed with respect to a reference design, sized for minimum delay at thenominal supply and threshold voltages as specified for the technology (e.g., VDD = 1.2V andVTH = 0.35V for a 90 nm process). This reference point is convenient, as it is well-defined.

i i +1

CwCiC γi Ci +1

Optimization Framework: Generic Network

VDD i +1VDD i

Gate in stage i loaded by fan-out (stage i +1)

Slide 4.13

The core of the frameworkconsists of effective modelsof delay and energy as afunction of the design para-meters. To develop theexpressions, we assume ageneric circuit configurationas illustrated in the slide.The gate under study is atthe i-th stage of a logical net-work, and is loaded by anumber of gates in stagei+1, which we have lumpedinto a single equivalent gate.Cw represents the capaci-

tance of the wire, which we will assume to be proportional to the fan-out (this is a reasonableassumption for a first-order model).

Fit parameters: Von, αd, K ,d γ

Alpha-Power Based Delay Model

VDD ref = 1.2 V, technology 90 nm

)1

1()()(

11

i

inom

i

iwi

onDD

DDd

C

C

C

CCCγγ γτ

VV

VKtp

++ ′⋅+=++

−=

0 2 4 6 8 100

10

20

30

40

50

60

Fan-out (Ci +1/Ci)

Del

ay (

ps)

tp

0.5 0.6 0.7 0.8 0.9 1 0

0.5

1

1.5

2

2.5

3

3.5

4

VDD

/VDD ref

FO

4 de

lay

(nor

m.)

Von = 0.37 Vαd = 1.53

simulationmodel

τnom= 6 psγ = 1.35

simulationmodel

αd

Slide 4.14

The delay modeling of thecomplex gate i proceeds intwo steps. First, we derivethe delay of an inverter as afunction of supply voltage,threshold, and fan-out; Next,we expand this to more com-plex gates.

The delay of an inverter isexpressed using a simple lin-ear delay model, based onthe alpha-power law for thedrain current (see Chapter2). Note that this model isbased on curve-fitting. Theparameters Von and �d areintrinsically related, yet not

equal, to the transistor threshold and the velocity saturation index. Kd is another fit parameter andrelates to the transconductance of the process (amongst others). The model fits SPICE simulateddata quite nicely, across a range of supply voltages, normalized to the nominal supply voltage (whichis 1.2V for our 90nm CMOS technology). Observe that this model is only valid if the supply voltageexceeds the threshold voltage by a reasonable amount. (This constraint will be removed in Chapter11, where we present a modified model that extends into the sub-threshold region.)

84 Chapter #4

The fan-out f = Ci+1/Ci represents the ratio of the load capacitance divided by the gatecapacitance. A small modification allows for the inclusion of the wire capacitance (f 0 ). � isanother technology-dependent parameter, representing the ratio between the output and inputcapacitance of a minimum-sized unloaded inverter.

Parasitic delay pi –

≈

depends upon gate topology

Electrical effort f i S i+1/S i

Logical effort gi – depends upon gate topology

Effective fan-out hi = fi gi

For Complex Gates

Combined with Logical-Effort Formulation

)( iiinom

gfptp τ γ+=

[Ref: I. Sutherland, Morgan-Kaufman’99]

Slide 4.15

The other part of the modelis based on the logical-effortformulation, which extendsthe notion to complex gates.Using the logical-effortnotation, the delay can beexpressed simply as aproductof the process-dependenttime constant tnom and aunitless delay, pi + figi/�, inwhich g is the logical effortthat quantifies the relativeability of a gate to delivercurrent, f is the ratio of thetotal output to input capaci-tance of the gate, and p repre-

sents the delay component due to the self-loading of the gate. The product of the logical effort andthe electrical effort is called the effective fan-out h. Gate sizing enters the equation through the fan-outfactor f = Si+1/Si.

= energy consumed by logic gate i

Dynamic Energy

i i +1

CwCiCi Ci+1

VDD,i +1VDD,i

iiiiwiiei

iDDiiiDDiwidyn

SSCCCfSKC

VfCV γγ

γ

CCCE

//)(

)()(

11

2,

2,1

++

+

′=+=′=

⋅′+=⋅++=

)( 2,

21, iDDiDDiei VVSKE += −

γ

Slide 4.16

For the time being, we onlyconsider the switchingenergy of the gate. In thismodel, f 0iCi is the totalload at the output, includ-ing wire and gate loads, and�Ci is the self-loading of thegate. The total energystored on these capacitancesis the energy taken out ofthe supply voltage in stage i.

Now, if we change the sizeof the gate in stage i, it affectsonly the energy stored on theinput capacitance and para-sitic capacitance of that gate.

Ei hence is defined as the energy that the gate at stage i contributes to the overall energy dissipation.


17

As mentioned, sensitivityanalysis provides intuitionabout the profitability ofoptimization. Using themodels developed in theprevious slides, we cannow derive expressions forthe sensitivities to some ofthe key design parameters.

The formulas indicatethat the largest potentialfor energy savings is at theminimum delay, Dmin,which is obtained by equal-izing the effective fan-outof all stages, and setting

the supply voltage at the maximum allowable value. This observation intuitively makes sense: atminimum delay, the delay cannot be reduced beyond the minimum achievable value, regardless ofhowmuch energy is spent. At the same time, the potential of energy savings through voltage scalingdecreases with reducing supply voltages: E decreases, while D and the ratio Von/VDD increase.

The key point to realize is that optimization primarily exploits the tuning variable with thelargest sensitivity, which ultimately leads to the solution where all sensitivities are equal. You willsee this concept at work in a number of examples.

Properties of inverter chain– Single path topology– Energy increases geometrically from input to output

Example: Inverter Chain

CL

1

S1 = 1 S2 … SNS3

Goal– Find optimal sizing S = [S1, S2, …, SN ], supply voltage, and

buffering strategy to achieve the best energy–delay trade-off

Slide 4.18

We use a number of well-known circuit topologies toillustrate the concepts ofcircuit optimization forenergy. The examples differin the amount of off-pathloading and path reconver-gence. By analyzing howthese properties affect theenergy profile, we maycome to some general prin-ciples related to the impactof the various design para-meters. More precisely, westudy the (well-understood)

inverter chain and the tree adder – as these examples differ widely in the number of paths and pathreconvergence.

Let us begin with the inverter chain. The goal is to find the optimal sizing, the supply voltages,and the number of stages that result in the best energy–delay trade-off.

∞ for equal h

(Dmin)

max at VDD(max)

(Dmin)

Depends on Sensitivity (∂E /∂D)

Optimizing Return on Investment (ROI)

Gate Sizing

Supply Voltage

)( 1−−−=

∂∂

∂∂

iinom

i

i

i

hh

E

τ

α

SD

SE

DD

ond

DD

on

DD

DD

V

VV

V

D

E

VD

VE

+−

−⋅⋅−=

∂∂

∂∂

1

)1(2

86 Chapter #4

19

The inverter chain has beenthe focus of a lot of attention,as it is a critical component indigital design, and some clearguidelines about optimaldesign can be derived inclosed form. For minimumdelay, the fan-out of eachstage is kept constant, andeach subsequent stage is up-sized with a constant factor.This means that the energystored per stage increases geo-metrically toward the output,with the largest energy storedin the final load.

In a first step, we consider solely transistor sizing. For a given delay increment, the optimum sizeof each stage, whichminimizes energy, can be derived. The sensitivities derived in Slide 4.17 alreadygive a first idea on what may unfold: the sensitivity to gate sizing is proportional to the energystored on the gate, and is inversely proportional to the difference in effective fan-outs. What thismeans is that, for equal sensitivity in all stages, the difference in the effective fan-outs of a gate mustincrease in proportion to the energy stored on the gate, indicating that the difference in the effectivefan-outs should increase exponentially toward the output.

This result was already analytically derived by Ma and Franzon [Ma, JSSC’94], who showed that atapered staging is the best way to combine performance and energy efficiency. One caveat: At large delayincrements, a more efficient solution can be found by reducing the number of stages � this was notincluded as a design parameter in this first-order optimization, inwhich the topologywas kept unchanged.

VDD reduces energy of the final load first

Variable taper achieved by voltage scaling

Inverter Chain: VDD Optimization

1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1.0

stage

VD

D/V

DD

nom

0%

1%

10%

30%

Dinc

= 50%

nomopt

Slide 4.20

Let us now consider thepotential of supply voltagescaling. We assume thateach stage can be run at adifferent voltage. As in siz-ing, the optimization tacklesthe largest consumers – thefinal stages – first by scalingtheir supply voltages. The neteffect is similar to a ‘‘virtual’’tapering. An important dif-ference between sizing andsupply reduction is that siz-ing does not affect the energystored in the final outputload CL. Supply reduction,on the other hand, lowers

this source of energy consumption first, by reducing the supply voltage of the gate that drives the load.As (dis)charging CL is the largest source of energy consumption, the impact of this is quite profound.

Variable taper achieves minimum energy

[Ref: Ma, JSSC’94]

Inverter Chain: Gate Sizing

1 2 3 4 5 6 70

5

10

15

20

25

stage

effe

ctiv

e fa

n-ou

t, h

0%

1%

10%

30%

Dinc

= 50%nomopt

1

21

112

21

−

−

+−

−∝

⋅⋅⋅−=μ τ

+ μ⋅=

ii

iS

Snom

DDe

i

iii

hh

EF

FVK

S

SSS


Parameter with the largest sensitivity has the largest potential for energy reductionTwo discrete supplies mimic per-stage V DD

Inverter Chain: Optimization Results

500 10 20 30 400

20

40

60

80

100

incD (%)

ener

gy r

educ

tion

(%)

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1.0

Dinc

(%)

Sen

sitiv

ity (

norm

)

cVDD

SgV DD

2V DD

Slide 4.21

Now, how good can all thisbe in terms of energy reduc-tion? In the graphs, we pre-sent the results of variousoptimizations performedon the inverter chain: siz-ing, reducing the globalVDD, two discrete VDDs,and a customizable VDD

per stage. For each ofthese cases, the sensitivityand the energy reductionare plotted as functions ofthe delay increment (overDmin). The prime observa-tion is that increasing thedelay by 50% reduces the

energy dissipation by more than 70%. Again, it is shown that for any value of the delay increment,the parameter with the largest sensitivity has the largest potential for energy reduction. Forexample, at small delay increments sizing has the largest sensitivity (initially infinity), so it offersthe largest energy reduction. Its potential however quickly falls off. At large delay increments, itpays to scale the supply voltage of the entire circuit, achieving the sensitivity equal to that of sizingat around 25% excess delay. The largest reductions can be obtained by custom voltage scaling. Yet,two discrete voltages are almost as good, and are a lot simpler from an implementation perspective.

Tree adder– Long wires– Reconvergent paths– Multiple active outputs

(A0, B0)

Example: Kogge–Stone Tree Adder

Cin

(A15, B15)

S0

S15

[Ref: P. Kogge, Trans. Comp’73]

Slide 4.22

An inverter chain has aparticularly simple energydistribution, which growsgeometrically until thefinal stage. This type ofprofile drives the optimiza-tion (for both sizing andsupply) to focus on thefinal stages first. However,most practical circuits havea more complex energyprofile.

An interesting counter-part is formed by the treeadder, which features longwires, large fan-out varia-

tions, reconvergent fan-out, and multiple active outputs qualified by paths of various logic depths.We have selected a popular instance of such an adder, the Kogge–Stone version, for our study[Kogge’93, Rabaey’03]. The overall architecture of the adder consists of a number of propagate/generate functions at the inputs (identified by the squares), followed by carry-merge operators

88 Chapter #4

(circles).The final-sum outputs are generated through XOR functions (diamonds). To balance thedelay paths, buffers (triangles) are inserted in many of the paths.

sizing: E (–54%)D = 10%

referenceD = D

Dual VDD : E (–27%)D = 10%

Tree Adder: Sizing vs. Dual-VDD Optimization

Reference design: all paths are critical

Internal energy ⇒ S more effective than V DD– S: E(–54%), Dual VDD: E(–27%) at D inc = 10%

incmin inc

10080604020

ener

gy

bit slice

stage63 47 31 15 0 1

35

79

10080604020

ener

gy

bit slice

stage63 47 31 15 0 1

35

79

10080604020

ener

gy

bit slice

stage63 47 31 15 0 1

35

79

Slide 4.23

The adder topology isbest understood in a two-dimensional plane. Oneaxis is formed by the dif-ferent bit slices N (we areconsidering a 64-bit adderfor this example), whereasthe other is formed by theconsecutive gate stages. Asbefits a tree adder, thenumber of stages equalslog2(N)+M, where M is theextra stages for propagate/generate and the final XORfunctionality. The energy ofan internal node is best

understood when plotted with respect to this two-dimensional topology.As always, we start from a reference design that is optimized for minimum delay, and we explore

how we can trade off energy and delay starting from that point. The initial sizing makes all paths inthe adder equal to the critical path. The first figure shows the energy map for the minimum delay.Though the output nodes are responsible for a sizable fraction of the energy consumption, anumber of internal nodes (around stage 5) dominate.

The large internal energy increases the potential for energy reduction through gate sizing. This isillustrated by the case where we allow for a 10% delay increase. We have plotted the energydistribution resulting from sizing, as well as from the introduction of two discrete supply voltages.The former results in 54% reduction in overall energy, whereas the latter only (!) saves 27%.

This result can be explained as follows. Given the fact that the dominant energy nodes areinternal, sizing allows each of these nodes to be attacked individually without too much of a globalimpact. In the case of dual supplies, one must be aware that driving a high-voltage node from alow-voltage node is hard. Hence the preferable assignment of low-voltage nodes is to start fromthe output nodes and to work one’s way toward the input nodes. Under these conditions, wehave already sacrificed a lot of delay slack on low-energy intermediate nodes before we reach theinternal high-energy nodes. In summary, supply voltages cannot be randomly assigned to nodes.This makes the usage of discrete supply voltages less effective in modules with high internal energy.

Slide 4.24

We can now put it all together, and explore the tree adder in the energy–delay space. Each of thedesign parameters (VDD,VTH,S) is analyzed separately and in combinationwith the others. (Observethat inclusion of the threshold voltage as a design parameter only makes sense when the leakageenergy is considered as well – how this is done is discussed later in the chapter).

A couple of interesting conclusions can be drawn:

� Through circuit optimization, we can reduce the energy consumption of the adder by a factor of10 by doubling the delay.


� Exploiting only two outof the three variablesyields close to the optimalgain. For the adder, themost effective parametersare sizing and thresholdselection. At the referencedesign point, sizing andthreshold reduction fea-ture the largest andthe smallest sensitivities,respectively. Hence, thiscombination has the lar-gest potential for energyreduction along thelines demonstrated inSlide 4.8.

� Finally, circuit optimi-zation is most effective in a small region around the reference point. Expanding beyond thatregion typically becomes too expensive in terms of energy or delay cost for small gains, yieldinga reduced return on investment.

Slide 4.25

So far, we have studied thetheoretical impact of circuitoptimization on energy anddelay. In reality, the designspace is more constrained.Choosing a different supplyor threshold voltage forevery gate is not a practicaloption. Transistor sizescome in discrete values, asdetermined by the availabledesign library. One of thefortunate conclusions emer-ging from the precedingstudies is that a couple ofwell-chosen discrete valuesfor each of the design para-

meters can get us quite close to the optimum.Let us first consider the practical issues related to the use of multiple supply voltages – a practice

that until recent was not common in digital integrated circuit design at all. It impacts the layoutstrategy and complicates the verification process (as will be discussed in Chapter 12). In addition,generating, regulating, and distributing multiple supplies are non-trivial tasks.

A number of different design strategies exist with respect to the usage of multiple supplyvoltages. The first is to assign the voltage at the block/macro level (the so-called voltage island

Tree Adder: Multi-dimensional Search

Can get pretty close to optimum with only two variablesGetting the minimum speed or delay is very expensive

En

erg

y/E

ref

Delay/Dmin

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.2

0.4

0.6

0.8

1Reference

S, V DD

VDD, VTH

S, V TH

S, V DD, V TH

Block-level supply assignment– Higher-throughput/lower-latency functions are

implemented in higher VDD

– Slower functions are implemented with lower VDD

– This leads to so-called voltage islands with separate supply grids

– Level conversion performed at block boundaries

Multiple supplies inside a block– Non-critical paths moved to lower supply voltage– Level conversion within the block– Physical design challenging

Multiple Supply Voltages

90 Chapter #4

approach). This makes particular sense in case some modules have higher performance/activityrequirements than others (for instance, a processor’s data path versus its memory). The second andmore general approach is to allow for voltage assignment all the way down to the gate level(‘‘custom voltage assignment’’). In general, this means that gates on the non-critical paths areassigned a lower supply voltage. Be aware that having signals at different voltage levels requires theinsertion of level converters. It is preferable if these are limited in number (as they consume extraenergy) and occur only at the boundaries of the modules.

V1 = 1.5V, VTH = 0.3V

Using Three VDD’s

+

V2 (V)

V3

(V)

0.4 0.6 0.8 1 1.2 1.4

0.4

0.6

0.8

1

1.2

1.4

V2 (V)V

3 (V)

Po

wer

Red

uct

ion

Rat

io

00.5

11.5

0

0.51

1.50.4

0.5

0.6

0.7

0.8

0.9

1

[Ref: T. Kuroda, ICCAD’02]

Slide 4.26

With respect to multiplesupply voltages, one cannothelp wondering about thefollowing question: Ifmulti-ple supply voltages areemployed, how many dis-crete levels are sufficient,and what are their values?This slide illustrates thepotential of using three dis-crete voltage levels, as wasstudied by Tadahiro Kur-oda [Kuroda, ICCAD’02].Supply assignment to theindividual logic gates isperformed by an optimiza-tion routine that minimizesenergy for a given clock per-

iod.With the main supply fixed at 1.5 V, providing a second and third supply yields a nearly twofoldpower reduction ratio.

A number of useful observations can be drawn from the graphs:

� The power minimum occurs for V2 � 1V and V3 � 0.7V.� The minimum is quite shallow. This is good news, as this means that small deviations around

this minimum (as caused, for instance, by IR drops) will not have a big impact.

The question now is how much impact on power each additional supply carries.


1.0

0.5

VD

D R

atio

1.0

0.4

0.5 1.0 1.5V1 (V)

P R

atio

V2 /V1

P2 /P1

{ V1, V2 }

V2 /V1

V3 /V1

{ V1, V2, V3 }

0.5 1.0 1.5V1 (V)

P3 /P1

V2 /V1

V3 /V1

V4 /V1

0.5 1.0 1.5V1 (V)

P4 /P1

{ V1, V2, V3, V4 }

Optimum Number of VDDs

The more the number of VDD s the less the power, but the effect saturates

Power reduction effect decreases with scaling of VDD

Optimum V2 /V1 is around 0.7

© IEEE 2001

[Ref: M. Hamada, CICC’01]

Slide 4.27

In fact, the marginal bene-fits of adding extra sup-plies quickly bottom out.Although adding a secondsupply yields big savings,the extra reductionsobtainable by adding athird or a fourth are mar-ginal. This makes sense,as the number of (non-critical) gates that canbenefit from the addi-tional supply shrinks witheach iteration. For exam-ple, the fourth supplyworks only with non-critical path gates close tothe tail of the delay distri-bution. Another observa-

tion is that the power savings obtainable from using multiple supplies reduce with the scalingof the main supply voltage (for a fixed threshold).

Two supply voltages per block are optimal

Optimal ratio between the supply voltages is 0.7

Level conversion is performed on the voltage boundary, using a level-converting flip-flop (LCFF)

An option is to use an asynchronous level converter– More sensitive to coupling and supply noise

Lessons: Multiple Supply Voltages

Slide 4.28

Our discussion on multiplediscrete supply voltagescan be summarized with anumber of rules-of-thumb:

� The largest benefit isobtained by adding a sec-ond supply.

� Theoptimal ratiobetweenthe discrete supplies isapproximately 0.7.

� Adding a third supplyprovides an additional5–10% incremental sav-ings. Going beyond thatdoesnotmakemuchsense.

92 Chapter #4

i1 o1

VDDHVDDL

VSS

Conventional

VDDH circuit V DDL circuit

i2 o2

V DDH

V DDL

V SS

Shared n-well

VDDH circuit VDDL circuit

Distributing Multiple Supply Voltages

i2 o2

i1 o1

Slide 4.29

Distribution of multiplesupply voltages requirescareful examination of thefloorplanning strategy. Theconventional way to sup-port multiple VDD’s (twoin this case) is to placegates with different sup-plies in different wells(e.g., low-VDD and high-VDD). This approach doesnot require a redesign ofthe standard cells, butcomes with an area over-head owing to the neces-

sary spacing between n-wells at different voltages. Another way to introduce the second supplyis to provide twoVDD rails for every standard cell, and selectively route the cells to the appropriatesupply. This ‘‘shared n-well’’ approach also comes with an area overhead owing to the extra voltagerail. Let us further analyze both techniques to see what kind of system-level trade-offs theyintroduce.

V DDH circuit

VDDH V DDL

VSS

n-well isolation

V DDL circuit

(a) Dedicated row

(b) Dedicated region

VDDH Row

VDDH Row

VDDH

RegionVDDL

Region

Conventional

VDDL Row

VDDL Row

Slide 4.30

In the conventional dual-voltage approach, the moststraightforward method isto cluster gates with thesame supply (scheme b).This scheme works wellfor the ‘‘voltage island’’model, where a single sup-ply is chosen for a completemodule. It does not verywell fit the ‘‘custom voltageassignment’’ mode, though.Logic paths consisting ofboth high-VDD and low-VDD cells incur additionaloverhead in wire delay due

to long wires between the voltage clusters. The extra wire capacitance also reduces the power savings.Maintaining spatial locality of connected combinational logic gates is essential.

Another approach is to assign voltages per row of cells (scheme a). Both VDDL and VDDH arerouted only to the edge of the rows, and a special standard cell is added that selects between the twovoltages (obviously, this applies only to the standard-cell methodology). This approach suits the‘‘custom voltage assignment’’ approach better, as the per-row assignment provides a smallergranularity and the overhead of moving between voltage domains is smaller.


VDDH circuit

V DDH

VDDL

VSS

Shared n-well

VDDL circuit

(a) Floor plan image

V DDL circuit

V DDH circuit

Shared n-Well

[Shimazaki et al., ISSCC’03]

Slide 4.31

Themost versatile approachis to redesign standardcells, and have both VDDL

and VDDH rails inside thecell (‘‘shared n-well’’). Thisapproach is quite attrac-tive, because we do nothave to worry about areapartitioning – both low-VDD and high-VDD cellscan be abutted to eachother. This approach wasdemonstrated on a high-speed adder/ALU circuitby Shimazaki et al

[Shimazaki, ISSCC’03]. However, it comes with a per-cell area overhead. Also, low-VDD cellsexperience reverse body biasing on the PMOS transistors, which degrades their performance.

Lower VDD portion is shared“Clustered voltage scaling”

Example: Multiple Supplies in a Block

FF

FF

FF

FFFF

FF

FF

FF

FF

FF

CVS StructureConventional Design

Critical Path

Level-Shifting FF

Critical Path

FF

FF

FF

FF

FF

FF FF

FF

FF

FF

FF

[Ref: M. Takahashi, ISSCC’98]

© IEEE 1998

Slide 4.32

Level conversion is anotherimportant issue in design-ing with multiple discretesupply voltages. It is easyto drive a low-voltagegate from a high-voltageone, but the opposite tran-sition is hard owing toextra leakage, degradedsignal slopes, and perfor-mance penalty. It is henceworthwhile minimizingthe occurrence of low-to-high connections.

As we will see in the nextfew slides, low-to-high levelconversion is best accom-plished using positive feed-back – which is naturally

present in flip-flops and registers. This leads to the following strategy: Every logical path startsat the high-voltage level. Once a path transitions to the low voltage, it never switches back. Thenext up-conversion happens in flip-flops. Supply voltage assignment starts from critical paths andworks backward to find non-critical paths where the supply voltage can be reduced. This strategy isillustrated in the slide. The conventional design on the left has all gates operating at the nominalsupply (the critical path is highlighted). Working backward from the flip-flops, non-critical pathsare gradually converted to the low voltage until they become critical (gray-shaded gates operate atVDDL). This technique of grouping is called ‘‘clustered voltage scaling’’ (CVS).

94 Chapter #4

Pulsed Half-Latch versus Master–Slave LCFFsSmaller # of MOSFETs/clock loadingFaster level conversion using half-latch structureShorter D–Q path from pulsed circuit

Level-Converting Flip-Flops (LCFFs)

q

ck

ckb ckclk

level conversion

ckb

ckd q (inv.)

ck

ckclk

level conversion

dmo

mf

sfso db

sfso

MN1 MN2

Master–Slave Pulsed Half-Latch

© IEEE 2003

[Ref: F. Ishihara, ISLPED’03]

Slide 4.33

As the level-convertingflip-flops play a crucialrole in the CVS scheme,we present a number offlip-flops that can do levelconversion and maintaingood speed.

The first circuit is basedon the traditional master–slave scheme, with themaster and slave stagesoperating at the low andhigh voltages, respectively.The positive feedbackaction in the slave latchensures efficient low-to-high level conversion. The

high voltage node sf is isolated from low-voltage node mo by the pass-transistor, gated by thelow-voltage signal ck.

The same concept can also be applied in an edge-triggered flip-flop, as shown in the second circuit(called the pulse-based half-latch). A pulse generator derives a short pulse from the clock edge, ensuringthat the latch is enabled only for a very short time. This circuit has the advantage of being simpler.

Dynamic Realization of Pulsed LCFF

Pulsed precharge LCFF (PPR)– Fast level conversion by

precharge mechanism– Suppressed

charge/discharge toggle by conditional capture

– Short D–Q path

Pulsed Precharge Latch

clk

ckd1

qb

clk level conversion

x

db

qb

ckd1

VDDH

VDDH

VDDH

d

xb

IV1

q (inv.)

ck

MN1

MN2

MP1

[Ref: F. Ishihara, ISLPED’03]© IEEE 2003

Slide 4.34

Dynamic gates withNMOS-only evaluation transistorsare naturally suited foroperation with reducedlogic swing, as the inputsignal does not need todevelop a full high-VDD

swing to drive the outputnode to logic zero. Thereduced swing only resultsin a somewhat longerdelay. A dynamic structurewith implicit level conver-sion is shown in the figure.

Observe that level con-version is also possible inan asynchronous fashion.A number of such non-

clocked converters will be presented in a later chapter on Interconnect (Chapter 6). Clockedcircuits tend to be more reliable, however.


carrygen.

partialsum

gpgen.

5:1MUX

ain

bin

carry

s0/s1

sum

sumb (long loop-back bus)

clk

clock gen.

: V DDH circuit

: V DDL circuit

INV1INV2

0.5 pF

sumsel.

2:1MUX

9:1MUX

logicalunit

9:1MUX

ain0

Case Study: ALU for 64-bit Microprocessor

[Ref: Y. Shimazaki, ISSCC’03]© IEEE 2003

Slide 4.35

A real-life example of ahigh-performance Itanium-class (#Intel) data pathhelps to demonstrate theeffective use of dual-VDD.From the block diagram, itis apparent that the criticalcomponent from an energyperspective is the very largeoutput capacitance of theALU, which is due to itshigh fan-out. Hence, lower-ing the supply voltage onthe output bus yields thelargest potential for powerreduction.

The shared-well techni-que was chosen for the

implementation of this 64-bit ALU module, which is composed of the ALU, the loop-back busdriver, the input operand selectors, and the register files. For performance reasons, a dominocircuit-style was adopted. As the carry generation is the most critical operation, circuits in the carrytree are assigned to theVDDH domain. On the other hand, the partial-sum generator and the logicalunit are assigned to theVDDL domain. In addition, the bus driver, as the gate with the largest load,is also supplied from VDDL. The level conversion from the VDDL signal to the VDDH signal isperformed by the sum selector and the 9:1 multiplexer.

sum

keeperpc

sumb

VDDH

VDDL

INV1 INV2

domino level converter (9:1 MUX)

ain0sel(VDDH)

VDDH

VDDL

INV2 is placed near 9:1 MUX to increase noise immunityLevel conversion is done by a domino 9:1 MUX

Low-Swing Bus and Level Converter

[Ref: Y. Shimazaki, ISSCC’03]

© IEEE 2003

Slide 4.36

This schematic shows thelow-swing loop-back busand the domino-style levelconverter. Since the loop-back bus sumb has a largecapacitive load, low-voltageimplementation is quiteattractive. Some issuesdeserve special attention:

� One of the concerns of theshared-well approach isthe reverse biasing on thePMOS transistor. As sumis a monotonically risingsignal (output of a dominostage), this doesnot impactthe performance of theimportant gate INV1.

96 Chapter #4

� In dynamic-logic designs, noise is one of the critical issues. To eliminate the effects of dis-turbances on the loop-back bus, the receiver INV2 is placed near the 9:1 multiplexer to increasenoise immunity.

� The output of INV2, which is aVDDL signal, is convertedVDDH by the 9:1 multiplexer. The levelconversion is fast, as the precharge levels are independent of the level of the input signal.

[Ref: Y. Shimazaki, ISSCC’03]

Single-supplyShared-well(VDDH=1.8 V)E

nerg

y [p

J]

TCYCLE [ns]

Room temperature

200

300

400

500

600

700

800

0.6 0.8 1.0 1.2 1.4 1.6

1.16 GHz

VDDL=1.4 VEnergy:–25.3% Delay :+2.8%

VDDL=1.2 VEnergy:–33.3% Delay :+8.3%

Measured Results: Energy and Delay

© IEEE 2003

Slide 4.37

This figure plots the familiarenergy–delay plots of theALU (as measured). Theenergy–delay curve for sin-gle-supply operation isdrawn as a reference. At thenominal supply voltage of1.8V (for a 180nm CMOStechnology), the chip oper-ates at 1.16GHz. Introdu-cing a second supply yieldsan energy saving of 33% atthe small cost of 8% in delayincrease. This exampledemonstrates that the theo-retical results derived in theearlier slides of this chapterare actually for real.

Practical Transistor Sizing

Continuous sizing of transistors only an option in custom design

In ASIC design flows, options set by available library

Discrete sizing options made possible in standard-cell design methodology by providing multiple options for the same cell– Leads to larger libraries (> 800 cells)– Easily integrated into technology mapping

Slide 4.38

Transistor sizing is theother high-impact designparameter we haveexplored at the circuit levelso far. The theoretical ana-lysis assumes a continuoussizing model, which is onlya possibility in purely cus-tom design. In ASIC designflows, transistor sizes arepredetermined in the celllibrary. In the early daysof application-specific inte-grated circuit (ASIC)design and automated

synthesis, libraries used to be quite small, counting between 50 and 100 cells. Energy considerationshave changed the picture substantially. With the need for various sizing options for each logical cell,


industrial libraries now count close to 1000 cells. Aswith supply voltages, it is necessary tomove froma continuous model to a discrete one. Similarly, the overall impact on energy efficiency of doing socan be quite small.

Larger gates reduce capacitance, but are slower

Technology Mapping

a

b

c

slack = 1

d

f

Slide 4.39

In the ASIC design flow, itis in the ‘‘technology map-ping’’ phase that the actuallibrary cells are selected forthe implementation of agiven logical function. Thelogic network, resultingfrom ‘‘technology-indepen-dent’’ optimizations, ismapped onto the librarycells such that performanceconstraints are met andenergy is minimized.Hence, this is where thetransistor (gate) sizingactually happens. Beyondchoosing between identical

cells with different sizes, technology mapping also gets to choose between different gate mappings:simple cells with small fan-in, or more complex cells with large fan-in. Over the last decade(s), it hasbeen common understanding that simple gates are good from a performance perspective� delay isa quadratic function of fan-in. From an energy perspective, complex gates are more attractive, as theintrinsic capacitance of these is substantially smaller than the inter-gate routing capacitances of anetwork of simple gates. Hence, it makes sense for complex gates to be preferentially used on non-critical paths.

98 Chapter #4

(a) Implemented using four-input NAND + INV(b) Implemented using two-input NAND + two-input NOR

Library 1: High-Speed

Technology Mapping

Example: four-input AND

Gatetype

Area (cell unit)

Input cap. (fF)

Average delay (ps)

Average delay (ps)

INV 3 1.8 7.0 + 3.8C L 12.0 + 6.0C L

NAND2 4 2.0 10.3 + 5.3C L 16.3 + 8.8CL

NAND4 5 2.0 13.6 + 5.8C L 22.7 + 10.2CL

NOR2 3 2.2 10.7 + 5.4C L 16.7 + 8.9CL

Library 2: Low-Power

(delay formula: C

(numbers calibrated for 90 nm)L in fF)

Slide 4.40

This argument is illustratedwith an example. In thisslide, we have summarizedthe area, delay, and energyproperties of four cells(INV, NAND2, NOR2,NAND4) implemented in a90nm CMOS technology.Two different librariesare considered: a low-powerand a high-performanceversion.

Technology Mapping – Example

four-input AND (a) NAND4 + INV

(b) NAND2 + NOR2

Area 8 11

HS: Delay (ps) 31.0 + 3.8CL

53.1 + 6.0CL

0.1 + 0.06CL

32.7 + 5.4CL

LP: Delay (ps) 52.4 + 8.9CL

Sw Energy (fF) 0.83 + 0.06CL

Area– Four-input more compact than two-input (two gates vs three gates)

Timing– Both implementations are two-stage realizations– Second-stage INV (a) is better driver than NOR2 (b)– For more complex blocks, simpler gates will show better

performanceEnergy– Internal switching increases energy in the two-input case– Low-power library has worse delay, but lower leakage (see later)

Slide 4.41

These libraries are usedto map the same function,an AND4, using eithertwo-input or four-inputgates (NAND4+INV orNAND2+NOR2). Theresulting metrics show thatthe complex gate imple-mentation yields a substan-tial reduction in energyand also reduces area. Forthis simple example, thecomplex-gate version isjust as fast, if not faster.However this is due to thesomewhat simplistic natureof the example. The situa-

tion becomes evenmore pronounced if the library would contain very complex gates (e.g., fan-in of5 or 6).

Slide 4.42

Technology mapping has brought us almost seamlessly to the next abstraction level in the designprocess – the logic level. Transistor sizes, voltage levels, and circuit style are the main optimizationknobs at the circuit level. At the logic level, the gate– network topology to implement a given


function is chosen and fine-tuned. The link betweenthe two is the already dis-cussed technology-map-ping process. Beyond gateselection and transistor siz-ing, technology mappingalso performs pin assign-ment. It is well knownthat, from a performanceperspective, it is a goodidea to connect the mostcritical signal to the inputpin ‘‘closest’’ to the outputnode. For a CMOSNANDgate, for instance, thiswould be the top transistorof the NMOS pull-down

chain. From a power reduction point of view, on the other hand, it is wise to connect the mostactive signal to that node, as this minimizes the switching capacitance.

The technology-independent part of the logic-synthesis process consists of a sequence of opti-mizations that manipulate the network topology to minimize delay, power, or area. As we havebecome used to, each such optimization represents a careful trade-off, not only between power anddelay, but sometimes also between the different components of power such as activity andcapacitance. This is illustrated with a couple of examples in the following slides.

Logic restructuring to minimize spurious transitions

Buffer insertion for path balancing

Logic Restructuring

01

1

1

0

1

1

1

0

1 1

1

1

1

1

111

2

3

Slide 4.43

In Chapter 3, we haveestablished that the occur-rence of dynamic hazardsin a logic network is mini-mized when the network isbalanced from a timingperspective – that is, mosttiming paths are of similarlengths. Paths of unequallength can always be equal-ized with respect to time ina number of ways: (1)through the restructuringof the network, such thatan equivalent networkwith balanced paths isobtained; (2) through the

introduction of non-inverting buffers on the fastest paths. The attentive reader realizes thatalthough the latter helps to minimize glitching, the buffers themselves add extra switching capa-citance. Hence, as always, buffer insertion is a careful trade-off process. Analysis of circuits

Technology mappingGate selectionSizingPin assignment

Logical OptimizationsFactoring

Restructuring

Buffer insertion/deletion

Don’t - care optimization

Gate-Level Trade-offs for Power

100 Chapter #4

generated by state-of-the-art synthesis tools have shown that simple buffers are responsible for aconsiderable part of the overall power budget of the combinatorial modules.

Idea: Modify network to reduce capacitance

Caveat: This may increase activity!

pa pb= 0.1; = 0.5; pc = 0.5

Algebraic Transformations Factoring

a

bc

ff

a

a

b

c

p1 = 0.051

p2 = 0.051

p3 = 0.076

p4 = 0.375

p5 = 0.076

Slide 4.44

Factoring is another transfor-mation that may introduceunintended consequences.From a capacitance perspec-tive, it seems obvious that asimpler logical expressionwould require less power aswell. For instance, translat-ing the function f = a �b +a �c into its equivalent f =a �(b + c) seems a no-brai-ner, as it requires one lessgate. However, it may alsointroduce an internal nodewith substantially highertransition probabilities, asannotated on the slide.

This may actually increase the net power. The lesson to be drawn is that power-aware logicalsynthesis must not only be aware of network topology and timing, but should – to the best possibleextent – incorporate parameters such as capacitance, activity, and glitching. In the end, the goal isagain to derive the pareto-optimal energy–delay curves, which we are now so familiar with, or toreformulate the synthesis process along the following lines: choose the network that minimizespower for a given maximum delay or minimizes the delay for a maximum power.

Energy-efficient design

Joint optimization over multiple design parameters possible using sensitivity-based optimization framework– Equal marginal costs ⇔

Peak performance is VERY power inefficient– About 70% energy reduction for 20% delay penalty– Additional variables for higher energy-efficiency

Two supply voltages in general sufficient; three or more supply voltages only offer small advantage

Choice between sizing and supply voltage parameters depends upon circuit topology

But … leakage not considered so far

Lessons from Circuit OptimizationSlide 4.45

Based on the preceding dis-cussions, we can now drawa clear set of guidelines forenergy–delay optimizationat the circuit and logicallevels. An attempt ofdoing so is presented inthis slide.

Yet, so far we have onlyaddressed dynamic power.In the rest of the chapter wetackle the other importantcontributor of power incontemporary networks:leakage.


Considering leakage as well as dynamic

power is essential in sub-100 nm

technologies

Leakage is not essentially a bad thing

– Increased leakage leads to improved

performance, allowing for lower supply voltages

– Again a trade-off issue …

Considering Leakage at Design Time

Slide 4.46

Leakage has so far been pre-sented as an evil side effectof nanometer-size technol-ogy scaling, something thatshould be avoided by allcost. However, given anactual technology node, thismay not necessarily be thecase. For instance, a lowerthreshold (and increasedleakage) allows for a lowersupply voltage for the samedelay – effectively tradingoff dynamic power for static

power. This was already illustrated graphically in Slide 3.41, where power and delay of a logicalfunction were plotted as a function of supply and threshold voltages. Once one realizes that allowingfor an amount of static powermay actually be a good thing, the next question inevitably arises: is therean optimal balance between dynamic and static power, and if so, what is the ‘‘golden’’ ratio?

Must adapt to process and activity variations

( ) 2

αln

lk sw optd

avg

E EL

K

=

−

Topology Inv Add Dec

(E lk /Esw)opt 0.8 0.5 0.2

Leakage – Not Necessarily a Bad Thing

Optimal designs have high leakage (Elk /Esw 0.5)≈

10–2

10–1

100

101

0

0.2

0.4

0.6

0.8

1

Estatic /Edynamic

Eno

rm

VTHref-180 mV

0.81VDDmax

VTHref-140 mV

0.52VDDmax

Version 1

Version 2

[Ref: D. Markovic, JSSC’04]

© IEEE 2004

Slide 4.47

The answer is an unequivo-cal yes. This is best illu-strated by the graph in thisslide, which plots the nor-malized minimum energyper operation for a givenfunction and a given delayas a function of the ratiobetween static and dynamicpower. The same curve isalso plotted for a modifiedversion of the same function.

A number of interestingobservations can be drawnfrom this set of graphs:

� The most energy-effi-cient designs have a con-

siderable amount of leakage energy.� For both the designs, the static energy is approximately 50% of the dynamic energy (or one-

third of the total energy), and does not vary very much between the different circuit topologies.� The curves are fairly flat around the minimum, making the minimum energy somewhat

insensitive to the precise ratio.

This ratio does not change much for different topologies except if activity changes by ordersof magnitude, as the optimal ratio is a logarithmic function of activity and logic depth. Still,looking into significantly different circuit topologies in the last few slides, we found that optimal

102 Chapter #4

ratio of the leakage-to-switching energy did not change much. Moreover, in the range definedby these extreme cases, energy of adder-based implementations is still very close to minimum, from0.2 to 0.8 leakage-to-switching ratio, as shown in this graph. A similar situation occurs if weanalyze inverter chain and memory decoder circuits assuming an optimal leakage-to-switchingratio of 0.5.

From this analysis, we can derive a very simple general result: energy is minimized when theleakage-to-switching ratio is about 0.5, regardless of logic topology or function. This is an impor-tant practical result. We can use this knowledge to determine the optimal VDD and VTH in a broadrange of designs.

Switching energy

Leakage energy

with:I0(Ψ): normalized leakage current with inputs in state Ψ

Refining the Optimization Model

210 )( DDedyn VfSKE += →

cycleDDqkT

VV

stat TVeSIEDDdTH

/0 )(

+−

Ψ=

α

λ

γ

Slide 4.48

The effect of leakage is easilyintroduced in our earlier-defined optimization frame-work. Remember that theleakage current of a moduleis a function of the state ofits inputs. However, it isoften acceptable to use theaverage leakage over the dif-ferent states. Another obser-vation is that the ratiobetween dynamic and staticenergy is a function of thecycle time and the averageactivity per cycle.

Using longer transistors– Limited benefit– Increase in active current

Using higher thresholds– Channel doping– Stacked devices– Body biasing

Reducing the voltage!!

Reducing Leakage @ Design Time

Slide 4.49

When trying to manipulatethe leakage current, thedesigner has a number ofknobs at her disposition –In fact, they are quite simi-lar to the ones we used foroptimizing the dynamicpower: transistor sizes, andthreshold and supply vol-tages. How they influenceleakage current is substan-tially different though. Thechoice of the threshold vol-tage is especially important.


10% longer gates reduce leakage by 50%Increases switching power by 18% with W/L = constant

Doubling L reduces leakage by 5xImpacts performance

– Attractive when not required to increase W (e.g., memory)

Longer Channels

100 110 120 130 140 150 160 170 180 190 2000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Transistor length (nm)

1

2

3

4

5

6

7

8

9

10

90 nm CMOS

Switching energy

Leakage power

Nor

mal

ized

sw

itchi

ng e

nerg

y

Nor

mal

ized

leak

age

pow

er

Slide 4.50

While wider transistorsobviously leak more, thechosen transistor lengthhas an impact as well. Asalready shown in Slide2.15, very short transistorssuffer from a sharp reduc-tion in threshold voltage,and hence an exponentialincrease in leakage current.In leakage-critical designssuch as memory cells, forinstance, it makes sense toconsider the use of transis-tors with longer channellengths rather than theones prescribed by the

nominal process parameters. This comes at a penalty in dynamic power though, but that increaseis relatively small. For a 90 nm CMOS technology, it was shown that increasing the channel lengthby 10% reduces the leakage current by 50%, while raising the dynamic power by 18%. It may seemstrange to deliberately forgo one of the key benefits of technology scaling – that is, smallertransistors – yet sometimes the penalty in area and performance is inconsequential, whereas thegain in overall power consumption is substantial.

There is no need for level conversion

Dual thresholds can be added to standard design flows– High-VTH and Low-VTH libraries are a standard in sub-0.18 μm

processes– For example: can synthesize using only high-VTH and then simply

swap-in low-VTH cells to improve timing.– Second VTH insertion can be combined with resizing

Only two thresholds are needed per block– Using more than two yields small improvements

Using Multiple Thresholds

Slide 4.51

Using multiple thresholdvoltages is an effective toolin the static-power optimi-zation portfolio. In contrastto the usage ofmultiple sup-ply voltages, introducingmultiple thresholds hasrelatively little impact onthe design flow. No levelconverters are needed, andno special layout strategiesare required. The real bur-den is the added cost tothe manufacturing process.From a design perspective,

the challenge is on the technology mapping process, which is where the choice between cells withdifferent thresholds is really made.

104 Chapter #4

Three VTH’s

VDD = 1.5 V, VTH.1 = 0.3 V

+

VTH.3(V)V

TH

.2(V

)0.4 0.6 0.8 1 1.2 1.4

0.4

0.6

0.8

1

1.2

1.4

Lea

kag

e R

edu

ctio

n R

atio

VTH.3(V)

VTH.2 (V )

00.5

11.5

0

11.5

0.5

0

0.2

0.4

0.6

0.8

1

Impact of third threshold very limited

[Ref: T. Kuroda, ICCAD’02]

Slide 4.52

The immediate question ishow many threshold vol-tages are truly desirable.As with supply voltages,the addition of more levelscomes at a substantial cost,and most likely yields adiminishing return. A num-ber of studies have shownthat although there is stillsome benefit in havingthree discrete thresholdvoltages for both NMOSand PMOS transistors, itis quite marginal. Hence,two thresholds for bothdevices have become the

norm in the sub-100 nm technologies.

Using Multiple Thresholds

FF

FF

FF

FF

FF

Cell-by-cell VTH assignment (not at block level)Achieves all-low-VTH performance with substantial reduction in leakage

Low VTHHigh VTH

[Ref: S. Date, SLPE’94]

Slide 4.53

As was the case withdynamic power reduction,the strategy is to increasethe threshold voltages intiming paths that are notcritical, leading to staticleakage power reductionat no performance anddynamic power costs. Theappealing factor is thathigh-threshold cells can beintroduced anywhere in thelogic structure withoutmajor side effects. The bur-den is clearly on the tools,as timing slack can be usedin a number of ways: redu-cing transistor sizes, supply

voltages, or threshold voltages. The former two reduce both dynamic and static power, whereas thelatter only influences the static component. Remember however that an optimal design carefullybalances both components.


Shaded transistors are low-threshold

Low-threshold transistors used only in critical paths

Dual-VTH Domino

P1

Inv1

Inv2 Inv3

Dn+1

Clkn

Clkn+1

Dn …

Slide 4.54

Most of the discussion onleakage so far has concen-trated on static logic.I reckon that dynamic-circuit designers are evenmore worried: for them,leakage means not onlypower dissipation but alsoa serious degradation innoise margin. Again, acareful selection betweenlow- and high-thresholddevices can go a longway. Low-threshold tran-sistors are used in thetiming-critical paths, suchas the pull-down logic

module. Yet even with these options, it is becoming increasingly apparent that dynamic logicis facing serious challenges in the extreme-scaling regimes.

Easily introduced in standard-cell design methodology by extending cell libraries with cells with different thresholds– Selection of cells during technology mapping– No impact on dynamic power– No interface issues (as was the case with multiple

VDDs)

Impact: Can reduce leakage power substantially

Multiple Thresholds and Design Methodology

Slide 4.55

Repeating what was statedearlier, the concept of mul-tiple thresholds is intro-duced quite easily in theexisting commercial designflows. In hindsight, this isclearly a no-brainer. Themajor impact is that thesize of the cell library dou-bles (at least), whichincreases the cost of thecharacterization process.This, combined with theintroduction of a range ofsize options for each cell,

has led to an explosion in the size of a typical library. Libraries with more than 1000 cells are notan exception.

106 Chapter #4

High-VTHOnly

Low-VTH Only

Dual-VTH

Total Slack –53 ps 0 ps 0 ps

Dynamic Power

3.2 mW 3.3 mW 3.2 mW

Static Power

914 nW 3873 nW 1519 nW

All designs synthesized automatically using Synopsys Flows

Dual-VTH for High-Performance Design

[Courtesy: Synopsys, Toshiba, 2004]

Slide 4.56

In this experiment, per-formed jointly by Toshibaand Synopsys, the impactof the introduction of cellswith multiple thresholds ina high-performance designis analyzed. The dual-threshold strategy leavestiming and dynamic powerunchanged, while reducingthe leakage power by half.

Example: High- vs. Low-Threshold Libraries

Leak

age

Pow

er (

nW)

Selected combinational tests130 nm CMOS

TH

TH

TH

TH

[Courtesy: Synopsys 2004]

TH

TH

Slide 4.57

A more detailed analysis isshown in this slide, whichalso illustrates the impactof the chosen design flowover a set of six bench-marks with varying com-plexity. It compares thehigh-VTH and low-VTH

designs (the extremes) witha design starting fromlow-VTH transistors onlyfollowed by a gradualintroduction of high-VTH

devices, and vice-versa.It shows that the latterstrategy – that is, startingexclusively with high-VTH

transistors and introducinglow-VTH transistors only in the critical paths to meet the timing constraints – yields better resultsfrom a leakage perspective.


Complex Gates Increase Ion /Ioff Ratio

Ion and Ioff of single NMOS versus stack of 10 NMOS transistorsTransistors in stack are sized up to give similar drive

No stack

Stack

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

VDD (V)

I off

(nA

)

No stack

Stack

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

20

40

60

80

100

120

140

I on

(μA

)

VDD (V)

(90 nm technology) (90 nm technology)

Slide 4.58

In earlier chapters, we havealready introduced thenotion that stacking transis-tors reduces the leakage cur-rent super-linearly primarilydue to the DIBL effect. Thestacking effect is an effectivemeans of managing leakagecurrent at design time. Asillustrated in the graphs, thecombination of stackingand transistor sizing allowsus to maintain the on-current, while keeping theoff-current in check, evenfor higher supply voltages.

Complex Gates Increase Ion/Ioff Ratio

Stacking transistors suppresses submicron effectsReduced velocity saturationReduced DIBL effectAllows for operation at lower thresholds

Stack

No stack

Factor 10!

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5× 105

VDD (V)

I on

/Io

ffra

tio

(90 nm technology)

Slide 4.59

This combined effect is putin a clear perspective in thisgraph, which plots the Ion/Ioff ratio of a transistorstack of 10 versus a singletransistor as a function ofVDD. For a supply voltageof 1V, the stacked transis-tor chain features an on-versus-off current ratiothat is 10 times higher.This enables us to lowerthresholds to values thatwould be prohibitive insimple gates. Overall, italso indicates that theusage of complex gates,

already beneficial in the reduction of dynamic power, helps to reduce static power as well. Froma power perspective, this is a win–win situation.

108 Chapter #4

Example: four-input NAND

With transistors sized for similar performance:Leakage of Fan-in(2) =

Leakage of Fan-in(4) x 3(Averaged over all possible input patterns)

Fan-in(2)Fan-in(4)

versus

Complex Gates Increase Ion /Ioff Ratio

2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

Input pattern

Lea

kag

e C

urr

ent

(nA

)

Fan-in(2)

Fan-in(4)

Slide 4.60

The advantage of using com-plex gates is illustratedwith asimple example: a fan-in(4)NAND versus a fan-in(2)NAND/NOR implementa-tion of the same function.The leakage current is ana-lyzed over all 16 input com-binations (remember thatleakage is state-dependent).On the average, the com-plex-gate topology has aleakage current that is threetimes smaller than that of theimplementation employingsimple gates. One way oflooking at this is that, forthe same functionality, com-

plex gates come with fewer leakage paths. However, they also carry a performance penalty. For high-performance designs, simple gates are a necessity in the critical-timing paths.

[Ref: S. Narendra, ISLPED’01]

Example: 32-bit Kogge–Stone Adder

H HV V

% o

f in

pu

t ve

cto

rs

Standby leakage current (μμA)

factor 18

Reducing the threshold by 150 mV increases leakage of single NMOS transistor by a factor of 60

© Springer 2001

Slide 4.61

The complex-versus-simplegate trade-off is illustratedwith the example of acomplex Kogge–Stone

earlier in this chapter. Thehistogram of the leakagecurrents over a large rangeof random input signals isplotted. It can be observedthat the average leakagecurrent of the low-VTH ver-sion is only 18 times largerthan that of the high-VTH

version, which is substan-tially smaller than what

would be predicted by the threshold ratios. For a single NMOS transistor, reducing the thresholdby 150mV would cause the leakage current to go up by a factor of 60 (for the slope factor n= 1.4).


adder (from [Narendra,ISLPED’01]). This issame circuit we studied

the

Circuit optimization can lead to substantial energy reduction at limited performance lossEnergy–delay plots are the perfect mechanisms for analyzing energy–delay trade-offsWell-defined optimization problem over W, VDD and VTH parametersIncreasingly better support by today’s CAD flowsObserve: leakage is not necessarily bad – if appropriately managed

SummarySlide 4.62

In summary, the energy–delay trade-off challengecan be redefined intoa perfectly manageableoptimization problem.Transistor sizing, multiplesupply and threshold vol-tages, and circuit topologyare the main knobsavailable to a designer.Also worth rememberingis that energy-efficientdesigns carefully balancethe dynamic and staticpower components, sub-ject to the predicted activ-

ity level of the modules. The burden is now on the EDA companies to translate these conceptsinto generally applicable tool flows.

Books:A. Bellaouar and M.I Elmasry, Low-Power Digital VLSI Design Circuits and Systems, Kluwer Academic Publishers, 1st ed, 1995.D. Chinnery and K. Keutzer, Closing the Gap Between ASIC and Custom, Springer, 2002. D. Chinnery and K. Keutzer, Closing the Power Gap Between ASIC and Custom, Springer, 2007. J. Rabaey, A. Chandrakasan and B. Nikolic, Digital Integrated Circuits: A Design Perspective, 2nd ed, Prentice Hall 2003.I. Sutherland, B. Sproul and D. Harris, Logical Effort: Designing Fast CMOS Circuits,Morgan- Kaufmann, 1st ed, 1999.

Articles:R.W. Brodersen, M.A. Horowitz, D. Markovic, B. Nikolic and V. Stojanovic, “Methods for True Power Minimization,” Int. Conf. on Computer-Aided Design (ICCAD), pp. 35–42, Nov. 2002.S. Date, N. Shibata, S. Mutoh, and J. Yamada, "1-V 30-MHz Memory-Macrocell-Circuit Technology with a 0.5 gm Multi-Threshold CMOS," Proceedings of the 1994 Symposium on Low Power Electronics, San Diego, CA, pp. 90–91, Oct. 1994.M. Hamada, Y. Ootaguro and T. Kuroda, “Utilizing Surplus Timing for Power Reduction,” IEEE Custom Integrated Circuits Conf., (CICC), pp. 89–92, Sept. 2001.F. Ishihara, F. Sheikh and B. Nikolic, “Level Conversion for Dual-Supply Systems,” Int. Conf. Low Power Electronics and Design, (ISLPED), pp. 164–167, Aug. 2003.P.M. Kogge and H.S. Stone, “A Parallel Algorithm for the Efficient Solution of General Class of Recurrence Equations,” IEEE Trans. Comput., C-22(8), pp. 786–793, Aug 1973. T. Kuroda, “Optimization and control of VDD and VTH for Low-Power, High-Speed CMOS Design,”Proceedings ICCAD 2002, San Jose, Nov. 2002.

References

Slide 4.63 and 4.64

Some references . . .

110 Chapter #4

Articles (cont.):H.C. Lin and L.W. Linholm, “An optimized output stage for MOS integrated circuits,” IEEE Journal of Solid-State Circuits, SC-102, pp. 106–109, Apr. 1975. S. Ma and P. Franzon, “Energy control and accurate delay estimation in the design of CMOS buffers,” IEEE Journal of Solid-State Circuits, (299), pp. 1150–1153, Sep. 1994.D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz and R.W. Brodersen, “Methods for true energy- Performance Optimization,” IEEE Journal of Solid-State Circuits, 39(8), pp. 1282–1293, Aug. 2004.MathWorks, http://www.mathworks.comS. Narendra, S. Borkar, V. De, D. Antoniadis and A. Chandrakasan, “Scaling of stack effect and its applications for leakage reduction,” Int. Conf. Low Power Electronics and Design, (ISLPED), pp. 195–200, Aug. 2001.T. Sakurai and R. Newton, “Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas,” IEEE Journal of Solid-State Circuits, 25(2),pp. 584–594, Apr. 1990.Y. Shimazaki, R. Zlatanovici and B. Nikolic, “A shared-well dual-supply-voltage 64-bit ALU,” Int. Conf. Solid-State Circuits, (ISSCC), pp. 104–105, Feb. 2003.V. Stojanovic, D. Markovic, B. Nikolic, M.A. Horowitz and R.W. Brodersen, “Energy-delay tradeoffs in combinational logic using gate sizing and supply voltage optimization,” European Solid- State Circuits Conf., (ESSCIRC), pp. 211–214, Sep. 2002.M. Takahashi et al., “A 60mW MPEG video codec using clustered voltage scaling with variable supply-voltage scheme,” IEEE Int. Solid-State Circuits Conf., (ISSCC), pp. 36–37, Feb. 1998.

References


Chapter 4 Optimizing Power @ Design Time – Circuit-Level...

Documents

Transcript of Chapter 4 Optimizing Power @ Design Time – Circuit-Level...