Chapter 4 Optimizing Power @ Design Time – Circuit-Level...
Transcript of Chapter 4 Optimizing Power @ Design Time – Circuit-Level...
Chapter 4
Optimizing Power @ Design Time – Circuit-Level Techniques
Jan M. Rabaey
Optimizing Power @ Design Time
Circuits
Dejan MarkovicBorivoje Nikolic
Slide 4.1
With the sources of powerdissipation in modern inte-grated circuits well under-stood, we can start toexplore the various sortsof power reduction techni-ques. As is made clear inthe beginning of the chap-ter, power or energy mini-mization can be performedat many stages in the designprocess and may addressdifferent targets such as
dynamic or static power. This chapter focuses on techniques for power reduction at design timeand at circuit level. Practical questions often expressed by designers are addressed: whether gatesizing or choice of supply voltage yields larger returns in terms of power–delay; howmany suppliesare needed; what the preferred ratio of discrete supplies to thresholds is; etc. As was made clear atthe end of the previous chapter, all optimizations should be seen in the broader light of anenergy–delay trade-off. To help guide this process, we introduce a unified sensitivity-basedoptimization framework. The availability of such a framework makes it possible to compare inan unbiased way the impact of various parameters such as gate size and supply and thresholdvoltages on a given design topology. The results serve as the foundation for optimization at thehigher levels of abstraction, which is the focus of later chapters.
J. Rabaey, Low Power Design Essentials, Series on Integrated Circuits and Systems,DOI 10.1007/978-0-387-71713-5_4, � Springer ScienceþBusiness Media, LLC 2009
77
Chapter Outline
Optimization framework for energy–delay trade-offDynamic-power optimization – Multiple supply voltages– Transistor sizing– Technology mapping
Static-power optimization– Multiple thresholds– Transistor stacking
Slide 4.2
The chapter starts with theintroduction of a unifiedenergy–delay optimizationframework, constructed asan extension of the power-ful logical-effort approach,which was originally con-structed to target perfor-mance optimization. Thedeveloped techniques arethen used to evaluate theeffectiveness and applic-ability of design-timepower reduction techni-
ques at the circuit level. Strategies to address both dynamic and static power are considered.
Energy/Power Optimization Strategy
For given function and activity, an optimal operation point can be derived in the energy–performance spaceTime of optimization depends upon activity profile Different optimizations apply to active and static power
Fixed Activity
Variable Activity
No Activity – Standby
ActiveDesign time Run time Sleep
Static
Slide 4.3
Before embarking on anyoptimization, we shouldrecall that the power andenergy metrics are related,but that they are by nomeans identical. The linkbetween the two is the activ-ity, which changes the ratiobetween the dynamic andstatic power components,and which may vary dyna-mically between operationalstates. Take, for instance,the example of an adder.
When the circuit is operated at its maximum speed and inputs are changing constantly andrandomly, the dynamic power component dominates. On the other hand, when the activity islow, static power rules. In addition, the desired performance of the adder may very well vary overtime as well, further complicating the optimization trajectory.
It will become apparent in this chapter that different design techniques apply to theminimization of dynamic and static power. Hence it is worth classifying power reductiontechniques based on the activity level, which is a dynamically varying parameter as discussedbefore. Fortunately, there exists a broad spectrum of optimizations that can be readily appliedat design time, either because they are independent of the activity level or because the moduleactivity is fixed and known in advance. These ‘‘design-time’’ design techniques are the topic ofthe next four chapters. In general though, activity and performance requirements vary overtime, and the minimization of power/energy under these circumstances requires techniques thatadapt to the prevailing conditions. These are called ‘‘run-time’’ optimizations. Finally, oneoperational condition requires special attention: the case where the system is idle (or is in‘‘standby’’). Under such circumstances, the dynamic power component approaches zero, and
78 Chapter #4
leakage power dominates. Keeping the static power within bounds under such conditionsrequires dedicated design techniques.
Maximize throughput for given energy orMinimize energy for given throughput
Delay
design
Emax
DmaxDmin
Energy/op
Emin
Energy–Delay Optimization and Trade-off
Trade-off space
Other important metrics: Area, Reliability, Reusability
Unoptimized
Slide 4.4
At the end of the previouschapter, it was argued thatdesign optimization forpower and/or energyrequires trade-offs, andthat energy and delayrepresent the major axes ofthe trade-off space. (Othermetrics such as area orreliability play a role aswell, but are only consid-ered as secondary factorsin this book.) This natu-rally motivates the use ofenergy–delay (E–D) spaceas the coordinate system inwhich designers evaluate
the effectiveness of their techniques.By changing the various independent design parameters, each design maps onto a constrained
region of the energy–delay plane. Starting from a non-optimized design, we want to either speed upthe system while keeping the design under the power cap (indicated by Emax), or minimize energywhile satisfying the throughput constraint (Dmax). The optimization space is bounded by the optimalenergy–delay curve. This curve is optimal (for the given set of design parameters), because all otherachievable points either consumemore energy for the same delay or have a longer delay for the sameenergy. Although finding the optimal curve seems quite simple in this slide, in real life it is far morecomplex. Observe also that any optimal energy–delay curve assumes a given activity level, and thatchanges in activity may cause the curve to shift.
Slide 4.5
The problem is that there are many sets of parameters to adjust. Some of these variables arecontinuous, like transistor sizes, and supply and threshold voltages. Others are discrete, likedifferent logic styles, topologies, and micro-architectures. In theory, it should be possible toconsider all parameters at the same time, and to define a single optimization problem. In practice,we have learned that the complexity of the problem becomes overwhelming, and that the resultingdesigns (if the process ever converges) are very often sub-optimal.
Hence, design methodologies for integrated circuits rely on some important concepts to helpmanage complexity: abstraction (hiding the details) and hierarchy (building larger entities througha composition of smaller ones). The two most often go hand-in-hand. The abstraction stack of atypical digital IC design flow is shown in this slide. Most design parameters are, in general,confined to and selected in a single layer of the stack only. For instance, the choice betweendifferent instruction sets is a typical micro-architecture optimization, while the choice betweendevices with different threshold voltages is best performed at the circuit layer.
Optimizing Power @ Design Time – Circuit-Level Techniques 79
Layering, hence, is thepreferred technique toman-age complexity in the designoptimization process.
Architecture
Micro-Architecture
Circuit (Logic & FFs)
Optimization Can/Must Span Multiple Levels
Design optimization combines top-down and bottom-up: “meet-in-the-middle”
Slide 4.6
The layered approach maygive the false impressionthat optimizations withindifferent layers are inde-pendent of each other.This is definitely not thecase. For instance, thechoice of the threshold vol-tages at the circuit layerchanges the shape of theoptimization space at thelogical or architecturallayers. Similarly, introdu-cing architectural transfor-
mations such as pipelining may increase the size of the optimization space at the circuit level, thusleading to larger potential gains. Hence, optimizations may and must span the layers.
Design optimization in general follows a ‘‘meet-in-the-middle’’ formulation: specifications andrequirements are propagated from the highest abstraction layer downward (top-down), andconstraints are propagated upward from the lowest abstraction later (bottom-up).
Slide 4.7
Continuous design parameters such as supply voltages and transistor sizes give rise to a continuousoptimization space and a single optimal energy–delay curve. Discrete parameters, such as the choicebetween different adder topologies, result in a set of optimal boundary curves. The overall optimumis then defined by their composite.
For example, topology B is better in the energy-performance sense for large target delays,whereas topology A is more effective for shorter delays.
The Design Abstraction Stack
Logic/RT
(Micro-)Architecture
Software
Circuit
Device
System/Application
Thi
s C
hapt
er
A very rich set of design parameters to consider!It helps to consider options in relation to their abstraction layer
sizing, supply, thresholds
logic family, standard cell versus custom
Parallel versus pipelined, general purpose versus application-specific
Bulk versus SOI
Choice of algorithm
Amount of concurrency
80 Chapter #4
One of the goals of thischapter is to demonstratehow we can quickly searchfor this global optimum,and based on that, buildan understanding of thescope and effectivenessof the different designparameters.
Some Optimization Observations
∂E/∂A∂D/∂A A=A0
SA =
SB
SA
f(A0,B)
f(A,B0)
Delay
En
erg
y
D0
(A0,B0)
Energy–Delay Sensitivities
[Ref: V. Stojanovic, ESSCIRC’02]
Slide 4.8
Given an appropriate for-mulation of the energy anddelay as a function of thedesign parameters, anyoptimization program canbe used to derive the opti-mal energy–delay curve.Most of the optimizationsand design explorations inthis text were performedusing various modules ofthe MATLAB program[Mathworks].
Yet, though relying onautomated optimization isvery useful to address largeproblems or to get preciseresults quickly, some analy-tical techniques often come
in handy to judge the effectiveness of a given parameter, or to come to a closed-form solution.The energy–delay sensitivity is a tool that does just that: it presents an effective way to evaluate the
effectiveness of changes in various design variables. It relies on simple gradient expressions thatquantify the profitability of a design modification: how much change in energy and delay resultsfrom tuning one of the design variables. Consider, for instance, the operation point (A0,B0), whereAand B are the design variables being studied. The sensitivity to each of the variables is simply theslope of the curve obtained by a small change in that variable. Observe that the sensitivities arenegative owing to the nature of energy–delay trade-off (when we compare sensitivities in the rest of
topology A
Delay
En
erg
y/o
p
Globally optimal energy–delay curve for a given function
Energy–Delay Optimization
topology B
topology A
topology B
Delay
En
erg
y/o
p
Optimizing Power @ Design Time – Circuit-Level Techniques 81
the text, we will use their absolute values – a larger absolute value indicates a higher potential forenergy reduction). For example, variable B has higher energy–delay sensitivity at point (A0,B0) thanthe variable A. Changing B hence yields a larger potential gain.
ΔE = SA · (−ΔD) + SB · ΔD
On the optimal curve, all sensitivities must be equal
Finding the Optimal Energy–Delay Curve
f(A0,B)
f(A,B0)
Delay
En
erg
y
D0
(A0,B0)
ΔD
f(A1,B)
Pareto-optimal:the best that can be achieved without disadvantaging at least one metric.
Slide 4.9
The optimal energy–delaycurve as defined earlier is apareto-optimal curve (anotion borrowed from eco-nomics). An assignment oroperational point in amulti-dimensional search ispareto-optimal if improvingon one metric by necessitymeans hurting another.
An interesting propertyof a pareto-optimal point isthat the sensitivities to alldesign variables must beequal. This can be under-stood intuitively. If the sen-sitivities are not equal, the
difference can be exploited to generate a no-loss improvement. Consider, for instance, the examplepresented here, where we strive tominimize the energy for a given delayD0. Using the ‘‘lower-energy-cost’’ variableA, we first create some timing slack �D at a small expense in energy�E (proportionaltoA’s E–D sensitivity). From the new operation point (A1,B0), we can now use ‘‘higher-energy-cost’’variable B to achieve an overall energy reduction as indicated by the formula. The fixed point in theoptimization is clearly reached when all sensitivities are equal.
Reducing voltages– Lowering the supply voltage (VDD) at the expense of clock speed– Lowering the logic swing (Vswing)
Reducing transistor sizes (CL )– Slows down logic
Reducing activity (α)– Reducing switching activity through transformations– Reducing glitching by balancing logic
fVVCP DDswingLactive ⋅⋅⋅⋅~DDswingLactive VVCα
αE ⋅⋅⋅~
Reducing Active Energy @ Design Time
Slide 4.10
In the rest of the chapter,we primarily focus on thecircuit and logic layers.Let us first focus on theactive component of powerdissipation, or, in light ofthe E–D trade-off perspec-tive, active energy dissipa-tion. The latter is a productof switching activity at theoutput of a gate, load capa-citance at the output, logicswing, and supply voltage.The simple guideline for
energy reduction is therefore to reduce each of the terms in the product expression. Some variables,however, are more efficient than others.
The largest impact on active energy is effected seemingly through supply voltage scaling, because of itsquadratic impact on power (we assume that the logic swing scales accordingly). All other terms have
82 Chapter #4
linear impact. For example, smaller transistors have less capacitance. Switching activity mostly dependson the choice of circuit topology.
For a fixed circuit topology, the most interesting trade-off exists between supply voltage and gatesizing, as these tuning knobs affect both energy and performance. Threshold voltages play asecondary role in this discussion as they impact performance without influencing dynamic energy.
Downsizing and/or lowering the supply on the critical path lowers the operating frequencyDownsizing non-critical paths reduces energy for free, but– Narrows down the path–delay distribution– Increases impact of variations, impacts robustness
tp(path)
# of
pat
hs
targetdelay
# of
pat
hs
targetdelay
Observation
tp(path)
Slide 4.11
Throughout this discussion,it is useful to keep in mindthat the optimizations in theE–D space also impactother important designmetrics that are not cap-tured here, such as areaor reliability. Take, forexample, the relationshipbetween transistor sizingand circuit reliability. Trim-ming the gates on the non-critical paths saves powerwithout a performancepenalty – and hence seemsto be a win-win operation.Yet in the extreme case, this
results in all paths becoming critical (unless a minimum gate size constraint is reached, of course).This effect is illustrated in the slide. The downsizing of non-critical gates narrows the delay distribu-tion and moves the average closer to the maximum delay. This makes this design vulnerable toprocess-variation effects and degrades its reliability.
topology A
topology B
Delay
En
erg
y/o
p
Reference case– Dmin sizing @ VDD max, VTH ref
minimize Energy (VDD, VTH, W )subject to Delay (VDD, VTH, W ) D≤ con
ConstraintsVDD min < VDD < VDD max
VTH min < VTH< VTH max
Wmin < W
Circuit Optimization Framework
[Ref: V. Stojanovic, ESSCIRC’02]
Slide 4.12
To evaluate fully the impactof the design variables inquestion, that of supplyand threshold voltages andgate size on energy and per-formance, we need to con-struct a simple and effective,yet accurate, optimizationframework. The search fora globally optimal energy–delay curve for a given cir-cuit topology and activitylevel is formulated as anoptimization problem:
Minimize energy subject to a delay constraint and bounds on the range of the optimization variables (VDD, VTH,and W).
Optimizing Power @ Design Time – Circuit-Level Techniques 83
Optimization is performed with respect to a reference design, sized for minimum delay at thenominal supply and threshold voltages as specified for the technology (e.g., VDD = 1.2V andVTH = 0.35V for a 90 nm process). This reference point is convenient, as it is well-defined.
i i +1
CwCiC γi Ci +1
Optimization Framework: Generic Network
VDD i +1VDD i
Gate in stage i loaded by fan-out (stage i +1)
Slide 4.13
The core of the frameworkconsists of effective modelsof delay and energy as afunction of the design para-meters. To develop theexpressions, we assume ageneric circuit configurationas illustrated in the slide.The gate under study is atthe i-th stage of a logical net-work, and is loaded by anumber of gates in stagei+1, which we have lumpedinto a single equivalent gate.Cw represents the capaci-
tance of the wire, which we will assume to be proportional to the fan-out (this is a reasonableassumption for a first-order model).
Fit parameters: Von, αd, K ,d γ
Alpha-Power Based Delay Model
VDD ref = 1.2 V, technology 90 nm
)1
1()()(
11
i
inom
i
iwi
onDD
DDd
C
C
C
CCCγγ γτ
VV
VKtp
++ ′⋅+=++
−=
0 2 4 6 8 100
10
20
30
40
50
60
Fan-out (Ci +1/Ci)
Del
ay (
ps)
tp
0.5 0.6 0.7 0.8 0.9 1 0
0.5
1
1.5
2
2.5
3
3.5
4
VDD
/VDD ref
FO
4 de
lay
(nor
m.)
Von = 0.37 Vαd = 1.53
simulationmodel
τnom= 6 psγ = 1.35
simulationmodel
αd
Slide 4.14
The delay modeling of thecomplex gate i proceeds intwo steps. First, we derivethe delay of an inverter as afunction of supply voltage,threshold, and fan-out; Next,we expand this to more com-plex gates.
The delay of an inverter isexpressed using a simple lin-ear delay model, based onthe alpha-power law for thedrain current (see Chapter2). Note that this model isbased on curve-fitting. Theparameters Von and �d areintrinsically related, yet not
equal, to the transistor threshold and the velocity saturation index. Kd is another fit parameter andrelates to the transconductance of the process (amongst others). The model fits SPICE simulateddata quite nicely, across a range of supply voltages, normalized to the nominal supply voltage (whichis 1.2V for our 90nm CMOS technology). Observe that this model is only valid if the supply voltageexceeds the threshold voltage by a reasonable amount. (This constraint will be removed in Chapter11, where we present a modified model that extends into the sub-threshold region.)
84 Chapter #4
The fan-out f = Ci+1/Ci represents the ratio of the load capacitance divided by the gatecapacitance. A small modification allows for the inclusion of the wire capacitance (f 0 ). � isanother technology-dependent parameter, representing the ratio between the output and inputcapacitance of a minimum-sized unloaded inverter.
Parasitic delay pi –
≈
depends upon gate topology
Electrical effort f i S i+1/S i
Logical effort gi – depends upon gate topology
Effective fan-out hi = fi gi
For Complex Gates
Combined with Logical-Effort Formulation
)( iiinom
gfptp τ γ+=
[Ref: I. Sutherland, Morgan-Kaufman’99]
Slide 4.15
The other part of the modelis based on the logical-effortformulation, which extendsthe notion to complex gates.Using the logical-effortnotation, the delay can beexpressed simply as aproductof the process-dependenttime constant tnom and aunitless delay, pi + figi/�, inwhich g is the logical effortthat quantifies the relativeability of a gate to delivercurrent, f is the ratio of thetotal output to input capaci-tance of the gate, and p repre-
sents the delay component due to the self-loading of the gate. The product of the logical effort andthe electrical effort is called the effective fan-out h. Gate sizing enters the equation through the fan-outfactor f = Si+1/Si.
= energy consumed by logic gate i
Dynamic Energy
i i +1
CwCiCi Ci+1
VDD,i +1VDD,i
iiiiwiiei
iDDiiiDDiwidyn
SSCCCfSKC
VfCV γγ
γ
CCCE
//)(
)()(
11
2,
2,1
++
+
′=+=′=
⋅′+=⋅++=
)( 2,
21, iDDiDDiei VVSKE += −
γ
Slide 4.16
For the time being, we onlyconsider the switchingenergy of the gate. In thismodel, f 0iCi is the totalload at the output, includ-ing wire and gate loads, and�Ci is the self-loading of thegate. The total energystored on these capacitancesis the energy taken out ofthe supply voltage in stage i.
Now, if we change the sizeof the gate in stage i, it affectsonly the energy stored on theinput capacitance and para-sitic capacitance of that gate.
Ei hence is defined as the energy that the gate at stage i contributes to the overall energy dissipation.
Optimizing Power @ Design Time – Circuit-Level Techniques 85
Slide 4.17
As mentioned, sensitivityanalysis provides intuitionabout the profitability ofoptimization. Using themodels developed in theprevious slides, we cannow derive expressions forthe sensitivities to some ofthe key design parameters.
The formulas indicatethat the largest potentialfor energy savings is at theminimum delay, Dmin,which is obtained by equal-izing the effective fan-outof all stages, and setting
the supply voltage at the maximum allowable value. This observation intuitively makes sense: atminimum delay, the delay cannot be reduced beyond the minimum achievable value, regardless ofhowmuch energy is spent. At the same time, the potential of energy savings through voltage scalingdecreases with reducing supply voltages: E decreases, while D and the ratio Von/VDD increase.
The key point to realize is that optimization primarily exploits the tuning variable with thelargest sensitivity, which ultimately leads to the solution where all sensitivities are equal. You willsee this concept at work in a number of examples.
Properties of inverter chain– Single path topology– Energy increases geometrically from input to output
Example: Inverter Chain
CL
1
S1 = 1 S2 … SNS3
Goal– Find optimal sizing S = [S1, S2, …, SN ], supply voltage, and
buffering strategy to achieve the best energy–delay trade-off
Slide 4.18
We use a number of well-known circuit topologies toillustrate the concepts ofcircuit optimization forenergy. The examples differin the amount of off-pathloading and path reconver-gence. By analyzing howthese properties affect theenergy profile, we maycome to some general prin-ciples related to the impactof the various design para-meters. More precisely, westudy the (well-understood)
inverter chain and the tree adder – as these examples differ widely in the number of paths and pathreconvergence.
Let us begin with the inverter chain. The goal is to find the optimal sizing, the supply voltages,and the number of stages that result in the best energy–delay trade-off.
∞ for equal h
(Dmin)
max at VDD(max)
(Dmin)
Depends on Sensitivity (∂E /∂D)
Optimizing Return on Investment (ROI)
Gate Sizing
Supply Voltage
)( 1−−−=
∂∂
∂∂
iinom
i
i
i
hh
E
τ
α
SD
SE
DD
ond
DD
on
DD
DD
V
VV
V
D
E
VD
VE
+−
−⋅⋅−=
∂∂
∂∂
1
)1(2
86 Chapter #4
Slide 4.19
The inverter chain has beenthe focus of a lot of attention,as it is a critical component indigital design, and some clearguidelines about optimaldesign can be derived inclosed form. For minimumdelay, the fan-out of eachstage is kept constant, andeach subsequent stage is up-sized with a constant factor.This means that the energystored per stage increases geo-metrically toward the output,with the largest energy storedin the final load.
In a first step, we consider solely transistor sizing. For a given delay increment, the optimum sizeof each stage, whichminimizes energy, can be derived. The sensitivities derived in Slide 4.17 alreadygive a first idea on what may unfold: the sensitivity to gate sizing is proportional to the energystored on the gate, and is inversely proportional to the difference in effective fan-outs. What thismeans is that, for equal sensitivity in all stages, the difference in the effective fan-outs of a gate mustincrease in proportion to the energy stored on the gate, indicating that the difference in the effectivefan-outs should increase exponentially toward the output.
This result was already analytically derived by Ma and Franzon [Ma, JSSC’94], who showed that atapered staging is the best way to combine performance and energy efficiency. One caveat: At large delayincrements, a more efficient solution can be found by reducing the number of stages � this was notincluded as a design parameter in this first-order optimization, inwhich the topologywas kept unchanged.
VDD reduces energy of the final load first
Variable taper achieved by voltage scaling
Inverter Chain: VDD Optimization
1 2 3 4 5 6 70
0.2
0.4
0.6
0.8
1.0
stage
VD
D/V
DD
nom
0%
1%
10%
30%
Dinc
= 50%
nomopt
Slide 4.20
Let us now consider thepotential of supply voltagescaling. We assume thateach stage can be run at adifferent voltage. As in siz-ing, the optimization tacklesthe largest consumers – thefinal stages – first by scalingtheir supply voltages. The neteffect is similar to a ‘‘virtual’’tapering. An important dif-ference between sizing andsupply reduction is that siz-ing does not affect the energystored in the final outputload CL. Supply reduction,on the other hand, lowers
this source of energy consumption first, by reducing the supply voltage of the gate that drives the load.As (dis)charging CL is the largest source of energy consumption, the impact of this is quite profound.
Variable taper achieves minimum energy
[Ref: Ma, JSSC’94]
Inverter Chain: Gate Sizing
1 2 3 4 5 6 70
5
10
15
20
25
stage
effe
ctiv
e fa
n-ou
t, h
0%
1%
10%
30%
Dinc
= 50%nomopt
1
21
112
21
−
−
+−
−∝
⋅⋅⋅−=μ τ
+ μ⋅=
ii
iS
Snom
DDe
i
iii
hh
EF
FVK
S
SSS
Optimizing Power @ Design Time – Circuit-Level Techniques 87
Parameter with the largest sensitivity has the largest potential for energy reductionTwo discrete supplies mimic per-stage V DD
Inverter Chain: Optimization Results
500 10 20 30 400
20
40
60
80
100
incD (%)
ener
gy r
educ
tion
(%)
0 10 20 30 40 500
0.2
0.4
0.6
0.8
1.0
Dinc
(%)
Sen
sitiv
ity (
norm
)
cVDD
SgV DD
2V DD
Slide 4.21
Now, how good can all thisbe in terms of energy reduc-tion? In the graphs, we pre-sent the results of variousoptimizations performedon the inverter chain: siz-ing, reducing the globalVDD, two discrete VDDs,and a customizable VDD
per stage. For each ofthese cases, the sensitivityand the energy reductionare plotted as functions ofthe delay increment (overDmin). The prime observa-tion is that increasing thedelay by 50% reduces the
energy dissipation by more than 70%. Again, it is shown that for any value of the delay increment,the parameter with the largest sensitivity has the largest potential for energy reduction. Forexample, at small delay increments sizing has the largest sensitivity (initially infinity), so it offersthe largest energy reduction. Its potential however quickly falls off. At large delay increments, itpays to scale the supply voltage of the entire circuit, achieving the sensitivity equal to that of sizingat around 25% excess delay. The largest reductions can be obtained by custom voltage scaling. Yet,two discrete voltages are almost as good, and are a lot simpler from an implementation perspective.
Tree adder– Long wires– Reconvergent paths– Multiple active outputs
(A0, B0)
Example: Kogge–Stone Tree Adder
Cin
(A15, B15)
S0
S15
[Ref: P. Kogge, Trans. Comp’73]
Slide 4.22
An inverter chain has aparticularly simple energydistribution, which growsgeometrically until thefinal stage. This type ofprofile drives the optimiza-tion (for both sizing andsupply) to focus on thefinal stages first. However,most practical circuits havea more complex energyprofile.
An interesting counter-part is formed by the treeadder, which features longwires, large fan-out varia-
tions, reconvergent fan-out, and multiple active outputs qualified by paths of various logic depths.We have selected a popular instance of such an adder, the Kogge–Stone version, for our study[Kogge’93, Rabaey’03]. The overall architecture of the adder consists of a number of propagate/generate functions at the inputs (identified by the squares), followed by carry-merge operators
88 Chapter #4
(circles).The final-sum outputs are generated through XOR functions (diamonds). To balance thedelay paths, buffers (triangles) are inserted in many of the paths.
sizing: E (–54%)D = 10%
referenceD = D
Dual VDD : E (–27%)D = 10%
Tree Adder: Sizing vs. Dual-VDD Optimization
Reference design: all paths are critical
Internal energy ⇒ S more effective than V DD– S: E(–54%), Dual VDD: E(–27%) at D inc = 10%
incmin inc
10080604020
ener
gy
bit slice
stage63 47 31 15 0 1
35
79
10080604020
ener
gy
bit slice
stage63 47 31 15 0 1
35
79
10080604020
ener
gy
bit slice
stage63 47 31 15 0 1
35
79
Slide 4.23
The adder topology isbest understood in a two-dimensional plane. Oneaxis is formed by the dif-ferent bit slices N (we areconsidering a 64-bit adderfor this example), whereasthe other is formed by theconsecutive gate stages. Asbefits a tree adder, thenumber of stages equalslog2(N)+M, where M is theextra stages for propagate/generate and the final XORfunctionality. The energy ofan internal node is best
understood when plotted with respect to this two-dimensional topology.As always, we start from a reference design that is optimized for minimum delay, and we explore
how we can trade off energy and delay starting from that point. The initial sizing makes all paths inthe adder equal to the critical path. The first figure shows the energy map for the minimum delay.Though the output nodes are responsible for a sizable fraction of the energy consumption, anumber of internal nodes (around stage 5) dominate.
The large internal energy increases the potential for energy reduction through gate sizing. This isillustrated by the case where we allow for a 10% delay increase. We have plotted the energydistribution resulting from sizing, as well as from the introduction of two discrete supply voltages.The former results in 54% reduction in overall energy, whereas the latter only (!) saves 27%.
This result can be explained as follows. Given the fact that the dominant energy nodes areinternal, sizing allows each of these nodes to be attacked individually without too much of a globalimpact. In the case of dual supplies, one must be aware that driving a high-voltage node from alow-voltage node is hard. Hence the preferable assignment of low-voltage nodes is to start fromthe output nodes and to work one’s way toward the input nodes. Under these conditions, wehave already sacrificed a lot of delay slack on low-energy intermediate nodes before we reach theinternal high-energy nodes. In summary, supply voltages cannot be randomly assigned to nodes.This makes the usage of discrete supply voltages less effective in modules with high internal energy.
Slide 4.24
We can now put it all together, and explore the tree adder in the energy–delay space. Each of thedesign parameters (VDD,VTH,S) is analyzed separately and in combinationwith the others. (Observethat inclusion of the threshold voltage as a design parameter only makes sense when the leakageenergy is considered as well – how this is done is discussed later in the chapter).
A couple of interesting conclusions can be drawn:
� Through circuit optimization, we can reduce the energy consumption of the adder by a factor of10 by doubling the delay.
Optimizing Power @ Design Time – Circuit-Level Techniques 89
� Exploiting only two outof the three variablesyields close to the optimalgain. For the adder, themost effective parametersare sizing and thresholdselection. At the referencedesign point, sizing andthreshold reduction fea-ture the largest andthe smallest sensitivities,respectively. Hence, thiscombination has the lar-gest potential for energyreduction along thelines demonstrated inSlide 4.8.
� Finally, circuit optimi-zation is most effective in a small region around the reference point. Expanding beyond thatregion typically becomes too expensive in terms of energy or delay cost for small gains, yieldinga reduced return on investment.
Slide 4.25
So far, we have studied thetheoretical impact of circuitoptimization on energy anddelay. In reality, the designspace is more constrained.Choosing a different supplyor threshold voltage forevery gate is not a practicaloption. Transistor sizescome in discrete values, asdetermined by the availabledesign library. One of thefortunate conclusions emer-ging from the precedingstudies is that a couple ofwell-chosen discrete valuesfor each of the design para-
meters can get us quite close to the optimum.Let us first consider the practical issues related to the use of multiple supply voltages – a practice
that until recent was not common in digital integrated circuit design at all. It impacts the layoutstrategy and complicates the verification process (as will be discussed in Chapter 12). In addition,generating, regulating, and distributing multiple supplies are non-trivial tasks.
A number of different design strategies exist with respect to the usage of multiple supplyvoltages. The first is to assign the voltage at the block/macro level (the so-called voltage island
Tree Adder: Multi-dimensional Search
Can get pretty close to optimum with only two variablesGetting the minimum speed or delay is very expensive
En
erg
y/E
ref
Delay/Dmin
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.2
0.4
0.6
0.8
1Reference
S, V DD
VDD, VTH
S, V TH
S, V DD, V TH
Block-level supply assignment– Higher-throughput/lower-latency functions are
implemented in higher VDD
– Slower functions are implemented with lower VDD
– This leads to so-called voltage islands with separate supply grids
– Level conversion performed at block boundaries
Multiple supplies inside a block– Non-critical paths moved to lower supply voltage– Level conversion within the block– Physical design challenging
Multiple Supply Voltages
90 Chapter #4
approach). This makes particular sense in case some modules have higher performance/activityrequirements than others (for instance, a processor’s data path versus its memory). The second andmore general approach is to allow for voltage assignment all the way down to the gate level(‘‘custom voltage assignment’’). In general, this means that gates on the non-critical paths areassigned a lower supply voltage. Be aware that having signals at different voltage levels requires theinsertion of level converters. It is preferable if these are limited in number (as they consume extraenergy) and occur only at the boundaries of the modules.
V1 = 1.5V, VTH = 0.3V
Using Three VDD’s
+
V2 (V)
V3
(V)
0.4 0.6 0.8 1 1.2 1.4
0.4
0.6
0.8
1
1.2
1.4
V2 (V)V
3 (V)
Po
wer
Red
uct
ion
Rat
io
00.5
11.5
0
0.51
1.50.4
0.5
0.6
0.7
0.8
0.9
1
[Ref: T. Kuroda, ICCAD’02]
Slide 4.26
With respect to multiplesupply voltages, one cannothelp wondering about thefollowing question: Ifmulti-ple supply voltages areemployed, how many dis-crete levels are sufficient,and what are their values?This slide illustrates thepotential of using three dis-crete voltage levels, as wasstudied by Tadahiro Kur-oda [Kuroda, ICCAD’02].Supply assignment to theindividual logic gates isperformed by an optimiza-tion routine that minimizesenergy for a given clock per-
iod.With the main supply fixed at 1.5 V, providing a second and third supply yields a nearly twofoldpower reduction ratio.
A number of useful observations can be drawn from the graphs:
� The power minimum occurs for V2 � 1V and V3 � 0.7V.� The minimum is quite shallow. This is good news, as this means that small deviations around
this minimum (as caused, for instance, by IR drops) will not have a big impact.
The question now is how much impact on power each additional supply carries.
Optimizing Power @ Design Time – Circuit-Level Techniques 91
1.0
0.5
VD
D R
atio
1.0
0.4
0.5 1.0 1.5V1 (V)
P R
atio
V2 /V1
P2 /P1
{ V1, V2 }
V2 /V1
V3 /V1
{ V1, V2, V3 }
0.5 1.0 1.5V1 (V)
P3 /P1
V2 /V1
V3 /V1
V4 /V1
0.5 1.0 1.5V1 (V)
P4 /P1
{ V1, V2, V3, V4 }
Optimum Number of VDDs
The more the number of VDD s the less the power, but the effect saturates
Power reduction effect decreases with scaling of VDD
Optimum V2 /V1 is around 0.7
© IEEE 2001
[Ref: M. Hamada, CICC’01]
Slide 4.27
In fact, the marginal bene-fits of adding extra sup-plies quickly bottom out.Although adding a secondsupply yields big savings,the extra reductionsobtainable by adding athird or a fourth are mar-ginal. This makes sense,as the number of (non-critical) gates that canbenefit from the addi-tional supply shrinks witheach iteration. For exam-ple, the fourth supplyworks only with non-critical path gates close tothe tail of the delay distri-bution. Another observa-
tion is that the power savings obtainable from using multiple supplies reduce with the scalingof the main supply voltage (for a fixed threshold).
Two supply voltages per block are optimal
Optimal ratio between the supply voltages is 0.7
Level conversion is performed on the voltage boundary, using a level-converting flip-flop (LCFF)
An option is to use an asynchronous level converter– More sensitive to coupling and supply noise
Lessons: Multiple Supply Voltages
Slide 4.28
Our discussion on multiplediscrete supply voltagescan be summarized with anumber of rules-of-thumb:
� The largest benefit isobtained by adding a sec-ond supply.
� Theoptimal ratiobetweenthe discrete supplies isapproximately 0.7.
� Adding a third supplyprovides an additional5–10% incremental sav-ings. Going beyond thatdoesnotmakemuchsense.
92 Chapter #4
i1 o1
VDDHVDDL
VSS
Conventional
VDDH circuit V DDL circuit
i2 o2
V DDH
V DDL
V SS
Shared n-well
VDDH circuit VDDL circuit
Distributing Multiple Supply Voltages
i2 o2
i1 o1
Slide 4.29
Distribution of multiplesupply voltages requirescareful examination of thefloorplanning strategy. Theconventional way to sup-port multiple VDD’s (twoin this case) is to placegates with different sup-plies in different wells(e.g., low-VDD and high-VDD). This approach doesnot require a redesign ofthe standard cells, butcomes with an area over-head owing to the neces-
sary spacing between n-wells at different voltages. Another way to introduce the second supplyis to provide twoVDD rails for every standard cell, and selectively route the cells to the appropriatesupply. This ‘‘shared n-well’’ approach also comes with an area overhead owing to the extra voltagerail. Let us further analyze both techniques to see what kind of system-level trade-offs theyintroduce.
V DDH circuit
VDDH V DDL
VSS
n-well isolation
V DDL circuit
(a) Dedicated row
(b) Dedicated region
VDDH Row
VDDH Row
VDDH
RegionVDDL
Region
Conventional
VDDL Row
VDDL Row
Slide 4.30
In the conventional dual-voltage approach, the moststraightforward method isto cluster gates with thesame supply (scheme b).This scheme works wellfor the ‘‘voltage island’’model, where a single sup-ply is chosen for a completemodule. It does not verywell fit the ‘‘custom voltageassignment’’ mode, though.Logic paths consisting ofboth high-VDD and low-VDD cells incur additionaloverhead in wire delay due
to long wires between the voltage clusters. The extra wire capacitance also reduces the power savings.Maintaining spatial locality of connected combinational logic gates is essential.
Another approach is to assign voltages per row of cells (scheme a). Both VDDL and VDDH arerouted only to the edge of the rows, and a special standard cell is added that selects between the twovoltages (obviously, this applies only to the standard-cell methodology). This approach suits the‘‘custom voltage assignment’’ approach better, as the per-row assignment provides a smallergranularity and the overhead of moving between voltage domains is smaller.
Optimizing Power @ Design Time – Circuit-Level Techniques 93
VDDH circuit
V DDH
VDDL
VSS
Shared n-well
VDDL circuit
(a) Floor plan image
V DDL circuit
V DDH circuit
Shared n-Well
[Shimazaki et al., ISSCC’03]
Slide 4.31
Themost versatile approachis to redesign standardcells, and have both VDDL
and VDDH rails inside thecell (‘‘shared n-well’’). Thisapproach is quite attrac-tive, because we do nothave to worry about areapartitioning – both low-VDD and high-VDD cellscan be abutted to eachother. This approach wasdemonstrated on a high-speed adder/ALU circuitby Shimazaki et al
[Shimazaki, ISSCC’03]. However, it comes with a per-cell area overhead. Also, low-VDD cellsexperience reverse body biasing on the PMOS transistors, which degrades their performance.
Lower VDD portion is shared“Clustered voltage scaling”
Example: Multiple Supplies in a Block
FF
FF
FF
FFFF
FF
FF
FF
FF
FF
CVS StructureConventional Design
Critical Path
Level-Shifting FF
Critical Path
FF
FF
FF
FF
FF
FF FF
FF
FF
FF
FF
[Ref: M. Takahashi, ISSCC’98]
© IEEE 1998
Slide 4.32
Level conversion is anotherimportant issue in design-ing with multiple discretesupply voltages. It is easyto drive a low-voltagegate from a high-voltageone, but the opposite tran-sition is hard owing toextra leakage, degradedsignal slopes, and perfor-mance penalty. It is henceworthwhile minimizingthe occurrence of low-to-high connections.
As we will see in the nextfew slides, low-to-high levelconversion is best accom-plished using positive feed-back – which is naturally
present in flip-flops and registers. This leads to the following strategy: Every logical path startsat the high-voltage level. Once a path transitions to the low voltage, it never switches back. Thenext up-conversion happens in flip-flops. Supply voltage assignment starts from critical paths andworks backward to find non-critical paths where the supply voltage can be reduced. This strategy isillustrated in the slide. The conventional design on the left has all gates operating at the nominalsupply (the critical path is highlighted). Working backward from the flip-flops, non-critical pathsare gradually converted to the low voltage until they become critical (gray-shaded gates operate atVDDL). This technique of grouping is called ‘‘clustered voltage scaling’’ (CVS).
94 Chapter #4
Pulsed Half-Latch versus Master–Slave LCFFsSmaller # of MOSFETs/clock loadingFaster level conversion using half-latch structureShorter D–Q path from pulsed circuit
Level-Converting Flip-Flops (LCFFs)
q
ck
ckb ckclk
level conversion
ckb
ckd q (inv.)
ck
ckclk
level conversion
dmo
mf
sfso db
sfso
MN1 MN2
Master–Slave Pulsed Half-Latch
© IEEE 2003
[Ref: F. Ishihara, ISLPED’03]
Slide 4.33
As the level-convertingflip-flops play a crucialrole in the CVS scheme,we present a number offlip-flops that can do levelconversion and maintaingood speed.
The first circuit is basedon the traditional master–slave scheme, with themaster and slave stagesoperating at the low andhigh voltages, respectively.The positive feedbackaction in the slave latchensures efficient low-to-high level conversion. The
high voltage node sf is isolated from low-voltage node mo by the pass-transistor, gated by thelow-voltage signal ck.
The same concept can also be applied in an edge-triggered flip-flop, as shown in the second circuit(called the pulse-based half-latch). A pulse generator derives a short pulse from the clock edge, ensuringthat the latch is enabled only for a very short time. This circuit has the advantage of being simpler.
Dynamic Realization of Pulsed LCFF
Pulsed precharge LCFF (PPR)– Fast level conversion by
precharge mechanism– Suppressed
charge/discharge toggle by conditional capture
– Short D–Q path
Pulsed Precharge Latch
clk
ckd1
qb
clk level conversion
x
db
qb
ckd1
VDDH
VDDH
VDDH
d
xb
IV1
q (inv.)
ck
MN1
MN2
MP1
[Ref: F. Ishihara, ISLPED’03]© IEEE 2003
Slide 4.34
Dynamic gates withNMOS-only evaluation transistorsare naturally suited foroperation with reducedlogic swing, as the inputsignal does not need todevelop a full high-VDD
swing to drive the outputnode to logic zero. Thereduced swing only resultsin a somewhat longerdelay. A dynamic structurewith implicit level conver-sion is shown in the figure.
Observe that level con-version is also possible inan asynchronous fashion.A number of such non-
clocked converters will be presented in a later chapter on Interconnect (Chapter 6). Clockedcircuits tend to be more reliable, however.
Optimizing Power @ Design Time – Circuit-Level Techniques 95
carrygen.
partialsum
gpgen.
5:1MUX
ain
bin
carry
s0/s1
sum
sumb (long loop-back bus)
clk
clock gen.
: V DDH circuit
: V DDL circuit
INV1INV2
0.5 pF
sumsel.
2:1MUX
9:1MUX
logicalunit
9:1MUX
ain0
Case Study: ALU for 64-bit Microprocessor
[Ref: Y. Shimazaki, ISSCC’03]© IEEE 2003
Slide 4.35
A real-life example of ahigh-performance Itanium-class (#Intel) data pathhelps to demonstrate theeffective use of dual-VDD.From the block diagram, itis apparent that the criticalcomponent from an energyperspective is the very largeoutput capacitance of theALU, which is due to itshigh fan-out. Hence, lower-ing the supply voltage onthe output bus yields thelargest potential for powerreduction.
The shared-well techni-que was chosen for the
implementation of this 64-bit ALU module, which is composed of the ALU, the loop-back busdriver, the input operand selectors, and the register files. For performance reasons, a dominocircuit-style was adopted. As the carry generation is the most critical operation, circuits in the carrytree are assigned to theVDDH domain. On the other hand, the partial-sum generator and the logicalunit are assigned to theVDDL domain. In addition, the bus driver, as the gate with the largest load,is also supplied from VDDL. The level conversion from the VDDL signal to the VDDH signal isperformed by the sum selector and the 9:1 multiplexer.
sum
keeperpc
sumb
VDDH
VDDL
INV1 INV2
domino level converter (9:1 MUX)
ain0sel(VDDH)
VDDH
VDDL
INV2 is placed near 9:1 MUX to increase noise immunityLevel conversion is done by a domino 9:1 MUX
Low-Swing Bus and Level Converter
[Ref: Y. Shimazaki, ISSCC’03]
© IEEE 2003
Slide 4.36
This schematic shows thelow-swing loop-back busand the domino-style levelconverter. Since the loop-back bus sumb has a largecapacitive load, low-voltageimplementation is quiteattractive. Some issuesdeserve special attention:
� One of the concerns of theshared-well approach isthe reverse biasing on thePMOS transistor. As sumis a monotonically risingsignal (output of a dominostage), this doesnot impactthe performance of theimportant gate INV1.
96 Chapter #4
� In dynamic-logic designs, noise is one of the critical issues. To eliminate the effects of dis-turbances on the loop-back bus, the receiver INV2 is placed near the 9:1 multiplexer to increasenoise immunity.
� The output of INV2, which is aVDDL signal, is convertedVDDH by the 9:1 multiplexer. The levelconversion is fast, as the precharge levels are independent of the level of the input signal.
[Ref: Y. Shimazaki, ISSCC’03]
Single-supplyShared-well(VDDH=1.8 V)E
nerg
y [p
J]
TCYCLE [ns]
Room temperature
200
300
400
500
600
700
800
0.6 0.8 1.0 1.2 1.4 1.6
1.16 GHz
VDDL=1.4 VEnergy:–25.3% Delay :+2.8%
VDDL=1.2 VEnergy:–33.3% Delay :+8.3%
Measured Results: Energy and Delay
© IEEE 2003
Slide 4.37
This figure plots the familiarenergy–delay plots of theALU (as measured). Theenergy–delay curve for sin-gle-supply operation isdrawn as a reference. At thenominal supply voltage of1.8V (for a 180nm CMOStechnology), the chip oper-ates at 1.16GHz. Introdu-cing a second supply yieldsan energy saving of 33% atthe small cost of 8% in delayincrease. This exampledemonstrates that the theo-retical results derived in theearlier slides of this chapterare actually for real.
Practical Transistor Sizing
Continuous sizing of transistors only an option in custom design
In ASIC design flows, options set by available library
Discrete sizing options made possible in standard-cell design methodology by providing multiple options for the same cell– Leads to larger libraries (> 800 cells)– Easily integrated into technology mapping
Slide 4.38
Transistor sizing is theother high-impact designparameter we haveexplored at the circuit levelso far. The theoretical ana-lysis assumes a continuoussizing model, which is onlya possibility in purely cus-tom design. In ASIC designflows, transistor sizes arepredetermined in the celllibrary. In the early daysof application-specific inte-grated circuit (ASIC)design and automated
synthesis, libraries used to be quite small, counting between 50 and 100 cells. Energy considerationshave changed the picture substantially. With the need for various sizing options for each logical cell,
Optimizing Power @ Design Time – Circuit-Level Techniques 97
industrial libraries now count close to 1000 cells. Aswith supply voltages, it is necessary tomove froma continuous model to a discrete one. Similarly, the overall impact on energy efficiency of doing socan be quite small.
Larger gates reduce capacitance, but are slower
Technology Mapping
a
b
c
slack = 1
d
f
Slide 4.39
In the ASIC design flow, itis in the ‘‘technology map-ping’’ phase that the actuallibrary cells are selected forthe implementation of agiven logical function. Thelogic network, resultingfrom ‘‘technology-indepen-dent’’ optimizations, ismapped onto the librarycells such that performanceconstraints are met andenergy is minimized.Hence, this is where thetransistor (gate) sizingactually happens. Beyondchoosing between identical
cells with different sizes, technology mapping also gets to choose between different gate mappings:simple cells with small fan-in, or more complex cells with large fan-in. Over the last decade(s), it hasbeen common understanding that simple gates are good from a performance perspective� delay isa quadratic function of fan-in. From an energy perspective, complex gates are more attractive, as theintrinsic capacitance of these is substantially smaller than the inter-gate routing capacitances of anetwork of simple gates. Hence, it makes sense for complex gates to be preferentially used on non-critical paths.
98 Chapter #4
(a) Implemented using four-input NAND + INV(b) Implemented using two-input NAND + two-input NOR
Library 1: High-Speed
Technology Mapping
Example: four-input AND
Gatetype
Area (cell unit)
Input cap. (fF)
Average delay (ps)
Average delay (ps)
INV 3 1.8 7.0 + 3.8C L 12.0 + 6.0C L
NAND2 4 2.0 10.3 + 5.3C L 16.3 + 8.8CL
NAND4 5 2.0 13.6 + 5.8C L 22.7 + 10.2CL
NOR2 3 2.2 10.7 + 5.4C L 16.7 + 8.9CL
Library 2: Low-Power
(delay formula: C
(numbers calibrated for 90 nm)L in fF)
Slide 4.40
This argument is illustratedwith an example. In thisslide, we have summarizedthe area, delay, and energyproperties of four cells(INV, NAND2, NOR2,NAND4) implemented in a90nm CMOS technology.Two different librariesare considered: a low-powerand a high-performanceversion.
Technology Mapping – Example
four-input AND (a) NAND4 + INV
(b) NAND2 + NOR2
Area 8 11
HS: Delay (ps) 31.0 + 3.8CL
53.1 + 6.0CL
0.1 + 0.06CL
32.7 + 5.4CL
LP: Delay (ps) 52.4 + 8.9CL
Sw Energy (fF) 0.83 + 0.06CL
Area– Four-input more compact than two-input (two gates vs three gates)
Timing– Both implementations are two-stage realizations– Second-stage INV (a) is better driver than NOR2 (b)– For more complex blocks, simpler gates will show better
performanceEnergy– Internal switching increases energy in the two-input case– Low-power library has worse delay, but lower leakage (see later)
Slide 4.41
These libraries are usedto map the same function,an AND4, using eithertwo-input or four-inputgates (NAND4+INV orNAND2+NOR2). Theresulting metrics show thatthe complex gate imple-mentation yields a substan-tial reduction in energyand also reduces area. Forthis simple example, thecomplex-gate version isjust as fast, if not faster.However this is due to thesomewhat simplistic natureof the example. The situa-
tion becomes evenmore pronounced if the library would contain very complex gates (e.g., fan-in of5 or 6).
Slide 4.42
Technology mapping has brought us almost seamlessly to the next abstraction level in the designprocess – the logic level. Transistor sizes, voltage levels, and circuit style are the main optimizationknobs at the circuit level. At the logic level, the gate– network topology to implement a given
Optimizing Power @ Design Time – Circuit-Level Techniques 99
function is chosen and fine-tuned. The link betweenthe two is the already dis-cussed technology-map-ping process. Beyond gateselection and transistor siz-ing, technology mappingalso performs pin assign-ment. It is well knownthat, from a performanceperspective, it is a goodidea to connect the mostcritical signal to the inputpin ‘‘closest’’ to the outputnode. For a CMOSNANDgate, for instance, thiswould be the top transistorof the NMOS pull-down
chain. From a power reduction point of view, on the other hand, it is wise to connect the mostactive signal to that node, as this minimizes the switching capacitance.
The technology-independent part of the logic-synthesis process consists of a sequence of opti-mizations that manipulate the network topology to minimize delay, power, or area. As we havebecome used to, each such optimization represents a careful trade-off, not only between power anddelay, but sometimes also between the different components of power such as activity andcapacitance. This is illustrated with a couple of examples in the following slides.
Logic restructuring to minimize spurious transitions
Buffer insertion for path balancing
Logic Restructuring
01
1
1
0
1
1
1
0
1 1
1
1
1
1
111
2
3
Slide 4.43
In Chapter 3, we haveestablished that the occur-rence of dynamic hazardsin a logic network is mini-mized when the network isbalanced from a timingperspective – that is, mosttiming paths are of similarlengths. Paths of unequallength can always be equal-ized with respect to time ina number of ways: (1)through the restructuringof the network, such thatan equivalent networkwith balanced paths isobtained; (2) through the
introduction of non-inverting buffers on the fastest paths. The attentive reader realizes thatalthough the latter helps to minimize glitching, the buffers themselves add extra switching capa-citance. Hence, as always, buffer insertion is a careful trade-off process. Analysis of circuits
Technology mappingGate selectionSizingPin assignment
Logical OptimizationsFactoring
Restructuring
Buffer insertion/deletion
Don’t - care optimization
Gate-Level Trade-offs for Power
100 Chapter #4
generated by state-of-the-art synthesis tools have shown that simple buffers are responsible for aconsiderable part of the overall power budget of the combinatorial modules.
Idea: Modify network to reduce capacitance
Caveat: This may increase activity!
pa pb= 0.1; = 0.5; pc = 0.5
Algebraic Transformations Factoring
a
bc
ff
a
a
b
c
p1 = 0.051
p2 = 0.051
p3 = 0.076
p4 = 0.375
p5 = 0.076
Slide 4.44
Factoring is another transfor-mation that may introduceunintended consequences.From a capacitance perspec-tive, it seems obvious that asimpler logical expressionwould require less power aswell. For instance, translat-ing the function f = a �b +a �c into its equivalent f =a �(b + c) seems a no-brai-ner, as it requires one lessgate. However, it may alsointroduce an internal nodewith substantially highertransition probabilities, asannotated on the slide.
This may actually increase the net power. The lesson to be drawn is that power-aware logicalsynthesis must not only be aware of network topology and timing, but should – to the best possibleextent – incorporate parameters such as capacitance, activity, and glitching. In the end, the goal isagain to derive the pareto-optimal energy–delay curves, which we are now so familiar with, or toreformulate the synthesis process along the following lines: choose the network that minimizespower for a given maximum delay or minimizes the delay for a maximum power.
Energy-efficient design
Joint optimization over multiple design parameters possible using sensitivity-based optimization framework– Equal marginal costs ⇔
Peak performance is VERY power inefficient– About 70% energy reduction for 20% delay penalty– Additional variables for higher energy-efficiency
Two supply voltages in general sufficient; three or more supply voltages only offer small advantage
Choice between sizing and supply voltage parameters depends upon circuit topology
But … leakage not considered so far
Lessons from Circuit OptimizationSlide 4.45
Based on the preceding dis-cussions, we can now drawa clear set of guidelines forenergy–delay optimizationat the circuit and logicallevels. An attempt ofdoing so is presented inthis slide.
Yet, so far we have onlyaddressed dynamic power.In the rest of the chapter wetackle the other importantcontributor of power incontemporary networks:leakage.
Optimizing Power @ Design Time – Circuit-Level Techniques 101
Considering leakage as well as dynamic
power is essential in sub-100 nm
technologies
Leakage is not essentially a bad thing
– Increased leakage leads to improved
performance, allowing for lower supply voltages
– Again a trade-off issue …
Considering Leakage at Design Time
Slide 4.46
Leakage has so far been pre-sented as an evil side effectof nanometer-size technol-ogy scaling, something thatshould be avoided by allcost. However, given anactual technology node, thismay not necessarily be thecase. For instance, a lowerthreshold (and increasedleakage) allows for a lowersupply voltage for the samedelay – effectively tradingoff dynamic power for static
power. This was already illustrated graphically in Slide 3.41, where power and delay of a logicalfunction were plotted as a function of supply and threshold voltages. Once one realizes that allowingfor an amount of static powermay actually be a good thing, the next question inevitably arises: is therean optimal balance between dynamic and static power, and if so, what is the ‘‘golden’’ ratio?
Must adapt to process and activity variations
( ) 2
αln
lk sw optd
avg
E EL
K
=
−
Topology Inv Add Dec
(E lk /Esw)opt 0.8 0.5 0.2
Leakage – Not Necessarily a Bad Thing
Optimal designs have high leakage (Elk /Esw 0.5)≈
10–2
10–1
100
101
0
0.2
0.4
0.6
0.8
1
Estatic /Edynamic
Eno
rm
VTHref-180 mV
0.81VDDmax
VTHref-140 mV
0.52VDDmax
Version 1
Version 2
[Ref: D. Markovic, JSSC’04]
© IEEE 2004
Slide 4.47
The answer is an unequivo-cal yes. This is best illu-strated by the graph in thisslide, which plots the nor-malized minimum energyper operation for a givenfunction and a given delayas a function of the ratiobetween static and dynamicpower. The same curve isalso plotted for a modifiedversion of the same function.
A number of interestingobservations can be drawnfrom this set of graphs:
� The most energy-effi-cient designs have a con-
siderable amount of leakage energy.� For both the designs, the static energy is approximately 50% of the dynamic energy (or one-
third of the total energy), and does not vary very much between the different circuit topologies.� The curves are fairly flat around the minimum, making the minimum energy somewhat
insensitive to the precise ratio.
This ratio does not change much for different topologies except if activity changes by ordersof magnitude, as the optimal ratio is a logarithmic function of activity and logic depth. Still,looking into significantly different circuit topologies in the last few slides, we found that optimal
102 Chapter #4
ratio of the leakage-to-switching energy did not change much. Moreover, in the range definedby these extreme cases, energy of adder-based implementations is still very close to minimum, from0.2 to 0.8 leakage-to-switching ratio, as shown in this graph. A similar situation occurs if weanalyze inverter chain and memory decoder circuits assuming an optimal leakage-to-switchingratio of 0.5.
From this analysis, we can derive a very simple general result: energy is minimized when theleakage-to-switching ratio is about 0.5, regardless of logic topology or function. This is an impor-tant practical result. We can use this knowledge to determine the optimal VDD and VTH in a broadrange of designs.
Switching energy
Leakage energy
with:I0(Ψ): normalized leakage current with inputs in state Ψ
Refining the Optimization Model
210 )( DDedyn VfSKE += →
cycleDDqkT
VV
stat TVeSIEDDdTH
/0 )(
+−
Ψ=
α
λ
γ
Slide 4.48
The effect of leakage is easilyintroduced in our earlier-defined optimization frame-work. Remember that theleakage current of a moduleis a function of the state ofits inputs. However, it isoften acceptable to use theaverage leakage over the dif-ferent states. Another obser-vation is that the ratiobetween dynamic and staticenergy is a function of thecycle time and the averageactivity per cycle.
Using longer transistors– Limited benefit– Increase in active current
Using higher thresholds– Channel doping– Stacked devices– Body biasing
Reducing the voltage!!
Reducing Leakage @ Design Time
Slide 4.49
When trying to manipulatethe leakage current, thedesigner has a number ofknobs at her disposition –In fact, they are quite simi-lar to the ones we used foroptimizing the dynamicpower: transistor sizes, andthreshold and supply vol-tages. How they influenceleakage current is substan-tially different though. Thechoice of the threshold vol-tage is especially important.
Optimizing Power @ Design Time – Circuit-Level Techniques 103
10% longer gates reduce leakage by 50%Increases switching power by 18% with W/L = constant
Doubling L reduces leakage by 5xImpacts performance
– Attractive when not required to increase W (e.g., memory)
Longer Channels
100 110 120 130 140 150 160 170 180 190 2000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Transistor length (nm)
1
2
3
4
5
6
7
8
9
10
90 nm CMOS
Switching energy
Leakage power
Nor
mal
ized
sw
itchi
ng e
nerg
y
Nor
mal
ized
leak
age
pow
er
Slide 4.50
While wider transistorsobviously leak more, thechosen transistor lengthhas an impact as well. Asalready shown in Slide2.15, very short transistorssuffer from a sharp reduc-tion in threshold voltage,and hence an exponentialincrease in leakage current.In leakage-critical designssuch as memory cells, forinstance, it makes sense toconsider the use of transis-tors with longer channellengths rather than theones prescribed by the
nominal process parameters. This comes at a penalty in dynamic power though, but that increaseis relatively small. For a 90 nm CMOS technology, it was shown that increasing the channel lengthby 10% reduces the leakage current by 50%, while raising the dynamic power by 18%. It may seemstrange to deliberately forgo one of the key benefits of technology scaling – that is, smallertransistors – yet sometimes the penalty in area and performance is inconsequential, whereas thegain in overall power consumption is substantial.
There is no need for level conversion
Dual thresholds can be added to standard design flows– High-VTH and Low-VTH libraries are a standard in sub-0.18 μm
processes– For example: can synthesize using only high-VTH and then simply
swap-in low-VTH cells to improve timing.– Second VTH insertion can be combined with resizing
Only two thresholds are needed per block– Using more than two yields small improvements
Using Multiple Thresholds
Slide 4.51
Using multiple thresholdvoltages is an effective toolin the static-power optimi-zation portfolio. In contrastto the usage ofmultiple sup-ply voltages, introducingmultiple thresholds hasrelatively little impact onthe design flow. No levelconverters are needed, andno special layout strategiesare required. The real bur-den is the added cost tothe manufacturing process.From a design perspective,
the challenge is on the technology mapping process, which is where the choice between cells withdifferent thresholds is really made.
104 Chapter #4
Three VTH’s
VDD = 1.5 V, VTH.1 = 0.3 V
+
VTH.3(V)V
TH
.2(V
)0.4 0.6 0.8 1 1.2 1.4
0.4
0.6
0.8
1
1.2
1.4
Lea
kag
e R
edu
ctio
n R
atio
VTH.3(V)
VTH.2 (V )
00.5
11.5
0
11.5
0.5
0
0.2
0.4
0.6
0.8
1
Impact of third threshold very limited
[Ref: T. Kuroda, ICCAD’02]
Slide 4.52
The immediate question ishow many threshold vol-tages are truly desirable.As with supply voltages,the addition of more levelscomes at a substantial cost,and most likely yields adiminishing return. A num-ber of studies have shownthat although there is stillsome benefit in havingthree discrete thresholdvoltages for both NMOSand PMOS transistors, itis quite marginal. Hence,two thresholds for bothdevices have become the
norm in the sub-100 nm technologies.
Using Multiple Thresholds
FF
FF
FF
FF
FF
Cell-by-cell VTH assignment (not at block level)Achieves all-low-VTH performance with substantial reduction in leakage
Low VTHHigh VTH
[Ref: S. Date, SLPE’94]
Slide 4.53
As was the case withdynamic power reduction,the strategy is to increasethe threshold voltages intiming paths that are notcritical, leading to staticleakage power reductionat no performance anddynamic power costs. Theappealing factor is thathigh-threshold cells can beintroduced anywhere in thelogic structure withoutmajor side effects. The bur-den is clearly on the tools,as timing slack can be usedin a number of ways: redu-cing transistor sizes, supply
voltages, or threshold voltages. The former two reduce both dynamic and static power, whereas thelatter only influences the static component. Remember however that an optimal design carefullybalances both components.
Optimizing Power @ Design Time – Circuit-Level Techniques 105
Shaded transistors are low-threshold
Low-threshold transistors used only in critical paths
Dual-VTH Domino
P1
Inv1
Inv2 Inv3
Dn+1
Clkn
Clkn+1
Dn …
Slide 4.54
Most of the discussion onleakage so far has concen-trated on static logic.I reckon that dynamic-circuit designers are evenmore worried: for them,leakage means not onlypower dissipation but alsoa serious degradation innoise margin. Again, acareful selection betweenlow- and high-thresholddevices can go a longway. Low-threshold tran-sistors are used in thetiming-critical paths, suchas the pull-down logic
module. Yet even with these options, it is becoming increasingly apparent that dynamic logicis facing serious challenges in the extreme-scaling regimes.
Easily introduced in standard-cell design methodology by extending cell libraries with cells with different thresholds– Selection of cells during technology mapping– No impact on dynamic power– No interface issues (as was the case with multiple
VDDs)
Impact: Can reduce leakage power substantially
Multiple Thresholds and Design Methodology
Slide 4.55
Repeating what was statedearlier, the concept of mul-tiple thresholds is intro-duced quite easily in theexisting commercial designflows. In hindsight, this isclearly a no-brainer. Themajor impact is that thesize of the cell library dou-bles (at least), whichincreases the cost of thecharacterization process.This, combined with theintroduction of a range ofsize options for each cell,
has led to an explosion in the size of a typical library. Libraries with more than 1000 cells are notan exception.
106 Chapter #4
High-VTHOnly
Low-VTH Only
Dual-VTH
Total Slack –53 ps 0 ps 0 ps
Dynamic Power
3.2 mW 3.3 mW 3.2 mW
Static Power
914 nW 3873 nW 1519 nW
All designs synthesized automatically using Synopsys Flows
Dual-VTH for High-Performance Design
[Courtesy: Synopsys, Toshiba, 2004]
Slide 4.56
In this experiment, per-formed jointly by Toshibaand Synopsys, the impactof the introduction of cellswith multiple thresholds ina high-performance designis analyzed. The dual-threshold strategy leavestiming and dynamic powerunchanged, while reducingthe leakage power by half.
Example: High- vs. Low-Threshold Libraries
Leak
age
Pow
er (
nW)
Selected combinational tests130 nm CMOS
TH
TH
TH
TH
[Courtesy: Synopsys 2004]
TH
TH
Slide 4.57
A more detailed analysis isshown in this slide, whichalso illustrates the impactof the chosen design flowover a set of six bench-marks with varying com-plexity. It compares thehigh-VTH and low-VTH
designs (the extremes) witha design starting fromlow-VTH transistors onlyfollowed by a gradualintroduction of high-VTH
devices, and vice-versa.It shows that the latterstrategy – that is, startingexclusively with high-VTH
transistors and introducinglow-VTH transistors only in the critical paths to meet the timing constraints – yields better resultsfrom a leakage perspective.
Optimizing Power @ Design Time – Circuit-Level Techniques 107
Complex Gates Increase Ion /Ioff Ratio
Ion and Ioff of single NMOS versus stack of 10 NMOS transistorsTransistors in stack are sized up to give similar drive
No stack
Stack
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
VDD (V)
I off
(nA
)
No stack
Stack
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
20
40
60
80
100
120
140
I on
(μA
)
VDD (V)
(90 nm technology) (90 nm technology)
Slide 4.58
In earlier chapters, we havealready introduced thenotion that stacking transis-tors reduces the leakage cur-rent super-linearly primarilydue to the DIBL effect. Thestacking effect is an effectivemeans of managing leakagecurrent at design time. Asillustrated in the graphs, thecombination of stackingand transistor sizing allowsus to maintain the on-current, while keeping theoff-current in check, evenfor higher supply voltages.
Complex Gates Increase Ion/Ioff Ratio
Stacking transistors suppresses submicron effectsReduced velocity saturationReduced DIBL effectAllows for operation at lower thresholds
Stack
No stack
Factor 10!
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5× 105
VDD (V)
I on
/Io
ffra
tio
(90 nm technology)
Slide 4.59
This combined effect is putin a clear perspective in thisgraph, which plots the Ion/Ioff ratio of a transistorstack of 10 versus a singletransistor as a function ofVDD. For a supply voltageof 1V, the stacked transis-tor chain features an on-versus-off current ratiothat is 10 times higher.This enables us to lowerthresholds to values thatwould be prohibitive insimple gates. Overall, italso indicates that theusage of complex gates,
already beneficial in the reduction of dynamic power, helps to reduce static power as well. Froma power perspective, this is a win–win situation.
108 Chapter #4
Example: four-input NAND
With transistors sized for similar performance:Leakage of Fan-in(2) =
Leakage of Fan-in(4) x 3(Averaged over all possible input patterns)
Fan-in(2)Fan-in(4)
versus
Complex Gates Increase Ion /Ioff Ratio
2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
Input pattern
Lea
kag
e C
urr
ent
(nA
)
Fan-in(2)
Fan-in(4)
Slide 4.60
The advantage of using com-plex gates is illustratedwith asimple example: a fan-in(4)NAND versus a fan-in(2)NAND/NOR implementa-tion of the same function.The leakage current is ana-lyzed over all 16 input com-binations (remember thatleakage is state-dependent).On the average, the com-plex-gate topology has aleakage current that is threetimes smaller than that of theimplementation employingsimple gates. One way oflooking at this is that, forthe same functionality, com-
plex gates come with fewer leakage paths. However, they also carry a performance penalty. For high-performance designs, simple gates are a necessity in the critical-timing paths.
[Ref: S. Narendra, ISLPED’01]
Example: 32-bit Kogge–Stone Adder
H HV V
% o
f in
pu
t ve
cto
rs
Standby leakage current (μμA)
factor 18
Reducing the threshold by 150 mV increases leakage of single NMOS transistor by a factor of 60
© Springer 2001
Slide 4.61
The complex-versus-simplegate trade-off is illustratedwith the example of acomplex Kogge–Stone
earlier in this chapter. Thehistogram of the leakagecurrents over a large rangeof random input signals isplotted. It can be observedthat the average leakagecurrent of the low-VTH ver-sion is only 18 times largerthan that of the high-VTH
version, which is substan-tially smaller than what
would be predicted by the threshold ratios. For a single NMOS transistor, reducing the thresholdby 150mV would cause the leakage current to go up by a factor of 60 (for the slope factor n= 1.4).
Optimizing Power @ Design Time – Circuit-Level Techniques 109
adder (from [Narendra,ISLPED’01]). This issame circuit we studied
the
Circuit optimization can lead to substantial energy reduction at limited performance lossEnergy–delay plots are the perfect mechanisms for analyzing energy–delay trade-offsWell-defined optimization problem over W, VDD and VTH parametersIncreasingly better support by today’s CAD flowsObserve: leakage is not necessarily bad – if appropriately managed
SummarySlide 4.62
In summary, the energy–delay trade-off challengecan be redefined intoa perfectly manageableoptimization problem.Transistor sizing, multiplesupply and threshold vol-tages, and circuit topologyare the main knobsavailable to a designer.Also worth rememberingis that energy-efficientdesigns carefully balancethe dynamic and staticpower components, sub-ject to the predicted activ-
ity level of the modules. The burden is now on the EDA companies to translate these conceptsinto generally applicable tool flows.
Books:A. Bellaouar and M.I Elmasry, Low-Power Digital VLSI Design Circuits and Systems, Kluwer Academic Publishers, 1st ed, 1995.D. Chinnery and K. Keutzer, Closing the Gap Between ASIC and Custom, Springer, 2002. D. Chinnery and K. Keutzer, Closing the Power Gap Between ASIC and Custom, Springer, 2007. J. Rabaey, A. Chandrakasan and B. Nikolic, Digital Integrated Circuits: A Design Perspective, 2nd ed, Prentice Hall 2003.I. Sutherland, B. Sproul and D. Harris, Logical Effort: Designing Fast CMOS Circuits,Morgan- Kaufmann, 1st ed, 1999.
Articles:R.W. Brodersen, M.A. Horowitz, D. Markovic, B. Nikolic and V. Stojanovic, “Methods for True Power Minimization,” Int. Conf. on Computer-Aided Design (ICCAD), pp. 35–42, Nov. 2002.S. Date, N. Shibata, S. Mutoh, and J. Yamada, "1-V 30-MHz Memory-Macrocell-Circuit Technology with a 0.5 gm Multi-Threshold CMOS," Proceedings of the 1994 Symposium on Low Power Electronics, San Diego, CA, pp. 90–91, Oct. 1994.M. Hamada, Y. Ootaguro and T. Kuroda, “Utilizing Surplus Timing for Power Reduction,” IEEE Custom Integrated Circuits Conf., (CICC), pp. 89–92, Sept. 2001.F. Ishihara, F. Sheikh and B. Nikolic, “Level Conversion for Dual-Supply Systems,” Int. Conf. Low Power Electronics and Design, (ISLPED), pp. 164–167, Aug. 2003.P.M. Kogge and H.S. Stone, “A Parallel Algorithm for the Efficient Solution of General Class of Recurrence Equations,” IEEE Trans. Comput., C-22(8), pp. 786–793, Aug 1973. T. Kuroda, “Optimization and control of VDD and VTH for Low-Power, High-Speed CMOS Design,”Proceedings ICCAD 2002, San Jose, Nov. 2002.
References
Slide 4.63 and 4.64
Some references . . .
110 Chapter #4
Articles (cont.):H.C. Lin and L.W. Linholm, “An optimized output stage for MOS integrated circuits,” IEEE Journal of Solid-State Circuits, SC-102, pp. 106–109, Apr. 1975. S. Ma and P. Franzon, “Energy control and accurate delay estimation in the design of CMOS buffers,” IEEE Journal of Solid-State Circuits, (299), pp. 1150–1153, Sep. 1994.D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz and R.W. Brodersen, “Methods for true energy- Performance Optimization,” IEEE Journal of Solid-State Circuits, 39(8), pp. 1282–1293, Aug. 2004.MathWorks, http://www.mathworks.comS. Narendra, S. Borkar, V. De, D. Antoniadis and A. Chandrakasan, “Scaling of stack effect and its applications for leakage reduction,” Int. Conf. Low Power Electronics and Design, (ISLPED), pp. 195–200, Aug. 2001.T. Sakurai and R. Newton, “Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas,” IEEE Journal of Solid-State Circuits, 25(2),pp. 584–594, Apr. 1990.Y. Shimazaki, R. Zlatanovici and B. Nikolic, “A shared-well dual-supply-voltage 64-bit ALU,” Int. Conf. Solid-State Circuits, (ISSCC), pp. 104–105, Feb. 2003.V. Stojanovic, D. Markovic, B. Nikolic, M.A. Horowitz and R.W. Brodersen, “Energy-delay tradeoffs in combinational logic using gate sizing and supply voltage optimization,” European Solid- State Circuits Conf., (ESSCIRC), pp. 211–214, Sep. 2002.M. Takahashi et al., “A 60mW MPEG video codec using clustered voltage scaling with variable supply-voltage scheme,” IEEE Int. Solid-State Circuits Conf., (ISSCC), pp. 36–37, Feb. 1998.
References
Optimizing Power @ Design Time – Circuit-Level Techniques 111