Active mode leakage reduction using fine-grained forward body biasing strategy

ARTICLE IN PRESS

0167-9260/$ - se

doi:10.1016/j.vl

�CorrespondE-mail addr

[email protected]

INTEGRATION, the VLSI journal 40 (2007) 561–570

www.elsevier.com/locate/vlsi

Active mode leakage reduction using fine-grained forwardbody biasing strategy

Vishal Khandelwal�, Ankur Srivastava

Department of Electrical and Computer Engineering, University of Maryland, College Park, USA

Received 7 March 2006; received in revised form 19 August 2006; accepted 12 December 2006

Abstract

Leakage power minimization has become an important issue with technology scaling. Variable threshold voltage schemes have become

popular for standby power reduction. In this work we look at another emerging aspect of this potent problem which is leakage power

reduction in active mode of operation. In gate level circuits, a large number of gates are not switching in active mode at any given point in

time but nevertheless are consuming leakage power. We propose a fine-grained forward body biasing (FBB) scheme for active mode

leakage power reduction in gate level circuits without any delay penalty. Our results show that our optimal polynomial time FBB

allocation algorithm results in 70.2% reduction in leakage currents. We also present an exact standard-cell placement driven FBB

allocation algorithm that effectively reduces the area penalty using the post-placement area slack and results in 56.5%, 62.8% and 66.1%

reduction in leakage currents for 0%, 4% and 8% area slack, respectively. Furthermore, we present a heuristic to solve the standard-cell

placement driven FBB allocation problem that is computationally efficient and results in leakage within 2% of that from the exact

formulation.

r 2007 Elsevier B.V. All rights reserved.

Keywords: Leakage power optimization; Forward body biasing; Algorithm; Design

1. Introduction

In recent years, technology scaling has increased the roleof leakage power in the overall power consumption ofcircuits. Supply voltage reduction is a widely acceptedmethodology for reducing dynamic power, but it has anadverse effect on circuit performance. To maintain highperformance, the threshold voltage V th must also be scaleddown which causes an exponential increase in the sub-threshold leakage currents. Threshold voltage controlthrough body biasing has been proposed in [1–3] as aneffective technique to reduce leakage currents in deep sub-micron technologies.

In this paper, we propose a fine-grained forward bodybiasing (FBB) scheme for leakage power reduction in theactive mode for gate level circuits. Previous literaturediscusses the use of high V th devices and the application of

e front matter r 2007 Elsevier B.V. All rights reserved.

si.2006.12.003

ing author.

esses: [email protected] (V. Khandelwal),

du (A. Srivastava).

FBB to achieve high performance in active mode [1–3]. Butno existing literature discusses the application of fine-grained FBB methodology for leakage power minimizationof gate level circuits in active mode. Furthermore, it isimportant to point out that this scheme is orthogonal toother standby mode leakage reduction schemes like inputvector control. Since all gates are fabricated at high V th, weget the maximum benefits that can be obtained from multi-threshold voltage schemes as far as standby mode leakagereduction is concerned. However, in this work we attemptto address active mode leakage reduction.We propose to fabricate all devices using the 2-D halo

doping profile to obtain super high V th. In active mode,high performance is obtained by applying FBB to thesehigh V th devices as opposed to using dual-V th technology.We propose a polynomial time optimal FBB allocationformulation. The key idea is to identify gates that are non-critical and increase their delay more than that of thecritical gates (using appropriate FBB values) such that theoverall leakage is minimized and the delay constraint issatisfied. In order to limit the number of distinct FBB

www.elsevier.com/locate/vlsi

dx.doi.org/10.1016/j.vlsi.2006.12.003

mailto:[email protected]

mailto:[email protected]

ARTICLE IN PRESSV. Khandelwal, A. Srivastava / INTEGRATION, the VLSI journal 40 (2007) 561–570562

values because of physical limitations, we also propose aclustering-based FBB allocation algorithm. Furthermore,in order to minimize the area penalty associated with thescheme, we present a novel standard-cell placement drivenfine-grained FBB allocation algorithm that utilizes theavailable area slack in each placement row for clustering.We propose an exact ILP solution for this formulation aswell as an efficient and accurate heuristic to solve theproblem.

In this work we address the following problems:

(1)
Given a gate level circuit fabricated with high V th
transistors, allocate optimum FBB values to each of thegates such that the active mode leakage power isminimized under a delay constraint.

(2)
Given a gate level circuit along with its standard-cellplacement information, utilize the available spatialslack in each row to cluster the gates such that thetotal number of distinct FBB values are limited whileminimizing the leakage power under a delay constraint.
Our experimental results show that the optimal fine-grained FBB allocation scheme can reduce the active modeleakage by 70.2%. Furthermore, the placement driven FBBallocation problem solved using the exact ILP formulationon average gave 56.5%, 62.8% and 66.1% reduction inleakage current for 0%, 4% and 8% area slack, respec-tively. The heuristic formulation results in average leakagesavings of 55.8%, 61.6% and 65.2%, respectively. Theleakage values obtained from the heuristic formulation arewithin 2% of those obtained from the exact ILP formula-tion while being significantly efficient in runtime.

The rest of the paper is organized as follows: Section 2discusses the preliminaries associated with this work,Section 3 describes the fine-grained gate level FBBallocation scheme, Section 4 describes the clustering-basedFBB allocation algorithm and Section 5 presents our novelstandard-cell placement driven FBB allocation problemsolved using the exact ILP formulation as well as aheuristic formulation. Section 6 discusses our experimentalresults and Section 7 contains the conclusions drawn fromthis work.

2. Preliminaries

2.1. Device equations

Let us understand the relevant gate level equations usedin this approach. The delay di of a gate i can be expressedas

di ¼KiCLi

V dd

ðV dd � V thÞa , (1)

where Ki is the proportionality constant, CLiis the load

capacitance at the gate output, V th is the threshold voltage,Vdd ¼ 1:8V and a is the velocity saturation index (� 1:3 in0.18-mm CMOS technology).

The sub-threshold leakage current I leak of a gate isexpressed as [4]

I leak ¼ mnCoxW eff

Leffe1:8V2

TeðVgs�V thÞ=nVTð1� e�Vds=VTÞ, (2)

where mn is the N-mobility, Cox is the oxide capacitance,V th is the threshold voltage, VT is the thermal voltage ¼26mV and n is the sub-threshold swing parameter.The dependence of threshold voltage on the FBB voltage

for a transistor can be expressed as [5]

V th ¼ V th0 þ g½ð2ff þ VSBÞ1=2� 2f1=2

f �, (3)

where V th0 is the threshold voltage at zero reverse biasvoltage. VSB is the reverse bias voltage between the sourceand the substrate (body) of the transistor, g is a processparameter and ff is a physical parameter.Eq. (1) establishes a relation between delay of a gate di

and V th. By replacing V th in Eq. (2) in terms of di (usingEq. (1)), we get a dependence between the gate delay andgate leakage. Thus, a range of gate delays wouldcorrespond to a range of gate leakage values. The finalrelation between leakage and delay can be expressed as


Leff

� �e1:8V2

Tð1� e�Vds=VTÞeVgs=nVT

�eKCLV1=add=nVTd

1=ai�Vdd=nVT . ð4Þ

Eq. (3) establishes the relation between the FBB and thethreshold voltage V th for a gate. We can utilize this relationand fabricate all devices at a high threshold voltage. Duringactive mode of operation, FBB is applied to reduce theirthreshold voltage to meet the performance constraints.Hence, the devices are inherently in low leakage state andduring active operation, their leakage is increased to theminimal possible level while meeting the performanceconstraints. This can be achieved by controlling theforward body bias of the corresponding delay-critical gates.

2.2. Body biasing schemes and device considerations

There are two popular body biasing schemes, namelyFBB and reverse body biasing (RBB). RBB schemes havebeen used to reduce sub-threshold leakage through body-effect while meeting the performance constraints in activemode by switching to zero body bias (ZBB) [6–8]. In thiswork we have considered the FBB scheme since it has beenseen that the BTBT component of leakage increases due todecrease in the body coefficient in scaled technologies withshrinking dimensions when using the RBB scheme. In [2],the authors show that FBB can be used to improveperformance and robustness of leakage-sensitive circuits.VT roll-off, drain induced barrier lowering (DIBL) as wellas short channel effects which are more prominent in lowV th devices, are countered by using high V th devices withFBB [2]. Previous work [2,3] has shown that a high V th

device can achieve high drive current in active mode usinga FBB.

ARTICLE IN PRESSV. Khandelwal, A. Srivastava / INTEGRATION, the VLSI journal 40 (2007) 561–570 563

In [1], various device optimization considerations havebeen discussed. Device engineering and FBB can be used tosimultaneously achieve low leakage power and high drivecurrents. We propose to use 2-D super-halo doping profiledevices [9] for fabricating high V th devices in our circuits.There are three main components of leakage current,namely sub-threshold leakage, gate leakage and BTBTleakage. As a result of increasing the peak halo doping,there is an exponential increase in the BTBT leakagecurrents. Ref. [1] shows that gate work function engineer-ing can be used to obtain super high V th devices withoutaffecting the BTBT leakage. We will consider only sub-threshold leakage as our optimization objective as [1]shows that gate leakage forms only a small part of the totalleakage in this case. We have also assumed that for everygate in the design, both N-MOS and P-MOS transistorsoperate at the same threshold voltage after applying theassigned FBB values.

2.3. Motivation

It is known that in active mode, most of the gates in anycircuit are not switching for a significant period of time.These gates are continuously leaking power even in activemode. Hence, in order to minimize the power consumptionof circuits in active mode we also need to address theproblem of leakage power minimization. In this work wetry to address this problem by using super high V th devicesduring circuit fabrication and then applying appropriateFBB values to meet the active mode performanceconstraints.

Let us consider a motivational example as shown inFig. 1. The nodes represent the gates in the circuit. Thedelay-critical nodes are represented by solid circles. We cansee from Fig. 1 that there are a large number of non-criticalnodes which can bear more delay penalty while stillsatisfying the required time constraint. Therefore, we canuse this extra slack at non-critical nodes during FBBallocation and increase the savings in leakage power. Ourstrategy proposes that all devices are fabricated using superhigh V th doping profiles. These devices have low leakagebut higher delays. Depending on the delay criticality ofeach gate, we can allocate appropriate FBB values to lowerits threshold voltage and therefore its delay. Hence, we canallocate appropriate FBB values to each gate such that the

Non Critical Gates

Non CriticalGates

Fig. 1. Motivational example.

delay constraints are met while keeping the penalty inleakage power as low as possible.

3. Fine-grained gate level FBB scheme

In this section we present our fine-grained gate level FBBscheme.

3.1. Modeling

Let us first consider the gate level delay and leakagemodels used in this work.

3.1.1. Gate delay model

The propagation delay and output transition formuladerived from the short channel MOSFET model [10] for aCMOS inverter is accurate in predicting the circuitbehavior of deep sub-micron designs. It has been foundthat N series-connected MOSFET (SCMS) would showless than N times the delay of a single MOSFET for deepsub-micron designs. We can represent this as

delayðSCMSÞ

delayðInvertorÞ¼ 1þ zðN � 1Þ, (5)

where z is a technology dependent parameter (0ozo1 fordeep sub-micron technologies). We use this model as ourgate delay model. We take the 0.18 micron inverter as thebase case and use Eq. (5) to calculate the delays of all othergates in the technology library.

3.1.2. Sub-threshold leakage model

The standby current of a CMOS network can beexpressed as a function of the current of a single CMOStransistor [11]. The ratio of the standby currents for singlestacked transistor I s1, double stacked transistor I s2 andtriple stacked transistor I s3 can be expressed as

I s1: I s2: I s3 ¼ 1:8 expðZV dd=nVTÞ: 1:8: 1, (6)

where 0:04pZp1, 1:4pnp1:5, V dd ¼ 1:8V andVT ¼ 26mV. Putting these values in Eq. (6), we get

I s1: I s2: I s3 ¼ 1: 0:234: 0:13. (7)

We take the 0.18 micron inverter as the base case and usethe ratio from Eq. (7) to calculate the leakage current of allother gates in the technology library. This defines our gatesub-threshold leakage estimation model.

3.2. Optimal fine-grained FBB allocation

We address the problem of fine-grained FBB allocationby presenting a novel polynomial time optimal algorithmthat tries to maximize the utilization of the existing slack inthe circuit for savings in leakage power in active mode. TheFBB allocation problem can be defined as follows:

Given a gate level circuit, the arrival time at eachprimary input and a required time constraint at each ofthe primary outputs, the problem is to optimally allocate


FBB values to each gate for minimal leakage in activemode while satisfying the overall delay constraints onthe circuit.

We have a linear-programming formulation that per-forms the optimal FBB allocation in polynomial time. Weexploit the dependence of gate delay and gate leakage inEqs. (1) and (2) on the threshold voltage of the gate.Essentially, we budget the delay of each gate such that thetotal leakage is minimized and the delay constraint issatisfied. The dependence of FBB and the effective thresh-old voltage at each gate has been discussed in Section 2.1 asgiven by Eq. (3). A range of possible FBB values for eachgate in the circuit would impose a range on the thresholdvoltage as well as the delay of each gate. Hence allocating aparticular delay value to a gate implies a threshold voltageassignment which in turn implies a FBB allocation to thatgate. We have assumed that it is possible for us to assignFBB values to each gate independently. We optimallyassign delay budgets to each gate in the circuit such thatour objective function, which is the sum of the leakagepower over all gates is minimized.

We represent our gate level circuit as a DAG, GðV ;EÞ asshown in Fig. 2. Each node in the DAG represents a gate inthe circuit. We add a dummy IN node before each of theprimary inputs which are shown as the black nodes markedIN in Fig. 2. We also add a similar dummy node OUT aftereach of the primary outputs which are shown as the blacknodes marked OUT in Fig. 2. For each node u, we associatea variable du which represents the delay of that node. Wealso associate another variable Du with each node whichrepresents the arrival time at the output of node u. Now weconsider two nodes u and v as shown in Fig. 2. Theircorresponding variables have also been shown in the figure.The timing constraints on GðV ;EÞ can be modeled as

dv �Dv þDup0 8eðu; vÞ 2 E, ð8Þ

diminpdipdi

max 8vertex i 2 V ; i 2 S, ð9Þ

DiIN ¼ Ti

arrival 8vertex i 2 IN, ð10Þ

DiOUTpTi

con 8vertex i 2 OUT , ð11Þ

diIN ¼ 0 8vertex i 2 IN, ð12Þ

diOUT ¼ 0 8vertex i 2 OUT . ð13Þ

For all the IN nodes (representing the primary inputs),the DIN values have been set to the corresponding arrival

IN

IN

OUT

INU

Dv

dv

Du

duOUT

V

Fig. 2. DAG representation.

time values Tarrival for the signals. The delay of the IN

nodes denoted by dIN have been set to zero. Similarly forthe OUT nodes (representing the primary outputs), theirdelay dOUT values have been set to zero and thecorresponding DOUT values have been set to be less thanor equal to the required time constraint Tcon at thecorresponding primary output node. The above LPformulation assigns delay budgets to all the gates of thecircuit such that the utilization of the available slack ismaximum. The range of possible FBB values that can beassigned is imposed as the corresponding range of delaybudgets for each gate, which is denoted by ½dmin; dmax�. Theobjective of optimal FBB allocation is to minimize the totalleakage power of the circuit in active mode which can berepresented as

minðSi2V PiÞ ¼ minðSi2V V ddI ileakÞ. (14)

As illustrated before the dependence between gate leakageand gate delay is given as follows:


Leff

� �e1:8V2

Tð1� e�Vds=VTÞeVgs=nVT

�eKCLV1=add=nVTd

1=ai�Vdd=nVT . ð15Þ

Theorem. Optimal FBB allocation algorithm is polynomial

time solvable.

Proof. A LP formulation is polynomial time solvable if theobjective function is a separable convex function under aset of linear constraints [12]. Let us consider the minimiza-tion objective function in Eq. (14). We can see that it is aseparable function since each term Pi depends only on thevariable di as seen from Eq. (15). Hence, grouping togetherall the other constant symbols into constants K1 and K2,we can represent Pi from Eqs. (14) and (15) as

Pi ¼ K1eK2=d

1=ai , (16)

where K1 and K2 are positive constants. We now try toprove the convexity of the objective function Pi. Since1=a ¼ 0:77, d

1=ai is a concave function. Also, the delay di is

positive by definition, hence d1=ai is a positive concave

function. Therefore, 1=d1=ai is a convex function (inverse of

a positive concave function). Since K2 is a positiveconstant, K2=d

1=ai is also a positive convex function. This

implies that eK2=d1=ai is also a convex function. Since K1 is a

positive constant, Pi ¼ K1eK2=d

1=ai is a convex function.

Thereby we have shown that the objective function isconvex separable. Hence, according to the result from [12],optimal FBB allocation algorithm is polynomial timesolvable. &

4. Optimal cluster-based FBB allocation

We now consider another variation of the fine-grainedFBB allocation problem. In Section 3.2, we assumed thatwe could assign each gate an independent FBB value. Froma fabrication point of view, routing these large number of

ARTICLE IN PRESS

Standard Cell Placement

Spatial Slack

after Placementin each row

Fig. 3. Standard-cell placement.

V. Khandelwal, A. Srivastava / INTEGRATION, the VLSI journal 40 (2007) 561–570 565

voltage supply lines to the substrate of the transistorsmight not be feasible. Let us suppose that there is a limit tothe maximum number of distinct FBB values that can beallocated to the gate in the circuit. The problem can bedefined as follows:

Given a gate level circuit, a clustering of gates such thateach gate within the same cluster has the same FBBvalue, the arrival time at each primary input and arequired time constraint at each of the primary outputs,the problem is to optimally allocate FBB values to eachcluster of gates for minimal leakage in active mode whilesatisfying the overall delay constraints on the circuit.

Since the gates within the same cluster are assigned thesame FBB value, their threshold voltages are the same.From delay equation (1), we can see that if two gateshave the same threshold voltage, then their delays arein a constant ratio. This is an additional constraint thatneeds to be added to the LP formulations proposed inSection 3.2. Let there be nk gates in cluster k. The followingequations show that the delays of these gates will be in afixed ratio since there threshold voltages are the same.

d1k¼

K1kCL1k

Vdd

ðV dd � V thÞa , ð17Þ

d2k¼

K2kCL2k

Vdd


..

.¼ ...

ð19Þ

dnk¼

KnkCLnk

Vdd


d1k

K1kCL1k

¼d2k

K2kCL2k

¼ � � � ¼dnk

KnkCLnk

. (21)

Thus we can add the following additional constraints foreach cluster k in the circuit to the LP formulation:

d1k

K1kCL1k

¼d2k

K2kCL2k

, ð22Þ

d2k

K2kCL2k

¼d3k

K3kCL3k

, ð23Þ

..

.¼ ...

ð24Þ

dn�1k

Kn�1kCLn�1k

¼dnk

KnkCLnk

. ð25Þ

We note that these are linear constraints as well. Hence,cluster-based FBB allocation is optimally solvable inpolynomial time.

5. Placement driven FBB allocation

In this section, we present a novel placement driven FBBallocation algorithm. The clustering scheme discussed inthe previous section does not take into account the areaoverhead in fabrication associated with clustering gates.Let us suppose that we assume that the N-MOS and

P-MOS transistors used in our designs are fabricated usingp-wells and n-wells on a silicon substrate, respectively. Iftwo gates are in the same cluster, the N-MOS transistors(or the P-MOS) have the same FBB value and can befabricated in the same p-well (or n-well). If two gates arenot in the same cluster, the N-MOS transistors (or theP-MOS) could have different FBB value and cannot befabricated in the same p-well (or n-well). We would ideallylike to cluster together gates that have been placedadjacently, such that the area overhead with creating thesenew n=p wells is minimum. If we consider clustering gatesbased on their placement information, we will clusterneighboring gates whose transistors can be fabricated inthe same n=p wells. If gates that have been placed far apartare clustered together to get the same FBB value, they mayneed to be fabricated in different n=p wells because theirneighbors might have different FBB values. Hence there isa large area cost associated with such a clustering scheme.Therefore, we should consider the post-placement informa-tion of the gates and cluster them taking into account theirphysical proximity. The additional area overhead that isassociated in making clusters of gates in each placementrow is accounted for using the available area slack in thecorresponding row. Hence, we minimize the penalty in theincrease in the overall area of the chip.Let us consider a standard cell-based placement scheme

that places the circuit into rows as shown in Fig. 3. Allstandard cells have the same height but vary in width. Wenote here that there is some unused area slack in each rowwhich is shown by the shaded areas in Fig. 3. Hence thisavailable extra area can be used to cluster the standard cellspost-placement such that they can be allocated one value ofFBB. We know that since these cells in each placement roware fabricated as neighbors, if two neighbors need to begiven a different FBB value, then their N-MOS (andP-MOS) need to be fabricated in different n-well (p-well).This essentially amounts to an area overhead every time theFBB values are changed within the same placement row.Hence we can formally define this problem as

Given a standard cell row-based placement of the gatelevel circuit, the arrival time at each primary input and arequired time constraint at each of the primary outputs,


each row has to be split into clusters of adjacentstandard cells such that the area overhead is within thespatial slack available for that row and each cluster isassigned a FBB value such that the leakage of the circuitis minimized and the overall delay constraints aresatisfied.

We will first present an exact formulation to solve theplacement driven FBB allocation problem. This is an exactformulation that is an integer linear program (ILP). In thenext subsection we will relax the ILP formulation andpresent an efficient heuristic to solve the problem. Ourexperimental results show that the results from theheuristic formulation does very close to the exact ILPsolution.

5.1. Exact formulation

The problem can be modeled by adding a set of extraedges E0 to the original DAG GðV ;EÞ shown in Fig. 4(a).Each row can be represented by a path of new edgesdenoting the adjacency in their placement of the standardcells as shown in Fig. 4(b). An edge eði; jÞ is added to E0 forevery pair of adjacently placed cells i and j. We note herethat if we have a placement with N rows of standardcells, we will now have N additional disjoint paths in theDAG. For any two adjacent nodes on any path p, as shownby the shaded nodes i and j in Fig. 4(b), we associate onebinary variable X

pij with edge eði; jÞ. The following

equations are added to our earlier LP formulation fromSection 3.2 to impose the additional placement-basedconstraints for clustering and FBB allocation under adelay constraint:

ðXpijÞM þ

di

KiCLi

�dj

KjCLj

X0

8eði; jÞ on path p; 8path p, ð26Þ

ij

jkX

X

j

kiIN

IN OUT

OUT

IN

dv

Dv

VU

Du

OUT

IN

IN

IN OUTdu

a

b

Fig. 4. Modified DAG representation.

� ðXpijÞM þ

di

KiCLi

�dj

KjCLj

p0


1þX

8eði;jÞ on path p

XpijpMaxp 8path p, (28)

Xpij 2 ½0; 1� 8eði; jÞ on path p; 8path p, (29)

where M is a very large positive number. We know fromSection 4 that if two gates are in the same cluster, they havethe same FBB value and their delays are in a fixed ratio.We have associated a variable X

pij with every pair of

adjacently placed nodes i; j in row p of the placement. IfX

pij ¼ 0, then nodes i and j are in the same cluster and have

the same FBB value. From a physical point of view, thiswould imply that they are fabricated in the same well. IfX

pij ¼ 1, then nodes i and j are in different clusters and do

not have the same FBB value. These two gates wouldtherefore be fabricated in different wells even though theyhave been placed adjacent to each other. This results in anextra area overhead. The spatial slack in each row obtainedafter placement is used to determine the maximum numberof allowed clusters in each placement row as denoted byMaxp for each row p in Eq. (28).If X

pij is zero (nodes i and j are in the same cluster), Eqs.

(26) and (27) impose the condition on the ratio of thedelays of two nodes. On the other hand if X

pij is one (nodes i

and j are in different clusters), then since M is a largepositive number, we do not impose any condition on theratio of the delays of the two nodes. Therefore theseadditional constraints along with the formulations pro-posed in Section 3.2, we have a placement driven FBBclustering and allocation algorithm under a delay con-straint for leakage minimization. The additional con-straints added to the LP formulation are linear and hencethe optimality is retained. The new formulation is now anILP.Furthermore, if we also have an upper limit LIMIT on

the total number of FBB values that can be allocated to thecircuit due to routing and fabrication constraints, we canadditionally impose this limit as

N þX

8eði;jÞ in E0

X ijpLIMIT , (30)

where N is the total number of standard cell rows in theplacement information of the circuit.This completes the description of the fine-grained

standard-cell placement driven simultaneous clusteringand FBB allocation scheme. We have presented an integerlinear programming formulation in this subsection. This isan exact formulation but due to their inherent computa-tional complexity, ILP formulations are not suitable forscaling to larger designs. In the next subsection, we willpresent a heuristic which is a relaxation of this formulationinto a LP.

ARTICLE IN PRESSV. Khandelwal, A. Srivastava / INTEGRATION, the VLSI journal 40 (2007) 561–570 567

5.2. Heuristic formulation

In this subsection we will present an efficient andaccurate heuristic to solve the placement driven FBBallocation problem. The exact ILP formulation presentedabove is not suitable for scaling and hence it is importantfor us to develop a heuristic for this problem. Using theproblem modeling that has been developed so far in thispaper, we will now present a three step heuristic. Theproblem of creating clusters in placement rows in order toget the best delay versus leakage trade-off under theavailable area is discrete in nature. This is the source ofcomplexity in the problem. In the first step of the heuristic,we try to relax this constraint (by making the X ij variablesfrom ILP formulation to be non-integers) and generate asolution where the clusters are not clearly defined (X ij isnot strictly 0 or 1). This step helps us identify the points inthe placement row where it is most beneficial to create aclustering boundary. The second step of the algorithmtakes this preliminary solution and performs an intelligentclustering decision on each row. We perform lp roundingon each clustering variable X ij and generate a validclustering solution for each placement row. After this step,the original problem reduces to the clustering driven FBBallocation problem that has been presented in Section 4.We use this formulation as the third step to complete theheuristic solution. Let us now understand each step of theheuristic in more detail.

Algorithm 1. HEURISTIC

Step 1: Solve the LP formulation after relaxing the integer

constraint on Xpij .

Step 2: Round all Xpij to 0 or 1 and decide the clustering of

gates.Step 3: Solve the clustering-based LP formulation for FBBallocation.

The first step of the heuristic solves the problemformulation developed in the earlier sections of the paper.Unlike the exact ILP formulation, we relax the integerconstraints on the X ij variables. From a physical perspec-tive, a non-0/1 value of this variable does not give a correctinterpretation, but the first step of the heuristic tries toget an indicator about all the points along the placementrow where clustering benefits leakage savings. At the endof the day, our proposed algorithm exploits the leakageversus delay trade-off constrained by an in-place areaconstraint. Given a gate level circuit represented by a graphGðV ;EÞ, the equations for the complete formulation can bewritten as

Objective:

minðSi2V VddI ileakÞ 8i 2 V . (31)

Delay constraints:


diminpdipdi


DiIN ¼ Ti


DiOUTpTi




Placement constraints:

ðXpijÞM þ

di

KiCLi

�dj

KjCLj

X0


� ðXpijÞM þ

di

KiCLi

�dj

KjCLj

p0


1þX

8eði;jÞ on path p

XpijpMaxp 8path p, (40)

0pXpijp1 8eði; jÞ on path p. (41)

The important thing to note here is that we have relaxedthe integer constraints on X ij variables. We solve thisformulation to get a preliminary solution to the problem.The second step takes as input the solution obtained

from the first step of the heuristic. The main objective ofthe second step is to perform intelligent rounding on the X ij

variables for each placement row such that the areaconstraints are satisfied. This is a heuristic step and maynot give us the optimal clustering solution that can beobtained from the exact ILP proposed in the previoussubsection. We have adopted a greedy rounding strategy inour heuristic. For each placement row, we first sort all theX ij variables in decreasing order of magnitude. We thentake the top jMaxpj elements from this sorted list andround them up to 1 while rounding the rest of X ij variablesto 0. This gives us a clustering solution for each placementrow that is valid (in terms of area constraints).The third step of the heuristic solve the FBB allocation

problem given the clusters that have been decided in thesecond step of the heuristic. We note here that now we donot need to consider the placement constraints explicitly.We can use the clustering model developed in Section 4 tosolve the problem. The final equations to solve the FBBallocation can be written as

Objective:

minðSi2V VddI ileakÞ 8i 2 V . (42)

Delay constraints:


diminpdipdi


DiIN ¼ Ti


DiOUTpTi





Clustering constraints: For each cluster k with n gates,the delays are constrained through the following linearequations:

d1k

K1kCL1k

¼d2k

K2kCL2k

, ð49Þ

d2k

K2kCL2k

¼d3k

K3kCL3k

, ð50Þ

..

.¼ ...

ð51Þ

dn�1k

Kn�1kCLn�1k

¼dnk

KnkCLnk

. ð52Þ

This completes the heuristic to solve the FBB allocationproblem. Using a three step scheme, we have avoided theexact ILP formulation that is not scalable to large designs.The proposed heuristic is efficient and our experimentsshow that this heuristic gives us results that are within 2%of those obtained from the exact ILP formulation.

6. Experimental results

The objective of these experiments was to prove threemain points:

(1)

Tab

Resu

Benc

C43

C49

C88

C19

x1

x3

x4

i5

i6

i8

Avg

There is inherent slack available in circuits which canbe used to slow down non-critical gates and getconsiderable savings in the leakage power withoutany delay penalty.

(2)
Placement driven clustering for FBB allocation usingthe exact ILP formulation is effective in reducing theleakage power without any significant area penalty.
(3)
The heuristic scheme for placement driven clusteringfor FBB allocation gives results that are very compar-able to those obtained from the exact ILP formulation.
We have implemented our placement driven fine-grainedFBB allocation schemes (exact ILP and heuristic) in SIS[13]. We have built an integrated software interface that

le 1

lts from exact ILP formulation

hmark Initial Optimal Savings 0% Area

(10�12 A) (10�12 A) (%) (10�12 A)

2 13 507 4312 68.1 8071

9 27 442 9979 63.6 18 073

0 19 472 5936 69.5 10 649

08 31 432 9721 69.1 17 784

16 871 4703 72.1 5709

45 734 12 632 72.4 14 515

27 416 7940 71.0 12 352

19 566 5481 71.9 5481

50 571 13 938 72.4 14 725

66 516 18 580 72.0 19 872

. 70.2

generates a placement for a benchmark from the SISlibrary using the CAPO placement tool [14]. We then usethis placement information to generate the constraints asexplained in the formulations in Section 5. The delayconstraint on the circuit is the best case delay of the circuitwith all gates at low threshold voltages (minimum delay),thereby imposing no additional delay penalty on thetiming. This illustrates our proposition of using theinherent slack available in the benchmark to assign FBBvalues under a delay constraint. Every gate is allocated athreshold voltage in the range 0.3–0.5V for the N-MOSand �0:3 to �0:5V for the P-MOS transistors. Forsimplification, we have used piece-wise linearization ofour convex objective function and used CPLEX toimplement our LP formulation. All experiments were doneon a Sunblade 150 workstation.Table 1 shows the results from the experiments over a

large range of benchmarks. Column 1 shows the variousbenchmarks from SIS that have been used for theexperiments. Column 2 gives the initial leakage currentvalues for the benchmarks using low threshold gates. Thisis the best case possible delay for the circuit. However, wewill show that we can still maintain this delay constraint byfabricating high V th gates (minimum leakage and max-imum delay) and allocating appropriate FBB values tospeed up timing critical gates to meet the delay constraints.Thus, utilizing the available delay slack in the circuit, wecan reduce the leakage power of the gates. Column 3 inTable 1 shows the result from our optimal FBB allocationformulation discussed in Section 3.2. Here we haveassumed that we can independently control the FBB valueof each gate and do not take into account the areaoverhead associated with this scheme. Column 4 showsthat there is an average 70.2% reduction in the totalleakage current across the benchmarks using the optimalFBB allocation scheme. This proves our first claim thatthere is available delay slack in circuits which can be usedfor significant savings in leakage without any extra delaypenalty.

Savings 4% Area Savings 8% Area Savings

(%) (10�12 A) (%) (10�12 A) (%)

40.2 7090 47.5 6059 55.1

34.1 15 868 42.1 13 881 49.4

45.3 8507 56.3 7443 61.7

43.4 13 934 55.6 11 485 63.4

66.1 4746 71.8 4706 72.1

68.2 12 771 72.0 12 632 72.4

54.9 9279 66.1 8170 70.1

71.9 5481 71.9 5481 71.9

70.8 13 938 72.4 13 938 72.4

70.1 18 622 72.0 18 580 72.0

56.5 62.8 66.1

ARTICLE IN PRESS

Table 2

Results from heuristic

Benchmark Initial Optimal Savings 0% Area Savings 4% Area Savings 8% Area Savings

(10�12 A) (10�12 A) (%) (10�12 A) (%) (10�12 A) (%) (10�12 A) (%)

C432 13 507 4312 68.1 8349 38.1 7723 42.8 6487 51.9

C499 27 442 9979 63.6 18 935 30.9 17 056 37.8 14 006 48.9

C880 19 472 5936 69.5 10 649 45.3 8687 55.3 8156 58.1

C1908 31 432 9721 69.1 17 988 42.7 14 512 53.8 11 702 62.7

x1 16 871 4703 72.1 5709 66.1 4746 71.8 4706 72.1

x3 45 734 12 632 72.4 14 587 68.2 12 874 71.8 12 632 72.3

x4 27 416 7940 71.0 12 613 53.9 9279 66.1 8177 70.1

i5 19 566 5481 71.9 5481 71.9 5481 71.9 5481 71.9

i6 50 571 13 938 72.4 14 725 70.8 13 938 72.4 13 938 72.4

i8 66 516 18 580 72.0 19 872 70.1 18 622 72.0 18 580 72.0

Avg. 70.2 55.8 61.6 65.2

Table 3

Comparison between heuristic and optimal FBB allocation schemes

Benchmark 0% Area 4% Area 8% Area

(% Inc. in

leakage)

(% Inc. in

leakage)

(% Inc. in

leakage)

C432 3.4 8.9 7.0

C499 4.7 7.4 0.9

C880 0 2.1 9.5

C1908 1.1 4.1 1.8

x1 0 0 0

x3 0.5 0.8 0

x4 2.1 0 0.1

i5 0 0 0

i6 0 0 0

i8 0 0 0

Avg. 1.1 2.3 1.9

V. Khandelwal, A. Srivastava / INTEGRATION, the VLSI journal 40 (2007) 561–570 569

In the first set of experiments, we present the resultsobtained from our exact ILP-based formulation. We nowconsider our placement driven clustering strategy thatconsiders the area penalty as a constraint. We first considerthe worst-case scenario when there is no available areaslack in each of the placement rows. This essentiallymeans that all gates in a particular placement row havebeen clustered together. Columns 5 and 6 show that there isan average 56.5% reduction in leakage for this case. Wenow consider the scenario with available area slack tocompensate for the area penalty from clustering. We haveassumed in our experiments, that each new cluster (new n

and p well) formed requires an area overhead equal tothat of an inverter. Columns 7–10 show that for 4%and 8% available area slack, there is an average 62.8%and 66.1% reduction in leakage current, respectively.Hence, we have shown the effectiveness of our placementdriven fine-grained FBB scheme through these experi-ments. The available area slack after placement can beeffectively used to cluster together gates in each of theplacement rows and then we can optimally allocate FBBvalues to these clusters using our formulation as presentedin this work.

The ILP-based formulation gives us an optimal solutionto the placement driven FBB allocation problem. But ILPformulations are computationally complex and not suitablefor larger designs. In order to ensure a polynomial timeformulation that can be scaled to larger designs, we nowperform the next set of experiments using the heuristicdescribed in Section 5.2. The constraints for delay areexactly the same as that in first experiment. From Table 2we can see that for 0%, 4% and 8% area penalty, we get anaverage leakage savings of 55.8%, 61.6% and 65.2%,respectively. These results are very close to that obtainedfrom the exact ILP formulation, thereby showing theaccuracy of the proposed heuristic formulation.

On a more direct comparison between the optimal ILP-based scheme versus the heuristic, it is clear that theheuristic is computationally efficient. From our experi-ments, we observed that the runtime of the heuristic is

significantly lower than that of the exact ILP formulation.Table 3 shows a more direct comparison of the leakagenumbers obtained by the heuristic. We compare theheuristic leakage numbers compared directly with theoptimal solution. Across the three sets of area constraints,on an average the heuristic results are within 2% of theoptimal ILP solution.The runtime of the exact ILP formulation was high as

can be expected. On an average, each benchmark took inthe order of a few hours while the proposed heuristic tookin the order of few seconds for each benchmark across allexperiments. Therefore, the heuristic is able to overcomethe computational complexity of the exact ILP formulationwhile retaining the quality of the result.These results prove the objectives of our experiments.

We have proposed optimal FBB allocation schemes aswell as efficient and accurate heuristic to solve the problem.The leakage savings obtained through this techniqueare very large and the proposed algorithms give thedesigner the opportunity to trade-off delay and area versusleakage savings. This is a very useful tool for low-powerapplications.


7. Conclusion and future work

In this work we have proposed a novel fine-grained FBBallocation algorithm for reducing leakage power of circuitsin active mode of operation. We have presented an optimalpolynomial time FBB allocation algorithm along with astandard-cell placement driven FBB allocation algorithmwhich tries to utilize the available area slack in each of theplacement rows for clustering. Our results have shown thatthe proposed schemes are effective in reducing the activemode leakage power in gate level circuits with no delaypenalty. We have further proposed a heuristic formulationfor placement driven FBB allocation that is computation-ally efficient as compared with the exact ILP formulationand also shows results that are within 2% of the optimal.We can extend this scheme to consider clustering acrossplacement rows and propose other heuristics to solve thisproblem. Furthermore, to apply this scheme to largedesigns, we need to come up with intelligent designpartitioning technique such that each partition can besolved separately and still get good overall results. Thiswork shows that FBB has the potential to providesignificant leakage savings with minimal area and delaypenalties.

References

[1] C.H. Kim, et al., A forward body-biased low-leakage SRAM cache:

device and architecture considerations, in: Proceedings of ISLPED,

August 2003.

[2] Ali Keshavarzi, et al., Forward body bias for microprocessors in

130 nm technology generation and beyond, in: VLSI Circuits

Symposium, 2002, pp. 312–315.

[3] S. Narendra, et al., 1.1V 1GHz communications router with on-chip

body bias in 150 nm CMOS, in: ISSCC, 2002.

[4] S. Mukhopadhyay, K. Roy, Modeling and estimation of total leakage

current in nano-scaled CMOS devices considering the effect of

parameter variation, in: Proceedings of ISLPED 2003, August 2003.

[5] A. Sedra, K. Smith, Microelectronic Circuits, Oxford University

Press, Oxford, 1997.

[6] A.J. Bhavnagarwala, et al., Dynamic-threshold CMOS SRAM

cells for fast, portable applications, in: ASIC/SOC Conference,

2000, pp. 359–363.

[7] H. Kawaguchi, et al., Dynamic leakage cut-off scheme for low-

voltage SRAM’s, in: VLSI Circuits Symposium, 1998, pp. 140–141.

[8] C.H. Kim, et al., Dynamic Vt SRAM: a leakage tolerant cache

memory for low voltage microprocessors, in: ISLPED, 2002,

pp. 251–254.

[9] Y. Taur, et al., 25 nm CMOS design considerations, in: IEDM, 1998,

pp. 789–792.

[10] T. Sakurai, A.R. Newton, A simple MOSFET model for circuit

analysis, IEEE Trans. Electron Devices 38 (4) (1991) 887–894.

[11] R. Gu, M. Elmasry, Power dissipation analysis and optimization of

deep submicron CMOS digital circuits, in: IEEE JSSC, May 1996.

[12] D. Hochbaum, J. Shanthikumar, Convex separable optimization is

not much harder than linear optimization, J. ACM 37 (4) (1974).

[13] E.M. Sentovich, K.J. Singh, L. Lavagno, C. Moon, R. Murgai, A.

Saldanha, H. Savoj, P.R. Stephan, R.K. Brayton, A.L. Sangiovanni-

Vincentelli, SIS: a system for sequential circuit synthesis, Memor-

andum No. UCB/ERL M92/41, Department of EECS, UC Berkeley,

May 1992.

[14] A. Caldwell, et al., Can recursive bisection alone produce routable

placements?, in: Proceedings of DAC, 2000.

Vishal Khandelwal (S‘03) received his B.Tech.

degree in Electrical Engineering from the Indian

Institute of Technology, Kanpur and his M.S. in

Electrical and Computer Engineering from the

University of Maryland, College Park in 2002

and 2005 respectively. He is currently a Ph.D.

candidate at the Electrical and Computer En-

gineering Department, University of Maryland,

College Park, under the guidance of Prof. Ankur

Srivastava.

His research interests include various aspects of deep submicron VLSI-

CAD, fabrication variability, statistical and probabilistic design automa-

tion, low-power design methodologies, logic and physical synthesis. He is

a Member of the IEEE.

Ankur Srivastava received his B.S. degree from

the Indian Institute of Technology, Delhi, his

M.S. degree from the Northwestern University,

Evanston, IL, and his Ph.D. degree from the

University of California Los Angeles (UCLA) in

1998, 2000, and 2002. Since fall 2002 he has been

an Assistant Professor with the University of

Maryland, College Park. His research interests

include many aspects of design automation

including logic and high-level synthesis, low-

power issues, fabrication variability and low-power issues pertaining to

sensor networks.

Dr. Srivastava received the outstanding Ph.D. Award from the

Computer Science Department of UCLA in 2002. He is a Member of

ACM and IEEE.

Active mode leakage reduction using fine-grained forward body biasing strategy

Documents

Transcript of Active mode leakage reduction using fine-grained forward body biasing strategy