Power-Delay Product Minimization

7/29/2019 Power-Delay Product Minimization

1/10

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004 235

Power-Delay Product Minimization inHigh-Performance 64-bit Carry-Select Adders

Amaury Nve, Member, IEEE, Helmut Schettler, Thomas Ludwig, Member, IEEE, andDenis Flandre, Senior Member, IEEE

AbstractThis paper analyzes methods to minimize thepower-delay product of 64-bit carry-select adders intended forhigh-performance and low-power applications. A first realizationin 0.18- m partially depleted (PD) silicon-on-insulator (SOI),using complex branch-based logic (BBL) cells, results in a delay of720 ps and a power dissipation of 96 mW at 1.5 V. The reduction ofthe stack height in the critical path, combined with the optimiza-tion of the global carry network with cell sharing and the selectionof 8-bit pre-sums, leads to a reduction of the power-delay productby 75%. The automatic tuning of the transistor widths in 0.13- mPD SOI produces an energy-efficient 64-bit adder which has adelay of 326 ps and a power dissipation of 23 mW only at 1.1 V.

Index TermsAdder, digital CMOS, high performance, lowpower, power-delay product, silicon-on-insulator technology.

I. INTRODUCTION

T ODAY, one of the major challenges for high-performancemicroelectronic systems is the power dissipation, bothstatic and dynamic [1][3]. The circuit designer must, therefore,find an optimum between power and speed, instead of targetingthem independently, and this is represented by the power-delayproduct, which represents the average energy dissipated for oneswitching event [4].

In this study, we investigate design methods to minimize

the power-delay product of 64-bit adders in partially depleted(PD) silicon-on-insulator (SOI) technology. Addition is usedas a benchmark here since it is one of the important tasks per-formed by the CPU, considering that adders are needed in theArithmetic and Logic Units, for the memory address generationand for floating point calculations [5], [6]. The improvementof the power-delay product will be performed at the differenthierarchical levels of the design: circuit design style, celldecomposition, and global architecture. Section II discussesdesign styles that can be used for low-power and high-perfor-mance VLSI systems in SOI. In Section III, we compare twopossible implementations of branch-based cells used for the64-bit adder, which has a classical carry-select architecture.

The experimental results of this realization are discussed inSection IV and are compared with a complementary pass-gate

Manuscript received February 28, 2003; revised June 30, 2003. The work ofA. Nve was supported in part by the Walloon Region of Belgium.

A. Nve was with the Microelectronics Laboratory, Universit Catholique deLouvain, B-1348 Louvain-la-Neuve, Belgium. He is now with IBM Entwick-lung, D-71032 Bblingen, Germany (e-mail: [email protected]).

H. Schettler and T. Ludwig are with IBM Entwicklung, D-71032 Bblingen,Germany (e-mail: [email protected]; [email protected]).

D. Flandre is with the Microelectronics Laboratory, Universit Catholique deLouvain, B-1348 Louvain-la-Neuve, Belgium (e-mail: [email protected]).

Digital Object Identifier 10.1109/TVLSI.2004.824305

adder. The optimization of the adder structure, the global carrynetwork, and the cells are presented in Section V. The proposedadder is compared with the original carry-select adder. Sec-tion V also gives the results for the optimized implementationof the 64-bit adder in 0.13- m PD SOI, and compares themwith state-of-the-art 64-bit adders of the literature.

II. LOGIC CIRCUIT DESIGN IN SOI

The use of SOI technology instead of bulk CMOS opens new

possibilities in the choice of the design style and in the designspace itself. In Section II-A, we discuss some possible designstyles for PD SOI. Among them, branch-basedlogic, a restrictedversion of static CMOS logic, seems very promising and is pre-sented in Section II-B.

A. Circuit Design Styles

In this study, we concentrate on static design styles,since the performance advantage of both dynamic logicstyles and pass-gate design is expected to decrease in futuredeep-submicron technologies [3], [7]. The features of lowerdynamic power consumption and higher noise margin makestatic CMOS particularly attractive [8], [9]. Moreover, theactivation of the parasitic bipolar transistor in PD SOI isreported to result in fatal erroneous states in dynamic logic andto make circuit design with pass-gates more difficult [10]. Therenewed interest in static design styles like pseudo-NMOS [11]and ratioed CMOS [12] shows that alternative design styles areinvestigated in SOI in order to reduce the power dissipationwhile still maintaining high-speed performance.

B. Branch-Based Circuit Design Style

In the branch-based logic (BBL) design style, a logic cell isonly made of branches that contain a few transistors in series[13]. The branches are connected in parallel between the power

supply lines and the common output node. Many usual staticCMOS gates have already a branch structure, as inverter andNAND and NOR gates. By using the branch-based concept, it ispossible to minimize the number of internal connections, andthus, the parasitic capacitances associated with the diffusions,interconnections, and contacts. As it belongs to the family ofstatic CMOS, it presents high noise margins and robustness todevice scaling and voltage scaling.

The optimal design point will depend on whether the em-phasis is placed on low power, high speed, or a compromisebetween the two.

1063-8210/04$20.00 2004 IEEE

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.


2/10

236 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

Fig. 1. 64-bit carry-select adder.

TABLE ILOGIC EQUATIONS AND CIRCUIT BLOCKS USED FOR THE INTERMEDIATE CARRY SIGNALS

III. DESIGN OF THE CARRY-SELECT ADDER

A. Adder Structure

In our first design, the 64-bit adder was classically divided

into four 16-bit sections as shown in Fig. 1 [14]. In each sec-

tion, two 16-bit adders generate the sum outputs with, respec-

tively, the carry-in at 0 and at 1. The true sum is selected

by a multiplexer. The control signals for the multiplexers arethe carry-in and the intermediate carry signals , , and

. The intermediate carry signals and the carry-out are

produced by the carry-select boxes (CS-boxes). The inputs of

the CS-boxes are the carry-in and the conditional carry signals

and (with 015, 1631, 3247, or 4863), whichare all generated simultaneously. The notation refers

to the block carry signal for bit positions 0 to 15, assuming

that the carry-in is at 0. For the sake of clarity, the indexesof the conditional carry signals have been simplified in Fig. 1.

The carry-out and the intermediate carry signals are computed

according to the equations presented in Table I. To compute

the carry-out, the intermediate carry is combined with

and in one CS-C0 stage to avoid a complex

CS-C3 stage.

At their turn, the 16-bit adder blocks are implemented as

carry-select adders, with 4-bit adder blocks having a carry-in

either at 0 or at 1. At the 16-bit level, the same CS-boxescan be used as at the 64-bit level, sizing of the transistors being

adapted to the particular load conditions.

The 4-bit adder blocks can finally be implemented as ripplecarry adders, or also as carry-select adders, which was chosen

here. The carry-select architecture is thus used at three different

levels: in the 64-bit, the 16-bit, and the 4-bit adders.

B. Design of the Carry-Select Boxes

A CS-box can be implemented in different ways, depending

on its number of inputs and complexity. The CS-C0 gate is

designed starting from the logic equation presented in Table I.

The P-part is common to the BBL version and the CMOS

OR-AND-INVERT (OAI) gate, which is further referenced as the

X-gate. It can immediately be designed, just observing that the



3/10

NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS 237

Fig. 2. One-stage complex CS-C0: (a) BBL and (b) X-gate.

Fig. 3. Two-stage decomposed CS-C0 cell.

input variables must be complemented in the logic equation

(Fig. 2).

For the NMOS part, the complement is taken:

(1)

(2)

Equation (2) defines the N-part of the X-gate represented in

Fig. 2(b). Normally, the N-part of the BBL cell would require

four transistors. However, to reduce the input capacitance and

the internal parasitic capacitances, (2) can be simplified noticingthat

(3)

This is true since the input combination AND

never happens, resulting from the properties of

the conditional carry signals. The resulting equation is

(4)

CS-C0 can thus be implemented as a complex cell in one

stage, obeying the BBL principle [Fig. 2(a)], or like an X-gate,allowing for internal connections between branches [Fig. 2(b)].

These X-gates will be used in the optimized version of the adder

described in Section V.

CS-C0 can also be implemented in two stages, using two clas-

sical NANDs and one inverter (Fig. 3). Notice that the latter is a

two-stage BBL implementation of CS-C0, since NAND gates can

be considered as elementary BBL cells. This will be referenced

as decomposed-cell design in the remainer of this paper.

Figs. 4 and 5 represent, respectively, the complex BBL one-

stage implementation of CS-C1 and the complex BBL two-stage

implementation of CS-C2. Using one stage only for the latter

requires stacks of four transistors, which appeared to be too

Fig. 4. One-stage complex BBL CS-C1 cell.

Fig. 5. Two-stage complex BBL CS-C2 cell.

slow. The carry indexes have been simplified for the sake of the

figures clarity. In Section III-C, these complex cells are com-pared with equivalent decomposed-cell designs, using NOR and

NAND gates.

C. Results for the Single Cells

The complex and decomposed-cell BBL gates have been op-timized for speed as follows. For the optimization process, each

cell is loaded with one CMOS inverter (fan-out of 1). The nom-

inal gate length is chosen at the minimum allowed drawing di-

mension, i.e., m. We carefully optimized the gate

widths by hand, using an iterative process. In the first step, the

critical branchis identified;this is thebranch that hasthe slowest

delay when the cell switches. The speed of a gate is indeed lim-

ited by the slowest branch [15]. Often this corresponds to the

branch with the largest number of transistors in series. The input

pattern that activates the top transistor of the critical branch in

one particular cell is applied at the input. With the AS/X cir-

cuit simulator [16], a sweep is made on the ratio of this

branch, all other ratios remaining constant. Thereafter, theratio of the other branches is further tuned to lower the

capacitance of the output node, to which all the branches are

connected. After this first step, we can turn back to the critical

branch and refine the choice of its gate widths. Most of the time,

the second step leads only to minor changes of the ratios

of the branches.

We evaluated the performances of the basic building blocks

of the adder by using circuit simulations. The simulations are

based on the schematic specification of the circuit. The device

models of the SOI PD 0.18- m process include the parasitic

capacitances of source, drain, and gate. The model accounts also

for the parasitic capacitances associated with the contacts.



4/10


TABLE IISIMULATED DELAY. V = 1 : 5 V; T = 2 5 C ; F O = 1

TABLE IIISIMULATED STATIC POWER DISSIPATION. V = 1 : 5 V; T = 2 5 C ; F O = 1

TABLE IVSIMULATED DYNAMIC POWER DISSIPATION. V = 1 : 5 V; T = 2 5 C ;

F O = 1 ; F = 1 G H z

The results for the delay, static power consumption, and

dynamic power consumption are presented, respectively, inTables IIIV. We compare two possible BBL implementationsof the cells: complex-cell design (COMPLEX) and decom-

posed-cell design (DEC), using only NAND, NOR, and INV gates.

Two categories clearly appear: circuits with few inputs and

few branches like the half-adder and CS-C0 perform much

better in their complex form, with a speed increase of 8%

and 20%. Circuits with more inputs, like CS-C1 and CS-C2,

perform worse in their complex form than in their decomposed

form. The bad performances of these two complex cells can be

explained by the combination of two factors: the presence of

branches with a stack of three transistors and a high number

of branches connected to the output node, which increases the

parasitic capacitance at the output. Moreover, the critical pathin the two decomposed circuits includes a stack of three NMOS

devices, whereas in the two complex-cell circuits, a stack of

three PMOS devices is activated in the worst case.

Floating-body effects are known to produce uncertainties in

the circuit results. In particular, the hysteresis effect is associ-

ated with the dependence of the body potential on the switching

history of the gates and appears as an additional delay variation

[17]. With a tool based on the methodology presented in [ 18],

we evaluated that the impact of the hysteresis effect on these

cells is less than 5%.

The static power dissipation is very close for the half-adder

and CS-C2, since these BBL cells count multiple stages, like

Fig. 6. Image of the test chip. The locations of the BBL adder (BBL ADD)

and the CPL adder (CPL ADD) are highlighted.

the decomposed versions, and hence, multiple leakage paths be-

tween the power supply and ground. The complex CS-C0 and

CS-C1 cells benefit from the design in one stage compared to

their decomposed counterparts.

Thedynamic powerdissipation is compared forthe same logic

cells in Table IV. The last column presents the dynamic power

reduction when comparing the complex implementation and the

decomposed implementation. The complex cells dissipate be-

tween 30%and 43%less dynamic powerthan in thedecomposed

cells. The main factor that explains why the complex cells have

a lower power dissipation than the decomposed cells is linked

with the lower number of internal nodes. In the CS-C0 cell, for

example,there is only onehighly capacitive node at theoutput in

the complex cell [see Fig. 2(a)]. In the decomposed cell (Fig. 3),

there are two internal nodes and one output node, though with a

lower parasitic capacitance than in the case of the complex cell.

When the output of the gate switches, thismeans that at least one

of the internal node also switches with a full-swing charge/dis-

charge of the associated parasitic capacitance.

IV. FIRST REALIZATION OF THE 64-BIT ADDER

The design and layout of the 64-bit carry-select adder pre-

sented in Section III has been realized for the CMOS 0.18- m

PD SOI process [19]. It is composed of 18 k devices and occu-pies an area of 735 m 280 m (Fig. 6, BBL ADD). The same

layout can be used for each 16-bit section, as they are repeated

four times.

All the cells are implemented using the BBL design style,

with the exception of the multiplexers, which are transmis-

sion-gate multiplexers. For this version of the adder, we fixed

the maximum stack height at three MOSFETs, which enables

us to use the complex BBL gates discussed in the previous

section.

In the remainer of this paper, the adder implemented here is

referenced as Adder4 16b. This notation refers to the factthat we make a selection of 16-bit pre-sums at the 64-bit adder

level.

A. Critical Path

The critical path to the carry-out (referenced as in Fig. 1)

and to the sum outputs is described here. In the 4-bit adder level,

the critical path involves 1 NAND/NOR and 1 CS-C2 cell, which

generates . A second CS-C2 cell in the 16-bit adder level gen-

erates the intermediate carry signals , , , a nd

. These signals are then fed into the CS-C1 and CS-C2

cells of the 64-bit adder for generation of, respectively, and

. But, by this way, the delay on the critical carry-out path

could be too high for two reasons. First, as seen previously,



5/10


Fig. 7. Detail of the critical path. Buffers are added on the path to CS-C1 inorder to reduce the capacitive load seen by the previous stage.

the delay in a CS-C2 cell is about 30% higher than in a CS-C1

cell. Second, the cells generating the intermediate carry signals

, , , and in the 16-bit adders see a

high capacitive load at the inputs of the CS-C1 and CS-C2 cells

at the 64-bit level. In our design, we favor the carry-out genera-tion in two ways at the 64-bit level (Fig. 7). First, since the delay

in CS-C2 is the largest, the intermediate carry signals ,

, , and are fed directly into CS-C2 for

the generation of . To compute the carry-out , only one

additional stage is necessary, i.e., CS-C0. Second, buffers are

added on the signal path to the inputs of CS-C1 in order to

reduce the capacitive load seen by the outputs of cells gener-

ating the signals , , , and . By

this way, these signals do not see the high capacitive charge of

the CS-C1 cell.

B. Simulation Results

A second 64-bit carry-select adder has been designed with the

decomposed cells for comparison purposes. AS/X simulations

of these adders based on the cell schematics are used to compare

both versions. At 1.5 V, 25 C, and with a capacitive output load

of 150 fF, theBBL complex-celladderfeaturesa dynamic power

consumption which is reduced by 10% compared to the decom-

posed-cell adder, with random input patterns applied at a rate of

1 GHz. The delay increase associated with the complex-cell de-

sign is less than 2% for supply voltages up to 1.5 V. For higher

supply voltages, the speed difference increases slightly between

the two, but remains lower than 5%. For worst case input pat-

terns, the peak dynamic power consumption is reduced by 16%

in the complex-cell adder compared to decomposed-cell imple-mentation. The overall reduction of dynamic power dissipation

is lower than forthe individual cells, because the complete adder

also involves inverters and multiplexers, which are similar in

both realizations.

An equivalent bulk realization of the adder consumes 15%

more dynamic power and is 29% slower than the PD SOI ver-

sion. This is associated with the lower junction capacitances in

SOI.

A complete netlist has been extracted from the 64-bit

complex-cell adder layout in order to precisely determine the

static and dynamic power consumption, taking all the intercon-

nections and parasitic elements into account. To compute the

Fig. 8. Experimental delay of the 64-bit BBL and CPL adders for differentV values at + 25 C. The critical delay is obtained for SUM24..31 for theBBL adder and for the carry-out in the case of the CPL adder.

dynamic power, we apply random patterns at the inputs of the

extracted netlist. Integration between 5 and 45 ns, with a bit

rate of 1 GHz, a supply voltage of 1.5 V, and a temperature of

25 C, results in a dynamic power consumption of 96 mW. Thepeak dynamic power consumption occurs when all the inputs

switch at the same time, and rises up to 177 mW. The static

power consumption is about 500 W at 1.5 V and 25 C.

C. Experimental Results

The experimental realization of the 64-bit adder has been

tested under different voltage and temperature conditions and

operates successfully. Worst case input patterns have been ap-

plied and the propagation delays to the sum and carry outputs

measured.

In the worst case situation and at 1.5 V, the final carry is pro-

duced after 600 ps, and thanks to the independent carry networkcomposed of the CS-boxes, it arrives earlier than the last sum

outputs, which arrive after 720 ps in the worst case.

On the same chip as the BBL adder, a complementary pass-

gate logic (CPL) 64-bit carry-select adder has been realized

(CPLADD in Fig. 6). The CPL adder makes the selection of

16-bit pre-sums and uses CPL cells based on the work presented

in [20]. Fig. 8 shows the critical delay for the CPL adder and for

the BBL adder. For low operating voltages V ,

the BBL is obviously better than CPL. Indeed, BBL cells do

not suffer from the voltage drop due to the threshold voltage

in single-rail pass-gates. This voltage drop increases in

relative terms when moving to lower supply voltages. For high

voltages V , CPL and BBL are able to achievesimilar performance. These results confirm the statement of [7]

that the performance of CPL-like logic styles degrades much

faster than other design styles due to the decreasing

ratio in deep-submicron technologies. The dynamic power con-

sumption of the CPL adder is about 50 mW. This is one half

of the power consumption of Adder4 16b. Two factors explain

the power advantage of the CPL adder: different structure of

the 4-bit adders implying the use of a lower number of cells and

lower switched capacitance thanks to the low number of NMOS

and especially PMOS devices in the design (2999 PMOS de-

vices and 4992 NMOS transistors in CPL, versus 9077 PMOS

and 9142 NMOS devices in the BBL adder).



6/10


Fig. 9. Four-bit ripple-carry adder (RCA) with c a r r y 0 i n = 0 .

TABLE VSIMULATION RESULTS FOR THE 4-BIT ADDER WITH THE ASSOCIATED CARRY-LOOKAHEAD CIRCUIT. IN FA+X-BBL THE CARRY-LOOKAHEAD CIRCUIT ISIMPLEMENTED WITH BBL CS-C0 CELLS AND X-GATES; IN FA+CS-C2, THE CARRY LOOK-AHEAD CIRCUIT IS COMPOSED OF JUST ONE CS-C2 CELL; IN

FA+CMOS THE CARRY PATH IS IMPLEMENTED WITH CONVENTIONAL NAND AND NOR GATES; CSA+CS-C2 IS THE 4-BIT CARRY-SELECT ADDER WITH THECARRY PATH CONSISTING OF CS-C2. V = 1 : 5 V; T = 2 5 C ; F = 1 G H z ; F a n O u t = 4 i n v e r t e r s

V. OPTIMIZATION OF THE 64-BIT ADDER

In the first part of this section, we will revisit the choices

made for the maximum stack height of the BBL cells. Second,

concerning the architecture, two elements must be considered.

The carry-selection can occur with either 8-bit or 16-bit pre-sums. Moreover, the structure of the carry network itself can

be further optimized, regarding the constraints of power and

performance.

A. Stack Height

The power-delay product of the adder can be improved by a

better balance of the cells with a stack height of two and the cells

with a stack height of three. In the 4-bit adder cells, the stack

height can be increased from two to three, since this part is not

in the critical path in the carry-select structure. Instead of using

half-adders (HAs) and multiplexers, efficient 28-transistor 1-bit

CMOS full adders (FAs) are used in a 4-bit ripple carry con-

figuration [21]. To avoid the long delay for , which rip-

ples through all the FA cells (Fig. 9) a carry-lookahead circuit is

used. We implemented four versions and compared them with

the original 4-bit carry-select adder (CSA) of Section III, which

is referenced as CSA+CS-C2 in Table V. In each case, thecarry network has NOR and NAND gates in the first stage to pro-

duce the conditional carry signals and (with

or ). The first way to generate is to use the complex

BBL CS-C2 cell, represented in Fig. 5, which is referenced as

FA+CS-C2 in Table V. The carry network with branch-basedcarry-select boxes can be redesigned avoiding high stacks and

this will favor speed. If we limit the height of the branches up

Fig. 10. Carry logic for the 4-bit adder with C = 0 .

to two devices only, we propose another decomposition of the

equation of :

(5)

(6)

(7)

By using the theorem of De Morgan, this expression becomes

(8)

The resulting circuit is shown in Fig. 10. Two complex

BBL CS-C0 cells [see Fig. 2(a)], one two-input NOR and one

complementary X-gate are combined. It has the advantage of

having a maximum of two PMOSFETs in the stack. Notice

that we cannot use a complementary CS-C0 cell in the last

stage. Indeed, the Never Happens condition that enabled



7/10


Fig. 11. Blockdiagramof the8-bit CLAadder with carry-in.The intermediatecarry signal C5 (interrupted line) was added in order to enhance the speed of the8-bit adder blocks. C3 and C5 are intermediate results produced by CS-C7.

the simplification of the BBL cells is not fulfilled here. If

is at 1, this does not imply that is at 1. Thecircuit is combined with the 4-bit ripple-carry adder (RCA)

and is referenced as FA+X-BBL in Table V. We can alsoreplace the two remaining BBL CS-C0 gates by X-gates. This

circuit is referenced as FA+X-gates. Finally, we can furtherdecompose the cells allowing to design this stage using only

inverters, NAND, and NOR gates. This case is FA+CMOS.Table V presents the simulation results for the 4-bit adders

with the carry-lookahead circuits. FA+X-BBL and FA+X-gates

are almost similar in all respects and have the best energy

efficiency. FA+CMOS, FA+CS-C2, and CSA+CS-C2 have a

power-delay product which is about 25%30% higher. Thereduction of the stack height from three to two devices reduces

the delay by 18% between FA+CS-C2 and FA+X-BBL.

B. Structural Optimization

A 64-bit carry-select adder can make the selection either of

8-bit pre-sums [22], [23] or of 16-bit pre-sums [14]. The use of8-bit adders allows the use of one carry-selection level instead

of three as in the first design. There are multiple ways to com-

bine two 4-bit adders to form the 8-bit adders. In the first possi-

bility, we use , the 4-bit block carry from the first 4-bit adder,

computed with the carry-lookahead circuit of the first stage.

will be re-used in the cell CS-C7, producing the carry for the

entire 8-bit adder block (Fig. 11). However, the speedup was

not sufficient to produce the SUM signals on time, even in the

carry-select architecture. Therefore, , the result of the carry

from the six lowest-order bits, will be fed directly into the FA

for the bit position 6, thus forming a second carry-lookahead

path (Fig. 11). is further used in CS-C7 to generate . By

using intermediate results and sharing circuit blocks, the dupli-cation of logic cells is avoided. This contributes to a lower area

and lower power consumption.

Adder8 8b is composed of eight 8-bit blocks, the globalcarry network, and seven multiplexers (Fig. 12). The first

adder block contains the 8-bit adder generating SUM0-7 and

CS-C7, the circuit generating the intermediate carry signal

. The seven other blocks are identical and are composed of

four circuits. Two of these are 8-bit adders, used to produce

the pre-sums, one assuming that the carry-in is at 0, theother assuming that the carry-in is at 1. A 2 8-inputmultiplexer selects the final sum output. The control signals for

the multiplexers are generated by the global carry network. The

Fig. 12. 64-bit adder based on the selection of 8-bit pre-sums.

two other circuits in the seven identical adder blocks generate

the conditional carry signals which are fed into the global carry

network. These two circuits are referenced as CS-cin0 and

CS-cin1 in Fig. 12. By using 8-bit blocks which are repeated

seven times, the design and layout time is made shorter than if

sections of different lengths were used.

C. Global Carry Network

The global carry network generates the final carry-out, ,

and the intermediate carry signals , , , , , and

. These signals command the multiplexers which select theappropriate 8-bit pre-sums.

In order to minimize the delay in the critical path, the fol-

lowing elements are taken into consideration: the number of

successive stages, the input load presented to the CS-cin0 and

CS-cin1 blocks, and the fan-out of each stage. In the decompo-

sition that we propose below, the fan-out is limited to three, both

for the conditional carry signals and for the intermediate carry

signals that are re-used at different places in the global carry

network.

The hot carry is , since it commands the multiplexerselecting the highest order sum signals. It is generated using all

the intermediate carry signals except and :

(9)

Since in the global carry network the stack height is limited

to two devices to favor speed, the equation of is further

decomposed in order to be able to implement this function with

complex BBL CS-C0 gates and, where needed, X-gates.



8/10


The second control signal to generate is :

(10)

which can be decomposed as

(11)

is already one of the intermediate carry signals needed

to command the multiplexer selecting SUM24..31. The expres-

sions of the complements of and are

(12)

(13)

and are shared with .

is expressed as follows:

(14)

where are also shared with and is the inter-

mediate carry signal commanding the multiplexer selecting

SUM16..23.

Finally, re-uses in its expression

(15)

The low-order intermediate signals and are re-used

as inputs for cells generating higher order carry signals. In this

way, we avoid the duplication of parts of the carry logic, which

favors low power, and we keep the fan-out to a maximum of

three, which is beneficial for speed.

D. Simulation Results

This adder has been simulated using the parameters of

the 0.18- m PD SOI CMOS process and is compared with

the experimental and simulation results of the first adder,Adder4 16b, which has a classical structure (Table VI). The

layout of Adder4 16b is taken as a reference to estimate

the lengths of the wires for the schematic simulations. For

Adder8 8b, the length of the long wires has been reduced by

25% to account for the lower die area thanks to the lower device

count. Considering the reduction in the number of devices

(67%), this is a rather conservative value.

The optimized version of the 64-bit adder shows a reduction

of about 20% of the maximum delay, thanks to three factors.

First, the lower number of buffers on the signal path accounts

for 6% of the delay reduction. Second, the use of stacks with

a maximum of two PMOSFETs further improves the speed by

TABLE VIDELAY, POWER CONSUMPTION, POWER-DELAY PRODUCT, AND DEVICE COUNTFOR THE REALIZED AND SIMULATED IMPLEMENTATIONS OF ADDER42 16b AND

THE SIMULATED IMPLEMENTATION OF ADDER8 2 8b IN THE 0.18- m CMOSTECHNOLOGY. V = 1 : 5 V; T = 2 5 C ; F = 1 GHz; F O = 4

Fig. 13. Layout of the 64-bit adder based on the selection of 8-bit pre-sums.

6%. The critical path in Adder8 8b includes a two-input NOR,

one CS-C0 stage and five X-gates produce , the hot carry.

Third, thanks to a more efficient architecture in Adder8 8b, the

capacitive load seen by the carry-select cells on the critical path

is reduced compared to Adder4 16b.

To estimate the dynamic power consumption, random

input patterns are applied at the inputs at a rate of 1 GHz. In

Adder8 8b, the dynamic power consumption is reduced by a

factor of three compared to the classical adder. This improve-

ment is associated with the higher stacks in the noncritical

path and the reduction of the number of cells in the proposedarchitecture.

Adder8 8b shows a 75% lower power-delay product than

Adder4 16b. About 20% of this improvement is associated

with the cell level, the remainder coming from the modifications

of the architecture. Adder8 8b has a power-delay product that

is 60% lower than the CPL adder. In this case, the architecture

accounts for 10% of the improvement, the remaining coming

from the different logic design styles.

E. Optimization and Results in 0.13- m PD SOI

Adder8 8b has finally been optimized with Einstuner in a

0.13- m PD SOI CMOS technology. Einstuner is a circuit opti-mization package that automatically resizes the transistors [24].

The final layout area is 151 m 461 m (Fig. 13). A netlist

has been extracted from the layout and is used to determine the

main features of the adder. The critical delay is 326 ps at 1.1 V

and 85 C. Random input patterns at a rate of 2 GHz are used to

calculate the dynamic power which is found to be as low as 23

mW. The static power is evaluated to be 380 W. Our realiza-

tion is compared with state-of-the-art 64-bit adders published in

previous work in Fig. 14. The adder proposed here is faster than

the other realizations, even those realized in finer technologies.

Our adder features the extremely low power-delay product of

7.5 pJ.



9/10


Fig. 14. Our 64-bit adder, based on the selection of 8-bit pre-sums designedin 0.13- m PD SOI, is compared with recent 64-bit adder realizations. L o a d = 4 0 fF; T = 8 5 C. ISSCC00 is a domino carry-select/carry-lookaheadadder

in 0.18-

mPDSOI[25]. VLSI01 is a race logic carry-lookahead/carry-selectadder in 0.18- m bulk CMOS [22], ESSCIRC02 is a BrentKung adder in100-nm bulk CMOS [26]. VLSI02 is a carry-select/carry-lookahead adder in0.1- m PD SOI [23].

VI. CONCLUSION

In this paper, we presented a methodology to minimize the

power-delay product of 64-bit carry-select adders for high-

performance and low-power applications, by working at three

levels of abstraction: design style, cell arrangement, and adder

structure. We demonstrated thata complex-cellbranch-basedde-

sign reduces thedynamic powerconsumption by 10%compared

to the design with decomposed cells, with a minimal impact on

the delay. The reduction of the CMOS cell stack height fromthree to twodevicesin thecritical path proved to be beneficial for

speed. By avoiding the duplication of cells, without increasing

the fan-out on the critical path, we were able to improve the

speed while maintaining low power consumption. The structural

optimization, by making the choice of the selection of 8-bit

pre-sums, allowed theuse of only onecarry-select level, thus fur-

ther contributingto thereduction of thepower dissipation thanks

to a lower number of cells. Compared to the classical design, the

power-delayproduct of theoptimized adder has been reduced by

a factor of four. Compared with an equivalent CPL 64-bit adder,

our realization shows a 60% improvement of the power-delay

product. Finally, an automatic tuning tool allowed the design of

an energy-efficient adder in 0.13- m PD SOI,with a power-delayproduct as low as 7.5 pJ. The approach presented in this paper

can be extended toward -bit carry-select adders. However, the

number of carry-select levels and the adder architecture might

be different in order to obtain an efficient realization.

ACKNOWLEDGMENT

The authors express their sincere thanks to G. Hellner,

J. Keinert, V. Gernhoeffer, and U. Krauch for the help they

provided when using the simulation, layout, and extraction

tools. The authors also to thank R. Sautter and W. Haller for

useful discussions, and J. Appinger for his helpful assistance in

obtaining the experimental data.

REFERENCES

[1] T. H. Ning, CMOS in the new millennium, in Proc. IEEE CustomIntegrated Circuit Conf. (CICC), 2000, pp. 4956.

[2] V. De and S. Borkar, Low power and high performance design chal-lenges in future technologies, in Proc. Great Lakes Symp. VLSI, 2000,pp. 16.

[3] European Semiconductor Industry Association, Japan Electronics andInformation Industries Association, Korea Semiconductor IndustryAssociation, Taiwan Semiconductor Industry Association, and Semi-

conductor Industry Association, International Technology Roadmap forSemiconductors, System Drivers, 2001.

[4] C. Nagendra, R. M. Owens, and M. J. Irwin, Power-delay character-istics of CMOS adders, IEEE Trans. VLSI Syst., vol. 2, pp. 377381,Sept. 1994.

[5] S. Naffzifer, A subnanosecond 0.5 m 64b adder, in Proc. IEEE Int.Solid-State Circuits Conf., Slide Supplement, 1996.

[6] J. Hennessy, D. Patterson, and D. Goldberg, Computer Architecture, AQuantitative Approach, 2nd ed. San Mateo, CA: Morgan Kaufman,1996.

[7] M. Allam, M. Anis, and M. Elmasry, Effect of technology scaling ondigital CMOS logic styles, in Proc. IEEE Custom Integrated CircuitsConf. (CICC), 2000, pp. 19-1-119-1-8.

[8] R. Yung, S. Rusu, and K. Shoemaker, Future trend of microprocessordesign, in Proc. ESSCIRC 2002, pp. 4346.

[9] V. De and S. Borkar, Low power and high performance design chal-lenges in future technologies, in Proc. Great Lakes Symp. VLSI, 2000,

pp. 16.[10] P.-F. Lu, C.-T. Chuang, J. Ji, L. F. Wagner, C.-M. Hsieh, J. B. Kuang,L. L.-C. Hsu, M. M. Pelella, S.-F. Sanford, and C. J. Anderson,Floating-body effects in partially depleted SOI CMOS circuits, IEEE

J. Solid-State Circuits, vol. 32, pp. 12411253, Aug. 1997.[11] N. Subba, A. Salman, S. Mitra, D. E. Ioannou, and C. Tretz,

Pseudo-NMOS revisited: Impact of SOI on low power, high speedcircuit design, in Proc. IEEE Int. SOI Conf., Oct. 2000, pp. 2627.

[12] C. R. Tretz, R. K. Montoye, and W. Reohr, Ratioed CMOS: A lowpower high speed design choice in SOI technologies, in Proc. IEEE

Int. SOI Conf., Oct. 2000, pp. 2829.[13] J. M. Masgonty, C. Arm, and C. Piguet, Technology- and power-

supply-independent cell library, in Proc. IEEE Custom IntegratedCircuits Conf. (CICC), 1991, pp. 25.5/125.5/4.

[14] K. Hwang, Computer Arithmetic: Principles, Architecture and Design.New York: Wiley, 1979.

[15] S. Zaker and J. Zahnd, OPTIMOS: a branch-level digital circuit opti-mizer, in Proc. EURO ASIC, 1993, pp. 563572.

[16] G. A. Katopis, W. D. Becker, T. R. Mazzawy, H. H. Smith, C. K.Vakirtzis, S. A. Kuppinger, B. Singh, P. C. Lin, J. Bartells Jr., G. V.Kihlmire, P. N. Venkatachalam, H. I. Stoller, and J. L. Frankel, MCMtechnology and design for the S/390 G5 system, IBM J. Res. Develop.,vol. 43, no. 5/6, pp. 621650, Sept.Nov. 1999.

[17] G. G. Shahidi, SOI technology for the GHz era, IBM J. Res. Develop.,vol. 46, no. 2/3, pp. 121131, Mar./May 2002.

[18] I. Aller and K. E. Kroell, Detailed analysis of the gate delay variabilityin partially depleted SOI CMOS circuits, in Proc. IEEE Int. SOI Conf.,Oct. 1999, pp. 4041.

[19] A.Nve, D. Flandre, H. Schettler, T. Ludwig, andG. Hellner, Design ofa branch-based 64-bit carry-select adder in 0.18 m partially-depletedSOI CMOS, in Proc. Int. Symp. Low Power Electronics and Design(ISLPED), Aug. 2002, pp. 108111.

[20] K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi, and A.Shimizu, A 3.8 ns CMOS 16 2 16-b multiplier using complementarypass-transistor logic,IEEE J. Solid-State Circuits, vol.25,pp.388395,Apr. 1990.

[21] K. Martin, Digital Integrated Circuit Design. Oxford, U.K.: OxfordUniv. Press, 2000.

[22] S. J. Lee, R. Woo, and H. J. Yoo, 480 ps 64-bit race logic adder, inSymp. VLSI Circuits Dig. Tech. Papers, 2001, pp. 2728.

[23] J. J. Kim, R. Joshi, C.-T. Chuang, and K. Roy, SOI-optimized 64-bithigh-speed CMOS adder design, in Symp. VLSI Circuits Dig. Tech. Pa-

pers, 2002, pp. 122125.[24] X. Bai, C. Visweswariah, P. Strenski, and D. Hathaway, Uncertainty-

aware circuit optimization, in Proc. 39th Design Automation Conf.,June 2002, pp. 5863.

[25] D. Sastiak, J. Tran, F. Mounes-Toussi, and S. Storino, A 2nd generation440 ps SOI 64 b adder, in IEEE Int. Solid-State Circuits Conf., Feb.2000, pp. 288289.

[26] M. Garg and A. Katoch, Evaluation of skew tolerance in delayedclocking scheme for dynamic circuits, in Proc. ESSCIRC, Sept. 2001,pp. 396399.



10/10


Amaury Nve (M96) received the electrical engi-neering and the Ph.D. degrees from the Universitcatholique de Louvain, Louvain-la-Neuve, Belgium,in 1998 and 2004, respectively.

From 1998 to 2003, he was Research Assistantwith the Microelectronics Laboratory of the Univer-sit Catholique de Louvain. He was involved in thedevelopment of design techniques for high-speedand low-power digital circuits in advanced sil-

icon-on-insulator (SOI) processes. In May 2003,he joined the IBM Research and DevelopmentLaboratory in Bblingen, Germany, where he is working on circuit design foradvanced CMOS processes.

Helmut Schettler received the Dipl.-Ing. degreein electrical engineering from the University ofStuttgart, Stuttgart, Germany, in 1969.

In 1969,he joined theIBM Laboratory, Bblingen,Germany. He worked for three years in the IBMLaboratories, East Fishkill, NY, and Burlington,VT, where he was involved in bipolar and CMOScircuit and chip design for memory and m-Processorapplications. He holds 23 patents and was theleading circuit designer when IBM moved its servertechnology from bipolar to CMOS. Presently, he is

also involved in lecturing at a university of applied science in Stuttgart.

Thomas Ludwig (M90) was born in Sindelfingen,Germany, in 1957. He received the Master of Elec-trical Engineering from the Technische Universitt,Berlin, Germany, in 1983.

He joined IBM in 1984, at the German Researchand Development Laboratory, Bblingen, workingon high-speed digital driver/receiver circuits. In1992, he was on an assignment with the jointIBM/Intel Noyce Development Center, BocaRaton, FL, working on technology conversion ofa -processor. He is currently Senior Engineer

and Leader of the Future Product Technology Team, IBM Systems Group,Bblingen, Germany, responsible for future silicon technologies in the field of

high-speed server processors. His current research interests are in the areas ofsilicon-on-insulator technology, especially FinFET circuit design, modelingand the influence on CAD tools.

Denis Flandre (M86SM03) was born inCharleroi, Belgium, in 1964. He received theElectrical Engineer degree, the Ph.D. degree, andthe Postdoctoral thesis degree from the UniversitCatholique de Louvain (UCL), Louvain-la-Neuve,Belgium, in 1986, 1990, and 1999, respectively.His doctoral research was on the modeling ofsilicon-on-insulator (SOI) MOS devices for charac-terization and circuit simulation, and his Postdoctoral

thesis was on a systematic and automated synthesismethodology for MOS analog circuits.In 1985, he was a summer Student Trainee at NTT Headquarters, Tokyo,

Japan. From October 1990 to September 1991, he was with the Centro Nacionalde Microelectrnica, Barcelona, Spain, working on the characterization and nu-merical simulation of SOI MOS process and devices. He was then at the Labo-ratoire de Microlectronique (DICE), Louvain-la-Neuve, Belgium, as a SeniorResearch Associate of the National Fund for Scientific Research (FNRS, Bel-gium). Since 2001, he has been a full-time Professor at UCL giving courses onintegrated analog circuit design, device physics, etc. He is currently involved inthe research anddevelopmentof digital andanalog SOIMOS circuits forspecialapplications, more specifically high-speed, low-voltage low-power, microwave,rad-hard, and high-temperature electronics.

Prof. Flandre has been the recipient of the 1992 Biennial SiemensFNRSAward for an original contribution in the fields of electricity and electronics, the1997 Wernaers Prize for innovation in pedagogical presentation of advancedresearch work, and the 1999 CEN SCK Prize for innovation in nuclear science

instrumentation. He has authored or coauthored more than 160 technicalpapers or conference contributions. He is a member of the Advisory Board ofthe EU Network of Excellence for High-Temperature Electronics (HITEN),of the Scientific Board of the Microserv large infrastructure EU programof the CNM-Barcelona and of the Director Board of the Cyclotron ResearchCenter (CRC, Louvain-la-Neuve, Belgium). He is a founding member of theCERMIN (Centre de Recherche en Dispositifs et Matriaux ElectroniquesMicro- et Nanoscopiques of UCL). He is a cofounder of CISSOID S.A., astartup company, spun off of UCL in July 2000, focusing on SOI circuit designservices.

Power-Delay Product Minimization

Documents

Transcript of Power-Delay Product Minimization