Power-Delay Product Minimization
Transcript of Power-Delay Product Minimization
-
7/29/2019 Power-Delay Product Minimization
1/10
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004 235
Power-Delay Product Minimization inHigh-Performance 64-bit Carry-Select Adders
Amaury Nve, Member, IEEE, Helmut Schettler, Thomas Ludwig, Member, IEEE, andDenis Flandre, Senior Member, IEEE
AbstractThis paper analyzes methods to minimize thepower-delay product of 64-bit carry-select adders intended forhigh-performance and low-power applications. A first realizationin 0.18- m partially depleted (PD) silicon-on-insulator (SOI),using complex branch-based logic (BBL) cells, results in a delay of720 ps and a power dissipation of 96 mW at 1.5 V. The reduction ofthe stack height in the critical path, combined with the optimiza-tion of the global carry network with cell sharing and the selectionof 8-bit pre-sums, leads to a reduction of the power-delay productby 75%. The automatic tuning of the transistor widths in 0.13- mPD SOI produces an energy-efficient 64-bit adder which has adelay of 326 ps and a power dissipation of 23 mW only at 1.1 V.
Index TermsAdder, digital CMOS, high performance, lowpower, power-delay product, silicon-on-insulator technology.
I. INTRODUCTION
T ODAY, one of the major challenges for high-performancemicroelectronic systems is the power dissipation, bothstatic and dynamic [1][3]. The circuit designer must, therefore,find an optimum between power and speed, instead of targetingthem independently, and this is represented by the power-delayproduct, which represents the average energy dissipated for oneswitching event [4].
In this study, we investigate design methods to minimize
the power-delay product of 64-bit adders in partially depleted(PD) silicon-on-insulator (SOI) technology. Addition is usedas a benchmark here since it is one of the important tasks per-formed by the CPU, considering that adders are needed in theArithmetic and Logic Units, for the memory address generationand for floating point calculations [5], [6]. The improvementof the power-delay product will be performed at the differenthierarchical levels of the design: circuit design style, celldecomposition, and global architecture. Section II discussesdesign styles that can be used for low-power and high-perfor-mance VLSI systems in SOI. In Section III, we compare twopossible implementations of branch-based cells used for the64-bit adder, which has a classical carry-select architecture.
The experimental results of this realization are discussed inSection IV and are compared with a complementary pass-gate
Manuscript received February 28, 2003; revised June 30, 2003. The work ofA. Nve was supported in part by the Walloon Region of Belgium.
A. Nve was with the Microelectronics Laboratory, Universit Catholique deLouvain, B-1348 Louvain-la-Neuve, Belgium. He is now with IBM Entwick-lung, D-71032 Bblingen, Germany (e-mail: [email protected]).
H. Schettler and T. Ludwig are with IBM Entwicklung, D-71032 Bblingen,Germany (e-mail: [email protected]; [email protected]).
D. Flandre is with the Microelectronics Laboratory, Universit Catholique deLouvain, B-1348 Louvain-la-Neuve, Belgium (e-mail: [email protected]).
Digital Object Identifier 10.1109/TVLSI.2004.824305
adder. The optimization of the adder structure, the global carrynetwork, and the cells are presented in Section V. The proposedadder is compared with the original carry-select adder. Sec-tion V also gives the results for the optimized implementationof the 64-bit adder in 0.13- m PD SOI, and compares themwith state-of-the-art 64-bit adders of the literature.
II. LOGIC CIRCUIT DESIGN IN SOI
The use of SOI technology instead of bulk CMOS opens new
possibilities in the choice of the design style and in the designspace itself. In Section II-A, we discuss some possible designstyles for PD SOI. Among them, branch-basedlogic, a restrictedversion of static CMOS logic, seems very promising and is pre-sented in Section II-B.
A. Circuit Design Styles
In this study, we concentrate on static design styles,since the performance advantage of both dynamic logicstyles and pass-gate design is expected to decrease in futuredeep-submicron technologies [3], [7]. The features of lowerdynamic power consumption and higher noise margin makestatic CMOS particularly attractive [8], [9]. Moreover, theactivation of the parasitic bipolar transistor in PD SOI isreported to result in fatal erroneous states in dynamic logic andto make circuit design with pass-gates more difficult [10]. Therenewed interest in static design styles like pseudo-NMOS [11]and ratioed CMOS [12] shows that alternative design styles areinvestigated in SOI in order to reduce the power dissipationwhile still maintaining high-speed performance.
B. Branch-Based Circuit Design Style
In the branch-based logic (BBL) design style, a logic cell isonly made of branches that contain a few transistors in series[13]. The branches are connected in parallel between the power
supply lines and the common output node. Many usual staticCMOS gates have already a branch structure, as inverter andNAND and NOR gates. By using the branch-based concept, it ispossible to minimize the number of internal connections, andthus, the parasitic capacitances associated with the diffusions,interconnections, and contacts. As it belongs to the family ofstatic CMOS, it presents high noise margins and robustness todevice scaling and voltage scaling.
The optimal design point will depend on whether the em-phasis is placed on low power, high speed, or a compromisebetween the two.
1063-8210/04$20.00 2004 IEEE
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.
-
7/29/2019 Power-Delay Product Minimization
2/10
236 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004
Fig. 1. 64-bit carry-select adder.
TABLE ILOGIC EQUATIONS AND CIRCUIT BLOCKS USED FOR THE INTERMEDIATE CARRY SIGNALS
III. DESIGN OF THE CARRY-SELECT ADDER
A. Adder Structure
In our first design, the 64-bit adder was classically divided
into four 16-bit sections as shown in Fig. 1 [14]. In each sec-
tion, two 16-bit adders generate the sum outputs with, respec-
tively, the carry-in at 0 and at 1. The true sum is selected
by a multiplexer. The control signals for the multiplexers arethe carry-in and the intermediate carry signals , , and
. The intermediate carry signals and the carry-out are
produced by the carry-select boxes (CS-boxes). The inputs of
the CS-boxes are the carry-in and the conditional carry signals
and (with 015, 1631, 3247, or 4863), whichare all generated simultaneously. The notation refers
to the block carry signal for bit positions 0 to 15, assuming
that the carry-in is at 0. For the sake of clarity, the indexesof the conditional carry signals have been simplified in Fig. 1.
The carry-out and the intermediate carry signals are computed
according to the equations presented in Table I. To compute
the carry-out, the intermediate carry is combined with
and in one CS-C0 stage to avoid a complex
CS-C3 stage.
At their turn, the 16-bit adder blocks are implemented as
carry-select adders, with 4-bit adder blocks having a carry-in
either at 0 or at 1. At the 16-bit level, the same CS-boxescan be used as at the 64-bit level, sizing of the transistors being
adapted to the particular load conditions.
The 4-bit adder blocks can finally be implemented as ripplecarry adders, or also as carry-select adders, which was chosen
here. The carry-select architecture is thus used at three different
levels: in the 64-bit, the 16-bit, and the 4-bit adders.
B. Design of the Carry-Select Boxes
A CS-box can be implemented in different ways, depending
on its number of inputs and complexity. The CS-C0 gate is
designed starting from the logic equation presented in Table I.
The P-part is common to the BBL version and the CMOS
OR-AND-INVERT (OAI) gate, which is further referenced as the
X-gate. It can immediately be designed, just observing that the
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.
-
7/29/2019 Power-Delay Product Minimization
3/10
NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS 237
Fig. 2. One-stage complex CS-C0: (a) BBL and (b) X-gate.
Fig. 3. Two-stage decomposed CS-C0 cell.
input variables must be complemented in the logic equation
(Fig. 2).
For the NMOS part, the complement is taken:
(1)
(2)
Equation (2) defines the N-part of the X-gate represented in
Fig. 2(b). Normally, the N-part of the BBL cell would require
four transistors. However, to reduce the input capacitance and
the internal parasitic capacitances, (2) can be simplified noticingthat
(3)
This is true since the input combination AND
never happens, resulting from the properties of
the conditional carry signals. The resulting equation is
(4)
CS-C0 can thus be implemented as a complex cell in one
stage, obeying the BBL principle [Fig. 2(a)], or like an X-gate,allowing for internal connections between branches [Fig. 2(b)].
These X-gates will be used in the optimized version of the adder
described in Section V.
CS-C0 can also be implemented in two stages, using two clas-
sical NANDs and one inverter (Fig. 3). Notice that the latter is a
two-stage BBL implementation of CS-C0, since NAND gates can
be considered as elementary BBL cells. This will be referenced
as decomposed-cell design in the remainer of this paper.
Figs. 4 and 5 represent, respectively, the complex BBL one-
stage implementation of CS-C1 and the complex BBL two-stage
implementation of CS-C2. Using one stage only for the latter
requires stacks of four transistors, which appeared to be too
Fig. 4. One-stage complex BBL CS-C1 cell.
Fig. 5. Two-stage complex BBL CS-C2 cell.
slow. The carry indexes have been simplified for the sake of the
figures clarity. In Section III-C, these complex cells are com-pared with equivalent decomposed-cell designs, using NOR and
NAND gates.
C. Results for the Single Cells
The complex and decomposed-cell BBL gates have been op-timized for speed as follows. For the optimization process, each
cell is loaded with one CMOS inverter (fan-out of 1). The nom-
inal gate length is chosen at the minimum allowed drawing di-
mension, i.e., m. We carefully optimized the gate
widths by hand, using an iterative process. In the first step, the
critical branchis identified;this is thebranch that hasthe slowest
delay when the cell switches. The speed of a gate is indeed lim-
ited by the slowest branch [15]. Often this corresponds to the
branch with the largest number of transistors in series. The input
pattern that activates the top transistor of the critical branch in
one particular cell is applied at the input. With the AS/X cir-
cuit simulator [16], a sweep is made on the ratio of this
branch, all other ratios remaining constant. Thereafter, theratio of the other branches is further tuned to lower the
capacitance of the output node, to which all the branches are
connected. After this first step, we can turn back to the critical
branch and refine the choice of its gate widths. Most of the time,
the second step leads only to minor changes of the ratios
of the branches.
We evaluated the performances of the basic building blocks
of the adder by using circuit simulations. The simulations are
based on the schematic specification of the circuit. The device
models of the SOI PD 0.18- m process include the parasitic
capacitances of source, drain, and gate. The model accounts also
for the parasitic capacitances associated with the contacts.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.
-
7/29/2019 Power-Delay Product Minimization
4/10
238 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004
TABLE IISIMULATED DELAY. V = 1 : 5 V; T = 2 5 C ; F O = 1
TABLE IIISIMULATED STATIC POWER DISSIPATION. V = 1 : 5 V; T = 2 5 C ; F O = 1
TABLE IVSIMULATED DYNAMIC POWER DISSIPATION. V = 1 : 5 V; T = 2 5 C ;
F O = 1 ; F = 1 G H z
The results for the delay, static power consumption, and
dynamic power consumption are presented, respectively, inTables IIIV. We compare two possible BBL implementationsof the cells: complex-cell design (COMPLEX) and decom-
posed-cell design (DEC), using only NAND, NOR, and INV gates.
Two categories clearly appear: circuits with few inputs and
few branches like the half-adder and CS-C0 perform much
better in their complex form, with a speed increase of 8%
and 20%. Circuits with more inputs, like CS-C1 and CS-C2,
perform worse in their complex form than in their decomposed
form. The bad performances of these two complex cells can be
explained by the combination of two factors: the presence of
branches with a stack of three transistors and a high number
of branches connected to the output node, which increases the
parasitic capacitance at the output. Moreover, the critical pathin the two decomposed circuits includes a stack of three NMOS
devices, whereas in the two complex-cell circuits, a stack of
three PMOS devices is activated in the worst case.
Floating-body effects are known to produce uncertainties in
the circuit results. In particular, the hysteresis effect is associ-
ated with the dependence of the body potential on the switching
history of the gates and appears as an additional delay variation
[17]. With a tool based on the methodology presented in [ 18],
we evaluated that the impact of the hysteresis effect on these
cells is less than 5%.
The static power dissipation is very close for the half-adder
and CS-C2, since these BBL cells count multiple stages, like
Fig. 6. Image of the test chip. The locations of the BBL adder (BBL ADD)
and the CPL adder (CPL ADD) are highlighted.
the decomposed versions, and hence, multiple leakage paths be-
tween the power supply and ground. The complex CS-C0 and
CS-C1 cells benefit from the design in one stage compared to
their decomposed counterparts.
Thedynamic powerdissipation is compared forthe same logic
cells in Table IV. The last column presents the dynamic power
reduction when comparing the complex implementation and the
decomposed implementation. The complex cells dissipate be-
tween 30%and 43%less dynamic powerthan in thedecomposed
cells. The main factor that explains why the complex cells have
a lower power dissipation than the decomposed cells is linked
with the lower number of internal nodes. In the CS-C0 cell, for
example,there is only onehighly capacitive node at theoutput in
the complex cell [see Fig. 2(a)]. In the decomposed cell (Fig. 3),
there are two internal nodes and one output node, though with a
lower parasitic capacitance than in the case of the complex cell.
When the output of the gate switches, thismeans that at least one
of the internal node also switches with a full-swing charge/dis-
charge of the associated parasitic capacitance.
IV. FIRST REALIZATION OF THE 64-BIT ADDER
The design and layout of the 64-bit carry-select adder pre-
sented in Section III has been realized for the CMOS 0.18- m
PD SOI process [19]. It is composed of 18 k devices and occu-pies an area of 735 m 280 m (Fig. 6, BBL ADD). The same
layout can be used for each 16-bit section, as they are repeated
four times.
All the cells are implemented using the BBL design style,
with the exception of the multiplexers, which are transmis-
sion-gate multiplexers. For this version of the adder, we fixed
the maximum stack height at three MOSFETs, which enables
us to use the complex BBL gates discussed in the previous
section.
In the remainer of this paper, the adder implemented here is
referenced as Adder4 16b. This notation refers to the factthat we make a selection of 16-bit pre-sums at the 64-bit adder
level.
A. Critical Path
The critical path to the carry-out (referenced as in Fig. 1)
and to the sum outputs is described here. In the 4-bit adder level,
the critical path involves 1 NAND/NOR and 1 CS-C2 cell, which
generates . A second CS-C2 cell in the 16-bit adder level gen-
erates the intermediate carry signals , , , a nd
. These signals are then fed into the CS-C1 and CS-C2
cells of the 64-bit adder for generation of, respectively, and
. But, by this way, the delay on the critical carry-out path
could be too high for two reasons. First, as seen previously,
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.
-
7/29/2019 Power-Delay Product Minimization
5/10
NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS 239
Fig. 7. Detail of the critical path. Buffers are added on the path to CS-C1 inorder to reduce the capacitive load seen by the previous stage.
the delay in a CS-C2 cell is about 30% higher than in a CS-C1
cell. Second, the cells generating the intermediate carry signals
, , , and in the 16-bit adders see a
high capacitive load at the inputs of the CS-C1 and CS-C2 cells
at the 64-bit level. In our design, we favor the carry-out genera-tion in two ways at the 64-bit level (Fig. 7). First, since the delay
in CS-C2 is the largest, the intermediate carry signals ,
, , and are fed directly into CS-C2 for
the generation of . To compute the carry-out , only one
additional stage is necessary, i.e., CS-C0. Second, buffers are
added on the signal path to the inputs of CS-C1 in order to
reduce the capacitive load seen by the outputs of cells gener-
ating the signals , , , and . By
this way, these signals do not see the high capacitive charge of
the CS-C1 cell.
B. Simulation Results
A second 64-bit carry-select adder has been designed with the
decomposed cells for comparison purposes. AS/X simulations
of these adders based on the cell schematics are used to compare
both versions. At 1.5 V, 25 C, and with a capacitive output load
of 150 fF, theBBL complex-celladderfeaturesa dynamic power
consumption which is reduced by 10% compared to the decom-
posed-cell adder, with random input patterns applied at a rate of
1 GHz. The delay increase associated with the complex-cell de-
sign is less than 2% for supply voltages up to 1.5 V. For higher
supply voltages, the speed difference increases slightly between
the two, but remains lower than 5%. For worst case input pat-
terns, the peak dynamic power consumption is reduced by 16%
in the complex-cell adder compared to decomposed-cell imple-mentation. The overall reduction of dynamic power dissipation
is lower than forthe individual cells, because the complete adder
also involves inverters and multiplexers, which are similar in
both realizations.
An equivalent bulk realization of the adder consumes 15%
more dynamic power and is 29% slower than the PD SOI ver-
sion. This is associated with the lower junction capacitances in
SOI.
A complete netlist has been extracted from the 64-bit
complex-cell adder layout in order to precisely determine the
static and dynamic power consumption, taking all the intercon-
nections and parasitic elements into account. To compute the
Fig. 8. Experimental delay of the 64-bit BBL and CPL adders for differentV values at + 25 C. The critical delay is obtained for SUM24..31 for theBBL adder and for the carry-out in the case of the CPL adder.
dynamic power, we apply random patterns at the inputs of the
extracted netlist. Integration between 5 and 45 ns, with a bit
rate of 1 GHz, a supply voltage of 1.5 V, and a temperature of
25 C, results in a dynamic power consumption of 96 mW. Thepeak dynamic power consumption occurs when all the inputs
switch at the same time, and rises up to 177 mW. The static
power consumption is about 500 W at 1.5 V and 25 C.
C. Experimental Results
The experimental realization of the 64-bit adder has been
tested under different voltage and temperature conditions and
operates successfully. Worst case input patterns have been ap-
plied and the propagation delays to the sum and carry outputs
measured.
In the worst case situation and at 1.5 V, the final carry is pro-
duced after 600 ps, and thanks to the independent carry networkcomposed of the CS-boxes, it arrives earlier than the last sum
outputs, which arrive after 720 ps in the worst case.
On the same chip as the BBL adder, a complementary pass-
gate logic (CPL) 64-bit carry-select adder has been realized
(CPLADD in Fig. 6). The CPL adder makes the selection of
16-bit pre-sums and uses CPL cells based on the work presented
in [20]. Fig. 8 shows the critical delay for the CPL adder and for
the BBL adder. For low operating voltages V ,
the BBL is obviously better than CPL. Indeed, BBL cells do
not suffer from the voltage drop due to the threshold voltage
in single-rail pass-gates. This voltage drop increases in
relative terms when moving to lower supply voltages. For high
voltages V , CPL and BBL are able to achievesimilar performance. These results confirm the statement of [7]
that the performance of CPL-like logic styles degrades much
faster than other design styles due to the decreasing
ratio in deep-submicron technologies. The dynamic power con-
sumption of the CPL adder is about 50 mW. This is one half
of the power consumption of Adder4 16b. Two factors explain
the power advantage of the CPL adder: different structure of
the 4-bit adders implying the use of a lower number of cells and
lower switched capacitance thanks to the low number of NMOS
and especially PMOS devices in the design (2999 PMOS de-
vices and 4992 NMOS transistors in CPL, versus 9077 PMOS
and 9142 NMOS devices in the BBL adder).
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.
-
7/29/2019 Power-Delay Product Minimization
6/10
240 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004
Fig. 9. Four-bit ripple-carry adder (RCA) with c a r r y 0 i n = 0 .
TABLE VSIMULATION RESULTS FOR THE 4-BIT ADDER WITH THE ASSOCIATED CARRY-LOOKAHEAD CIRCUIT. IN FA+X-BBL THE CARRY-LOOKAHEAD CIRCUIT ISIMPLEMENTED WITH BBL CS-C0 CELLS AND X-GATES; IN FA+CS-C2, THE CARRY LOOK-AHEAD CIRCUIT IS COMPOSED OF JUST ONE CS-C2 CELL; IN
FA+CMOS THE CARRY PATH IS IMPLEMENTED WITH CONVENTIONAL NAND AND NOR GATES; CSA+CS-C2 IS THE 4-BIT CARRY-SELECT ADDER WITH THECARRY PATH CONSISTING OF CS-C2. V = 1 : 5 V; T = 2 5 C ; F = 1 G H z ; F a n O u t = 4 i n v e r t e r s
V. OPTIMIZATION OF THE 64-BIT ADDER
In the first part of this section, we will revisit the choices
made for the maximum stack height of the BBL cells. Second,
concerning the architecture, two elements must be considered.
The carry-selection can occur with either 8-bit or 16-bit pre-sums. Moreover, the structure of the carry network itself can
be further optimized, regarding the constraints of power and
performance.
A. Stack Height
The power-delay product of the adder can be improved by a
better balance of the cells with a stack height of two and the cells
with a stack height of three. In the 4-bit adder cells, the stack
height can be increased from two to three, since this part is not
in the critical path in the carry-select structure. Instead of using
half-adders (HAs) and multiplexers, efficient 28-transistor 1-bit
CMOS full adders (FAs) are used in a 4-bit ripple carry con-
figuration [21]. To avoid the long delay for , which rip-
ples through all the FA cells (Fig. 9) a carry-lookahead circuit is
used. We implemented four versions and compared them with
the original 4-bit carry-select adder (CSA) of Section III, which
is referenced as CSA+CS-C2 in Table V. In each case, thecarry network has NOR and NAND gates in the first stage to pro-
duce the conditional carry signals and (with
or ). The first way to generate is to use the complex
BBL CS-C2 cell, represented in Fig. 5, which is referenced as
FA+CS-C2 in Table V. The carry network with branch-basedcarry-select boxes can be redesigned avoiding high stacks and
this will favor speed. If we limit the height of the branches up
Fig. 10. Carry logic for the 4-bit adder with C = 0 .
to two devices only, we propose another decomposition of the
equation of :
(5)
(6)
(7)
By using the theorem of De Morgan, this expression becomes
(8)
The resulting circuit is shown in Fig. 10. Two complex
BBL CS-C0 cells [see Fig. 2(a)], one two-input NOR and one
complementary X-gate are combined. It has the advantage of
having a maximum of two PMOSFETs in the stack. Notice
that we cannot use a complementary CS-C0 cell in the last
stage. Indeed, the Never Happens condition that enabled
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.
-
7/29/2019 Power-Delay Product Minimization
7/10
NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS 241
Fig. 11. Blockdiagramof the8-bit CLAadder with carry-in.The intermediatecarry signal C5 (interrupted line) was added in order to enhance the speed of the8-bit adder blocks. C3 and C5 are intermediate results produced by CS-C7.
the simplification of the BBL cells is not fulfilled here. If
is at 1, this does not imply that is at 1. Thecircuit is combined with the 4-bit ripple-carry adder (RCA)
and is referenced as FA+X-BBL in Table V. We can alsoreplace the two remaining BBL CS-C0 gates by X-gates. This
circuit is referenced as FA+X-gates. Finally, we can furtherdecompose the cells allowing to design this stage using only
inverters, NAND, and NOR gates. This case is FA+CMOS.Table V presents the simulation results for the 4-bit adders
with the carry-lookahead circuits. FA+X-BBL and FA+X-gates
are almost similar in all respects and have the best energy
efficiency. FA+CMOS, FA+CS-C2, and CSA+CS-C2 have a
power-delay product which is about 25%30% higher. Thereduction of the stack height from three to two devices reduces
the delay by 18% between FA+CS-C2 and FA+X-BBL.
B. Structural Optimization
A 64-bit carry-select adder can make the selection either of
8-bit pre-sums [22], [23] or of 16-bit pre-sums [14]. The use of8-bit adders allows the use of one carry-selection level instead
of three as in the first design. There are multiple ways to com-
bine two 4-bit adders to form the 8-bit adders. In the first possi-
bility, we use , the 4-bit block carry from the first 4-bit adder,
computed with the carry-lookahead circuit of the first stage.
will be re-used in the cell CS-C7, producing the carry for the
entire 8-bit adder block (Fig. 11). However, the speedup was
not sufficient to produce the SUM signals on time, even in the
carry-select architecture. Therefore, , the result of the carry
from the six lowest-order bits, will be fed directly into the FA
for the bit position 6, thus forming a second carry-lookahead
path (Fig. 11). is further used in CS-C7 to generate . By
using intermediate results and sharing circuit blocks, the dupli-cation of logic cells is avoided. This contributes to a lower area
and lower power consumption.
Adder8 8b is composed of eight 8-bit blocks, the globalcarry network, and seven multiplexers (Fig. 12). The first
adder block contains the 8-bit adder generating SUM0-7 and
CS-C7, the circuit generating the intermediate carry signal
. The seven other blocks are identical and are composed of
four circuits. Two of these are 8-bit adders, used to produce
the pre-sums, one assuming that the carry-in is at 0, theother assuming that the carry-in is at 1. A 2 8-inputmultiplexer selects the final sum output. The control signals for
the multiplexers are generated by the global carry network. The
Fig. 12. 64-bit adder based on the selection of 8-bit pre-sums.
two other circuits in the seven identical adder blocks generate
the conditional carry signals which are fed into the global carry
network. These two circuits are referenced as CS-cin0 and
CS-cin1 in Fig. 12. By using 8-bit blocks which are repeated
seven times, the design and layout time is made shorter than if
sections of different lengths were used.
C. Global Carry Network
The global carry network generates the final carry-out, ,
and the intermediate carry signals , , , , , and
. These signals command the multiplexers which select theappropriate 8-bit pre-sums.
In order to minimize the delay in the critical path, the fol-
lowing elements are taken into consideration: the number of
successive stages, the input load presented to the CS-cin0 and
CS-cin1 blocks, and the fan-out of each stage. In the decompo-
sition that we propose below, the fan-out is limited to three, both
for the conditional carry signals and for the intermediate carry
signals that are re-used at different places in the global carry
network.
The hot carry is , since it commands the multiplexerselecting the highest order sum signals. It is generated using all
the intermediate carry signals except and :
(9)
Since in the global carry network the stack height is limited
to two devices to favor speed, the equation of is further
decomposed in order to be able to implement this function with
complex BBL CS-C0 gates and, where needed, X-gates.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.
-
7/29/2019 Power-Delay Product Minimization
8/10
242 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004
The second control signal to generate is :
(10)
which can be decomposed as
(11)
is already one of the intermediate carry signals needed
to command the multiplexer selecting SUM24..31. The expres-
sions of the complements of and are
(12)
(13)
and are shared with .
is expressed as follows:
(14)
where are also shared with and is the inter-
mediate carry signal commanding the multiplexer selecting
SUM16..23.
Finally, re-uses in its expression
(15)
The low-order intermediate signals and are re-used
as inputs for cells generating higher order carry signals. In this
way, we avoid the duplication of parts of the carry logic, which
favors low power, and we keep the fan-out to a maximum of
three, which is beneficial for speed.
D. Simulation Results
This adder has been simulated using the parameters of
the 0.18- m PD SOI CMOS process and is compared with
the experimental and simulation results of the first adder,Adder4 16b, which has a classical structure (Table VI). The
layout of Adder4 16b is taken as a reference to estimate
the lengths of the wires for the schematic simulations. For
Adder8 8b, the length of the long wires has been reduced by
25% to account for the lower die area thanks to the lower device
count. Considering the reduction in the number of devices
(67%), this is a rather conservative value.
The optimized version of the 64-bit adder shows a reduction
of about 20% of the maximum delay, thanks to three factors.
First, the lower number of buffers on the signal path accounts
for 6% of the delay reduction. Second, the use of stacks with
a maximum of two PMOSFETs further improves the speed by
TABLE VIDELAY, POWER CONSUMPTION, POWER-DELAY PRODUCT, AND DEVICE COUNTFOR THE REALIZED AND SIMULATED IMPLEMENTATIONS OF ADDER42 16b AND
THE SIMULATED IMPLEMENTATION OF ADDER8 2 8b IN THE 0.18- m CMOSTECHNOLOGY. V = 1 : 5 V; T = 2 5 C ; F = 1 GHz; F O = 4
Fig. 13. Layout of the 64-bit adder based on the selection of 8-bit pre-sums.
6%. The critical path in Adder8 8b includes a two-input NOR,
one CS-C0 stage and five X-gates produce , the hot carry.
Third, thanks to a more efficient architecture in Adder8 8b, the
capacitive load seen by the carry-select cells on the critical path
is reduced compared to Adder4 16b.
To estimate the dynamic power consumption, random
input patterns are applied at the inputs at a rate of 1 GHz. In
Adder8 8b, the dynamic power consumption is reduced by a
factor of three compared to the classical adder. This improve-
ment is associated with the higher stacks in the noncritical
path and the reduction of the number of cells in the proposedarchitecture.
Adder8 8b shows a 75% lower power-delay product than
Adder4 16b. About 20% of this improvement is associated
with the cell level, the remainder coming from the modifications
of the architecture. Adder8 8b has a power-delay product that
is 60% lower than the CPL adder. In this case, the architecture
accounts for 10% of the improvement, the remaining coming
from the different logic design styles.
E. Optimization and Results in 0.13- m PD SOI
Adder8 8b has finally been optimized with Einstuner in a
0.13- m PD SOI CMOS technology. Einstuner is a circuit opti-mization package that automatically resizes the transistors [24].
The final layout area is 151 m 461 m (Fig. 13). A netlist
has been extracted from the layout and is used to determine the
main features of the adder. The critical delay is 326 ps at 1.1 V
and 85 C. Random input patterns at a rate of 2 GHz are used to
calculate the dynamic power which is found to be as low as 23
mW. The static power is evaluated to be 380 W. Our realiza-
tion is compared with state-of-the-art 64-bit adders published in
previous work in Fig. 14. The adder proposed here is faster than
the other realizations, even those realized in finer technologies.
Our adder features the extremely low power-delay product of
7.5 pJ.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.
-
7/29/2019 Power-Delay Product Minimization
9/10
NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS 243
Fig. 14. Our 64-bit adder, based on the selection of 8-bit pre-sums designedin 0.13- m PD SOI, is compared with recent 64-bit adder realizations. L o a d = 4 0 fF; T = 8 5 C. ISSCC00 is a domino carry-select/carry-lookaheadadder
in 0.18-
mPDSOI[25]. VLSI01 is a race logic carry-lookahead/carry-selectadder in 0.18- m bulk CMOS [22], ESSCIRC02 is a BrentKung adder in100-nm bulk CMOS [26]. VLSI02 is a carry-select/carry-lookahead adder in0.1- m PD SOI [23].
VI. CONCLUSION
In this paper, we presented a methodology to minimize the
power-delay product of 64-bit carry-select adders for high-
performance and low-power applications, by working at three
levels of abstraction: design style, cell arrangement, and adder
structure. We demonstrated thata complex-cellbranch-basedde-
sign reduces thedynamic powerconsumption by 10%compared
to the design with decomposed cells, with a minimal impact on
the delay. The reduction of the CMOS cell stack height fromthree to twodevicesin thecritical path proved to be beneficial for
speed. By avoiding the duplication of cells, without increasing
the fan-out on the critical path, we were able to improve the
speed while maintaining low power consumption. The structural
optimization, by making the choice of the selection of 8-bit
pre-sums, allowed theuse of only onecarry-select level, thus fur-
ther contributingto thereduction of thepower dissipation thanks
to a lower number of cells. Compared to the classical design, the
power-delayproduct of theoptimized adder has been reduced by
a factor of four. Compared with an equivalent CPL 64-bit adder,
our realization shows a 60% improvement of the power-delay
product. Finally, an automatic tuning tool allowed the design of
an energy-efficient adder in 0.13- m PD SOI,with a power-delayproduct as low as 7.5 pJ. The approach presented in this paper
can be extended toward -bit carry-select adders. However, the
number of carry-select levels and the adder architecture might
be different in order to obtain an efficient realization.
ACKNOWLEDGMENT
The authors express their sincere thanks to G. Hellner,
J. Keinert, V. Gernhoeffer, and U. Krauch for the help they
provided when using the simulation, layout, and extraction
tools. The authors also to thank R. Sautter and W. Haller for
useful discussions, and J. Appinger for his helpful assistance in
obtaining the experimental data.
REFERENCES
[1] T. H. Ning, CMOS in the new millennium, in Proc. IEEE CustomIntegrated Circuit Conf. (CICC), 2000, pp. 4956.
[2] V. De and S. Borkar, Low power and high performance design chal-lenges in future technologies, in Proc. Great Lakes Symp. VLSI, 2000,pp. 16.
[3] European Semiconductor Industry Association, Japan Electronics andInformation Industries Association, Korea Semiconductor IndustryAssociation, Taiwan Semiconductor Industry Association, and Semi-
conductor Industry Association, International Technology Roadmap forSemiconductors, System Drivers, 2001.
[4] C. Nagendra, R. M. Owens, and M. J. Irwin, Power-delay character-istics of CMOS adders, IEEE Trans. VLSI Syst., vol. 2, pp. 377381,Sept. 1994.
[5] S. Naffzifer, A subnanosecond 0.5 m 64b adder, in Proc. IEEE Int.Solid-State Circuits Conf., Slide Supplement, 1996.
[6] J. Hennessy, D. Patterson, and D. Goldberg, Computer Architecture, AQuantitative Approach, 2nd ed. San Mateo, CA: Morgan Kaufman,1996.
[7] M. Allam, M. Anis, and M. Elmasry, Effect of technology scaling ondigital CMOS logic styles, in Proc. IEEE Custom Integrated CircuitsConf. (CICC), 2000, pp. 19-1-119-1-8.
[8] R. Yung, S. Rusu, and K. Shoemaker, Future trend of microprocessordesign, in Proc. ESSCIRC 2002, pp. 4346.
[9] V. De and S. Borkar, Low power and high performance design chal-lenges in future technologies, in Proc. Great Lakes Symp. VLSI, 2000,
pp. 16.[10] P.-F. Lu, C.-T. Chuang, J. Ji, L. F. Wagner, C.-M. Hsieh, J. B. Kuang,L. L.-C. Hsu, M. M. Pelella, S.-F. Sanford, and C. J. Anderson,Floating-body effects in partially depleted SOI CMOS circuits, IEEE
J. Solid-State Circuits, vol. 32, pp. 12411253, Aug. 1997.[11] N. Subba, A. Salman, S. Mitra, D. E. Ioannou, and C. Tretz,
Pseudo-NMOS revisited: Impact of SOI on low power, high speedcircuit design, in Proc. IEEE Int. SOI Conf., Oct. 2000, pp. 2627.
[12] C. R. Tretz, R. K. Montoye, and W. Reohr, Ratioed CMOS: A lowpower high speed design choice in SOI technologies, in Proc. IEEE
Int. SOI Conf., Oct. 2000, pp. 2829.[13] J. M. Masgonty, C. Arm, and C. Piguet, Technology- and power-
supply-independent cell library, in Proc. IEEE Custom IntegratedCircuits Conf. (CICC), 1991, pp. 25.5/125.5/4.
[14] K. Hwang, Computer Arithmetic: Principles, Architecture and Design.New York: Wiley, 1979.
[15] S. Zaker and J. Zahnd, OPTIMOS: a branch-level digital circuit opti-mizer, in Proc. EURO ASIC, 1993, pp. 563572.
[16] G. A. Katopis, W. D. Becker, T. R. Mazzawy, H. H. Smith, C. K.Vakirtzis, S. A. Kuppinger, B. Singh, P. C. Lin, J. Bartells Jr., G. V.Kihlmire, P. N. Venkatachalam, H. I. Stoller, and J. L. Frankel, MCMtechnology and design for the S/390 G5 system, IBM J. Res. Develop.,vol. 43, no. 5/6, pp. 621650, Sept.Nov. 1999.
[17] G. G. Shahidi, SOI technology for the GHz era, IBM J. Res. Develop.,vol. 46, no. 2/3, pp. 121131, Mar./May 2002.
[18] I. Aller and K. E. Kroell, Detailed analysis of the gate delay variabilityin partially depleted SOI CMOS circuits, in Proc. IEEE Int. SOI Conf.,Oct. 1999, pp. 4041.
[19] A.Nve, D. Flandre, H. Schettler, T. Ludwig, andG. Hellner, Design ofa branch-based 64-bit carry-select adder in 0.18 m partially-depletedSOI CMOS, in Proc. Int. Symp. Low Power Electronics and Design(ISLPED), Aug. 2002, pp. 108111.
[20] K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi, and A.Shimizu, A 3.8 ns CMOS 16 2 16-b multiplier using complementarypass-transistor logic,IEEE J. Solid-State Circuits, vol.25,pp.388395,Apr. 1990.
[21] K. Martin, Digital Integrated Circuit Design. Oxford, U.K.: OxfordUniv. Press, 2000.
[22] S. J. Lee, R. Woo, and H. J. Yoo, 480 ps 64-bit race logic adder, inSymp. VLSI Circuits Dig. Tech. Papers, 2001, pp. 2728.
[23] J. J. Kim, R. Joshi, C.-T. Chuang, and K. Roy, SOI-optimized 64-bithigh-speed CMOS adder design, in Symp. VLSI Circuits Dig. Tech. Pa-
pers, 2002, pp. 122125.[24] X. Bai, C. Visweswariah, P. Strenski, and D. Hathaway, Uncertainty-
aware circuit optimization, in Proc. 39th Design Automation Conf.,June 2002, pp. 5863.
[25] D. Sastiak, J. Tran, F. Mounes-Toussi, and S. Storino, A 2nd generation440 ps SOI 64 b adder, in IEEE Int. Solid-State Circuits Conf., Feb.2000, pp. 288289.
[26] M. Garg and A. Katoch, Evaluation of skew tolerance in delayedclocking scheme for dynamic circuits, in Proc. ESSCIRC, Sept. 2001,pp. 396399.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.
-
7/29/2019 Power-Delay Product Minimization
10/10
244 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004
Amaury Nve (M96) received the electrical engi-neering and the Ph.D. degrees from the Universitcatholique de Louvain, Louvain-la-Neuve, Belgium,in 1998 and 2004, respectively.
From 1998 to 2003, he was Research Assistantwith the Microelectronics Laboratory of the Univer-sit Catholique de Louvain. He was involved in thedevelopment of design techniques for high-speedand low-power digital circuits in advanced sil-
icon-on-insulator (SOI) processes. In May 2003,he joined the IBM Research and DevelopmentLaboratory in Bblingen, Germany, where he is working on circuit design foradvanced CMOS processes.
Helmut Schettler received the Dipl.-Ing. degreein electrical engineering from the University ofStuttgart, Stuttgart, Germany, in 1969.
In 1969,he joined theIBM Laboratory, Bblingen,Germany. He worked for three years in the IBMLaboratories, East Fishkill, NY, and Burlington,VT, where he was involved in bipolar and CMOScircuit and chip design for memory and m-Processorapplications. He holds 23 patents and was theleading circuit designer when IBM moved its servertechnology from bipolar to CMOS. Presently, he is
also involved in lecturing at a university of applied science in Stuttgart.
Thomas Ludwig (M90) was born in Sindelfingen,Germany, in 1957. He received the Master of Elec-trical Engineering from the Technische Universitt,Berlin, Germany, in 1983.
He joined IBM in 1984, at the German Researchand Development Laboratory, Bblingen, workingon high-speed digital driver/receiver circuits. In1992, he was on an assignment with the jointIBM/Intel Noyce Development Center, BocaRaton, FL, working on technology conversion ofa -processor. He is currently Senior Engineer
and Leader of the Future Product Technology Team, IBM Systems Group,Bblingen, Germany, responsible for future silicon technologies in the field of
high-speed server processors. His current research interests are in the areas ofsilicon-on-insulator technology, especially FinFET circuit design, modelingand the influence on CAD tools.
Denis Flandre (M86SM03) was born inCharleroi, Belgium, in 1964. He received theElectrical Engineer degree, the Ph.D. degree, andthe Postdoctoral thesis degree from the UniversitCatholique de Louvain (UCL), Louvain-la-Neuve,Belgium, in 1986, 1990, and 1999, respectively.His doctoral research was on the modeling ofsilicon-on-insulator (SOI) MOS devices for charac-terization and circuit simulation, and his Postdoctoral
thesis was on a systematic and automated synthesismethodology for MOS analog circuits.In 1985, he was a summer Student Trainee at NTT Headquarters, Tokyo,
Japan. From October 1990 to September 1991, he was with the Centro Nacionalde Microelectrnica, Barcelona, Spain, working on the characterization and nu-merical simulation of SOI MOS process and devices. He was then at the Labo-ratoire de Microlectronique (DICE), Louvain-la-Neuve, Belgium, as a SeniorResearch Associate of the National Fund for Scientific Research (FNRS, Bel-gium). Since 2001, he has been a full-time Professor at UCL giving courses onintegrated analog circuit design, device physics, etc. He is currently involved inthe research anddevelopmentof digital andanalog SOIMOS circuits forspecialapplications, more specifically high-speed, low-voltage low-power, microwave,rad-hard, and high-temperature electronics.
Prof. Flandre has been the recipient of the 1992 Biennial SiemensFNRSAward for an original contribution in the fields of electricity and electronics, the1997 Wernaers Prize for innovation in pedagogical presentation of advancedresearch work, and the 1999 CEN SCK Prize for innovation in nuclear science
instrumentation. He has authored or coauthored more than 160 technicalpapers or conference contributions. He is a member of the Advisory Board ofthe EU Network of Excellence for High-Temperature Electronics (HITEN),of the Scientific Board of the Microserv large infrastructure EU programof the CNM-Barcelona and of the Director Board of the Cyclotron ResearchCenter (CRC, Louvain-la-Neuve, Belgium). He is a founding member of theCERMIN (Centre de Recherche en Dispositifs et Matriaux ElectroniquesMicro- et Nanoscopiques of UCL). He is a cofounder of CISSOID S.A., astartup company, spun off of UCL in July 2000, focusing on SOI circuit designservices.