Power-Delay Product Minimization

download Power-Delay Product Minimization

of 10

Transcript of Power-Delay Product Minimization

  • 7/29/2019 Power-Delay Product Minimization

    1/10

    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004 235

    Power-Delay Product Minimization inHigh-Performance 64-bit Carry-Select Adders

    Amaury Nve, Member, IEEE, Helmut Schettler, Thomas Ludwig, Member, IEEE, andDenis Flandre, Senior Member, IEEE

    AbstractThis paper analyzes methods to minimize thepower-delay product of 64-bit carry-select adders intended forhigh-performance and low-power applications. A first realizationin 0.18- m partially depleted (PD) silicon-on-insulator (SOI),using complex branch-based logic (BBL) cells, results in a delay of720 ps and a power dissipation of 96 mW at 1.5 V. The reduction ofthe stack height in the critical path, combined with the optimiza-tion of the global carry network with cell sharing and the selectionof 8-bit pre-sums, leads to a reduction of the power-delay productby 75%. The automatic tuning of the transistor widths in 0.13- mPD SOI produces an energy-efficient 64-bit adder which has adelay of 326 ps and a power dissipation of 23 mW only at 1.1 V.

    Index TermsAdder, digital CMOS, high performance, lowpower, power-delay product, silicon-on-insulator technology.

    I. INTRODUCTION

    T ODAY, one of the major challenges for high-performancemicroelectronic systems is the power dissipation, bothstatic and dynamic [1][3]. The circuit designer must, therefore,find an optimum between power and speed, instead of targetingthem independently, and this is represented by the power-delayproduct, which represents the average energy dissipated for oneswitching event [4].

    In this study, we investigate design methods to minimize

    the power-delay product of 64-bit adders in partially depleted(PD) silicon-on-insulator (SOI) technology. Addition is usedas a benchmark here since it is one of the important tasks per-formed by the CPU, considering that adders are needed in theArithmetic and Logic Units, for the memory address generationand for floating point calculations [5], [6]. The improvementof the power-delay product will be performed at the differenthierarchical levels of the design: circuit design style, celldecomposition, and global architecture. Section II discussesdesign styles that can be used for low-power and high-perfor-mance VLSI systems in SOI. In Section III, we compare twopossible implementations of branch-based cells used for the64-bit adder, which has a classical carry-select architecture.

    The experimental results of this realization are discussed inSection IV and are compared with a complementary pass-gate

    Manuscript received February 28, 2003; revised June 30, 2003. The work ofA. Nve was supported in part by the Walloon Region of Belgium.

    A. Nve was with the Microelectronics Laboratory, Universit Catholique deLouvain, B-1348 Louvain-la-Neuve, Belgium. He is now with IBM Entwick-lung, D-71032 Bblingen, Germany (e-mail: [email protected]).

    H. Schettler and T. Ludwig are with IBM Entwicklung, D-71032 Bblingen,Germany (e-mail: [email protected]; [email protected]).

    D. Flandre is with the Microelectronics Laboratory, Universit Catholique deLouvain, B-1348 Louvain-la-Neuve, Belgium (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TVLSI.2004.824305

    adder. The optimization of the adder structure, the global carrynetwork, and the cells are presented in Section V. The proposedadder is compared with the original carry-select adder. Sec-tion V also gives the results for the optimized implementationof the 64-bit adder in 0.13- m PD SOI, and compares themwith state-of-the-art 64-bit adders of the literature.

    II. LOGIC CIRCUIT DESIGN IN SOI

    The use of SOI technology instead of bulk CMOS opens new

    possibilities in the choice of the design style and in the designspace itself. In Section II-A, we discuss some possible designstyles for PD SOI. Among them, branch-basedlogic, a restrictedversion of static CMOS logic, seems very promising and is pre-sented in Section II-B.

    A. Circuit Design Styles

    In this study, we concentrate on static design styles,since the performance advantage of both dynamic logicstyles and pass-gate design is expected to decrease in futuredeep-submicron technologies [3], [7]. The features of lowerdynamic power consumption and higher noise margin makestatic CMOS particularly attractive [8], [9]. Moreover, theactivation of the parasitic bipolar transistor in PD SOI isreported to result in fatal erroneous states in dynamic logic andto make circuit design with pass-gates more difficult [10]. Therenewed interest in static design styles like pseudo-NMOS [11]and ratioed CMOS [12] shows that alternative design styles areinvestigated in SOI in order to reduce the power dissipationwhile still maintaining high-speed performance.

    B. Branch-Based Circuit Design Style

    In the branch-based logic (BBL) design style, a logic cell isonly made of branches that contain a few transistors in series[13]. The branches are connected in parallel between the power

    supply lines and the common output node. Many usual staticCMOS gates have already a branch structure, as inverter andNAND and NOR gates. By using the branch-based concept, it ispossible to minimize the number of internal connections, andthus, the parasitic capacitances associated with the diffusions,interconnections, and contacts. As it belongs to the family ofstatic CMOS, it presents high noise margins and robustness todevice scaling and voltage scaling.

    The optimal design point will depend on whether the em-phasis is placed on low power, high speed, or a compromisebetween the two.

    1063-8210/04$20.00 2004 IEEE

    Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.

  • 7/29/2019 Power-Delay Product Minimization

    2/10

    236 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

    Fig. 1. 64-bit carry-select adder.

    TABLE ILOGIC EQUATIONS AND CIRCUIT BLOCKS USED FOR THE INTERMEDIATE CARRY SIGNALS

    III. DESIGN OF THE CARRY-SELECT ADDER

    A. Adder Structure

    In our first design, the 64-bit adder was classically divided

    into four 16-bit sections as shown in Fig. 1 [14]. In each sec-

    tion, two 16-bit adders generate the sum outputs with, respec-

    tively, the carry-in at 0 and at 1. The true sum is selected

    by a multiplexer. The control signals for the multiplexers arethe carry-in and the intermediate carry signals , , and

    . The intermediate carry signals and the carry-out are

    produced by the carry-select boxes (CS-boxes). The inputs of

    the CS-boxes are the carry-in and the conditional carry signals

    and (with 015, 1631, 3247, or 4863), whichare all generated simultaneously. The notation refers

    to the block carry signal for bit positions 0 to 15, assuming

    that the carry-in is at 0. For the sake of clarity, the indexesof the conditional carry signals have been simplified in Fig. 1.

    The carry-out and the intermediate carry signals are computed

    according to the equations presented in Table I. To compute

    the carry-out, the intermediate carry is combined with

    and in one CS-C0 stage to avoid a complex

    CS-C3 stage.

    At their turn, the 16-bit adder blocks are implemented as

    carry-select adders, with 4-bit adder blocks having a carry-in

    either at 0 or at 1. At the 16-bit level, the same CS-boxescan be used as at the 64-bit level, sizing of the transistors being

    adapted to the particular load conditions.

    The 4-bit adder blocks can finally be implemented as ripplecarry adders, or also as carry-select adders, which was chosen

    here. The carry-select architecture is thus used at three different

    levels: in the 64-bit, the 16-bit, and the 4-bit adders.

    B. Design of the Carry-Select Boxes

    A CS-box can be implemented in different ways, depending

    on its number of inputs and complexity. The CS-C0 gate is

    designed starting from the logic equation presented in Table I.

    The P-part is common to the BBL version and the CMOS

    OR-AND-INVERT (OAI) gate, which is further referenced as the

    X-gate. It can immediately be designed, just observing that the

    Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.

  • 7/29/2019 Power-Delay Product Minimization

    3/10

    NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS 237

    Fig. 2. One-stage complex CS-C0: (a) BBL and (b) X-gate.

    Fig. 3. Two-stage decomposed CS-C0 cell.

    input variables must be complemented in the logic equation

    (Fig. 2).

    For the NMOS part, the complement is taken:

    (1)

    (2)

    Equation (2) defines the N-part of the X-gate represented in

    Fig. 2(b). Normally, the N-part of the BBL cell would require

    four transistors. However, to reduce the input capacitance and

    the internal parasitic capacitances, (2) can be simplified noticingthat

    (3)

    This is true since the input combination AND

    never happens, resulting from the properties of

    the conditional carry signals. The resulting equation is

    (4)

    CS-C0 can thus be implemented as a complex cell in one

    stage, obeying the BBL principle [Fig. 2(a)], or like an X-gate,allowing for internal connections between branches [Fig. 2(b)].

    These X-gates will be used in the optimized version of the adder

    described in Section V.

    CS-C0 can also be implemented in two stages, using two clas-

    sical NANDs and one inverter (Fig. 3). Notice that the latter is a

    two-stage BBL implementation of CS-C0, since NAND gates can

    be considered as elementary BBL cells. This will be referenced

    as decomposed-cell design in the remainer of this paper.

    Figs. 4 and 5 represent, respectively, the complex BBL one-

    stage implementation of CS-C1 and the complex BBL two-stage

    implementation of CS-C2. Using one stage only for the latter

    requires stacks of four transistors, which appeared to be too

    Fig. 4. One-stage complex BBL CS-C1 cell.

    Fig. 5. Two-stage complex BBL CS-C2 cell.

    slow. The carry indexes have been simplified for the sake of the

    figures clarity. In Section III-C, these complex cells are com-pared with equivalent decomposed-cell designs, using NOR and

    NAND gates.

    C. Results for the Single Cells

    The complex and decomposed-cell BBL gates have been op-timized for speed as follows. For the optimization process, each

    cell is loaded with one CMOS inverter (fan-out of 1). The nom-

    inal gate length is chosen at the minimum allowed drawing di-

    mension, i.e., m. We carefully optimized the gate

    widths by hand, using an iterative process. In the first step, the

    critical branchis identified;this is thebranch that hasthe slowest

    delay when the cell switches. The speed of a gate is indeed lim-

    ited by the slowest branch [15]. Often this corresponds to the

    branch with the largest number of transistors in series. The input

    pattern that activates the top transistor of the critical branch in

    one particular cell is applied at the input. With the AS/X cir-

    cuit simulator [16], a sweep is made on the ratio of this

    branch, all other ratios remaining constant. Thereafter, theratio of the other branches is further tuned to lower the

    capacitance of the output node, to which all the branches are

    connected. After this first step, we can turn back to the critical

    branch and refine the choice of its gate widths. Most of the time,

    the second step leads only to minor changes of the ratios

    of the branches.

    We evaluated the performances of the basic building blocks

    of the adder by using circuit simulations. The simulations are

    based on the schematic specification of the circuit. The device

    models of the SOI PD 0.18- m process include the parasitic

    capacitances of source, drain, and gate. The model accounts also

    for the parasitic capacitances associated with the contacts.

    Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.

  • 7/29/2019 Power-Delay Product Minimization

    4/10

    238 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

    TABLE IISIMULATED DELAY. V = 1 : 5 V; T = 2 5 C ; F O = 1

    TABLE IIISIMULATED STATIC POWER DISSIPATION. V = 1 : 5 V; T = 2 5 C ; F O = 1

    TABLE IVSIMULATED DYNAMIC POWER DISSIPATION. V = 1 : 5 V; T = 2 5 C ;

    F O = 1 ; F = 1 G H z

    The results for the delay, static power consumption, and

    dynamic power consumption are presented, respectively, inTables IIIV. We compare two possible BBL implementationsof the cells: complex-cell design (COMPLEX) and decom-

    posed-cell design (DEC), using only NAND, NOR, and INV gates.

    Two categories clearly appear: circuits with few inputs and

    few branches like the half-adder and CS-C0 perform much

    better in their complex form, with a speed increase of 8%

    and 20%. Circuits with more inputs, like CS-C1 and CS-C2,

    perform worse in their complex form than in their decomposed

    form. The bad performances of these two complex cells can be

    explained by the combination of two factors: the presence of

    branches with a stack of three transistors and a high number

    of branches connected to the output node, which increases the

    parasitic capacitance at the output. Moreover, the critical pathin the two decomposed circuits includes a stack of three NMOS

    devices, whereas in the two complex-cell circuits, a stack of

    three PMOS devices is activated in the worst case.

    Floating-body effects are known to produce uncertainties in

    the circuit results. In particular, the hysteresis effect is associ-

    ated with the dependence of the body potential on the switching

    history of the gates and appears as an additional delay variation

    [17]. With a tool based on the methodology presented in [ 18],

    we evaluated that the impact of the hysteresis effect on these

    cells is less than 5%.

    The static power dissipation is very close for the half-adder

    and CS-C2, since these BBL cells count multiple stages, like

    Fig. 6. Image of the test chip. The locations of the BBL adder (BBL ADD)

    and the CPL adder (CPL ADD) are highlighted.

    the decomposed versions, and hence, multiple leakage paths be-

    tween the power supply and ground. The complex CS-C0 and

    CS-C1 cells benefit from the design in one stage compared to

    their decomposed counterparts.

    Thedynamic powerdissipation is compared forthe same logic

    cells in Table IV. The last column presents the dynamic power

    reduction when comparing the complex implementation and the

    decomposed implementation. The complex cells dissipate be-

    tween 30%and 43%less dynamic powerthan in thedecomposed

    cells. The main factor that explains why the complex cells have

    a lower power dissipation than the decomposed cells is linked

    with the lower number of internal nodes. In the CS-C0 cell, for

    example,there is only onehighly capacitive node at theoutput in

    the complex cell [see Fig. 2(a)]. In the decomposed cell (Fig. 3),

    there are two internal nodes and one output node, though with a

    lower parasitic capacitance than in the case of the complex cell.

    When the output of the gate switches, thismeans that at least one

    of the internal node also switches with a full-swing charge/dis-

    charge of the associated parasitic capacitance.

    IV. FIRST REALIZATION OF THE 64-BIT ADDER

    The design and layout of the 64-bit carry-select adder pre-

    sented in Section III has been realized for the CMOS 0.18- m

    PD SOI process [19]. It is composed of 18 k devices and occu-pies an area of 735 m 280 m (Fig. 6, BBL ADD). The same

    layout can be used for each 16-bit section, as they are repeated

    four times.

    All the cells are implemented using the BBL design style,

    with the exception of the multiplexers, which are transmis-

    sion-gate multiplexers. For this version of the adder, we fixed

    the maximum stack height at three MOSFETs, which enables

    us to use the complex BBL gates discussed in the previous

    section.

    In the remainer of this paper, the adder implemented here is

    referenced as Adder4 16b. This notation refers to the factthat we make a selection of 16-bit pre-sums at the 64-bit adder

    level.

    A. Critical Path

    The critical path to the carry-out (referenced as in Fig. 1)

    and to the sum outputs is described here. In the 4-bit adder level,

    the critical path involves 1 NAND/NOR and 1 CS-C2 cell, which

    generates . A second CS-C2 cell in the 16-bit adder level gen-

    erates the intermediate carry signals , , , a nd

    . These signals are then fed into the CS-C1 and CS-C2

    cells of the 64-bit adder for generation of, respectively, and

    . But, by this way, the delay on the critical carry-out path

    could be too high for two reasons. First, as seen previously,

    Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.

  • 7/29/2019 Power-Delay Product Minimization

    5/10

    NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS 239

    Fig. 7. Detail of the critical path. Buffers are added on the path to CS-C1 inorder to reduce the capacitive load seen by the previous stage.

    the delay in a CS-C2 cell is about 30% higher than in a CS-C1

    cell. Second, the cells generating the intermediate carry signals

    , , , and in the 16-bit adders see a

    high capacitive load at the inputs of the CS-C1 and CS-C2 cells

    at the 64-bit level. In our design, we favor the carry-out genera-tion in two ways at the 64-bit level (Fig. 7). First, since the delay

    in CS-C2 is the largest, the intermediate carry signals ,

    , , and are fed directly into CS-C2 for

    the generation of . To compute the carry-out , only one

    additional stage is necessary, i.e., CS-C0. Second, buffers are

    added on the signal path to the inputs of CS-C1 in order to

    reduce the capacitive load seen by the outputs of cells gener-

    ating the signals , , , and . By

    this way, these signals do not see the high capacitive charge of

    the CS-C1 cell.

    B. Simulation Results

    A second 64-bit carry-select adder has been designed with the

    decomposed cells for comparison purposes. AS/X simulations

    of these adders based on the cell schematics are used to compare

    both versions. At 1.5 V, 25 C, and with a capacitive output load

    of 150 fF, theBBL complex-celladderfeaturesa dynamic power

    consumption which is reduced by 10% compared to the decom-

    posed-cell adder, with random input patterns applied at a rate of

    1 GHz. The delay increase associated with the complex-cell de-

    sign is less than 2% for supply voltages up to 1.5 V. For higher

    supply voltages, the speed difference increases slightly between

    the two, but remains lower than 5%. For worst case input pat-

    terns, the peak dynamic power consumption is reduced by 16%

    in the complex-cell adder compared to decomposed-cell imple-mentation. The overall reduction of dynamic power dissipation

    is lower than forthe individual cells, because the complete adder

    also involves inverters and multiplexers, which are similar in

    both realizations.

    An equivalent bulk realization of the adder consumes 15%

    more dynamic power and is 29% slower than the PD SOI ver-

    sion. This is associated with the lower junction capacitances in

    SOI.

    A complete netlist has been extracted from the 64-bit

    complex-cell adder layout in order to precisely determine the

    static and dynamic power consumption, taking all the intercon-

    nections and parasitic elements into account. To compute the

    Fig. 8. Experimental delay of the 64-bit BBL and CPL adders for differentV values at + 25 C. The critical delay is obtained for SUM24..31 for theBBL adder and for the carry-out in the case of the CPL adder.

    dynamic power, we apply random patterns at the inputs of the

    extracted netlist. Integration between 5 and 45 ns, with a bit

    rate of 1 GHz, a supply voltage of 1.5 V, and a temperature of

    25 C, results in a dynamic power consumption of 96 mW. Thepeak dynamic power consumption occurs when all the inputs

    switch at the same time, and rises up to 177 mW. The static

    power consumption is about 500 W at 1.5 V and 25 C.

    C. Experimental Results

    The experimental realization of the 64-bit adder has been

    tested under different voltage and temperature conditions and

    operates successfully. Worst case input patterns have been ap-

    plied and the propagation delays to the sum and carry outputs

    measured.

    In the worst case situation and at 1.5 V, the final carry is pro-

    duced after 600 ps, and thanks to the independent carry networkcomposed of the CS-boxes, it arrives earlier than the last sum

    outputs, which arrive after 720 ps in the worst case.

    On the same chip as the BBL adder, a complementary pass-

    gate logic (CPL) 64-bit carry-select adder has been realized

    (CPLADD in Fig. 6). The CPL adder makes the selection of

    16-bit pre-sums and uses CPL cells based on the work presented

    in [20]. Fig. 8 shows the critical delay for the CPL adder and for

    the BBL adder. For low operating voltages V ,

    the BBL is obviously better than CPL. Indeed, BBL cells do

    not suffer from the voltage drop due to the threshold voltage

    in single-rail pass-gates. This voltage drop increases in

    relative terms when moving to lower supply voltages. For high

    voltages V , CPL and BBL are able to achievesimilar performance. These results confirm the statement of [7]

    that the performance of CPL-like logic styles degrades much

    faster than other design styles due to the decreasing

    ratio in deep-submicron technologies. The dynamic power con-

    sumption of the CPL adder is about 50 mW. This is one half

    of the power consumption of Adder4 16b. Two factors explain

    the power advantage of the CPL adder: different structure of

    the 4-bit adders implying the use of a lower number of cells and

    lower switched capacitance thanks to the low number of NMOS

    and especially PMOS devices in the design (2999 PMOS de-

    vices and 4992 NMOS transistors in CPL, versus 9077 PMOS

    and 9142 NMOS devices in the BBL adder).

    Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.

  • 7/29/2019 Power-Delay Product Minimization

    6/10

    240 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

    Fig. 9. Four-bit ripple-carry adder (RCA) with c a r r y 0 i n = 0 .

    TABLE VSIMULATION RESULTS FOR THE 4-BIT ADDER WITH THE ASSOCIATED CARRY-LOOKAHEAD CIRCUIT. IN FA+X-BBL THE CARRY-LOOKAHEAD CIRCUIT ISIMPLEMENTED WITH BBL CS-C0 CELLS AND X-GATES; IN FA+CS-C2, THE CARRY LOOK-AHEAD CIRCUIT IS COMPOSED OF JUST ONE CS-C2 CELL; IN

    FA+CMOS THE CARRY PATH IS IMPLEMENTED WITH CONVENTIONAL NAND AND NOR GATES; CSA+CS-C2 IS THE 4-BIT CARRY-SELECT ADDER WITH THECARRY PATH CONSISTING OF CS-C2. V = 1 : 5 V; T = 2 5 C ; F = 1 G H z ; F a n O u t = 4 i n v e r t e r s

    V. OPTIMIZATION OF THE 64-BIT ADDER

    In the first part of this section, we will revisit the choices

    made for the maximum stack height of the BBL cells. Second,

    concerning the architecture, two elements must be considered.

    The carry-selection can occur with either 8-bit or 16-bit pre-sums. Moreover, the structure of the carry network itself can

    be further optimized, regarding the constraints of power and

    performance.

    A. Stack Height

    The power-delay product of the adder can be improved by a

    better balance of the cells with a stack height of two and the cells

    with a stack height of three. In the 4-bit adder cells, the stack

    height can be increased from two to three, since this part is not

    in the critical path in the carry-select structure. Instead of using

    half-adders (HAs) and multiplexers, efficient 28-transistor 1-bit

    CMOS full adders (FAs) are used in a 4-bit ripple carry con-

    figuration [21]. To avoid the long delay for , which rip-

    ples through all the FA cells (Fig. 9) a carry-lookahead circuit is

    used. We implemented four versions and compared them with

    the original 4-bit carry-select adder (CSA) of Section III, which

    is referenced as CSA+CS-C2 in Table V. In each case, thecarry network has NOR and NAND gates in the first stage to pro-

    duce the conditional carry signals and (with

    or ). The first way to generate is to use the complex

    BBL CS-C2 cell, represented in Fig. 5, which is referenced as

    FA+CS-C2 in Table V. The carry network with branch-basedcarry-select boxes can be redesigned avoiding high stacks and

    this will favor speed. If we limit the height of the branches up

    Fig. 10. Carry logic for the 4-bit adder with C = 0 .

    to two devices only, we propose another decomposition of the

    equation of :

    (5)

    (6)

    (7)

    By using the theorem of De Morgan, this expression becomes

    (8)

    The resulting circuit is shown in Fig. 10. Two complex

    BBL CS-C0 cells [see Fig. 2(a)], one two-input NOR and one

    complementary X-gate are combined. It has the advantage of

    having a maximum of two PMOSFETs in the stack. Notice

    that we cannot use a complementary CS-C0 cell in the last

    stage. Indeed, the Never Happens condition that enabled

    Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.

  • 7/29/2019 Power-Delay Product Minimization

    7/10

    NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS 241

    Fig. 11. Blockdiagramof the8-bit CLAadder with carry-in.The intermediatecarry signal C5 (interrupted line) was added in order to enhance the speed of the8-bit adder blocks. C3 and C5 are intermediate results produced by CS-C7.

    the simplification of the BBL cells is not fulfilled here. If

    is at 1, this does not imply that is at 1. Thecircuit is combined with the 4-bit ripple-carry adder (RCA)

    and is referenced as FA+X-BBL in Table V. We can alsoreplace the two remaining BBL CS-C0 gates by X-gates. This

    circuit is referenced as FA+X-gates. Finally, we can furtherdecompose the cells allowing to design this stage using only

    inverters, NAND, and NOR gates. This case is FA+CMOS.Table V presents the simulation results for the 4-bit adders

    with the carry-lookahead circuits. FA+X-BBL and FA+X-gates

    are almost similar in all respects and have the best energy

    efficiency. FA+CMOS, FA+CS-C2, and CSA+CS-C2 have a

    power-delay product which is about 25%30% higher. Thereduction of the stack height from three to two devices reduces

    the delay by 18% between FA+CS-C2 and FA+X-BBL.

    B. Structural Optimization

    A 64-bit carry-select adder can make the selection either of

    8-bit pre-sums [22], [23] or of 16-bit pre-sums [14]. The use of8-bit adders allows the use of one carry-selection level instead

    of three as in the first design. There are multiple ways to com-

    bine two 4-bit adders to form the 8-bit adders. In the first possi-

    bility, we use , the 4-bit block carry from the first 4-bit adder,

    computed with the carry-lookahead circuit of the first stage.

    will be re-used in the cell CS-C7, producing the carry for the

    entire 8-bit adder block (Fig. 11). However, the speedup was

    not sufficient to produce the SUM signals on time, even in the

    carry-select architecture. Therefore, , the result of the carry

    from the six lowest-order bits, will be fed directly into the FA

    for the bit position 6, thus forming a second carry-lookahead

    path (Fig. 11). is further used in CS-C7 to generate . By

    using intermediate results and sharing circuit blocks, the dupli-cation of logic cells is avoided. This contributes to a lower area

    and lower power consumption.

    Adder8 8b is composed of eight 8-bit blocks, the globalcarry network, and seven multiplexers (Fig. 12). The first

    adder block contains the 8-bit adder generating SUM0-7 and

    CS-C7, the circuit generating the intermediate carry signal

    . The seven other blocks are identical and are composed of

    four circuits. Two of these are 8-bit adders, used to produce

    the pre-sums, one assuming that the carry-in is at 0, theother assuming that the carry-in is at 1. A 2 8-inputmultiplexer selects the final sum output. The control signals for

    the multiplexers are generated by the global carry network. The

    Fig. 12. 64-bit adder based on the selection of 8-bit pre-sums.

    two other circuits in the seven identical adder blocks generate

    the conditional carry signals which are fed into the global carry

    network. These two circuits are referenced as CS-cin0 and

    CS-cin1 in Fig. 12. By using 8-bit blocks which are repeated

    seven times, the design and layout time is made shorter than if

    sections of different lengths were used.

    C. Global Carry Network

    The global carry network generates the final carry-out, ,

    and the intermediate carry signals , , , , , and

    . These signals command the multiplexers which select theappropriate 8-bit pre-sums.

    In order to minimize the delay in the critical path, the fol-

    lowing elements are taken into consideration: the number of

    successive stages, the input load presented to the CS-cin0 and

    CS-cin1 blocks, and the fan-out of each stage. In the decompo-

    sition that we propose below, the fan-out is limited to three, both

    for the conditional carry signals and for the intermediate carry

    signals that are re-used at different places in the global carry

    network.

    The hot carry is , since it commands the multiplexerselecting the highest order sum signals. It is generated using all

    the intermediate carry signals except and :

    (9)

    Since in the global carry network the stack height is limited

    to two devices to favor speed, the equation of is further

    decomposed in order to be able to implement this function with

    complex BBL CS-C0 gates and, where needed, X-gates.

    Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.

  • 7/29/2019 Power-Delay Product Minimization

    8/10

    242 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

    The second control signal to generate is :

    (10)

    which can be decomposed as

    (11)

    is already one of the intermediate carry signals needed

    to command the multiplexer selecting SUM24..31. The expres-

    sions of the complements of and are

    (12)

    (13)

    and are shared with .

    is expressed as follows:

    (14)

    where are also shared with and is the inter-

    mediate carry signal commanding the multiplexer selecting

    SUM16..23.

    Finally, re-uses in its expression

    (15)

    The low-order intermediate signals and are re-used

    as inputs for cells generating higher order carry signals. In this

    way, we avoid the duplication of parts of the carry logic, which

    favors low power, and we keep the fan-out to a maximum of

    three, which is beneficial for speed.

    D. Simulation Results

    This adder has been simulated using the parameters of

    the 0.18- m PD SOI CMOS process and is compared with

    the experimental and simulation results of the first adder,Adder4 16b, which has a classical structure (Table VI). The

    layout of Adder4 16b is taken as a reference to estimate

    the lengths of the wires for the schematic simulations. For

    Adder8 8b, the length of the long wires has been reduced by

    25% to account for the lower die area thanks to the lower device

    count. Considering the reduction in the number of devices

    (67%), this is a rather conservative value.

    The optimized version of the 64-bit adder shows a reduction

    of about 20% of the maximum delay, thanks to three factors.

    First, the lower number of buffers on the signal path accounts

    for 6% of the delay reduction. Second, the use of stacks with

    a maximum of two PMOSFETs further improves the speed by

    TABLE VIDELAY, POWER CONSUMPTION, POWER-DELAY PRODUCT, AND DEVICE COUNTFOR THE REALIZED AND SIMULATED IMPLEMENTATIONS OF ADDER42 16b AND

    THE SIMULATED IMPLEMENTATION OF ADDER8 2 8b IN THE 0.18- m CMOSTECHNOLOGY. V = 1 : 5 V; T = 2 5 C ; F = 1 GHz; F O = 4

    Fig. 13. Layout of the 64-bit adder based on the selection of 8-bit pre-sums.

    6%. The critical path in Adder8 8b includes a two-input NOR,

    one CS-C0 stage and five X-gates produce , the hot carry.

    Third, thanks to a more efficient architecture in Adder8 8b, the

    capacitive load seen by the carry-select cells on the critical path

    is reduced compared to Adder4 16b.

    To estimate the dynamic power consumption, random

    input patterns are applied at the inputs at a rate of 1 GHz. In

    Adder8 8b, the dynamic power consumption is reduced by a

    factor of three compared to the classical adder. This improve-

    ment is associated with the higher stacks in the noncritical

    path and the reduction of the number of cells in the proposedarchitecture.

    Adder8 8b shows a 75% lower power-delay product than

    Adder4 16b. About 20% of this improvement is associated

    with the cell level, the remainder coming from the modifications

    of the architecture. Adder8 8b has a power-delay product that

    is 60% lower than the CPL adder. In this case, the architecture

    accounts for 10% of the improvement, the remaining coming

    from the different logic design styles.

    E. Optimization and Results in 0.13- m PD SOI

    Adder8 8b has finally been optimized with Einstuner in a

    0.13- m PD SOI CMOS technology. Einstuner is a circuit opti-mization package that automatically resizes the transistors [24].

    The final layout area is 151 m 461 m (Fig. 13). A netlist

    has been extracted from the layout and is used to determine the

    main features of the adder. The critical delay is 326 ps at 1.1 V

    and 85 C. Random input patterns at a rate of 2 GHz are used to

    calculate the dynamic power which is found to be as low as 23

    mW. The static power is evaluated to be 380 W. Our realiza-

    tion is compared with state-of-the-art 64-bit adders published in

    previous work in Fig. 14. The adder proposed here is faster than

    the other realizations, even those realized in finer technologies.

    Our adder features the extremely low power-delay product of

    7.5 pJ.

    Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.

  • 7/29/2019 Power-Delay Product Minimization

    9/10

    NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS 243

    Fig. 14. Our 64-bit adder, based on the selection of 8-bit pre-sums designedin 0.13- m PD SOI, is compared with recent 64-bit adder realizations. L o a d = 4 0 fF; T = 8 5 C. ISSCC00 is a domino carry-select/carry-lookaheadadder

    in 0.18-

    mPDSOI[25]. VLSI01 is a race logic carry-lookahead/carry-selectadder in 0.18- m bulk CMOS [22], ESSCIRC02 is a BrentKung adder in100-nm bulk CMOS [26]. VLSI02 is a carry-select/carry-lookahead adder in0.1- m PD SOI [23].

    VI. CONCLUSION

    In this paper, we presented a methodology to minimize the

    power-delay product of 64-bit carry-select adders for high-

    performance and low-power applications, by working at three

    levels of abstraction: design style, cell arrangement, and adder

    structure. We demonstrated thata complex-cellbranch-basedde-

    sign reduces thedynamic powerconsumption by 10%compared

    to the design with decomposed cells, with a minimal impact on

    the delay. The reduction of the CMOS cell stack height fromthree to twodevicesin thecritical path proved to be beneficial for

    speed. By avoiding the duplication of cells, without increasing

    the fan-out on the critical path, we were able to improve the

    speed while maintaining low power consumption. The structural

    optimization, by making the choice of the selection of 8-bit

    pre-sums, allowed theuse of only onecarry-select level, thus fur-

    ther contributingto thereduction of thepower dissipation thanks

    to a lower number of cells. Compared to the classical design, the

    power-delayproduct of theoptimized adder has been reduced by

    a factor of four. Compared with an equivalent CPL 64-bit adder,

    our realization shows a 60% improvement of the power-delay

    product. Finally, an automatic tuning tool allowed the design of

    an energy-efficient adder in 0.13- m PD SOI,with a power-delayproduct as low as 7.5 pJ. The approach presented in this paper

    can be extended toward -bit carry-select adders. However, the

    number of carry-select levels and the adder architecture might

    be different in order to obtain an efficient realization.

    ACKNOWLEDGMENT

    The authors express their sincere thanks to G. Hellner,

    J. Keinert, V. Gernhoeffer, and U. Krauch for the help they

    provided when using the simulation, layout, and extraction

    tools. The authors also to thank R. Sautter and W. Haller for

    useful discussions, and J. Appinger for his helpful assistance in

    obtaining the experimental data.

    REFERENCES

    [1] T. H. Ning, CMOS in the new millennium, in Proc. IEEE CustomIntegrated Circuit Conf. (CICC), 2000, pp. 4956.

    [2] V. De and S. Borkar, Low power and high performance design chal-lenges in future technologies, in Proc. Great Lakes Symp. VLSI, 2000,pp. 16.

    [3] European Semiconductor Industry Association, Japan Electronics andInformation Industries Association, Korea Semiconductor IndustryAssociation, Taiwan Semiconductor Industry Association, and Semi-

    conductor Industry Association, International Technology Roadmap forSemiconductors, System Drivers, 2001.

    [4] C. Nagendra, R. M. Owens, and M. J. Irwin, Power-delay character-istics of CMOS adders, IEEE Trans. VLSI Syst., vol. 2, pp. 377381,Sept. 1994.

    [5] S. Naffzifer, A subnanosecond 0.5 m 64b adder, in Proc. IEEE Int.Solid-State Circuits Conf., Slide Supplement, 1996.

    [6] J. Hennessy, D. Patterson, and D. Goldberg, Computer Architecture, AQuantitative Approach, 2nd ed. San Mateo, CA: Morgan Kaufman,1996.

    [7] M. Allam, M. Anis, and M. Elmasry, Effect of technology scaling ondigital CMOS logic styles, in Proc. IEEE Custom Integrated CircuitsConf. (CICC), 2000, pp. 19-1-119-1-8.

    [8] R. Yung, S. Rusu, and K. Shoemaker, Future trend of microprocessordesign, in Proc. ESSCIRC 2002, pp. 4346.

    [9] V. De and S. Borkar, Low power and high performance design chal-lenges in future technologies, in Proc. Great Lakes Symp. VLSI, 2000,

    pp. 16.[10] P.-F. Lu, C.-T. Chuang, J. Ji, L. F. Wagner, C.-M. Hsieh, J. B. Kuang,L. L.-C. Hsu, M. M. Pelella, S.-F. Sanford, and C. J. Anderson,Floating-body effects in partially depleted SOI CMOS circuits, IEEE

    J. Solid-State Circuits, vol. 32, pp. 12411253, Aug. 1997.[11] N. Subba, A. Salman, S. Mitra, D. E. Ioannou, and C. Tretz,

    Pseudo-NMOS revisited: Impact of SOI on low power, high speedcircuit design, in Proc. IEEE Int. SOI Conf., Oct. 2000, pp. 2627.

    [12] C. R. Tretz, R. K. Montoye, and W. Reohr, Ratioed CMOS: A lowpower high speed design choice in SOI technologies, in Proc. IEEE

    Int. SOI Conf., Oct. 2000, pp. 2829.[13] J. M. Masgonty, C. Arm, and C. Piguet, Technology- and power-

    supply-independent cell library, in Proc. IEEE Custom IntegratedCircuits Conf. (CICC), 1991, pp. 25.5/125.5/4.

    [14] K. Hwang, Computer Arithmetic: Principles, Architecture and Design.New York: Wiley, 1979.

    [15] S. Zaker and J. Zahnd, OPTIMOS: a branch-level digital circuit opti-mizer, in Proc. EURO ASIC, 1993, pp. 563572.

    [16] G. A. Katopis, W. D. Becker, T. R. Mazzawy, H. H. Smith, C. K.Vakirtzis, S. A. Kuppinger, B. Singh, P. C. Lin, J. Bartells Jr., G. V.Kihlmire, P. N. Venkatachalam, H. I. Stoller, and J. L. Frankel, MCMtechnology and design for the S/390 G5 system, IBM J. Res. Develop.,vol. 43, no. 5/6, pp. 621650, Sept.Nov. 1999.

    [17] G. G. Shahidi, SOI technology for the GHz era, IBM J. Res. Develop.,vol. 46, no. 2/3, pp. 121131, Mar./May 2002.

    [18] I. Aller and K. E. Kroell, Detailed analysis of the gate delay variabilityin partially depleted SOI CMOS circuits, in Proc. IEEE Int. SOI Conf.,Oct. 1999, pp. 4041.

    [19] A.Nve, D. Flandre, H. Schettler, T. Ludwig, andG. Hellner, Design ofa branch-based 64-bit carry-select adder in 0.18 m partially-depletedSOI CMOS, in Proc. Int. Symp. Low Power Electronics and Design(ISLPED), Aug. 2002, pp. 108111.

    [20] K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi, and A.Shimizu, A 3.8 ns CMOS 16 2 16-b multiplier using complementarypass-transistor logic,IEEE J. Solid-State Circuits, vol.25,pp.388395,Apr. 1990.

    [21] K. Martin, Digital Integrated Circuit Design. Oxford, U.K.: OxfordUniv. Press, 2000.

    [22] S. J. Lee, R. Woo, and H. J. Yoo, 480 ps 64-bit race logic adder, inSymp. VLSI Circuits Dig. Tech. Papers, 2001, pp. 2728.

    [23] J. J. Kim, R. Joshi, C.-T. Chuang, and K. Roy, SOI-optimized 64-bithigh-speed CMOS adder design, in Symp. VLSI Circuits Dig. Tech. Pa-

    pers, 2002, pp. 122125.[24] X. Bai, C. Visweswariah, P. Strenski, and D. Hathaway, Uncertainty-

    aware circuit optimization, in Proc. 39th Design Automation Conf.,June 2002, pp. 5863.

    [25] D. Sastiak, J. Tran, F. Mounes-Toussi, and S. Storino, A 2nd generation440 ps SOI 64 b adder, in IEEE Int. Solid-State Circuits Conf., Feb.2000, pp. 288289.

    [26] M. Garg and A. Katoch, Evaluation of skew tolerance in delayedclocking scheme for dynamic circuits, in Proc. ESSCIRC, Sept. 2001,pp. 396399.

    Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.

  • 7/29/2019 Power-Delay Product Minimization

    10/10

    244 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

    Amaury Nve (M96) received the electrical engi-neering and the Ph.D. degrees from the Universitcatholique de Louvain, Louvain-la-Neuve, Belgium,in 1998 and 2004, respectively.

    From 1998 to 2003, he was Research Assistantwith the Microelectronics Laboratory of the Univer-sit Catholique de Louvain. He was involved in thedevelopment of design techniques for high-speedand low-power digital circuits in advanced sil-

    icon-on-insulator (SOI) processes. In May 2003,he joined the IBM Research and DevelopmentLaboratory in Bblingen, Germany, where he is working on circuit design foradvanced CMOS processes.

    Helmut Schettler received the Dipl.-Ing. degreein electrical engineering from the University ofStuttgart, Stuttgart, Germany, in 1969.

    In 1969,he joined theIBM Laboratory, Bblingen,Germany. He worked for three years in the IBMLaboratories, East Fishkill, NY, and Burlington,VT, where he was involved in bipolar and CMOScircuit and chip design for memory and m-Processorapplications. He holds 23 patents and was theleading circuit designer when IBM moved its servertechnology from bipolar to CMOS. Presently, he is

    also involved in lecturing at a university of applied science in Stuttgart.

    Thomas Ludwig (M90) was born in Sindelfingen,Germany, in 1957. He received the Master of Elec-trical Engineering from the Technische Universitt,Berlin, Germany, in 1983.

    He joined IBM in 1984, at the German Researchand Development Laboratory, Bblingen, workingon high-speed digital driver/receiver circuits. In1992, he was on an assignment with the jointIBM/Intel Noyce Development Center, BocaRaton, FL, working on technology conversion ofa -processor. He is currently Senior Engineer

    and Leader of the Future Product Technology Team, IBM Systems Group,Bblingen, Germany, responsible for future silicon technologies in the field of

    high-speed server processors. His current research interests are in the areas ofsilicon-on-insulator technology, especially FinFET circuit design, modelingand the influence on CAD tools.

    Denis Flandre (M86SM03) was born inCharleroi, Belgium, in 1964. He received theElectrical Engineer degree, the Ph.D. degree, andthe Postdoctoral thesis degree from the UniversitCatholique de Louvain (UCL), Louvain-la-Neuve,Belgium, in 1986, 1990, and 1999, respectively.His doctoral research was on the modeling ofsilicon-on-insulator (SOI) MOS devices for charac-terization and circuit simulation, and his Postdoctoral

    thesis was on a systematic and automated synthesismethodology for MOS analog circuits.In 1985, he was a summer Student Trainee at NTT Headquarters, Tokyo,

    Japan. From October 1990 to September 1991, he was with the Centro Nacionalde Microelectrnica, Barcelona, Spain, working on the characterization and nu-merical simulation of SOI MOS process and devices. He was then at the Labo-ratoire de Microlectronique (DICE), Louvain-la-Neuve, Belgium, as a SeniorResearch Associate of the National Fund for Scientific Research (FNRS, Bel-gium). Since 2001, he has been a full-time Professor at UCL giving courses onintegrated analog circuit design, device physics, etc. He is currently involved inthe research anddevelopmentof digital andanalog SOIMOS circuits forspecialapplications, more specifically high-speed, low-voltage low-power, microwave,rad-hard, and high-temperature electronics.

    Prof. Flandre has been the recipient of the 1992 Biennial SiemensFNRSAward for an original contribution in the fields of electricity and electronics, the1997 Wernaers Prize for innovation in pedagogical presentation of advancedresearch work, and the 1999 CEN SCK Prize for innovation in nuclear science

    instrumentation. He has authored or coauthored more than 160 technicalpapers or conference contributions. He is a member of the Advisory Board ofthe EU Network of Excellence for High-Temperature Electronics (HITEN),of the Scientific Board of the Microserv large infrastructure EU programof the CNM-Barcelona and of the Director Board of the Cyclotron ResearchCenter (CRC, Louvain-la-Neuve, Belgium). He is a founding member of theCERMIN (Centre de Recherche en Dispositifs et Matriaux ElectroniquesMicro- et Nanoscopiques of UCL). He is a cofounder of CISSOID S.A., astartup company, spun off of UCL in July 2000, focusing on SOI circuit designservices.