IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …€¦ · partial remainders, and redundant...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Energy-Efficient VLSI Realization of Binary64Division With Redundant Number Systems

Saba Amanollahi and Ghassem Jaberipur

Abstract— VLSI realizations of digit-recurrence binarydivision usually use redundant representation of partial remain-ders and quotient digits. The former allows for fast carry-freecomputation of the next partial remainder, and the latter leadsto less number of the required divisor multiples. In studying theprevious relevant works, we have noted that the binary carry-save (CS) number system is prevalent in the representation ofpartial remainders, and redundant high radix representationof quotient digits is popular in order to reduce the cyclecount. In this paper, we explore a design space containingfour division architectures. These are based on binary CS orradix-16 signed digit (SD) representations of partial remainders.On the other hand, they use full or partial precomputationof divisor multiples. The latter uses smaller multiplexer at thecost two extra adders, where one of the operands is constantwithin all cycles. The quotient digits are represented by radix-16[−9, 9] SDs. Our synthesis-based evaluation of VLSI realizationsof the best previous relevant work and the four proposed designsshow reduced power and energy figures in the proposed designsat the cost of more silicon area and delay measures. However, ourenergy-delay product is 26%–35% less than that of the referencework.

Index Terms— Carry save (CS), digit recurrence binary divi-sion, energy efficiency, radix-16 signed digit (SD), redundantnumber system.

I. INTRODUCTION

D IVISION is the less frequent operation among the fourbasic arithmetic operations that are carried out within the

execution of a typical task on digital processors. On the otherhand, it is the most complex and time consuming operation.VLSI realization of dividers is generally based on two classesof algorithms, namely, subtractive (aka digit recurrence) andmultiplicative (aka functional) [1]. The former is less costly,but slower, such that obtaining each digit of the quotientrequires a separate recurrence cycle, while the number ofiterations in the latter is logarithmically proportional to thenumber of quotient digits. Quotient digit selection (QDS) isvery simple in conventional binary division algorithms, such asnonrestoring division scheme [1], where the next quotient bitis obtained just by examining the sign of partial remainder.However, in order to reduce the number of recurrences,radix-2h (e.g., h = 4) division schemes have been proposed atthe cost of more complex QDS, since one out of 2h possible

Manuscript received May 6, 2016; revised July 16, 2016; acceptedAugust 24, 2016. This work was supported in part by IPM underGrant CS1395-2-03, and in part by Shahid Beheshti University.

The authors are with the Department of Computer Science andEngineering, Shahid Beheshti University, Tehran 1983969411, Iran (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2016.2604346

digit values is to be selected. This is undertaken via comparingthe partial remainder with a set of divisor multiples [2].

In order to reduce the generation cost of such multiples,the quotient digits are often selected from a signed digit (SD)set [3] (e.g., [−2h−1, 2h−1]), and converted on the fly intothe desired binary output. Another cost reducing technique isto perform the comparison only via the few most significantdigits, whose number is determined via a fairly complexerror analysis. On the other hand, the SD representationof partial remainders has been employed to enable borrow-free subtraction that shortens the cycle time. However, signdetection of SD numbers, which is required in QDS, is not atrivial operation.

Despite the less frequent occurrence of division in compari-son with other basic operations, the several addition operationsthat are embedded within a digit recurrence division contributeto extra energy consumption. On the other hand, given thepopularity of portable devices and prominence of green com-puting, the current trend in VLSI design favors low-powerproducts that are often more reliable due to the reduction ofdie temperature. In addition, the reduction of power dissipationis considerably influential within the arithmetic/logic units,which is often recognized as the hot spot of digital processors.

The commonly used redundant number system that isrequired for the representation of partial remainders is thebinary carry save (CS), which roughly doubles the requirednumber of bits for representing the same value. Although theextra register cost might be affordable, the corresponding extrapower dissipation could be in question. On the other hand,there are less costly redundant number systems that representthe same range of numbers via exploiting less number of addi-tional bits. For example, the case of redundant digit floatingpoint arithmetic [4] uses the radix-16 maximally redundantSD (MRSD) representation that requires only 25% extra bits.Efficient radix-16 MRSD addition and subtraction have beenoffered in [5]. The choice of SD set for the representationof quotient digits is critical, since narrower digit sets (e.g.,the minimally redundant digit set [−2h−1, 2h−1]) that aredesirable for reducing the cost of divisor multiple genera-tion (DMG) tend to require more number of digits for com-paring them with partial remainders [1]. Given the aforemen-tioned design space parameters (i.e., the choice of SD or CSfor partial remainders and quotient digits), we are motivatedto study and explore the low power possibilities in the designand implementation of new binary digit recurrence dividers.

The remaining sections of this paper are organized asfollows. Section II provides background materials on binarydigit recurrence division schemes, and the relevant previous

1063-8210 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.


2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

works. The proposed architectures are offered in Section III.Figures of merit of the best previous relevant divider andthe proposed ones are evaluated, via synthesis, and comparedwith Section IV. Finally, we provide our concluding remarksin Section V.

II. BACKGROUND AND THE RELATED WORKS

The division operation can be defined as X = QD + R,where X and D represent the dividend and divisor, respec-tively, as the input operands. The results are denoted as thequotient Q and the remainder R. In hardware realization ofdigit recurrence division algorithms, it is often postulated thatX < D, and the divisor is a normalized fraction, such that inradix-2h division 1/2h ≤ D < 1. Equation (1) describes thej th recurrence, where W [ j ], q j+1, and Q[ j ], represent thej th partial remainder, the next quotient digit, and the partialquotient, respectively. In addition, 0 ≤ j < n, W [0] = X ,Q[0] = 0, and n denotes the precision of D and Q

W [ j + 1] = 2h W [ j ] − q j+1 D

Q[ j + 1] = Q[ j ] + 2−h( j+1)q j+1. (1)

As was mentioned in Section I, QDS can be simplifiedvia selecting q j+1 from an SD set [−α, α]. For example,Liu and Nannarelli [6] use the radix-16 digit set [−10, 10],where the next quotient digit is obtained as q j+1 = 4qh

j+1 +ql

j+1, with qhj+1, ql

j+1 ∈ [−2, 2]. The common QDS approachis via comparing the shifted partial remainder with a set ofcomparison constants (i.e., a set of selected divisor multiples),such that q j+1 is selected based on (2), where Mk andMk+1 represent two consecutive comparison constants, andk ∈ [−α, α]. In practice, all the required comparisons takeplace in parallel. On the other hand, in order to speedup the QDS, truncated (to some t � n fractional digits)partial remainders and comparison constants are used,where t is so determined to guarantee that the convergencecondition (defined by (3) [1]) is not violated

Mk ≤ 2h W [ j ] < Mk+1 �⇒ q j+1 = k (2)

−ρD ≤ W [ j + 1] ≤ ρD, ρ = (α/(2h − 1)). (3)

To speed up the partial remainder computation (PRC),the partial remainders are often represented via a redundantnumber system. For example, [6]–[11] use binary CS, and[7] and [13] employ the binary SD (BSD) representations.However, we have not encountered any relevant work withhigh radix redundant partial remainders.

A. Previous Relevant Works

We briefly describe some representative binary divisionhardware designs, but mainly focus on those that internallyuse radix-16, for better comparison with our radix-16 designs.Taylor [12] offered the aforementioned idea of decompos-ing the radix-16 quotient digit to two minimally redundantradix-4 digits that are obtained in a time-overlapping manner.However, the gained speed is at the cost of more silicon areaand more power dissipation. This overlapping technique hasbeen later used in a number of publications [6], [10]–[13],

TABLE I

TAXONOMY OF BINARY DIGIT RECURRENCE DIVIDERS

Fig. 1. Two digit slices of CFA required for PRC.

where the main difference is in partial remainder representa-tion (e.g., CS or BSD). This is summarized in Table I, whereother design particularities, such as the employed quotient digitset and the use of overlapping, are also indicated. Nevertheless,each design may have its particular additional contribution(e.g., retiming in [6] and [10] and some low-power techniquesin [8]).

B. Use of Signed Digit Number Systems

Utilization of a high radix SD number system (for partialremainder representation) entails particular carry-free addition,based on (1). This regards a previous work [13] and two ofthe proposed designs in this paper, which use radix-16 SDrepresentation of partial remainders. The digit set of theformer is not indicated, while those of the proposed designsare MRSD. Therefore, an addition of a radix-16 MRSDnumber (i.e., the shifted partial remainder) with a binarynumber (i.e., the divisor multiple) is in order. In the previousworks that use radix-16 MRSD number system (see [4], [5]),each radix-16 MRSD digit is represented by a 5-bit two’scomplement encoding. In [5], the most significant bit is anegabit with arithmetic value −1(0) corresponding to logicalstatus 0(1) [14]. Fig. 1 shows two digit slices of the requiredcarry-free addition, for the latter so-called negabit two’s com-plement representation, where white (black) dots representnegabits (posibits). One 4-bit adder per radix-16 position,whose input bits are distinguished within dashed boxes, is usedto produce the sum digits (surrounded by dashed curves). The


AMANOLLAHI AND JABERIPUR: ENERGY-EFFICIENT VLSI REALIZATION OF BINARY64 DIVISION 3

Fig. 2. General division architecture.

4-bit adders produce the negabit and three most significantposibits of the sum digit, in the same radix-16 position, andthe carry-out rests at the least significant position of the nextsum digit.

The use of the aforementioned adder is expected to leadto less power dissipation with respect to designs that useCS adder, which is mainly due to the use of 5 bits per digitin the former compared with 8 bits in the latter.

III. PROPOSED ARCHITECTURES

The general digit recurrence division architecture is shownin Fig. 2, where the pertinent discussion regarding the pro-posed designs will be provided as appropriate. Double preci-sion operands (i.e., binary64 floating point) is assumed for allthe proposed architectures as is the case for the main referencework [6].

Within the framework of Fig. 2, static and semidynamicDMGs and two different representations for partial remain-ders provide us with a design space based on the followingoptions.

1) Radix-16 Quotient Digit Set: This choice, as in theprevious relevant works, leads to the reduced numberof cycles versus the direct generation of quotient bits.

2) SD Representation of Quotient Digits: We use [−9, 9]radix-16 SD set for the intermediate representation ofquotient digits. The 5-bit representation of such digits isthe same as the MRSD of Fig. 1. The primary choicewould be the minimally redundant [−8, 8] digit setthat requires the minimum number of divisor multiples,while the other extreme choice would be the maximallyredundant [−15, 15] digit set. However, another influen-tial measure is the number of fractional digits that aresufficient for truncated comparison of partial remainderswith divisor multiples. This is later shown to be 2 in thecase of digit sets [−α, α] (α ∈ [9, 15]), and 3 for α = 8.

3) Semidynamic DMG: The [−9, 9] multiples of divisorthat are needed in the PRC are normally obtainedwithin the initialization cycle, where ten-way multi-plexer is required for selecting q j+1D. However, besidesimplementing the latter conventional method, we pro-pose the following method that uses a four-way mul-tiplexer. In the initialization cycle, we generate only{2, 3, 6}D, and dynamically obtain ±{4, 5, 7, 8, 9}D,

Fig. 3. Overlapped zones for specific quotient digit values.

as ±{6D − 2D, 6D ∓ D, 6D + 2D, 6D + 3D}, respec-tively, within each recurrence.

4) Use of Redundant Number Systems for PRC: The pre-vious relevant works have opted for CS representationof partial remainders. To be able to independently showthe advantage of aforementioned semidynamic DMG,we also use CS as one option, which due to doublingthe representation storage does not seem to be a properchoice when lower power dissipation is desirable. There-fore, our other choice is to use higher radix redundantnumber systems for partial remainder representation.For example, radix-16 maximally redundant numbersystem (MRSD) [5] is a viable choice, with only 25%extra representation storage.

Based on the above discussion, our design space includesfour architectures that are designated as CS-10, CS-4,MRSD-10, and MRSD-4, where CS and MRSD regard thepartial remainder representation, and 4 and 10 refer to thesizes of utilized multiplexers, corresponding to [0, 3] × D or[0, 9]× D multiples, respectively. In Section III-A, we discussthe QDS and capture the main differences of the four proposedarchitectures in Section III-B.

A. Quotient Digit Selection

As was argued above, the quotient digit set of our choiceis [−9, 9]. In general, the next quotient digit q j+1 should beselected from [−α, α], such that the convergence conditionthat is described by (3) (above) holds [1], where in this case(i.e., α = 9), ρ = (a/(16 − 1)) = (9/15) = 0.6.

The convergence condition (−0.6D ≤ W [ j + 1] ≤ 0.6D)is partially unfolded as in Fig. 3, which suggests the compar-ison of the partial remainder with a set of divisor multiples(e.g., −1.6D and −0.4D, for q j+1 = −1) in order to decidethe value of the next quotient digit. However, for some W [ j ]values (e.g., 16W [ j ] = −.5 D), there may be more than onevalid q j+1. For example, see the overlapped zone between thedashed lines for q j+1 = −1 and q j+1 = 0.

To ease the QDS process and reduce the number of com-parisons, it is common to precompute a set of comparisonconstants, as fixed multiples of divisor [as in (2)], to be usedfor exact QDS. As such, (4) provides the proper range ofMks, for k ∈ [−9, 9]. Therefore, an ease to compute choiceis Mk = (k + 0.5)D, which leads to the case that the exactinterval of 16W [ j ] (for a particular value of q j+1 = −1)falls between M−2 = −1.5D and M−1 = −.5D (see thecorresponding bold arrows in Fig. 3)

D(k + 0.4) ≤ Mk ≤ D(k + 0.6). (4)



Fig. 4. QDS architecture.

On the other hand, the comparison of an SD or CSnumber (i.e., partial remainder) with a nonredundant one(i.e., comparison constants) cannot be trusted to a digit by digitcomparator (most significant digits first). For example, con-sider the first two digits of a hexadecimal MRSD operand asX = x1x0, and nonredundant number Y = y1 y0, and assumex1 = 8, x0 = −2, y1 = 7, and y0 = 15. A normal comparatorrules X > Y , since x1 > y1. However, it turns out that X canbe equivalently represented as x1 = 7 and x0 = 14. There-fore, the same trivial comparison scheme leads to X < Y ,since x0 = 14 < 15 = y0. A straightforward way out of thisproblem is to convert the SD or CS operand into their two’scomplement equivalents and use a conventional comparator,which is of course counterproductive with respect to losing thecarry-free advantage of SD or CS number systems. However,it is possible to undertake the required comparisons basedon the few most significant fractional digits, such that theconvergence condition (3) holds. It turns out, as is proved inSection III-A1 that t = 2 fractional digits are sufficient, wherefor the sake of similar treatment of SD and CS representation,we consider four CS digits as one radix-16 fractional digit.The overall QDS architecture is shown in Fig. 4, whereWk[ j + 1] = (16W [ j ])t − (Mk)t , and the index t denotestfractional digits.

1) Proof of t = 2: Assuming t fractional digits, for Mk andW , let the truncated operands be denoted as Mk − 16−t <(Mk)t ≤ Mk and 16W [ j ] − 16−t < (16W [ j ])t < 16W [ j ] +16−t , respectively. Note that the discarded digits in the lattercase can assume both positive and negative values due to SDnature of the partial remainders. We follow a similar procedureas in the decimal division schemes of [15] and [16], and adaptit for radix-16 division. Replacing the operands of (2) withthe corresponding truncated operands, we get at (5) and (6),respectively

(Mk)t ≤ (16W [ j ])t , for q j+1 = k (5)

(16W [ j ])t < (Mk)t , for q j+1 = k − 1. (6)

After some elaborations on (5) and (6), as follows, we getat Mk−2×16−t − k D < W [ j + 1] and Mk + 16−t −(k − 1)D > w[ j + 1], respectively. Applying these resultson the convergence condition in (3) (i.e., −ρD < W [ j + 1]

Fig. 5. PRC for designs CS-10 and MRSD-10.

and W [ j + 1] < ρD) leads to (7) and (8), respectively

(5) �⇒ Mk − 16−t < (Mk)t ≤ (16W [ j ])t

< 16W [ j ] + 16−t �⇒ Mk − 2

×16−t − k D < 16W [ j ] − k D

= W [ j + 1](6) �⇒ 16W [ j ] − 16−t < (16W [ j ])t < (Mk)t ≤ Mk �⇒Mk + 16−t − (k − 1)D > 16W [ j ] − (k − 1)D

= W [ j + 1]−ρD < Mk − 2

×16−t − k D �⇒ 2 × 16−t

+ k DρD < Mk (7)

Mk + 16−t − (k − 1)D < ρD �⇒ Mk

< ρD − 16−t + (k − 1)D. (8)

Recalling that ρ = 0.6 and 1/16 ≤ D < 1, and combiningthe inequalities (7) and (8), we get at the following:ρD + 2 × 16−t + k D < ρD − 16−t + (k − 1)D

�⇒ 3 × 16−t < (2ρ − 1)D = 0.2 × D �⇒ 16t >15

D.

Since the latter should hold for all the values of D (inparticular D = 16−1), we need to assert 16t > 240, whichleads to t > 1. Therefore, we trust on t = 2. However, giventhe existence of a nonfractional integer digit of the operands,three-digit comparisons are required, for which the cost ofconverting into two’s complement can be afforded. Note thatin the cases of α ∈ [9, 15], the same analysis leads to t > 1.However, α = 8 �⇒ t > 2.

B. Partial Remainder Computation

The aforementioned four proposed architectures are mainlydescribed within the PRC discussion as follows.

1) CS-10 and MRSD-10 Designs: The straightforward PRCwould use a unified carry-free adder/subtractor (CFA/S)and a 10:1 multiplexer, for the quotient digit set [−9, 9].The multiplexer selects one precomputed divisor multiplefrom [0, 9] × D, which are obtained in the initializationphase. Fig. 5 shows the required architecture, for designsCS-10 and MRSD-10, where a digit is meant to denote eitheran MRSD or four CS digits, and QDS box represents Fig. 4,whose output control signals are shown as “MUX Selector”



Fig. 6. PRC for designs CS-4 and MRSD-4.

and “ADD/SUB.” The ten-way multiplexing is expected tobe highly influential in prolonging the latency. Recalling thePRC, as W [ j + 1] = 16W [ j ] − q j+1D, note that W [ j ] ismaintained either in radix-2 CS or radix-16 MRSD, and themultiples of D are represented in binary. Therefore, the PRCis simply done via a CS addition (for the CS-10 case), and by a4-bit adder per radix-16 digit (for the MRSD-10 case), as wasdescribed in Section II-B. The nonredundant binary dividendX must be converted into redundant representation to providefor W [0]. This is discussed in Section III-C (initialization).

2) CS-4 and MRSD-4 Designs: In order to utilize a smallerselector of divisor multiples, we propose the architectureof Fig. 6, where again QDS box represents Fig. 4, whoseoutput control signals are shown as “MUX Selector” and“ADD/SUB.” Let q j+1 = 6α + β, where α ∈ [−1, 1] andβ ∈ [0, 3]. The overlapped PRC computes W [ j + 1]′ =16W [ j ]+6αD via a CFA and a CFS, since W [ j ] is representedin binary CS or radix-16 MRSD formats. The operation of thispart overlaps with the QDS. The rest of PRC is carried out afterQDS, where β D is selected via the four-input multiplexer,followed by computing W [ j + 1] = W [ j + 1]′ + β D, viaanother CFA.

C. Initialization and Clean-Up

The initialization cycle is responsible for the followingtasks.

1) Converting X to W[0]: This is a cost- and delay-free operation. In case of CS designs, zero-valued bitsare inserted as the second bits of CS representation.However, in MRSD cases, a zero-valued negabit (withlogical status 1) is added per each four bits of X to leadto the radix-16 MRSD representation of W [0].

2) Parallel Computation of the Divisor Multiples:a) CS-10 and MRSD-10 Designs: There are four

shifters (for 2D, 4D, 6D, and 8D) and four paralleladders (for 3D = 2D + D, 5D = 4D + D,7D = 8D − D, and 9D = 8D + D).

b) CS-4 and MRSD-4 Designs: There are only twoshift operations (for 2D and 6D), and one addi-tion (for 3D = 2D + D).

TABLE II

POWER DISSIPATION (mW ) AT MAXIMUM FREQUENCY

3) Parallel Precomputation of the Truncated ComparisonConstants: This is done, as in the following expressionsfor M0 to M8.

a) CS-10 and MRSD-10 Designs:M1 = D/2, M2 = 3D/2, M3 = 5D/2

M4 = 7D/2, M5 = 9D/2, M6 = 5D + M1

M7 = 6D + M1, M8 = 7D + M1

M9 = 8D + M1.

b) CS-4 and MRSD-4 Designs:M1 = D/2, M2 = 3D/2, M3 = 2D + M1

M4 = 3D + M1, M5 = 4D + M1

M6 = 4D + M2, M7 = 6D + M1

M8 = 6D + M2, M9 = 8D + M1.

Other comparison constants are obtained as M−k =−Mk+1, for 0 ≤ k ≤ 8. Note that the implementation ofinitialize block uses only one register for storing the D value,while the generated constants are directly passed to the PRCunit.

The terminal cycle is responsible for the conversion ofthe obtained quotient and remainder to nonredundant binaryformat. Such conversion entails word wide binary addition thatis undertaken via a parallel prefix binary adder in both CS andMRSD cases, which also takes care of the required roundingaddition.

IV. EVALUATION AND COMPARISON

We have described all components of the proposed archi-tectures with hardware description language (HDL) code,tested them for correctness with 30 000 random input data,and synthesized the corresponding circuits with SynopsysDesign Compiler. In addition, Synopsys Power Compilerhas been exploited to extract the power consumption ofthe designs under the tested random input data. The STM90-nm technology was used in order to compare the fig-ures of merit with those of the main reference work [6](i.e., apparently the best amongst the previous works). On theother hand, comparison of synthesis results requires the useof the same technology and the corresponding libraries, whichwe found it possible only in case of [6].

Tables II–IV contain the power (at maximum frequency),area, and delay figures for the components of the proposedarchitectures, respectively, where XFF (Xsetup) refers to theclock-to-Q (register) time.



TABLE III

AREA USAGE (μm2) FOR THE FOUR PROPOSED DIVIDERS

TABLE IV

DELAY ( ps) OF COMPONENTS IN CRITICAL PATH

Some noteworthy conclusions from the contents ofTables II–IV are as follows.

1) The MRSD designs, which use smaller registers, showless power figures than the CS ones (10%–12%).In addition, the QDS of the MRSD-10 (-4), dissipate23% (10%) less power, where the similar reductionmeasure for the PRC is around 20%.

2) The CS designs are slightly faster than their MRSDcounterparts, while the latter is less area consuming.

3) The CS-4 and MRSD-4 designs exhibit 7% and 8%lower delay figures, as compared with CS-10 andMRSD-10 designs, while those of the latter two dissipate12% and 10% less power, respectively.

4) The CS-4 and MRSD-4 designs are less area con-suming (16%), which is mainly due 41% less areaconsumption of the initialization phase. On the otherhand, area figures of the active phases (i.e., QDS andPRC) of CS-10 and MRSD-10 designs consume lessarea, which explains their less power dissipation, despitethat their overall area figures are more.

The total figures of merits on delay, power, energy, energy-delay-product (EDP), total area, area without that of theinitialize stage, E × area, and EDP × area for the proposeddesigns and that of the reference work are compiled in Table V,where the area of a NAND2 is equal to 4.4 μm2, and energy isobtained as the power × 16 × delay in maximum frequency.The power measure at maximum frequency for [6] is estimatedto be 28 mW, based on the reported average power at 100 MHZ(i.e., 2.73 mW per Table VI of [6]). This estimation resultseems to be viable, since the reported power measure in1.2-ns delay is 27.47 mW. In addition, illustrative comparisonof such figures of merit is offered by Fig. 7, where the

superiority of the proposed design in power and energy mea-sures is evident, but at the cost of lower speed and additionalsilicon area, which is mostly due to once executed initializationcircuits. The lowered EDP measures display that the lost speedwithin the proposed design has paid off as the saved energy.More detailed conclusions from the contents of Table V andthorough justifications for improvement in power and EDPmeasures also follow the following.

1) The power dissipation of [6], in comparison to theaverage of the power measures of the four proposeddesigns, is about 62% more. The area consumption of[6] is, on the average, 35% less than that of the proposeddesigns. However, that of the former is more whenexcluding the area measure of the initialization stages.

2) The least achieved delay of [6] is, on the average,26% less than that the proposed designs.

3) The energy figures of the proposed designs are, on theaverage, 48.5% less than that of [6].

4) The EDP of the proposed designs is, on the average,about 30% less than that of [6].

5) The EDP of the MRSD-4 design is, on the average,15.5% less than that of our other three designs.

6) The energy-per division of the MRSD designs are, on theaverage, 10% less than those of the CS designs.

7) The E × Area of all the proposed designs and theEDP × Area of MRSD-4 are less than that of [6].

The considerable power saving (see Table V) of theproposed designs versus [6], despite the considerable lessarea consumption of the latter, can be justified as fol-lows, where implementation details of [6] is obtainedfrom [10].

The additional area of the proposed designs is mostlyconsumed in the initialization logic, which is executed onlyonce per division operation versus the very active recurrenceunit that is executed 14 times. Therefore, such an extra areais not expected to significantly contribute in the overall powerdissipation. On the other hand, although the silicon area con-tributes to power measure via the switching capacitance, otherfactors, such as switching activity, are also influential on thedynamic power dissipation. The latter factor is convincinglyless in our design, as is discussed below.

Other grounds for extra power dissipation in [6] can berecognized as follows.

1) The constant nature of one of the addition operands(i.e., 6D), in our PRC is expected to lead to lessswitching activity.

2) The employed overlapping of [6] leads to 24 signdetection, per cycle, of the partial remainder, while thecorresponding figure in our designs is 18.

3) The sign detection logic in [6] utilizes two 7-bit CSAs,where 1 out of 4 (25%) inputs is constant, while theproposed designs use only one 12-bit adder, with oneconstant operand (50%).

4) The PRC of [6] uses two 57–bit CSAs, while the cor-responding unit in the proposed CS-10 design containsonly one 56-bit CSA. Since the PRC power measureof [6] is not available, regarding the other three proposeddesigns, we can also justify their less power dissipation



Fig. 7. Illustrative comparison of composite figures of merit based on Table V.

TABLE V

COMPOSITE FIGURES OF MERIT

TABLE VI

FIGURES OF MERIT FOR DIFFERENT FO4 CASES

via the following argument. The PRC power dissipationof the CS-10, with almost 50% reduced adder hardware,is 3.27 mW (see Table II). The PRC power measuresof the other three proposed designs (see Table II) areless for MRSD-10 (2.62), MRSD-4 (3.6), and only27% more for CS-4 (4.48). Therefore, our PRC is notexpected to dissipate more power than that of [6], whilewe could not offer a sounder evaluation due to lack ofseparate PRC power measures in [6].

Table VI compiles additional figures of merits (i.e., through-put, average power, energy per division, and performance

per watt) of the reference work [6] and the proposed onesin terms of different target FO4 measures that are obtainedvia synthesizing the divisions under different delay con-straints (based on 45 ps per FO4). The least FO4 met bythe proposed MRSD-4 is 28.5, where that of the referencework is 25. However, for FO4 measures between 30 and 50,the figures for power and energy per division of the proposeddesigns are 27% (averaged on the four designs and fiveFO4 figures) less than those of [6], respectively.

The contents of Table VI reveal that, for all the FO4 con-straints, the power and energy consumptions of the MRSD-4 is



Fig. 8. GFLOPs versus FO4 curves of the proposed and two reference works.

9% less than our other three designs, which is 31% less thanthat of the corresponding figures of [6].

The performance per watt of the proposed design and themain reference [6] that are computed for different FO4 mea-sures are illustrated via the corresponding curves of Fig. 8.In addition, the curve of an energy-efficient radix-8 divider [9]is added, which was not included in our study reports ofprevious works, since its figures of merit are not better thanthose of [6], within our working FO4 delays.

The improved performance of the proposed architectures isevident in Fig. 8. In particular, the MRSD-4 design is superiorcompared with other designs in terms of giga floating-pointoperations per second (GFLOPs) per watt, for delays morethan 28.5 FO4, but that of [6] exhibits less GFLOP measurefor delays less than 28.5 FO4. In addition, the MRSD designsare showing more GFLOP measures than the CS ones.

V. CONCLUSION

We studied (via analytical and synthesis-based evaluationof the corresponding VLSI realizations) the impact of thefollowing two design options on the figures of merit of binarydigit-recurrence division hardware.

1) Representation of partial remainders via high radixredundant number systems. Our representation choice ismaximally redundant radix-16 SD number system withdigit set [−15, 15].

2) Dynamic generation of some divisor multiples in[−9, 9] × D around the precomputed multiple 6D.

We also studied the relevant previous designs, which haveopted for binary CS representation of partial remainders andrepresentation of radix-16 quotient digits via minimally redun-dant radix-4 [−2, 2] digits, which leads to partial dynamicgeneration of divisor multiples.

For better evaluation of the above options, we explored adesign space containing four architectures based on precom-putation of all or part of divisor multiples and CS or MRSDrepresentation of partial remainders. The HDL simulations andsynthesis showed low-power and low-energy advantages of allthe four designs as compared with the best previous work.However, while our designs do not operate as fast as thereference one, the EDP of the proposed designs is 26%–35%less than that of the reference work.

REFERENCES

[1] M. Ercegovac and T. Lang, Digital Arithmetic. San Mateo, CA, USA:Morgan Kaufmann, 2003.

[2] D. L. Harris, S. F. Oberman, and M. A. Horowitz, “SRT divisionarchitectures and implementations,” in Proc. 13th IEEE Symp. Comput.Arithmetic, Jul. 1997, pp. 18–25.

[3] A. Avizienis, “Signed-digit number representations for fast parallelarithmetic,” IRE Trans. Electron. Comput., vol. 10, no. 3, pp. 389–400,Sep. 1961.

[4] H. A. H. Fahmy and M. J. Flynn, “The case for a redundant format infloating point arithmetic,” in Proc. 16th IEEE Symp. Comput. Arithmetic,Santiago de Compostela, Spain, Jun. 2003, pp. 95–102.

[5] S. Gorgin and G. Jaberipur, “A family of high radix signed digit adders,”in Proc. 20th IEEE Symp. Comput. Arithmetic, Tübingen, Germany,Jul. 2011, pp. 112–120.

[6] W. Liu and A. Nannarelli, “Power efficient division and square rootunit,” IEEE Trans. Comput., vol. 61, no. 8, pp. 1059–1070, Aug. 2012.

[7] J. Ebergen and N. Jamadagni, “Radix-2 division algorithms with an over-redundant digit set,” IEEE Trans. Comput., vol. 64, no. 9, pp. 2652–2663,Sep. 2015.

[8] A. Nannarelli and T. Lang, “Low-power divider,” IEEE Trans. Comput.,vol. 48, no. 1, pp. 2–14, Jan. 1999.

[9] A. Nannarelli, “Performance/power space exploration for binary64 divi-sion units,” IEEE Trans. Comput., vol. 65, no. 5, pp. 1671–1677,May 2016.

[10] E. Antelo, T. Lang, P. Montuschi, and A. Nannarelli, “Digit-recurrencedividers with reduced logical depth,” IEEE Trans. Comput., vol. 54,no. 7, pp. 837–851, Jul. 2005.

[11] A. Nannarelli and T. Lang, “Low-power division: Comparison amongimplementations of radix 4, 8 and 16,” in Proc. 14th IEEE Symp.Comput. Arithmetic, Apr. 1999, pp. 60–67.

[12] G. S. Taylor, “Radix 16 SRT dividers with overlapped quotientselection stages: A 225 nanosecond double precision divider for theS-1 Mark IIB,” in Proc. IEEE 7th Symp. Comput. Arithmetic (ARITH),Jun. 1985, pp. 64–71.

[13] T. M. Carter and J. E. Robertson, “Radix-16 signed-digit division,” IEEETrans. Comput., vol. 39, no. 12, pp. 1424–1433, Dec. 1990.

[14] G. Jaberipur and B. Parhami, “Efficient realisation of arithmetic algo-rithms with weighted collection of posibits and negabits,” IET Comput.Digit. Techn., vol. 6, no. 5, pp. 259–268, Sep. 2012.

[15] H. Nikmehr, B. Phillips, and C.-C. Lim, “Fast decimal floating-pointdivision,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14,no. 9, pp. 951–961, Sep. 2006.

[16] A. Kaivani, A. Hosseiny, and G. Jaberipur, “Improving the speedof decimal division,” IET Comput. Digit. Techn., vol. 5, no. 5,pp. 393–404, Sep. 2011.

Saba Amanollahi received the B.Sc. degree incomputer engineering from the University of Tehran,Tehran, Iran, in 2007, and the M.Sc. degree in com-puter architecture from Shahid Beheshti University,Tehran, in 2011, where she is currently pursuing thePh.D. degree in computer architecture.

Her current research interests include computerarithmetic, low power design, and embedded systemdesign.

Ghassem Jaberipur received the B.S. degree inelectrical engineering from the Sharif University ofTechnology, Tehran, Iran, in 1974, the M.S. degreein engineering from the University of California atLos Angeles, Los Angeles, CA, USA, in 1976, theM.S. degree in computer science from the Universityof Wisconsin–Madison, Madison, WI, USA, in 1979,and the Ph.D. degree in computer engineering fromthe Sharif University of Technology in 2004.

He is currently an Associate Professor of computerengineering with the Department of Computer Sci-

ence and Engineering, Shahid Beheshti University, Tehran. He is also with theSchool of Computer Science, Institute for Research in Fundamental Sciences(IPM), Tehran. His current research interests include computer arithmetic.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …€¦ · partial remainders, and redundant...

Documents

Transcript of IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …€¦ · partial remainders, and redundant...