Author's personal copy - Πανεπιστήμιο Πατρών

13
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Transcript of Author's personal copy - Πανεπιστήμιο Πατρών

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Author's personal copy

Efficient modulo 2n71 squarers

D. Bakalis a, H.T. Vergos b,�, A. Spyrou b

a Physics Department, University of Patras, 26 500, Greeceb Computer Engineering & Informatics Department, University of Patras, 26 500, Greece

a r t i c l e i n f o

Article history:

Received 5 October 2010

Received in revised form

8 March 2011

Accepted 8 March 2011Available online 21 March 2011

Keywords:

Squarers

Modulo arithmetic

Residue number system

Computer arithmetic

Booth encoding

a b s t r a c t

Modulo 2n71 squarers are useful components for designing special purpose digital signal processors

that internally use a residue number system and for implementing the modulo exponentiators and

multiplicative inverses required in cryptographic algorithms. In this paper we propose, in a unified way,

architectures for their design that are based on the radix-4 modified Booth encoding. For the modulo

2nþ1 case, both the normal and the diminished-one representations are considered. Experimental

results show that the proposed squarers offer significant savings in the implementation area over

previous proposals that can reach up to 38% for sufficiently large operand widths, while in many cases a

small improvement in execution delay can also be achieved.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Special-purpose digital signal processors (DSPs) and co-pro-cessors often adopt a residue number system (RNS) for achievinghigh operation speeds [1,2]. A non-positional RNS is defined by aset of L moduli, suppose {d1,d2,y, dL}, which are pair-wise relativeprime. Assume that jAjM denotes the modulo M residue of aninteger A, that is, the least non-negative remainder of the divisionof A by M. A has a unique representation in the RNS, given bythe set {a1,a2,y,aL} of residues, where ai ¼ jAjdi

, if AZ0 andai ¼ jDþAjdi

, if Ao0, with D¼ d1 � d2 � � � � � dL. An operation� over the RNS is defined as ðz1,z2, . . . ,zLÞ ¼ ða1,a2, . . . ,aLÞ � ðb1,b2, . . . ,bLÞ, where zi ¼ jai � bijdi

. That is, the computation of zi onlydepends on ai, bi and di and therefore each zi can be computed in aseparate arithmetic unit often called a channel. Since each channeloperates on small residues instead of large numbers and since allrequired channels operate in parallel, significant speedup over thebinary system may be achieved.

RNSs built on moduli of the 2n, 2n�1 or 2n

þ1 forms havereceived significant attention [3–7] due to the fact that the requiredcircuits for the 2n moduli can be straightforwardly derived from thecorresponding integer circuits by limiting the result to n bits, whileadders [8,9] and multipliers [10–13] have been proposed for the2n71 moduli that can operate as fast as the integer ones. The 2n

þ1channel has to deal with operands that are (nþ1)-bits widecompared to the n-bit wide operands of the modulo 2n and the

modulo 2n�1 channels. Leibowitz [14] introduced the diminished-

one representation where each number is represented decreased byone compared to its normal representation and all arithmeticoperations are inhibited for a zero input operand since the resultcan be straightforwardly derived. The diminished-one representa-tion has the advantage that the computations in the modulo 2n

þ1channel are also restricted to n bits.

Squaring in DSPs can always be performed by driving bothinputs of the multiplier with the same operand. However, there areseveral applications that require a very large number of squaringoperations and therefore profit from the use of a dedicated squaringcircuit, capable of performing hundreds of millions of squaringoperations per second. Such applications include adaptive filtering[15,16], vector quantization, pattern recognition and image com-pression [17–20], calculation of Euclidean branch metrics [21,22]and function evaluation [23].

Efficient designs for modulo squarers are welcome for specialpurpose DSPs and for cryptographic algorithm implementationsthat adopt an RNS [24–26]. Furthermore, modulo squarers can beused in applications where multiplicative inverses or modularexponentiators have to be computed, since, according to the

Fermat’s little theorem, it holds that jx�1jp ¼ jxp�2jp, when p is

prime and x is coprime to p. For example, a modulo 216þ1 squarer

can advance the implementation of the multiplicative inversesmodulo 216

þ1 in the international data encryption algorithm(IDEA) utilizing the square and multiply algorithm [27]. Finally,modulo squarers have been proposed for implementing modulomultipliers based on the quarter-square principle [28,29].

A modulo squarer can be designed using: (a) a modulo multiplierwhose both inputs are driven by the same operand, (b) minimized

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/vlsi

INTEGRATION, the VLSI journal

0167-9260/$ - see front matter & 2011 Elsevier B.V. All rights reserved.

doi:10.1016/j.vlsi.2011.03.006

� Corresponding author.

E-mail addresses: [email protected] (D. Bakalis),

[email protected] (H.T. Vergos), [email protected] (A. Spyrou).

INTEGRATION, the VLSI journal 44 (2011) 163–174

Author's personal copy

logic functions [29], (c) look-up tables [30], or (d) a combinationalapproach for generating and adding the various partial products[29,31–34]. The first approach is unnecessarily complex and slow.The second approach can be used only for small input operands. Thethird approach requires unacceptably large memories as the inputoperand width increases and is more suitable for generic modulivalues. The last approach is the most appropriate for modulo 2n71squarers with medium and large input operand widths.

In this paper we present novel architectures for designingmodulo 2n71 squarers. For the modulo 2n

þ1 case, we considerboth the normal and the diminished-one representation. Thispaper extends the methodology that was originally presented in[35] for modulo 2n

�1 squarers for designing efficient normal anddiminished-one modulo 2n

þ1 squarers as well and provides amethodology that can cover, in a unified way, all three modulicases. Furthermore, in the current state of the art, the diminished-one modulo 2n

þ1 squarers require more area and delay com-pared to the corresponding normal modulo 2n

þ1 squarers of thesame size. We solve this inconsistency by proposing diminished-one modulo 2n

þ1 squarers that are faster and smaller than thenormal ones. The proposed circuits utilize the modified Boothencoding of the input operand for efficiently generating thepartial products along with a Dadda tree architecture for theirsummation and are based on the Booth-folding technique [36].The Booth-folding technique was introduced in [36] for the designof binary squarers. We apply the same technique in the design ofmodulo 2n71 squarers. The application of the Booth-foldingtechnique in the design of modulo squarers is not a straightfor-ward procedure since one has to: (a) suitably Booth-encode themodulo 2n

�1 or 2nþ1 input operand, (b) confine the resulting

partial product bits within n bits and (c) derive the correctionterms that are required in the modulo 2n

þ1 case. Even though(a) can be based on the Booth encoding that is already reported inthe open literature for modulo 2n71 multipliers [10,11] (b) and(c) are presented in this paper for the first time in the open literatureand form its main contributions. We prove that a constant correc-tion term is required in both the normal and the diminished-onemodulo 2n

þ1 squarers cases and analytically derive their values.Our experimental results indicate that the proposed squarers offerreduced implementation area compared to previously proposedones, for sufficiently wide operands, while a small improvement inexecution delay is also offered in many cases.

The rest of the paper is organized as follows. The optimizationtechniques suggested so far for the design of modulo 2n71squarers and the current state of the art are reviewed in the nextsection. Efficient modified Booth-encoded architectures arederived in Section 3. Section 4 presents comparative resultsagainst previously proposed architectures. Conclusions are drawnin the last section.

2. Background and previous work

A modulo 2n�1 squarer accepts an n-bit operand

A¼an�1an�2ya0, with 0rA¼Pn�1

i ¼ 0 ai � 2ir2n�1 (assuming a

double representation of zero), and produces the n-bit result

R¼ jA2j2n�1. The normal modulo 2n

þ1 squarer accepts an (nþ1)-bit

operand A¼anan�1ya0, with 0rA¼Pn

i ¼ 0 ai � 2io2nþ1 and

computes the (nþ1)-bit result R¼ jA2j2nþ1. Finally, denoting

A�1 ¼ an�1an�2 . . . a0 the n-bit diminished-one representation of an(nþ1)-bit operand A, a diminished-one modulo 2n

þ1 squarer

accepts A�1, with A�1 ¼Pn�1

i ¼ 0 ai � 2i¼ A�1, 0oAo2n

þ1, and

computes the n-bit diminished-one representation R�1 of jA2j2nþ1.

Assuming that a combinational approach is used, a squaringcircuit usually consists of three stages, namely the partial product

generation, the reduction stage that derives two final addendsfrom the generated partial products and the final two-inputparallel adder. The partial product matrix of a modulo 2n71squarer can be optimized by the following procedures:

(i) Logic and arithmetic simplifications, since aiai ¼ ai andanai¼0, 8i, 0r ion.

(ii) Merging pairs of equal partial product bits having the sameweight, since aiaj � 2iþ j

þajai � 2iþ j¼ aiaj � 2iþ jþ1.

(iii) Repositioning all partial product bits residing in columnswith weights greater than 2n�1 within the rightmost n

columns of the matrix. This is accomplished by taking intoaccount the periodic properties of the powers of 2 takenmodulo 2n71 [29]. More specifically,

(a) Modulo 2n�1: All partial product bits with weight 2nþ i,

iZ0, can be moved to the column with weight 2i, sincefor every bit z it holds that jz� 2nþ i

j2n�1 ¼ jz� 2i

j2n�1.

(b) Modulo 2nþ1: All partial product bits with weight

2nþ i, 0r irn�1, can be inverted and repositioned in thecolumn with weight 2i, since it holds that jz� 2nþ i

j2nþ1 ¼

j2nþ iþz � 2i

j2nþ1 (z denotes the complement of z). An

additive correction term equal to j2nþ ij2nþ1 should be

taken into account for each such move. Finally, the an

partial product bit which has a weight equal to 22n

can be moved to the column with weight 20 sincejan � 22n

j2nþ1 ¼ an � 20. Since an and a0 can not be both

simultaneously at 1, we can combine them in a logic ORgate instead of adding them.

(c) Diminished-one modulo 2nþ1: The same repositioning

procedure with that of (iiib) can be used for deriving thepartial product matrix of A�1

2 , without of course having toreposition the an partial product bit, since it holds thatR�1 ¼ jA

2�1j2nþ1 ¼ jðA�1þ1Þ2�1j2n

þ1 ¼ jA2�1þ2� A�1j2n

þ1.The 2� A�1 term can be represented, according to (iiib),by the n-bit vector an�2an�3 . . . a0an�1 as long as anadditive correction term of j2n

j2nþ1 ¼�1 is also

considered.

The modulo 2n�1 squarers proposed at first in [29] and later

in [31] apply procedures (i), (ii) and (iiia) for attaining an n-bitwide matrix of partial product bits. For example, Fig. 1(a) presentsthe partial product matrix of a modulo 28

�1 squarer according to[29,31]. The partial products are then reduced to two finaladdends by an end-around carry (EAC) save adder (CSA) tree,composed of full adders. The use of EAC is justified by the fact thatthe carry-out at the most significant bit (MSB) position of anystage of the adder tree has a weight equal to 2n. According toprocedure (iiia), the carry output can be added at the leastsignificant bit (LSB) position of the next stage. The two finaladdends are driven to a fast parallel modulo 2n

�1 adder thatprovides the output of the squarer.

For the normal modulo 2nþ1 squarers case, [32,33] refined the

architecture originally proposed by [29]. An n-bit wide matrix ofpartial product bits as the one presented in Fig. 1(b) for the n¼8case is derived, by applying procedures (i), (ii) and (iiib) (3 isused to denote the logical OR). A CSA tree with inverted EACs isthen used for reducing the partial products in the two finaladdends. Since the carry output at any stage of the adder treehas a weight equal to 2n, it can be inverted and driven to the LSBposition of the next stage, according to procedure (iiib), providedthat a correction term of 2n is also taken into account. Although anormal modulo 2n

þ1 parallel adder can be used for providing theresult, in [32,33] a slightly modified diminished-one modulo2nþ1 adder has been used as the final adder. This adder offers

the same execution speed as the fastest diminished-one adders

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174164

Author's personal copy

[9] but also checks whether their inputs are bit-wise comple-mentary, in which case produces an 1 at the MSB position. Itshould be noted that, the correction term T¼tn�1yt0 used in thepartial product matrix accounts for all corrections required, thatis, due to partial product bit repositioning, inverted EACs and theuse of a diminished-one adder as the final adder. In [32,33] it hasbeen shown that T has a constant value equal to T¼2.

Finally, the modulo 2nþ1 squarers for diminished-one oper-

ands that have been proposed in [34], also derive an n-bit widematrix of partial product bits by applying procedures (i), (ii), (iiib)and (iiic). They also utilize a CSA tree with inverted EACs forpartial product reduction and a final diminished-one adder forproducing the result. The partial product matrix of a diminished-one modulo 28

þ1 squarer is shown in Fig. 1(c). The requiredcorrection term T is constant in this case too and its value is equalto 0 [34]. However, its addition cannot be omitted since thatwould alter the number of EACs in the CSA tree that are accountedin the correction term.

The architectures presented in [29,32–34] are considered thecurrent state of the art. It should be noted that none of them usesBooth encoding for reducing the number of partial products.

3. Proposed modified Booth-encoded modulo squarers

Modified Booth encoding is a commonly used technique forreducing the number of partial products in binary multipliers andsquarers, resulting in more compact implementations than thecorresponding non-encoded ones for sufficiently large operands.Since the reduction of the partial products makes the adder treeused to add them shallower, modified Booth-encoded multipliersand squarers may also offer faster implementations provided thatthe partial product bits are derived fast enough. In this section, wepropose modified Booth-encoded modulo 2n71 squarers. In thefollowing we concentrate only on the partial product matrix ofeach proposed architecture, since the adder tree that reduces thepartial products and the final adder follow the same schemes with

those used in the previous proposals. At first we consider thediminished-one representation for the modulo 2n

þ1 case, sincethe normal modulo 2n

þ1 squarers can be designed as a straight-forward extension of the corresponding diminished-one ones. Forsimplicity, in the following we assume that n is even and use then¼8 case as an example.

According to the radix-4 modified Booth encoding, an n-bit

operand A¼ an�1an�2 . . . a0 can be rewritten as A¼Pn=2�1

i ¼ 0 22i�

ð�2a2iþ1þa2iþa2i�1Þ ¼Pn=2�1

i ¼ 0 ð22i� AiÞ, where a�1¼0 and Ai is a

Booth-encoded digit, with Ai ¼ ð�2a2iþ1þa2iþa2i�1ÞA ½�2,þ2�. Ithas been shown in the past that the same encoding can be appliedto modulo r operands, where r denotes either the modulo 2n

�1 orthe diminished-one modulo 2n

þ1 case. Specifically, it holds that

jAjr ¼ jPn=2�1

i ¼ 0 ð22i� AiÞjr , with a�1¼an�1 and ¼ an�1 in the mod-

ulo 2n�1 [11] and diminished-one modulo 2n

þ1 [10] cases,respectively.

The square of A in modulo r arithmetic can be computed usingthe modified Booth-encoded digits of A, instead of its bits, asfollows [36]:

jA2jr ¼Xðn=2Þ�1

i ¼ 0

22i� Ai

����������r

�Xðn=2Þ�1

i ¼ 0

22i� Ai

����������r

����������r

¼Xðn=2Þ�1

i ¼ 0

22i� Ai

!�

Xðn=2Þ�1

i ¼ 0

22i� Ai

!����������r

¼Xðn=2Þ�1

i ¼ 0

24i� ðAi � AiÞ

!�����þ

Xðn=2Þ�2

i ¼ 0

24iþ3Xðn=2Þ�1

k ¼ iþ1

22ðk�i�1Þ� Ai � Ak

!�����r

¼Xðn=2Þ�1

i ¼ 0

24i� Ciþ

Xðn=2Þ�2

i ¼ 0

24iþ3� Pi

����������r

,

where Ci ¼ Ai � Ai and Pi ¼Pn=2�1

k ¼ iþ1ð22ðk�i�1Þ

� Ai � AkÞ. The Ci

terms, 0r ion=2, are unsigned numbers that can only assume

Fig. 1. Partial product matrix for (a) jA2j28�1, (b) jA2j28

þ1 in the normal representation and (c) jA2j28þ1 in the diminished-one representation.

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174 165

Author's personal copy

the values 0, 1 or 4. Each one can therefore be represented by3 bits. However, since the middle bit is in every case equal to 0,there is no need to be included in the partial product matrix. Onthe other hand, the Pi terms are signed two’s complementnumbers. Each Pi term, 0r ion=2�1, requires (n�1�2i) bitsfor its representation. Let Ci,j and Pi,j denote the j-th bit of Ci and Pi,respectively. Fig. 2 presents the partial product matrix that can bederived for the square of an 8-bit operand A in modulo r

arithmetic according to the above analysis.For attaining an n-bit wide matrix, we need to reposition all

partial product bits with weights greater than 2n�1. Repositioningthe Ci,j bits can be accomplished by the procedures (iiia) and (iiib)described in Section 2. The Pi,j bits, however, need to be treateddifferently since the Pi terms represent signed two’s complementnumbers. Let the most significant bit of each Pi term be alsodenoted as Pi,MSB. It then holds that:

� Modulo 2n�1: Procedure (iiia) can be applied for all Pi,j bits

except the Pi,MSB bits. Each Pi,MSB bit represents the sign of a Pi

term and has a weight equal to �2nþ2iþ1. Therefore:

Xðn=2Þ�2

i ¼ 0

ð�2nþ2iþ1� Pi,MSBÞ

����������2n�1

¼ �Xðn=2Þ�2

i ¼ 0

22iþ1� Pi,MSB

����������2n�1

¼ ð2n�1Þ�

Xðn=2Þ�2

i ¼ 0

22iþ1� Pi,MSB

����������2n�1

:

Taking into account that for every bit z it holds that 1�z¼ z, wehave that:

Xðn=2Þ�2

i ¼ 0

ð�2nþ2iþ1� Pi,MSBÞ

����������2n�1

¼ 1 1 Pn2�2,MSB 1 . . . P1,MSB 1 P0,MSB 1:

Hence, we invert every Pi,MSB bit, move it to the column withweight 22iþ1 and fill the remaining columns with 1 s.� Diminished-one modulo 2n

þ1: We similarly can apply proce-dure (iiib) for all Pi,j bits except the MSB of each Pi. For the Pi,MSB

bits we have that:

Xðn=2Þ�2

i ¼ 0

ð�2nþ2iþ1� Pi,MSBÞ

����������2nþ1

¼Xðn=2Þ�2

i ¼ 0

ð�1Þ � ð�22iþ1� Pi,MSBÞ

����������2nþ1

¼Xðn=2Þ�2

i ¼ 0

22iþ1� Pi,MSB

����������2nþ1

:

Hence, each Pi,MSB can be simply moved to the column withweight 22iþ1 and no further action is required.

Fig. 3(a) and (b) present the derived n-bit wide partial productmatrices for the proposed modified Booth-encoded modulo 28

�1and diminished-one modulo 28

þ1 squarers. The last partialproduct T¼t7yt0 of Fig. 3(b) incorporates all correction termsrequired. It can be shown that T has a constant value which isequal to TA¼88y816 in the case of even values of n that aremultiples of 4 (a proof is provided in A.1) and is equal toTB¼22y216 in the case of even values of n that are not multiplesof 4 (a proof is provided in A.2). Subscript 16 denotes a hexade-cimal value. Figs. 3(c) and (d) present the corresponding circuitimplementations. HA, FA and FAþ denote half adders, full addersand simplified full adders with an input connected to logic 1,respectively. The final adders can be designed according to anydesirable architecture. The Ci,j and Pi,j bits can be derived using thesimple circuits proposed in [36]. For example, the circuits forderiving the Ci and Pi terms, assuming a four-bit coding, for themodulo squarer of Fig. 3(a) are presented in Fig. 4.

Let us now consider the case of modified Booth-encodedmodulo 2n

þ1 squarers for an (nþ1)-bit operand A in the normalrepresentation. At first we assume that Ao2n, that is, an¼0. If then LSBs of A are driven to the diminished-one modulo 2n

þ1squarer derived before then, this will compute the n LSBs ofjðAþ1Þ2�1j2n

þ1, that is, it will compute jA2þ2Aj2nþ1. As a result,

we can use an extra partial product equal to j�2Aj2nþ1 in the

partial product matrix of the diminished-one squarer for derivingthe n LSBs of jA2j2n

þ1. The j�2Aj2nþ1 can be expressed as the n-bit

vector an�2 an�3 . . . a0 an�1, provided that an additional correctionterm equal to 3 is also taken into account. The most significant bitof the result can be derived by using the slightly modifieddiminished-one adder [32,33,37] as the final adder, as explainedin Section 2. Finally, when A¼2n, an¼1 and all the rest bits are 0.Since in this case jA2j2n

þ1 ¼ j22nj2nþ1 ¼ 1, we can just position an

at the column with weight 20 and logically OR with an�1 of thepartial product added for j�2Aj2n

þ1. Fig. 5(a) and (b) present thepartial product matrix and the circuit implementation of theproposed normal modulo 28

þ1 squarer. All required correctionterms are merged into an n-bit vector T whose value is constantand is equal to the value of the corresponding diminished-onecorrection term increased by 2 (see Appendix B for a proof). Notethat the proposed architectures for the modulo 2n

þ1 case lead to

Fig. 2. Partial product matrix for the modulo 2871 Booth-encoded squarer.

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174166

Author's personal copy

modulo squarers for the diminished-one representation that aresmaller and faster than the corresponding squarers for the normalrepresentation, whereas in the previously reported architectures[34,32,33] the diminished-one squarers are less efficient than thenormal ones.

For odd values of n, the previously described modified Booth-encoded modulo 2n71 architectures can also be used as long asthe input operand is augmented by one bit, by adding a 0 to theMSB position. Then ððnþ1Þ=2Þ Ci terms and ððnþ1Þ=2Þ�1 Pi termscan be derived and the procedures (iiia) and (iiib) of Section 2 canbe applied as before in order to derive an n-bit wide partial

product matrix. As an example, Fig. 6 presents the partial productmatrix and the circuit implementation of a modulo 27

�1 Booth-encoded squarer, according to the proposed architecture.

Table 1 presents closed forms for the total number of partialproduct bits as well as the maximum height among all columns ofthe partial product matrices, for both the proposed and thearchitectures of [29,32–34] and for the three examined modulocases. The closed forms of Table 1 as well as the partial productmatrices of Figs. 1, 3 and 5 for the n¼8 case, indicate that theproposed architectures achieve considerable reductions in boththe total number of bits and the maximum height in the partial

Fig. 4. Radix-4 Booth encoding and Ci, Pi generation circuits.

r r r r r rrrrrrrrr r r

Fig. 3. Partial product matrices and circuit implementations of the proposed (a), (c) modulo 28�1 and (b), (d) diminished-one modulo 28

þ1 squarers.

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174 167

Author's personal copy

Modified Diminished-One Modulo 28+1 Adder

Fig. 5. (a) Partial product matrix and (b) circuit implementation of the proposed normal modulo 28þ1 squarer.

Table 1Number of bits and maximum height of the partial product matrix.

Modulo 2n�1 Modulo 2n

þ1

Normal Diminished-one

[29] Proposed [32,33] Proposed [34] Proposed

Number of bits n2þn

2

n2þ6n

4

n2þ3n

2

n2þ12n�4

4

n2þ5n

2

n2þ8n�4

4Maximum height n

2þ2

n

4

j kþ3

n

2þ3

n

4

j kþ4

n

2þ4

n

4

j kþ3

r r r r r r r

Fig. 6. (a) Partial product matrix and (b) circuit implementation of the proposed modulo 27�1 squarer.

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174168

Author's personal copy

product matrices. The number of partial product bits provides anindication of the total number of FAs that will be required forreducing the partial products into the two final summands.Assuming that this reduction is performed by an adder tree, thelogarithm of the maximum height of the matrix provides anindication on the number of FA stages that will be required. Itshould, however, be kept in mind that the actual delay savingsalso depend on the delay of the partial products generationcircuits and on whether this delay can be hidden by driving thepartial product bits that are derived later to the next stages of theadder tree.

4. Evaluation and comparisons

In this section we compare the proposed modulo squarerswith those presented in [29,34,32,33]. We assume even values forn. We do not consider small values of n ðno8Þ in our compar-isons, since efficient designs for modulo squarers for these casescan be derived by the minimized logic equations given in [29] orby using look-up tables. In the following we assume that thereduction of the partial products into the two final summands isperformed by an adder tree [38]. For the final two-operand adderswe assume the implementations reported in [8,9,37].

At first, we consider the unit-gate model [39] which providesestimations independent of any implementation technology.Table 2 presents the area and delay of logic gates and basic cellsin gate equivalents according to this model. For the proposedmodulo squarers, we assume the four-bit Booth coding imple-mentation of Fig. 4.

The area (in gate equivalents) of a modulo squarer consists ofthe area required for deriving the partial product bits, the arearequired for reducing the partial products in two final n-bitvectors and the area required for the final adder. Table 3 liststhe basic cells, the unit-gate area of each basic cell, the number ofbasic cells and the total unit-gate area for each modulo squarer.Regarding the area required for the addition of the partialproducts, let B denotes the number of bits in the partial productmatrix. Since a FA is equivalent to a 3:2 compressor, we have touse (B�2n) FAs for reducing the B bits of the partial productmatrix to the two n-bit vectors that will then be added by thefinal adder. Furthermore, if x denotes the number of the constantbits that are present in the partial product matrix, then x FAs canbe replaced by x HAs or FAþs, according to each constant value.

The delay (in gate equivalents) of a modulo squarer dependson the delay for deriving the partial product bits, the delay forreducing the partial products in two final n-bit vectors and thedelay of the final adder. Table 4 lists the unit-gate delay of eachstep and the total delay of each modulo squarer for various valuesof n. From Figs. 1, 3 and 5 it is evident that the partial productmatrices are not fully balanced, that is, they do not have the samenumber of partial product bits in every column. Hence, a series ofHAs, FAþs or FAs have to be initially used for balancing the partialproduct matrix. Their delay may or may not be on the critical pathof the corresponding modulo squarers, depending on the specificmodulo squarer architecture and specific value of n. Once fully

balanced, a matrix with k bits in every column requires yðkÞ levelsof FAs assuming a Dadda architecture, or equivalently 4� yðkÞgate equivalents, where yðkÞ denotes the minimum number oflevels of an adder tree that processes k input operands [29]. Incases where a constant bit is present in every column of the fullybalanced partial product matrix, then the corresponding FA thatprocesses it can be replaced by a HA or a FAþ , depending onthe specific constant value. For some modulo squarers andsome values of n, this reduces the delay of the adder tree to4� yðk�1Þþ2 gate equivalents instead of 4yðkÞ.

Table 5 lists the area and delay estimates derived for thevarious architectures under comparison and for several values ofn. Area savings up to 17%, 15% and 29% can be observed for thethree modulo cases, respectively. The only exception is the normalmodulo 2n

þ1 squarer, where for small values of n (8 and 12), thecost of the Booth encoding logic and the fact that the proposedsquarer is derived by adding an additional partial product to thecorresponding diminished-one Booth-encoded modulo squareroverwhelms the savings offered by the reduction of the partialproducts. The larger the width of the input operand is, the larger thearea savings are. It should be noted, however, that the unit-gate areasavings can be only considered as lower bounds since the Pi terms inthe proposed Booth-encoded squarers can be implemented muchmore efficiently in CMOS VLSI technologies than predicted by theunit-gate model, by the use of OR-AND-INVERT compound gates. Inthe majority of the examined cases, the proposed squarers also offera small delay reduction, equal to the delay of one HA or FAþ or FA.In most of the remaining cases they offer the same delay, comparedto the previous proposals. The only exception appears in the n¼8case where the reduced maximum height of the partial productmatrix is compensated by the increased delay for generating thepartial product bits.

Finally, the previously reported and the proposed squarerarchitectures were described in HDL for n¼8, 12, 16, 20 and 32.After simulating the resulting descriptions, the designs weremapped to a 90 nm CMOS standard cell library [40] that providesa single poly-layer and up to nine metal layers, using theSynopsyss Design Compilers tool. A NAND gate in this technol-ogy with a 1X drive strength requires an implementation area of3:136 mm2. Typical process parameters (1.0 V, 25 1C) were used. Abottom-up approach was followed during mapping. Each basiccell (for example the half and full adder cell) was iterativelyoptimized until no further delay savings were possible. The cellsthen underwent successive area recovery steps. Finally, ‘‘do nottouch’’ primitives were applied to them. Then, the same optimi-zation procedure was applied to every subcomponent and succes-sively to the whole circuit. In this way, every design was mappedin the target technology as an interconnection of already opti-mized blocks and the architecture in each description waspreserved as much as possible. All constraints, such as maximumfan-out, output capacitance, and available input drive strength,were kept constant for all architectures. Table 6 lists the attainedarea and delay results. Parentheses in the rightmost columnindicate the theoretical delay savings that are expected. Theresults validate the estimations and conclusions drawn before.The proposed modulo squarers, for medium and large values of n,achieve significant area savings (up to 38%) while in most casesthey also offer small delay improvements. The small deviationsbetween estimations and results are attributed to the optimiza-tion algorithm of the synthesis tool.

5. Conclusions

Efficient modulo squarers are highly appreciated in special-purpose digital signal processors that use a residue number

Table 2Unit gate model.

Logic gate/basic cell Area Delay

NOT, BUF 0 0

AND, OR, NAND, NOR (two-input) 1 1

XOR, XNOR (two-input) 2 2

HA/FAþ 3 2

FA 7 4

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174 169

Author's personal copy

Ta

ble

3U

nit

-ga

tea

rea

an

aly

sis.

Ba

sic

cell

sC

ircu

it

Pre

vio

usl

yp

rop

ose

dsq

ua

rers

Pro

po

sed

bo

oth

-en

cod

edsq

ua

rers

Mo

du

lo2

n�

1[2

9]

Mo

du

lo2

1[3

2,3

3]

Mo

du

lo2

1[3

4]

Mo

du

lo2

n�

1M

od

ulo

2nþ

1ðN

orm

alÞ

Mo

du

lo2

1ðD

imin

ish

ed

-on

Bit

gen

era

tio

n#

AN

D/N

AN

D(1

eq

.g

ate

)n

2 2�

n 2

n2 2�

n 2

n2 2�

n 2

––

#B

oo

thE

nco

de

rs(6

eq

.g

ate

s)–

––

n 2

n 2

n 2

#X

NO

R(b

ib

its)

(2e

q.

ga

tes)

––

–n

2 4�

n 2

n2 4�

n 2

n2 4�

n 2

#N

OR

(Ci,

2b

its)

(1e

q.

ga

te)

––

–n 2

n 2

n 2

#N

OR

(Pi,

0b

its)

(1e

q.

ga

te)

––

–n 2�

1n 2�

1n 2�

1

#N

OR

–O

R/N

OR

–N

OR

(Pi,

jb

its)

(4e

q.

ga

tes)

––

–n

2 4�

n 2

n2 4�

n 2

n2 4�

n 2

#O

R(N

orm

al

mo

du

lo2

l)(1

eq

.g

ate

)–

1–

–1

Bit

red

uct

ion

#F

A(7

eq

.g

ate

s)n

2 2�

3n 2

n2 2�

3n 2

n2 2�

n 2

n2 4�

1n

2 4�

1n

2 4�

n�

1

#H

A/F

(3e

q.

ga

tes)

–n

nn 2þ

1n

n

Fin

al

ad

dit

ion

#M

od

ulo

2n-1

Ad

de

r[8

](3

nlo

gnþ

4n

eq

.g

ate

s)1

––

1–

#N

orm

al

mo

du

lo2

1A

dd

er

[38

]

9 2n

log

n 2þ

7e

q:g

ate

s

��

–1

––

1–

#D

im-1

mo

du

lo2

1A

dd

er

[38

]9 2

nlo

gnþ

n 2þ

5e

q:g

ate

s

��

––

1–

–1

To

tal

are

a(e

q:

ga

tesÞ

4n

3n

log

n�

7n

4n

9 2n

log

n�

15 2

84

n2þ

9 2n

log

n�

1 2nþ

51

3 4n

3n

log

n�

1 2n�

51

3 4n

9 2n

log

9 2n

13 4

n2þ

9 2n

log

n�

5 2n�

3

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174170

Author's personal copy

Table 4Unit-gate delay analysis (assuming even values of n for the previously reported squarers and values of n that are multiples of four for the proposed squarers).

Bit generationðeq: gatesÞ

Bit reduction

ðadder treeÞ ðeq: gatesÞ

Final additionðeq: gatesÞ

Total delay ðeq: gatesÞ

Modulo 2n�1

[29] 1 4yðn=2Þ or 4yðn=2Þþ4 2dlogneþ3 4yðn=2Þþ2dlogneþ4ðn¼ 8,16,20,24,28,32,36,40,44,48,52,56,60,64Þ

or 4yðn=2Þþ2dlogneþ8ðn¼ 12Þ

Proposed 2, 3, 5 4yðn=4þ1Þ or 4yðn=4þ1Þþ2 2dlogneþ3 4yðn=4þ1Þþ2dlogneþ8ðn¼ 16,24,28,36,40,44,52,56,60,64Þ or

4yðn=4þ1Þþ2dlogneþ10ðn¼ 8,12,20,32,48Þ

Normal modulo 2nþ1

[32,33] 1 4yðn=2Þþ2 or 4yðn=2Þþ1 or

4yðn=2þ1Þþ2

2dlogneþ3 4yðn=2Þþ2dlogneþ6ðn¼ 8,56Þ or

4yðn=2þ1Þþ2dlogneþ4ðn¼ 12,20,24,28,32,36,40,44,48,52,60,64Þ or

4yðn=2þ1Þþ2dlogneþ6ðn¼ 16Þ

Proposed 2, 3, 5 4yðn=4þ2Þþ2 or 4yðn=4þ3Þ 2dlogneþ3 4yðn=4þ2Þþ2dlogneþ10ðn¼ 8,16,28,44Þ or

4yðn=4þ3Þþ2dlogneþ8ðn¼ 12,20,24,32,36,40,48,52,56,60,64Þ

Diminished-one modulo 2nþ1

[34] 1 4yðn=2þ2Þ or 4yðn=2þ2Þþ2 2dlogneþ3 4yðn=2þ2Þþ2dlogneþ3ðn¼ 16,24,36Þ or

4yðn=2þ2Þþ2dlogneþ4ðn¼ 12,20,28,32,40,44,48,52,56,60,64Þ or

4yðn=2þ2Þþ2dlogneþ5ðn¼ 8Þ

Proposed 2, 3, 5 4yðn=4þ1Þþ2 or 4yðn=4þ2Þ 2dlogneþ3 4yðn=4þ1Þþ2dlogneþ10ðn¼ 8,12,20,32,48Þ or

4yðn=4þ2Þþ2dlogneþ8ðn¼ 16,24,28,36,40,44,52,56,60,64Þ

Table 5Unit gate area and delay estimates in equivalent gates.

n 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Modulo 2n�1 squarers

Area [29] 272 621 1104 1719 2466 3344 4352 5490 6759 8157 9684 11,341 13,128 15,043 17,088

Proposed 271 586 1011 1544 2185 2933 3787 4747 5814 6986 8263 9646 11,135 12,728 14,427

Savings 1 35 93 175 281 411 565 743 945 1171 1421 1695 1993 2315 2661

Delay [29] 18 28 28 34 34 38 38 40 44 44 44 44 44 48 48

Proposed 20 26 28 32 34 34 36 40 40 40 42 44 44 44 44

Savings �2 2 0 2 0 4 2 0 4 4 2 0 0 4 4

Normal modulo 2nþ1 squarers

Area [32,33] 312 688 1200 1847 2627 3540 4584 5760 7066 8503 10,070 11,768 13,595 15,553 17,640

Proposed 352 716 1192 1779 2475 3280 4192 5212 6338 7571 8910 10,356 11,907 13,565 15,328

Savings �40 �28 8 68 152 260 392 548 728 932 1160 1412 1688 1988 2312

Delay [32,33] 20 28 30 34 34 38 38 40 44 44 44 44 46 48 48

Proposed 24 28 30 34 34 36 38 40 40 42 44 44 44 44 44

Savings �4 0 0 0 0 2 0 0 4 2 0 0 2 4 4

Diminished-one 2nþ1 squarers

Area [34] 365 769 1309 1984 2792 3733 4805 6009 7343 8808 10,403 12,129 13,984 15,970 18,085

Proposed 293 629 1077 1636 2304 3081 3965 4957 6055 7260 8571 9989 11,512 13,142 14,877

Savings 72 140 232 348 488 652 840 1052 1288 1548 1832 2140 2472 2828 3208

Delay [34] 23 28 31 34 37 38 38 43 44 44 44 44 48 48 48

Proposed 20 26 28 32 34 34 36 40 40 40 42 44 44 44 44

Savings 3 2 3 2 3 4 2 3 4 4 2 0 4 4 4

Table 6CMOS VLSI implementation results.

Modulo 2n�1 squarers

n Area Delay

[29] ðmm2Þ Proposed ðmm2Þ Savings (%) [29] (ns) Proposed (ns) Savings (%)

8 3399 3323 2.2 0.545 0.576 �5.7 (�1 HA/FAþ)

12 8448 7025 16.8 0.808 0.751 7.1 (þ1 HA/FAþ)

16 15,373 12,030 21.7 0.826 0.801 3.0 (0)

20 24,158 18,413 23.8 1.015 0.940 7.4 (þ1 HA/FAþ)

32 61,471 41,731 32.1 1.136 1.064 6.3 (þ1 HA/FAþ)

Normal modulo 2nþ1 squarers

n Area Delay

[32,33] (mm2) Proposed (mm2) Savings (%) [32,33] (ns) Proposed (ns) Savings (%)

8 4106 4696 �14.4 0.660 0.746 �13.0 (�1 FA)

12 9907 9664 2.4 0.868 0.856 1.4 (0)

16 16,850 14,700 12.8 0.902 0.916 �1.6 (0)

20 26,485 22,611 14.6 1.053 1.030 2.2 (0)

32 64,336 47,053 26.9 1.196 1.196 0.0 (0)

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174 171

Author's personal copy

system and in the implementation of modulo exponentiators andmultiplicative inverses. We have proposed modified Booth-encoded squarer architectures for modulo 2n71 arithmetic. Forthe modulo 2n

þ1 case, we have considered both the normal andthe diminished-one representations. All proposed architectureslead to shallower partial product matrices with less bits that needto be added at the cost of more complexity in the generation ofthe partial product bits. Experimental results have shown that,the proposed modified Booth-encoded squarers offer up to 38%less implementation area than the previous proposals while inmost cases they also offer a small delay improvement equal to thedelay of a half adder or a full adder.

Acknowledgement

This research was partially supported by the CaratheodoryProgramme of the University of Patras (D.178).

Appendix A. Calculation of T for the proposed diminished-onemodulo 2n

þ1 Booth-encoded squarer

A.1. n is even and a multiple of four

The squarer has ðn=2Þ Ci terms, with 2 bits each. Half of themare inverted (see Fig. 3b as an example for the n¼8 case). Hence,according to procedure (iiib) in Section 2, the correction term forrepositioning the Ci terms, TCi

, is equal to:

TCi¼�ð20

þ22þ � � � þ2n�2

Þ ¼ �2n�1

3: ðA:1Þ

The squarer also has ððn=2Þ�1Þ Pi terms, that also undergorepositioning. The correction term required for the first n=4 ofthem, TPA, is equal to:

TPA ¼ TP0þTP1

þ . . . þTPn=4�1

¼�20�ð20

þ21þ22Þ� � � ��ð20

þ21þ � � � þ2ðn=2Þ�2

Þ

¼�ð21�1Þ�ð23

�1Þ� � � ��ð2ðn=2Þ�1�1Þ

¼n

4�

2

3ð2ðn=2Þ

�1Þ: ðA:2Þ

Similarly, the correction term required for the remainingððn=4Þ�1Þ Pi terms, TPB, is equal to:

TPB ¼ TPn=4þTPðn=4Þ þ 1

þ . . . þTPðn=2Þ�2

¼�ð23þ24þ � � � þ2n=2

Þ�ð27þ28þ � � � þ2ðn=2Þþ2

Þ

� � � ��ð2n�5þ2n�4

Þ

¼�ð2ðn=2Þþ1�23Þ�ð2ðn=2Þþ3

�27Þ� � � ��ð2n�3

�2n�5Þ

¼�ð2ðn=2Þþ1þ2ðn=2Þþ3

þ � � � þ2n�3Þþð23

þ27þ � � � þ2n�5

Þ

¼�2ðn=2Þþ1 2ðn=2Þ�2�1

3þð23

þ27þ � � � þ2n�5

Þ: ðA:3Þ

The reduction of the partial products into two n-bit final addendsrequires n=4 CSAs with EACs. Hence the correction term due tothe CSA tree is equal to:

TCSA,A ¼�n

4: ðA:4Þ

Therefore, the total correction term is equal to:

TA ¼ jTCiþTPAþTPBþTCSA,A�2j2n

þ1, ðA:5Þ

where the last term ð�2Þ accounts for the fact that we want thediminished-one representation of the result and for the fact thatthe total correction term has also to be in diminished-onerepresentation. Substituting (A.1)–(A.4) in (A.5), we get that:

TA ¼ �2n�1

n

4�

2

3ð2n=2�1Þ�2ðn=2Þþ1 2ðn=2Þ�2

�1

3

�����þð23

þ27þ � � � þ2n�5

Þ�n

4�2

�����2nþ1

¼ j�2n�1þð23

þ27þ � � � þ2n�5

Þ�1j2nþ1

¼ 2nþ1�2n�1

þð23þ27þ � � � þ2n�5

Þ�1

¼Xðn=4Þ�1

i ¼ 0

24iþ3¼ 88 . . .816:

A.2. n is even but not a multiple of four

The correction term required due to the inversion of the Ci

terms, TCi, is also given in this case by (A.1). The squarer also

in this case has ððn=2Þ�1Þ Pi terms. The correction term requiredfor repositioning the first bn=4c ¼ ððn=2Þ�1Þ=2 of them, TPC, isequal to:

TPC ¼ TP0þTP1

þ � � � þTPbn=4c�1

¼�20�ð20

þ21þ22Þ� � � � �ð20

þ21þ � � � þ2ðn=2Þ�3

Þ

¼�ð21�1Þ�ð23

�1Þ� � � ��ð2n=2�2�1Þ

¼n

4

j k�

2

3ð22bn=4c

�1Þ: ðA:6Þ

The correction term required for the remaining bn=4c ¼ððn=2Þ�1Þ=2 Pi terms, TPD, is equal to:

TPD ¼ TPbn=4cþTPbn=4c þ 1

þ � � � þTPðn=2Þ�2

¼�ð21þ22þ � � � þ2ðn=2Þ�1

Þ�ð25þ26þ � � � þ2ðn=2Þþ1

Þ

� � � � �ð2n�5þ2n�4

Þ

¼�ð2ðn=2Þ�21Þ�ð2ðn=2Þþ2

�25Þ� � � ��ð2n�3

�2n�5Þ

¼�ð2ðn=2Þþ2ðn=2Þþ2

þ � � � þ2n�3Þþð21

þ25þ � � � þ2n�5

Þ

¼�2ðn=2Þ 22bn=4c

�1

3þð21

þ25þ � � � þ2n�5

Þ: ðA:7Þ

Table 6 (continued )

Diminished-one modulo 2nþ1 squarers

n Area Delay

[34] (mm2) Proposed (mm2) Savings (%) [34] (ns) Proposed (ns) Savings (%)

8 5029 3766 25.1 0.692 0.627 9.4 (þ1 HA/FAþþ1

gate)

12 10,963 8113 26.0 0.849 0.812 4.4 (þ1 HA/FAþ)

16 18,038 12,985 28.0 0.946 0.840 11.2 (þ1 FA-1 gate)

20 29,048 19,664 32.3 1.062 1.036 2.4 (þ1 HA/FAþ)

32 68,480 42,699 37.6 1.186 1.150 3.0 (þ1 HA/FAþ)

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174172

Author's personal copy

The reduction of the partial products into two n-bit final addendsrequires bn=4c CSAs with EACs. Hence the correction term due tothe CSA tree is equal to:

TCSA,B ¼�n

4

j kðA:8Þ

and the total correction term is equal to:

TB ¼ jTCiþTPCþTPDþTCSA,B�2j2n

þ1 ðA:9Þ

and by substituting (A.1), (A.6)–(A.8) in (A.9), we get that:

TB ¼ �2n�1

n

4

j k�

2

3ð22bn=4c

�1Þ�2n=2 22bn=4c�1

3

�����þð21

þ25þ � � � þ2n�5

Þ�n

4

j k�2

�����2nþ1

¼ j�2n�1þð21

þ25þ � � � þ2n�5

Þ�1j2nþ1

¼ 2nþ1�2n�1

þð21þ25þ � � � þ2n�5

Þ�1

¼Xbn=4c

i ¼ 0

24iþ1¼ 22 . . .216:

Appendix B. Calculation of TNORMAL for the proposed normalmodulo 2n

þ1 Booth-encoded squarers

The proposed normal modulo 2nþ1 Booth-encoded squarers

utilize one partial product more than the proposed diminished-one (see Figs. 3b and 5a) for the n¼8 case. This corresponds toj�2Aj2n

þ1 and can be expressed in n bits provided that a furthercorrection equal to 3 is taken into account. This extra partialproduct has to be added along with the rest partial products.Therefore, one more CSA is required and an additional correctionequal to �1 has to be considered to account for the inverted EAC.Based on the above, TNORMAL ¼ jTþ3�1j2n

þ1 ¼ jTþ2j2nþ1, where T

represents the correction term required in the correspondingdiminished-one squarer.

References

[1] P.V. Ananda Mohan, Residue Number Systems: Algorithms and Architectures,Springer-Verlag, 2002.

[2] A. Omondi, B. Premkumar, Residue Number Systems: Theory and Implemen-tations, Imperial College Press, 2007.

[3] B. Cao, C.H. Chang, T. Srikanthan, An efficient reverse converter for the4-moduli set f2n

�1,2n ,2nþ1,22n

þ1g based on the new Chinese remaindertheorem, IEEE Trans. Circuits Syst. I 50 (10) (2003) 1296–1303.

[4] R. Conway, J. Nelson, Improved RNS FIR filter architectures, IEEE Trans.Circuits Syst. II 51 (1) (2004) 26–28.

[5] B. Cao, T. Srikanthan, C.H. Chang, Efficient reverse converters for the four-moduli sets f2n

�1,2n ,2nþ1,2nþ1

�1g and f2n�1,2n ,2n

þ1,2n�1�1g, IEE Proc.

Comput. Digital Tech. 152 (5) (2005) 687–696.[6] B. Cao, C.H. Chang, T. Srikanthan, A residue-to-binary converter for a new

5-moduli set, IEEE Trans. Circuits Syst. I 54 (5) (2007) 1041–1049.[7] K. Navi, A. Molahosseini, M. Esmaeildoust, How to teach residue number

system to computer scientists and engineers, IEEE Trans. Educ. 54 (1) (2011)156–163.

[8] L. Kalampoukas, et al., High-speed parallel-prefix modulo 2n�1 adders, IEEE

Trans. Comput. 49 (7) (2000) 673–680.[9] H.T. Vergos, C. Efstathiou, D. Nikolos, Diminished-one modulo 2n

þ1 adderdesign, IEEE Trans. Comput. 51 (12) (2002) 1389–1399.

[10] Y. Ma, A simplified architecture for modulo (2nþ1) multiplication, IEEE

Trans. Comput. 47 (3) (1998) 333–337.[11] C. Efstathiou, H.T. Vergos, D. Nikolos, Modified booth modulo 2n

�1 multi-pliers, IEEE Trans. Comput. 53 (3) (2004) 370–374.

[12] C. Efstathiou, H.T. Vergos, G. Dimitrakopoulos, D. Nikolos, Efficientdiminished-1 modulo 2n

þ1 multipliers, IEEE Trans. Comput. 54 (4) (2005)491–496.

[13] L. Sousa, R. Chaves, A universal architecture for designing efficient modulo2nþ1 multipliers, IEEE Trans. Circuits Syst.I 52 (6) (2005) 1166–1178.

[14] L.M. Leibowitz, A simplified binary arithmetic for the fermat numbertransform, IEEE Trans. Acoust. Speech Signal Process. 24 (5) (1976)356–359.

[15] T. Kwan, T. Martin, Adaptive detection and enhancement of multiplesinusoids using a cascade IIR filter, IEEE Trans. Circuits Syst. 36 (7) (1989)937–945.

[16] R.H. Strandberg, et al., Efficient realizations of squaring circuit and reciprocalused in adaptive sample rate notch filters, J. VLSI Signal Process. 14 (3) (1996)303–309.

[17] R. Jain, A. Madisetti, R.L. Baker, An integrated circuit design for pruned tree-search vector quantization encoding with an off-chip controller, IEEE Trans.Circuits Syst. Video Technol. 2 (2) (1992) 147–158.

[18] J. Pihl, E.J. Aas, A multiplier and squarer generator for high performance DSPapplications, Proceedings of 39th Midwest Symposium on Circuits andSystems, vol. I, 1996, pp. 109–112.

[19] J.-T. Yoo, K.F. Smith, G. Gopalakrishnan, A fast parallel squarer based ondivide-and-conquer, IEEE J. Solid-State Circuits 32 (6) (1997) 909–912.

[20] Y. Yu Fengqi, A.N. Wilson, Multirate digital squarer architectures, in:Proceedings of the 8th IEEE International Conference on Electronics Circuits& Systems, 2001, pp. 177–180.

[21] Lucent Technologies, DSP 1628 Datasheet, Murray Hill, NJ, 1997.[22] R.K. Kolagotla, W.R. Griesbach, H.R. Srinivas, VLSI implementation

of 350 MHz 0:35 m 8 bit merged squarer, Electron. Lett. 34 (1) (1998) 47–48.[23] A.A. Liddicoat, M.J. Flynn, Parallel square and cube computations, Proceed-

ings of the 34th Asilomar Conference on Signals, Systems and Computers,vol. 2, 2000, pp. 1325–1329.

[24] M. Ciet, M. Neve, E. Peeters, J.J. Quisquater, Parallel FPGA implementation ofRSA with RNS, Proceedings of the 46th IEEE Midwest Symposium on Circuitsand Systems, vol. II, 2003, pp. 806–810.

[25] J.C. Bajard, L.S. Didier, P. Kornerup, An RNS montgomery modular multi-plication algorithm, IEEE Trans. Comput. 47 (7) (1998) 766–776.

[26] J.C. Bajard, L. Imbert, A Full RNS implementation of RSA, IEEE Trans. Comput.53 (6) (2004) 769–774.

[27] R. Zimmermann, et al., A 177 Mb/s VLSI implementation of the interna-tional data encryption algorithm, IEEE J. Solid-State Circuits 29 (3) (1994)303–307.

[28] U. Meyer-Base, A. Garcia, F. Taylor, Implementation of a communicationsChannelizer using FPGAs and RNS arithmetic, J. VLSI Signal Process. 28 (1–2)(2001) 115–128.

[29] S.J. Piestrak, Design of squarers modulo a with low-level pipelining, IEEETrans. Comput. 49 (1) (2002) 31–41.

[30] P.B. Rao, A. Skavantzos, ROM based methods for computing the squaringoperation in modular rings, J. VLSI Signal Process. 7 (3) (1994) 199–211.

[31] B. Cao, T. Srikanthan, C-H. Chang, A new design method to Modulo 2n�1

squaring, in: Proceedings of the IEEE International Symposium on Circuitsand Systems, May 2005, pp. 664–667.

[32] H.T. Vergos, C. Efstathiou, Efficient modulo 2kþ1 squarers, in: Proceedings of

the XXI Conference on Design of Circuits and Integrated Systems, November2006.

[33] R. Muralidharan, C.H. Chang, C.C. Jong, A low complexity modulo 2nþ1

squarer design, in: Proceedings of the IEEE Asia Pacific Conference on Circuitsand Systems, 2008, pp. 1296–1299.

[34] H.T. Vergos, C. Efstathiou, Diminished-1 modulo 2nþ1 squarer design, IEE

Proc. Comput. Digital Tech. 152 (5) (2005) 561–566.[35] A. Spyrou, D. Bakalis, H.T. Vergos, Efficient architectures for modulo 2n

�1squarers, in: Proceedings of the 16th IEEE International Conference on DigitalSignal Processing (DSP 2009), July 2009, pp. 1–6.

[36] A. Strollo, D. Caro, Booth folding encoding for high performance squarercircuits, IEEE Trans. Circuits Syst. II 50 (5) (2003) 250–254.

[37] H.T. Vergos, D. Bakalis, On implementing efficient modulo 2nþ1 arithmetic

components, J. Circuits Syst. Comput. 19 (5) (2010) 911–930.[38] L. Dadda, Some schemes for parallel multipliers, Alta Frequenza 34 (1965)

349–356.[39] A. Tyagi, A reduced-area scheme for carry-select adders, IEEE Trans. Comput.

42 (10) (1993) 1163–1170.[40] Faraday Technology Corp., 90 nm Standard Cell, Faraday ASIC Cell Library

FSD0A_A, September 2004.

Dimitris Bakalis received the Diploma degree in 1995,the M.Sc. degree in 2000 and the Ph.D. degree in 2001in Computer Engineering, all from the Department ofComputer Engineering and Informatics at the Univer-sity of Patras in Greece. He currently holds a Lecturerposition in the Physics Department at the same uni-versity. His main research interests include VLSI designand test, digital system design and test, embeddedsystems, computer arithmetic, low power design andtest.

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174 173

Author's personal copy

Haridimos Vergos received his Diploma in ComputerEngineering in 1991, and his Ph.D. in 1996 from theDepartment of Computer Engineering and Informatics,University of Patras, Greece, where he currently holds anAssociate Professor position. He was a member of AtmelMultimedia & Communications Group and worked onthe development of the first worldwide IEEE 802.11compliant wireless MAC processor. His research interestsinclude computer arithmetic and architecture, depend-able system architectures and low power design andtest.

Anastasia Spyrou received her Diploma Degree inComputer Engineering and Informatics in 2007 andthe Masters Degree in Integrated Software and Hard-ware Systems in 2009 from the University of Patras,Greece. Her research interests focus on the area of VLSIdesign.

D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174174