Author's personal copy - Πανεπιστήμιο Πατρών
Transcript of Author's personal copy - Πανεπιστήμιο Πατρών
This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information
regarding Elsevier’s archiving and manuscript policies areencouraged to visit:
http://www.elsevier.com/copyright
Author's personal copy
Efficient modulo 2n71 squarers
D. Bakalis a, H.T. Vergos b,�, A. Spyrou b
a Physics Department, University of Patras, 26 500, Greeceb Computer Engineering & Informatics Department, University of Patras, 26 500, Greece
a r t i c l e i n f o
Article history:
Received 5 October 2010
Received in revised form
8 March 2011
Accepted 8 March 2011Available online 21 March 2011
Keywords:
Squarers
Modulo arithmetic
Residue number system
Computer arithmetic
Booth encoding
a b s t r a c t
Modulo 2n71 squarers are useful components for designing special purpose digital signal processors
that internally use a residue number system and for implementing the modulo exponentiators and
multiplicative inverses required in cryptographic algorithms. In this paper we propose, in a unified way,
architectures for their design that are based on the radix-4 modified Booth encoding. For the modulo
2nþ1 case, both the normal and the diminished-one representations are considered. Experimental
results show that the proposed squarers offer significant savings in the implementation area over
previous proposals that can reach up to 38% for sufficiently large operand widths, while in many cases a
small improvement in execution delay can also be achieved.
& 2011 Elsevier B.V. All rights reserved.
1. Introduction
Special-purpose digital signal processors (DSPs) and co-pro-cessors often adopt a residue number system (RNS) for achievinghigh operation speeds [1,2]. A non-positional RNS is defined by aset of L moduli, suppose {d1,d2,y, dL}, which are pair-wise relativeprime. Assume that jAjM denotes the modulo M residue of aninteger A, that is, the least non-negative remainder of the divisionof A by M. A has a unique representation in the RNS, given bythe set {a1,a2,y,aL} of residues, where ai ¼ jAjdi
, if AZ0 andai ¼ jDþAjdi
, if Ao0, with D¼ d1 � d2 � � � � � dL. An operation� over the RNS is defined as ðz1,z2, . . . ,zLÞ ¼ ða1,a2, . . . ,aLÞ � ðb1,b2, . . . ,bLÞ, where zi ¼ jai � bijdi
. That is, the computation of zi onlydepends on ai, bi and di and therefore each zi can be computed in aseparate arithmetic unit often called a channel. Since each channeloperates on small residues instead of large numbers and since allrequired channels operate in parallel, significant speedup over thebinary system may be achieved.
RNSs built on moduli of the 2n, 2n�1 or 2n
þ1 forms havereceived significant attention [3–7] due to the fact that the requiredcircuits for the 2n moduli can be straightforwardly derived from thecorresponding integer circuits by limiting the result to n bits, whileadders [8,9] and multipliers [10–13] have been proposed for the2n71 moduli that can operate as fast as the integer ones. The 2n
þ1channel has to deal with operands that are (nþ1)-bits widecompared to the n-bit wide operands of the modulo 2n and the
modulo 2n�1 channels. Leibowitz [14] introduced the diminished-
one representation where each number is represented decreased byone compared to its normal representation and all arithmeticoperations are inhibited for a zero input operand since the resultcan be straightforwardly derived. The diminished-one representa-tion has the advantage that the computations in the modulo 2n
þ1channel are also restricted to n bits.
Squaring in DSPs can always be performed by driving bothinputs of the multiplier with the same operand. However, there areseveral applications that require a very large number of squaringoperations and therefore profit from the use of a dedicated squaringcircuit, capable of performing hundreds of millions of squaringoperations per second. Such applications include adaptive filtering[15,16], vector quantization, pattern recognition and image com-pression [17–20], calculation of Euclidean branch metrics [21,22]and function evaluation [23].
Efficient designs for modulo squarers are welcome for specialpurpose DSPs and for cryptographic algorithm implementationsthat adopt an RNS [24–26]. Furthermore, modulo squarers can beused in applications where multiplicative inverses or modularexponentiators have to be computed, since, according to the
Fermat’s little theorem, it holds that jx�1jp ¼ jxp�2jp, when p is
prime and x is coprime to p. For example, a modulo 216þ1 squarer
can advance the implementation of the multiplicative inversesmodulo 216
þ1 in the international data encryption algorithm(IDEA) utilizing the square and multiply algorithm [27]. Finally,modulo squarers have been proposed for implementing modulomultipliers based on the quarter-square principle [28,29].
A modulo squarer can be designed using: (a) a modulo multiplierwhose both inputs are driven by the same operand, (b) minimized
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/vlsi
INTEGRATION, the VLSI journal
0167-9260/$ - see front matter & 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.vlsi.2011.03.006
� Corresponding author.
E-mail addresses: [email protected] (D. Bakalis),
[email protected] (H.T. Vergos), [email protected] (A. Spyrou).
INTEGRATION, the VLSI journal 44 (2011) 163–174
Author's personal copy
logic functions [29], (c) look-up tables [30], or (d) a combinationalapproach for generating and adding the various partial products[29,31–34]. The first approach is unnecessarily complex and slow.The second approach can be used only for small input operands. Thethird approach requires unacceptably large memories as the inputoperand width increases and is more suitable for generic modulivalues. The last approach is the most appropriate for modulo 2n71squarers with medium and large input operand widths.
In this paper we present novel architectures for designingmodulo 2n71 squarers. For the modulo 2n
þ1 case, we considerboth the normal and the diminished-one representation. Thispaper extends the methodology that was originally presented in[35] for modulo 2n
�1 squarers for designing efficient normal anddiminished-one modulo 2n
þ1 squarers as well and provides amethodology that can cover, in a unified way, all three modulicases. Furthermore, in the current state of the art, the diminished-one modulo 2n
þ1 squarers require more area and delay com-pared to the corresponding normal modulo 2n
þ1 squarers of thesame size. We solve this inconsistency by proposing diminished-one modulo 2n
þ1 squarers that are faster and smaller than thenormal ones. The proposed circuits utilize the modified Boothencoding of the input operand for efficiently generating thepartial products along with a Dadda tree architecture for theirsummation and are based on the Booth-folding technique [36].The Booth-folding technique was introduced in [36] for the designof binary squarers. We apply the same technique in the design ofmodulo 2n71 squarers. The application of the Booth-foldingtechnique in the design of modulo squarers is not a straightfor-ward procedure since one has to: (a) suitably Booth-encode themodulo 2n
�1 or 2nþ1 input operand, (b) confine the resulting
partial product bits within n bits and (c) derive the correctionterms that are required in the modulo 2n
þ1 case. Even though(a) can be based on the Booth encoding that is already reported inthe open literature for modulo 2n71 multipliers [10,11] (b) and(c) are presented in this paper for the first time in the open literatureand form its main contributions. We prove that a constant correc-tion term is required in both the normal and the diminished-onemodulo 2n
þ1 squarers cases and analytically derive their values.Our experimental results indicate that the proposed squarers offerreduced implementation area compared to previously proposedones, for sufficiently wide operands, while a small improvement inexecution delay is also offered in many cases.
The rest of the paper is organized as follows. The optimizationtechniques suggested so far for the design of modulo 2n71squarers and the current state of the art are reviewed in the nextsection. Efficient modified Booth-encoded architectures arederived in Section 3. Section 4 presents comparative resultsagainst previously proposed architectures. Conclusions are drawnin the last section.
2. Background and previous work
A modulo 2n�1 squarer accepts an n-bit operand
A¼an�1an�2ya0, with 0rA¼Pn�1
i ¼ 0 ai � 2ir2n�1 (assuming a
double representation of zero), and produces the n-bit result
R¼ jA2j2n�1. The normal modulo 2n
þ1 squarer accepts an (nþ1)-bit
operand A¼anan�1ya0, with 0rA¼Pn
i ¼ 0 ai � 2io2nþ1 and
computes the (nþ1)-bit result R¼ jA2j2nþ1. Finally, denoting
A�1 ¼ an�1an�2 . . . a0 the n-bit diminished-one representation of an(nþ1)-bit operand A, a diminished-one modulo 2n
þ1 squarer
accepts A�1, with A�1 ¼Pn�1
i ¼ 0 ai � 2i¼ A�1, 0oAo2n
þ1, and
computes the n-bit diminished-one representation R�1 of jA2j2nþ1.
Assuming that a combinational approach is used, a squaringcircuit usually consists of three stages, namely the partial product
generation, the reduction stage that derives two final addendsfrom the generated partial products and the final two-inputparallel adder. The partial product matrix of a modulo 2n71squarer can be optimized by the following procedures:
(i) Logic and arithmetic simplifications, since aiai ¼ ai andanai¼0, 8i, 0r ion.
(ii) Merging pairs of equal partial product bits having the sameweight, since aiaj � 2iþ j
þajai � 2iþ j¼ aiaj � 2iþ jþ1.
(iii) Repositioning all partial product bits residing in columnswith weights greater than 2n�1 within the rightmost n
columns of the matrix. This is accomplished by taking intoaccount the periodic properties of the powers of 2 takenmodulo 2n71 [29]. More specifically,
(a) Modulo 2n�1: All partial product bits with weight 2nþ i,
iZ0, can be moved to the column with weight 2i, sincefor every bit z it holds that jz� 2nþ i
j2n�1 ¼ jz� 2i
j2n�1.
(b) Modulo 2nþ1: All partial product bits with weight
2nþ i, 0r irn�1, can be inverted and repositioned in thecolumn with weight 2i, since it holds that jz� 2nþ i
j2nþ1 ¼
j2nþ iþz � 2i
j2nþ1 (z denotes the complement of z). An
additive correction term equal to j2nþ ij2nþ1 should be
taken into account for each such move. Finally, the an
partial product bit which has a weight equal to 22n
can be moved to the column with weight 20 sincejan � 22n
j2nþ1 ¼ an � 20. Since an and a0 can not be both
simultaneously at 1, we can combine them in a logic ORgate instead of adding them.
(c) Diminished-one modulo 2nþ1: The same repositioning
procedure with that of (iiib) can be used for deriving thepartial product matrix of A�1
2 , without of course having toreposition the an partial product bit, since it holds thatR�1 ¼ jA
2�1j2nþ1 ¼ jðA�1þ1Þ2�1j2n
þ1 ¼ jA2�1þ2� A�1j2n
þ1.The 2� A�1 term can be represented, according to (iiib),by the n-bit vector an�2an�3 . . . a0an�1 as long as anadditive correction term of j2n
j2nþ1 ¼�1 is also
considered.
The modulo 2n�1 squarers proposed at first in [29] and later
in [31] apply procedures (i), (ii) and (iiia) for attaining an n-bitwide matrix of partial product bits. For example, Fig. 1(a) presentsthe partial product matrix of a modulo 28
�1 squarer according to[29,31]. The partial products are then reduced to two finaladdends by an end-around carry (EAC) save adder (CSA) tree,composed of full adders. The use of EAC is justified by the fact thatthe carry-out at the most significant bit (MSB) position of anystage of the adder tree has a weight equal to 2n. According toprocedure (iiia), the carry output can be added at the leastsignificant bit (LSB) position of the next stage. The two finaladdends are driven to a fast parallel modulo 2n
�1 adder thatprovides the output of the squarer.
For the normal modulo 2nþ1 squarers case, [32,33] refined the
architecture originally proposed by [29]. An n-bit wide matrix ofpartial product bits as the one presented in Fig. 1(b) for the n¼8case is derived, by applying procedures (i), (ii) and (iiib) (3 isused to denote the logical OR). A CSA tree with inverted EACs isthen used for reducing the partial products in the two finaladdends. Since the carry output at any stage of the adder treehas a weight equal to 2n, it can be inverted and driven to the LSBposition of the next stage, according to procedure (iiib), providedthat a correction term of 2n is also taken into account. Although anormal modulo 2n
þ1 parallel adder can be used for providing theresult, in [32,33] a slightly modified diminished-one modulo2nþ1 adder has been used as the final adder. This adder offers
the same execution speed as the fastest diminished-one adders
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174164
Author's personal copy
[9] but also checks whether their inputs are bit-wise comple-mentary, in which case produces an 1 at the MSB position. Itshould be noted that, the correction term T¼tn�1yt0 used in thepartial product matrix accounts for all corrections required, thatis, due to partial product bit repositioning, inverted EACs and theuse of a diminished-one adder as the final adder. In [32,33] it hasbeen shown that T has a constant value equal to T¼2.
Finally, the modulo 2nþ1 squarers for diminished-one oper-
ands that have been proposed in [34], also derive an n-bit widematrix of partial product bits by applying procedures (i), (ii), (iiib)and (iiic). They also utilize a CSA tree with inverted EACs forpartial product reduction and a final diminished-one adder forproducing the result. The partial product matrix of a diminished-one modulo 28
þ1 squarer is shown in Fig. 1(c). The requiredcorrection term T is constant in this case too and its value is equalto 0 [34]. However, its addition cannot be omitted since thatwould alter the number of EACs in the CSA tree that are accountedin the correction term.
The architectures presented in [29,32–34] are considered thecurrent state of the art. It should be noted that none of them usesBooth encoding for reducing the number of partial products.
3. Proposed modified Booth-encoded modulo squarers
Modified Booth encoding is a commonly used technique forreducing the number of partial products in binary multipliers andsquarers, resulting in more compact implementations than thecorresponding non-encoded ones for sufficiently large operands.Since the reduction of the partial products makes the adder treeused to add them shallower, modified Booth-encoded multipliersand squarers may also offer faster implementations provided thatthe partial product bits are derived fast enough. In this section, wepropose modified Booth-encoded modulo 2n71 squarers. In thefollowing we concentrate only on the partial product matrix ofeach proposed architecture, since the adder tree that reduces thepartial products and the final adder follow the same schemes with
those used in the previous proposals. At first we consider thediminished-one representation for the modulo 2n
þ1 case, sincethe normal modulo 2n
þ1 squarers can be designed as a straight-forward extension of the corresponding diminished-one ones. Forsimplicity, in the following we assume that n is even and use then¼8 case as an example.
According to the radix-4 modified Booth encoding, an n-bit
operand A¼ an�1an�2 . . . a0 can be rewritten as A¼Pn=2�1
i ¼ 0 22i�
ð�2a2iþ1þa2iþa2i�1Þ ¼Pn=2�1
i ¼ 0 ð22i� AiÞ, where a�1¼0 and Ai is a
Booth-encoded digit, with Ai ¼ ð�2a2iþ1þa2iþa2i�1ÞA ½�2,þ2�. Ithas been shown in the past that the same encoding can be appliedto modulo r operands, where r denotes either the modulo 2n
�1 orthe diminished-one modulo 2n
þ1 case. Specifically, it holds that
jAjr ¼ jPn=2�1
i ¼ 0 ð22i� AiÞjr , with a�1¼an�1 and ¼ an�1 in the mod-
ulo 2n�1 [11] and diminished-one modulo 2n
þ1 [10] cases,respectively.
The square of A in modulo r arithmetic can be computed usingthe modified Booth-encoded digits of A, instead of its bits, asfollows [36]:
jA2jr ¼Xðn=2Þ�1
i ¼ 0
22i� Ai
����������r
�Xðn=2Þ�1
i ¼ 0
22i� Ai
����������r
����������r
¼Xðn=2Þ�1
i ¼ 0
22i� Ai
!�
Xðn=2Þ�1
i ¼ 0
22i� Ai
!����������r
¼Xðn=2Þ�1
i ¼ 0
24i� ðAi � AiÞ
!�����þ
Xðn=2Þ�2
i ¼ 0
24iþ3Xðn=2Þ�1
k ¼ iþ1
22ðk�i�1Þ� Ai � Ak
!�����r
¼Xðn=2Þ�1
i ¼ 0
24i� Ciþ
Xðn=2Þ�2
i ¼ 0
24iþ3� Pi
����������r
,
where Ci ¼ Ai � Ai and Pi ¼Pn=2�1
k ¼ iþ1ð22ðk�i�1Þ
� Ai � AkÞ. The Ci
terms, 0r ion=2, are unsigned numbers that can only assume
Fig. 1. Partial product matrix for (a) jA2j28�1, (b) jA2j28
þ1 in the normal representation and (c) jA2j28þ1 in the diminished-one representation.
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174 165
Author's personal copy
the values 0, 1 or 4. Each one can therefore be represented by3 bits. However, since the middle bit is in every case equal to 0,there is no need to be included in the partial product matrix. Onthe other hand, the Pi terms are signed two’s complementnumbers. Each Pi term, 0r ion=2�1, requires (n�1�2i) bitsfor its representation. Let Ci,j and Pi,j denote the j-th bit of Ci and Pi,respectively. Fig. 2 presents the partial product matrix that can bederived for the square of an 8-bit operand A in modulo r
arithmetic according to the above analysis.For attaining an n-bit wide matrix, we need to reposition all
partial product bits with weights greater than 2n�1. Repositioningthe Ci,j bits can be accomplished by the procedures (iiia) and (iiib)described in Section 2. The Pi,j bits, however, need to be treateddifferently since the Pi terms represent signed two’s complementnumbers. Let the most significant bit of each Pi term be alsodenoted as Pi,MSB. It then holds that:
� Modulo 2n�1: Procedure (iiia) can be applied for all Pi,j bits
except the Pi,MSB bits. Each Pi,MSB bit represents the sign of a Pi
term and has a weight equal to �2nþ2iþ1. Therefore:
Xðn=2Þ�2
i ¼ 0
ð�2nþ2iþ1� Pi,MSBÞ
����������2n�1
¼ �Xðn=2Þ�2
i ¼ 0
22iþ1� Pi,MSB
����������2n�1
¼ ð2n�1Þ�
Xðn=2Þ�2
i ¼ 0
22iþ1� Pi,MSB
����������2n�1
:
Taking into account that for every bit z it holds that 1�z¼ z, wehave that:
Xðn=2Þ�2
i ¼ 0
ð�2nþ2iþ1� Pi,MSBÞ
����������2n�1
¼ 1 1 Pn2�2,MSB 1 . . . P1,MSB 1 P0,MSB 1:
Hence, we invert every Pi,MSB bit, move it to the column withweight 22iþ1 and fill the remaining columns with 1 s.� Diminished-one modulo 2n
þ1: We similarly can apply proce-dure (iiib) for all Pi,j bits except the MSB of each Pi. For the Pi,MSB
bits we have that:
Xðn=2Þ�2
i ¼ 0
ð�2nþ2iþ1� Pi,MSBÞ
����������2nþ1
¼Xðn=2Þ�2
i ¼ 0
ð�1Þ � ð�22iþ1� Pi,MSBÞ
����������2nþ1
¼Xðn=2Þ�2
i ¼ 0
22iþ1� Pi,MSB
����������2nþ1
:
Hence, each Pi,MSB can be simply moved to the column withweight 22iþ1 and no further action is required.
Fig. 3(a) and (b) present the derived n-bit wide partial productmatrices for the proposed modified Booth-encoded modulo 28
�1and diminished-one modulo 28
þ1 squarers. The last partialproduct T¼t7yt0 of Fig. 3(b) incorporates all correction termsrequired. It can be shown that T has a constant value which isequal to TA¼88y816 in the case of even values of n that aremultiples of 4 (a proof is provided in A.1) and is equal toTB¼22y216 in the case of even values of n that are not multiplesof 4 (a proof is provided in A.2). Subscript 16 denotes a hexade-cimal value. Figs. 3(c) and (d) present the corresponding circuitimplementations. HA, FA and FAþ denote half adders, full addersand simplified full adders with an input connected to logic 1,respectively. The final adders can be designed according to anydesirable architecture. The Ci,j and Pi,j bits can be derived using thesimple circuits proposed in [36]. For example, the circuits forderiving the Ci and Pi terms, assuming a four-bit coding, for themodulo squarer of Fig. 3(a) are presented in Fig. 4.
Let us now consider the case of modified Booth-encodedmodulo 2n
þ1 squarers for an (nþ1)-bit operand A in the normalrepresentation. At first we assume that Ao2n, that is, an¼0. If then LSBs of A are driven to the diminished-one modulo 2n
þ1squarer derived before then, this will compute the n LSBs ofjðAþ1Þ2�1j2n
þ1, that is, it will compute jA2þ2Aj2nþ1. As a result,
we can use an extra partial product equal to j�2Aj2nþ1 in the
partial product matrix of the diminished-one squarer for derivingthe n LSBs of jA2j2n
þ1. The j�2Aj2nþ1 can be expressed as the n-bit
vector an�2 an�3 . . . a0 an�1, provided that an additional correctionterm equal to 3 is also taken into account. The most significant bitof the result can be derived by using the slightly modifieddiminished-one adder [32,33,37] as the final adder, as explainedin Section 2. Finally, when A¼2n, an¼1 and all the rest bits are 0.Since in this case jA2j2n
þ1 ¼ j22nj2nþ1 ¼ 1, we can just position an
at the column with weight 20 and logically OR with an�1 of thepartial product added for j�2Aj2n
þ1. Fig. 5(a) and (b) present thepartial product matrix and the circuit implementation of theproposed normal modulo 28
þ1 squarer. All required correctionterms are merged into an n-bit vector T whose value is constantand is equal to the value of the corresponding diminished-onecorrection term increased by 2 (see Appendix B for a proof). Notethat the proposed architectures for the modulo 2n
þ1 case lead to
Fig. 2. Partial product matrix for the modulo 2871 Booth-encoded squarer.
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174166
Author's personal copy
modulo squarers for the diminished-one representation that aresmaller and faster than the corresponding squarers for the normalrepresentation, whereas in the previously reported architectures[34,32,33] the diminished-one squarers are less efficient than thenormal ones.
For odd values of n, the previously described modified Booth-encoded modulo 2n71 architectures can also be used as long asthe input operand is augmented by one bit, by adding a 0 to theMSB position. Then ððnþ1Þ=2Þ Ci terms and ððnþ1Þ=2Þ�1 Pi termscan be derived and the procedures (iiia) and (iiib) of Section 2 canbe applied as before in order to derive an n-bit wide partial
product matrix. As an example, Fig. 6 presents the partial productmatrix and the circuit implementation of a modulo 27
�1 Booth-encoded squarer, according to the proposed architecture.
Table 1 presents closed forms for the total number of partialproduct bits as well as the maximum height among all columns ofthe partial product matrices, for both the proposed and thearchitectures of [29,32–34] and for the three examined modulocases. The closed forms of Table 1 as well as the partial productmatrices of Figs. 1, 3 and 5 for the n¼8 case, indicate that theproposed architectures achieve considerable reductions in boththe total number of bits and the maximum height in the partial
Fig. 4. Radix-4 Booth encoding and Ci, Pi generation circuits.
r r r r r rrrrrrrrr r r
Fig. 3. Partial product matrices and circuit implementations of the proposed (a), (c) modulo 28�1 and (b), (d) diminished-one modulo 28
þ1 squarers.
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174 167
Author's personal copy
Modified Diminished-One Modulo 28+1 Adder
Fig. 5. (a) Partial product matrix and (b) circuit implementation of the proposed normal modulo 28þ1 squarer.
Table 1Number of bits and maximum height of the partial product matrix.
Modulo 2n�1 Modulo 2n
þ1
Normal Diminished-one
[29] Proposed [32,33] Proposed [34] Proposed
Number of bits n2þn
2
n2þ6n
4
n2þ3n
2
n2þ12n�4
4
n2þ5n
2
n2þ8n�4
4Maximum height n
2þ2
n
4
j kþ3
n
2þ3
n
4
j kþ4
n
2þ4
n
4
j kþ3
r r r r r r r
Fig. 6. (a) Partial product matrix and (b) circuit implementation of the proposed modulo 27�1 squarer.
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174168
Author's personal copy
product matrices. The number of partial product bits provides anindication of the total number of FAs that will be required forreducing the partial products into the two final summands.Assuming that this reduction is performed by an adder tree, thelogarithm of the maximum height of the matrix provides anindication on the number of FA stages that will be required. Itshould, however, be kept in mind that the actual delay savingsalso depend on the delay of the partial products generationcircuits and on whether this delay can be hidden by driving thepartial product bits that are derived later to the next stages of theadder tree.
4. Evaluation and comparisons
In this section we compare the proposed modulo squarerswith those presented in [29,34,32,33]. We assume even values forn. We do not consider small values of n ðno8Þ in our compar-isons, since efficient designs for modulo squarers for these casescan be derived by the minimized logic equations given in [29] orby using look-up tables. In the following we assume that thereduction of the partial products into the two final summands isperformed by an adder tree [38]. For the final two-operand adderswe assume the implementations reported in [8,9,37].
At first, we consider the unit-gate model [39] which providesestimations independent of any implementation technology.Table 2 presents the area and delay of logic gates and basic cellsin gate equivalents according to this model. For the proposedmodulo squarers, we assume the four-bit Booth coding imple-mentation of Fig. 4.
The area (in gate equivalents) of a modulo squarer consists ofthe area required for deriving the partial product bits, the arearequired for reducing the partial products in two final n-bitvectors and the area required for the final adder. Table 3 liststhe basic cells, the unit-gate area of each basic cell, the number ofbasic cells and the total unit-gate area for each modulo squarer.Regarding the area required for the addition of the partialproducts, let B denotes the number of bits in the partial productmatrix. Since a FA is equivalent to a 3:2 compressor, we have touse (B�2n) FAs for reducing the B bits of the partial productmatrix to the two n-bit vectors that will then be added by thefinal adder. Furthermore, if x denotes the number of the constantbits that are present in the partial product matrix, then x FAs canbe replaced by x HAs or FAþs, according to each constant value.
The delay (in gate equivalents) of a modulo squarer dependson the delay for deriving the partial product bits, the delay forreducing the partial products in two final n-bit vectors and thedelay of the final adder. Table 4 lists the unit-gate delay of eachstep and the total delay of each modulo squarer for various valuesof n. From Figs. 1, 3 and 5 it is evident that the partial productmatrices are not fully balanced, that is, they do not have the samenumber of partial product bits in every column. Hence, a series ofHAs, FAþs or FAs have to be initially used for balancing the partialproduct matrix. Their delay may or may not be on the critical pathof the corresponding modulo squarers, depending on the specificmodulo squarer architecture and specific value of n. Once fully
balanced, a matrix with k bits in every column requires yðkÞ levelsof FAs assuming a Dadda architecture, or equivalently 4� yðkÞgate equivalents, where yðkÞ denotes the minimum number oflevels of an adder tree that processes k input operands [29]. Incases where a constant bit is present in every column of the fullybalanced partial product matrix, then the corresponding FA thatprocesses it can be replaced by a HA or a FAþ , depending onthe specific constant value. For some modulo squarers andsome values of n, this reduces the delay of the adder tree to4� yðk�1Þþ2 gate equivalents instead of 4yðkÞ.
Table 5 lists the area and delay estimates derived for thevarious architectures under comparison and for several values ofn. Area savings up to 17%, 15% and 29% can be observed for thethree modulo cases, respectively. The only exception is the normalmodulo 2n
þ1 squarer, where for small values of n (8 and 12), thecost of the Booth encoding logic and the fact that the proposedsquarer is derived by adding an additional partial product to thecorresponding diminished-one Booth-encoded modulo squareroverwhelms the savings offered by the reduction of the partialproducts. The larger the width of the input operand is, the larger thearea savings are. It should be noted, however, that the unit-gate areasavings can be only considered as lower bounds since the Pi terms inthe proposed Booth-encoded squarers can be implemented muchmore efficiently in CMOS VLSI technologies than predicted by theunit-gate model, by the use of OR-AND-INVERT compound gates. Inthe majority of the examined cases, the proposed squarers also offera small delay reduction, equal to the delay of one HA or FAþ or FA.In most of the remaining cases they offer the same delay, comparedto the previous proposals. The only exception appears in the n¼8case where the reduced maximum height of the partial productmatrix is compensated by the increased delay for generating thepartial product bits.
Finally, the previously reported and the proposed squarerarchitectures were described in HDL for n¼8, 12, 16, 20 and 32.After simulating the resulting descriptions, the designs weremapped to a 90 nm CMOS standard cell library [40] that providesa single poly-layer and up to nine metal layers, using theSynopsyss Design Compilers tool. A NAND gate in this technol-ogy with a 1X drive strength requires an implementation area of3:136 mm2. Typical process parameters (1.0 V, 25 1C) were used. Abottom-up approach was followed during mapping. Each basiccell (for example the half and full adder cell) was iterativelyoptimized until no further delay savings were possible. The cellsthen underwent successive area recovery steps. Finally, ‘‘do nottouch’’ primitives were applied to them. Then, the same optimi-zation procedure was applied to every subcomponent and succes-sively to the whole circuit. In this way, every design was mappedin the target technology as an interconnection of already opti-mized blocks and the architecture in each description waspreserved as much as possible. All constraints, such as maximumfan-out, output capacitance, and available input drive strength,were kept constant for all architectures. Table 6 lists the attainedarea and delay results. Parentheses in the rightmost columnindicate the theoretical delay savings that are expected. Theresults validate the estimations and conclusions drawn before.The proposed modulo squarers, for medium and large values of n,achieve significant area savings (up to 38%) while in most casesthey also offer small delay improvements. The small deviationsbetween estimations and results are attributed to the optimiza-tion algorithm of the synthesis tool.
5. Conclusions
Efficient modulo squarers are highly appreciated in special-purpose digital signal processors that use a residue number
Table 2Unit gate model.
Logic gate/basic cell Area Delay
NOT, BUF 0 0
AND, OR, NAND, NOR (two-input) 1 1
XOR, XNOR (two-input) 2 2
HA/FAþ 3 2
FA 7 4
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174 169
Author's personal copy
Ta
ble
3U
nit
-ga
tea
rea
an
aly
sis.
Ba
sic
cell
sC
ircu
it
Pre
vio
usl
yp
rop
ose
dsq
ua
rers
Pro
po
sed
bo
oth
-en
cod
edsq
ua
rers
Mo
du
lo2
n�
1[2
9]
Mo
du
lo2
nþ
1[3
2,3
3]
Mo
du
lo2
nþ
1[3
4]
Mo
du
lo2
n�
1M
od
ulo
2nþ
1ðN
orm
alÞ
Mo
du
lo2
nþ
1ðD
imin
ish
ed
-on
eÞ
Bit
gen
era
tio
n#
AN
D/N
AN
D(1
eq
.g
ate
)n
2 2�
n 2
n2 2�
n 2
n2 2�
n 2
––
–
#B
oo
thE
nco
de
rs(6
eq
.g
ate
s)–
––
n 2
n 2
n 2
#X
NO
R(b
ib
its)
(2e
q.
ga
tes)
––
–n
2 4�
n 2
n2 4�
n 2
n2 4�
n 2
#N
OR
(Ci,
2b
its)
(1e
q.
ga
te)
––
–n 2
n 2
n 2
#N
OR
(Pi,
0b
its)
(1e
q.
ga
te)
––
–n 2�
1n 2�
1n 2�
1
#N
OR
–O
R/N
OR
–N
OR
(Pi,
jb
its)
(4e
q.
ga
tes)
––
–n
2 4�
n 2
n2 4�
n 2
n2 4�
n 2
#O
R(N
orm
al
mo
du
lo2
nþ
l)(1
eq
.g
ate
)–
1–
–1
–
Bit
red
uct
ion
#F
A(7
eq
.g
ate
s)n
2 2�
3n 2
n2 2�
3n 2
n2 2�
n 2
n2 4�
1n
2 4�
1n
2 4�
n�
1
#H
A/F
Aþ
(3e
q.
ga
tes)
–n
nn 2þ
1n
n
Fin
al
ad
dit
ion
#M
od
ulo
2n-1
Ad
de
r[8
](3
nlo
gnþ
4n
eq
.g
ate
s)1
––
1–
–
#N
orm
al
mo
du
lo2
nþ
1A
dd
er
[38
]
9 2n
log
nþ
n 2þ
7e
q:g
ate
s
��
–1
––
1–
#D
im-1
mo
du
lo2
nþ
1A
dd
er
[38
]9 2
nlo
gnþ
n 2þ
5e
q:g
ate
s
��
––
1–
–1
To
tal
are
a(e
q:
ga
tesÞ
4n
2þ
3n
log
n�
7n
4n
2þ
9 2n
log
n�
15 2
nþ
84
n2þ
9 2n
log
n�
1 2nþ
51
3 4n
2þ
3n
log
n�
1 2n�
51
3 4n
2þ
9 2n
log
nþ
9 2n
13 4
n2þ
9 2n
log
n�
5 2n�
3
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174170
Author's personal copy
Table 4Unit-gate delay analysis (assuming even values of n for the previously reported squarers and values of n that are multiples of four for the proposed squarers).
Bit generationðeq: gatesÞ
Bit reduction
ðadder treeÞ ðeq: gatesÞ
Final additionðeq: gatesÞ
Total delay ðeq: gatesÞ
Modulo 2n�1
[29] 1 4yðn=2Þ or 4yðn=2Þþ4 2dlogneþ3 4yðn=2Þþ2dlogneþ4ðn¼ 8,16,20,24,28,32,36,40,44,48,52,56,60,64Þ
or 4yðn=2Þþ2dlogneþ8ðn¼ 12Þ
Proposed 2, 3, 5 4yðn=4þ1Þ or 4yðn=4þ1Þþ2 2dlogneþ3 4yðn=4þ1Þþ2dlogneþ8ðn¼ 16,24,28,36,40,44,52,56,60,64Þ or
4yðn=4þ1Þþ2dlogneþ10ðn¼ 8,12,20,32,48Þ
Normal modulo 2nþ1
[32,33] 1 4yðn=2Þþ2 or 4yðn=2Þþ1 or
4yðn=2þ1Þþ2
2dlogneþ3 4yðn=2Þþ2dlogneþ6ðn¼ 8,56Þ or
4yðn=2þ1Þþ2dlogneþ4ðn¼ 12,20,24,28,32,36,40,44,48,52,60,64Þ or
4yðn=2þ1Þþ2dlogneþ6ðn¼ 16Þ
Proposed 2, 3, 5 4yðn=4þ2Þþ2 or 4yðn=4þ3Þ 2dlogneþ3 4yðn=4þ2Þþ2dlogneþ10ðn¼ 8,16,28,44Þ or
4yðn=4þ3Þþ2dlogneþ8ðn¼ 12,20,24,32,36,40,48,52,56,60,64Þ
Diminished-one modulo 2nþ1
[34] 1 4yðn=2þ2Þ or 4yðn=2þ2Þþ2 2dlogneþ3 4yðn=2þ2Þþ2dlogneþ3ðn¼ 16,24,36Þ or
4yðn=2þ2Þþ2dlogneþ4ðn¼ 12,20,28,32,40,44,48,52,56,60,64Þ or
4yðn=2þ2Þþ2dlogneþ5ðn¼ 8Þ
Proposed 2, 3, 5 4yðn=4þ1Þþ2 or 4yðn=4þ2Þ 2dlogneþ3 4yðn=4þ1Þþ2dlogneþ10ðn¼ 8,12,20,32,48Þ or
4yðn=4þ2Þþ2dlogneþ8ðn¼ 16,24,28,36,40,44,52,56,60,64Þ
Table 5Unit gate area and delay estimates in equivalent gates.
n 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Modulo 2n�1 squarers
Area [29] 272 621 1104 1719 2466 3344 4352 5490 6759 8157 9684 11,341 13,128 15,043 17,088
Proposed 271 586 1011 1544 2185 2933 3787 4747 5814 6986 8263 9646 11,135 12,728 14,427
Savings 1 35 93 175 281 411 565 743 945 1171 1421 1695 1993 2315 2661
Delay [29] 18 28 28 34 34 38 38 40 44 44 44 44 44 48 48
Proposed 20 26 28 32 34 34 36 40 40 40 42 44 44 44 44
Savings �2 2 0 2 0 4 2 0 4 4 2 0 0 4 4
Normal modulo 2nþ1 squarers
Area [32,33] 312 688 1200 1847 2627 3540 4584 5760 7066 8503 10,070 11,768 13,595 15,553 17,640
Proposed 352 716 1192 1779 2475 3280 4192 5212 6338 7571 8910 10,356 11,907 13,565 15,328
Savings �40 �28 8 68 152 260 392 548 728 932 1160 1412 1688 1988 2312
Delay [32,33] 20 28 30 34 34 38 38 40 44 44 44 44 46 48 48
Proposed 24 28 30 34 34 36 38 40 40 42 44 44 44 44 44
Savings �4 0 0 0 0 2 0 0 4 2 0 0 2 4 4
Diminished-one 2nþ1 squarers
Area [34] 365 769 1309 1984 2792 3733 4805 6009 7343 8808 10,403 12,129 13,984 15,970 18,085
Proposed 293 629 1077 1636 2304 3081 3965 4957 6055 7260 8571 9989 11,512 13,142 14,877
Savings 72 140 232 348 488 652 840 1052 1288 1548 1832 2140 2472 2828 3208
Delay [34] 23 28 31 34 37 38 38 43 44 44 44 44 48 48 48
Proposed 20 26 28 32 34 34 36 40 40 40 42 44 44 44 44
Savings 3 2 3 2 3 4 2 3 4 4 2 0 4 4 4
Table 6CMOS VLSI implementation results.
Modulo 2n�1 squarers
n Area Delay
[29] ðmm2Þ Proposed ðmm2Þ Savings (%) [29] (ns) Proposed (ns) Savings (%)
8 3399 3323 2.2 0.545 0.576 �5.7 (�1 HA/FAþ)
12 8448 7025 16.8 0.808 0.751 7.1 (þ1 HA/FAþ)
16 15,373 12,030 21.7 0.826 0.801 3.0 (0)
20 24,158 18,413 23.8 1.015 0.940 7.4 (þ1 HA/FAþ)
32 61,471 41,731 32.1 1.136 1.064 6.3 (þ1 HA/FAþ)
Normal modulo 2nþ1 squarers
n Area Delay
[32,33] (mm2) Proposed (mm2) Savings (%) [32,33] (ns) Proposed (ns) Savings (%)
8 4106 4696 �14.4 0.660 0.746 �13.0 (�1 FA)
12 9907 9664 2.4 0.868 0.856 1.4 (0)
16 16,850 14,700 12.8 0.902 0.916 �1.6 (0)
20 26,485 22,611 14.6 1.053 1.030 2.2 (0)
32 64,336 47,053 26.9 1.196 1.196 0.0 (0)
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174 171
Author's personal copy
system and in the implementation of modulo exponentiators andmultiplicative inverses. We have proposed modified Booth-encoded squarer architectures for modulo 2n71 arithmetic. Forthe modulo 2n
þ1 case, we have considered both the normal andthe diminished-one representations. All proposed architectureslead to shallower partial product matrices with less bits that needto be added at the cost of more complexity in the generation ofthe partial product bits. Experimental results have shown that,the proposed modified Booth-encoded squarers offer up to 38%less implementation area than the previous proposals while inmost cases they also offer a small delay improvement equal to thedelay of a half adder or a full adder.
Acknowledgement
This research was partially supported by the CaratheodoryProgramme of the University of Patras (D.178).
Appendix A. Calculation of T for the proposed diminished-onemodulo 2n
þ1 Booth-encoded squarer
A.1. n is even and a multiple of four
The squarer has ðn=2Þ Ci terms, with 2 bits each. Half of themare inverted (see Fig. 3b as an example for the n¼8 case). Hence,according to procedure (iiib) in Section 2, the correction term forrepositioning the Ci terms, TCi
, is equal to:
TCi¼�ð20
þ22þ � � � þ2n�2
Þ ¼ �2n�1
3: ðA:1Þ
The squarer also has ððn=2Þ�1Þ Pi terms, that also undergorepositioning. The correction term required for the first n=4 ofthem, TPA, is equal to:
TPA ¼ TP0þTP1
þ . . . þTPn=4�1
¼�20�ð20
þ21þ22Þ� � � ��ð20
þ21þ � � � þ2ðn=2Þ�2
Þ
¼�ð21�1Þ�ð23
�1Þ� � � ��ð2ðn=2Þ�1�1Þ
¼n
4�
2
3ð2ðn=2Þ
�1Þ: ðA:2Þ
Similarly, the correction term required for the remainingððn=4Þ�1Þ Pi terms, TPB, is equal to:
TPB ¼ TPn=4þTPðn=4Þ þ 1
þ . . . þTPðn=2Þ�2
¼�ð23þ24þ � � � þ2n=2
Þ�ð27þ28þ � � � þ2ðn=2Þþ2
Þ
� � � ��ð2n�5þ2n�4
Þ
¼�ð2ðn=2Þþ1�23Þ�ð2ðn=2Þþ3
�27Þ� � � ��ð2n�3
�2n�5Þ
¼�ð2ðn=2Þþ1þ2ðn=2Þþ3
þ � � � þ2n�3Þþð23
þ27þ � � � þ2n�5
Þ
¼�2ðn=2Þþ1 2ðn=2Þ�2�1
3þð23
þ27þ � � � þ2n�5
Þ: ðA:3Þ
The reduction of the partial products into two n-bit final addendsrequires n=4 CSAs with EACs. Hence the correction term due tothe CSA tree is equal to:
TCSA,A ¼�n
4: ðA:4Þ
Therefore, the total correction term is equal to:
TA ¼ jTCiþTPAþTPBþTCSA,A�2j2n
þ1, ðA:5Þ
where the last term ð�2Þ accounts for the fact that we want thediminished-one representation of the result and for the fact thatthe total correction term has also to be in diminished-onerepresentation. Substituting (A.1)–(A.4) in (A.5), we get that:
TA ¼ �2n�1
3þ
n
4�
2
3ð2n=2�1Þ�2ðn=2Þþ1 2ðn=2Þ�2
�1
3
�����þð23
þ27þ � � � þ2n�5
Þ�n
4�2
�����2nþ1
¼ j�2n�1þð23
þ27þ � � � þ2n�5
Þ�1j2nþ1
¼ 2nþ1�2n�1
þð23þ27þ � � � þ2n�5
Þ�1
¼Xðn=4Þ�1
i ¼ 0
24iþ3¼ 88 . . .816:
A.2. n is even but not a multiple of four
The correction term required due to the inversion of the Ci
terms, TCi, is also given in this case by (A.1). The squarer also
in this case has ððn=2Þ�1Þ Pi terms. The correction term requiredfor repositioning the first bn=4c ¼ ððn=2Þ�1Þ=2 of them, TPC, isequal to:
TPC ¼ TP0þTP1
þ � � � þTPbn=4c�1
¼�20�ð20
þ21þ22Þ� � � � �ð20
þ21þ � � � þ2ðn=2Þ�3
Þ
¼�ð21�1Þ�ð23
�1Þ� � � ��ð2n=2�2�1Þ
¼n
4
j k�
2
3ð22bn=4c
�1Þ: ðA:6Þ
The correction term required for the remaining bn=4c ¼ððn=2Þ�1Þ=2 Pi terms, TPD, is equal to:
TPD ¼ TPbn=4cþTPbn=4c þ 1
þ � � � þTPðn=2Þ�2
¼�ð21þ22þ � � � þ2ðn=2Þ�1
Þ�ð25þ26þ � � � þ2ðn=2Þþ1
Þ
� � � � �ð2n�5þ2n�4
Þ
¼�ð2ðn=2Þ�21Þ�ð2ðn=2Þþ2
�25Þ� � � ��ð2n�3
�2n�5Þ
¼�ð2ðn=2Þþ2ðn=2Þþ2
þ � � � þ2n�3Þþð21
þ25þ � � � þ2n�5
Þ
¼�2ðn=2Þ 22bn=4c
�1
3þð21
þ25þ � � � þ2n�5
Þ: ðA:7Þ
Table 6 (continued )
Diminished-one modulo 2nþ1 squarers
n Area Delay
[34] (mm2) Proposed (mm2) Savings (%) [34] (ns) Proposed (ns) Savings (%)
8 5029 3766 25.1 0.692 0.627 9.4 (þ1 HA/FAþþ1
gate)
12 10,963 8113 26.0 0.849 0.812 4.4 (þ1 HA/FAþ)
16 18,038 12,985 28.0 0.946 0.840 11.2 (þ1 FA-1 gate)
20 29,048 19,664 32.3 1.062 1.036 2.4 (þ1 HA/FAþ)
32 68,480 42,699 37.6 1.186 1.150 3.0 (þ1 HA/FAþ)
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174172
Author's personal copy
The reduction of the partial products into two n-bit final addendsrequires bn=4c CSAs with EACs. Hence the correction term due tothe CSA tree is equal to:
TCSA,B ¼�n
4
j kðA:8Þ
and the total correction term is equal to:
TB ¼ jTCiþTPCþTPDþTCSA,B�2j2n
þ1 ðA:9Þ
and by substituting (A.1), (A.6)–(A.8) in (A.9), we get that:
TB ¼ �2n�1
3þ
n
4
j k�
2
3ð22bn=4c
�1Þ�2n=2 22bn=4c�1
3
�����þð21
þ25þ � � � þ2n�5
Þ�n
4
j k�2
�����2nþ1
¼ j�2n�1þð21
þ25þ � � � þ2n�5
Þ�1j2nþ1
¼ 2nþ1�2n�1
þð21þ25þ � � � þ2n�5
Þ�1
¼Xbn=4c
i ¼ 0
24iþ1¼ 22 . . .216:
Appendix B. Calculation of TNORMAL for the proposed normalmodulo 2n
þ1 Booth-encoded squarers
The proposed normal modulo 2nþ1 Booth-encoded squarers
utilize one partial product more than the proposed diminished-one (see Figs. 3b and 5a) for the n¼8 case. This corresponds toj�2Aj2n
þ1 and can be expressed in n bits provided that a furthercorrection equal to 3 is taken into account. This extra partialproduct has to be added along with the rest partial products.Therefore, one more CSA is required and an additional correctionequal to �1 has to be considered to account for the inverted EAC.Based on the above, TNORMAL ¼ jTþ3�1j2n
þ1 ¼ jTþ2j2nþ1, where T
represents the correction term required in the correspondingdiminished-one squarer.
References
[1] P.V. Ananda Mohan, Residue Number Systems: Algorithms and Architectures,Springer-Verlag, 2002.
[2] A. Omondi, B. Premkumar, Residue Number Systems: Theory and Implemen-tations, Imperial College Press, 2007.
[3] B. Cao, C.H. Chang, T. Srikanthan, An efficient reverse converter for the4-moduli set f2n
�1,2n ,2nþ1,22n
þ1g based on the new Chinese remaindertheorem, IEEE Trans. Circuits Syst. I 50 (10) (2003) 1296–1303.
[4] R. Conway, J. Nelson, Improved RNS FIR filter architectures, IEEE Trans.Circuits Syst. II 51 (1) (2004) 26–28.
[5] B. Cao, T. Srikanthan, C.H. Chang, Efficient reverse converters for the four-moduli sets f2n
�1,2n ,2nþ1,2nþ1
�1g and f2n�1,2n ,2n
þ1,2n�1�1g, IEE Proc.
Comput. Digital Tech. 152 (5) (2005) 687–696.[6] B. Cao, C.H. Chang, T. Srikanthan, A residue-to-binary converter for a new
5-moduli set, IEEE Trans. Circuits Syst. I 54 (5) (2007) 1041–1049.[7] K. Navi, A. Molahosseini, M. Esmaeildoust, How to teach residue number
system to computer scientists and engineers, IEEE Trans. Educ. 54 (1) (2011)156–163.
[8] L. Kalampoukas, et al., High-speed parallel-prefix modulo 2n�1 adders, IEEE
Trans. Comput. 49 (7) (2000) 673–680.[9] H.T. Vergos, C. Efstathiou, D. Nikolos, Diminished-one modulo 2n
þ1 adderdesign, IEEE Trans. Comput. 51 (12) (2002) 1389–1399.
[10] Y. Ma, A simplified architecture for modulo (2nþ1) multiplication, IEEE
Trans. Comput. 47 (3) (1998) 333–337.[11] C. Efstathiou, H.T. Vergos, D. Nikolos, Modified booth modulo 2n
�1 multi-pliers, IEEE Trans. Comput. 53 (3) (2004) 370–374.
[12] C. Efstathiou, H.T. Vergos, G. Dimitrakopoulos, D. Nikolos, Efficientdiminished-1 modulo 2n
þ1 multipliers, IEEE Trans. Comput. 54 (4) (2005)491–496.
[13] L. Sousa, R. Chaves, A universal architecture for designing efficient modulo2nþ1 multipliers, IEEE Trans. Circuits Syst.I 52 (6) (2005) 1166–1178.
[14] L.M. Leibowitz, A simplified binary arithmetic for the fermat numbertransform, IEEE Trans. Acoust. Speech Signal Process. 24 (5) (1976)356–359.
[15] T. Kwan, T. Martin, Adaptive detection and enhancement of multiplesinusoids using a cascade IIR filter, IEEE Trans. Circuits Syst. 36 (7) (1989)937–945.
[16] R.H. Strandberg, et al., Efficient realizations of squaring circuit and reciprocalused in adaptive sample rate notch filters, J. VLSI Signal Process. 14 (3) (1996)303–309.
[17] R. Jain, A. Madisetti, R.L. Baker, An integrated circuit design for pruned tree-search vector quantization encoding with an off-chip controller, IEEE Trans.Circuits Syst. Video Technol. 2 (2) (1992) 147–158.
[18] J. Pihl, E.J. Aas, A multiplier and squarer generator for high performance DSPapplications, Proceedings of 39th Midwest Symposium on Circuits andSystems, vol. I, 1996, pp. 109–112.
[19] J.-T. Yoo, K.F. Smith, G. Gopalakrishnan, A fast parallel squarer based ondivide-and-conquer, IEEE J. Solid-State Circuits 32 (6) (1997) 909–912.
[20] Y. Yu Fengqi, A.N. Wilson, Multirate digital squarer architectures, in:Proceedings of the 8th IEEE International Conference on Electronics Circuits& Systems, 2001, pp. 177–180.
[21] Lucent Technologies, DSP 1628 Datasheet, Murray Hill, NJ, 1997.[22] R.K. Kolagotla, W.R. Griesbach, H.R. Srinivas, VLSI implementation
of 350 MHz 0:35 m 8 bit merged squarer, Electron. Lett. 34 (1) (1998) 47–48.[23] A.A. Liddicoat, M.J. Flynn, Parallel square and cube computations, Proceed-
ings of the 34th Asilomar Conference on Signals, Systems and Computers,vol. 2, 2000, pp. 1325–1329.
[24] M. Ciet, M. Neve, E. Peeters, J.J. Quisquater, Parallel FPGA implementation ofRSA with RNS, Proceedings of the 46th IEEE Midwest Symposium on Circuitsand Systems, vol. II, 2003, pp. 806–810.
[25] J.C. Bajard, L.S. Didier, P. Kornerup, An RNS montgomery modular multi-plication algorithm, IEEE Trans. Comput. 47 (7) (1998) 766–776.
[26] J.C. Bajard, L. Imbert, A Full RNS implementation of RSA, IEEE Trans. Comput.53 (6) (2004) 769–774.
[27] R. Zimmermann, et al., A 177 Mb/s VLSI implementation of the interna-tional data encryption algorithm, IEEE J. Solid-State Circuits 29 (3) (1994)303–307.
[28] U. Meyer-Base, A. Garcia, F. Taylor, Implementation of a communicationsChannelizer using FPGAs and RNS arithmetic, J. VLSI Signal Process. 28 (1–2)(2001) 115–128.
[29] S.J. Piestrak, Design of squarers modulo a with low-level pipelining, IEEETrans. Comput. 49 (1) (2002) 31–41.
[30] P.B. Rao, A. Skavantzos, ROM based methods for computing the squaringoperation in modular rings, J. VLSI Signal Process. 7 (3) (1994) 199–211.
[31] B. Cao, T. Srikanthan, C-H. Chang, A new design method to Modulo 2n�1
squaring, in: Proceedings of the IEEE International Symposium on Circuitsand Systems, May 2005, pp. 664–667.
[32] H.T. Vergos, C. Efstathiou, Efficient modulo 2kþ1 squarers, in: Proceedings of
the XXI Conference on Design of Circuits and Integrated Systems, November2006.
[33] R. Muralidharan, C.H. Chang, C.C. Jong, A low complexity modulo 2nþ1
squarer design, in: Proceedings of the IEEE Asia Pacific Conference on Circuitsand Systems, 2008, pp. 1296–1299.
[34] H.T. Vergos, C. Efstathiou, Diminished-1 modulo 2nþ1 squarer design, IEE
Proc. Comput. Digital Tech. 152 (5) (2005) 561–566.[35] A. Spyrou, D. Bakalis, H.T. Vergos, Efficient architectures for modulo 2n
�1squarers, in: Proceedings of the 16th IEEE International Conference on DigitalSignal Processing (DSP 2009), July 2009, pp. 1–6.
[36] A. Strollo, D. Caro, Booth folding encoding for high performance squarercircuits, IEEE Trans. Circuits Syst. II 50 (5) (2003) 250–254.
[37] H.T. Vergos, D. Bakalis, On implementing efficient modulo 2nþ1 arithmetic
components, J. Circuits Syst. Comput. 19 (5) (2010) 911–930.[38] L. Dadda, Some schemes for parallel multipliers, Alta Frequenza 34 (1965)
349–356.[39] A. Tyagi, A reduced-area scheme for carry-select adders, IEEE Trans. Comput.
42 (10) (1993) 1163–1170.[40] Faraday Technology Corp., 90 nm Standard Cell, Faraday ASIC Cell Library
FSD0A_A, September 2004.
Dimitris Bakalis received the Diploma degree in 1995,the M.Sc. degree in 2000 and the Ph.D. degree in 2001in Computer Engineering, all from the Department ofComputer Engineering and Informatics at the Univer-sity of Patras in Greece. He currently holds a Lecturerposition in the Physics Department at the same uni-versity. His main research interests include VLSI designand test, digital system design and test, embeddedsystems, computer arithmetic, low power design andtest.
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174 173
Author's personal copy
Haridimos Vergos received his Diploma in ComputerEngineering in 1991, and his Ph.D. in 1996 from theDepartment of Computer Engineering and Informatics,University of Patras, Greece, where he currently holds anAssociate Professor position. He was a member of AtmelMultimedia & Communications Group and worked onthe development of the first worldwide IEEE 802.11compliant wireless MAC processor. His research interestsinclude computer arithmetic and architecture, depend-able system architectures and low power design andtest.
Anastasia Spyrou received her Diploma Degree inComputer Engineering and Informatics in 2007 andthe Masters Degree in Integrated Software and Hard-ware Systems in 2009 from the University of Patras,Greece. Her research interests focus on the area of VLSIdesign.
D. Bakalis et al. / INTEGRATION, the VLSI journal 44 (2011) 163–174174