Reconfigurable Computing - Multipliers: Options in Circuit Design

Reconfigurable Computing -Multipliers: Options in Circuit Design

John MorrisChung-Ang University

The University of Auckland

‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia

Multipliers

‘Long’ multiplication

x x x x

x x x x

x x x x

x x x x

x x x x

x x x x

x x x x x x x

multiplier

multiplicand

partialproducts

product

In binary, the partial products

are trivial–

if multiplier bit = 1, copy the multiplicand

else 0

Use an ‘and’ gate!

Multipliers ‘Long’ multiplication

a3 a2 a1 a0

b3 b2 b1 b0

x x x x

x x x x

x x x x

x x x x

x x x x x x x

In binary, the partial products are trivial–

if multiplier bit = 1, copy the multiplicand

else 0

Use an ‘and’ gate!

b0

b1

b2

b3

a0

b0

a1a2a3

first row of partial products

Multipliers – Simple binary multiplier

We can add the partial products with FA blocks

b0

b1

a0a1a2a3

FAFAFAFA

FA

0

FAFAFA

p0p1

b2

FAFAFAFA

product bits

Parallel Array Adder - VHDL

We can build this adder in VHDL with two GENERATE loops

FOR j IN 0 TO n-1 GENERATE -- For each row

FOR j IN 0 TO n-1 GENERATE –- Generate a row

pjk : full_adder PORT MAP;) … (

END GENERATE;

END GENERATE; This part is straight-forward!

SIGNAL pa, pb, cout : ARRAY) 0 TO n-1 ( OF ARRAY) 0 TO n-1 ( OF std_logic;

…but you need to fill in the PORT MAP

using internal signals!

Multipliers – Adding partial products


b0

b1

a0a1a2a3

FAFAFAFA

FA

0

FAFAFA

p0p1

b2

FAFAFAFA

product bits

Optimization 1:

Replace this rowof FAs

Time?

What’s the worst case propagation

delay?

Multipliers – Using carry save adders


b0

b1

a0a1a2a3

FAFAFAFA

FA

0

FAFAFA

p0p1

b2

FAFAFAFA

product bits

Try to use a

more efficient adder in each row?

A simpler scheme

uses a ‘carry save’ adder – which

pushes the carry out’s down to the

next row!

Note that an extra adder is needed below the last row to

add the last partial products and the carries from the row above!Carry select adder

Multipliers - Tree

Chris Wallace discovered a way to build fast multipliers by reducing the number of carry propagations – and thus the delay

All the partial product bits can be generated directly from the operand bits

A full adder adds 3 input bits to produce a 2 bit result

Use it to add the bits in columns

Produce pairs of ‘first level’ sums

Combine bits in these sums vertically again

············

············

············

············

Combine pp bits

vertically!

3 at a time

············ ···· ········ ·

·

· First level results

Pairs of bits

from FA cells

Multipliers - Tree

Summing the partial products

············

············

············

············

So combine them vertically!

············ ···· ········ ·

·

· First level results

Signed digit arithmetic – Avoiding the carries! Terminology

First, we need to distinguish carefully between digits of a number and bits used in representing the number

In the standard binary representations,one bit is used to represent each binary digit (0 or 1) of a number

However, we can use other representation schemes … If we use more than one bit to represent each digit of an operand, then

we have a redundant system We’re using more bits than the minimum log2n needed to represent

a number of magnitude, n. These redundant number systems generally have the ability to

avoid carry propagation This may be exploited in the addition of sequences of numbers Carries are transferred to the following addition Concept similar to that used in carry-save multiplier where carries are

transferred to the following partial product addition

Booth Recoding

A binary number can be re-coded according to Booth’s scheme to reduce the number of partial products in a multiplier

Original idea Early computers: shift much faster than add Observe than when there is a 0 in the multiplier,

you can skip the addition and just shift the multiplicand In a synchronous computer, this doesn’t help –

in the worst case, you still have to perform an add for each digit of the multiplier (all or most of them are 1’s)but

in an asynchronous computer, the ability to skip some additions reduces the average completion time

Booth observed that when there is a long sequence of 1s,eg digits j through (down to) k are 1s, then

2j + 2j+1 + … +2k-1 + 2k = 2j+1 – 2k

Booth Recoding

A binary number can be re-coded according to Booth’s scheme to reduce the number of partial products in a multiplier

Booth recoding Booth observed that when there is a long sequence of 1s,

eg digits j through (down to) k are 1s, then

2j + 2j+1 + … +2k-1 + 2k = 2j+1 – 2k

Thus the sequence of additions can be replaced by An addition of the multiplicand shifted by j+1 positions and A subtraction of the multiplicand shifted by k positions

This is equivalent to recoding the multiplier from a representation using {0,1} to one using {-1,0,1} – corresponding to subtract, skip, add

The recoding can be done in O(1) time by inspecting neighbouring digits

Booth Recoding

Booth’s scheme Radix-2 Booth recoding

For each position, j, inspect xj and xj-1 to determine the bits (2 needed!) of yj

Example

x: 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 )0(y: -1 0 1 0 0 -1 1 0 -1 1 -1 1 0 0 -1 0

In practice, this scheme is no use in a synchronous machine, Worst case: sequence of alternating 0 1

More additions than necessary!

but if we use a higher radix Booth recoding

xj xj-1 yj Note

0 0 0 No 1’s

0 1 1 End of a string of 1’s - add

1 0 -1 Start of a string of 1’s - subtract

1 1 0 Middle of a string of 1’s - skip

Higher Radix Multiplication

Radix-2 multiplier Use 1 bit of the multiplier at a time Form partial product with and gates

Radix-4 multiplier Use 2 bits of the multiplier at a time

If A is the multiplicand ..

Radix-4 Booth recoding …

Multiplierbits

Operation

00 none

01 +A

10 +2A (shift A)

11 +3A (precompute A+2A?)

Radix-4 Booth Recoding

Recode multiplier into a signed digit form Use 3 bits of the original

multiplier at a time Recoded multiplier has half

the number of digits, but each digit is in [-2,2]

Operands to the adders are now formed by shifts alone

Recode Constant time

Partial products Shift, and, select n/2 partial products

generated Potentially 2× speed!

x2j+1 x2j x2j-1 yj Operation

0 0 0 0 No 1’s

0 0 1 1 +AEnd of 1’s string

0 1 0 1 +AIsolated 1

0 1 1 2 +2A

End of 1’s string

1 0 0 -2 -2A

Beginning of 1’s

1 0 1 -1 -AEnd one string, start new one

1 1 0 -1 -AStart of 1’s string

1 1 1 0 Middle of 1’s

No carries at all?

Residue Number Systems

Residue Arithmetic Residue Number Systems

A verse by the Chinese scholar, Sun Tsu, over 1500 years ago posed this problem

What number has remainders 2, 3 and 2 when divided by the numbers 7, 5 and 3, respectively?

This is probably the first documented use of number representations using multiple residues

In a residue number system,a number, x, is represented by the list of its residues (remainders) with respect to k relatively prime moduli,

mk-1, mk-2, …, m0

Thus x is represented by

(xk-1, xk-2, …, x0) where

xi = x mod mi

So the puzzle may be re-writtenWhat is the decimal representation of (2,3,2) in RNS(7,5,3)?

Residue Number Systems The dynamic range of a RNS,

M = mk-1 mk-2 … m0

For example, in the system RNS(8,7,5,3)M = 8 7 5 3 = 840

Thus we have

Any RNS can be viewed as a weighted representation In RNS(8,7,5,3), the weights are:

105 120 336 280 Thus (1,2,4,0) represents

(105 1 + 120 2 336 4 + 280 0)840 = (1689)840 = 9

RNS(8,7,5,3) Decimal

(0,0,0,0) 0 or 840 or -840 or …

(1,1,1,1) 1 or 841 or -839 or …

(2,2,2,2) 2 or 842 or …

(0,1,3,2) 8 or 848 or …

Residue Number Systems - Operations Complement

To find –x, complement each of the digits with respect to the modulus for that digit

21 = )5,0,1,0( so

-21 = )8-5,0,5-1,0( = )3,0,4,0( Addition or subtraction is performed on each digit

) 5 , 5 , 0 , 2 (RNS = 510

) 7 , 6 , 4 , 2 (RNS = -110

) )5+7(=48, )5+6(=47, 4 , )2+2(=13(RNS = 410

) 4 , 4 , 4 , 1 (RNS = 410

Multiplication is also achieved by operations on each digit

) 5 , 5 , 0 , 2 (RNS = 510

) 7 , 6 , 4 , 2 (RNS = -110

) )5x7(=38, )5x6(=27, 0 , )2x2(=13(RNS = -510

) 3 , 2 , 0 , 1 (RNS = -510

Residue Arithmetic - Advantages Parallel independent

operations on small numbers of digits Significant speed ups

Especially for multiplication!

4 bit x 4 bit multiplier (moduli up to 15) much simpler than 16 bit x 16 bit one

Carries are strictly confined to small numbers of bits

Each modulus is only a small number of bits

Can be implemented in Look Up Tables (LUTs) 6 bit residues (moduli up

to 64) 64 x 64 x 6 bits required

(<4Kbytes)

Residue Arithmetic – Choosing the moduli Largest modulus determines the overall speed –

Try to make it as small as possible Simple strategy

Choose sequence of prime numbers until the dynamic range, M, becomes large enough

eg Application requires a range of at least 105, ie M 105

For RNS(13,11,7,5,3,2), M = 30,300 Range is too low, so add one more modulus: RNS(17,13,11,7,5,3,2), M = 510,510 Now

• each modulus requires a separate circuit and• our range is now ~5 times as large as needed, so remove 5:

RNS(17,13,11,7,3,2), M = 102,102 Six residues, requiring

5 + 4 + 4 + 3 + 2 + 1 = 19 bits The largest modulus (17 requiring 5 bits) determines the speed,

so …

Residue Arithmetic – Choosing the moduli

Application requires a range of at least 105, ie M 105

… RNS(17,13,11,7,3,2), M = 102,102 Six residues, requiring

5 + 4 + 4 + 3 + 2 + 1 = 19 bits The largest modulus (17 requiring 5 bits) determines the speed,

so combine some of the smaller moduli(Remember the requirement is that they be relatively prime!)

Try to produce the largest modulus using only 5 bits –Pair 2 and 13, 3 and 7

RNS(26,21,17, 11), M = 102,102 Four residues, requiring

5 + 5 + 5 + 4 = 19 bits

(no improvement in total bit count, but 2 fewer ALUs!) Better …?



… RNS(26,21,17, 11), M = 102,102 Four residues, requiring

5 + 5 + 5 + 4 = 19 bits(no improvement in total bit count, but 2 fewer ALUs!)

Include powers of smaller primes before primes,starting with

RNS(3,2), M = 6 Note that 22 is smaller than the next prime, 5, so move to RNS(22,3), M = 12 (trying to minimize the size of the largest modulus) After including 5 and 7, note that 23 and 32 are smaller than 11:RNS(32,23,7,5), M = 2,520 Add 11 RNS(11,32,23,7,5), M = 27,720 Add 13 RNS(13,11,32,23,7,5), M = 360,360



… Add 13 RNS(13,11,32,23,7,5), M = 360,360 M is now 3 larger than needed,

so replace 9 with 3, then combine 5 and 3RNS(15,13,11,23,7), M = 360,360

5 moduli, 4 + 4 + 4 + 3 + 3 = 18 bits, largest modulus has 4 bits

You can actually do somewhat better than this! Reference:

B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, Oxford University Press, 2000

Residue Numbers - Conversion

Inputs and outputs will invariably be in standard binary or decimal representations,

conversion to and from them is required Conversion from binary | decimal to RNS

Problem: Given a number, y, find its residues wrt moduli, mi

Divisions would be too time-consuming! Use this equality:

(yk-1yk-2…y1y0)2mi = 2k-1yk-1 mi + … + 2y1 mi + y0 mi mi

So we only need to precompute the residues 2 j mi for

each of the moduli, mi, used by the RNS


j 2 j 2 j7 2 j5 2 j3

0 1 1 1 1

1 2 2 2 2

2 4 4 4 1

3 8 1 3 2

4 16 2 1 1

5 32 4 2 2

6 64 1 4 1

7 128 2 3 2

8 256 4 1 1

9 512 1 2 2

For RNS(8,7,5,3) :

• <y>8 is trivially calculated (3 LSB bits)

• For 7, 5 and 3, we need the powers of 2 modulus 7, 5 and 3


j 2 j 2 j7 2 j5 2 j3

0 1 1 1 1

1 2 2 2 2

2 4 4 4 1

3 8 1 3 2

4 16 2 1 1

5 32 4 2 2

6 64 1 4 1

7 128 2 3 2

8 256 4 1 1

9 512 1 2 2

Find 16410 = 1010 01002 = 27 + 25 + 22 in RNS(8,7,5,3) :

• <164>8 is 1002 = 410

Note that the

additions are done

in a modular adder!

Worst case:

k additions for each

residue for a k -bitnumber

<164>7 = <2 + 4 + 4>7

= <10>7 = 3


Conversion from RNS to binary Digits of an RNS representation can be shown to have position

weightings, eg for RNS(8,7,5,3) the weightings are

105 120 336 280 The weightings may be calculated using the Chinese Remainder

Theorem

x = (xk-1xk-2 … x1x0)RNS = Mi ixim M

where

Mi = M / mi and

i = < Mi-1>m is the multiplicative inverse of Mi wrt mi

This means that

(x3, x2, x1, x0)RNS = x3 × 105 + x2 × 120 + x1 × 336 + x0 × 280

i

i


Conversion from RNS to binary Digits of an RNS representation can be shown to have position

weightings, eg for RNS(8,7,5,3) the weightings are

105 120 336 280 Calculate position weights with CRT … This means that

(x3, x2, x1, x0)RNS = x3 × 105 + x2 × 120 + x1 × 336 + x0 × 280

This is most efficiently done through a LUT Note that the table for RNS(8,7,5,3) requires only

8 + 7 + 5 + 3 = 23 entries

In general, this requires only

k-1i=0 mi

words – a reasonable number!

Residue Arithmetic - Disadvantages Range is limited Division is hard! Comparison <, >, sign (<0?) are hard Still suitable for some DSP applications

Only use +, x Range is limited Result range is known Examples: digital filters, Fourier transforms

Reconfigurable Computing - Multipliers: Options in Circuit Design

Documents

Transcript of Reconfigurable Computing - Multipliers: Options in Circuit Design