Post on 06-Jul-2018
8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier
1/5
A
13.3ns Double-precision Floating-point ALU and Multiplier
H.
Yamada, T. Hottat, T. Nishiyama, F. Murabayashi, T. Yamauchi, and
H.
Sawamoto
General P urpose Com puter Division, Hitachi Ltd. THitachi Research Labo ratory, Hitachi Ltd.
1
Horiyamashita, Hadano City, Kanagaw a Prefecture,
259- 13
Japan
Abstract
One-bit pre-shifting before alignment shift,
normalization with anticipated leading '1' bit and
pre-rounding techniques have been developed for a
floating-point arithmetic logic unit (ALU). In addition,
carry select addition and pre-rounding techniques have
been developed for a floating-point multiplier. A noise
tolerant precharge (NTF') circuit was designed and
applied to the ALU and multiplier. These techniques
reduced the delay time of
the
critical path by 24%. Each
unit was fabricated in 0.3 ym 2.5V four-layer-metal
CMOS
technology and achieved a two-cycle latency at
150 MHz.
1. Introduction
Scientific and engineering applications demand
exceptionally high floating-point performance which in
turn requires high speed floating-point ALUs and
multipliers to reduce executing time. In recent years a
number of high speed floating-point execution units
have been presented [ll
-
[61.
A floating-point ALU and multiplier were designed
which are each capable of 13.311s ex ecution . Th e ALU
nd multiplier can each individually produce a result in a
one-cycle pipelined pitch, achieving a peak execution
rate of 300MFLOPS at 15OMHz. The units re in
full
compliance with the IEEE Standard for Binary
Floating-point Arithmetic (Std. 754-1985) [7].
Th e ALU performs add, subtract, compare, convert
to smaller/larger floating-point precision value, and
convert floating to/from integer instructions for both
double and single precision operands. The ALU can
produce a denormalized number without requiring an
additional cycle.
The multiplier performs floating-point
multiplication for both double
nd
single precision
operands and integer multiplication for sing le precision
operands. The multiplier
is
unable to produce a
denormalized number, but it can optionally generate a
correctly signed zero instead of a denonnalized number
to avoid decre se of performance
due
to a trap.
To accom plish the 13.311s executin g time, these
execution units were designed with several new
arithmetic and circuit techniques and fabricated with the
most advanced silicon technology. This paper describes
the arithmetic and circuit techniques developed for the
ALU
nd
multiplier.
2. ALU
A block diagram of the floating-point ALU is
shown in Figure 1.
It
is a tw o stage pipelined m achine.
In the first stage, the exponen t of the larger operand is
selected as the common exponent and the fraction of the
operand with the sm aller exponent is shifted to
the
right
by the alignment shifter. In the second stage,
addition/subtraction of the fraction of
the
larger
exponent operand and the right shifted fraction, as well
as normalization, IEEE rounding, and correction of the
common exponent
re
performed.
Three arithmetic techniques are used in the ALU.
The fist is one-bit pre-shifting of both fractions in
effective addition cases. This technique
is
useful for
making the rounding process easier. The second is
normalization with the anticipated leading '1' bit of
addition/subtraction results. This normalization process
is
fast even if the anticipated bit
is
wrong, because the
incorrectly shifted fraction can be adjusted by a simple
one-bit right shift. The third technique utilized is
pre-rounding, which prepares all p ossible rounded
results in parallel with addition/subtraction of aligned
fractions and selects the correct one with the leading '1'
bit of the addsubtract result. By using this technique,
the rounding proce ss is acceralated by
5 1%.
466
1063-6404/95 4.00 1995
EEE
8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier
2/5
S1.El S2.E2 F1 F2
SWAP
I
Q
X
a,
g l
I
shift number(Ediff)
E
8
I I
16527
+ +
SELECT
ST
ET
FT(1:51) FT(52)
ES l One-bit pre-shifting
23 Normalization with anticipated leading '1' bit
re-rounding
Figure
1. ALU
block diagram
2 . 1 One bit pre shifting
When effective addition is performed, both fractions
of the operands are shifted right by one bit first, and
then the shifted fraction of the smaller exponent operand
is right shifted
the
amount of the operand exponent
difference (Ediff). Th e addition result of the aligned
fractions lies between
0.1
and
1.1
11..
.
(represented in
binary) and may exceed the IEEE format bit length.
Normalization shift left by one/zero bit position and
rounding
re performed
if necessary.
When effective subtraction is performed, the
fraction of th e smaller exponent operand is right shifted
the amount of Ediff. If
Ediff=O
or 1, the subtraction
result of the aligned fraction is less than or equal to 1,
so
performing
a
large normalization shift is necessary.
However, the normalized result already complies
to
the
IEEE format bit len gth,
so
rounding is not
performed.
If
Ediff>l, the substraction result lies between 0.1 and
1.111.. and may exceed the IEEE format bit length . In
such
cases, normalization shift left by zero/one bit
position
nd
rounding
re performed
if necessary.
2 .2 Normalization with anticipated leading
1
bit
The normalization process consists of the
following four steps:
1)
leading
'1'
bit anticipation of
the add subtract result;
2)
shift control signal generation
with priority encoding of the leading
'1'
bit anticipation
result;
3)
left shift of the addsubtrac t result by the shift
control signal;
4)
one-bit right shift adjustment if the
anticipated leading
'1'
bit is incorrect.
The algorithm
used
for the leading '1' bit
anticipation is as follows, The leading
'1'
bit
anticipation signal Z is
:
(1)
where
the
i-th bit of signal Z is defined as
(2)
nd
a,
and b, are the i-th bits of the fractions to be dded
(02
2 52). In equation
2 ) ,
represents an
EXCLUSIVE-OR and
I
represents an
OR.
Producing
signal
Z
requires
a
maximum of
2
gate delays
(2
EXCLUSIVE-ORs) which
is
far smaller than
the 7-8
gate delays necessary for a bit carry lookahead adder.
The
leading '1' bit position of signal
Z is
equal to or
only one bit lower than that of the adds ubt ract result. If
the anticipated bit is w rong, the normalization shift i s
incorrect by one bit position and can be adjusted by a
simple on e bit right sh ift. If the anticipated bit is
correct, no further shifting is required. Table 1 shows
examples of the leading '1' bit anticipation.
z = z,
z, z2
* * * z, . * * 5
= (abl
b,,)
(4
b3
Table
1.
Examples
of
leading
1
bit ant icipat ion
(a) Correct anticipation
A
0 1 . 0 1 0 0 0 1 1 0 0 0 1 1 1
B 1 1 . 0 0 0 1 1 0 1 0 1 0 0 0 1
z
0 . 0 1 1 1 0 0 0 0 1 1 1 0 0
(sum 0 .
0 1
1 0 0 0 0 0 1 1 0 0 0 )
shift number=2 (adjustment shift=O)
(b) Incorrect anticipation
A 0 1 . 0 1 1 0 0 1 1 0 0 0 1 1 1
B 1 1 . 0 1 0 1 1 0 1 0 1 0 0 0 1
z 0 . 0 1 1 0 0 0 0 0 1 1 1 0 0
(sum
0 . 1 1 0 0 0 0 0 0 1 1 0 0 0 )
t----l
shift numberla (adjustment shift=l)
2 .3
Pre rounding
Figure
2
shows the pre-rounding scheme. The
pre-rounding process of the ALU calculations consists
of four steps.
467
8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier
3/5
The first step involves incrementing the
addsubtra ct result at the 52nd decimal place by one.
This incrementation is performed in parallel with the
additionlsubtraction, and the result is ignored if no carry
arises from rounding. In the second step, three
independent pre-roundings are performed for the three
possible positions of
the
leading
1
bit (type
1,
type 2,
type 3 ) . Type 1, 2, and
3
are the cases when the leading
'1' bit is located one bit left, one bit right, and two or
more bits right of
the
decimal point. Bits 52 to 55 of
the addsubtract result, sign bit, and rounding mode
signals are used to calculate the three rounding carries
and the three least significant bits of the rounded results
in pre-rounders RO, R1, and R2. In the third step, the
correct pre-rounded result is selec ted according to the
most significant two bits of the addsubtract result. If
the two bits are '10' or '1 l', the re sults of RO a re used.
If the bits are
O l ,
the results of
R1
xe used.
Otherwise,
the
results of R 2 a re used. In the four step,
the selected carry is used to select either the incremented
result calculated
in
the first step, or the addsubtract
result.
Calculation of
the
most significant two bits of the
addsubtra ct result followed by the selection of the
rounding carry signal is one of the most critical paths,
so normalization shifters were intentionally removed
from the critical path. In this way they can execute in
parallel with the rounding carry calculation.
ahresult SO SI
s2 - -
s52 s53 s54 s55
type l (R0) 1
X
X
X X
L R 1 L:
least
~
type2(R1) 0 1
x x
x
x L R
S R:round
S: sticky
type
3(R2)
0 0 x x x x x 0 0 x: o
m
01 -> R1
-> R2
rounding carry
r52
Figure 2. Pre-rounding scheme
3. Multiplier
A
block diagram of the floating-point multiplier is
shown in Figure
3 .
Like the
ALU,
it is also a two stage
pipelined machine. In the first stage, one of the
fractions is encoded using a Radix 4 Booth algorithm.
The generated twenty seven 54-bit
partial
products
are
summed by the Wallace tree [8]. The partial product
array utilizes a 4-2 compressor tree rather than a 3-2 full
dder
in order to
reduce
tree depth and to simplify
layout. Exponent addition and rebias are also performed
in the first stage. In the second stage, carry propagate
addition of the partial product sum (carry save form ), as
well as normalization, IEEE rounding, and exponent
correction
re
performed.
Two arithmetic techniques are used in the
multiplier. The first involves spliting the Wallace tree
sum and performing the upper 52-bits and lower 54-bits
addition calculations in parallel. The second technique is
pre-rounding which is similar to that of
the
floating-point ALU.
El
E2
f I F2
Radix 4 BOOTH ENCODER
PARTIAL PRODUCTS
.c
FT(52)
f
FT(1:51)
f
ET
64 Carry
select addition
[
Pre-rounding
Figure
3.
Mult ipl ier lock diagram
3 .1 Carry sel ect addition
Partial product sum carry save form)
is
divided i n t o
two pairs (one is a pairing of the upper 52-bits and the
other is a pairing
of
the lower 54-bits). With-carry and
without-carry cases are calculated for the upper 52-bits,
and the correct sum is selected by
the
carry
from the
lower 54-bit sum. Addition of the lower pair is also
performed in parallel with the upper pair calculation,
and the signal P (propagate carry from the most
468
8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier
4/5
significant bit),
L
(least bit),
G
(guard bit), R (round
bit),
nd S
(sticky bit) are output.
(2)
With
arithmetic
3.2
Pre rounding
round r
I
The pre-rounding
of
multiplication results consists
of
three
steps. In the first step, the rounding carry CO
C, and the rounded results
Lo,
Go,
L,
are calculated. CO
Lo and Go are
the
results when G is the least significant
bit, and
C ,
and
L,
re the results when
L
is the least
significant bit. In the second step, the correct rounding
carry signal C, and rounded results
L,, G
are selected.
G
has
no meaning when
L
is the least significant bit.
Finally either the
dded
result or incremented result is
selected by
the
carry signal C, as the upper portion of
the rounded result.
(3) With
circuit
technique
4. Design methodology
round 1
I
add/su
4 . 1 Circuit
The noise tolerant precharge (NTP) circuit, a high
speed
and high noise tolerance CMOS circuit was
developed and adopted for critical paths of the
ALU
and
multiplier [9 ] . Figure
4
shows a block diagram of the
NTP circuit. The NTP circuit has a noise tolerant
PMOS logic which provides high noise immunity. The
NTP
circuit is precharged when the clock is low, and
the circuit is evaluated when the clock is high. The
delay time
of
the circuit is determined by the NMOS
logic. The NTP circuit has a
30-36
delay time
advantage over a conventional
CMOS
circuit. Three
types of NTP circuits were designed in order to
accelerate
the
time critical paths in cany lookahead
adders nd
leading '1' bit anticipator.
Noise-tolerant 7 7
OUT
IN2
IN3
I
CK
Discharge
NMOS
Figure
4.
NTR circuit block diagram
4 .2
Performance
Figure
5
shows the delay time of the floating-point
ALU.
Each delay time was calclated by a circuit
simulation. By using the above arithmetic techniques,
thedelay time
of
the max imum critical path i s reduced
by
15.4 .
Moreover, by using the NTP circuit, the
delay time of carry propagation in additiodsubtraction
and leading
'1'
bit anticipation in normalization is
reduced
s well, reducing the total delay time by 24 .
Delay time (ns)
0 5
10
15
1 1 1 1 1 1 1 1 1 [ 1 1 1 1 1 1 1 1
.format alian etc.
/
- 17.5ns
(') Without normalize/
I
_
lalianment shift1 addhub
I
m , , n A
techniques
4 . 3
Floating point unit
A
floating-point unit utilizing the
ALU
and the
multiplier were fabricated in 0.3pm four-layer-metal
CMOS technology. A block diagram of the
floating-point unit is shown in Figure
6.
The
floating-point unit contains four major sub-units: a
128x64-bit register file,
an ALU,
a multiplier, and
a
dividdsquare root unit (Div/Sqrt). The register file has
four write ports and four read ports, which allows
parallel execution of a load, an
ALU,
and
a
multiply
operation.
A
microphotograph of the floating-point unit
is shown
in
Figure
7. All
of the cells were placed
manually
to
shorten the wire length, and the routing
of
the macro was made automatically except for
the
critical
parts. Table
2
summarizes the floating-point latency nd
throughput.
5 .
Conclusion
One-bit pre-shifting before alignment shift,
Normalization with the anticipated leading 1 bit and
pre-rounding techniques have been developed for
a
floating-point
ALU.
Carry select addition and
469
8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier
5/5
pre-rounding techniques have been developed for a
floating-point multiplier. A high
speed
and high noise
toleranct precharge (NTP)
CMOS
circuit was developed
in order to accelerate critical paths of the ALU and
multiplie r. The se technique s reduced the delay tim e of
critical path by 24 .Each unit was fabricated in 0.3pm
four-layer-metal
CMOS
technology
nd
achieved a
two
cycle latency at 150 MHz.
Acknowledgements
The authors would like to thank A. Anzai, M.
Hashimoto, R. Yamagata,
T.
Kumagai, E. Kamada, T.
Nakano,
K. Kaneko,
N.
Ido, Y . Kiyoshige, S . Muto,
S
Tanaka, K. Shimamura, K. Matsuo, T. Shimizu, nd S
Nakahara of Hitachi Ltd. for their technical support,
discussions, d guidance.
References
[ l] R.
K.
Mon toye et al., Design of the IBM RISC
Syste m/6 000 Floating-Point Execution Unit, IBM J. Res.
Develop. Vol. 34, No. 1 , pp. 59 -70, January 199 0.
[2] J. Yetter, A 100-MHz Superscalar PA-RISC
CPU/Co prosessor Chip, Digest of Technical Papers,
Symp. VLSICircuits , pp.12 -13, 19 92.
[3] D. W. Dobberpuhl et al., A 200-MHz 64-b Dual-Issue
CM OS Microprocessor, IEEE J. Solid -state Circuits, Vol.
2 7 , No . 1 1 , p p . 1 5 5 5 - 1 55 7 , No v e mb e r1 9 9 2 .
[4] L. Gwennap, Digital Leads the Pack with 21164,
Microprocessor Report, Vol. 8, No. 12, pp. 6-10,
September 1994.
[5] L. Gwe nnap, MIPS RlOOOO Uses Decoup led
Architecture, Microprocessor Report, Vol. 8,
No.
14,
pp.
18-22 , October 199 4.
[6]
L.
Gwennap, PA-8000 Combines Complexity and
Speed, Microprocessor Report, Vol.
8,
No. 15, pp. 6-9,
November 1994 .
[7] IEEE Standard
for
Binary Floating-point Arithmetic,
A N S E E E Standard No.754, 1988.
[8] C.S. Wallace,
A
Suggestion for a Fast M ultiplier,
Trans. IEEE Electronic Computers, Vol. EC -13, pp. 1 4-17,
February 1964.
191 F. Murab ayashi et al., 2.5V
NOVEL
CMOS CIRCUIT
TECHNIQUES FOR A 150MHz SUPERSCALAR RISC
PROCESSO R, to be published in ESS CIR CP5 , September
1 9 9 5 .
Figure
6.
Float ing-point u ni t b lock d iagram
Register File ALU Multiplier DivISqrt
I
I
I I
I I
Figure
7.
Float ing-point uni t m icrophotograph
Table 2. Float ing-point la tency and th rough put
Multiply
Divide
Doubl
Latency
(Cycleln5
2113.3
211
3.3
811
20.0
31/206.7
-precision
Throug hpu
(Cyclelns)
116.7
116.7
171113.3
30/200.0
470