Arithmetic.1 2/15 Computer Arithmetic ALU Performance is critical ( App. C5, C6 4 th ed.)

arithmetic.12/15

Computer Arithmetic

ALU Performance is critical( App. C5, C6 4th ed.)

arithmetic.22/15

Requirements: CPU needs a 32-bit ALU(1) Functional Specification

inputs: 2 x 32-bit operands A, B, 4-bit modeoutputs: 32-bit result S, 1-bit carry, 1 bit overflowoperations: add, addu, sub, subu, and, or, xor, nor, slt, sltU

(2) Block Diagram (schematic symbol/ Verilog description)

ALUALUA B

movf

S

32 32

32

4c

arithmetic.32/15

1-bit adder Review (Appendix B.5, B.6)

A B C Co Sum

0 0 0 0 0

0 0 1 0 1

0 1 0 0 1

0 1 1 1 0

1 0 0 0 1

1 0 1 1 0

1 1 0 1 0

1 1 1 1 1

Sum = a!bc! + ab!c! + a!b!c+abc

= a b c = XOR

Carryout = a!bc + ab!c + abc! + abc

a

b

SumSum

CarryIn

CarryOut

a

b

Cin

CoA

B

Cinsum

2 units of delay from A/B to sum

1unit of delay from Cin to sum

arithmetic.42/15

Carry Out circuit

b

CarryOut

a

CarryInCin

a

b

Cout2 units of delay

from Cin to Cout

arithmetic.52/15

1-bit ALU cell: ADD, AND, OR

A

B

1-bitFull

Adder

CarryOut

Mu

x

CarryIn

Result

add

and

or

S-select

A B C Co

O

0 0 0 0 0

0 0 1 0 1

0 1 0 0 1

0 1 1 1 0

1 0 0 0 1

1 0 1 1 0

1 1 0 1 0

1 1 1 1 1

Full Adder(3->2 element)

arithmetic.62/15

Additional operations: Subtract, AND, OR

• A - B = A + (– B) = A + B + 1– form two complement by invert and add one

A

B

1-bitFull

Adder

CarryOut

Mu

x

CarryIn

Result

add

and

or

S-selectinvert

arithmetic.72/15

1-bit ALU: AND, OR, a+b, a+b!

Most significant bit

0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

0

3

Result

Operation

a

1

CarryIn

0

1

Binvert

b 2

Less

Set

Overflow detection Overflow

a.

b.

ALU Delays

Result = 1 gate delay

From a to result = 2

Form b to Result = 2 (ignore b invert)

arithmetic.82/15

Final 32-bit ALU,

including zero detect

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

Operation

arithmetic.92/15

Behavioral Representation: verilog, RTL FYI)

module ALU(A, B, m, S, c, ovf);input [0:31] A, B;input [0:3] m;output [0:31] S;output c, ovf;

reg [0:31] S;reg c, ovf;

always @(A, B, m) begincase (m)

0: S = A + B;

. . .

endendmodule

• Code written, simulated & verified

• translated into hardware (mapped)

• How complex digital design is done

arithmetic.102/15

Overflow ?? - 4-bit example

• Examples: 7 + 3 = 10 but ...

• - 4 - 5 = - 9 but ...

2’s ComplementBinaryDecimal

0 0000

1 0001

2 0010

3 0011

0000

1111

1110

1101

Decimal

0

-1

-2

-3

4 0100

5 0101

6 0110

7 0111

1100

1011

1010

1001

-4

-5

-6

-7

1000-8

0 1 1 1

0 0 1 1+

1 0 1 0

1

1 1 0 0

1 0 1 1+

0 1 1 1

110

7

3

1

– 6

– 4

– 5

7

arithmetic.112/15

Overflow Detection• Overflow: arithmetic result too large (or too small) to represent properly

– Example: - 8 4-bit binary number 7

• When adding operands with different signs, overflow cannot occur!

• Overflow occurs when adding:

– 2 positive numbers and sum is negative

– 2 negative numbers and the sum is positive

• On your own: Prove you can detect overflow by:

– Carry into MSB Carry out of MSB

0 1 1 1

0 0 1 1+

1 0 1 0

1

1 1 0 0

1 0 1 1+

0 1 1 1

110

7

3

1

– 6

–4

– 5

7

0

arithmetic.122/15

Overflow Detection Logic

• Carry into MSB Carry out of MSB– For a N-bit ALU: Overflow = CarryIn[N - 1] XOR CarryOut[N - 1]

CarryIn0

A0

B0

1-bitALU

Result0

CarryOut0

A1

B1

1-bitALU

Result1

CarryIn1

CarryOut1

A2

B2

1-bitALU

Result2

CarryIn2

A3

B3

1-bitALU

Result3

CarryIn3

CarryOut3

Overflow

X Y X XOR Y

0 0 0

0 1 1

1 0 1

1 1 0

arithmetic.132/15

MIPS ALU requirements

• Add, AddU, Sub, SubU, AddI, AddIU – => 2’s complement adder/sub with overflow detection

• And, Or, AndI, OrI, Xor, Xori, Nor– => Logical AND, logical OR, XOR, nor

• SLTI, SLTIU (set less than)– => 2’s complement adder with inverter, check sign bit of result

• ALU must support these ops

arithmetic.142/15

MIPS arithmetic instruction format - Review

• Signed arithmetic generate overflow, no carry

R-type:

I-Type:

31 25 20 15 5 0

op Rs Rt Rd funct

op Rs Rt Immed 16

Type op funct

ADDI 10 xx

ADDIU 11 xx

SLTI 12 xx

SLTIU 13 xx

ANDI 14 xx

ORI 15 xx

XORI 16 xx

LUI 17 xx

Type op funct

ADD 00 40

ADDU 00 41

SUB 00 42

SUBU 00 43

AND 00 44

OR 00 45

XOR 00 46

NOR 00 47

Type op funct

00 50

00 51

SLT 00 52

SLTU 00 53

arithmetic.152/15

Ripple Adder Performance?• Critical Path of n-bit

Rippled-carry adder is n*CP

A0

B0

1-bitALU

Result0

CarryIn0

CarryOut0

A1

B1

1-bitALU

Result1

CarryIn1

CarryOut1

A2

B2

1-bitALU

Result2

CarryIn2

CarryOut2

A3

B3

1-bitALU

Result3

CarryIn3

CarryOut3

Very slow: Must improveAssume t = carry delay / bit32- bit ALU needs 32 * t units of delay64-bit ALU needs64 * t units of delay

A

B

Cin sum

2 units of delay from A/B to sum

1unit of delay from Cin to sum

b

CarryOut

a

CarryIn

arithmetic.162/15

Fast Addition : Carry Lookahead

• Carry Inputs can be precomputed by logic c1 = g0 + c0 p0 = a0 b0 + c0 (a0 + b0) p0 = a0 + b0 g0 = a0 b0

c2 = g1 + p1 c1 = g1 + p1 g0 + p1 p0 c0 = a1 b1 + c1 a1 + b1) p1 = a1 + b1 g1 = a1 b1 c3 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0

c4 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 + p3 p2 p1 p0 c0

C4= func( a3, b3, a2, b2, a1, b1, a0, b0, c0)

1 unit delay each p, g

1 unit delay

3 units of delay

3 units of delay

3 units of delay

arithmetic.172/15

Fast Addition: Carry Look Ahead – 4 bits A B C-out0 0 0 “kill”0 1 C-in “propagate”1 0 C-in “propagate”1 1 1 “generate”

g = a and b 1 delay p = a or b

C0 = Cin

c1 = g0 + c0 p0

c2 = g1 + g0 p1 + c0 p0 p1

c3 = g2 + g1 p2 + g0 p1 p2 + c0 p0 p1 p2

a0

b0

a1

b1

a2

b2

a3

b3

S

S

S

S

gp

gp

gp

gp

G0=g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0

C4 = . . .

P0 = p3 p2 p1 p0

3 units of delay for G0

3 units of delay for c1, c2, c3, (c4)4 units of delay for S1, S2, S3

3

3

3

4

4

4

2

arithmetic.182/15

Carry Lookahead – 2nd level – 16 bits Add 2nd level abstraction for more practical 4-bit units Each Pi, Gi handles 4 bits at a time, 0-3, 4-7, 8-11,..)

P0 = p3 p2 p1 p0 ; G0 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0

P1 = p7 p6 p5 p4 ;G1 = g7 + p7 g6 + p7 p6 g5 + p7 p6 p5 g4

P2 = p11 p10 p9 p8 ;G2 =g11 + p11 g10 + p11 p10 g9 + p11 p10 p9 g8

P3 = p15 p14 p13 p12;G3 = …….

3 units of delay for G0, G1, G2, G3

2 units of delay for P0, P1, P2, P3

arithmetic.192/15

Fast Addition: Cascaded Carry Look-ahead (16-bit):

CLA

4-bitAdder

4-bitAdder

4-bitAdder

c4 = G0 + C0 P0

c8 = G1 + G0 P1 + C0 P0 P1

c12 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2

GP

G0P0

c16 = . . .

C0

5 units of delay for c8, c12, c16

c4 has 4 units of delay

c8

c12

5

5

4

arithmetic.202/15

Carry Lookahead Homework

You are required to calculate the performance of a 16-bit Carry lookahead adder similar to the one discussed in class. The design has 2 options

1. assuming ripple carry is used inside each 4-bit cell2. Carry lookahead is used inside each 4-bit cell

•Both cases use carry lookahead at predicting 4-bit boundary carries [c4, c8, c12]•Draw a table showing the delay of each adder bit i.e. Sum0 - Sum 15; as well as the carry at each stage of the design – for the 2 designs

arithmetic.212/15

8-bit carry lookahead adder (4-bit block is also CLA)

c5= g4 + c4.p4Delays 1 4 1

S0

S1

S2

S3

a4b4

S4

S5

S6

S7

a5b5

a6b6

a7b7

c4= G0 + c0 P0

2nd level carry lookahead

a0b0

a1b1

a2b2

a3b3

3

3

3

4 units of delay

6

6

6

G0

P0

G1

P1

5

6

arithmetic.222/15

8-bit CLA – uses ripple carry inside 4-bit block

a0b0

Result0

Result1

Result2

Result3

a1b1

a2b2

a3b3

a4b4

Result4

Result5

Result6

Result7a7b7

a6b6

a5b5

2nd level carry lookahead c4

0

2

4

6

4

6

8

10

2

3

5

7

5

7

9

11

arithmetic.232/15

Additional MIPS ALU requirements

• Mult, MultU, Div, DivU => Need 32-bit multiply and divide, signed and unsigned

• Sll, Srl, Sra => Need left shift, right shift, right shift arithmetic by 0 to 31 bits

• Nor (leave as exercise !)=> logical NOR or use 2 steps: (A OR B) XOR 1111....1111

arithmetic.242/15

Multiply, Divide & Shift

arithmetic.252/15

MIPS arithmetic instructions

• Instruction Example Meaning Comments• add add $1,$2,$3 $1 = $2 + $3 3 operands; exception possible• subtract sub $1,$2,$3 $1 = $2 – $3 3 operands; exception possible• add immediate addi $1,$2,100 $1 = $2 + 100 + constant; exception possible• add unsigned addu $1,$2,$3 $1 = $2 + $3 3 operands; no exceptions• subtract unsigned subu $1,$2,$3 $1 = $2 – $3 3 operands; no exceptions• add imm. unsign. addiu $1,$2,100 $1 = $2 + 100 +

constant; no exceptions• multiply mult $2,$3 Hi, Lo = $2 x $3 64-bit signed product• multiply unsigned multu$2,$3 Hi, Lo = $2 x $3 64-bit unsigned product• divide div $2,$3 Lo = $2 ÷ $3, Lo = quotient, Hi = remainder • Hi = $2 mod $3 • divide unsigned divu $2,$3 Lo = $2 ÷ $3, Unsigned quotient & remainder • Hi = $2 mod $3• Move from Hi mfhi $1 $1 = Hi Used to get copy of Hi• Move from Lo mflo $1 $1 = Lo Used to get copy of Lo

arithmetic.262/15

MULTIPLY (unsigned)• Paper and pencil example (unsigned):

Multiplicand 1000 AMultiplier 1001 B 1000

0000 0000 1000

Product 01001000• m bits x n bits = m+n bit product• Binary makes it easy:

–0 => place 0 ( 0 x multiplicand)–1 => place a copy ( 1 x multiplicand)

• 4 versions of multiply hardware & algorithm: –successive refinement

arithmetic.272/15

Fast Multiply== Array Multiplier

• Stage i accumulates A * 2 i if Bi == 1

• Q: How much hardware for 32 bit multiplier? Critical path?

B0

A0A1A2A3

A0A1A2A3

A0A1A2A3

A0A1A2A3

B1

B2

B3

P0P1P2P3P4P5P6P7

0 0 0 0

FA

bj sum in

sum out

carryout

ai

carryin

Bi

Aj

Multiplicand A

Multiplier BProduct P

Cell delays ?

arithmetic.282/15

Multiplier operation

• At each stage shift multiplicand left ( x 2)

• Multiplier bit Bi determines : add in shifted multiplicand

• Accumulate 2n bit partial product at each stage

B0

A0A1A2A3

A0A1A2A3

A0A1A2A3

A0A1A2A3

B1

B2

B3

P0P1P2P3P4P5P6P7

0 0 0 00 0 0

Multiplication, using shift & Add

arithmetic.29

Multiplication, using shift & Add

• long-multiplication approach

1000× 1001 1000 0000 0000 1000 1001000

Length of product is the sum of operand lengths

multiplicand

multiplier

product

2/15

arithmetic.30

Multiplication Hardwareusing shift & Add

Initially 0

2/15

arithmetic.31

Optimized Multiplierusing shift & Add

• Perform steps in parallel: add/shift

One cycle per partial-product addition ok, if frequency of multiplications is low

2/15

32 – bit ALU, multiplicand

arithmetic.322/15

Multiply Algorithm

DoneYes: 32 repetitions

2. Shift the Product register right 1 bit.

No: < 32 repetitions

1. TestProduct0

Product0 = 0Product0 = 1

1a. Add multiplicand to the left half of product & place the result in the left half of Product register

32nd repetition?

Start

0000 0011 0010 1: 0010 0011 0010 2: 0001 0001 0010 1: 0011 0001 0010 2: 0001 1000 0010 1: 0001 1000 0010 2: 0000 1100 0010 1: 0000 1100 0010 2: 0000 0110 0010

0000 0110 0010

Product Multiplicand

arithmetic.332/15

MIPS logical instructions• Instruction Example Meaning Comment

• and and $1,$2,$3 $1 = $2 & $3 3 reg. operands; Logical AND• or or $1,$2,$3 $1 = $2 | $3 3 reg. operands; Logical OR• xor xor $1,$2,$3 $1 = $2 $3 3 reg. operands; Logical XOR• nor nor $1,$2,$3 $1 = ~($2 |$3) 3 reg. operands; Logical NOR• and immediate andi $1,$2,10 $1 = $2 & 10 Logical AND reg, constant• or immediate ori $1,$2,10 $1 = $2 | 10 Logical OR reg, constant• xor immediate xori $1, $2,10 $1 = ~$2 &~10 Logical XOR reg, constant• shift left logical sll $1,$2,10 $1 = $2 << 10 Shift left by constant• shift right logical srl $1,$2,10 $1 = $2 >> 10 Shift right by constant• shift right arithm. sra $1,$2,10 $1 = $2 >> 10 Shift right (sign extend) • shift left logical sllv $1,$2,$3 $1 = $2 << $3 Shift left by variable• shift right logical srlv $1,$2, $3 $1 = $2 >> $3 Shift right by variable• shift right arithm. srav $1,$2, $3 $1 = $2 >> $3 Shift right arith. by variable

arithmetic.342/15

How shift instructions are implemented

Two kinds: logical-- value shifted in is always "0"

arithmetic-- on right shifts, sign extend

msb lsb"0" "0"

msb lsb "0"

instruction can request 0 to 32 bits to be shifted!

1011 1110

shift right arithmeticby 2

1100 1011

shift right logical by 2

arithmetic.35

– Shift value can be either be:• 5 bit unsigned integer• Specified in bottom byte of another

register.

Example: ADD r0, r1, r2, LSL#7

• Semantics: r2 is shifted left by 7 & then added to r1

Result

Operand 1

BarrelShifter

Operand 2

ALU

ARM :: Barrel Shifter:

2/14

arithmetic.362/15

Barrel Shifter, used in ICsShift Right using one transistor per switch

D3

D2

D1

D0

A6

A5

A4

A3 A2 A1 A0

SR0SR1SR2SR3

arithmetic.37

Barrel Shifter, used in ICsShift ……Left & right

D3

D2

D1

D0

A5

A4

A3

A2 A1 A0

SR0SR1SR2SL 1 SL 2 SL3

arithmetic.382/15

Summary: Multiply & Shift• Multiply: successive refinement to see final design

– 32-bit Adder, 64-bit shift register, 32-bit Multiplicand Register

• Fast multiply Array multiplier

• Shifter: success refinement 1/bit at a time shift register to barrel shifter

arithmetic.392/15

Floating Point Arithmetic

• How to represent – numbers with fractions, e.g., 3.1416

– very small numbers, e.g., .000000001

– very large numbers, e.g., 3.15576 109

• Fixed point• Floating point: a number system with floating decimal

point• Normalized numbers: no leading 0’s , single digit before

decimal point1.0 x3.1557 x350.03

10 9

109

arithmetic.402/15

Floating Point Notation – IEEE 754 FP

6.02 x 10 1.673 x 1023 -24

exponent

radix (base)Mantissa

decimal point

Sign, magnitude

Sign, magnitude

IEEE F.P. ± 1.M x 2e - 127

• Issues:– Arithmetic (+, -, *, / )– Representation, Normal form– Range and Precision, Single, Double– Rounding– Exceptions (e.g., divide by zero, overflow, underflow)

arithmetic.412/15

Floating-Point ArithmeticFloating point numbers in IEEE 754 standard:

single precision1 8 23

sign

exponent:excess 127binary integer

mantissa:sign + magnitude, normalizedbinary significand w/ hiddeninteger bit: 1.M

actual exponent ise = E - 127

S E M

N = (-1) 2 (1.M)S E-127

0 < E < 255

0 = 0 00000000 0 . . . 0 -1.5 = 1 01111111 10 . . . 0

Numbers that can be represented is in the range:

2-126

(1.0) to 2127

(2 - 2-23 )

Double Precision IEEE 754 [64-bits]

Exponent = 11 bits, Bias = 1023, Mantissa = 52, Sign= 1bit

127

arithmetic.422/15

Exponent Bias used to simplify comparisons

• If we use 2’s complement, not good for sorting and comparison

0000 0000 1111 1111most negative most positiveexponent exponent

arithmetic.432/15

Floating Point – Example review

•

• Represents – bias = 127 for 32-bit word– S = 1: negative

0: positive or zero

• Example (from fraction to floating point representation)-0.75

S exponent significant

( ) ( ) (exp. ) 1 1 2s biassignificant

arithmetic.442/15

Floating-Point Example - review

• Represent –0.75– –0.75 = (–1)1 × 1.12 × 2–1

– S = 1

– Fraction = 1000…002

– Exponent = –1 + Bias = 126• Single: –1 + 127 = 126 = 011111102

• Double: –1 + 1023 = 1022 = 011111111102

• Single: 1011111101000…00• Double: 1011111111101000…00

arithmetic.452/15

Addition – Multiply Algorithm issuesFor addition (or subtraction) :

(1) compute Ye - Xe (getting ready to align binary point)

(2) right shift Xm that many positions to form Xm 2

(3) compute Xm 2 + Ym

(4) for multiply, doubly biased exponent must be corrected:

Xe = 7 Ye = -3 Excess 8 extra subtraction step of the bias amount

Xe-Ye

Xe-Ye

Xe = 1111Ye = 0101 10100

= 15= 5 20

= 7 + 8= -3 + 8 4 + 8 + 8

arithmetic.462/15

Floating Point Addition

• Step 1: align, round

• Step 2: add

• Step 3: normalize, check overflow or underflow

• Step 4: round

• Example: 9 99 10 1610 10 1. .ten ten

arithmetic.472/15

Floating Point Multiplication

• Step 1: add exponents, subtract bias, Mpy mantissas

• Step 2: normalize and check over/underflow

• Step 3: round

• Step 4: check sign

• Example: 05 0 4375. ( . )

arithmetic.48

FP Adder Hardware

• more complex than integer adder

• Doing it in one clock cycle - takes too long– Much longer than integer operations– Slower clock would penalize all instructions

• FP adder usually takes several cycles– pipelined

2/15

arithmetic.49

FP Adder Hardware

Step 1

Step 2

Step 3

Step 4

2/15

arithmetic.502/15

Floating Point: Overflow & Underflow

• Exponent too large to be represented

• Underflow: negative exponent too small to fit in exponent field

arithmetic.512/15

Summary of Floating Point Arithmetic

• IEEE floating point standard 32 bit and 64 bit

• Converting decimal numbers to floating point and vice versa

• Overflow and underflow

• Floating point add and multiply

Arithmetic.1 2/15 Computer Arithmetic ALU Performance is critical ( App. C5, C6 4 th ed.)

Documents

Transcript of Arithmetic.1 2/15 Computer Arithmetic ALU Performance is critical ( App. C5, C6 4 th ed.)