Chapter 7 Floating-Point Arithmetic

38
71 Chapter 7 Floating-Point Arithmetic

Transcript of Chapter 7 Floating-Point Arithmetic

Page 1: Chapter 7 Floating-Point Arithmetic

7‐1

Chapter 7

Floating-Point Arithmetic

Page 2: Chapter 7 Floating-Point Arithmetic

2

Representation of Floating-Point Numbers

A simple representation of a floating-point (or real) number (N) uses a fraction (F), base (B), and exponent (E), where N = F x BE.

The base can be any integer larger than 1 and can be implied or explicit.

The fraction and the exponent can be represented in many formats. Example: they can be represented by 2’s

complement formats, sign-magnitude form, oranother number representation.

There are a variety of floating-point formats.

7‐2

Page 3: Chapter 7 Floating-Point Arithmetic

3

Representation of Floating-Point Numbers: 2’s Complement 1

The base for the exponent is 2. Hence, the value of the number is N = F x 2E.

In a typical floating-point number system, F is 16 to 64 bits long and E is 8 to 15 bits long.

The sign bit is 0 for positive numbers and 1 for negative numbers.

7‐3

Page 4: Chapter 7 Floating-Point Arithmetic

4

Representation of Floating-Point Numbers: 2’s Complement 2

Example: represent decimal 2.5 in 8-bit 2’s complement floating-point format: 2.5 = 0010.1000 = 1.010 x 21 (normalized representation) = 0.101 x 22 (4-bit 2’s complement fraction) Thus, F = 0.101 E = 0010 N = 5/8 x 22

If the number is -2.5, the same exponent can be used, but the fraction must have a negative sign. The 2’s complement representation for the fraction is 1.011.

Thus, F = 1.011 E = 0010 N = -5/8 x 22

7‐4

Page 5: Chapter 7 Floating-Point Arithmetic

5

Representation of Floating-Point Numbers: 2’s Complement 3

Normalizing: In order to utilize all the bits in F and have the

maximum number of significant figures, F should benormalized so that its magnitude is as large aspossible.

If F is not normalized, normalize F by shifting it leftuntil the sign bit and the next bit are different.

Shifting F left is equivalent to multiplying by 2, sofor every shift, decrement E by 1 to keep N thesame.

After normalization, the magnitude of F will be aslarge as possible, since any further shifting wouldchange the sign bit.

7‐5

Page 6: Chapter 7 Floating-Point Arithmetic

6

Representation of Floating-Point Numbers: 2’s Complement 4

Examples:

7‐6

Unnormalized: F = 0.0101 E = 0011 N = 5/16 x 23 = 5/2

Normalized: F = 0.101 E = 0010 N = 5/8 x 22 = 5/2

Unnormalized: F = 1.11011 E = 1100 N = -5/32 x 2-4 = -5 x 2-9

Shift F left: F = 1.1011 E = 1011 N = -5/16 x 2-5 = -5 x 2-9

Normalized: F = 1.011 E = 1010 N = -5/8 x 2-6 = -5 x 2-9

Page 7: Chapter 7 Floating-Point Arithmetic

7

Representation of Floating-Point Numbers: 2’s Complement 5

Zero cannot be normalized, so F = 0.000 whenN = 0.

Any exponent could then be used; however, it isbest to have a uniform representation of 0.

In this format, associate the negative exponentwith the largest magnitude with the fraction 0.

In a 4-bit 2’s complement integer numbersystem, the most negative number is 1000, whichrepresents -8. Thus when F and E are 4 bits, 0 isrepresented by: F = 0.000 E = 1000 N = 0.000 x 2-8

Some floating-point systems use a biasedexponent whereby E = 0 is associated with F = 0.

7‐7

Page 8: Chapter 7 Floating-Point Arithmetic

8

Representation of Floating-Point Numbers: IEEE 754 Standard 1

IEEE 754 is a floating-point standardestablished by the IEEE in 1985.

It contains two representations for floating-point numbers: Single precision: uses 32 bits. Double precision: uses 64 bits.

Designers of IEEE 754 desired a format thatwas easy to sort and hence adopted a sign-magnitude system for the fractional partand a biased notation for the exponent.

7‐8

Page 9: Chapter 7 Floating-Point Arithmetic

9

Representation of Floating-Point Numbers: IEEE 754 2

The IEEE 754 floating-point formats needthree subfields: sign, fraction, and exponent.

The fractional part of the number isrepresented using a sign-magnituderepresentation in the IEEE floating-pointformats.

The sign is 0 for positive numbers and 1 fornegative numbers.

7‐9

Page 10: Chapter 7 Floating-Point Arithmetic

10

Representation of Floating-Point Numbers: IEEE 754 3

Form is: N = (-1)S X (1 + F) X 2E

S is the sign bit, F is the fractional part, and Eis the exponent.

Base of the exponent is 2 and implied (notstored).

Magnitude (significand) of the fraction is 1 + F. Often the terms significand and fraction are

used interchangeably.

7‐10

Page 11: Chapter 7 Floating-Point Arithmetic

11

Representation of Floating-Point Numbers: IEEE 754 4

IEEE Single Precision Floating-Point Format: 32 bits:

IEEE Double Precision Floating-Point Format: 64 bits:

7‐11

Sign Exponent Fraction1 bit 8 bits 23 bits

Sign Exponent Fraction1 bit 11 bits 52 bits

Page 12: Chapter 7 Floating-Point Arithmetic

12

Representation of Floating-Point Numbers: IEEE 754 5

The exponent in the IEEE floating-pointformats uses a biased notation: Contains the actual exponent plus 127 for

single precision or plus 1023 for doubleprecision.

Converts all single-precision exponents from-126 to +127 into normalized floating-pointnumbers from 1 to 254, and all double-precisionexponents from -1022 to +1023 into normalizedfloating-point numbers from 1 to 2046.

7‐12

Page 13: Chapter 7 Floating-Point Arithmetic

13

Representation of Floating-Point Numbers: IEEE 754 6

Overflow: positive exponent is too large to berepresented in the exponent field.

Underflow: negative exponent is too large to berepresented in the exponent field.

7‐13

Page 14: Chapter 7 Floating-Point Arithmetic

14

Representation of Floating-Point Numbers: IEEE 754-Example 1

13.45 in IEEE single precision floating-pointformat: Converting to binary representation (.45 is a

recurring binary fraction): 13.45 = 1101.01 1100 1100 1100 1100 … … …

Normalized: 13.45 = 1.10101 1100 1100 … x 23

As the number is positive, the sign bit is 0. Exponent in biased notation:

127 + 3 = 130 or 10000010 in binary.

7‐14

Page 15: Chapter 7 Floating-Point Arithmetic

15

Representation of Floating-Point Numbers: IEEE 754-Example 2

Fraction is 1.10101 1100 1100 … Omitting the leading 1, the 23 bits for the fractional part are: 10101 1100 1100 1100 1100 11

Thus, the 32 bits are: 0 10000010 10101 1100 1100 1100 1100 11

Summarized as:

In hex format, the 32 bits are: 4157 3333

7‐15

Sign Exponent Fraction

0 1 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

Page 16: Chapter 7 Floating-Point Arithmetic

16

Representation of Floating-Point Numbers: IEEE 754-Example 3

The number -13.45 can be represented bychanging only the sign bit (i.e., the first bitmust be 1 instead of 0).

Hence, the hex number C157 3333 represents-13.45 in IEEE 754 single precision format.

7‐16

Page 17: Chapter 7 Floating-Point Arithmetic

17

Representation of Floating-Point Numbers: IEEE 754-Example 4

13.45 in IEEE double precision floating-pointformat: Converting to binary representation:

13.45 = 1101.01 1100 1100 1100 … … … Normalized:

13.45 = 1.10101 1100 1100 … x 23

As the number is positive, the sign bit is 0. Exponent in biased notation:

1023 + 3 = 1026 or 10000000010 in binary.

7‐17

Page 18: Chapter 7 Floating-Point Arithmetic

18

Representation of Floating-Point Numbers: IEEE 754-Example 5

Fraction is 1.10101 1100 1100 …. Omitting the leading 1, the 52 bits of the fractional part are: 10101 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 110

Thus the 64 bits are: 0 10000000010 10101 1100 1100 1100 1100 1100 1100 1100 1100 1100

1100 1100 110

Summarized as:

In hex format, the 64 bits are: 402A E666 6666 6666

7‐18

Sign Exponent Fraction

0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0

Page 19: Chapter 7 Floating-Point Arithmetic

19

Representation of Floating-Point Numbers: IEEE 754-Example 6

The number -13.45 can be represented by changing only the sign bit (i.e., the first bit must be 1 instead of 0).

Hence, the hex number C02A E666 6666 6666 represents -13.45 in IEEE 754 double precision format.

7‐19

Page 20: Chapter 7 Floating-Point Arithmetic

20

Special Cases in IEEE 754 Standard

The smallest and highest exponents are used to denote these special cases.

7‐20

Single Precision Double Precision ObjectRepresented

Exponent Fraction Exponent Fraction

0 0 0 0 0

0 Nonzero 0 Nonzero ± denormalizednumber

255 0 2047 0 ± infinity

255 Nonzero 2047 Nonzero NaN (not a number)

Page 21: Chapter 7 Floating-Point Arithmetic

21

Guard Round and Sticky Bit

When the number of bits available is smaller thanthe number of bits required to represent anumber, rounding is employed.

It is desirable to round to the nearest value. Guard round: the two extra bits that the IEEE

standard requires in intermediate representationsin order to facilitate better rounding.

Sticky bit: the third intermediate bit sometimesused in rounding. It is set whenever there arenon-zero bits to the right of the round bit.

7‐21

Page 22: Chapter 7 Floating-Point Arithmetic

22

Round, Truncate, and Unbiased The IEEE standard has 4 rounding modes when

the number falls halfway: Round up: round toward positive infinity; round up

to the next higher number. Round down: round toward negative infinity;

round down to the nearest smaller number. Truncate: round toward zero. Ignore bits beyond

the allowable number of bits. Same as truncation in sign magnitude.

Unbiased: round to nearest. If the number falls halfway, round up half the time and round down half the time. In order to achieve rounding up half the time, add 1 if the lowest bit retained is 1, and truncate if it is 0.

7‐22

Page 23: Chapter 7 Floating-Point Arithmetic

23

Floating-Point Multiplication 1

Given two floating-point numbers, F1 x 2E1 and F2 x 2E2, the product is:(F1 x 2E1) x (F2 x 2E2) = (F1 x F2) x 2(E1 + E2) = F x 2E

The fraction part of the product is the product of the fractions, and the exponent part of the product is the sum of the exponents.

a floating-point multiplier consists of two major components: 1. A fraction multiplier 2. An exponent adder

7‐23

Page 24: Chapter 7 Floating-Point Arithmetic

24

Floating-Point Multiplication 2

Procedure for performing floating-point multiplication:1. Add the two exponents.2. Multiply the two fractions (significands). 3. If the product is 0, adjust the representation to the

proper representation for 0. 4. a. If the product fraction is too big, normalize by shifting

it right and incrementing the exponent. b. If the product fraction is too small, normalize by shifting left and decrementing the exponent.

5. If an exponent underflow or overflow occurs, generate an exception or error indicator.

6. Round to the appropriate number of bits. If rounding resulted in loss of normalization, go to step 4 again.

7‐24

Page 25: Chapter 7 Floating-Point Arithmetic

25

Flowchart for Floating-Point Multiplication

7‐25

Page 26: Chapter 7 Floating-Point Arithmetic

26

Hardware Required to Implement the Multiplier

Exponent adder: a 5-bit full adder is used. Fraction multiplier: implements a shift-and-add

multiplier algorithm. Control unit: provides the signals to perform

the appropriate operations of right shifting, left shifting, exponent incrementing/decrementing, and so forth.

7‐26

Page 27: Chapter 7 Floating-Point Arithmetic

27

SM Chart for Floating-PointMultiplication

7‐27

Page 28: Chapter 7 Floating-Point Arithmetic

28

The VHDL Behavioral Description for Floating-Point Multiplication 1

The VHDL behavioral description uses three processes: The main process generates control signals

based on the SM chart. The second process generates the control

signals for the fraction multiplier. The third process tests the control signals and

updates the appropriate registers on the rising edge of the clock.

7‐28

Page 29: Chapter 7 Floating-Point Arithmetic

29

The VHDL Behavioral Description for Floating-Point Multiplication 2

Testing the VHDL code for the floating-point multiplier must be done carefully to account for all the special cases in combination with positive and negative fractions, as well as positive and negative exponents.

When the VHDL code was synthesized for the Xilinx Spartan-3/Virtex-4 architectures using the Xilinx ISE tools, the result was 38 slices, 29 flip-flops, 72 4-input LUTs, 27 I/O blocks, and one global clock circuitry.

7‐29

Page 30: Chapter 7 Floating-Point Arithmetic

30

Floating-Point Addition

Given two floating-point numbers, F1 x 2E1

and F2 x 2E2, the sum is:(F1 x 2E1) + (F2 x 2E2) = F x 2E

7‐30

Page 31: Chapter 7 Floating-Point Arithmetic

31

Procedure for Performing Floating-Point Addition

Procedure for performing floating-point addition: 1. Compare exponents. If the exponents are not equal,

shift the fraction with the smaller exponent right and add 1 to its exponent; repeat until the exponents are equal.

2. Add the fractions (significands). 3. If the result is 0, set the exponent to the appropriate

representation for 0 and exit. 4. If fraction overflow occurs, shift right and add 1 to the

exponent to correct the overflow. 5. If the fraction is unnormalized, shift left and subtract 1

from the exponent until the fraction is normalized. 6. Check for exponent overflow. Set overflow indicator, if

necessary. 7. Round to the appropriate number of bits. Is it still

normalized? If not, go back to step 4.

7‐31

Page 32: Chapter 7 Floating-Point Arithmetic

32

Floating-Point Addition- Example 1

add (F1 x 2E1) = 0.111 x 25 and(F2 x 2E2) = 0.101 x 23

Apply the aforementioned steps:1. Compare exponents. Since E2 does not equal E1,

unnormalize the smaller number F2 by shifting right 2 times and adding 2 to the exponent: 0.101 x 23 = 0.0101 x 24 = 0.00101 x 25

2. Add the fractions: (0.111 x 25) + (0.00101 x 25) = 01.00001 x 25

3. If the result is 0, set the exponent to the appropriate representation for 0 and exit. Result is not 0.

7‐32

Page 33: Chapter 7 Floating-Point Arithmetic

33

Floating-Point Addition- Example 2

4. If fraction overflow occurs, shift right and add 1 to the exponent to correct the overflow:

This addition caused an overflow into the sign bit position. The final result is: F x 2E = 0.100001 x 26

5. If the fraction is unnormalized (or negative), shift left and subtract 1 from the exponent until the fraction is normalized. Example:

(1.100 x 2-2) + (0.100 X 2-1) =(1.110 x 2-1) + (0.100 x 2-1) (after shifting F1)= 0.010 x 2-1 (result of adding fractions unnormalized)= 0.100 x 2-2 (normalized by shifting left and

subtracting 1 from exponent)

7‐33

Page 34: Chapter 7 Floating-Point Arithmetic

34

Hardware Units are Required to Implement a Floating-Point Adder Adder (subtractor) to compare Shift register to shift the smaller number to the

right ALU (adder) to add fractions Bidirectional shifter, incrementer/decrementer. Overflow detector Rounding hardware

Many of these components can be combined.

7‐34

Page 35: Chapter 7 Floating-Point Arithmetic

35

Overview of a Floating-Point Addition

7‐35

Page 36: Chapter 7 Floating-Point Arithmetic

36

Other Floating-Point Operations: Subtraction

The procedure is the same as addition, except you must subtract the fractions instead of adding them.

Other steps remain the same.

7‐36

Page 37: Chapter 7 Floating-Point Arithmetic

37

Other Floating-Point Operations: Division 1

The quotient of 2 floating-point numbers is:(F1 x 2E1) ÷ (F2 x 2E2) = (F1 / F2) x 2(E1 - E2) = F x 2E

The basic procedure is to divide the fractions and subtract the exponents. In addition to considering the special cases already described, also test for divide by 0 before dividing.

7‐37

Page 38: Chapter 7 Floating-Point Arithmetic

38

Other Floating-Point Operations: Division 2

If F1 and F2 are normalized, then the largest positive quotient (F) will be:0.1111 … / 0.1000 … = 01.111 … This is less than 102, so the fraction overflow is

easily corrected. For example:

(0.110101 x 22) ÷ (0.101 x 2-3) = 01.010 x 25

= 0.101 x 26

Alternatively, if F1 ≥ F2, we can shift F1 right before dividing and avoid fraction overflow in the first place. In the IEEE format, when divide by 0 is involved, the result can be set to NaN.

7‐38