Higham & Moler -- Three Measures of Precision

Three Measures of Precision in Floating PointArithmetic

Nick Higham

April 13, 1991

This note is about three quantities that relate to the precision of floating pointarithmetic. For t-digit, rounded base b arithmetic the quantities are

1. machine epsilon εm, defined as the distance from 1.0 to the smallest float-ing point number bigger than 1.0 (and given by εm = b(1−t), which is thespacing of the floating point numbers between 1.0 and b), and

2. µ = smallest floating point number x such that fl(1 + x) > 1,

3. unit roundoff u = 12 b(1−t) (which is a bound for the relative error in round-

ing a real number to floating point form).

The terminology I have used is not an accepted standard; for example, the namemachine epsilon is sometimes given to the quantity in (2). My definition ofunit roundoff is as in Golub and Van Loan’s book Matrix Computations [1] andis widely used. I chose the notation eps in (1) because it conforms with MATLAB,in which the permanent variable eps is the machine epsilon. [Ed. note: Well, notquite. See my comments below. – Cleve]

The purpose of this note is to point out that it is not necessarily the case that µ =εm, or that µ = u, as is sometimes claimed in the literature, and that, moreover,the precise value of µ is difficult to predict.

It is helpful to consider binary arithmetic with t = 3. Using binary notation wehave

1 + u = 1.00 + .001 = 1.001,

which is exactly half way between the adjacent floating point numbers 1.00 and1.01. Thus fl(1 + u) = 1.01 if we round away from zero when there is a tie, whilefl(1 + u) = 1.00 if we round to an even last digit on a tie. It follows that µ ≤ uwith round away from zero (and it is easy to see that µ = u), whereas µ > u forround to even. I believe that round away from zero used to be the more commonchoice in computer arithmetic, and this may explain why some authors defineor characterize u as in (2). However, the widely used IEEE standard 754 binaryarithmetic uses round to even.

1

Nick Higham Three Measures of Precision

So far, then it is clear that the way in which ties are resolved in rounding affectsthe value of µ. Let us now try to determine the value of µ with round to even. Alittle thought may lead one to suspect that µ ≤ u(1 + εm). For in the b = 2, t = 3case we have

x = u ∗ (1 + εm) = .001 ∗ (1 + .01) = .00101

⇒ f l(1 + x) = f l(1.00101) = 1.01,

assuming “perfect rounding”. I reasoned this way, and decided to check thisputative value of µ in 386-MATLAB on my PC. MATLAB uses IEEE standard 754binary arithmetic, which has t = 53 (taking into account the implicit leading bitof 1). Here is what I found:

>> format compact; format hex>> x = 2^(-53)*(1+2^(-52)); y = [1+x 1 x]y = 3ff0000000000000 3ff0000000000000 3ca0000000000001

>> x = 2^(-53)*(1+2^(-11)); y = [1+x 1 x]y = 3ff0000000000000 3ff0000000000000 3ca0020000000000

>> x = 2^(-53)*(1+2^(-10)); y = [1+x 1 x]y = 3ff0000000000001 3ff0000000000000 3ca0040000000000

Thus the guess is wrong, and it appears that µ = u(1 + 242 ∗ εm) in this environ-ment! What is the explanation?

The answer is that we are seeing the effect of “double-rounding”, a phenomenonthat I learned about from an article by Cleve Moler [2]. The Intel floating-pointchips used on PCs implement internally the optional extended precision arith-metic described in the IEEE standard, with 64 bits in the mantissa [3]. What ap-pears to be happening in the example above is that ‘1 + x’ is first rounded to 64bits; if x = u ∗ (1 + 2−i) and i > 10 then the least significant bit is lost in thisrounding. The extended precision number is now rounded to 53 bit precision;but when i > 10 there is a rounding tie (since we have lost the original leastsignificant bit) which is resolved to 1.0, which has an even last bit.

The interesting fact, then, is that the value of µ can vary even between machinesthat implement IEEE standard arithmetic.

Finally, I’d like to stress an important point that I learned from the work of VelKahan: the relative error in addition and subtraction is not necessarily boundedby u. Indeed on machines such as Crays that lack a guard digit this relative errorcan be as large as 1. For example, if b = 2 and t = 3, then subtracting from 1.0 thenext smaller floating number we have

Exactly: Computed, without a guard digit:1.00 1.00-0.111 -0.11 The least significant bit is dropped.----- -----0.001 0.01

© DEREK O’CONNOR, JULY 29, 2011 2


The computed answer is too big by a factor 2 and so has relative error 1! Ac-cording to Vel Kahan, the example I have given mimics what happens on a CrayX-MP or Y-MP, but the Cray 2 behaves differently and produces the answer zero.Although the relative error in addition/subtraction is not bounded by the unitroundoff u for machines without a guard digit, it is nevertheless true that

f l(a + b) = a(1 + e) + b(1 + f ),

where e and f are bounded in magnitude by u.

[1] G. H. Golub and C. F. Van Loan, Matrix Computations, Second Edition, JohnsHopkins Press, Baltimore, 1989.

[2] C. B. Moler, Technical note: Double-rounding and implications for numericcomputations, The MathWorks Newsletter, Vol 4, No. 1 (1990), p. 6.

[3] R. Startz, 8087/80287/80387 for the IBM PC & Compatibles, Third Edition, Brady,New York, 1988.



Editor’s addendum: [Cleve Moler]

I agree with everything Nick has to say, and have a few more comments.

MATLAB on a PC has IEEE floating point with extended precision implementedin an Intel chip. The C compiler generates code with double rounding. MATLABon a Sun Sparc also has IEEE floating point with extended precision, but it is im-plemented in a Sparc chip. The C compiler generates code which avoids doublerounding.

On both the PC and the Sparc

εm = 2−52 = 3cb0000000000000 = 2.220446049250313e-16

However, on the PC

µ = 2−53(1 + 2−10) = 3ca0040000000000 = 1.111307226797642e− 16

While on the Sparc

µ = 2−53(1 + 2−52) = 3ca0000000000001 = 1.110223024625157e− 16

Note that µ is not 2 raised to a negative integer power.

MATLAB on a VAX usually uses “D” floating point (there is also a “G” versionunder VMS). Compared to IEEE floating point, the D format has 3 more bits inthe fraction and 3 less bits in the exponent. So εm should be 2−55, but MATLABsays εm is 2−56. It is actually using the 1+ x > 1 trick to compute what we’re nowcalling µ. There is no extended precision or double rounding and ties betweentwo floating point values are chopped, so we can find µ by just trying powers of2.

On the VAX with D float

εm = 2−55 = 2.775557561562891e− 17

µ = 2−56 = 1.387778780781446e− 17

The definition of εm as the distance from 1.0 to the next floating point numberis a purely “geometric” quantity depending only on the structure of the floatingpoint numbers. The point Nick is making is that the more common definition ofwhat we here call µ involves a comparison between 1.0 + x and 1.0 and subtlerounding properties of floating point addition. I now much prefer the simplegeometric definition, even though I’ve been as responsible as anybody for thepopularity of the definition involving addition.– Cleve

This is a LATEXed version of the original post to NA Digest, 13 April, 1991,Derek O’Connor.

www.derekroconnor.net,


www.derekroconnor.net

Higham & Moler -- Three Measures of Precision

Documents

Transcript of Higham & Moler -- Three Measures of Precision