What should know about floating point




















Consider using numerical analysis to properly understand the problem. This noncompliant code example takes the mean of 10 identical numbers and checks to see if the mean matches this number. It should match because the 10 numbers are all Yet, because of the imprecision of floating-point arithmetic, the computed mean does not match this number.

The noncompliant code can be fixed by replacing the floating-point numbers with integers for the internal additions. Floats are used only when printing results and when doing the division to compute the mean. Can detect violations of this recommendation. In particular, it checks to see if the arguments to an equality operator are of a floating-point type. Checks for floating point comparison with equality operators rec.

Search for vulnerabilities resulting from the violation of this recommendation on the CERT website. There are several things to be noted in the code example and solution, but I just dropped by to flag someone with editing permissions that the noncompliant code example is uncompilable due to a typographical error. How bad can the error be?

That is, all of the p digits in the result are wrong! Suppose that one extra digit is added to guard against this situation a guard digit. With a guard digit, the previous example becomes.

With a single guard digit, the relative error of the result may be greater than , as in - 8. This rounds to , compared with the correct answer of In general, the relative error of the result can be only slightly larger than.

More precisely,. This theorem will be proven in Rounding Error. Addition is included in the above theorem since x and y can be positive or negative. The last section can be summarized by saying that without a guard digit, the relative error committed when subtracting two nearby quantities can be very large. In other words, the evaluation of any expression containing a subtraction or an addition of quantities with opposite signs could result in a relative error so large that all the digits are meaningless Theorem 1.

When subtracting nearby quantities, the most significant digits in the operands match and cancel each other. There are two kinds of cancellation: catastrophic and benign.

Catastrophic cancellation occurs when the operands are subject to rounding errors. For example in the quadratic formula, the expression b 2 - 4 ac occurs. The quantities b 2 and 4 ac are subject to rounding errors since they are the results of floating-point multiplications. Suppose that they are rounded to the nearest floating-point number, and so are accurate to within. When they are subtracted, cancellation can cause many of the accurate digits to disappear, leaving behind mainly digits contaminated by rounding error.

Hence the difference might have an error of many ulps. The exact value of b 2 - 4 ac is. But b 2 rounds to The subtraction did not introduce any error, but rather exposed the error introduced in the earlier multiplications. Benign cancellation occurs when subtracting exactly known quantities. If x and y have no rounding error, then by Theorem 2 if the subtraction is done with a guard digit, the difference x -y has a very small relative error less than 2.

A formula that exhibits catastrophic cancellation can sometimes be rearranged to eliminate the problem. Again consider the quadratic formula. When , then does not involve a cancellation and. But the other addition subtraction in one of the formulas will have a catastrophic cancellation. To avoid this, multiply the numerator and denominator of r 1 by. If and , then computing r 1 using formula 4 will involve a cancellation. Therefore, use formula 5 for computing r 1 and 4 for r 2.

The expression x 2 - y 2 is another formula that exhibits catastrophic cancellation. By Theorem 2, the relative error in x - y is at most 2. Multiplying two quantities with a small relative error results in a product with a small relative error see the section Rounding Error. In order to avoid confusion between exact and computed values, the following notation is used. Whereas x - y denotes the exact difference of x and y , x y denotes the computed difference i.

Similarly , , and denote computed addition, multiplication, and division, respectively. Lowercase functions and traditional mathematical notation denote their exact values as in ln x and. Although x y x y is an excellent approximation to x 2 - y 2 , the floating-point numbers x and y might themselves be approximations to some true quantities and.

For example, and might be exactly known decimal numbers that cannot be expressed exactly in binary. In general, however, replacing a catastrophic cancellation by a benign one is not worthwhile if the expense is large, because the input is often but not always an approximation.

But eliminating a cancellation entirely as in the quadratic formula is worthwhile even if the data are not exact. Throughout this paper, it will be assumed that the floating-point inputs to an algorithm are exact and that the results are computed as accurately as possible. We next present more interesting examples of formulas exhibiting catastrophic cancellation that can be rewritten to exhibit only benign cancellation. The area of a triangle can be expressed directly in terms of the lengths of its sides a , b , and c as.

Then s a , and the term s - a in formula 6 subtracts two nearby numbers, one of which may have rounding error. Even though the computed value of s 9. There is a way to rewrite formula 6 so that it will return accurate results even for flat triangles [Kahan ].

It is. If a , b, and c do not satisfy a b c , rename them before applying 7. It is straightforward to check that the right-hand sides of 6 and 7 are algebraically identical. Using the values of a , b , and c above gives a computed area of 2. Although formula 7 is much more accurate than 6 for this example, it would be nice to know how well 7 performs in general.

In statements like Theorem 3 that discuss the relative error of an expression, it is understood that the expression is computed using floating-point arithmetic. In particular, the relative error is actually of the expression. Because of the cumbersome nature of 8 , in the statement of theorems we will usually say the computed value of E rather than writing out E with circle notation.

Error bounds are usually too pessimistic. In the numerical example given above, the computed value of 7 is 2. The main reason for computing error bounds is not to get precise bounds but rather to verify that the formula does not contain numerical problems.

This expression arises in financial calculations. The reason for the problem is easy to see. Although the formula may seem mysterious, there is a simple explanation for why it works. So changing x slightly will not introduce much error. Is there a value for for which and can be computed accurately? The results of this section can be summarized by saying that a guard digit guarantees accuracy when nearby precisely known quantities are subtracted benign cancellation.

Sometimes a formula that gives inaccurate results can be rewritten to have much higher numerical accuracy by using benign cancellation; however, the procedure only works if subtraction is performed using a guard digit. The price of a guard digit is not high, because it merely requires making the adder one bit wider. Although most modern computers have a guard digit, there are a few such as Cray systems that do not. When floating-point operations are done with a guard digit, they are not as accurate as if they were computed exactly then rounded to the nearest floating-point number.

Operations performed in this manner will be called exactly rounded. The previous section gave several examples of algorithms that require a guard digit in order to work properly.

This section gives examples of algorithms that require exact rounding. So far, the definition of rounding has not been given. Rounding is straightforward, with the exception of how to round halfway cases; for example, should Another school of thought says that since numbers ending in 5 are halfway between two possible roundings, they should round down half the time and round up the other half. Thus Which of these methods is best, round up or round to even? Reiser and Knuth [] offer the following reason for preferring round to even.

When rounding up, the sequence becomes. Under round to even, x n is always 1. This example suggests that when using the round up rule, computations can gradually drift upward, whereas when using round to even the theorem says this cannot happen. Throughout the rest of this paper, round to even will be used.

One application of exact rounding occurs in multiple precision arithmetic. There are two basic approaches to higher precision. One approach represents floating-point numbers using a very large significand, which is stored in an array of words, and codes the routines for manipulating these numbers in assembly language.

The second approach represents higher precision floating-point numbers as an array of ordinary floating-point numbers, where adding the elements of the array in infinite precision recovers the high precision floating-point number. It is this second approach that will be discussed here. The advantage of using an array of floating-point numbers is that it can be coded portably in a high level language, but it requires exactly rounded arithmetic.

The key to multiplication in this system is representing a product x y as a sum, where each summand has the same precision as x and y. This can be done by splitting x and y. When p is even, it is easy to find a splitting. The number x 0. When p is odd, this simple splitting method will not work. An extra bit can, however, be gained by using negative numbers. There is more than one way to split a number.

A splitting method that is easy to compute is due to Dekker [], but it requires more than a single guard digit. Then b 2 - ac rounded to the nearest floating-point number is.

This is an error of ulps. Finally, subtracting these two series term by term gives an estimate for b 2 - ac of 0. As a final example of exact rounding, consider dividing m by Actually, a more general fact due to Kahan is true.

We are now in a position to answer the question, Does it matter if the basic arithmetic operations introduce a little more rounding error than necessary? The answer is that it does matter, because accurate basic operations enable us to prove that formulas are "correct" in the sense they have a small relative error. The section Cancellation discussed several algorithms that require guard digits to produce correct results in this sense. If the input to those formulas are numbers representing imprecise measurements, however, the bounds of Theorems 3 and 4 become less interesting.

The reason is that the benign cancellation x - y can become catastrophic if x and y are only approximations to some measured quantity. But accurate operations are useful even in the face of inexact data, because they enable us to establish exact relationships like those discussed in Theorems 6 and 7.

These are useful even if every floating-point variable is only an approximation to some actual value. There are two different IEEE standards for floating-point computation. It also specifies the precise layout of bits in a single and double precision. It does not require a particular value for p , but instead it specifies constraints on the allowable values of p for single and double precision.

This section provides a tour of the IEEE standard. Each subsection discusses one aspect of the standard and why it was included. It is not the purpose of this paper to argue that the IEEE standard is the best possible floating-point standard but rather to accept the standard as given and provide an introduction to its use.

Base ten is how humans exchange and think about numbers. There are several reasons why IEEE requires that if the base is not 10, it must be 2.

The section Relative Error and Ulps mentioned one reason: the results of error analyses are much tighter when is 2 because a rounding error of. A related reason has to do with the effective precision for large bases. Both systems have 4 bits of significand.

In general, base 16 can lose up to 3 bits, so that a precision of p hexadecimal digits can have an effective precision as low as 4 p - 3 rather than 4 p binary bits. Only IBM knows for sure, but there are two possible reasons. The first is increased exponent range. Hence the significand requires 24 bits. Since this must fit into 32 bits, this leaves 7 bits for the exponent and one for the sign bit. When adding two floating-point numbers, if their exponents are different, one of the significands will have to be shifted to make the radix points line up, slowing down the operation.

Formats that use this trick are said to have a hidden bit. It was already pointed out in Floating-point Formats that this requires a special convention for 0. The method given there was that an exponent of e min - 1 and a significand of all zeros represents not , but rather 0.

IEEE single precision is encoded in 32 bits using 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand. The IEEE standard defines four different precisions: single, double, single-extended, and double-extended. In IEEE , single and double precision correspond roughly to what most floating-point hardware provides. Single precision occupies a single 32 bit word, double precision two consecutive 32 bit words.

The IEEE standard only specifies a lower bound on how many extra bits extended precision provides. The minimum allowable double-extended format is sometimes referred to as bit format , even though the table shows it using 79 bits.

The reason is that hardware implementations of extended precision normally do not use a hidden bit, and so would use 80 rather than 79 bits. The standard puts the most emphasis on extended precision, making no recommendation concerning double precision, but strongly recommending that Implementations should support the extended format corresponding to the widest basic format supported, One motivation for extended precision comes from calculators, which will often display 10 digits, but use 13 digits internally.

By displaying only 10 of the 13 digits, the calculator appears to the user as a "black box" that computes exponentials, cosines, etc. For the calculator to compute functions like exp, log and cos to within 10 digits with reasonable efficiency, it needs a few extra digits to work with. It is not hard to find a simple rational expression that approximates log with an error of units in the last place.

Thus computing with 13 digits gives an answer correct to 10 digits. By keeping these extra 3 digits hidden, the calculator presents a simple model to the operator.

Extended precision in the IEEE standard serves a similar function. It enables libraries to efficiently compute quantities to within about. However, when using extended precision, it is important to make sure that its use is transparent to the user. For example, on a calculator, if the internal representation of a displayed value is not rounded to the same precision as the display, then the result of further operations will depend on the hidden digits and appear unpredictable to the user.

To illustrate extended precision further, consider the problem of converting between IEEE single precision and decimal. Ideally, single precision numbers will be printed with enough digits so that when the decimal number is read back in, the single precision number can be recovered.

It turns out that 9 decimal digits are enough to recover a single precision binary number see the section Binary to Decimal Conversion. When converting a decimal number back to its unique binary representation, a rounding error as small as 1 ulp is fatal, because it will give the wrong answer.

Here is a situation where extended precision is vital for an efficient algorithm. When single-extended is available, a very straightforward method exists for converting a decimal number to a single precision binary one. First read in the 9 decimal digits as an integer N , ignoring the decimal point. Next find the appropriate power 10 P necessary to scale N.

This will be a combination of the exponent of the decimal number, together with the position of the up until now ignored decimal point. Compute 10 P. If this last operation is done exactly, then the closest binary number is recovered.

The section Binary to Decimal Conversion shows how to do the last multiply or divide exactly. Thus for P 13, the use of the single-extended format enables 9-digit decimal numbers to be converted to the closest binary number i. If double precision is supported, then the algorithm above would be run in double precision rather than single-extended, but to convert double precision to a digit decimal number and back would require the double-extended format.

Since the exponent can be positive or negative, some method must be chosen to represent its sign. The two's complement representation is often used in integer arithmetic. In this scheme, a number in the range [-2 p-1 , 2 p-1 - 1] is represented by the smallest nonnegative number that is congruent to it modulo 2 p.

The IEEE binary standard does not use either of these methods to represent the exponent, but instead uses a biased representation. In the case of single precision, where the exponent is stored in 8 bits, the bias is for double precision it is What this means is that if is the value of the exponent bits interpreted as an unsigned integer, then the exponent of the floating-point number is - This is often called the unbiased exponent to distinguish from the biased exponent.

Although it is true that the reciprocal of the largest number will underflow, underflow is usually less serious than overflow. The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number using round to even.

The section Guard Digits pointed out that computing the exact difference or sum of two floating-point numbers can be very expensive when their exponents are substantially different. That section introduced guard digits, which provide a practical way of computing differences while guaranteeing that the relative error is small. However, computing with a single guard digit will not always give the same answer as computing the exact result and then rounding.

By introducing a second guard digit and a third sticky bit, differences can be computed at only a little more cost than with a single guard digit, but the result is the same as if the difference were computed exactly and then rounded [Goldberg ].

Thus the standard can be implemented efficiently. One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic.

Another advantage of precise specification is that it makes it easier to reason about floating-point. Proofs about floating-point are hard enough, without having to deal with multiple cases arising from multiple kinds of arithmetic.

Just as integer programs can be proven to be correct, so can floating-point programs, although what is proven in that case is that the rounding error of the result satisfies certain bounds. Theorem 4 is an example of such a proof. These proofs are made much easier when the operations being reasoned about are precisely specified. Brown [] has proposed axioms for floating-point that include most of the existing floating-point hardware. However, proofs in this system cannot verify the algorithms of sections Cancellation and Exactly Rounded Operations , which require features not present on all hardware.

Furthermore, Brown's axioms are more complex than simply defining operations to be performed exactly and then rounded. Thus proving theorems from Brown's axioms is usually more difficult than proving them assuming operations are exactly rounded. There is not complete agreement on what operations a floating-point standard should cover.

It also requires that conversion between internal formats and decimal be correctly rounded except for very large numbers. Kulisch and Miranker [] have proposed adding inner product to the list of operations that are precisely specified.

They note that when inner products are computed in IEEE arithmetic, the final answer can be quite wrong. It is possible to compute inner products to within 1 ulp with less hardware than it takes to implement a fast multiplier [Kirchner and Kulish ].

All the operations mentioned in the standard are required to be exactly rounded except conversion between decimal and binary. The reason is that efficient algorithms for exactly rounding all the operations are known, except conversion. For conversion, the best known efficient algorithms produce results that are slightly worse than exactly rounded ones [Coonen ].

The IEEE standard does not require transcendental functions to be exactly rounded because of the table maker's dilemma. To illustrate, suppose you are making a table of the exponential function to 4 places. Then exp 1. Should this be rounded to 5. If exp 1. And then 5.

Since exp is transcendental, this could go on arbitrarily long before distinguishing whether exp 1. Thus it is not practical to specify that the precision of transcendental functions be the same as if they were computed to infinite precision and then rounded. Another approach would be to specify transcendental functions algorithmically. But there does not appear to be a single algorithm that works well across all hardware architectures.

Rational approximation, CORDIC, 16 and large tables are three different techniques that are used for computing transcendentals on contemporary machines. Each is appropriate for a different class of hardware, and at present no single algorithm works acceptably over the wide range of current hardware. On some floating-point hardware every bit pattern represents a valid floating-point number. On the other hand, the VAX TM reserves some bit patterns to represent special numbers called reserved operands.

Without any special quantities, there is no good way to handle exceptional situations like taking the square root of a negative number, other than aborting computation. Since every bit pattern represents a valid number, the return value of square root must be some floating-point number. However, there are examples where it makes sense for a computation to continue in such a situation. Consider a subroutine that finds the zeros of a function f , say zero f.

Traditionally, zero finders require the user to input an interval [ a , b ] on which the function is defined and over which the zero finder will search. That is, the subroutine is called as zero f , a , b. A more useful zero finder would not require the user to input this extra information.

This more general zero finder is especially appropriate for calculators, where it is natural to simply key in a function, and awkward to then have to specify the domain. However, it is easy to see why most zero finders require a domain. The zero finder does its work by probing the function f at various values.

Then when zero f probes outside the domain of f , the code for f will return NaN, and the zero finder can continue. That is, zero f is not "punished" for making an incorrect guess. With this example in mind, it is easy to see what the result of combining a NaN with an ordinary floating-point number should be.

Similarly if one operand of a division operation is a NaN, the quotient should be a NaN. In general, whenever a NaN participates in a floating-point operation, the result is another NaN. Another approach to writing a zero solver that doesn't require the user to input a domain is to use signals. The zero-finder could install a signal handler for floating-point exceptions.

Then if f was evaluated outside its domain and raised an exception, control would be returned to the zero solver. The problem with this approach is that every language has a different method of handling signals if it has a method at all , and so it has no hope of portability. Implementations are free to put system-dependent information into the significand. Thus there is not a unique NaN, but rather a whole family of NaNs.

When a NaN and an ordinary floating-point number are combined, the result should be the same as the NaN operand. Thus if the result of a long computation is a NaN, the system-dependent information in the significand will be the information that was generated when the first NaN in the computation was generated. Actually, there is a caveat to the last statement. If both operands are NaNs, then the result will be one of those NaNs, but it might not be the NaN that was generated first.

This is much safer than simply returning the largest representable number. So the final result is , which is safer than returning an ordinary floating-point number that is nowhere near the correct answer. The division of 0 by 0 results in a NaN. You can distinguish between getting because of overflow and getting because of division by zero by checking the status flags which will be discussed in detail in section Flags. The overflow flag will be set in the first case, the division by zero flag in the second.

The rule for determining the result of an operation that has infinity as an operand is simple: replace infinity with a finite number x and take the limit as x. When a subexpression evaluates to a NaN, the value of the entire expression is also a NaN. Here is a practical example that makes use of the rules for infinity arithmetic. Zero is represented by the exponent e min - 1 and a zero significand. Although it would be possible always to ignore the sign of zero, the IEEE standard does not do so.

When a multiplication or division involves a signed zero, the usual sign rules apply in computing the sign of the answer. Another example of the use of signed zero concerns underflow and functions that have a discontinuity at 0, such as log. Suppose that x represents a small negative number that has underflowed to zero.

Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -. Another example of a function with a discontinuity at zero is the signum function, which returns the sign of a number. Probably the most interesting use of signed zero occurs in complex arithmetic. To take a simple example, consider the equation.

This is certainly true when z 0. The problem can be traced to the fact that square root is multi-valued, and there is no way to select the values so that it is continuous in the entire complex plane. However, square root is continuous if a branch cut consisting of all negative real numbers is excluded from consideration.

Signed zero provides a perfect way to resolve this problem. In fact, the natural formulas for computing will give these results. Back to. Thus IEEE arithmetic preserves this identity for all z. Some more sophisticated examples are given by Kahan []. However, the IEEE committee decided that the advantages of utilizing the sign of zero outweighed the disadvantages. How important is it to preserve the property. Tracking down bugs like this is frustrating and time consuming.

On a more philosophical level, computer science textbooks often point out that even though it is currently impractical to prove large programs correct, designing programs with the idea of proving them often results in better code. For example, introducing invariants is quite useful, even if they aren't going to be used as part of a proof.

Floating-point code is just like any other code: it helps to have provable facts on which to depend. Similarly, knowing that 10 is true makes writing reliable floating-point code easier. If it is only true for most numbers, it cannot be used to prove anything. The IEEE standard uses denormalized 18 numbers, which guarantee 10 , as well as other useful relations. They are the most controversial part of the standard and probably accounted for the long delay in getting approved.

Most high performance hardware that claims to be IEEE compatible does not support denormalized numbers directly, but rather traps when consuming or producing denormals, and leaves it to software to simulate the IEEE standard.

The exponent e min is used to represent denormals. More formally, if the bits in the significand field are b 1 , b 2 , With denormals, x - y does not flush to zero but is instead represented by the denormalized number. This behavior is called gradual underflow. It is easy to verify that 10 always holds when using gradual underflow. The top number line in the figure shows normalized floating-point numbers. Notice the gap between 0 and the smallest normalized number.

If the result of a floating-point calculation falls into this gulf, it is flushed to zero. The bottom number line shows what happens when denormals are added to the set of floating-point numbers. The "gulf" is filled in, and when the result of a calculation is less than , it is represented by the nearest denormal.

When denormalized numbers are added to the number line, the spacing between adjacent floating-point numbers varies in a regular way: adjacent spacings are either the same length or differ by a factor of. Without denormals, the spacing abruptly changes from to , which is a factor of , rather than the orderly change by a factor of. Because of this, many algorithms that can have large relative error for normalized numbers close to the underflow threshold are well-behaved in this range when gradual underflow is used.

Large relative errors can happen even without cancellation, as the following example shows [Demmel ]. The obvious formula. A better method of computing the quotients is to use Smith's formula:. It yields 0. It is typical for denormalized numbers to guarantee error bounds for arguments all the way down to 1. When an exceptional condition like division by zero or overflow occurs in IEEE arithmetic, the default is to deliver a result and continue.

The preceding sections gave examples where proceeding from an exception with these default values was the reasonable thing to do. When any exception occurs, a status flag is also set. Implementations of the IEEE standard are required to provide users with a way to read and write the status flags. The flags are "sticky" in that once set, they remain set until explicitly cleared.

Sometimes continuing execution in the face of exception conditions is not appropriate. The IEEE standard strongly recommends that implementations allow trap handlers to be installed. Then when an exception occurs, the trap handler is called instead of setting the flag.

The value returned by the trap handler will be used as the result of the operation. It is the responsibility of the trap handler to either clear or set the status flag; otherwise, the value of the flag is allowed to be undefined. The IEEE standard divides exceptions into 5 classes: overflow, underflow, division by zero, invalid operation and inexact.

There is a separate status flag for each class of exception. The meaning of the first three exceptions is self-evident.

The default result of an operation that causes an invalid exception is to return a NaN, but the converse is not true. The inexact exception is raised when the result of a floating-point operation is not exact.

Binary to Decimal Conversion discusses an algorithm that uses the inexact exception. That is more digits than most people find useful, so Python keeps the number of digits manageable by displaying a rounded value instead. Interestingly, there are many different decimal numbers that share the same nearest approximate binary fraction. For example, the numbers 0. Historically, the Python prompt and built-in repr function would choose the one with 17 significant digits, 0. Starting with Python 3.

Note that this is in the very nature of binary floating-point: this is not a bug in Python, and it is not a bug in your code either. For more pleasant output, you may wish to use string formatting to produce a limited number of significant digits:. One illusion may beget another. For example, since 0.

Also, since the 0. Though the numbers cannot be made closer to their intended exact values, the round function can be useful for post-rounding so that results with inexact values become comparable to one another:. Binary floating-point arithmetic holds many surprises like this. See The Perils of Floating Point for a more complete account of other common surprises.

For use cases which require exact decimal representation, try using the decimal module which implements decimal arithmetic suitable for accounting applications and high-precision applications.

If you are a heavy user of floating point operations you should take a look at the NumPy package and many other packages for mathematical and statistical operations supplied by the SciPy project. Python provides tools that may help on those rare occasions when you really do want to know the exact value of a float.

The float.



0コメント

  • 1000 / 1000