The Art of Assembly Language
4.4 Rounding
During a calculation, as you have seen, floating-point arithmetic functions may produce a result with greater precision than the floating-point format supports (the guard bits in the calculation maintain this extra precision). When the calculation is complete and the code needs to store the result back into a floating-point variable, something must be done about those extra bits of precision. How the system uses with those extra guard bits to affect the bits it does maintain is known as rounding, and how it is done can affect the accuracy of the computation. Traditionally, floating-point software and hardware use one of four different ways to round values: truncation , rounding up, rounding down, or rounding to nearest .
Truncation is easy, but it generates the least accurate results in a chain of computations . Few modern floating-point systems use truncation except as a means for converting floating-point values to integers (truncation is the standard conversion when coercing a floating-point value to an integer).
Rounding up is another function that is useful on occasion. Rounding up leaves the value alone if the guard bits are all zero, but if the current mantissa does not exactly fit into the destination bits, then rounding up sets the result to the smallest possible larger value in the floating-point format. Like truncation, this is not a normal rounding mode. It is, however, useful for implementing functions like ceil (which rounds a floating-point value to the smallest possible larger integer).
Rounding down is just like rounding up, except it rounds the result to the largest possible smaller value. This may sound like truncation, but there is a subtle difference between truncation and rounding down. Truncation always rounds towards zero. For positive numbers, truncation and rounding down do the same thing. However, for negative numbers , truncation simply uses the existing bits in the mantissa, whereas rounding down will actually add a one bit to the LO position if the result was negative. Like truncation, this is not a normal rounding mode. It is, however, useful for implementing functions like floor (which rounds a floating-point value to the largest possible smaller integer).
Rounding to nearest is probably the most intuitive way to process the guard bits. If the value of the guard bits is less than half the value of the LO bit of the mantissa, then rounding to nearest truncates the result to the largest possible smaller value (ignoring the sign). If the guard bits represent some value that is greater than half of the value of the LO mantissa bit, then rounding to nearest rounds the mantissa to the smallest possible greater value (ignoring the sign). If the guard bits represent a value that is exactly half the value of the LO bit of the mantissa, then the IEEE floating-point standard says that half the time it should round up and half the time it should round down. You do this by rounding the mantissa to the value that has a zero in the LO bit position. That is, if the current mantissa already has a zero in its LO bit, you use the current mantissa value; if the current mantissa value has a one in the LO mantissa position, then you add one to the mantissa to round it up to the smallest possible larger value with a zero in the LO bit. This scheme, mandated by the IEEE floating-point standard, produces the best possible result when loss of precision occurs.
Here are some examples of rounding, using 24-bit mantissas, with 4 guard bits (that is, these examples round 28-bit numbers to 24-bit numbers using the rounding to nearest algorithm):
1.000_0100_1010_0100_1001_0101_ 0001 -> 1.000_0100_1010_0100_1001_0101 1.000_0100_1010_0100_1001_0101_ 1100 -> 1.000_0100_1010_0100_1001_0110 1.000_0100_1010_0100_1001_0101_ 1000 -> 1.000_0100_1010_0100_1001_0110 1.000_0100_1010_0100_1001_0100_ 0001 -> 1.000_0100_1010_0100_1001_0100 1.000_0100_1010_0100_1001_0100_ 1100 -> 1.000_0100_1010_0100_1001_0101 1.000_0100_1010_0100_1001_0100_ 1000 -> 1.000_0100_1010_0100_1001_0100
Категории