FLOATING-POINT BINARY FORMATS
Floating-point binary formats allow us to overcome most of the limitations of precision and dynamic range mandated by fixed-point binary formats, particularly in reducing the ill effects of overflow [19]. Floating-point formats segment a data word into two parts: a mantissa m and an exponent e. Using these parts, the value of a binary floating-point number n is evaluated as that is, the number's value is the product of the mantissa and 2 raised to the power of the exponent. (Mantissa is a somewhat unfortunate choice of terms because it has a meaning here very different from that in the mathematics of logarithms. Mantissa originally meant the decimal fraction of a logarithm.[
[
Let's assume that a b-bit floating-point number will use be bits for the fixed-point signed exponent and bm bits for the fixed-point signed mantissa. The greater the number of be bits used, the larger the dynamic range of the number. The more bits used for bm, the better the resolution, or precision, of the number. Early computer simulations conducted by the developers of b-bit floating-point formats indicated that the best trade-off occurred with be
Equation 12-28
The floating-point word above can be evaluated to retrieve our decimal number again as
Equation 12-29
After some experience using floating-point normalization, users soon realized that always having a one in the most significant bit of the fraction was wasteful. That redundant one was taking up a single bit position in all data words and serving no purpose. So practical implementations of floating-point formats discard that one, assume its existence, and increase the useful number of fraction bits by one. This is why the term hidden bit is used to describe some floating-point formats. While increasing the fraction's precision, this scheme uses less memory because the hidden bit is merely accounted for in the hardware arithmetic logic. Using a hidden bit, the fraction in Eq. (12-28)'s floating point number is shifted to the left one place and would now be
Equation 12-30
Recall that the exponent and mantissa bits were fixed-point signed binary numbers, and we've discussed several formats for representing signed binary numbers, i.e., sign magnitude, two's complement, and offset binary. As it turns out, all three signed binary formats are used in industry-standard floating-point formats. The most common floating-point formats, all using 32-bit words, are listed in Table 12-3.
The IEEE P754 floating-point format is the most popular because so many manufacturers of floating-point integrated circuits comply with this standard [8, 20–22]. Its exponent e is offset binary (biased exponent), and its fraction is a sign-magnitude binary number with a hidden bit that's assumed to be 20. The decimal value of a normalized IEEE P754 floating-point number is evaluated as
Equation 12-31
The IBM floating-point format differs somewhat from the other floating-point formats because it uses a base of 16 rather than 2. Its exponent is offset binary, and its fraction is sign magnitude with no hidden bit. The decimal value of a normalized IBM floating-point number is evaluated as
Equation 12-32
Table 12-3. Floating–Point Number Formats
IEEE Standard P754 Format |
||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bit |
31 |
30 |
29 |
28 |
27 |
26 |
25 |
24 |
23 |
22 |
21 |
20 |
. . . |
2 |
1 |
0 |
S |
27 |
26 |
25 |
24 |
23 |
22 |
21 |
20 |
2–1 |
2–2 |
2–3 |
. . . |
2–21 |
2–22 |
2–23 |
|
Sign (s) |
|
|
||||||||||||||
IBM Format |
||||||||||||||||
Bit |
31 |
30 |
29 |
28 |
27 |
26 |
25 |
24 |
23 |
22 |
21 |
20 |
. . . |
2 |
1 |
0 |
S |
26 |
25 |
24 |
23 |
22 |
21 |
20 |
2–1 |
2–2 |
2–3 |
2–4 |
. . . |
2–22 |
2–23 |
2–24 |
|
Sign (s) |
|
|
||||||||||||||
DEC (Digital Equipment Corp.) Format |
||||||||||||||||
Bit |
31 |
30 |
29 |
28 |
27 |
26 |
25 |
24 |
23 |
22 |
21 |
20 |
. . . |
2 |
1 |
0 |
S |
27 |
26 |
25 |
24 |
23 |
22 |
21 |
20 |
2–2 |
2–3 |
2–4 |
. . . |
2–22 |
2–23 |
2–24 |
|
Sign (s) |
|
|
||||||||||||||
MIL–STD 1750A Format |
||||||||||||||||
Bit |
31 |
30 |
29 |
. . . |
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
20 |
2–1 |
2–2 |
. . . |
2–20 |
2–21 |
2–22 |
2–23 |
27 |
26 |
25 |
24 |
23 |
22 |
21 |
20 |
|
|
|
The DEC floating-point format uses an offset binary exponent, and its fraction is sign magnitude with a hidden bit that's assumed to be 2–1. The decimal value of a normalized DEC floating-point number is evaluated as
Equation 12-33
MIL-STD 1750A is a United States Military Airborne floating-point standard. Its exponent e is a two's complement binary number residing in the least significant eight bits. MIL-STD 1750A's fraction is also a two's complement number (with no hidden bit), and that's why no sign bit is specifically indicated in Table 12-3. The decimal value of a MIL-STD 1750A floating-point number is evaluated as
Equation 12-34
Notice how the floating-point formats in Table 12-3 all have word lengths of 32 bits. This was not accidental. Using 32-bit words makes these formats easier to handle using 8-, 16-, and 32-bit hardware processors. That fact not withstanding and given the advantages afforded by floating-point number formats, these formats do require a significant amount of logical comparisons and branching to correctly perform arithmetic operations. Reference [23] provides useful flow charts showing what procedural steps must be taken when floating-point numbers are added and multiplied.
12.4.1 Floating-Point Dynamic Range
Attempting to determine the dynamic range of an arbitrary floating-point number format is a challenging exercise. We start by repeating the expression for a number system's dynamic range from Eq. (12-6) as
Equation 12-35
When we attempt to determine the largest and smallest possible values for a floating-point number format, we quickly see that they depend on such factors as
- the position of the binary point
- whether a hidden bit is used or not (If used, its position relative to the binary point is important.)
- the base value of the floating-point number format
- the signed binary format used for the exponent and the fraction (For example, recall from Table 12-2 that the binary two's complement format can represent larger negative numbers than the sign-magnitude format.)
- how unnormalized fractions are handled, if at all. (Unnormalized, also called gradual underflow, means a nonzero number that's less than the minimum normalized format but can still be represented when the exponent and hidden bit are both zero.)
- how exponents are handled when they're either all ones or all zeros. (For example, the IEEE P754 format treats a number having an all ones exponent and a nonzero fraction as an invalid number, whereas the DEC format handles a number having a sign = 1 and a zero exponent as a special instruction instead of a valid number.)
Trying to develop a dynamic range expression that accounts for all the possible combinations of the above factors is impractical. What we can do is derive a rule of thumb expression for dynamic range that's often used in practice[8,22,24].
Let's assume the following for our derivation: the exponent is a be-bit offset binary number, the fraction is a normalized sign-magnitude number having a sign bit and bm magnitude bits, and a hidden bit is used just left of the binary point. Our hypothetical floating-point word takes the following form:
Bit |
bm+be–1 |
bm+be–2 |
· · · |
bm+2 |
bm |
bm–1 |
bm–2 |
. . . |
1 |
0 |
|
S |
2be–1 |
2be–2 |
· · · |
21 |
20 |
2–1 |
2–2 |
. . . |
2–bm+1 |
2–bm |
|
Sign (s) |
|
|
First we'll determine what the largest value can be for our floating-point word. The largest fraction is a one in the hidden bit, and the remaining bm fraction bits are all ones. This would make fraction
Equation 12-36
The smallest value we can represent with our floating-point word is a one in the hidden bit times two raised to the exponent's most negative value,
Equation 12-37
Plugging Eqs. (12-36) and (12-37) into Eq. (12-35),
Equation 12-38
Now here's where the thumb comes in—when bm is large, say over seven, the
Equation 12-39
Using Eq. (12-39) we can estimate, for example, the dynamic range of the single-precision IEEE P754 standard floating-point format with its eight-bit exponent:
Equation 12-40
Although we've introduced the major features of the most common floating-point formats, there are still more details to learn about floating-point numbers. For the interested reader, the references given in this section provide a good place to start.
URL http://proquest.safaribooksonline.com/0131089897/ch12lev1sec4
Amazon | ||
|
||||||||||
|