SQL Performance Tuning
Numbers
There are three kinds of numbers: integers, floats (approximate numbers), and fixed-decimal numbers. (Technically, integers are a type of fixed-decimal number with a scale of zero, but storage methods for integers and decimal numbers differ so we'll talk about them separately.) There is also a number type that auto-incrementswe'll call this a serial. All DBMSs except Oracle use similar numeric formats, so in our discussion of numeric data types, we'll talk about the "normal" DBMSs first and relegate Oracle's treatment of numbers to a sidebar. Integers
The SQL Standard provides two data types for columns that contain integers: INTEGER (or INT) and SMALLINT. Table 7-5 shows the SQL Standard requirements and the level of support (data type and size in bytes) the Big Eight have for these data types. Table 7-5. ANSI/DBMS Integer Support
Notes on Table 7-5:
Any number that has a scale of zero (no digits after the decimal point) should be in an integer column, because integer arithmetic is faster than arithmetic with a decimal point. In fact, the fastest arithmetic occurs with the specific SQL data type INTEGER because on a 32-bit machine (such as a Pentium), 32 bits is the native word size. Thus, for example: CREATE TABLE Table1 ( column1 SMALLINT) INSERT INTO Table1 ... SELECT column1 * 1234 FROM Table1 is slower than: CREATE TABLE Table1 ( column1 INTEGER) INSERT INTO Table1 ... SELECT column1 * 1234 FROM Table1 GAIN: 5/8 although the difference is unnoticeable unless millions of calculations happen. The INTEGER data type is also the default data type of any literal that contains all digits. Consider the expression: ... WHERE column1 = 555 If column1 is a SMALLINT, the DBMS will have to cast its value to INTEGER before it can make the comparison with the integer literal 555 . The other integer data types have special advantages though:
Here's another consideration: At the beginning of this chapter, we said it's important to choose a data type that won't allow impossible values for a column. For example, if you have a list of mileages between world cities, using SMALLINT (and perhaps UNSIGNED if the DBMS allows it) gives you a free check constraintit should be impossible to insert the absurd value 35000 . We still hold to this advice, with one caveatIt Won't Always Work. That is: "On some systems, for example, the numeric operations for some data types may silently underflow or overflow." PostgreSQL Interactive Documentation "When asked to store a value in a numeric column that is outside the column type's allowable range, MySQL clips the value to the appropriate endpoint of the range and stores the resulting value instead. Conversions that occur due to clipping are reported as 'warnings' for ALTER TABLE, LOAD DATA INFILE, UPDATE, and multi-row INSERT [but not for single-row INSERT]." MySQL Reference Manual (Square bracket comment ours) If input errors are a serious worry, only an explicit CHECK constraint will do the job right. If you decide to define column1 as INTEGER, don't just consider whether the largest possible value is less than or equal to 214783647, but also whether SUM(column1) will be less than or equal to 214783647. A DBMS will normally decide that the sum of an INTEGER column is a BIGINT or a DECIMALbut it could just overflow instead. Why are telephone numbers stored in CHAR(12) columns instead of BIGINT columns? The answer is that, although a CHAR(12) column is longer, there are built-in functions for all the operations that are likely to be performed on telephone numbers: LIKE, SUBSTRING, SIMILAR, and so on. Meanwhile, all the built-in numeric operators (+ / *) are useless for telephone numbers. The general rule in such cases is that a column should have a non-numeric data type if all appropriate operations for it are non-numeric. A similar generality can be stated for the temporal data types. Recommendation: Prefer INTEGER for integers unless maximum valuesincluding those from arithmetic operationsexceed the INTEGER range. Floats
The SQL Standard provides three data types for columns that contain floats: REAL, FLOAT, and DOUBLE PRECISION (or DOUBLE). Table 7-6 shows the SQL Standard requirements and the level of support (data type and precision) the Big Eight have for these data types. Notes on Table 7-6:
Table 7-6. ANSI/DBMS Float Support
The official SQL term for FLOAT, REAL, and DOUBLE PRECISION values is "approximate numeric" but "floating point" or simply "float" is common. The key point for all such values is that the decimal point is floating. (If the decimal point is fixed, see the next section.) You should use float literals when working with floats. For example, use this type of expression: UPDATE Table1 SET float_column = 1.25E02 instead of: UPDATE Table1 SET float_column = 125
Portability MySQL won't accept a float literal unless it has a two-digit exponent. That is, the literal 1.25E02 is acceptable, but 1.25E2 is not. All other DBMSs allow you to drop the leading zero.
Floating-point operations are fast if they go through the computer's Floating Point Unit (FPU), but a compiler can make the cautious assumption that no FPU is present. In that case, floating-point operations are slow because they are emulated instead of performed with the FPU. When you install a DBMS, the installer should detect the FPU automatically and bring in the right code, whether FPU dependent or emulated, so make sure you rerun the install program after hardware upgrades or moves. Take another look at Table 7-6 and the precisions shown for the float data types. The range of IEEE single-precision float is from 1.175E-37 to +3.402E+38, to 7 decimal digits precision, although some DBMSs are more cautious in stating the actual range supported. The range of IEEE double-precision float is from 2.225E-307 to +1.798E+308, to 15 decimal digits precision. Again, some DBMSs give a slightly smaller range. These sizes are shown in Table 7-7. Table 7-7 shows that the range of a single-precision float is 1.175E-37 to +3.402E+38. In reality it isn't possible to store all the real numbers in that range in a four-byte space; it isn't even possible to store all the integers in that range in four bytes (the range of a four-byte INTEGER is from 2.14E9 to +2.14E9). So for most numbers in the single-precision range, you'll need to use whatever number is closest that can be represented in a single float. In other words, a floating-point number is exact in bit combinationsthat is, all bit combinations are exactbut it might not be exactly the same as the number that was inserted originally. Hence the name approximate. The question that arises from this isIs it better to use DOUBLE PRECISION for float columns or REAL? (We'll ignore FLOAT entirely because REAL and DOUBLE are just synonyms for predefined sizes of FLOAT.) We checked the Big Eight to see what happens if the same number is stored both ways. First, we created this table: CREATE TABLE Table1 ( real_column REAL, double_column DOUBLE PRECISION) Table 7-7. IEEE 754 Floats
Then we inserted the same number into both columns and selected: INSERT INTO Table1 VALUES (0.01, 0.01) SELECT * FROM Table1 WHERE real_column = 0.01 /* result is zero rows, "incorrect" */ SELECT * FROM Table1 WHERE double_column = 0.01 /* result is one row, "correct" */ Most DBMSs returned the "correct" result for the second SELECT, but were unable to find a row where real_column contained the value 0.01 . Due to their greater precision, DOUBLE PRECISION columns return the expected result more frequently than do REAL columns. Recommendation: Prefer DOUBLE PRECISION for floats. Decimals
The SQL Standard provides two data types for columns that contain fixed-decimal numbers: DECIMAL and NUMERIC. Table 7-8 shows the SQL Standard requirements and the level of support (data type and maximum precision in digits) the Big Eight have for these data types. Table 7-8. ANSI/DBMS Decimal Support
Notes on Table 7-8:
A fixed-decimal number is one that has a fixed-decimal precision. For example, DECIMAL(7,2) has a precision of seven digits, with a fixed scalethe number of digits after the decimal pointof two digits. If the data type of a column is DECIMAL, then the actual precision must be at least as big as the defined precision. If the data type of a column is NUMERIC, then the actual precision should be exactly the same as the defined precision, but in practice most DBMSs treat NUMERIC as a synonym for DECIMAL. Any numeric literal can take three forms, as shown in Table 7-9. Notice that Table 7-9 shows that the number 17.7 is a DECIMALnot a FLOAT. This can confuse C programmers because in C the literal value 17.7 is a float . All DBMSs except for InterBase and MySQL store DECIMAL values as a packed string of decimal digits, usually with two decimal digits per byte; see the sidebar "Oracle Numbers." InterBase, as mentioned earlier, stores decimal values inside integers, while MySQL stores them as unpacked strings. Table 7-9. Forms of a Numeric Literal
The primary advantage of DECIMAL is that it is easy to cast to CHAR, because the number isn't stored in binary form. The primary disadvantage is that DECIMAL values must be converted to binary form before some arithmetic operations can be performed on them. In general the conversion requires only your patiencethe DBMS handles multiple-digit arithmetic slowly, but it gets there. One exception exists, though: MySQL avoids the slowness of multidigit arithmetic by converting to floats when one of the operands is decimal. So this work around is primarily of interest to MySQL users:
To add 7.35 to such a divided number, add 35 to column1_post_decimal_point . If the result is greater than 100, subtract 100 and carry one, then add seven plus the carry to column1_pre_decimal_point . The work around sounds awfulbut it works more quickly than a decimal addition. We still live in an era where the questionShould I use floats for decimals?makes some sense, but the era is fast approaching its end. The fact is that most sensible bean-count values (i.e., money) can be stored and manipulated as integers now, and with 64-bit processors the values are trillions. The only reason to use floats for dollars is that some host languages still have no equivalent for DECIMAL. It is notable that all the DBMSs that support a MONEY data type store MONEY internally as a DECIMAL with a predefined scale. Recommendation: Prefer DECIMAL for fixed-decimal numbers and for most floats.
Serials
The SQL:1999 Standard does not provide a data type for columns that contain "serial" numbers but most DBMSs do provide support for auto-incremented, or monotonic, values. Table 7-10 shows the level of support the Big Eight have for serial data types. Table 7-10. ANSI/DBMS Serial Support
Notes on Table 7-10
The various "serial" data types are usually 4-byte integers, with values assigned by the DBMS. The idea is that the DBMS will increment the value for each row you INSERT. Serial data types are useful for ensuring that each row of a table has a unique identifier. For example, you could create a table with Informix's nonstandard SQL-extension serial data type: CREATE TABLE Table1 ( column1 SERIAL PRIMARY KEY, column2 INTEGER) Because the value of column1 increases each time you INSERT a value into column2 , the uniqueness of each row is guaranteed . The problem with serials is that they can cause trouble with concurrency control; see Chapter 15, "Locks." Recommendation: Prefer INTEGER for serials and do your own value assignments. The Bottom Line: Numbers
All DBMSs except Oracle use similar numeric formats, so look for our Oracle-specific recommendations only if you use Oracle. Any number with a scale of zero (no digits after the decimal point) should be in an INTEGER column. INTEGER is the default data type of any literal that contains all digits. SMALLINT is the only other SQL Standard data type and is usually just half the size of INTEGER. TINYINT is the smallest integer data type and is still large enough for most personal information. BIGINT is appropriate for numbers larger than two billion though working with compilers that can't handle such large numbers could be problematic . Floating-point operations are fast if they go through the computer's FPU. When you install a DBMS, the installer should detect the FPU automatically and bring in the right code, whether FPU dependent or emulated, so make sure you rerun the install program after hardware upgrades. The primary advantage of DECIMAL is that it is easy to cast to CHAR, because the number isn't stored in binary form. The primary disadvantage of DECIMAL is that it must be converted to binary form before some arithmetic operations can be performed on it. Serial data types are useful for ensuring each row of a table has a unique identifier. The problem with serials is that they can cause trouble with concurrency control. Recommendation if you don't use Oracle: Prefer INTEGER for integers unless maximum valuesincluding those from arithmetic operationsexceed the INTEGER range. Prefer DECIMAL for fixed-decimal numbers and for most floats. Prefer DOUBLE PRECISION for floats where DECIMAL isn't appropriate. Prefer INTEGER for serials and do your own value assignments. Recommendation if you use Oracle: Data type doesn't really matter because all numbers are stored the same. Avoid multiply and divide because the packed decimal numbers will have to be converted before they can be operated on. Allow free space on each page because rows are all variable-length. Don't worry about floating-point storage because it's not there. |