SQL Performance Tuning

2017-07-07 02:10:07

Numbers

There are three kinds of numbers: integers, floats (approximate numbers), and fixed-decimal numbers. (Technically, integers are a type of fixed-decimal number with a scale of zero, but storage methods for integers and decimal numbers differ so we'll talk about them separately.) There is also a number type that auto-incrementswe'll call this a serial. All DBMSs except Oracle use similar numeric formats, so in our discussion of numeric data types, we'll talk about the "normal" DBMSs first and relegate Oracle's treatment of numbers to a sidebar.

Integers

The SQL Standard provides two data types for columns that contain integers: INTEGER (or INT) and SMALLINT. Table 7-5 shows the SQL Standard requirements and the level of support (data type and size in bytes) the Big Eight have for these data types.

Table 7-5. ANSI/DBMS Integer Support

	INTEGER	SMALLINT	BIGINT/LONGINT	TINYINT/INTEGER1
ANSI SQL	Yes	Yes	No	No
IBM	4	2	8	No
Informix	4	2	8	No
Ingres	4	2	No	1
InterBase	4	2	No	No
Microsoft	4	2	8	1
MySQL	4	2	8	1
Oracle	Yes	Yes	No	No
Sybase	4	2	No	1

Notes on Table 7-5:

INTEGER column

A 32-bit binary (four bytes) allows for the range 2147483648 to +2147483647. ANSI and Informix disallow the value 2147483648 (hexadecimal 80000000).

SMALLINT column

A 16-bit binary (two bytes) allows for the range 32768 to +32767. ANSI and Informix disallow the value 32768 (hexadecimal 8000).

BIGINT/LONGINT column

A 64-bit binary (eight bytes) allows for the range 223372036854775808 to +9223372036854775807. Informix calls this data type INT8.

TINYINT/INTEGER1 column

An 8-bit binary (one byte) allows for the range 128 to +127. Microsoft and Sybase support unsigned TINYINT only, with a range from 0 to +255. MySQL supports both signed TINYINT (range 128 to +127) and unsigned TINYINT (range 0 to +255).

Oracle accepts columns defined as INTEGER or SMALLINT but treats all numbers differently from other DBMSs; see the sidebar "Oracle Numbers."

Any number that has a scale of zero (no digits after the decimal point) should be in an integer column, because integer arithmetic is faster than arithmetic with a decimal point. In fact, the fastest arithmetic occurs with the specific SQL data type INTEGER because on a 32-bit machine (such as a Pentium), 32 bits is the native word size. Thus, for example:

CREATE TABLE Table1 ( column1 SMALLINT) INSERT INTO Table1 ... SELECT column1 * 1234 FROM Table1

is slower than:

CREATE TABLE Table1 ( column1 INTEGER) INSERT INTO Table1 ... SELECT column1 * 1234 FROM Table1 GAIN: 5/8

although the difference is unnoticeable unless millions of calculations happen.

The INTEGER data type is also the default data type of any literal that contains all digits. Consider the expression:

... WHERE column1 = 555

If column1 is a SMALLINT, the DBMS will have to cast its value to INTEGER before it can make the comparison with the integer literal 555 .

The other integer data types have special advantages though:

SMALLINT is the only other SQL Standard data type and is just half the size of INTEGER.

TINYINT is the smallest and is still large enough for personal information, but beware: Even seemingly restricted data like "number of wives" or "height in centimeters" could cause overflow. (Solomon had 700 wives, and Goliath was six cubits and a span, or about 290 centimeters tall.)

BIGINT is appropriate for numbers larger than two billion, though working with compilers that can't handle such large numbers could be troublesome .

Here's another consideration: At the beginning of this chapter, we said it's important to choose a data type that won't allow impossible values for a column. For example, if you have a list of mileages between world cities, using SMALLINT (and perhaps UNSIGNED if the DBMS allows it) gives you a free check constraintit should be impossible to insert the absurd value 35000 . We still hold to this advice, with one caveatIt Won't Always Work. That is:

"On some systems, for example, the numeric operations for some data types may silently underflow or overflow."

PostgreSQL Interactive Documentation

"When asked to store a value in a numeric column that is outside the column type's allowable range, MySQL clips the value to the appropriate endpoint of the range and stores the resulting value instead. Conversions that occur due to clipping are reported as 'warnings' for ALTER TABLE, LOAD DATA INFILE, UPDATE, and multi-row INSERT [but not for single-row INSERT]."

MySQL Reference Manual (Square bracket comment ours)

If input errors are a serious worry, only an explicit CHECK constraint will do the job right.

If you decide to define column1 as INTEGER, don't just consider whether the largest possible value is less than or equal to 214783647, but also whether SUM(column1) will be less than or equal to 214783647. A DBMS will normally decide that the sum of an INTEGER column is a BIGINT or a DECIMALbut it could just overflow instead.

Why are telephone numbers stored in CHAR(12) columns instead of BIGINT columns? The answer is that, although a CHAR(12) column is longer, there are built-in functions for all the operations that are likely to be performed on telephone numbers: LIKE, SUBSTRING, SIMILAR, and so on. Meanwhile, all the built-in numeric operators (+ / *) are useless for telephone numbers. The general rule in such cases is that a column should have a non-numeric data type if all appropriate operations for it are non-numeric. A similar generality can be stated for the temporal data types.

Recommendation: Prefer INTEGER for integers unless maximum valuesincluding those from arithmetic operationsexceed the INTEGER range.

Floats

The SQL Standard provides three data types for columns that contain floats: REAL, FLOAT, and DOUBLE PRECISION (or DOUBLE). Table 7-6 shows the SQL Standard requirements and the level of support (data type and precision) the Big Eight have for these data types.

Notes on Table 7-6:

The letters IEEE mean "according to the IEEE 754 Standard for Binary Floating-Point Arithmetic." Informix and PostgreSQL use "the native C float," which coincidentally corresponds to IEEE 754 for any common C compiler.

IEEE single storage is 32 bits (four bytes). It allows for seven-digit precision, with a usual range from 3.402E+38 to 1.175E37 for negative numbers, zero, and from +1.175E-37 to +3.402E+38 for positive numbers.

REAL and FLOAT(n)where n <= 23 are usually synonymous and refer to a 32-bit floating-point number, IEEE single precision. InterBase, though, allows you to define a column with a specific precision for FLOAT, for example:

CREATE TABLE Table1 ( column1 FLOAT(20))
but ignores the precision specified. InterBase FLOAT(n) is always 64-bit IEEE double precision even though FLOAT alone is always 32-bit IEEE single precision.

IEEE double storage is 64 bits (eight bytes). It allows for 15-digit precision, with a usual range from 1.798E+308 to 2.225E-307 for negative numbers, zero, and from +2.225E-307 to +1.798E+308 for positive numbers.

DOUBLE PRECISION and FLOAT(n)where n BETWEEN 24 AND 53 are usually synonymous and refer to a 64-bit floating-point number, IEEE double precision. For MySQL, REAL is synonymous with DOUBLE, rather than with FLOAT.

Oracle accepts columns defined as REAL, FLOAT, or DOUBLE PRECISION, but treats all numbers differently from other DBMSs; see the sidebar "Oracle Numbers."

Table 7-6. ANSI/DBMS Float Support

	REAL	FLOAT	DOUBLE PRECISION
ANSI SQL	Yes	Yes	Yes
IBM	IEEE single	IEEE double	IEEE double
Informix	IEEE single	IEEE double	IEEE double
Ingres	IEEE single	IEEE double	IEEE double
InterBase	IEEE single	IEEE single	IEEE double
Microsoft	IEEE single	IEEE double	IEEE double
MySQL	IEEE double	IEEE single	IEEE double
Oracle	Yes	Yes	Yes
Sybase	IEEE single	IEEE double	IEEE double

The official SQL term for FLOAT, REAL, and DOUBLE PRECISION values is "approximate numeric" but "floating point" or simply "float" is common. The key point for all such values is that the decimal point is floating. (If the decimal point is fixed, see the next section.) You should use float literals when working with floats. For example, use this type of expression:

UPDATE Table1 SET float_column = 1.25E02

instead of:

UPDATE Table1 SET float_column = 125

Portability

MySQL won't accept a float literal unless it has a two-digit exponent. That is, the literal 1.25E02 is acceptable, but 1.25E2 is not. All other DBMSs allow you to drop the leading zero.

Floating-point operations are fast if they go through the computer's Floating Point Unit (FPU), but a compiler can make the cautious assumption that no FPU is present. In that case, floating-point operations are slow because they are emulated instead of performed with the FPU. When you install a DBMS, the installer should detect the FPU automatically and bring in the right code, whether FPU dependent or emulated, so make sure you rerun the install program after hardware upgrades or moves.

Take another look at Table 7-6 and the precisions shown for the float data types. The range of IEEE single-precision float is from 1.175E-37 to +3.402E+38, to 7 decimal digits precision, although some DBMSs are more cautious in stating the actual range supported. The range of IEEE double-precision float is from 2.225E-307 to +1.798E+308, to 15 decimal digits precision. Again, some DBMSs give a slightly smaller range. These sizes are shown in Table 7-7.

Table 7-7 shows that the range of a single-precision float is 1.175E-37 to +3.402E+38. In reality it isn't possible to store all the real numbers in that range in a four-byte space; it isn't even possible to store all the integers in that range in four bytes (the range of a four-byte INTEGER is from 2.14E9 to +2.14E9). So for most numbers in the single-precision range, you'll need to use whatever number is closest that can be represented in a single float. In other words, a floating-point number is exact in bit combinationsthat is, all bit combinations are exactbut it might not be exactly the same as the number that was inserted originally. Hence the name approximate.

The question that arises from this isIs it better to use DOUBLE PRECISION for float columns or REAL? (We'll ignore FLOAT entirely because REAL and DOUBLE are just synonyms for predefined sizes of FLOAT.)

We checked the Big Eight to see what happens if the same number is stored both ways. First, we created this table:

CREATE TABLE Table1 ( real_column REAL, double_column DOUBLE PRECISION)

Table 7-7. IEEE 754 Floats

	Precision in Bits (effective mantissa size)	Precision in Decimal Digits( inexact )	Min	Max
Single Precision	<=23	7	1.175E-37	+3.402E+38
Double Precision	>=24 AND <=53	15	2.225E-307	+1.798E+308

Then we inserted the same number into both columns and selected:

INSERT INTO Table1 VALUES (0.01, 0.01) SELECT * FROM Table1 WHERE real_column = 0.01 /* result is zero rows, "incorrect" */ SELECT * FROM Table1 WHERE double_column = 0.01 /* result is one row, "correct" */

Most DBMSs returned the "correct" result for the second SELECT, but were unable to find a row where real_column contained the value 0.01 . Due to their greater precision, DOUBLE PRECISION columns return the expected result more frequently than do REAL columns.

Recommendation: Prefer DOUBLE PRECISION for floats.

Decimals

The SQL Standard provides two data types for columns that contain fixed-decimal numbers: DECIMAL and NUMERIC. Table 7-8 shows the SQL Standard requirements and the level of support (data type and maximum precision in digits) the Big Eight have for these data types.

Table 7-8. ANSI/DBMS Decimal Support

	DECIMAL	DECIMAL Precision	NUMERIC	NUMERIC Precision
ANSI SQL	Yes	N/S	Yes	N/S
IBM	Yes	31	Yes	31
Informix	Yes	32	Yes	32
Ingres	Yes	31	Yes	31
InterBase	Yes	18	Yes	18
Microsoft	Yes	38	Yes	38
MySQL	Yes	254	Yes	254
Oracle	Yes	38	Yes	38
Sybase	Yes	38	Yes	38

Notes on Table 7-8:

IBM, Informix, Ingres, Microsoft, Oracle, and Sybase store decimal values as packed decimal values; see the sidebar "Oracle Numbers."

InterBase stores decimal values inside SMALLINTs, INTEGERs, BIGINTs, or FLOATs, depending on the defined size. The longest string of 9s that could fit in a BIGINT is 999999999999999999 (eighteen 9s) so DECIMAL(18) is the largest precise fixed-decimal number that such a trick can accommodate.

MySQL stores decimal values as unpacked floating-point numbersthat is, MySQL decimals are stored as strings, using one character for each digit of the value, plus another character for the sign and decimal point if applicable . The maximum number of digits that can be specified is 254, but the actual range for both DECIMAL and NUMERIC is the same as for a double-precision float (see DOUBLE PRECISION in Table 7-6).

A fixed-decimal number is one that has a fixed-decimal precision. For example, DECIMAL(7,2) has a precision of seven digits, with a fixed scalethe number of digits after the decimal pointof two digits. If the data type of a column is DECIMAL, then the actual precision must be at least as big as the defined precision. If the data type of a column is NUMERIC, then the actual precision should be exactly the same as the defined precision, but in practice most DBMSs treat NUMERIC as a synonym for DECIMAL.

Any numeric literal can take three forms, as shown in Table 7-9. Notice that Table 7-9 shows that the number 17.7 is a DECIMALnot a FLOAT. This can confuse C programmers because in C the literal value 17.7 is a float .

All DBMSs except for InterBase and MySQL store DECIMAL values as a packed string of decimal digits, usually with two decimal digits per byte; see the sidebar "Oracle Numbers." InterBase, as mentioned earlier, stores decimal values inside integers, while MySQL stores them as unpacked strings.

Table 7-9. Forms of a Numeric Literal

Form	Data type	Example
Digits and no decimal point	INTEGER	537
Digits with decimal point	DECIMAL	17.7
Exponential notation	FLOAT	1.32E15

The primary advantage of DECIMAL is that it is easy to cast to CHAR, because the number isn't stored in binary form. The primary disadvantage is that DECIMAL values must be converted to binary form before some arithmetic operations can be performed on them.

In general the conversion requires only your patiencethe DBMS handles multiple-digit arithmetic slowly, but it gets there. One exception exists, though: MySQL avoids the slowness of multidigit arithmetic by converting to floats when one of the operands is decimal. So this work around is primarily of interest to MySQL users:

Define two INTEGER columns instead of one DECIMAL column. For example, instead of Definition #1, use Definition #2:

Definition #1: CREATE TABLE Table1 ( column1 DECIMAL(7,5), ...) Definition #2: CREATE TABLE Table1 ( column1_pre_decimal_point INTEGER, column1_post_decimal_point INTEGER, ...)

To add 7.35 to such a divided number, add 35 to column1_post_decimal_point . If the result is greater than 100, subtract 100 and carry one, then add seven plus the carry to column1_pre_decimal_point . The work around sounds awfulbut it works more quickly than a decimal addition.

We still live in an era where the questionShould I use floats for decimals?makes some sense, but the era is fast approaching its end. The fact is that most sensible bean-count values (i.e., money) can be stored and manipulated as integers now, and with 64-bit processors the values are trillions. The only reason to use floats for dollars is that some host languages still have no equivalent for DECIMAL. It is notable that all the DBMSs that support a MONEY data type store MONEY internally as a DECIMAL with a predefined scale.

Recommendation: Prefer DECIMAL for fixed-decimal numbers and for most floats.

Oracle Numbers

Taking a different path from the other DBMSs, Oracle uses a single storage method packed decimal for all kinds of numbers. The format consists of two parts :

An exponent: size1 byte.

A mantissa: sizeup to 20 bytes, packed decimal, possibly includes a 1-byte sign.

Leading and trailing zeros are not stored.

Packed decimal is a number representation where each number is expressed as a sequence of decimal digits, with each decimal digit encoded as a 4-bit binary number (or nibble). In some cases, the right-most nibble contains the sign (positive or negative).

Oracle's system is good for up to 38 digits of decimal precision. The exponent represents the number of digits before the decimal point. Technically, the decimal point "floats," but all integers on the number line between 1.0E38 and +1.0E38 can be represented. Here's how Oracle stores the decimal number 523:

Exponent in byte #0: 2 Mantissa in byte #1: 5 2 Mantissa in byte #2: 3 filler: _

That is, 523 is stored in a format similar to 5.23 x 10 ² . One byte is used for the exponent (2), and two bytes are used for the three significant digits of the mantissa (5, 2, 3). Blank filler is added to the second byte to fill it completely.

The system's range is as big as the ranges for SMALLINT, BIGINT, and REAL, so Oracle can accept column definitions for any of those data types. But the storage is always the same. Because Oracle is only simulating a float by using a wide integer, though, details that are specific to the IEEE 754 specificationfor example, Not a Number (NaN) can't be represented.

The differences between Oracle and other DBMSs when it comes to numbers can be summarized as follows :

Number size

Oracle numbers are variable-length. Other DBMSs' numbers are fixed-length.

Storage format

Oracle uses one storage format for numbers. Other DBMSs use at least three storage formats for numbers.

Biggest integer supported

Oracle's biggest integer is 1.0E38. Other DBMSs' biggest integer is 9.2E18 (BIGINT).

Biggest float supported

Oracle's biggest "float" is 1.0E38. Other DBMSs' biggest float is 1.7E308 (DOUBLE).

The differences are so large, many of the recommendations that apply for other DBMSs do not apply to Oracle. Here are some Oracle-specific recommendations instead:

Avoid multiply and divide because the packed decimal numbers will have to be converted before they can be operated on.

Allow free space on each page because rows are all variable-length.

Don't worry about floating-point storage because it's not there.

It is easier to convert decimal numbers to and from packed decimal representation than binary representation, but packed decimal is often converted to binary for arithmetic processing.

Serials

The SQL:1999 Standard does not provide a data type for columns that contain "serial" numbers but most DBMSs do provide support for auto-incremented, or monotonic, values. Table 7-10 shows the level of support the Big Eight have for serial data types.

Table 7-10. ANSI/DBMS Serial Support

	Data Type	Sequence Generator
ANSI SQL	N/A	N/A
IBM	INTEGER AS IDENTITY	N/A
Informix	SERIAL, SERIAL8	N/A
Ingres	TABLE_KEY	N/A
InterBase	N/A	GEN_ID function
Microsoft	INTEGER IDENTITY	N/A
MySQL	INTEGER AUTO_INCREMENT	N/A
Oracle	N/A	CREATE SEQUENCE
Sybase	NUMERIC(10,0) IDENTITY	N/A

Notes on Table 7-10

Data Type column

What is the serial attribute and/or data type?

This column is "N/A" if no serial data type is supported and otherwise shows the name of the serial data type and necessary serial attribute, if required. For example, Informix supports serial data on columns defined with the SERIAL or SERIAL8 data types, while Sybase supports serial data on columns defined as NUMERIC(10,0) IDENTITY.

Sequence Generator column

What is the sequence generator?

This column is "N/A" if serial data is supported via the serial attribute and/or data type shown in the Data Type column and otherwise shows the name of the function or statement the DBMS uses to generate sequences. For example, Oracle's CREATE SEQUENCE statement creates a SEQUENCE object. The next version of the SQL Standard is expected to support a SEQUENCE object as well.

The various "serial" data types are usually 4-byte integers, with values assigned by the DBMS. The idea is that the DBMS will increment the value for each row you INSERT. Serial data types are useful for ensuring that each row of a table has a unique identifier. For example, you could create a table with Informix's nonstandard SQL-extension serial data type:

CREATE TABLE Table1 ( column1 SERIAL PRIMARY KEY, column2 INTEGER)

Because the value of column1 increases each time you INSERT a value into column2 , the uniqueness of each row is guaranteed .

The problem with serials is that they can cause trouble with concurrency control; see Chapter 15, "Locks."

Recommendation: Prefer INTEGER for serials and do your own value assignments.

The Bottom Line: Numbers

All DBMSs except Oracle use similar numeric formats, so look for our Oracle-specific recommendations only if you use Oracle.

Any number with a scale of zero (no digits after the decimal point) should be in an INTEGER column. INTEGER is the default data type of any literal that contains all digits. SMALLINT is the only other SQL Standard data type and is usually just half the size of INTEGER. TINYINT is the smallest integer data type and is still large enough for most personal information. BIGINT is appropriate for numbers larger than two billion though working with compilers that can't handle such large numbers could be problematic .

Floating-point operations are fast if they go through the computer's FPU. When you install a DBMS, the installer should detect the FPU automatically and bring in the right code, whether FPU dependent or emulated, so make sure you rerun the install program after hardware upgrades.

The primary advantage of DECIMAL is that it is easy to cast to CHAR, because the number isn't stored in binary form. The primary disadvantage of DECIMAL is that it must be converted to binary form before some arithmetic operations can be performed on it.

Serial data types are useful for ensuring each row of a table has a unique identifier. The problem with serials is that they can cause trouble with concurrency control.

Recommendation if you don't use Oracle: Prefer INTEGER for integers unless maximum valuesincluding those from arithmetic operationsexceed the INTEGER range. Prefer DECIMAL for fixed-decimal numbers and for most floats. Prefer DOUBLE PRECISION for floats where DECIMAL isn't appropriate. Prefer INTEGER for serials and do your own value assignments.

Recommendation if you use Oracle: Data type doesn't really matter because all numbers are stored the same. Avoid multiply and divide because the packed decimal numbers will have to be converted before they can be operated on. Allow free space on each page because rows are all variable-length. Don't worry about floating-point storage because it's not there.