SQL Performance Tuning

   

The advantages of stored procedures, in order by importance, are:

  • Procedures are on the server so messages don't need to go back and forth to the client during the time the procedure is executed.

  • Procedures are parsed once, and the result of the parsing is stored persistently, so there's no need to reparse for every execution.

  • Procedures are in the catalog so they are retrievable, and procedures are subject to security provisions, in the same way as other SQL data.

  • Procedures are in one place so code sharing is easy, and when changes happen there's no need to send code changes to clients .

Also, in theory, a procedural language is better adapted for procedural goals. So when you're working with scalar values rather than databases or sets, and one step happens after another, then constructs such as IF, DECLARE, BEGIN/END, FOR, WHILE, and so on, come into their own. Things get particularly interesting if the procedural language you're using is C or Java.

Less Traffic

Stored procedures mean less message traffic between clients and servers. The client must send some sort of message to initiate the procedure, and the procedure must return some sort of result when the procedure is over, but that's allno message passing occurs within the procedure. So a stored procedure that contains [n] statements will need only two messages, while an ODBC application that contains [n] statements will need (2 * n) messages. This factor is significant because a message takes at least a few milliseconds (on a TCP/IP connection to the same computer), and most likely a few centiseconds (on a LAN), or even a few deciseconds (on a WAN or Internet connection). Against this, you must set the cost of loading a stored procedure from disk the first time, which takes a few milliseconds . Calculating these factors together, we can say that stored procedures are faster than direct passing of SQL statements when (a) more than two SQL statements in the stored procedure are executed, and (b) the stored procedure is accessed so frequently that it might already be in the operating system's cache.

A stored procedure is not the only thing that leads to less traffic. You can sometimes achieve similar effects with views or constraints. And you must remember that not all application programs need messages over a networkthey could be Common Gateway Interface (CGI) programs called by the application server.

A message is not the same thing as a fetch. Beware of twaddle like "if a database has many rows, then stored procedures are good because the excess rows won't be transferred over the network for processing"the rows won't be transferred anyway if you use WHERE clauses in your SQL statements! However, data-change (INSERT/UPDATE/DELETE) statements can cause useless messages if the DBMS feels obliged to return the number of rows that were changed. That's an SQL Standard requirement, but only Microsoft does it, and the message can be suppressed by telling Microsoft: SET NOCOUNT ON .

At the start of this section, we gave you four advantages to using stored procedures. Traffic reduction is more important than the other three combined. If you're on a network, you need stored procedures, and traffic reduction is the reason.

Semiprecompilation

The second advantage of stored procedures is that they're precompiled. This means that the DBMS only has to prepare a statement once, instead of preparing a statement every time it executes. To avoid building false hopes, we should emphasize that the precompilation is only partial, is only temporary, and is not a free lunch .

Let's take Informix as an example. We know that other DBMSs operate a bit differently, but they all operate under the same constraints. In a general fashion, here is what actually goes on.

When a CREATE PROCEDURE statement (for example, CREATE PROCEDURE Sp_proc1 ) is executed, Informix parses the statement and stores two things in the database catalog: a list of the objects (tables or other stored procedures) on which the procedure depends, and a list of tokens from the parsing. The token list is called pcode. It's a step away from an ASCII command but a long way from executable code. Pcode is somewhat like Java bytecode it's interpretable, not executable. The pcode is kept in a BLOB field in the catalog's tables.

When EXECUTE PROCEDURE Sp_proc1 is run for the first time, Informix loads the procedure's pcode and makes a query plan. The query plan has to go in a cache, but the cache has only enough room for 16 query plans because query plans require a lot of RAM. If the cache already contains 16 query plans, Informix discards the least-recently-used query plan at this point. The precompilation advantage for stored procedures thus applies only to the last 16 procedures you've used. All other procedures must be reloaded and, in effect, precompiled again, because making the query plan is the bulk of the precompilation job.

When EXECUTE PROCEDURE Sp_proc1 is run and it's not the first time, Informix has a cached query plan ready to go. However, the DBMS must still check the dependency list because the stored procedure might refer to some object that has been altered or dropped since the first time the procedure was executed. The other thing that might have changed since the first time is the parameter values, so Informix reprocesses them too.

Now Informix locks the procedure. Usually stored procedures are not reentrant because some of the variable information is stored outside the user 's area. By ensuring that only one job at a time can execute, Informix ensures that executions of stored procedures are serializable.

And then Informix actually does the job.

Listing 11-2 shows a comparison test that we ran on a LAN, for a table with no rows in it. The point of the test was to see whether it was faster to SELECT data using direct SQL or via a stored procedure. The result of the test was that the "Direct SQL" execution was faster than the "stored procedure" execution. Our explanation for this is two-fold:

  • First, the SQL statement was cached, so precompilation is true for both cases.

  • Second, there is a call overhead for (a) finding Sp_proc1 , (b) checking EXECUTE privileges, (c) saving current state, and (d) returning the result set to the caller.

So stored procedures do not enhance performance in every circumstance.

The result of Listing 11-2's test might lead us to sayIf you're executing the same thing multiple times, then just PREPARE, don't use a stored procedure. But it's not that simplethe results of a PREPARE can disappear when a transaction ends. The results of a stored-procedure precompilation are more permanent.

Listing 11-2 Test: Direct SQL versus stored procedure

Stored Procedure CREATE PROCEDURE Sp_proc1 ... SELECT * FROM Table1; ... ... CALL Sp_proc1 repeat CALL 1000 times Direct SQL SELECT * FROM Table1 repeat 1000 times GAIN: 3/7

Because the DBMS remakes a plan the first time a procedure is called, you had better make sure the DBMS has all the necessary information when that first time happens. In particular:

  • If a table needs indexes, make them before the first call. Don't execute CREATE INDEX within the procedure.

  • Populate a table with realistic data before the first call, so the optimizer sees the histograms and knows the table sizes.

  • Pass parameter values to the first call if the parameters are likely to be passed again in subsequent calls.

With any DBMS there are ways to force re-precompilation of stored procedures, and doing so occasionally won't harm anything. In most offices, it's easy: turn the machine off once a day.

Parameters

We've already noted that the DBMS rechecks the parameter values every time it executes a stored procedure, and that the DBMS takes parameter values into account for its query plan when it executes a procedure for the first time. The same cannot be said for local-variable values. Therefore if a procedure has a condition like this somewhere:

... WHERE column1 = num1

it's better if num1 is a parameter, not a variable.

Procedure parameter passing is "by value" (although you can simulate "by reference" passing using BLOBs and REFs). When parameter passing is by value, the DBMS usually makes a local copy so that the called procedure can make changes to the parameter without mucking up the original. That leads inevitably to the bad idea of declaring parameters to be "input/output" even when they're merely "input" parameters. The trick behind the idea is that an "input/output" parameter value is not recopied down the line if stored procedure #1 calls stored procedure #2only the address is copiedso it saves you space and copying time.

The problem with the trick is that not all procedure calls are on the server's local stack. For example, if the server uses Remote Procedure Call (RPC) methods , then the parameter value is copied , and stored procedure #2 makes a second copy when it returns. Therefore you actually lose space and copying time by declaring the parameter to be an "input/output" parameter.

If you really want to avoid constant recopying of parameter values, you can either make judicious use of the SQL DEFAULT clause for column and domain definitions, or (more likely) you can use globals. Yes, there is no such thing (except in Microsoft's implementation), but you can simulate globals by storing values in temporary tables. People frequently do. Another form of global is one of the "global registers" that are accessible from any of the niladic functionsjust keep in mind that most register values can't be changed by the user within a transaction.

Other Tips

The first tip to keep in mind when you're working with stored procedures is not to think of a compiler when you think of the precompiler. (A precompiler is a utility that converts SQL statements in a host program to statements a compiler can understand.) You are probably working with a primitive pcode interpreter, and the tricks that you take for granted with Clike constant propagation and invariant hoistingare not supported here. For example, the stored procedure shown in Listing 11-3 is too optimistic.

Listing 11-3 Optimistic stored procedure

CREATE PROCEDURE Sp_proc1 (param1 INT) MODIFIES SQL DATA BEGIN DECLARE num1 INT; DECLARE num2 INT; SET num2 = 1; WHILE num2 <=3 DO SET num1 = param1; IF num1 > num2 AND param1 <> 1 THEN UPDATE Table1 SET column1 = 15 + num1 7; END IF; SET num2 = num2 +1; END WHILE; END

Listing 11-3's procedure can be made about 50% faster if we ourselvesrather than expecting the DBMS to do ittake things out of the loop, fold, put declarations together in one statement, and avoid the assumption (which is true only in C) that if num1 > num2 is false then param1 <> 1 won't be tested . The improved procedure is shown in Listing 11-4.

Listing 11-4 Improved stored procedure

CREATE PROCEDURE Sp_proc1 (param1 INT) MODIFIES SQL DATA BEGIN DECLARE num1 INT, num2 INT; SET num1 = param1; SET num2 = 1; WHILE num2 <=3 DO IF num1 > num2 THEN IF param1 <> 1 THEN UPDATE Table1 SET column1 = num1 + 8; END IF; END IF; SET num2 = num2 +1; END WHILE; END GAIN: 5/7

The second stored procedure tip to keep in mind is to shift logic from WHERE clauses to IF clauses whenever possible. For example, if your procedure contains this code:

BEGIN UPDATE Table1 SET column1 = num1 WHERE num1 = 2; END

change the code to:

BEGIN IF num1 = 2 THEN UPDATE Table1 SET column1 = 2; END IF; END GAIN: 5/7

Partly, the gain here happens because a falseness (when num1 <> 2 ) can be detected earlier. With IBM, the reason might be more subtleIBM can actually compile the IF block, by converting it to C (or Java) statements and then invoking the appropriate compiler. But IBM can't compile the UPDATE statement, so any logic taken out of the UPDATE will be evaluated faster.

Listing 11-5 shows an example of a stored procedure that contains an SQL SELECT statement. We used InterBase for this example. InterBase does not conform to Standard SQL PSM as well as other DBMSs, but the code should still be easy to follow. Note especially these four things, marked by the comments in the Listing:

  • Note 1: It is an InterBase requirement that all SQL statements be enclosed within a BEGIN END block.

  • Note 2: It's a good idea to put a qualifier such as " Schema1 ." in front of the table name . This removes confusion if some other user, with a different default schema, calls the Select_proc procedure. Also, some DBMSs will look up an object faster if it's qualified, because they know what container to begin the search with. It's recommended that all objects used in a procedure should belong to the same schema.

  • Note 3: The keyword SUSPEND is specific to InterBase. It means "return the fetched row to the caller."

  • Note 4: No explicit CLOSE is in this procedure. The code depends on the DBMS to close cursors automatically when it exits. This method is safe, but in a more complex procedure it's important to remember to close early.

Listing 11-5 An InterBase stored procedure using SELECT

CREATE PROCEDURE Select_proc (parameter1 INTEGER) RETURNS (var1 CHAR(2), var2 CHAR(2), var3 CHAR(2)) AS BEGIN /* Note 1 */ FOR SELECT column1, column2, column3 FROM Schema1.Table1 /* Note 2 */ WHERE column1 = :parameter1 INTO :var1, :var2, :var3 DO SUSPEND; /* Note 3 */ END /* Note 4 */

Once Listing 11-5's Select_proc procedure has been created, these two statements are equivalent:

SELECT column1, column2, column3 FROM Schema1.Table1 WHERE column1 = ? EXECUTE PROCEDURE Select_proc

The interesting thing is that the EXECUTE PROCEDURE statement (or CALL statement, in ANSI SQL) will be faster than the SELECT statement even if the stored procedure is not cached (GAIN: 1/1). That's because "SELECT procedures" don't return result-set metadata.

The Bottom Line: Stored Procedures

When you use a stored procedure, you are depending on cache (which is limited) and access to the procedure (which may be serialized). Therefore it is a mistake to use stored procedures too frequentlythat is, it is a mistake to use them except for the reasons discussed in this chapter. Stored procedures are like Santa Claus: You get presents for nothingbut you ought to be good anyway.

Don't declare parameters to be "input/output" when they're merely "input" parameters.

Result-Set Metadata

When you execute a SELECT statement, for example:

SELECT column1 FROM Table1

you get back the values for column1 in a result set. What you may not realize is that you also get back the metadata about column1: its name, length, and other descriptive data. The DBMS passes this information automatically; usually the application program doesn't need it, so the passing is a waste of network packets. The CALL statement, on the other hand, won't return metadata. So when you're not repeating the SELECT multiple times, you'll save time by using CALL with a stored procedure instead.

Don't write stored procedures as if the precompiler has the built-in skills of a compiler. Take things out of loops , fold, put declarations together in one statement, and avoid testing assumptions on your owndon't expect the DBMS to do these things for you.

Shift logic from WHERE clauses to IF clauses whenever possible.

If a stored procedure contains an IF statement and is inside a trigger, take the condition out of the IF and put it in the trigger's WHEN clause.

If a stored procedure contains conditions like this:

IF <search condition> THEN SELECT * FROM Table1; ELSE SELECT * FROM Table2; END IF;

then only one of the SELECTs (whichever SELECT is actually executed) will be precompiled and get a stored plan. To ensure that both plans are stored, put the SELECTs in procedures of their own and CALL as required, like this:

CREATE PROCEDURE Sp_proc1 ... SELECT * FROM Table1 ... CREATE PROCEDURE Sp_proc2 ... SELECT * FROM Table2 ... ... IF <search condition> THEN CALL Sp_proc1; ELSE CALL Sp_proc2; END IF;

Two different connections will not use the same in-memory copy of a stored procedure. That's an argument for connection pooling; see Chapter 13, "JDBC."

Use VARCHAR parameters rather than CHAR parameters. Some DBMSs will automatically add space padding if the data type is CHAR, and that slows network access. However, if you have a long IN parameter and a long OUT parameter, you should combine them as a single INOUT fixed- size CHAR parameter. Reusing the same buffer for input and output will save memory on the server.

IBM recommends that you avoid putting COMMIT or ROLLBACK in the stored procedure. The transaction-end statements are especially bad if you're using distributed transactions and TP monitors . However, doing so will save one network message (if auto-commit is off ).

Microsoft recommends that you avoid interleaving data-change statements with data definition statements. For example, do not mix CREATE and INSERT statements in a stored procedure.

Try to make sure that the stored procedure and the client application are operating with similar assumptions. For example, they should each have the same codepage.

Do not assume that frequently used stored procedures are cached. They can be invalidated due to many circumstancesfor example, if the DBMS automatically updates the database statistics every hour , or the stored procedure contains SQL statements which are not executed every time the procedure is run, or there are too many stored procedures for the DBMS's fixed cache size.

Begin a stored procedure by testing whether a parameter is NULL. Some DBMSs (not the SQL Standard) allow callers to pass two arguments to a procedure that was defined with three parameters. In such a case, the third parameter value will be NULL. Among the Big Eight, only Ingres allows such uneven passing. (Other DBMSs allow uneven parameter passing when a parameter was defined with a DEFAULT NULL clause, but this is not standard SQL.)

   

Категории