Moving Records Between Tables Safely

3.24.1 Problem

You're moving records by copying them from one table to another and then deleting them from the original table. But some records seem to be getting lost.

3.24.2 Solution

Be careful to delete exactly the same set of records from the source table that you copied to the destination table.

3.24.3 Discussion

Applications that copy rows from one table to another can do so with a single operation, such as INSERT ... SELECT to retrieve the relevant rows from the source table and add them to the destination table. If an application needs to move (rather than copy) rows, the procedure is a little more complicated: After copying the rows to the destination table, you must remove them from the source table. Conceptually, this is nothing more than INSERT ... SELECT followed by DELETE. In practice, the operation may require more care, because it's necessary to select exactly the same set of rows in the source table for both the INSERT and DELETE statements. If other clients insert new rows into the source table after you issue the INSERT and before you issue the DELETE, this can be tricky.

To illustrate, suppose you have an application that uses a working log table worklog into which records are entered on a continual basis, and a long-term repository log table repolog. Periodically, you move worklog records into repolog to keep the size of the working log small, and so that clients can issue possibly long-running log analysis queries on the repository without blocking processes that create new records in the working log.[3]

[3] If you use a MyISAM log table that you only insert into and never delete from or modify, you can run queries on the table without preventing other clients from inserting new log records at the end of the table.

How do you properly move records from worklog to repolog in this situation, given that worklog is subject to ongoing insert activity? The obvious (but incorrect) way is to issue an INSERT ... SELECT statement to copy all the worklog records into repolog, followed by a DELETE to remove them from worklog:

INSERT INTO repolog SELECT * FROM worklog; DELETE FROM worklog;

This is a perfectly workable strategy when you're certain nobody else will insert any records into worklog during the time between the two statements. But if other clients insert new records in that period, they'll be deleted without ever having been copied, and you'll lose records. If the tables hold logs of web page requests, that may not be such a big deal, but if they're logs of financial transactions, you could have a serious problem.

What can you do to keep from losing records? Two possibilities are to issue both statements within a transaction, or to lock both tables while you're using them. These techniques are covered in Chapter 15. However, either one might block other clients longer than you'd prefer, because you tie up the tables for the duration of both queries. An alternative strategy is to move only those records that are older than some cutoff point. For example, if the log records have a column t containing a timestamp, you can limit the scope of the selected records to all those created before today. Then it won't matter whether new records are added to worklog between the copy and delete operations. Be sure to specify the cutoff properly, though. Here's a method that fails under some circumstances:

INSERT INTO repolog SELECT * FROM worklog WHERE t < CURDATE( ); DELETE FROM worklog WHERE t < CURDATE( );

This won't work if you happen to issue the INSERT statement at one second before midnight and the SELECT statement one second later. The value of CURDATE( ) will differ for the two statements, and the DELETE operation may remove too many records. If you're going to use a cutoff, make sure it has a fixed value, not one that may change between statements. For example, a SQL variable can be used to save the value of CURDATE( ) in a form that won't change as time passes:

SET @cutoff = CURDATE( ); INSERT INTO repolog SELECT * FROM worklog WHERE t < @cutoff; DELETE FROM worklog WHERE t < @cutoff;

This ensures that both statements use the same cutoff value so that the DELETE operation doesn't remove records that it shouldn't.

Категории