Identifying and Removing Unattached Records

12.22.1 Problem

You have tables that are related (for example, they have a master-detail relationship). But you suspect that some of the records are unattached and can be removed.

12.22.2 Solution

Use a LEFT JOIN to identify unmatched values and delete them by adapting the techniques shown in Recipe 12.21. Or use a table-replacement procedure that selects the matched records into a new table and replaces the original table with it.

12.22.3 Discussion

The previous section shows how to delete related records from multiple tables at once, using the relationship that exists between the tables. Sometimes the opposite problem presents itself, where you want to delete records based on the lack of relationship. Problems of this kind typically occur when you have tables that are supposed to match up, but some of the records are unattachedthat is, they are unmatched by any corresponding record in the other table.

This can occur by accident, such as when you delete a parent record but forget to delete the associated child records, or vice versa. It can also occur as an anticipated consequence of a deliberate action. Suppose an online discussion board uses a parent table that lists discussion topics and a child table that records the articles posted for each topic. If you purge the child table of old article records, that may result in any given topic record in the parent table no longer having any children. If so, the lack of recent postings for the topic indicates that it is probably dead and that the parent record in the topic table can be deleted, too. In such a situation, you delete a set of child records with the explicit recognition that the operation may strand parent records and cause them to become eligible for being deleted as well.

However you arrive at the point where related tables have unmatched records, restoring the tables to a consistent state is a matter of identifying the unattached records and then deleting them:

The examples here use the swdist_head and swdist_item software distribution tables that were used in Recipe 12.21. Create the tables in their initial state using the swdist_create.sql script in the joins directory of the recipes distribution. They'll look like this:

mysql> SELECT * FROM swdist_head; +---------+------------+---------+------------+ | dist_id | name | ver_num | rel_date | +---------+------------+---------+------------+ | 1 | DB Gadgets | 1.59 | 1996-03-25 | | 2 | NetGizmo | 3.02 | 1998-11-10 | | 3 | DB Gadgets | 1.60 | 1998-12-26 | | 4 | DB Gadgets | 1.61 | 1998-12-28 | | 5 | NetGizmo | 4.00 | 2001-08-04 | +---------+------------+---------+------------+ mysql> SELECT * FROM swdist_item; +---------+----------------+ | dist_id | dist_file | +---------+----------------+ | 1 | README | | 1 | db-gadgets.sh | | 3 | README | | 3 | README.linux | | 3 | db-gadgets.sh | | 4 | README | | 4 | README.linux | | 4 | README.solaris | | 4 | db-gadgets.sh | | 2 | README.txt | | 2 | NetGizmo.exe | | 5 | README.txt | | 5 | NetGizmo.exe | +---------+----------------+

The records in the tables are fully matched at this point: For every dist_id value in the parent table, there is at least one child record, and each child record has a parent. To "damage" the integrity of this relationship for purposes of illustration, remove a few records from each table:

mysql> DELETE FROM swdist_head WHERE dist_id IN (1,4); mysql> DELETE FROM swdist_item WHERE dist_id IN (2,5);

The result is that there are unattached records in both tables:

mysql> SELECT * FROM swdist_head; +---------+------------+---------+------------+ | dist_id | name | ver_num | rel_date | +---------+------------+---------+------------+ | 2 | NetGizmo | 3.02 | 1998-11-10 | | 3 | DB Gadgets | 1.60 | 1998-12-26 | | 5 | NetGizmo | 4.00 | 2001-08-04 | +---------+------------+---------+------------+ mysql> SELECT * FROM swdist_item; +---------+----------------+ | dist_id | dist_file | +---------+----------------+ | 1 | README | | 1 | db-gadgets.sh | | 3 | README | | 3 | README.linux | | 3 | db-gadgets.sh | | 4 | README | | 4 | README.linux | | 4 | README.solaris | | 4 | db-gadgets.sh | +---------+----------------+

A little inspection reveals that only distribution 3 has records in both tables. Distributions 2 and 5 in the swdist_head table are unmatched by any records in the swdist_item table. Conversely, distributions 1 and 4 in the swdist_item table are unmatched by any records in the swdist_head table.

The problem now is to identify the unattached records (by some means other than visual inspection), and then remove them. Identification is a matter of using a LEFT JOIN. For example, to find childless parent records in the swdist_head table, use the following query:

mysql> SELECT swdist_head.dist_id AS 'unmatched swdist_head IDs' -> FROM swdist_head LEFT JOIN swdist_item -> ON swdist_head.dist_id = swdist_item.dist_id -> WHERE swdist_item.dist_id IS NULL; +---------------------------+ | unmatched swdist_head IDs | +---------------------------+ | 2 | | 5 | +---------------------------+

Conversely, to find the IDs for orphaned children in the swdist_item table that have no parent, reverse the roles of the two tables:

mysql> SELECT swdist_item.dist_id AS 'unmatched swdist_item IDs' -> FROM swdist_item LEFT JOIN swdist_head -> ON swdist_item.dist_id = swdist_head.dist_id -> WHERE swdist_head.dist_id IS NULL; +---------------------------+ | unmatched swdist_item IDs | +---------------------------+ | 1 | | 1 | | 4 | | 4 | | 4 | | 4 | +---------------------------+

Note that in this case, an ID will appear more than once in the list if there are multiple children for a missing parent. Depending on how you choose to delete the unmatched records, you may want to use DISTINCT to select each unmatched child ID only once:

mysql> SELECT DISTINCT swdist_item.dist_id AS 'unmatched swdist_item IDs' -> FROM swdist_item LEFT JOIN swdist_head -> ON swdist_item.dist_id = swdist_head.dist_id -> WHERE swdist_head.dist_id IS NULL; +---------------------------+ | unmatched swdist_item IDs | +---------------------------+ | 1 | | 4 | +---------------------------+

After you identify the unattached records, the question becomes how to get rid of them. You can use either of the following techniques, which you'll recognize as similar to those discussed in Recipe 12.21:

To use a multiple-table DELETE statement for removing unmatched records, just take the SELECT statement that you use to identify those records and replace the stuff leading up to the FROM keyword with DELETE tbl_name. For example, the SELECT that identifies childless parents looks like this:

SELECT swdist_head.dist_id AS 'unmatched swdist_head IDs' FROM swdist_head LEFT JOIN swdist_item ON swdist_head.dist_id = swdist_item.dist_id WHERE swdist_item.dist_id IS NULL;

The corresponding DELETE looks like this:

DELETE swdist_head FROM swdist_head LEFT JOIN swdist_item ON swdist_head.dist_id = swdist_item.dist_id WHERE swdist_item.dist_id IS NULL;

Conversely, the query to identify parentless children is as follows:

SELECT swdist_item.dist_id AS 'unmatched swdist_item IDs' FROM swdist_item LEFT JOIN swdist_head ON swdist_item.dist_id = swdist_head.dist_id WHERE swdist_head.dist_id IS NULL;

And the corresponding DELETE statement removes them:

DELETE swdist_item FROM swdist_item LEFT JOIN swdist_head ON swdist_item.dist_id = swdist_head.dist_id WHERE swdist_head.dist_id IS NULL;

To remove unmatched records by writing a program, select the ID list and turn it into a set of DELETE statements. Here's a Perl program that does so, first for the parent table and then for the child table:

#! /usr/bin/perl -w use strict; use lib qw(/usr/local/apache/lib/perl); use Cookbook; my $dbh = Cookbook::connect ( ); # Identify the IDs of childless parent records my $ref = $dbh->selectcol_arrayref ( "SELECT swdist_head.dist_id FROM swdist_head LEFT JOIN swdist_item ON swdist_head.dist_id = swdist_item.dist_id WHERE swdist_item.dist_id IS NULL"); # selectcol_arrayref( ) returns a reference to a list. Convert the reference # to a list, which will be empty if $ref is undef or points to an empty list. my @val = ($ref ? @{$ref} : ( )); # Use the ID list to delete records for all IDs at once. If the list # is empty, don't bother; there's nothing to delete. if (@val) { # generate list of comma-separated "?" placeholders, one per value my $where = "WHERE dist_id IN (" . join (",", ("?") x @val) . ")"; $dbh->do ("DELETE FROM swdist_head $where", undef, @val); } # Repeat the procedure for the child table. Use SELECT DISTINCT so that # each ID is selected only once. $ref = $dbh->selectcol_arrayref ( "SELECT DISTINCT swdist_item.dist_id FROM swdist_item LEFT JOIN swdist_head ON swdist_item.dist_id = swdist_head.dist_id WHERE swdist_head.dist_id IS NULL"); @val = ($ref ? @{$ref} : ( )); if (@val) { # generate list of comma-separated "?" placeholders, one per value my $where = "WHERE dist_id IN (" . join (",", ("?") x @val) . ")"; $dbh->do ("DELETE FROM swdist_item $where", undef, @val); } $dbh->disconnect ( ); exit (0);

The program uses IN( ) to delete all the affected records in a given table at once. See Recipe 12.21 for other related approaches.

You can also use mysql to generate the DELETE statements; a script that shows how to do this can be found in the joins directory of the recipes distribution.

A different type of solution to the problem is to use a table-replacement procedure. This method comes at the problem in reverse. Instead of finding and removing unmatched records, find and keep matched records. For example, you can use a join to select matched records into a new table. Then replace the original table with it. Unattached records don't get carried along by the join, and so in effect are removed when the new table replaces the original one.

The table replacement procedure works as follows. For the swdist_head table, create a new table with the same structure:

CREATE TABLE tmp ( dist_id INT UNSIGNED NOT NULL AUTO_INCREMENT, # distribution ID name VARCHAR(40), # distribution name ver_num NUMERIC(5,2), # version number rel_date DATE NOT NULL, # release date PRIMARY KEY (dist_id) );

Then select into the tmp table those swdist_head records that have a match in the swdist_item table:

INSERT IGNORE INTO tmp SELECT swdist_head.* FROM swdist_head, swdist_item WHERE swdist_head.dist_id = swdist_item.dist_id;

Note that the query uses INSERT IGNORE; a parent record may be matched by multiple child records, but we want only one instance of its ID. (The symptom of failing to use IGNORE is that the query will fail with a "duplicate key" error.)

Finish by replacing the original table with the new one:

DROP TABLE swdist_head; ALTER TABLE tmp RENAME TO swdist_head;

The procedure for replacing the child table with a table containing only matched child records is similar, except that IGNORE is not neededeach child that is matched will be matched by only one parent:

CREATE TABLE tmp ( dist_id INT UNSIGNED NOT NULL, # parent distribution ID dist_file VARCHAR(255) NOT NULL # name of file in distribution ); INSERT INTO tmp SELECT swdist_item.* FROM swdist_head, swdist_item WHERE swdist_head.dist_id = swdist_item.dist_id; DROP TABLE swdist_item; ALTER TABLE tmp RENAME TO swdist_item;

Категории