Eliminating Duplicates from a Table

2017-11-03 09:05:05

14.7.1 Problem

You want to remove duplicate records from a table so that it contains only unique rows.

14.7.2 Solution

Select the unique rows from the table into a second table that you use to replace the original one. Or add a unique index to the table using ALTER TABLE, which will remove duplicates as it builds the index. Or use DELETE ... LIMIT n to remove all but one instance of a specific set of duplicate rows.

14.7.3 Discussion

If you forget to create a table with a unique index to prevent the occurrence of duplicates within the table, you may discover later that it's necessary to apply some sort of duplicate-removal technique. The cat_mailing table used in earlier sections is an example of this, because it contains several instances where the same person is listed multiple times.

The table contains redundant entries and it would be a good idea to remove them, to eliminate duplicate mailings and reduce postage costs. To do this, you have several options:

Select the table's unique rows into another table, then use that table to replace the original one. The result is to remove the table's duplicates. This works when "duplicate" means "the entire row is the same as another."

Add a unique index to the table using ALTER TABLE. This operation will remove duplicate rows based on the contents of the indexed columns.

You can remove duplicates for a specific set of duplicate rows by using DELETE ... LIMIT n to remove all but one of the rows.

This section discusses each of these duplicate-removal methods. When you consider which of them to choose under various circumstances, note that the applicability of a given method to a specific problem often will be determined by two factors:

Does the method require the table to have a unique index?

If the columns in which duplicate values occur may contain NULL, will the method remove duplicate NULL values?

14.7.4 Removing Duplicates Using Table Replacement

One way to eliminate duplicates from a table is to select its unique records into a new table that has the same structure. Then replace the original table with the new one. If a row is considered to duplicate another only if the entire row is the same, you can use SELECT DISTINCT to select the unique rows:

This method works in the absence of an index (though it might be slow for large tables), and for tables that contain duplicate NULL values, it will remove those duplicates. Note that this method considers the rows for Wallace Baxter that have slightly different street values to be distinct.

If duplicates are defined only with respect to a subset of the columns in the table, create a new table that has a unique index first, then select rows into it using INSERT IGNORE.

The index prevents records with duplicate key values from being inserted into tmp, and IGNORE tells MySQL not to stop with an error if a duplicate is found. One shortcoming of this method is that if the indexed columns can contain NULL values, you must use a UNIQUE index rather than a PRIMARY KEY, in which case the index will not remove duplicate NULL keys. (UNIQUE indexes allow multiple NULL values.)

After creating the new table tmp that contains unique rows, use it to replace the original cat_mailing table. The effective result is that cat_mailing no longer will contain duplicates:

mysql> DROP TABLE cat_mailing; mysql> ALTER TABLE tmp RENAME TO cat_mailing;

14.7.5 Removing Duplicates by Adding an Index

To remove duplicates from a table "in place," add a unique index to the table with ALTER TABLE, using the IGNORE keyword to tell it to discard records with duplicate key values during the index construction process. The original cat_mailing table looks like this without an index:

Add a unique index, then see what effect doing so has on the table contents:

If the indexed columns can contain NULL, you must use a UNIQUE index rather than a PRIMARY KEY. In that case, the index will not remove duplicate NULL key values.

14.7.6 Removing Duplicates of a Particular Row

As of MySQL 3.22.7, you can use LIMIT to restrict the effect of a DELETE statement to a subset of the rows that it otherwise would delete. This makes the statement applicable to removing duplicate records. Suppose you have a table t with the following contents:

+-------+ | color | +-------+ | blue | | green | | blue | | blue | | red | | green | | red | +-------+

The table lists blue three times, and green and red twice each. To remove the extra instances of each color, do this:

mysql> DELETE FROM t WHERE color = 'blue' LIMIT 2; mysql> DELETE FROM t WHERE color = 'green' LIMIT 1; mysql> DELETE FROM t WHERE color = 'red' LIMIT 1; mysql> SELECT * FROM t; +-------+ | color | +-------+ | blue | | green | | red | +-------+

This technique works in the absence of a unique index, and it will eliminate duplicate NULL values. It's handy if you want to remove duplicates only for a specific set of rows within a table. However, if there are many different sets of duplicates that you want to remove, this is not a procedure you'd want to carry out by hand. The process can be automated by using the techniques discussed earlier in Recipe 14.4 for determining which values are duplicated. Recall that in that recipe we wrote a make_dup_count_query( ) function to generate the query needed to count the number of duplicate values in a given set of columns in a table:

sub make_dup_count_query { my ($tbl_name, @col_name) = @_; return ( "SELECT COUNT(*)," . join (",", @col_name) . " FROM $tbl_name" . " GROUP BY " . join (",", @col_name) . " HAVING COUNT(*) > 1" ); }

We can write another function delete_dups( ) that uses make_dup_count_query( ) to find out which values in a table are duplicated and how often. From that information, we can figure out how many duplicates to remove with DELETE ... LIMIT n, so that only unique instances remain. The delete_dups( ) function looks like this:

sub delete_dups { my ($dbh, $tbl_name, @col_name) = @_; # Construct and run a query that finds duplicated values my $dup_info = $dbh->selectall_arrayref ( make_dup_count_query ($tbl_name, @col_name) ); return unless defined ($dup_info); # For each duplicated set of values, delete all but one instance # of the rows containing those values foreach my $row_ref (@{$dup_info}) { my ($count, @col_val) = @{$row_ref}; next unless $count > 1; # Construct condition string to match values, being # careful to match NULL with IS NULL my $str; for (my $i = 0; $i < @col_name; $i++) { $str .= " AND " if $str; $str .= defined ($col_val[$i]) ? "$col_name[$i] = " . $dbh->quote ($col_val[$i]) : "$col_name[$i] IS NULL"; } $str = "DELETE FROM $tbl_name WHERE $str LIMIT " . ($count - 1); $dbh->do ($str); } }

Suppose we have an employee table that contains the following records:

To use the delete_dups( ) function to eliminate duplicates on the name and department columns of the employee table, call it like this:

delete_dups ($dbh, "employee", "name", "department");

delete_dups( ) calls make_dup_count_query( ) and executes the SELECT query that it generates. For the employee table, that query produces the following results:

delete_dups( ) uses that information to generate the following DELETE statements:

DELETE FROM employee WHERE name = 'Boris' AND department IS NULL LIMIT 1 DELETE FROM employee WHERE name = 'Fred' AND department = 'accounting' LIMIT 3 DELETE FROM employee WHERE name = 'Mary Ann' AND department = 'sales' LIMIT 5 DELETE FROM employee WHERE name = 'Mary Ann' AND department = 'shipping' LIMIT 1

In general, using DELETE ... LIMIT n is likely to be slower than removing duplicates by using a second table or by adding a unique index. Those methods keep the data on the server side and let the server do all the work. DELETE ... LIMIT n involves a lot of client-server interaction because it uses a SELECT query to retrieve information about duplicates, followed by several DELETE statements to remove instances of duplicated rows.

When you issue DELETE ... LIMIT n statements from within a program, be sure to execute them only for values of n greater than zero. That is not only sensible (why waste bandwidth issuing a query to delete nothing?), but it's also necessary to avoid a bug that affects some versions of MySQL. Logically, one would expect that a statement of the form DELETE ... LIMIT 0 would delete no records, and that's what happens for current versions of MySQL. But versions prior to 3.23.40 have a bug such that LIMIT 0 is treated as though the LIMIT clause is not present at all; the result is that DELETE deletes all the selected rows! (For affected versions of MySQL, this problem also occurs for UPDATE ... LIMIT 0.)

Категории