Introduction

Tables or result sets sometimes contain duplicate records. In some cases this is acceptable. For example, if you conduct a web poll that records dates and client IP numbers along with the votes, duplicate records may be allowable, because it's possible for large numbers of votes to appear to originate from the same IP number for an Internet service that routes traffic from its customers through a single proxy host. In other cases, duplicates will be unacceptable, and you'll want to take steps to avoid them. Operations related to handling of duplicate records include the following:

Several tools are at your disposal for dealing with duplicate records. These can be chosen according to the objective you're trying to achieve:

This chapter describes how each of these techniques applies to duplicate identification and removal, but before proceeding further, I should define what "duplicate" means here. When people say "duplicate record," they may mean different things. For purposes of this chapter, one record is a duplicate of another if both rows contain the same values in columns that are supposed to distinguish them. Consider the following table:

mysql> SELECT * FROM person; +------+-----------+------------+---------------+------+ | id | last_name | first_name | address | age | +------+-----------+------------+---------------+------+ | 1 | Smith | Jim | 428 Mill Road | 36 | | 2 | Smith | Joan | 428 Mill Road | 36 | | 3 | Smith | Junior | 428 Mill Road | 12 | +------+-----------+------------+---------------+------+

None of these records are duplicates if you compare rows using all the columns, because then the records contain the id and first_name columns, each of which happen to contain only unique values. However, if you look only at the last_name or address columns, all the records contain duplicated values. Lying between these extremes, a result set consisting of the age column contains a mix of unique and duplicated values.

Scripts related to the examples shown in this chapter are located in the dups directory of the recipes distribution. For scripts that create the tables used here, look in the tables directory.

Категории