Comparing a Table to Itself
12.12.1 Problem
You want to compare records in a table to other records in the same table. For example, you want to find all paintings in your collection by the artist who painted "The Potato Eaters." Or you want to know which states listed in the states table joined the Union in the same year as New York. Or you want to know which of the people listed in the profile table have some favorite food in common.
12.12.2 Solution
Problems that require comparing a table to itself involve an operation known as a self-join. It's much like other joins, except that you must always use table aliases so that you can refer to the same table different ways within the query.
12.12.3 Discussion
A special case of joining one table to another occurs when both tables are the same. This is called a self-join. Although many people find the idea confusing or strange to think about at first, it's perfectly legal. Be assured that you'll get used to the concept, and more than likely will find yourself using self-joins quite often because they are so important.
A tip-off that you need a self-join is when you want to know which pairs of elements in a table satisfy some condition. For example, suppose your favorite painting is "The Potato Eaters" and you want to identify all the items in your collection that were done by the artist who painted it. You can do so as follows:
- Identify the row in the painting table that contains the title "The Potato Eaters," so that you can refer to its a_id value.
- Use the a_id value to match other rows in the table that have the same a_id value.
- Display the titles from those matching rows.
The artist ID and painting titles that we begin with look like this:
mysql> SELECT a_id, title FROM painting ORDER BY a_id; +------+-------------------+ | a_id | title | +------+-------------------+ | 1 | The Last Supper | | 1 | The Mona Lisa | | 3 | Starry Night | | 3 | The Potato Eaters | | 3 | The Rocks | | 5 | Les Deux Soeurs | +------+-------------------+
A two-step method for picking out the right titles without a join is to look up the artist's ID with one query, then use the ID in a second query that selects records that match it:
mysql> SELECT @id := a_id FROM painting WHERE title = 'The Potato Eaters'; +-------------+ | @id := a_id | +-------------+ | 3 | +-------------+ mysql> SELECT title FROM painting WHERE a_id = @id; +-------------------+ | title | +-------------------+ | Starry Night | | The Potato Eaters | | The Rocks | +-------------------+
Another solutionone that requires only a single queryis to use a self-join. The trick to this lies in figuring out the proper notation to use. The way many people first try to write a query that joins a table to itself looks something like this:
mysql> SELECT title FROM painting, painting -> WHERE title = 'The Potato Eaters' AND a_id = a_id; ERROR 1066 at line 1: Not unique table/alias: 'painting'
The problem with that query is that the column references are ambiguous. MySQL can't tell which instance of the painting table any given column name refers to. The solution is to give at least one instance of the table an alias so that you can distinguish column references by using different table qualifiers. The following query shows how to do this, using the aliases p1 and p2 to refer to the painting table different ways:
mysql> SELECT p2.title -> FROM painting AS p1, painting AS p2 -> WHERE p1.title = 'The Potato Eaters' -> AND p1.a_id = p2.a_id; +-------------------+ | title | +-------------------+ | Starry Night | | The Potato Eaters | | The Rocks | +-------------------+
The query output illustrates something typical of self-joins: when you begin with a reference value in one table instance ("The Potato Eaters") to find matching records in a second table instance (paintings by the same artist), the output includes the reference value. That makes senseafter all, the reference matches itself. If you want to find only other paintings by the same artist, explicitly exclude the reference value from the output:
mysql> SELECT p2.title -> FROM painting AS p1, painting AS p2 -> WHERE p1.title = 'The Potato Eaters' AND p2.title != 'The Potato Eaters' -> AND p1.a_id = p2.a_id; +--------------+ | title | +--------------+ | Starry Night | | The Rocks | +--------------+
A more general way to exclude the reference value without naming it literally is to specify that you don't want output rows to have the same title as the reference, whatever that title happens to be:
mysql> SELECT p2.title -> FROM painting AS p1, painting AS p2 -> WHERE p1.title = 'The Potato Eaters' AND p1.title != p2.title -> AND p1.a_id = p2.a_id; +--------------+ | title | +--------------+ | Starry Night | | The Rocks | +--------------+
The preceding queries use comparisons of ID values to match records in the two table instances, but any kind of value can be used. For example, to use the states table to answer the question "Which states joined the Union in the same year as New York?," perform a temporal pairwise comparison based on the year part of the dates in the table's statehood column:
mysql> SELECT s2.name, s2.statehood -> FROM states AS s1, states AS s2 -> WHERE s1.name = 'New York' -> AND YEAR(s1.statehood) = YEAR(s2.statehood) -> ORDER BY s2.name; +----------------+------------+ | name | statehood | +----------------+------------+ | Connecticut | 1788-01-09 | | Georgia | 1788-01-02 | | Maryland | 1788-04-28 | | Massachusetts | 1788-02-06 | | New Hampshire | 1788-06-21 | | New York | 1788-07-26 | | South Carolina | 1788-05-23 | | Virginia | 1788-06-25 | +----------------+------------+
Here again, the reference value (New York) appears in the output. If you want to prevent that, add an expression to the WHERE clause that explicitly excludes the reference:
mysql> SELECT s2.name, s2.statehood -> FROM states AS s1, states AS s2 -> WHERE s1.name = 'New York' AND s1.name != s2.name -> AND YEAR(s1.statehood) = YEAR(s2.statehood) -> ORDER BY s2.name; +----------------+------------+ | name | statehood | +----------------+------------+ | Connecticut | 1788-01-09 | | Georgia | 1788-01-02 | | Maryland | 1788-04-28 | | Massachusetts | 1788-02-06 | | New Hampshire | 1788-06-21 | | South Carolina | 1788-05-23 | | Virginia | 1788-06-25 | +----------------+------------+
Like the problem of finding other paintings by the painter of "The Potato Eaters," the statehood problem could have been solved by using a SQL variable and two queries. That will always be true when you're seeking matches for one particular row in your table. Other problems require finding matches between several pairs of rows, in which case the two-query method will not work. Suppose you want to determine which pairs of people listed in the profile table have favorite foods in common. In this case, the output potentially can include any pair of people in the table. There is no fixed reference value, so you cannot store the reference in a variable.
A self-join is perfect for this problem, although there is the question of how to identify which foods values share common elements. The foods column contains SET values, each of which may indicate multiple foods, so an exact comparison will not work:
- The comparison is true only if both foods values name an identical set of foods; this is unsuitable if you require only a common element.
- Two empty values will compare as equal, even though they have no foods in common.
To identify SET values that share common elements, use the fact that MySQL represents them as bit fields and perform the comparison using the & (bitwise AND) operator to look for pairs that have a non-zero intersection:
mysql> SELECT t1.name, t2.name, t1.foods, t2.foods -> FROM profile AS t1, profile AS t2 -> WHERE t1.id != t2.id AND (t1.foods & t2.foods) != 0 -> ORDER BY t1.name, t2.name; +------+------+----------------------+----------------------+ | name | name | foods | foods | +------+------+----------------------+----------------------+ | Alan | Brit | curry,fadge | burrito,curry,pizza | | Alan | Fred | curry,fadge | lutefisk,fadge,pizza | | Alan | Mara | curry,fadge | lutefisk,fadge | | Alan | Sean | curry,fadge | burrito,curry | | Brit | Alan | burrito,curry,pizza | curry,fadge | | Brit | Carl | burrito,curry,pizza | eggroll,pizza | | Brit | Fred | burrito,curry,pizza | lutefisk,fadge,pizza | | Brit | Sean | burrito,curry,pizza | burrito,curry | | Carl | Brit | eggroll,pizza | burrito,curry,pizza | | Carl | Fred | eggroll,pizza | lutefisk,fadge,pizza | | Fred | Alan | lutefisk,fadge,pizza | curry,fadge | | Fred | Brit | lutefisk,fadge,pizza | burrito,curry,pizza | | Fred | Carl | lutefisk,fadge,pizza | eggroll,pizza | | Fred | Mara | lutefisk,fadge,pizza | lutefisk,fadge | | Mara | Alan | lutefisk,fadge | curry,fadge | | Mara | Fred | lutefisk,fadge | lutefisk,fadge,pizza | | Sean | Alan | burrito,curry | curry,fadge | | Sean | Brit | burrito,curry | burrito,curry,pizza | +------+------+----------------------+----------------------+
Some self-join problems are of the "Which values have no match?" variety. An instance of this is the question, "Which message senders in the mail table didn't send any messages to themselves?" First, check who sent mail to who:
mysql> SELECT DISTINCT srcuser, dstuser FROM mail -> ORDER BY srcuser, dstuser; +---------+---------+ | srcuser | dstuser | +---------+---------+ | barb | barb | | barb | tricia | | gene | barb | | gene | gene | | gene | tricia | | phil | barb | | phil | phil | | phil | tricia | | tricia | gene | | tricia | phil | +---------+---------+
Of those pairs, several are for people that did send mail to themselves:
mysql> SELECT DISTINCT srcuser, dstuser FROM mail -> WHERE srcuser = dstuser; +---------+---------+ | srcuser | dstuser | +---------+---------+ | phil | phil | | barb | barb | | gene | gene | +---------+---------+
Finding people who didn't send mail to themselves is a "non-match" problem, which is the type of problem that typically involves a LEFT JOIN. In this case, the solution requires a LEFT JOIN of the mail table to itself:
mysql> SELECT DISTINCT m1.srcuser -> FROM mail AS m1 LEFT JOIN mail AS m2 -> ON m1.srcuser = m2.srcuser AND m2.srcuser = m2.dstuser -> WHERE m2.dstuser IS NULL; +---------+ | srcuser | +---------+ | tricia | +---------+
For each record in the mail table, the query selects matches where the sender and recipient are the same. For records having no such match, the LEFT JOIN forces the output to contain a row anyway, with all the m2 columns set to NULL. These rows identify the senders who sent no messages to themselves.
Using a LEFT JOIN to join a table to itself also provides another way to answer maximum-per-group questions of the sort discussed in Recipe 12.7, but without using a secondary temporary table. Recall that in that recipe we found the most expensive painting per artist as follows using a temporary table:
mysql> CREATE TABLE tmp -> SELECT a_id, MAX(price) AS max_price FROM painting GROUP BY a_id; mysql> SELECT artist.name, painting.title, painting.price -> FROM artist, painting, tmp -> WHERE painting.a_id = tmp.a_id -> AND painting.price = tmp.max_price -> AND painting.a_id = artist.a_id; +----------+-------------------+-------+ | name | title | price | +----------+-------------------+-------+ | Da Vinci | The Mona Lisa | 87 | | Van Gogh | The Potato Eaters | 67 | | Renoir | Les Deux Soeurs | 64 | +----------+-------------------+-------+
Another way to identify the paintings and then pull out values from each of those rows is with a LEFT JOIN. The following query identifies the paintings:
mysql> SELECT p1.a_id, p1.title, p1.price -> FROM painting p1 -> LEFT JOIN painting p2 -> ON p1.a_id = p2.a_id AND p1.price < p2.price -> WHERE p2.a_id IS NULL; +------+-------------------+-------+ | a_id | title | price | +------+-------------------+-------+ | 1 | The Mona Lisa | 87 | | 3 | The Potato Eaters | 67 | | 5 | Les Deux Soeurs | 64 | +------+-------------------+-------+
To display the artist names, join the result with the artist table:
mysql> SELECT artist.name, p1.title, p1.price -> FROM (painting p1 -> LEFT JOIN painting p2 -> ON p1.a_id = p2.a_id AND p1.price < p2.price), artist -> WHERE p2.a_id IS NULL AND p1.a_id = artist.a_id; +----------+-------------------+-------+ | name | title | price | +----------+-------------------+-------+ | Da Vinci | The Mona Lisa | 87 | | Van Gogh | The Potato Eaters | 67 | | Renoir | Les Deux Soeurs | 64 | +----------+-------------------+-------+
Note that a given "compare a table to itself" problem does not necessarily require a self-join, even if it's possible to solve it that way. The mail table serves to illustrate this. One way to determine which senders sent themselves a message is to use a self-join:
mysql> SELECT DISTINCT m1.srcuser, m2.dstuser -> FROM mail AS m1, mail AS m2 -> WHERE m1.srcuser = m2.srcuser AND m2.dstuser = m1.srcuser; +---------+---------+ | srcuser | dstuser | +---------+---------+ | phil | phil | | barb | barb | | gene | gene | +---------+---------+
But that's silly. The query doesn't need to compare records to each other. It needs only to compare different columns within each row, so a non-join query is sufficient, and simpler to write:
mysql> SELECT DISTINCT srcuser, dstuser FROM mail
-> WHERE srcuser = dstuser;
+---------+---------+
| srcuser | dstuser |
+---------+---------+
| phil | phil |
| barb | barb |
| gene | gene |
+---------+---------+
Категории