Dividing a Summary into Subgroups

7.8.1 Problem

You want to calculate a summary for each subgroup of a set of rows, not an overall summary value.

7.8.2 Solution

Use a GROUP BY clause to arrange rows into groups.

7.8.3 Discussion

The summary queries shown so far calculate summary values over all rows in the result set. For example, the following query determines the number of daily driving records in the driver_log table, and thus the total number of days that drivers were on the road:

mysql> SELECT COUNT(*) FROM driver_log; +----------+ | COUNT(*) | +----------+ | 10 | +----------+

But sometimes it's desirable to break a set of rows into subgroups and summarize each group. This is done by using aggregate functions in conjunction with a GROUP BY clause. To determine the number of days driven by each driver, group the rows by driver name, count how many rows there are for each name, and display the names with the counts:

mysql> SELECT name, COUNT(name) FROM driver_log GROUP BY name; +-------+-------------+ | name | COUNT(name) | +-------+-------------+ | Ben | 3 | | Henry | 5 | | Suzi | 2 | +-------+-------------+

That query summarizes the same column used for grouping (name), but that's not always necessary. Suppose you want a quick characterization of the driver_log table, showing for each person listed in it the total number of miles driven and the average number of miles per day. In this case, you still use the name column to place the rows in groups, but the summary functions operate on the miles values:

mysql> SELECT name, -> SUM(miles) AS 'total miles', -> AVG(miles) AS 'miles per day' -> FROM driver_log GROUP BY name; +-------+-------------+---------------+ | name | total miles | miles per day | +-------+-------------+---------------+ | Ben | 362 | 120.6667 | | Henry | 911 | 182.2000 | | Suzi | 893 | 446.5000 | +-------+-------------+---------------+

Use as many grouping columns as necessary to achieve as fine-grained a summary as you require. The following query produces a coarse summary showing how many messages were sent by each message sender listed in the mail table:

mysql> SELECT srcuser, COUNT(*) FROM mail -> GROUP BY srcuser; +---------+----------+ | srcuser | COUNT(*) | +---------+----------+ | barb | 3 | | gene | 6 | | phil | 5 | | tricia | 2 | +---------+----------+

To be more specific and find out how many messages each sender sent from each host, use two grouping columns. This produces a result with nested groups (groups within groups):

mysql> SELECT srcuser, srchost, COUNT(*) FROM mail -> GROUP BY srcuser, srchost; +---------+---------+----------+ | srcuser | srchost | COUNT(*) | +---------+---------+----------+ | barb | saturn | 2 | | barb | venus | 1 | | gene | mars | 2 | | gene | saturn | 2 | | gene | venus | 2 | | phil | mars | 3 | | phil | venus | 2 | | tricia | mars | 1 | | tricia | saturn | 1 | +---------+---------+----------+

Getting Distinct Values Without Using DISTINCT

If you use GROUP BY without selecting the value of any aggregate functions, you achieve the same effect as DISTINCT without using DISTINCT explicitly:

mysql> SELECT name FROM driver_log GROUP BY name; +-------+ | name | +-------+ | Ben | | Henry | | Suzi | +-------+

Normally with this kind of query you'd select a summary value (for example, by invoking COUNT(name) to count the instances of each name), but it's legal not to. The net effect is to produce a list of the unique grouped values. I prefer to use DISTINCT, because it makes the point of the query more obvious. (Internally, MySQL actually maps the DISTINCT form of the query onto the GROUP BY form.)

The preceding examples in this section have used COUNT( ), SUM( ) and AVG( ) for per-group summaries. You can use MIN( ) or MAX( ), too. With a GROUP BY clause, they will tell you the smallest or largest value per group. The following query groups mail table rows by message sender, displaying for each one the size of the largest message sent and the date of the most recent message:

mysql> SELECT srcuser, MAX(size), MAX(t) FROM mail GROUP BY srcuser; +---------+-----------+---------------------+ | srcuser | MAX(size) | MAX(t) | +---------+-----------+---------------------+ | barb | 98151 | 2001-05-14 14:42:21 | | gene | 998532 | 2001-05-19 22:21:51 | | phil | 10294 | 2001-05-17 12:49:23 | | tricia | 2394482 | 2001-05-14 17:03:01 | +---------+-----------+---------------------+

You can group by multiple columns and display a maximum for each combination of values in those columns. This query finds the size of the largest message sent between each pair of sender and recipient values listed in the mail table:

mysql> SELECT srcuser, dstuser, MAX(size) FROM mail GROUP BY srcuser, dstuser; +---------+---------+-----------+ | srcuser | dstuser | MAX(size) | +---------+---------+-----------+ | barb | barb | 98151 | | barb | tricia | 58274 | | gene | barb | 2291 | | gene | gene | 23992 | | gene | tricia | 998532 | | phil | barb | 10294 | | phil | phil | 1048 | | phil | tricia | 5781 | | tricia | gene | 194925 | | tricia | phil | 2394482 | +---------+---------+-----------+

When using aggregate functions to produce per-group summary values, watch out for the following trap. Suppose you want to know the longest trip per driver in the driver_log table. That's produced by this query:

mysql> SELECT name, MAX(miles) AS 'longest trip' -> FROM driver_log GROUP BY name; +-------+--------------+ | name | longest trip | +-------+--------------+ | Ben | 152 | | Henry | 300 | | Suzi | 502 | +-------+--------------+

But what if you also want to show the date on which each driver's longest trip occurred? Can you just add trav_date to the output column list? Sorry, that won't work:

mysql> SELECT name, trav_date, MAX(miles) AS 'longest trip' -> FROM driver_log GROUP BY name; +-------+------------+--------------+ | name | trav_date | longest trip | +-------+------------+--------------+ | Ben | 2001-11-30 | 152 | | Henry | 2001-11-29 | 300 | | Suzi | 2001-11-29 | 502 | +-------+------------+--------------+

The query does produce a result, but if you compare it to the full table (shown below), you'll see that although the dates for Ben and Henry are correct, the date for Suzi is not:

+--------+-------+------------+-------+ | rec_id | name | trav_date | miles | +--------+-------+------------+-------+ | 1 | Ben | 2001-11-30 | 152 | <-- Ben's longest trip | 2 | Suzi | 2001-11-29 | 391 | | 3 | Henry | 2001-11-29 | 300 | <-- Henry's longest trip | 4 | Henry | 2001-11-27 | 96 | | 5 | Ben | 2001-11-29 | 131 | | 6 | Henry | 2001-11-26 | 115 | | 7 | Suzi | 2001-12-02 | 502 | <-- Suzi's longest trip | 8 | Henry | 2001-12-01 | 197 | | 9 | Ben | 2001-12-02 | 79 | | 10 | Henry | 2001-11-30 | 203 | +--------+-------+------------+-------+

So what's going on? Why does the summary query produce incorrect results? This happens because when you include a GROUP BY clause in a query, the only values you can select are the grouped columns or the summary values calculated from them. If you display additional columns, they're not tied to the grouped columns and the values displayed for them are indeterminate. (For the query just shown, it appears that MySQL may simply be picking the first date for each driver, whether or not it matches the driver's maximum mileage value.)

The general solution to the problem of displaying contents of rows associated with minimum or maximum group values involves a join. The technique is described in Chapter 12. If you don't want to read ahead, or you don't want to use another table, consider using the MAX-CONCAT trick described earlier. It produces the correct result, although the query is fairly ugly:

mysql> SELECT name, -> SUBSTRING(MAX(CONCAT(LPAD(miles,3,' '), trav_date)),4) AS date, -> LEFT(MAX(CONCAT(LPAD(miles,3,' '), trav_date)),3) AS 'longest trip' -> FROM driver_log GROUP BY name; +-------+------------+--------------+ | name | date | longest trip | +-------+------------+--------------+ | Ben | 2001-11-30 | 152 | | Henry | 2001-11-29 | 300 | | Suzi | 2001-12-02 | 502 | +-------+------------+--------------+

Категории