Calculating Differences Between Successive Rows
12.13.1 Problem
You have a table containing successive cumulative values in its rows and you want to compute the differences between pairs of successive rows.
12.13.2 Solution
Use a self-join that matches up pairs of adjacent rows and calculates the differences between members of each pair.
12.13.3 Discussion
Self-joins are useful when you have a set of absolute (or cumulative) values that you want to convert to relative values representing the differences between successive pairs of rows. For example, if you take an automobile trip and write down the total miles traveled at each stopping point, you can compute the difference between successive points to determine the distance from one stop to the next. Here is such a table that shows the stops for a trip from San Antonio, Texas to Madison, Wisconsin. Each row shows the total miles driven as of each stop:
mysql> SELECT seq, city, miles FROM trip_log ORDER BY seq; +-----+------------------+-------+ | seq | city | miles | +-----+------------------+-------+ | 1 | San Antonio, TX | 0 | | 2 | Dallas, TX | 263 | | 3 | Benton, AR | 566 | | 4 | Memphis, TN | 745 | | 5 | Portageville, MO | 878 | | 6 | Champaign, IL | 1164 | | 7 | Madison, WI | 1412 | +-----+------------------+-------+
A self-join can convert these cumulative values to successive differences that represent the distances from each city to the next. The following query shows how to use the sequence numbers in the records to match up pairs of successive rows and compute the differences between each pair of mileage values:
mysql> SELECT t1.seq AS seq1, t2.seq AS seq2, -> t1.city AS city1, t2.city AS city2, -> t1.miles AS miles1, t2.miles AS miles2, -> t2.miles-t1.miles AS dist -> FROM trip_log AS t1, trip_log AS t2 -> WHERE t1.seq+1 = t2.seq -> ORDER BY t1.seq; +------+------+------------------+------------------+--------+--------+------+ | seq1 | seq2 | city1 | city2 | miles1 | miles2 | dist | +------+------+------------------+------------------+--------+--------+------+ | 1 | 2 | San Antonio, TX | Dallas, TX | 0 | 263 | 263 | | 2 | 3 | Dallas, TX | Benton, AR | 263 | 566 | 303 | | 3 | 4 | Benton, AR | Memphis, TN | 566 | 745 | 179 | | 4 | 5 | Memphis, TN | Portageville, MO | 745 | 878 | 133 | | 5 | 6 | Portageville, MO | Champaign, IL | 878 | 1164 | 286 | | 6 | 7 | Champaign, IL | Madison, WI | 1164 | 1412 | 248 | +------+------+------------------+------------------+--------+--------+------+
The presence of the seq column in the trip_log table is important for calculating successive difference values. It's needed for establishing which row precedes another and matching each row n with row n+1. The implication is that a table should include a sequence column that has no gaps if you want to perform relative-difference calculations from absolute or cumulative values. If the table contains a sequence column but there are gaps, renumber it. If the table contains no such column, add one. Recipe 11.9 and Recipe 11.13 describe how to perform these operations.
A somewhat more complex situation occurs when you compute successive differences for more than one column and use the results in a calculation. The following table, player_stats, shows some cumulative numbers for a baseball player at the end of each month of his season. ab indicates the total at-bats and h the total hits the player has had as of a given date. (The first record indicates the starting point of the player's season, which is why the ab and h values are zero.)
mysql> SELECT id, date, ab, h, TRUNCATE(IFNULL(h/ab,0),3) AS ba -> FROM player_stats ORDER BY id; +----+------------+-----+----+-------+ | id | date | ab | h | ba | +----+------------+-----+----+-------+ | 1 | 2001-04-30 | 0 | 0 | 0.000 | | 2 | 2001-05-31 | 38 | 13 | 0.342 | | 3 | 2001-06-30 | 109 | 31 | 0.284 | | 4 | 2001-07-31 | 196 | 49 | 0.250 | | 5 | 2001-08-31 | 304 | 98 | 0.322 | +----+------------+-----+----+-------+
The last column of the query result also shows the player's batting average as of each date. This column is not stored in the table, but is easily computed as the ratio of hits to at-bats. The result provides a general idea of how the player's hitting performance changed over the course of the season, but it doesn't give a very informative picture of how the player did during each individual month. To determine that, it's necessary to calculate relative differences between pairs of rows. This is easily done with a self-join that matches each row n with row n+1, to calculate differences between pairs of at-bats and hits values. These differences allow batting average during each month to be computed:
mysql> SELECT -> t1.id AS id1, t2.id AS id2, -> t2.date, -> t1.ab AS ab1, t2.ab AS ab2, -> t1.h AS h1, t2.h AS h2, -> t2.ab-t1.ab AS abdiff, -> t2.h-t1.h AS hdiff, -> TRUNCATE(IFNULL((t2.h-t1.h)/(t2.ab-t1.ab),0),3) AS ba -> FROM player_stats AS t1, player_stats AS t2 -> WHERE t1.id+1 = t2.id -> ORDER BY t1.id; +-----+-----+------------+-----+-----+----+----+--------+-------+-------+ | id1 | id2 | date | ab1 | ab2 | h1 | h2 | abdiff | hdiff | ba | +-----+-----+------------+-----+-----+----+----+--------+-------+-------+ | 1 | 2 | 2001-05-31 | 0 | 38 | 0 | 13 | 38 | 13 | 0.342 | | 2 | 3 | 2001-06-30 | 38 | 109 | 13 | 31 | 71 | 18 | 0.253 | | 3 | 4 | 2001-07-31 | 109 | 196 | 31 | 49 | 87 | 18 | 0.206 | | 4 | 5 | 2001-08-31 | 196 | 304 | 49 | 98 | 108 | 49 | 0.453 | +-----+-----+------------+-----+-----+----+----+--------+-------+-------+
These results show much more clearly than the original table does that the player started off well, but had a slump in the middle of the season, particularly in July. They also indicate just how strong his performance was in August.