Hack 78. Break It Down by Range
Some reports work better if you can reduce precision. If your data has too much detail you can group values into buckets.
Suppose you have survey data showing consumer spending in some sector. You have the age and the amount spent for every individual surveyed, as shown in Table 10-3.
id | age | spend |
---|---|---|
1 | 34 | 100 |
2 | 31 | 110 |
3 | 24 | 140 |
... |
If you want to see how much different age groups spend, you have too much detail. For example, knowing the average spend for 34-year-olds is too fine-grained, but knowing the average spend for thirty-somethings may be useful.
To do this you need to group together all the individuals in specified ranges, and put each row into a "bucket."
10.2.1. Reduce the Precision of a Number
You can remove significant digits of a number using the ROUND function. ROUND allows you to specify the number of decimal places. When you specify a positive number of decimal places to the right of the decimal pointfor example, ROUND(3.14, 1)you get what you normally think of as a rounded number, 3.1. If you specify a negative number of decimal places it will round to the corresponding power of 10. For example, ROUND(1234, -2) gives 1200:
mysql> SELECT id, age, ROUND(age,-1) FROM population; +----+------+---------------+ | id | age | ROUND(age,-1) | +----+------+---------------+ | 1 | 34 | 30 | | 2 | 39 | 40 | | 3 | 35 | 40 | ...
Notice that the 34-year-old has been approximated to 30 and 39 has been put in the 40 bucket.
With every member of the population grouped by decade you can find the average across each age group:
mysql> SELECT ROUND(age,-1) AS low, -> AVG(spend) AS avgSpend -> FROM population -> GROUP BY ROUND(age,-1); +------+----------+ | low | avgSpend | +------+----------+ | 20 | 147.5000 | | 30 | 107.5000 | | 40 | 130.0000 | +------+----------+
The low value, 20, represents the ages from 1524. You can display the range by building up a string. In MySQL, you can use the CONCAT function:
mysql> SELECT CONCAT(low-5,'-',low+4) AS the_range -> ,avgSpend -> FROM(SELECT ROUND(age,-1) AS low, -> AVG(spend) AS avgSpend -> FROM population -> GROUP BY ROUND(age,-1)) t; +-------+----------+ | range | avgSpend | +-------+----------+ | 15-24 | 150.0000 | | 25-34 | 100.0000 | | 35-44 | 125.0000 | +-------+----------+
In Oracle and PostgreSQL, you use the || operator to concatenate strings:
SELECT (low-5) || '-' || (low+4) AS the_range ,avgSpend FROM(SELECT ROUND(age,-1) AS low, AVG(spend) AS avgSpend FROM population GROUP BY ROUND(age,-1)) t
In SQL Server, you should explicitly convert the numbers into strings with the STR function before you apply the + operator to concatenate:
SELECT STR(low-5) + '-' + STR(low+4) AS the_range ,avgSpend FROM(SELECT ROUND(age,-1) AS low, AVG(spend) AS avgSpend FROM population GROUP BY ROUND(age,-1)) t
This is fine if the range of each bucket is a power of 10, because you can round to the decimal places 1, 2, 3, and so on. You can get nondecimal-size buckets too. To see values to the nearest multiple of 5, for example, you can divide by 5, ROUND to an integer, and then multiply back by 5.
It's easier to work with the FLOOR function. FLOOR simply discards any fractional part, so numbers (including negative numbers) are always rounded down:
mysql> SELECT age, 5*FLOOR(age/5) AS valueBucket -> , CONCAT(5*FLOOR(age/5),'-',5*FLOOR(age/5)+4) -> AS the_range -> FROM population; +------+-------------+-------+ | age | valueBucket | range | +------+-------------+-------+ | 34 | 30 | 30-34 | | 39 | 35 | 35-39 | | 35 | 35 | 35-39 | | 24 | 20 | 20-24 |
10.2.2. Reduce the Precision of a Date
Reducing the precision of a date requires different techniques on different platforms. If you need to get the date of the first of the month from a given date (the whn column) you can use the following.
10.2.2.1. MySQL
Here's how to do this in MySQL:
mysql> SELECT whn, -> DATE_FORMAT(whn,'%Y-%m') AS YearAndMonth, -> CAST(DATE_FORMAT(whn,'%Y-%m-01') AS DATE) AS FirstOfMnth -> FROM dates; +------------+--------------+-------------+ | whn | YearAndMonth | FirstOfMnth | +------------+--------------+-------------+ | 2006-12-05 | 2006-12 | 2006-12-01 | | 2007-01-01 | 2007-01 | 2007-01-01 | | 2007-02-05 | 2007-02 | 2007-02-01 | +------------+--------------+-------------+
10.2.2.2. Oracle
For Oracle, you'd do the following:
SELECT whn, TO_CHAR(whn,'YYYY-MM') AS YearAndMonth, CAST(DATE_FORMAT(whn,'YYYY-MM-01') AS DATE) AS FirstOfMonth FROM dates;
10.2.2.3. SQL Server
In SQL Server, you can convert the date into a string using the CONVERT function. One of the most useful formats available is the ODBC canonical format: yyyy-mm-dd hh:mi:ss. The first seven characters of this format include only the year and month, so you could use CONVERT(CHAR(7),whn,120). This relies on the fact that 120 is the code to use for the ODBC canonical format:
SELECT whn, CONVERT(CHAR(7),whn,120) AS YearAndMonth, CONVERT(DATETIME,CONVERT(CHAR(7),whn,120)+'-01') AS FirstOfMonth FROM dates;
10.2.2.4. PostgreSQL
Here's how to issue the query in PostgreSQL:
SELECT whn, EXTRACT(YEAR FROM whn)||'-'||EXTRACT(MONTH FROM whn) AS YearAndMonth, DATE_TRUNC('month'whn) AS FirstOfMonth FROM dates;
10.2.3. Hacking the Hack
If you want more control over the buckets, you can create tables to represent custom ranges of any size.
Suppose that your source data includes sales with dates and values. If you want to see how the values of sales change over time, you can create custom buckets for dates and for values. These buckets do not have to be the same size: in many lines of business, the Christmas/New Year season is short but highly profitable. You might want to compare a few weeks' worth of sales in the holiday period against several months' worth of sales during a quieter spell.
You can name each range, and these names (see Table 10-4) will show up as the bucket names.
id | beginDate | endDate |
---|---|---|
Autumn 2006 | 2006-09-01 | 2006-12-10 |
Christmas 2006 | 2006-12-11 | 2007-01-05 |
You can decide when the seasons begin and end and you can decide what constitutes small, medium, and large, as shown in Table 10-5.
id | beginValue | endValue |
---|---|---|
Small (under 15) | 0.00 | 14.99 |
Medium (15 to 25) | 15.00 | 24.99 |
Large (over 25) | 25.00 | 1E6 |
You can now put the values into the value buckets and the dates into the season buckets. Suppose that your source data includes a row for every sale:
mysql> SELECT valueBucket.id AS vBucket, seasonBucket.id AS season, -> COUNT(value) AS NumSales -> FROM source -> JOIN valueBucket ON value BETWEEN beginVal AND endVal -> JOIN seasonBucket ON whn BETWEEN beginDate AND endDate -> GROUP BY valueBucket.id, seasonBucket.id; +-------------------+--------------+-----------+ | vBucket | season | NumSales | +-------------------+--------------+-----------+ | Large (over 25) | Autumn 2006 | 20 | | Large (over 25) | Winter 06/07 | 15 | | Medium (15 to 25) | Autumn 2006 | 50 | | Medium (15 to 25) | Winter 06/07 | 66 | | Small (under 15) | Autumn 2006 | 104 | | Small (under 15) | Winter 06/07 | 150 | +-------------------+--------------+-----------+
You need to perform this kind of quantization if you want to display your data effectively in a pivot table [Hack #77], in a bar chart [Hack #35], or in a pie chart [Hack #42].