Aggregates
PostgreSQL offers a number of aggregate functions. An aggregate is a collection of thingsyou can think of an aggregate as the set of rows returned by a query. An aggregate function is a function that operates on an aggregate (nonaggregate functions operate on a single row within an aggregate). Most of the aggregate functions operate on a single value extracted from each rowthis is called an aggregate expression.
COUNT()
COUNT() is probably the simplest aggregate function. COUNT() returns the number of objects in an aggregate. The COUNT() function comes in four forms:
- COUNT(*)
- COUNT( expression )
- COUNT( ALL expression )
- COUNT( DISTINCT expression )
In the first form, COUNT(*) returns the number of rows in an aggregate:
movies=# SELECT * FROM customers; customer_id | customer_name | phone | birth_date | balance -------------+----------------------+----------+------------+--------- 3 | Panky, Henry | 555-1221 | 1968-01-21 | 0.00 1 | Jones, Henry | 555-1212 | 1970-10-10 | 0.00 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 3.00 2 | Rubin, William | 555-2211 | 1972-07-10 | 15.00 5 | Funkmaster, Freddy | 555-FUNK | | 7 | Gull, Jonathon LC | 555-1111 | 1984-02-05 | 8 | Grumby, Jonas | 555-2222 | 1984-02-21 | (7 rows) movies=# SELECT COUNT(*) FROM customers; count ------- 7 (1 row) movies=# SELECT COUNT(*) FROM customers WHERE customer_id < 5; count ------- 4 (1 row)
You can see from this example that the COUNT(*) function pays attention to the WHERE clause. In other words, COUNT(*) returns the number of rows that filter through the WHERE clause; that is, the number of rows in the aggregate.
In the second form, COUNT( expression ) returns the number of non-NULL values in the aggregate. For example, you might want to know how many customers have a non-NULL balance:
movies=# SELECT COUNT( balance ) FROM customers; count ------- 4 (1 row) movies=# SELECT COUNT(*) - COUNT( balance ) FROM customers; ?column? ---------- 3 (1 row)
The first query returns the number of non-NULL balances in the customers table. The second query returns the number of NULL balances.
The third form, COUNT( ALL expression ) is equivalent to the second form. PostgreSQL includes the third form for completeness; it complements the fourth form.
COUNT( DISTINCT expression ) returns the number of distinct non-NULL values in the aggregate.
movies=# SELECT DISTINCT balance FROM customers; balance --------- 0.00 3.00 15.00 (4 rows) movies=# SELECT COUNT( DISTINCT balance ) FROM customers; count ------- 3 (1 row)
You might notice a surprising result in that last example. The first query returns the distinct balances in the customers table. Notice that PostgreSQL tells you that it returned four rowsthere are four distinct values. The second query returns a count of the distinct balancesit says that there are only three.
Is this a bug? No, both queries returned the correct information. The first query includes the NULL value in the result set. COUNT(), and in fact all the aggregate functions (except for COUNT(*)), ignore NULL values.
SUM()
The SUM( expression ) function returns the sum of all the values in the aggregate expression. Unlike COUNT(), you can't use SUM() on entire rows[12]. Instead, you usually specify a single column:
[12] Actually, you can SUM(*), but it probably doesn't do what you would expect. SUM(*) is equivalent to COUNT(*).
movies=# SELECT SUM( balance ) FROM customers; sum ------- 18.00 (1 row)
Notice that the SUM() function expects an expression. The name of a numeric column is a valid expression. You can also specify an arbitrarily complex expression as long as that expression results in a numeric value.
You can also SUM() an aggregate of intervals. For example, the following query tells you how long it would take to watch all the tapes in your video store:
movies=# SELECT SUM( duration ) FROM tapes; sum -------------- 4 days 03:12 (1 row)
AVG()
The AVG( expression ) function returns the average of an aggregate expression. Like SUM(), you can find the average of a numeric aggregate or an interval aggregate.
movies=# SELECT AVG( balance ) FROM customers; avg -------------- 4.5000000000 (1 row) movies=# SELECT AVG( balance ) FROM customers movies-# WHERE balance IS NOT NULL; avg -------------- 4.5000000000 (1 row)
These queries demonstrate an important point: the aggregate functions completely ignore rows where the aggregate expression evaluates to NULL. The aggregate produced by the second query explicitly omits any rows where the balance is NULL. The aggregate produced by the first query implicitly omits NULL balances. In other words, the following queries are equivalent:
SELECT AVG( balance ) FROM customers; SELECT AVG( balance ) FROM customers WHERE balance IS NOT NULL; SELECT SUM( balance ) / COUNT( balance ) FROM customers;
But these queries are not equivalent:
SELECT AVG( balance ) FROM customers; SELECT SUM( balance ) / COUNT( * ) FROM customers;
Why not? Because COUNT( * ) counts all rows whereas COUNT( balance ) omits any rows where the balance is NULL.
MIN() and MAX()
The MIN( expression ) and MAX( expression ) functions return the minimum and maximum values, respectively, of an aggregate expression. The MIN() and MAX() functions can operate on numeric, date/time, or string aggregates:
movies=# SELECT MIN( balance ), MAX( balance ) FROM customers; min | max ------+------- 0.00 | 15.00 (1 row) movies=# SELECT MIN( birth_date ), MAX( birth_date ) FROM customers; min | max ------------+------------ 1968-01-21 | 1984-02-21 (1 row) movies=# SELECT MIN( customer_name ), MAX( customer_name ) movies-# FROM customers; min | max --------------------+---------------------- Funkmaster, Freddy | Wonderland, Alice N. (1 row)
Other Aggregate Functions
In addition to COUNT(), SUM(), AVG(), MIN(), and MAX(), PostgreSQL also supports the STDDEV( expression ) and VARIANCE( expression ) aggregate functions. These last two aggregate functions compute the standard deviation and variance of an aggregate, two common statistical measures of variation within a set of observations.
Grouping Results
The aggregate functions are useful for summarizing information. The result of an aggregate function is a single value. Sometimes, you really want an aggregate function to apply to each of a number of subsets of your data. For example, you may find it interesting to compute some demographic information about your customer base. Let's first look at the entire customers table:
movies=# SELECT * FROM customers; customer_id | customer_name | phone | birth_date | balance -------------+----------------------+----------+------------+--------- 3 | Panky, Henry | 555-1221 | 1968-01-21 | 0.00 1 | Jones, Henry | 555-1212 | 1970-10-10 | 0.00 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 3.00 2 | Rubin, William | 555-2211 | 1972-07-10 | 15.00 5 | Funkmaster, Freddy | 555-FUNK | | 7 | Gull, Jonathon LC | 555-1111 | 1984-02-05 | 8 | Grumby, Jonas | 555-2222 | 1984-02-21 | (7 rows)
Look at the birth_date columnnotice that you have customers born in three distinct decades (four if you count NULL as a decade):
movies=# SELECT DISTINCT( EXTRACT( DECADE FROM birth_date )) movies-# FROM customers; date_part ----------- 196 197 198 (4 rows)
The EXtrACT() function extracts a date component from a date/time value. The DECADE component looks a little strange, but it makes sense to know whether the decade of the '60s refers to the 1960s or the 2060s, now that we are past Y2K.
Now that you know how many decades are represented in your customer base, you might next want to know how many customers were born in each decade. The GROUP BY clause helps answer this kind of question:
movies=# SELECT COUNT(*), EXTRACT( DECADE FROM birth_date ) movies-# FROM customers movies-# GROUP BY EXTRACT( DECADE FROM birth_date ); count | date_part -------+----------- 2 | 196 2 | 197 2 | 198 1 | (4 rows)
The GROUP BY clause is used with aggregate functions. PostgreSQL sorts the result set by the GROUP BY expression and applies the aggregate function to each group.
There is an easier way to build this query. The problem with this query is that you had to repeat the EXtrACT( DECADE FROM birth_date ) phrase. Instead, you can use the AS clause to name the decade field, and then you can refer to that field by name in the GROUP BY clause:
movies=# SELECT COUNT(*), EXTRACT( DECADE FROM birth_date ) AS decade movies-# FROM customers movies-# GROUP BY decade; count | decade -------+-------- 2 | 196 2 | 197 2 | 198 1 | (4 rows)
If you don't request an explicit ordering, the GROUP BY clause will cause the result set to be sorted by the GROUP BY fields. If you want a different ordering, you can use the ORDER BY clause with GROUP BY. The following query shows how many customers you have for each decade, sorted by the count:
movies=# SELECT movies-# COUNT(*) as "Customers", movies-# EXTRACT( DECADE FROM birth_date ) as "Decade" movies-# FROM customers movies-# GROUP BY "Decade" movies-# ORDER BY "Customers"; Customers | Decade -----------+-------- 1 | 2 | 196 2 | 197 2 | 198 (4 rows)
The NULL decade looks a little funny in this result set. You have one customer (Freddy Funkmaster) who was too vain to tell you when he was born. You can use the HAVING clause to eliminate aggregate groups:
movies=# SELECT COUNT(*), EXTRACT( DECADE FROM birth_date ) as decade movies-# FROM customers movies-# GROUP BY decade movies-# HAVING EXTRACT( DECADE FROM birth_date ) IS NOT NULL; count | decade -------+-------- 2 | 196 2 | 197 2 | 198 (3 rows)
You can see that the HAVING clause is similar to the WHERE clause. The WHERE clause determines which rows are included in the aggregate, whereas the HAVING clause determines which groups are included in the result set.