Hack 83. Find the Top n in Each Group

2017-11-03 09:05:03

Identify the top n rows in the table, or the top n in each group of rows, by using a correlated subquery with a COUNT and an inequality.

Many database systems implement a feature which allows you to limit the number of rows returned in a query. Some of these proprietary methods include LIMIT (MySQL), ROWNUM (Oracle), and TOP (SQL Server).

These methods work reasonably well (except for one small detail which is explained at the end of this hack), but there is another, "generic" method that will work correctly in all database systems. See "Extract a Subset of the Results" [Hack #72] for details on how to get the first 10 rows from a result set using proprietary methods.

To understand how this method works, imagine you are running a race. To determine your finish position, all you have to do is count how many racers finished ahead of you. So, if two people finished ahead of you, you finished third. Pretty simple.

The generic top n query works in a similar way. It finds each row's position and then eliminates all but the top n.

Before you look at the query, though, it's important to remember that "position" has meaning only with respect to the values in some column. There's no point in asking for the top n rows of some table, unless you state this request with specific regard to some column's values which can be compared in order to determine relative position. In a race, it's finish time, with the smallest value being the top finish time. If dates are the values which determine position, the latest rows are those with the highest dates.

10.7.1. Last Three Articles

Suppose you have a web site with articles, and you want to show on your home page a list of links to the three most recent articles. Here's the query:

SELECT pdate , title FROM articles A WHERE ( SELECT count(*) FROM articles WHERE pdate > A.pdate ) < 3 ORDER BY pdate DESC

This query has an outer query, and a correlated subquery. A correlated subquery is one which references a column in the outer query. In this case, the outer query returns rows from the articles table using the alias or correlation variable A, while the correlated subquery counts all rows in the same table which have a higher (later) date than the row from the outer query.

Each row in the outer query is evaluated as follows: in the subquery, count the number of rows that have a higher date than this row, and if that number is less than 3, this row must be in the top three. This works exactly like the race did; if you count how many people finished ahead of you, and only two people did, you must have finished in the top three. If for some article, the subquery counts five articles with a later date, that article can't be in the latest three.

10.7.2. Top n Rows in Each Group

So far, so good; now change your query slightly so that it's looking for the latest three articles in each category:

SELECT category , pdate , title FROM articles A WHERE ( SELECT count(*) FROM articles WHERE category = A.category AND pdate > A.pdate ) < 3 ORDER BY category , pdate DESC

This correlated subquery is similar to the preceding one, with the difference that the counting is done only within the same category.

10.7.3. Ties

In each of the preceding queries, ties are included. This is as it should be. If only two people finished ahead of you, you finished third, and it doesn't matter that someone else tied with you: you still finished third. Yes, the person who tied with you also finished third, so when the medals are handed out, four will be handed out in all.

This can be an issue (or fatal shortcoming, depending on your point of view, especially if you don't get a medal which you rightfully earned) in proprietary methods such as MySQL's LIMIT and Oracle's ROWNUM. If there's a tie for last place, some row is left out. SQL Server at least has the optional clause, TOP n [WITH TIES], but you have to remember to use it.

10.7.4. See Also

"Calculate Rank" [Hack #40]

Rudy Limeback

Категории