SQL Performance Tuning
When you write a join, you're letting the DBMS optimizer decide among a variety of plans, such as merge- sort , nested-loop, or hash; see Chapter 5, "Joins." But when you write a subquery, you are strongly hinting that you want a nested-loop. Both out-to-in and in-to-out queries are nested-loop plansthey merely differ in direction or driver choice. That's one point in favor of writing joins instead of subqueries, because it's better not to tell the DBMS what to do. The best option is to give the DBMS choices. Nested-loops for subqueries can be different from nested- loops for joins. Listing 6-1 shows a join nested-loop, while Listing 6-2 shows a subquery nested-loop. Listing 6-1 Nested-Loop for Joins
for (each row in driver) { for (each row in driven) { if "match" add to matchlist } } Listing 6-2 Nested-Loop for Subqueries
for (each row in driver) { for (each row in driven) { if "match" EXIT LOOP } } The difference between nested-loops for joins and nested-loops for subqueries is that a subquery needs to find only a single match, so it might be able to break out early from the inner loop. In contrast, a join needs all matches, so it must chug through all rows. (Because the subquery loop can abort early the result is sometimes called a "semijoin," but that's a rare word.) And that's one point in favor of writing subqueries, because a subquery loop will have fewer iterations. To test whether subqueries can beat joins, we made two one-column (INTEGER) tables. Table1 had 1,000 rows, and Table2 had 10,000 rows. Each table was populated with random numbers between 100 and 1,000. There was no clustering or indexing. Then we tested this join and a similar-looking subquery against Table1 and Table2 :
Join: SELECT MIN(Table1.column1) FROM Table1, Table2 WHERE Table1.column1 = Table2.column1 Subquery: SELECT MIN(Table1.column1) FROM Table1 WHERE Table1.column1 IN (SELECT Table2.column1 FROM Table2) GAIN: 7/7
Portability MySQL doesn't support subqueries. The gain on all tests shown in this chapter is for only seven DBMSs.
The result of the test was that the subquery was faster with every DBMS tested. Here are some other arguments in the join-versus-subquery debate: Pro join arguments
People who prefer joins over subqueries pose the following arguments:
Pro subquery arguments
On the other side, people who prefer subqueries over joins argue these points:
Flattening
To flatten a query means to make everything one level. Based on our observations so far, there are clearly times when you should flatten a subquery. In this section, we'll discuss how the DBMS can flatten automatically, but first let's establish how you can do it yourself. To flatten IN
Here's an example of a subquery using IN, and the flattened analogous query: Subquery with IN: SELECT * FROM Table1 WHERE Table1.column1 IN (SELECT Table2.column1 FROM Table2 WHERE Table2.column1 = 5) Flattened: SELECT Table1.* FROM Table1, Table2 WHERE Table1.column1 = Table2.column1 AND Table2.column1 = 5 This simple flattening seems like it should be equivalent to the IN query, but it has a flaw that can be seen if Table1 has these three rows: {1 , 5 , 5} and Table2 has these three rows: {2 , 5 , 5} . Notice that the subquery-based SELECT would correctly return two rows: {5 , 5} , but the flattened SELECT would return four rows: {5 , 5 , 5 , 5} . Notice further that if the flattened SELECT is "corrected" by using SELECT DISTINCT instead of just SELECT, the join would return only one row: {5} . In other words, flattening can cause too many duplicate rowsand if you try to get rid of the duplicates, you can end up with too few duplicate rows. Sometimes you'll see this flaw if the DBMS flattens automatically. (By the way, DISTINCT can be expensive so don't use it if Table2.column1 is uniquely indexed.) To flatten NOT IN
Here's an example of a subquery using NOT IN, and the flattened equivalent query: Subquery with NOT IN: SELECT Table1.column1 FROM Table1 WHERE Table1.column1 NOT IN (SELECT Table2.column1 FROM Table2) Flattened equivalent: SELECT Table1.column1 FROM Table1 LEFT JOIN Table2 ON Table1.column1 = Table2.column1 WHERE Table2.column1 IS NULL This type of flattening is rare, and has two flaws: it assumes that ON happens before WHERE as in the SQL Standard, and it assumes that Table2.column1 has no NULLs (the WHERE Table2.column1 IS NULL clause is meant to find dummy rows produced by the outer join). When is flattening automatic? The rules vary slightly between DBMSs and often seem to reflect some foible in the DBMS rather than some strategy. Amalgamating the rules followed by the various DBMSs, we can say that a flattening is almost certain to happen to subqueries in this scenario:
If you think the DBMS is likely to flatten incorrectly, you can discourage it by changing the conditions. If you think the DBMS is likely to flatten correctly, then you face the questionWhy code with a subquery in the first place? If it's going to be a join anyway, it would be clearer to code it as a join. In other words, if flattening is automatic, that's a reason to flatten manually! We think that's taking things a bit too far. Better to let the DBMS decide, and worry yourself about the portability. Keep in mind that the original subquery code is there because somebody thought it was better. The Bottom Line: Join versus Subquery
The difference between nested-loops for joins and nested-loops for subqueries is that a subquery needs to find only a single match, so it might be able to break out early from the inner loop. In contrast, a join needs all matches, so it must chug through all rows. One major point in favor of writing joins instead of subqueries is that it's better not to tell the DBMS what to do. By writing joins, you're giving the DBMS choices instead of forcing it to follow a specific plan. One major point in favor of writing subqueries is that a subquery loop may have fewer iterations, so subqueries may be faster than joins. Encourage the DBMS to break out of subquery loops early by using syntax like IN (SELECT <query>) rather than = (SELECT <query>) . Generally joins are better, but if, on average, several rows are in the driven query for each row in the driver (for example, a primary-to-foreign or other one-to-many relation), and especially if matching rows come first in the driven query, then subqueries are better. |