Fortunately, MySQL 5.6 also introduced subquery materialization that can be used when semi-join does not apply; as long as the subquery is not dependent on the outer query. (There should be no references to tables of the outer query within the subquery). In an earlier blog post, I showed how the query execution time for DBT-3 Query 18 improved from over a month to a few seconds due to subquery materialization. In this blog post, I will show how we can further improve the performance of Query 18 by rewriting it.
Let’s first take a look at Query 18 in its original form:
SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, SUM(l_quantity) FROM customer, orders, lineitem WHERE o_orderkey IN ( SELECT l_orderkey FROM lineitem GROUP BY l_orderkey HAVING SUM(l_quantity) > 313 ) AND c_custkey = o_custkey AND o_orderkey = l_orderkey GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY o_totalprice DESC, o_orderdate LIMIT 100;
This query is called Large Volume Customer Query since it finds large orders (of more than 313 items) and returns the top 100 orders; ordered by total price and date. In MySQL 5.5 and earlier, MySQL would execute the GROUP BY subquery once for every row of the outer query. In MySQL 5.6, subquery materialization made it possible to execute this subquery only once. This can be seen from the EXPLAIN output for this query (some of the columns have been removed to save space):
|1||PRIMARY||orders||ALL||NULL||1500000||Using where; Using temporary; Using filesort|
That select_type is SUBQUERY indicates that the subquery will be executed only once. (Otherwise, the select_type would be DEPENDENT SUBQUERY.) Here is the Visual EXPLAIN diagram for the same query plan:
A basic component of both semi-join execution strategies and subquery materialization is duplicate removal. Unlike a JOIN operation where the result will contain all combinations of rows that match the join criteria, an IN-expression should only produce one row regardless of the number of matches in the subquery. For semi-join there are multiple strategies for removing duplicates, and for subquery materialization, duplicates are removed from the temporary table.
Looking at QUERY 18, we realize that the subquery will never give any duplicates. The only column selected is the column that we are grouping on, l_orderkey, so all rows will be distinct. This means that we can replace the IN-expression with a join, and get this equivalent query:
SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, SUM(lineitem.l_quantity) FROM customer, orders, lineitem, ( SELECT l_orderkey FROM lineitem GROUP BY l_orderkey HAVING SUM(l_quantity) > 313 ) l2 WHERE o_orderkey = l2.l_orderkey AND c_custkey = o_custkey AND o_orderkey = lineitem.l_orderkey GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY o_totalprice DESC, o_orderdate LIMIT 100;
The subquery is now a derived table; a subquery in the FROM clause. Take a look at the query plan for the rewritten query:
Notice that the subquery is materialized as before, but this time the plan is to start with scanning the temporary table instead of the orders table as for the original query. This means that instead of accessing all orders, MySQL will now only access the biggest orders (with more than 313 items). Given that most orders are smaller than that, this should give improved performance. Indeed, with MySQL 5.7.14 and using a scale factor 1 database, the query
execution time is slashed by one third by this rewrite:
In general, rewriting the query from an IN-expression to a JOIN is beneficial because the join optimizer may then be used to find the optimal order for accessing the tables. This is the idea behind the semi-join transformations that are done automatically since MySQL 5.6, and this is why our manual transformation gave a better query plan. Hopefully, some time in the future the MySQL Query Optimizer will be able to do such transformations automatically. Until then, watch out for opportunities to manually rewrite IN-subqueries where semi-join does not apply!