Performance, Troubleshooting

Improved JOIN Order by Taking Condition Filter Effect into Account

One of the major challenges of query optimizers is to correctly
estimate how many rows qualify from each table of a join. If the
estimates are wrong, the optimizer may choose a non-optimal join
order.

Before MySQL 5.7, the estimated number of rows from a table only took
into account the conditions from the WHERE clause that were used to set
up the access method (e.g., the size of an index range scan). This
often led to row estimates that were far too high, resulting in very wrong
cost estimates for join plans. To improve this issue, MySQL 5.7
introduced a cost model that considered the entire WHERE condition
when estimating the number of qualifying rows from each table. This
model estimates the filtering effect of the table’s conditions.

As shown in the above figure, the condition filter effect will reduce
the estimated number of rows from tx that will lead to an
index look-up on tx+1. For more details on how condition
filtering
works, see two earlier blog posts on this topic:
part1, part2.

Taking condition filtering into account, the join optimizer will in
many cases be able to find a more optimal join order. However, there
are cases where the optimizer will overestimate the filtering effect
and choose a non-optimal query plan. In this blog post, I will show an
example of a query that benefits from condition filtering, and in a
follow-up blog post I will discuss what could be done if condition
filtering does not have the desired effect.

Example: DBT-3 Query 8

To show the benefits of condition filtering, we will look at Query 8 in
the DBT-3 benchmark:

SELECT o_year,
SUM(CASE WHEN nation = 'FRANCE' THEN volume ELSE 0 END) / SUM(volume) AS mkt_share
FROM (
SELECT EXTRACT(YEAR FROM o_orderdate) AS o_year,
l_extendedprice * (1 - l_discount) AS volume, n2.n_name AS nation
FROM part
JOIN lineitem ON p_partkey = l_partkey
JOIN supplier ON s_suppkey = l_suppkey
JOIN orders ON l_orderkey = o_orderkey
JOIN customer ON o_custkey = c_custkey
JOIN nation n1 ON c_nationkey = n1.n_nationkey
JOIN region ON  n1.n_regionkey = r_regionkey   
JOIN nation n2 ON s_nationkey = n2.n_nationkey
WHERE r_name = 'EUROPE' AND o_orderdate BETWEEN '1995-01-01' AND '1996-12-31'
AND p_type = 'PROMO BRUSHED STEEL'
) AS all_nations GROUP BY o_year ORDER BY o_year;

Query 8 is called National Market Share Query, and it finds the
market share in Europe for French suppliers of a given part type. You
do not need to understand this query in detail. The main point is that
8 tables are joined, and that it is important to find an efficient
join order for the query to perform well.

In MySQL 5.6, Visual EXPLAIN shows the following query plan for Query 8:

Tables are joined from left-to-right, starting with a full table scan
of the region table, doing index look-ups into all other
tables, with the part table as the last table. The
execution time for this query is around 6 seconds on a scale factor 1
DBT-3 database.

If we look at the WHERE clause of the query, we see that there is one
condition with high selectivity: We are interested in just one
particular of many possible part types. That the region should be
Europe, and that the time period is restricted to two years, will
still leave many candidate rows, so these conditions have low
selectivity. In other words, a query plan that put the
part table at the end is not optimal. If
part was processed early, we would be able to filter out
many rows, and the number of index look-ups into other tables could be
significantly reduced.

In MySQL 5.7, the use of condition filtering estimates changes the
query plan for Query 8, and the new query plan looks as follows:

As you see, part is now processed early. (We see that
region is still processed first, but this is a small table with
only 5 rows of which only one row matches its condition. Hence, it
will not impact the fan-out of the join.) This new query plan takes
0.5 seconds to execute. In other words, execution time is reduced by
92% by changing the join order! The main saving is that, instead of
going through all orders for customers in Europe, one will only look
at orders of parts of a given type.

(The observant reader will have noticed another difference between the
Visual EXPLAIN diagrams: The box around the tables is no longer
present in the 5.7 diagram. If you look a Query 8, you notice that the
many-table join is in a sub-query in the FROM clause; aka derived
table
. In MySQL 5.6 and earlier, the result of derived tables
where materialized in a temporary table, before the main query was
executed. The extra box in the 5.6 diagram, represents such a
temporary table. In MySQL 5.7, derived
tables will be merged into the outer query if possible
, and the
materialization of the derived table is avoided.)

To see the optimizer’s estimates for condition filter effects, we can
look at the filtered column of tabular EXPLAIN (some of the
columns have been removed to save space):

id select_type table type key rows filtered Extra
1 SIMPLE region ALL NULL 5 20.00 Using where; Using temporary; Using filesort
1 SIMPLE part ALL NULL 200000 10.00 Using where; Using join buffer (Block Nested Loop)
1 SIMPLE lineitem ref i_l_partkey_suppkey 30 100.00 Using index condition
1 SIMPLE supplier eq_ref PRIMARY 1 100.00 Using where
1 SIMPLE n2 eq_ref PRIMARY 1 100.00 NULL
1 SIMPLE orders eq_ref PRIMARY 1 50.00 Using where
1 SIMPLE customer eq_ref PRIMARY 1 100.00 Using where
1 SIMPLE n1 eq_ref PRIMARY 1 20.00 Using where

We can see that there are 4 tables for which the optimizer assumes
some filtering; region, part,
orders, and n1 (nation). The
optimizer has three sources for its estimates:

  1. Range estimates:For indexed columns, the range optimizer will ask the storage engine
    for an estimate as to how many rows are within a given range. This
    estimate will be pretty accurate. For our query, this is the case for
    the region and orders table. Europe is 1 out
    of 5 regions and the requested two year period contains 50% of the
    orders in the database.
  2. Index statistics:For indexed columns, MySQL also keep statistics on the average number
    of rows per value
    (records_per_key). This can be used to
    estimate the filtering effect of conditions that refers to columns of
    preceeding tables in the join order. For our query this is the case
    for n1. The query has two conditions involving
    n1, but the condition on n_nationkey is used
    for the index look-up and will not contribute to extra filtering. On
    the other hand, the condition on n_regionkey will provide
    some extra filtering, and since records_per_key tells
    that there are 5 distinct values for this column, the estimated
    filtering effect will be 20%.

    The assumption is that values are evenly distributed. If the
    distribution is skewed, the filtering estimate will be less accurate.
    Another cause of estimation errors is that index statistics are based
    on sampling, so it may not precisely reflect the actual distribution
    of values.

  3. Guesstimates:For non-indexed columns, MySQL does not have any statistics. The
    optimizer will then resort to heuristics. For equality conditions,
    the filtering effect is assumed to be 10%. The DBT-3 database does
    not have an index on part_type. Hence, the filtering
    effect for the part table will be set to 10%. This is
    actually a bit off since there are 150 different parts. However, in
    this case, just assuming that there is some filtering, made the
    optimizer choose a better plan.

Caveat

As shown in this blog post, taking condition filtering into account
may give better query plans. However, as discussed, there is a risk
that the filtering estimate are inaccurate, especially for conditions
on non-indexed columns. Another cause of estimation errors is
that it is assumed that columns are not correlated. Hence, when here
are conditions on correlated columns, the optimizer will overestimate
the filtering effect of those conditions.

We have got a few bug reports on performance regressions when
upgrading from 5.6 to 5.7 that are caused by the optimizer
overestimating the filtering effect. In most cases, this is because
the query contains equality conditions on non-indexed columns with low
cardinality, so the guesstimate of 10% is too optimistic. In my next
blog post, I will discuss what can be done when this happens. We are
also working on adding histograms to MySQL. Histograms will give a
much better basis for estimating the filtering effects of conditions
on non-indexed columns.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *