August 27, 2024

Reclaiming SQL’s Declarative Power

A good SQL optimizer can dramatically improve query performance, allowing you to focus on writing readable SQL instead of getting lost in the details of database optimization. This post discusses query optimizers, and query unnesting in particular, to emphasize the importance of query optimization.

Moritz Sichert
Moritz Sichert
Moritz Sichert
Moritz Sichert
Reclaiming SQL’s Declarative Power

Reclaiming SQL’s Declarative Power

You may have heard that SQL is a “declarative” programming language. In imperative languages such as Python, C++, and Rust, you, the programmer, tell the computer what to do and in what order: The program essentially runs from top to bottom, executing one statement at a time. In a declarative language like SQL, on the other hand, you only specify what the result of the query should look like, not how the database system should execute it.

Because SQL is declarative, a database system can efficiently execute your query regardless of how you write it. This gives you the freedom to focus on writing readable and understandable SQL without having to worry about efficiency. Your database system will just magically figure out how to execute any query you write as quickly as possible!

If that is the case, why do many different blog posts recommend writing your SQL queries in a certain way in order to achieve better performance? For Postgres, a common wisdom used to be to use WITH-statements (also called CTEs) to force the database system to execute a query in a specific order. There are more hacks like these such as adding OFFSET 0 to prevent a subquery from being optimized with other parts of a query.

Wasn’t the whole point of a declarative language to avoid having to make your program harder to read just to make it faster? The unfortunate truth is that most database systems are not capable of executing all queries efficiently unless you push them hard in the right direction. This often requires you to use certain combinations of SQL that don’t look natural and make your queries harder to read and understand.

Query Optimization

The component of a database system that takes care of finding the best way to execute a query written in declarative SQL is called the query optimizer. The query optimizer takes a SQL query and tries to find the most efficient algorithm that can calculate the result. It’s easier to understand what this means by looking at a short example. Consider this simple query which prints the titles of all blog posts and their authors:

select a.name, b.title
from authors a, blogposts b
where b.author_id = a.id

If a database system were to execute this query naively, it might, for example, iterate through all blog posts and, for each blog post, iterate through all authors until it finds the matching author. If the database contains n blog posts and m authors, this execution strategy will compute n × m combinations of blog posts and authors, even though most of them won’t match. If you are running a large micro-blogging site like Twitter with about 300 million users and 1 billion tweets, such a quadratic algorithm has to look at 300 × 1015 or 300 quadrillion combinations. As you can imagine, this is hardly reasonable.

You could also write this with a join in SQL:

select a.name, b.title
from authors a join blogposts b on (b.author_id = a.id)

A database system could execute a join like this using the hash join algorithm. This algorithm only looks at n + m values which for our example amounts to 1,3 billion. As you can see, just choosing the correct algorithm can drastically improve the execution speed even before we talk about what kind of server the database system runs on.

Of course, the whole point of a declarative language like SQL is that you can write the “unoptimized” query from above and still get the same performance as if you had used a join explicitly. This is exactly what the query optimizer does automatically for you.

This optimization that replaces a quadratic algorithm with a hash join is only the beginning of what a query optimizer can do. In fact, this transformation is so trivial that any query optimizer worth its name should be able to do it. In general, designing query optimizers is actually so complex that there has been little innovation in this area over the last few decades, and the subject is taught in-depth at only a handful of universities worldwide (two of them in Germany: TUM and University of Mannheim!).

In this post, we will take a look at a query optimization technique called “query unnesting” or “query decorrelation”. Query unnesting is not used effectively by most existing database systems, even in the simplest cases. However, as we will see, having a query optimizer that can unnest queries allows you to write much simpler SQL statements that are just as fast as equivalent hand-tuned complex SQL statements.

Correlated Subqueries in SQL

Query unnesting is a technique that allows to break up so-called correlated subqueries that you can write in SQL. To show what a correlated subquery is, let me first show you what a regular (i.e., uncorrelated) subquery looks like in SQL:

-- Find the names of all olympic athletes ...
select o.athlete_name
from olympics_results o
where
    -- ... competing in one particular style of swimming ...
    o.discipline = '100 metres freestyle swimming' and
    -- ... whose time is ...
    o.time = (
        -- ... the lowest among all times in the same discipline
        select min(o2.time)
        from olympics_results o2
        where o2.discipline = '100 metres freestyle swimming'
    )

This query returns the gold medal winners in 100 metres freestyle swimming. To find them, we use a subquery that finds the fastest time by any athlete in that event, and then prints only those athletes who had that time. This correctly accounts for ties where more than one athlete receives a gold medal.

Now we want to change the query to find the gold medal winners of all disciplines. Using a correlated subquery leads to a very elegant solution:

-- Find the names and the discipline of all gold medal olympic athletes
select o.athlete_name, o.discipline
from olympics_results o
where
    -- ... whose time is ...
    o.time = (
        -- ... the lowest among all times in the same discipline
        select min(o2.time)
        from olympics_results o2
        where o2.discipline = o.discipline
    )

We only changed the last line in the subquery to compare to the discipline of the outer query instead of using a fixed string. Because the subquery uses an attribute of the outer query, it is now called a correlated subquery.

Correlated Subqueries Are Slow

While correlated subqueries allow you to write concise SQL, unfortunately most database systems execute them very inefficiently. To understand why, you have to roughly understand how a database system executes the correlated query: On the top level, the database system goes through all rows of the olympics_results table. For every row it then executes the subquery with its current discipline. The subquery goes through the entire table again to find the lowest time of the given discipline.

So, in the worst-case, the entire query needs to look through the entire table olympics_results for every row in that table. This is a classic example of an algorithm with quadratic runtime which is something every database systems tries very hard to avoid. Roughly speaking, with a quadratic runtime, when you double the size of the table, the runtime of the query will increase by 4x.

Example Query on Benchmark Data Set TPC-H

Let’s look at a slightly more realistic example on the TPC-H benchmark dataset which is used to benchmark analytics on business data. The benchmark also comes with a set of queries that can be used to evaluate the performance of a database system.

One such query is Q17 which you can see here:

select sum(l_extendedprice) / 7.0 as avg_yearly
from lineitem, part
where
    p_partkey = l_partkey
    and p_brand = 'Brand#23'
    and p_container = 'MED BOX'
    and l_quantity < (
        select 0.2 * avg(l_quantity)
        from lineitem
        where l_partkey = p_partkey
    )
Try it in CedarDB

The idea behind this query is to find all low-volume parts with a specific brand and container type that are sold by the company. The query returns the average annual revenue of these low-volume parts, which may help the business owner decide whether to discontinue these parts.

What’s more interesting about this query is its structure: Like the example query we showed in the beginning, Q17 has a correlated subquery. The subquery selects from the lineitem table and the outer query also scans through the entire lineitem table. In the benchmark data set with scale factor 1, lineitem contains 6 million rows, so if you execute Q17 in a database system that uses an algorithm with quadratic runtime, you will have a bad time.

Correlated Subqueries in Postgres

Running this query in Postgres on the TPC-H dataset with scale factor 1 (around 1 GB of CSV data) confirms our fears about quadratic runtime: Q17 runs for 26 minutes on my laptop! Postgres also allows you to inspect the query execution in more detail by prefixing the query with explain (analyze, verbose) which prints the following “optimized” query plan:


                           QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2070453.68..2070453.69 rows=1 width=32) (actual time=1577110.964..1577111.068 rows=1 loops=1)
   Output: (sum(lineitem.l_extendedprice) / 7.0)
   ->  Hash Join  (cost=6368.71..2070448.85 rows=1931 width=8) (actual time=596.474..1577110.309 rows=587 loops=1)
         Output: lineitem.l_extendedprice
         Inner Unique: true
         Hash Cond: (lineitem.l_partkey = part.p_partkey)
         Join Filter: (lineitem.l_quantity < (SubPlan 1))
         Rows Removed by Join Filter: 5501
         ->  Seq Scan on public.lineitem  (cost=0.00..172571.28 rows=6001628 width=17) (actual time=0.004..433.981 rows=6001215 loops=1)
               Output: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, lineitem.l_tax, lineitem.l_returnflag, lineitem.l_linestatus, lineitem.l_shipdate, lineitem.l_commitdate, lineitem.l_receiptdate, lineitem.l_shipinstruct, lineitem.l_shipmode, lineitem.l_comment
         ->  Hash  (cost=6366.30..6366.30 rows=193 width=4) (actual time=93.251..93.353 rows=204 loops=1)
               Output: part.p_partkey
               Buckets: 1024  Batches: 1  Memory Usage: 16kB
               ->  Gather  (cost=1000.00..6366.30 rows=193 width=4) (actual time=93.209..93.323 rows=204 loops=1)
                     Output: part.p_partkey
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Parallel Seq Scan on public.part  (cost=0.00..5347.00 rows=80 width=4) (actual time=57.926..62.724 rows=68 loops=3)
                           Output: part.p_partkey
                           Filter: ((part.p_brand = 'Brand#23'::bpchar) AND (part.p_container = 'MED BOX'::bpchar))
                           Rows Removed by Filter: 66599
                           Worker 0:  actual time=40.292..47.555 rows=95 loops=1
                             JIT:
                               Functions: 4
                               Options: Inlining true, Optimization true, Expressions true, Deforming true
                               Timing: Generation 0.182 ms, Inlining 28.746 ms, Optimization 6.198 ms, Emission 5.118 ms, Total 40.244 ms
                           Worker 1:  actual time=40.423..47.553 rows=109 loops=1
                             JIT:
                               Functions: 4
                               Options: Inlining true, Optimization true, Expressions true, Deforming true
                               Timing: Generation 0.181 ms, Inlining 28.431 ms, Optimization 6.319 ms, Emission 5.310 ms, Total 40.241 ms
         SubPlan 1
           ->  Aggregate  (cost=187575.43..187575.45 rows=1 width=32) (actual time=258.900..258.900 rows=1 loops=6088)
                 Output: (0.2 * avg(lineitem_1.l_quantity))
                 ->  Seq Scan on public.lineitem lineitem_1  (cost=0.00..187575.35 rows=32 width=5) (actual time=7.144..258.839 rows=31 loops=6088)
                       Output: lineitem_1.l_orderkey, lineitem_1.l_partkey, lineitem_1.l_suppkey, lineitem_1.l_linenumber, lineitem_1.l_quantity, lineitem_1.l_extendedprice, lineitem_1.l_discount, lineitem_1.l_tax, lineitem_1.l_returnflag, lineitem_1.l_linestatus, lineitem_1.l_shipdate, lineitem_1.l_commitdate, lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct, lineitem_1.l_shipmode, lineitem_1.l_comment
                       Filter: (lineitem_1.l_partkey = part.p_partkey)
                       Rows Removed by Filter: 6001184
 Planning Time: 0.496 ms
 JIT:
   Functions: 32
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.268 ms, Inlining 68.352 ms, Optimization 62.986 ms, Emission 41.877 ms, Total 174.483 ms
 Execution Time: 1577112.054 ms

Explaining this entire output would span multiple blog posts, so I’ll keep it short: The two highlighted lines show why this query is slow. While we are using an efficient Hash Join, the join needs to evaluate the filter condition lineitem.l_quantity < (SubPlan 1). This filter condition can only be evaluated by calculating the result of the subquery which can be seen in the second highlighted line.

So, for each row in the lineitem table, Postgres executes the subplan that again scans through the entire lineitem table! In total, the query reads the entirety of all 6 million rows in lineitem over 6000 times.

Optimizing Correlated Subqueries

Now we know that correlated subqueries can lead to abysmal query performance. How can we can we avoid this? Ideally, you would use a database system with an optimizer that can automatically optimize correlated subqueries, such as CedarDB.

If not, all hope is not lost! You must now enter the land of DBA wizards and database whisperers who can hopefully help you make your query with correlated subqueries more efficient. Fortunately, at least for Q17 on TPC-H, you won’t have to resort to the evil magic of “optimizer hints” (which Postgres deliberately doesn’t allow!).

The subquery in Q17 calculates the average quantity of the given part in all orders. Writing this as a correlated subquery makes the intention of the SQL query very clear but unfortunately also very slow. We could also calculate the average quantity in all orders for every part in advance and only afterwards use the quantity to find the low-quantity parts. In SQL, calculating the average quantities in advance is very easy:

select l_partkey as a_partkey, 0.2 * avg(l_quantity) as a_avg
from lineitem
group by l_partkey

We can combine this into Q17 by removing the correlated subquery and instead joining with the pre-calculated quantities like this:

select sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,
    part,
    (
        select l_partkey as a_partkey, 0.2 * avg(l_quantity) as a_avg
        from lineitem
        group by l_partkey
    ) avg_lineitem
where
    p_partkey = l_partkey
    and p_brand = 'Brand#23'
    and p_container = 'MED BOX'
    and l_partkey = a_partkey
    and l_quantity < a_avg
Try it in CedarDB

The new query is a bit more annoying to write, but it now runs in under 2 seconds in Postgres, giving us a performance improvement of almost 1000x! If you are interested in the details, you can run the query again with explain (analyze, verbose) and you will see that the query now doesn’t use any SubPlan and can even be parallelized automatically.

Complex Correlated Subqueries

In Q17, even the hand-optimized query is easy to understand, and you can get the best performance on systems like Postgres that do not automatically optimize the query. However, you can already see the benefit of using a database system with a better query optimizer like CedarDB: If you click on “Try it in CedarDB” for either version of Q17, you can see that both versions have the same runtime. Our query optimizer is able to do the optimization we did by hand automatically.

In fact, CedarDB can automatically decorrelate any SQL query. The more complex a query becomes, the more important it is to make it readable and understandable to yourself and your colleagues. So for more complex queries, you should probably use more correlated subqueries to manage the complexity for humans, not fewer. So, using a database system with a sophisticated query optimizer means that you as a programmer have to think a lot less about how the query is executed, and can focus on the problem you actually want to solve!

Summary

We hope you now understand the importance of a query optimizer: An optimized query can easily outperform a not-so-optimized query by 1000x. Especially correlated subqueries in SQL can be very hard to optimize by most database systems. However, we believe that being able to write correlated subqueries significantly improves the usability of a database system. With a good query optimizer, anyone can write SQL queries that will be executed efficiently, which previously only seasoned database wizards were able to do.

If you are interested in what other queries CedarDB can optimize, join our waitlist to be among the first to experience the benefits of CedarDB’s query optimizer!

Join our waitlist!