Why Your SSD (Probably) Sucks and What Your Database Can Do About It

Database system developers have a complicated relationship with storage devices: They can store terabytes of data cheaply, and everything is still there after a system crash. On the other hand, storage can be a spoilsport by being slow when it matters most.

This blog post shows

how SSDs are used in database systems,
where SSDs have limitations,
and how to get around them.

When are SSDs fast?

When we colloquially talk about speed, we usually think in terms of throughput, i.e., how much data we can store or retrieve per second. Let’s use the fantastic bench-fio tool to measure how a consumer-grade Crucial T700 SSD used in one of our build servers performs for random reads:

bench-fio --size=10G --target=./fiotest --type file --mode randread --output consumerSSD --destructive

Almost 6 GB/s! As you can see, modern SSDs are simply amazing when it comes to read throughput. Since analytical queries tend to read a lot of data, SSDs can make full use of their high read throughput as long as two conditions are fulfilled:

We use multiple threads to access the SSD in parallel (numjobs).
Each thread keeps multiple requests in flight to keep the SSD busy (iodepth).

CedarDB is designed to take advantage of this “happy place” when processing analytical queries: Through CedarDB’s morsel-based parallelism approach, each query is executed by many threads that can process read requests in parallel. To keep the read queue full, CedarDB also uses io_uring, a Linux syscall available in newer kernels specifically designed for more efficient I/O, to asynchronously queue multiple requests in parallel.

When are SSDs slow?

Unfortunately, not all database workloads require only high read throughput. After all, we also want to insert new data or update existing data, which incurs writes at a low queue depth. While SSDs also have acceptable write throughput (as long as their internal cache is not exhausted), database inserts and updates are bottlenecked by latency, i.e., how long it takes to store some bytes and tell the user when everything is done.

Take a look at the latency histogram below:

bench-fio --size=10G --target=./fiotest --type file --mode randwrite --output latency --numjobs=1 --iodepth=1 --destructive

Doesn’t look too bad, does it? Almost all write requests complete in 10 microseconds. So each thread could theoretically persist 100,000 inserts per second to the SSD. This is in line with SSD vendors who promise 1500k IOPS and more.

Unfortunately, your operating system is lying to you: When your database system does a write(), it marks the data to be persisted but there is no guarantee that it has actually reached the physical storage medium yet. If your system crashes at exactly the wrong time, data that you thought you had written to disk may not be there after a restart. However, your database system should be able to guarantee you that your committed data is durable (the “D” in ACID).

If you want your data to be stored persistent for real, you need to issue a sync command to make the operating system block and only return control after the data has been persisted. Let’s have look at how this affects latency:

bench-fio --size=10G --target=./fiotest --type file --mode randwrite --output latencydsync --numjobs=1 --iodepth=1 --destructive --extra-opts sync=dsync

Oops! The latency is now around 500 microseconds with a tail of over a millisecond! If we actually care about durability, we’ve just been heavily downgraded. So, a simple

for value in values: 
  insert(value)

will at most insert ~2000 values per second!

Solving the Mess

Such low throughput is simply unacceptable. What options do we have to mitigate the high latency of storage?

Group Commits

Instead of writing every single change to the SSD immediately, we can instead queue many such writes and persist the whole queue in one go. Instead of having one latency-sensitive round trip per commit, we now have one round trip per queue flush, let’s say every 100 commits. We just have to be careful not to tell the user that their data has been committed before the queue containing their data has been flushed.

Group commits really shine when there are many users and clients inserting data in parallel, so that the queue is never empty. On the other hand, when not much data is being written, group commits actually make the problem worse: If no new data is coming in, the database system has to bite the bullet and flush the nearly empty queue at some point, or else you will have to wait indefinitely. This will undoubtedly make you doubly unhappy: You have to endure the SSD latency and the group commit window.

Database systems can mitigate this problem somewhat by dynamically changing the group commit window based on load. However, group commits are still only advantageous if the load on the system is not negligible.

Asynchronous Processing

Another solution is to use asynchronous processing and just never wait for the SSD. There are two ways to do this:

Adapting the Application

We could encourage the developer to rewrite their application code to never wait for the “OK” from the database system, e.g. by using pipeline mode which keeps multiple transactions in flight at all times.

While we definitely recommend using pipeline mode whenever possible as it also hides network latency, it’s not possible if the application’s next decision depends on the result of the previous query.

Asynchronous Commits

Alternatively, we could enable “asynchronous commits”. If enabled, the database system will give the application its OK without waiting for the disk to complete the sync. The downside of this approach is that we might lose a few milliseconds of data in the event of a crash, i.e., any data whose sync request was inflight.

Is this trade-off worth it? Ultimately, it is up to you as a programmer to decide whether your use case can tolerate losing a small amount of data that the database system has told you has already been committed.

Case in point: MacOS does not even do durable writes when you explicitly issue a fsync call. For durable writes, fcntl(F_FULLFSYNC) might work, but Apple’s documentation still describes this as a “best effort guarantee”. But are Apple products notorious for losing data? It certainly hasn’t hurt their bottom line.

Server-Grade SSDs

Enterprise SSDs, which you are hopefully using in production environments, have a capacitor-backed write cache (marketed as Power-Loss Protection), and sync calls virtually become a “no-op”. Here we really do get durable writes with 10 microsecond latency!

If you really want durable writes with sub-millisecond latency, you might want to pay the ~20% premium that vendors charge for such enterprise SSDs.

Unfortunately, my laptop, and probably many other developers’ laptops, do not have an enterprise-grade SSD, or even the option to install one.

How CedarDB Mitigates SSD Bottlenecks

CedarDB combines all of the above approaches to give you the best of all worlds:

It uses group commits with an automatically adjusting time window to get the best performance under medium or high load.
It supports PostgreSQL’s pipeline mode to allow you (or your database driver) to make asynchronous requests whenever possible.
It automatically detects if your system’s SSD has a write cache. If you have such an SSD, CedarDB uses fully ACID-compliant synchronous commits at no extra cost. If not, CedarDB will fall back to asynchronous commits, assuming it is running in a non-production environment where the small chance of data loss during a system failure is acceptable. In this case, CedarDB will notify you with the following warning:

WARNING: Your storage device uses write back caching. Write durability might be delayed. 
See: https://cedardb.com/docs/references/writecache

You can of course instruct CedarDB to run in full synchronous commit mode anyway and even enable or disable asynchronous commits on a per-connection basis.

How Other Database Systems Deal With SSD Latency

Let’s look at two other popular database systems and how they deal with SSD latency bottlenecks.

MongoDB

MongoDB also uses a write-ahead logging via a separate journal. Each write operation is appended to this journal, which is then flushed to disk every 100 milliseconds, taking aspects of the group commit and async commit approaches. If you cannot tolerate losing up to 100 milliseconds of data, you can optionally force an early flush of the journal when issuing a write, increasing the commit latency.

PostgreSQL

As a truly battle-tested database system, PostgreSQL offers very strong durability guarantees. It flushes transactions to disk on commit, and only then acknowledges the commit to the client. PostgreSQL also offers a commit_delay configuration parameter which causes logs to be flushed every n microseconds, acting as a simple group commit mechanism without variable time windows. In addition, PostgreSQL allows the user to globally enable asynchronous commits (via synchronous_commits = off), or the even more invasive option of simply disabling fsync. The latter option is strongly discouraged, however, as it can lead to an unrecoverable database state in the event of a crash, causing the user to lose all their data.

Does It Matter That Your SSD (Probably) Sucks?

Probably not.

If you run your database system on a consumer-grade desktop PC or laptop, you can get similar performance

at the cost of potentially losing a few milliseconds of data due to async commits, or
at the cost of reduced application expressiveness via transaction pipelining.

In this case, however, you probably have other, more catastrophic data loss vectors to worry about. In other words: You probably don’t care if the fsync call completed when your laptop burns down in a house fire or gets stolen - the data is gone anyway.

If you are running database systems in a production environment, you (or your cloud provider) are hopefully using enterprise-grade SSDs with capacitor-backed write cache that have great write latency.

If you want to see out for yourself how the different trade-offs affect performance, or find out if your SSD really sucks, why not join our waitlist and be among the first to get your hands on CedarDB?

Simple, Efficient, and Robust Hash Tables for Join Processing