To B or not to B: B-Trees with Optimistic Lock Coupling
B-Trees are one of the few specialized data structures which truly stood the test of time, being over 50 years old. They were invented in year zero of unix time, on a system with a whopping 7.25MB of disk storage — Not DRAM! They are only one month younger than the idea of relational databases itself and witnessed the birth of multi-core architectures, complex cache hierarchies, fast network attached storage and many more groundbreaking changes. However, while most technologies would show their age and can comfortably retire, B-Trees are still ubiquitous for data storage.
Old but gold
Many of the tried and true file and database systems use B-Trees, and new systems conduct research on them. In this blog post, we explore why I consider B-Trees the most practical data structure, and why they are an excellent data structure for modern hardware although predating modern hardware by decades.
CedarDB uses B-Trees as its main data storage for two particular properties that translate well onto modern systems:
- They are incredibly cache efficient, and make great use of large caches.
- B-Trees can be effectively synchronized between hundreds of threads.
These properties are important to efficiently store and retrieve data with low latency.
Pure data warehouse systems often forgo index structures since they typically scan most of the data sequentially.
However, for other systems, indexes are essential for low-latency point lookups, small joins, and to enforce foreign key
constraints.
In CedarDB, we combine data warehouse capabilities with short-running transactions.
For data warehouse-style analytics, we also often rely on full table scans with block-wise min-max bounds pruning.
This usually works great for implicitly clustered attributes like creation date but already fails for last access date,
as data isn’t always neatly clustered.
Especially for large data sets, a CREATE INDEX
can make the difference between your queries running in minutes or
milliseconds.
Therefore, we can’t go the easy route of skipping index structures. Instead, we revisited good old B-Trees and adapted them to work efficiently on modern highly parallel hardware.
Cache efficiency
One of the nice features of B-Trees is their great data locality. IMHO, B-Trees are in practice close enough to being cache oblivious, so they benefit greatly from hardware development of ever faster and growing memory without changes to the code. To find a value in a B-Tree, we only have a few truly random accesses between the nodes in our tree.
To quantify the exact number of random accesses, let’s take a look at how many levels there are in a practical B-Tree. The real-world ClickBench dataset has about 100M rows and 70GB raw data. Its schema defines a primary key index over 5 columns, which results in a 4GB B-Tree index.
CedarDB uses 64KB pages, so a B-Tree with 32 Byte keys has a fan-out of 1361. So, for each level (i.e., random access) in the B-Tree, we narrow down the search range in our dataset by three orders of magnitude. As a rule of thumb, a thousand rows need one level, a million rows two, a billion rows three, etc. The 100M rows of ClickBench, therefore fit comfortably in three levels. And indeed, the root node of our B-Tree has 49 children and a total height of only three. The following image shows a visualization of this massive fanout.
Visualization of the ClickBench primary key B-Tree. The root node in the middle connects via 49 inner to a total of 66,689 leaf nodes.
This image illustrates just how quickly the fanout of a B-Tree escalates. Here, we use a radial node layout, since its over 60k leaf nodes are way too much for a conventional tree visualization. While the root and the inner nodes are still pretty clear here, this is mainly because the root is severely underfull.
The fanout of the B-Tree also nicely corresponds to the cache hierarchy of modern CPUs. While caches are not as drastically bigger each level, they follow the same trend. The 64KB root node of our B-Tree fits nicely into the L1/L2 cache of a modern CPU, the next-level inner nodes into L3, and leaf nodes in main memory.
The cache topology of modern CPUs, which follows a similar hierarchical growth as B-Tree nodes.
Hot data automatically stays cached in the huge caches of modern CPUs because it’s accessed very frequently. While B-Trees were initially designed for systems with small main memory, their cache locality nowadays greatly benefits the cache hierarchy, and still allows effortless scaling beyond caches or main memory. In-memory data structures like hash tables, or radix trees can be faster, but they have their own set of challenges to scale to larger than memory data sizes, and parallel access.
Synchronization with Lock Coupling
Efficiently sharing the state of an index structure is a huge topic. Many systems can avoid synchronization between threads by distributing or partitioning their work. Database systems, however, are all about managing shared state. Therefore, they need to efficiently synchronize their data structures.
The easy solution of a global read/write lock seems simple enough, but servers now have 100+ cores, and taking a global lock is prohibitively expensive. While lock-free data structures would be ideal, those are usually copy-on-write, which doesn’t really scale for large data sizes. Nevertheless, the ultimate goal is to avoid contention, so all parallel cores can do useful work. B-Trees solve this by lock coupling of tier nodes, which makes locking fine granular quickly. We start with a lock on the root node, which effectively locks everything, but we quickly narrow this down to the actual range of data accessed.
Animation of lock coupling in the B-Tree. We first take a red lock on the root node, before we lock the yellow lock on the inner node, and at last arrive at the fine-grained leaf lock in green.
While this naive implementation still has some contention, we first look at the basics before we take a look at optimistic lock coupling, which can make most operations lock-free.
The basic idea behind lock coupling is that we always hold a lock on the data that we’re interested in, first very coarsely, but then fine granular. So, to access a row in the B-Tree, we start by locking the root node, before we go down towards the range we want to access. Then we lock the inner node, and can unlock the root, always holding either one or two locks.
While lock coupling profits greatly from the high fanout of our B-Trees, this also highlights the bottleneck of the shared root node where all accesses start. While we can usually lock the root node shared, this shared lock still needs to update the lock state for a reference count. While modifying the refcount itself is lock-free with atomic operations, writing to shared state still causes contention due to the way CPUs implement this using the MESI protocol.
Consider the following example of two concurrent threads reading and shared-locking the root node of a B-Tree. Note that this example explicitly writes out the cache coherency state transitions to illustrate the contention.
- Thread 1: lock_shared(root): Get root exclusive → increment refcount → root is modified
- Thread 2: lock_shared(root): Get root exclusive → invalidate the remote cache → increment refcount → root is modified
- Thread 1: unlock_shared(root): Get root exclusive → invalidate the remote cache → decrement refcount → root is modified
- Thread 2: unlock_shared(root): Get root exclusive → invalidate the remote cache → decrement refcount → root is modified
Optimistic lock coupling
All this fighting for exclusive write access causes metadata contention between threads. Even though we cleanly share the data structure without modifications and only keep a shared lock. A neat locking alternative that avoids the refcount contention is a sequence lock, as used in the linux kernel. Here, we don’t actively protect against concurrent modifications. Instead, we introduce a sequence number to detect them after the fact.
The basic idea is to read the node without holding any (shared) locks to eliminate the contention. However, this risks a data race with concurrent modifications. Therefore, we need to validate that we read the correct data and retry the read otherwise.
The procedure to detect concurrent modification relies on a modification sequence number, which writers need to increment for each modification of the B-Tree node. Then, readers can compare the sequence number before and after reading to detect if they observed any modifications. In the read-only case, this costs one extra load. However, since cache lines can now stay shared, this is a fast L1 cache read, which is practically instant. Writers need to be a bit more careful and ensure that they always increment the version instead of unlocking.
We call this optimistic locking, since we just assume that there’s no writer, but have a backup plan. This cleverly exploits that B-Trees rarely update the root node. In our example with a three-level tree, and a fan-out of 1361, a write occurs only every 1361 * 1361 = 1.85M inserts. Of course, modifications of leaf nodes still need to lock those, but leaves are fine-granular anyway, so we get little contention that way.
While this sounds conceptually easy at first, the code to traverse a B-Tree node still is somewhat tricky. Consider the following pseudocode for a node traversal.
traverse_node(page, version, key):
retry:
# Read the page
pageCopy = copy(page)
if version != atomic_load(page.version):
goto retry;
# Go to the next page
childPtr = binary_search(pageCopy, key)
nextPage = load(childPtr)
nextVersion = atomic_load(nextPage.version)
# Validate that no writer overtook the reader
if version != atomic_load(page.version):
goto retry;
# Safely traverse to the next node
page = nextPage
Here, we assume that copy()
is a concurrency-safe copy operation.
After verifying the version sequence number, we can then be sure that we have a consistent view over the data, and we
can use regular functions to search that node.
However, copying a whole node of 64KB is rather inefficient.
In CedarDB, we use a carefully implemented binary_search
that is safe w.r.t. concurrent modifications, particularly
that its invariants of sorted data can be violated.
Performance of optimistic lock coupling. Optimistic locking of the B-Tree is almost as good as unsynchronized reads.
Optimistic locks add a bit of complexity, but honestly, I’ve seen worse for truly lock-free data structures. Nevertheless, the performance numbers as reported in the 2019 paper are still impressive, and they show almost no overhead over unsynchronized lookups. Perhaps surprisingly, writing operations are also faster, even tough the write-locks are more involved. Avoiding the contention on the read-mostly root node lock state is also beneficial here.
B-Trees won’t get obsolete
While especially in computer science, change is frequent, B-Trees are here to stay for several reasons: Their locality translated seamlessly from small disks to the caches we now have in modern systems. This also makes them close enough to being cache-oblivious and adapting to whatever resources they have available. The same goes for synchronization. With their high fan-out and optimistic lock coupling, they thrive in highly parallel environments. While they are technically not lock-free, being practically lock-free is good enough. There might be faster, more tailored, data structures for special applications, but for general purpose performance B-Trees work so well, that everything else has seriously diminishing returns.