January 29, 2026 • 17 minutes

Efficient String Compression for Modern Database Systems

Since version v2026-01-22, CedarDB applies a quite novel compression scheme called FSST to its text columns, allowing to halve their storage size while making queries faster. In this blog post, we cover how we achieved this by sharing implementation details and insights into the trade-offs and implications of integrating FSST.

Nicolas Schmitt

Why you should care about strings

If dealing with strings wasn’t important, why bother thinking about compressing them? Whether you like it or not, “strings are everywhere”¹. They are by far the most prominent data type in the real world. In fact, roughly 50% of data is stored as strings. This is largely because strings are flexible and convenient: you can store almost anything as a string. As a result, they’re often used even when better alternatives exist. For example, have you ever found yourself storing enum-like values in a text column? Or, even worse, using a text column for UUIDs? Guilty as charged, I’ve done it too, and it turns out this is actually very common.¹

Recently, Snowflake published insights into one of their analytical workloads². They found that string columns where not only the most common data type, but also the most frequently used in filters.

This means two things: First, it is important to store these strings efficiently, i.e., in a way that does not waste resources, including money. Second, they must be stored in a way that allows to answer queries efficiently, because that is what users want: Fast responses to their queries.

Why you want to compress (your strings)

Compression (usually) reduces the size of your data, and therefore the resources your data consumes. That can mean real money if you’re storing data in cloud object stores where you pay per GB stored, or simply less space used on your local storage device. However, in database systems, size reduction isn’t the only reason to compress. As Prof. Thomas Neumann always used to say in one of my foundational database courses (paraphrased): “In database systems, you don’t primarily compress data to reduce size, but to improve query performance.” As compression also reduces the data’s memory footprint, your data might all of a sudden fit into CPU caches where it previously only fit into RAM, cutting the access time by more than 10x. And because data must travel through bandwidth-limited physical channels from disk to RAM to the CPU registers, smaller data also means reading more information in the same amount of time, and thus better bandwidth-utilization.

String Compression in CedarDB

Before diving deeper into FSST, it makes sense to take a brief detour and look at CedarDB’s current compression suite for text columns, as this helps explain some design decisions and better contextualize the impact of FSST. Until the 22nd January 2026, CedarDB supported following compression schemes for strings:

Uncompressed
Single Value
Dictionary

The first two of the schemes are special cases, and we choose them either when compression is not worth it, as all the strings are very short (Uncompressed), or if there is a single value in the domain (Single Value). Since dictionary compression will play an important role later in this post, let’s first take a closer look at how CedarDB builds dictionaries and the optimizations they enable.

To illustrate string compression in CedarDB, let’s consider the following string data throughout the remainder of this blog post:

https://www.cit.tum.de/
https://cedardb.com/
https://www.wikipedia.org/
https://www.vldb.org/
https://cedardb.com/
…

Dictionary Compression

Dictionary compression is a well-known and widely applied technique. The basic idea is that you store all the unique input values within a dictionary, and then you compress the input by substituting the input values with smaller fixed-size integer keys that act as offsets into the dictionary. Building a CedarDB dictionary on our input data and compressing the data would look like this:

The attentive reader may have noticed two things. First, we store the offsets to the strings in our dictionary. This is necessary because we want to enable efficient random access to our dictionary. Efficient random access means that, given a key, we can directly jump to the correct string using the stored offset. Without storing the offset, we would need to read the dictionary from beginning to end until we find the desired string. Also, since strings have variable sizes, some kind of length information must be stored somewhere.

Second, our dictionary data is lexicographically ordered. Performing insertions and deletions in such an ordered dictionary would be quite costly. CedarDB already treats compressed data as immutable³, sidestepping this issue. An ordered dictionary provides interesting properties that will be useful when evaluating queries on the compressed representation. For example, if str1 < str2, then key_str1 < key_str2 in the compressed representation. You’ll understand later why this comes in handy.

Additionally, we might be able to further compress the keys. Since the value range of the keys depends on the number of values in the dictionary, we might be able to represent the keys with one or two bytes instead of four bytes. This technique, called truncation, is part of our compression repertoire for integers and floats.

Evaluating Queries on Dictionary-Compressed Data

No matter the scheme, decompressing data is always more CPU-intensive than not decompressing data. Thus, we want to keep the data compressed for as long as possible. CedarDB evaluates filters directly on the compressed representation when possible.

Consider the following query on our input data:

SELECT * FROM hits WHERE url='https://cedardb.com/';

Our ordered dictionary comes in handy now. Instead of comparing against every value in the dictionary, we can perform a binary search on the dictionary values to find the key representing our search string “https://cedardb.com/". This is because the order of the keys matches the order of the uncompressed strings in the dictionary.

If we don’t find a match, we already know that none of the compressed data will equal our search string; therefore, we can stop here. If we find a match, then we know the compressed representation of the string, i.e., the index in the dictionary where the match was found. Note that we perform this procedure only once per compressed block, so the operation is amortized across 2¹⁸ tuples (the size of our compressed blocks).

Now that we have found the key, we can perform cheap integer comparisons on the compressed keys. These small, fixed-size integer keys allow us to perform comparisons more efficiently since we can leverage modern processor features, such as SIMD (vectorized instructions), that enable us to perform multiple comparisons with a single instruction. Note that using SIMD for string comparisons is also possible. However, the variable size of strings makes these comparisons more involved and thus less efficient.
The comparisons then produce the tuples (or rather, their index in our compressed block) that satisfy the predicate. Then, we can decompress only the values needed for the query output or use the qualifying tuples for further processing. For example, we can reduce the amount of work required by restricting other filters to only look at tuples 1 and 4, since the other tuples won’t make it into the result.

Why not stop here?

Hopefully, all of what I just described sounds interesting, and you also get an intuition for how this can accelerate query processing. If dictionaries were the ultimate solution for all problems without any drawbacks, thus the best compression scheme for strings, we would stop here. Unfortunately, dictionaries also have drawbacks. For one, they only perform well with data that contains few distinct values. After all, they force us to store every single distinct string in full. While real-world data is usually highly repetitive, we can’t rely on that: The number of possible distinct strings is boundless.

As you can see, and as the colors indicate, the strings share many common patterns. From an information theoretical perspective, the strings have fairly low entropy, meaning they are more predictable than completely random strings. Another compression scheme could exploit this predictability; and this is where FSST comes in!

FSST

FSST (Fast Static Symbol Table)⁴ operates similarly to a tokenizer in that it replaces frequently occurring substrings with short, fixed-size tokens. In FSST-Lingo, substrings are called “symbols” and can be up to eight bytes long. Tokens are called “codes” and are one byte long, meaning there can be a maximum of 256. These 1-byte codes are useful because they allow work on byte boundaries during compression and decompression. Since there can be only 256 different symbols, all the symbols easily fit into the first-level cache of modern processors (L1), allowing for very fast access (~1 ns). The table that stores the symbols is static, i.e. it won’t change after construction, and FSST’s design goals are to support fast compression and decompression; thus, the name “Fast Static Symbol Table.” Let’s look at it in a bit more detail.

FSST compression

FSST compresses in two phases. First, it creates a symbol table using a sample of the input data. Then, it uses the symbol table to tokenize the input. Working on a sample accelerates the compression process. Intuitively, sampling should work well because symbols that appear frequently in the input data will also appear frequently in a randomly selected sample. To illustrate the compression process, consider the following sample of our input data:

https://www.wikipedia.org/
https://cedardb.com/
…

During construction of the symbol table, the algorithm iteratively modifies an initially empty table. First, it compresses the sample using the existing table and, while doing that, count the appearances of the table’s symbols and individual bytes in the sample. To extend existing symbols, it also counts the occurrences of combinations of two symbols. Then, in a second step, it selects the 255 symbols that yield the best compression gain. The compression gain of a symbol is the number of bytes that would be eliminated from the input by having this symbol in the table. It is calculated as follows: gain(s1) = frequency(s1) * size(s1). This process is illustrated below using our sample.

If you’ve been paying close attention, you might have noticed that only 255 symbols are picked, even though a single byte can represent 256 values. The reason for this is that code 255 is reserved for the “escape code”. This code is necessary because the symbol table cannot (or rather, should not) hold all 256 individual byte values. Otherwise, FSST would not compress at all since one-byte symbols would be substituted by one-byte codes. Consequently, the input may contain a byte that cannot be represented by one of the codes. The “escape code” tells the decoder to interpret the next byte literally.

Now that the Symbol Table is constructed, it can be used to compress the input. This is done by scanning each string in the input and looking for the longest symbol at the current string offset. When such a symbol is found, its code is written to the output buffer and the input buffer is advanced by the length of the symbol. If there is no symbol for the current string offset byte, the escape code and the literal byte are written to the output buffer.

FSST decompression

Decompression is straightforward. For each code in the compressed input, the symbol table is consulted to find the corresponding symbol, which is then written to the output buffer. The output buffer is advanced by the length of the symbol.

Note that the first “c” is different from the second “c.” The first “c” is prepended by the escape code “ESC” and is therefore interpreted literally. The second “c,” however, is a real code that refers to the symbol at index “c” (99 in ASCII) in the symbol table.
For more information on FSST compression and decompression and its design decisions, see the paper⁴.

Integrating FSST into a Database System

Until now, we have discussed FSST compression and decompression, but not how to integrate it into a modern database that optimizes for very low query latencies, like CedarDB. Specifically, we have not discussed what compressed FSST data looks like when written to persistent storage so that it can be decompressed efficiently upon re-reading, so let’s do that now.

Obviously, you want to serialize the symbol table with your data, i.e., the symbols and their lengths. Without it, you won’t be able to decode the compressed strings and will lose all your data. To support efficient random access to individual strings in our compressed corpus, we also store the offsets to the compressed strings. To decompress the third string in our input, for example, we first look at the third value in the offset array. Since both the symbol table and the offsets have a fixed size, we always know where the offset is in our compressed corpus. The offset tells us the location of our string in the compressed strings. Finally, we use the symbol table to decompress the string.

We planned to use the above layout in the beginning. However, it has significant drawbacks when it comes to query processing, i.e., evaluating predicates on the data. To illustrate this, let’s evaluate our query on the above layout:

SELECT * FROM hits WHERE url='https://cedardb.com/';

One naive way to evaluate this query on the data would be to decompress each compressed string and then perform a string comparison. This process is illustrated below:

This is quite slow. Alternatively, one could be smarter and use the compressed representation of the search string as soon as the first match is found for further comparisons. Another option is to directly compress the search string and then compare the compressed strings. Note that this only works for equality comparisons, not for greater than or less than comparisons, as our symbols are not lexicographically sorted. Even if we sorted them, we’d ultimately end up with string comparisons because FSST-compressed strings are still strings, meaning they’re variable-size byte sequences. As already mentioned, this makes comparing them more difficult and slower than comparing small fixed-size integers, for example. Wait, does comparing small fixed-size integers ring a bell? This is exactly what dictionaries allowed us to do before! So is it possible to combine the advantages of a dictionary and FSST? Yes, it is!

Combining FSST with a Dictionary

The key idea is to create a dictionary from the input data and then use FSST to compress only the dictionary. This allows for efficient filtering of the compressed representation (i.e., the dictionary keys), as illustrated previously, while achieving better compression ratios than regular dictionary compression, as FSST allows us to leverage common patterns in the data for compression. Combining FSST and dictionary compression would look like this:

Evaluating predicates on this data works the same way as it does for dictionary compression. Decompression works the same way as it does for FSST. However, accessing the strings involves an additional level of indirection due to the dictionary keys.

Note that combining FSST with a dictionary is nothing new, DuckDB integrated DICT_FSST half a year ago⁵, which is a very similar approach.

Deciding when to choose FSST

We’re almost there, but not quite yet. The open question is still: How do we decide when to apply FSST to a text column? Before CedarDB supported FSST, it chose a scheme by compressing the input and selecting the one scheme that produced the smallest size (though there are some edge cases). However, applying FSST just to save one byte is not worth it since decompressing a FSST-compressed string is much more expensive than decompressing a dictionary-compressed string. Thus, we introduced a penalty, X, such that the FSST-compressed data must be X% smaller than the second-smallest scheme to be chosen.

We evaluated multiple penalty values by contrasting storage size and query runtime on some popular benchmarks (see the Benchmarks section). In the end, we chose a penalty of 40%. A more detailed discussion of the trade-off will follow in the next section.

Benchmarks

As CedarDB is rooted in research, let’s stop the talking and look at some data to empirically validate my claims by examining benchmark results. We’ll take a look at two popular analytical benchmarks, ClickBench and TPC-H. Why these two? TPC-H is an industry standard, but its data is artificially generated, while ClickBench is based on real-world data.

Storage Size

As one of the reasons for compressing data is reducing its storage size, let’s have a look at the impact of activating FSST in CedarDB on the data size:

Enabling FSST reduces the storage size of ClickBench by roughly 6GB, corresponding to a 20% reduction in total data size and a 35% reduction in string data size. For TPC-H, the effect is even more pronounced: total storage size is reduced by over 40%, while string data size shrinks by almost 60%. This is likely because the data is artificially generated and therefore contains patterns that are more easily captured by the symbol table.

Query Runtime

As previously mentioned, we also aim to compress data to improve query performance. To measure this, we ran all queries from ClickBench and TPC-H on CedarDB with FSST active, and compared them to runs without FSST, normalizing the results by the runtime without FSST. To provide a clearer picture, we differentiate between cold runs, where the data is stored on disk and the query is executed for the first time, and hot runs, where the data is already cached in memory.

Cold Runs

As shown, activating FSST has a positive effect on cold runs for both ClickBench and TPC-H. For ClickBench, query runtimes improve by up to 40% for queries that operate mainly on FSST-compressed data, such as queries 21, 22, 23, 24, and 28. On my machine, this 40% reduction corresponds to an absolute runtime decrease of over a second, which is a substantial impact. For TPC-H, the effect is less pronounced, likely because the queries are more complex, i.e. there is a lot of other stuff to be done, and thus loading data from disk is less of a bottleneck compared to the mostly simple queries in ClickBench. Nevertheless, we still observe a speedup of up to 10% for query 13. Moreover, not only is the impact on individual TPC-H queries smaller, but activating FSST also affects fewer queries. This is because fewer filters are applied to FSST-compressed string columns compared to ClickBench.

Hot Runs

Looking at the hots runs, the effect is quite the opposite.

For the ClickBench queries 21,22,23,24 and 28, which showed the largest runtime improvements in the cold runs, the runtime for hot runs is higher by up to 2.8x. Since the data is already cached in memory, loading the data from disk is no longer the bottleneck; instead, decompressing the FSST-compressed strings becomes the limiting factor. This is because all these queries need to decompress most of the strings in the text columns to evaluate the LIKE (or length in the case of Q28) predicates. And since decompressing FSST is more expensive than simple dictionary lookups, this results in slower execution. Note that this is really the worst-case scenario for compression in database systems: very simple queries that require decompressing nearly all the data to produce a result. As they say, there’s no free lunch, you can’t beat physics.

Queries that can operate on FSST-compressed columns without full decompression, such as query 31, continue to benefit in the hot runs, in this case achieving a notable 25% speedup. Once again, the effect is less pronounced for TPC-H. For query 13, the runtime is 2.5× higher because a LIKE predicate is applied on almost all values of an FSST-compressed column. Note that, although the relative runtime difference is higher than in the cold runs, the absolute difference is an order of magnitude smaller. For example, for query 22, being 2.8x slower corresponds to only about 100ms. This is because reading data from memory is much faster than reading it from disk.

As another idea for those wanting to integrate this into their system, one way to improve the performance of simple, decompression-heavy queries might be to cache decompressed strings in the buffer manager. This way, subsequent queries touching the same data can read the decompressed strings directly without repeated decompression. However, decompressed strings take up more space than compressed strings, so there is less space in the buffer manager for other things that could help answer queries efficiently, ultimately making it a trade-off again. I don’t have any numbers on that yet because I haven’t had the chance to implement this. Let me know if you do.

Conclusions

As you might have noticed, compressing data is always a trade-off between storage size and decompression speed. After careful consideration, we decided to activate FSST to reduce CedarDB’s storage footprint and improve loading times. While some queries may not benefit, or even slightly suffer from FSST, working on better-compressed strings improves overall resource usage and query performance, making it a net win. If you’ve read this far, I hope I’ve given you some insight into string compression schemes and FSST, as well as what it takes to integrate such a scheme into a database system, along with the potential implications. If you have any questions about the topics covered in this post, feel free to reach out at contact@cedardb.com or contact me directly at nicolas@cedardb.com. To see how well CedarDB compresses your own data and how this affects query performance, download the community version from our website and examine the compression schemes applied to your data using the system table cedardb_compression_infos.

Efficient String Compression for Modern Database Systems

Why you should care about strings

Why you want to compress (your strings)