An ode to PostgreSQL, and why it is still time to start over
No matter what your opinion is on the efficacy of the Modern Data Stack, one thing is for sure: there are now a metric ton of database systems out there. Whether you’re trying to do real time streaming, analyze a graph, build a warehouse, power an app, or simply determine the sentiment of text… well, there’s a system out there just for that. Having a dedicated, use case-specific tool for exactly what you need to do is great, until it’s not; you inevitably end up with major tool sprawl, having to maintain many tools as companies have more than one task to be solved. But this does not have to be the case.
Not too long ago, there was a single knight in shining armor handling all your data needs with ease, no matter what comprised your workload. That knight was, and for some still is, PostgreSQL. Because of this, people are loyal to PostgreSQL to a sometimes ridiculous extent. Companies resort to running hundreds or, as we have seen in some cases, thousands of read replicas because of how attached they are to PostgreSQL, despite the fact that they’d probably be better served by different technology at that point. The way we build software (and how it gets used) has radically changed since PostgreSQL was invented, and it’s simply not going to be the right tool for every job anymore. But is this really it? Do our only two options need to be either trying to drill in all your screws with a hammer, or 18 different databases haphazardly stitched together?
How we got here
Database infrastructure forms the backbone of most organizations, so the idea of radically overhauling this core functionality is understandably daunting - after all, the maxim “never change a running system” exists for good reasons. Because of this, databases tend to evolve very slowly: drastic changes to the codebase scare away users and maintainers. To this day, many assumptions made for computer hardware in the 90’s are still prevalent in PostgreSQL’s architecture, like access granularity in 4KB pages optimized for spinning disks, and single-threaded query evaluation (although there has been some success in moving towards intra-query parallelism). Do such outdated assumptions make PostgreSQL generally a bad choice? Not in the slightest, it is the epitome of reliability exactly because it had little large-scale internal change for decades.
But this stability also comes at a cost. While PostgreSQL’s internals remained unchanged, the use cases for analyzing data and the underlying hardware changed drastically. Up until the mid 2000s, companies had few but valuable data points to store and analyze on expensive single-core machines. However, the spread of the internet into all aspects of both the business and private lives of people suddenly spawned millions to billions of data points to analyze, from clickstreams to sensor and server logs. At the same time, hardware became cheaper and better. Instead of a single CPU core per machine and few megabytes of RAM, many-core architectures with giga- to terabytes of main memory became affordable. And because PostgreSQL’s need for stability kept it from adapting to these trends, the only way for users to accommodate this data in PostgreSQL was excessive replication.
While PostgreSQL held true to its course, others stepped up to address these changing requirements. Spark was one of the first, and undoubtedly most successful, to seize the opportunity of analytical solutions for this new big data, promising limitless scaling to many workers when aggregating and analyzing these high volume data points to draw valuable insights. But it did just that, offering no solutions for actually managing this data. For its big data analytics use-case, that was fine: in contrast to the very valuable data points of the past, each entry now held little information, and therefore value, on its own. Transactional consistency, or even durability, of all the terabytes of data they now had to analyze was, therefore, no longer a top priority.
Combined with PostgreSQL’s inability to analyze these huge amounts of data, the emergence of Spark (and other tools like it) led people to move data out of the original transactional database system. This move started the split of transactional and analytical data processing into different processes altogether, a reality now taught as canon to undergrad CS students (OLTP vs. OLAP). While the first remained in database systems like PostgreSQL, the latter moved to systems like Spark with data residing in largely data-agnostic file stores like Hadoop, Parquet, or simple CSV files. Each of these new analytical systems were not built to be general purpose solutions, and rather specialized to individual challenges. Consequently, users were suddenly facing countless systems, each providing functionality for one specific kind of data like semi-structured data, graph data, time series data, and so on, with sometimes overlapping feature sets.
What is wrong with the new guy?
At first glance, the modern data stack as we know it today seems like the land of milk and honey. For every data challenge you have to solve, you get to pick your favorite out of countless solutions, each focusing on your specific use case. Looking to gain insights into relational data? Get BigQuery, Redshift, or Snowflake. Need to analyze streams? There’s Materialize and RisingWave for that. If you want to find graph-like connections in your data, then use Neo4J, TigerGraph, or Amazon Neptune. Too many systems to figure out how to move data between them? There is an app tool for that, Fivetran. Don’t want to recreate your data transformations for each new system? Just use dbt. Of course there is still a place for PostgreSQL in this data stack as well, but only as a solution for transactional challenges, one building block out of many.
Even specialized subtasks, like clickstream analysis, have dedicated solutions such as ClickHouse. And for the specific problem that ClickHouse is trying to solve, it offers a rich feature set and high performance (this has allowed them to even achieve a commendable place on the podium in their custom-made clickstream benchmark, a small distance behind our underlying technology Umbra). But only for one specific problem! The sheer amount of tools for each task, such as stream or graph analytics, allows users to choose not only on the provided functionality (which is almost identical) for tools in each category, but on all aspects of these tools. For example, they can choose their relational data warehouse solely based on hosting and billing preference between a serverless system like Google’s BigQuery or Snowflake, which comes in T-shirt sizes to tailor it to your data size, without affecting analytical expressiveness.
So where is the catch? When developers first start using the modern data stack, they will encounter few issues in their data pipelines as most solutions are a great fit for the specific problem they aim to solve. But even simple changes in the analytical tasks, such as needing to find graph patterns in relational data or performing text similarity search on string columns, will require re-organizing the entire data stack, switching systems, or adding new ones – your existing tools will always be worse at doing the “new thing” than a new system tailor made for that thing. Every new system you have to add increases complexity and maintenance overhead, as they bring along their own requirements, connectors, and query languages. This results in an increase in inconsistencies, reliability issues, and cost. This array of issues may sound a bit generic. But that is intentional: because the possible combinations of database systems are endless, no two problems from this setup we have seen in our talks with data engineers and analysts are the same. However, we have seen two meta-patterns shared across companies:
Systems promise to be one-stop, but fail to deliver
Many systems available today promise to be a jack of all trades to attract users wanting to avoid the tool sprawl of the modern data stack altogether. Take MongoDB, for example. Starting out as a document store with web-scale capabilities, they promise users transactions, time-series, AI, analytics and more. But if you take a look under the hood, few of those promises hold true for intensive use. Complex analytical workloads, for example, are hard to express, and even simple analytical aggregations will run into limits quickly. Starting from intermediate result sizes as little as 100MB, MongoDB can start spilling to disk. If you want all aggregates in a single document, i.e., result, you are limited by their 16MB size limit for BSON documents. These limits may be fit, albeit tight, for hobby projects, but often make MongoDB unsuitable for more complex analytics. This, in turn, forces users to pull complex aggregation logic out of the database system into their application code, which is error prone, costly, and slow, or add yet another system to their data stack to address this, which they tried to avoid in the first place.
This all makes sense: databases take years (or decades) to build, and trying to take a database built for one thing many years ago (NoSQL), and simply “attach” new use cases to it, is bound to run into limitations. Our belief is that building a true all-in-one database requires rethinking everything from scratch. It cannot be retrofitted onto a system built for a single use case.
Problems require multiple views in different systems, deteriorating data quality
The second pattern often is a consequence of the first. Once analytics is moved away from the primary system, complex tasks require additional tools. Data warehouses excel at relational analysis, but users today often desire multiple views on their data. Take fraud detection for online payments as an example: while data warehouses can detect suspicious and fraudulent activity after the fact in nightly jobs, companies have a strong interest in detecting them before any money is transferred. For this, they require additional stream processing capabilities, which requires an additional tool. After they have found one malicious account, they want to hunt for potential accomplices, easiest expressed as clique-detection queries in a graph system. This forces architects and engineering teams to orchestrate pipelines across multiple systems, which is not only brittle but limited in expressiveness, forcing them to pull logic into the application layer. Further, orchestrating multiple systems increases the likelihood of inconsistencies across the different views of the data, which can severely impact the quality of cross-system insights.
With a modern true general purpose system, changes to the workload, be it in complexity or size, should result in no issues at all. Such a system would allow data to live in one place, be it the cloud or local storage, and know how to handle this data efficiently. With it, running new workloads on a new combination of your data, or with a new view on it, such as a graph view for relational data, requires no new connectors, no data transfers and transformations, no new tools and definitely no re-implementation of database functionality in user logic.
Why is everyone still talking about PostgreSQL?
Exactly this elegance and simplicity of one-stop solutions is the reason why, even with the dominant role of the modern data stack in today’s data landscape, PostgreSQL is still on everyone’s minds. Despite its limitations and performance bottlenecks, people still perceive it as eating the database world and it remains the most used and loved system by developers. However, when people talk about PostgreSQL, they are increasingly not talking about the core system anymore but rather about the possibilities of a true general purpose single-stop solution and about its ecosystem of extensions and connectors. Because of PostgreSQL’s past and present popularity, there are few systems and programming languages that do not offer connectors to it. This has resulted in the notion that the most defining part of PostgreSQL is no longer the query engine, but rather its protocol.
If the defining factor is its protocol, why do so many new database vendors still start out with PostgreSQL’s codebase? Some claim that the PostgreSQL engine at its core is fine, and only needs one component changed or one extension to fix the bottleneck preventing wider adoption. Take AlloyDB for example. Google put undoubtedly brilliant engineers on the task of speeding up analytical query processing in PostgreSQL by removing bottlenecks in the row-major query engine. And they achieved just that, boasting a respectable 117x speedup over regular PostgreSQL. But, to their credit, they are very transparent for which queries this speedup can be expected: simple filter-and-aggregate ones. Just adding group by
keys and non-equality filters dampens the speedup to 20x. Throwing joins in the mix reduces the speedup to just 2.6x for a query that will still be on the simpler side of what data analysts see daily:
select sum(lo_revenue) as revenue
from lineorder
left join date on lo_orderdate = d_datekey
where d_year <= 1997 and lo_discount between 1 and 3
and lo_quantity < 25
Maybe, the resulting speedups are enough for the use case they had in mind in the first place, but it shows that simply removing one bottleneck will not be enough for all purposes. The only way to truly remove all bottlenecks and leverage all the performance that hardware engineers have worked so hard to achieve is a fresh start.
This brings us back to the question: If you won’t be able to eliminate all bottlenecks, why is everyone still starting out with PostgreSQL? The answer is simple, albeit a bit sad. Building a database system from scratch is a long and hard endeavor. Believe us, we know. And few are as lucky as we are to work on this task in the comfort of government funded research without having to think about getting revenue early on to keep investors happy. For the others, the PostgreSQL code base is a great accelerator to get to market quickly with improvements for one use case. But to tackle all, you must start over.
Introducing CedarDB
To take advantage of everything that the database world has learned in the past few decades and build a system that is truly all in one, you can’t tack on functionality to existing systems: you need to start from scratch. Starting from scratch means keeping what’s best from PostgreSQL, its wire protocol and ecosystem, and changing the rest. In order to saturate the hundreds of CPU cores, hundreds of GBs of main memory, and NVMe SSDs available today, all components of PostgreSQL or other widely used tools prevalent today have to be re-thought entirely. With CedarDB we did exactly that: CedarDB is a relational database system created to be that original knight in shining armor that can truly do everything.
Contrary to what marketing pieces would have you believe, the most important part of a general-purpose system is not having as many use-case specific functions and operators as possible, but rather building a solid base layer. Despite their different appearances, the tools in the modern data stack perform the same basic tasks: Loading data from storage and persisting it again, filtering data, aggregating it, and combining it in new ways (i.e., through joins or connection patterns for graphs, etc.). If you can build these components to be use case agnostic and run them efficiently, then all you have to do to address new use cases is map specific functionality to those tasks. So while we could fill many pages describing the many technical innovations that make CedarDB the versatile and powerful general-purpose system that it is, and we certainly plan to do so in the future on this blog, we want to focus first on these core components.
Parallelization Framework
An important prerequisite for all these tasks is to make use of modern multi-core CPUs through efficient parallelization. If any of the core tasks does not properly make use of all available CPUs, it immediately turns into a bottleneck for the whole system. Therefore, all algorithms, data structures and tasks in CedarDB are designed to seamlessly scale to as many CPU cores as possible so that we can avoid diminishing returns when adding more CPU cores as described by Amdahl’s law. The effects this has on execution become immediately clear when looking at CPU utilization during the execution of the same TPC-H query on PostgreSQL and CedarDB below.
Despite configuring PostgreSQL to use as many CPU cores as it wants, it only makes use of 10 cores at a time. In contrast, CedarDB is able to utilize all CPU cores from the very first to the very last second of the query (even in cases where it is not the same second). Although important, parallel algorithms and data structures alone are not enough. The available CPUs also need to be distributed fairly between tasks. Our scheduler automatically adapts to new tasks coming in and scales the degree of parallelism of older tasks up or down as needed. This ensures that short-running and interactive tasks can complete quickly without starving complex long-running tasks.
Execution engine
In theory, achieving 100% CPU utilization as in the htop
view above is easy, a simple while (true) {;}
will do the trick. Doing it in a meaningful way to filter, aggregate, and combine data, on the other hand, is a completely different story. The vast majority of analytical systems in the modern data stack today are interpreting queries, deciding what to do with every data point at every step of the query during runtime. For a simple join, an interpreting system will look up the location of values to compare within the data point, then their respective data type, then the kind of comparison, e.g., equality, and only then evaluate the join condition. Each of these steps will involve at least one branch in the code, leading to many branches before the actual meaningful comparison is performed.
Aside from the final value comparison, the results of each of these can be inferred from the query alone. The CPU, however, can never be certain, limiting its ability to pipeline instructions. By compiling specialized code for each query that already includes this information, CedarDB is able to reduce branches during execution by orders of magnitude over systems like PostgreSQL. Even modern vectorized engines, like DuckDB, require 5x more branches to execute the same query.
Compiling code for queries also allows us to embed and optimize user- or use-case-specific logic deep in the query engine with ease. In addition to this advantage in code structure, our data centric query processing approach further avoids unnecessary materialization of tuples between operators, keeping them in CPU registers for as long as possible, thereby sidestepping delays else introduced by cache or memory stalls. Combined, these techniques allow us to not only fully utilize CPUs, but also to achieve more query progress in less time while doing so.
Storage Engine
Deployment
Finally, a true general purpose system should make your life easier not only when working with data, but also when working with the system itself. While systems like PostgreSQL require careful tuning to both the query workload and underlying hardware to achieve their best performance, careful and intelligent scheduling, algorithm selection and data centric code generation allow us to eliminate the need to tune for the workload. In addition, tuning to the specific hardware is also no longer necessary with CedarDB. Operating systems know all relevant properties of the hardware available in a system, giving the database system all necessary information to tune itself. After all, it knows the demands of the internal algorithms best.
Of course, we still allow users to limit both CPU and buffer manager memory consumption to allow for operating in parallel with other applications on the same hardware without fighting for all resources if that is what the users desire. But in contrast to PostgreSQL, such tuning is entirely optional and CedarDB will deliver the best performance out of the box, whether you are running on a Raspberry Pi or a server with hundreds of cores.
Integrating new use cases
With these building blocks in place, enabling new use cases and views on the data does not require you to fight against the system but rather allows you to work with it. These components have allowed us to work with semi-structured data as if it were fully structured, to outperform both stream and graph processing engines in their respective domains, and to integrate arbitrarily complex user code directly into queries without slowing down existing components or exploding the code base. All these use cases deserve an extensive deep dive in their own right and are therefore outside of the scope of this blog post. However, keep a keen eye on this blog in the future if you want to learn more about how we pulled that off.
In the meantime, if you want to experience how a modern single-stop solution for data processing with support for PostgreSQL’s SQL dialect and tools can make your life easier, why not join our waitlist and be among the first to get your hands on CedarDB!