What It Takes to Get a Research Project Ready for Production

At CedarDB, we set out to bring the fruits of the highly successful Umbra research project to a wider audience. While Umbra has undoubtedly always had the potential to be the foundation of a highly performant production-grade database system, getting a research project ready for production workloads and building a company at the same time is no trivial task.

When we launched a year ago, we were still figuring out the differences between building a research system at university, and building a system for widespread use. Since then, we have learned a lot.

In this post, we explore how we transformed our research to a publicly available community edition, what it took to get it ready for a public release, and what we learned along the way. If you’d like to jump right in and see how far we’ve come, you can use the link below to get started with our Community Edition and continue reading later.

Start Now

Transitioning from a research mindset

The first major change we saw when transitioning out of the university was in the focus of our work. Research focus often lies outside of the traditional database components, trying to make the most out of new hardware trends and deployment environments [1, 2, 3], data models [4,5], and even programming languages [6] and CPU architectures [7, 8].

While of course there was also a ton of research on the more traditional components, such as the query optimizer [9, 10], transaction processing [11], and more efficient database operators [12, 13, 14] and storage backends [15, 4], some of which we’ve already explored in previous blog posts, the focus of these works was often on performance. In comparison, only small parts of our research were on the other properties people come to expect from a database, such as reliability, fault tolerance and compatibility with popular tools.

The “move fast and break things” mantra may be great for many startups, but it is not the best mantra when being trusted with user data. Thus, we turned our focus to stability. Luckily, already at the TUM research group there was a strong focus on doing things right and not cutting corners. Contributions to the main branch of Umbra required passing a strict test suite, making sure that all corner cases and data types are handled correctly, and that the code is well structured and documented. This no doubt made our transition easier. But it still did not make it easy.

That’s why a lot of our work has focused on expanding the test suite, making sure that every bug we uncovered will be caught immediately in the future.

Lines of code in unit tests in CedarDB today and in Umbra at the time of spinoff

We more than doubled our SQL-based tests, and even increased our internal unit tests by 10% without adding much new functionality. Besides new unit tests, we also test for different things now. At university, if an Umbra instance part of an experiment was running for more than a day it was already considered a long-running test. In reality, most benchmark setups focus on individual benchmarks and queries, often restarting the database in between runs to test different configurations and algorithms. In reality, of course, database systems run for years without interruption. And we have come a long way already, so much so that we now use CedarDB internally for months to track and store internal metrics without any incidents.

Building to solve real-world problems

Although our main focus was on becoming stable and reliable, our goal was never to be just another SQL database. So, while a lot of our work has been focused on becoming a more mature system, we did not stop building new functionality to solve today’s problems. And for that task, being a company instead of a research group helped a lot. Unfortunately, database research is often decoupled from real world problems. That is not necessarily the fault of researchers. Sadly, not many companies openly talk about their data problems and the limitations to their current data stacks. That leaves researchers with little choice but to try and scout for potential problems on their own, from carefully reading company publications and listening to presentations about current market trends, estimating which problems people might have before preemptively solving them.

While it was always hard to get in touch with companies to talk about problems in their data pipeline as researchers, it became surprisingly easy when you are working on a usable solution for their problems. This allowed us to work on features people actually want and need. And because we took great care that CedarDB’s internal architecture is modular and future-proof, building these features was rather quick. So over the past months, we added, among other things, the often requested feature of supporting upserts, required for many CDC tools, as well as As-of Joins for more efficient time series operations, without much hassle. For As-of Joins, we even could re-use previous Umbra research on band joins [16].

What it means to be PostgreSQL compatible

As great as a clean internal structure is for building new features, it could not fully shield us from the technical debt of other systems. If you want to be a PostgreSQL compatible system, you have to adhere to at least some of its design choices, even if they do not seem to make any sense. And while in research it is fine to interact with the database through hand-crafted tools supporting exactly what you need, it is less so for a production-ready database system. So, besides stability, we have spent and are still spending a lot of time supporting as many PostgreSQL tools and features as possible out of the box. I won’t get into details here, but check out out our separate post on what it takes to be PostgreSQL compatible if you want to learn more.

At times, being PostgreSQL-compatible can feel like an uphill battle. Implementing one PostgreSQL function may reveal the need for three more. Nevertheless, it is worth the effort. PostgreSQL is by far the most popular database ecosystem, and even though some of its internal structures and functions seem outdated or just plain wrong, it gets many things right. And even the amplification of features goes both ways, while building one new function sometimes uncovers three more to build, ensuring compatibility for one tool also ensures compatibility to countless others. And so, more often than not, new tools that we try will just connect to CedarDB without error.

Thinking about user experience

Another major difference between a research project and a company is that for the latter, the vast majority of users are not developers of the system. For the Umbra research project, almost all major users knew the codebase by heart. And while I personally will always prefer seeing what goes wrong in a gdb session over an error message, the average user will see that differently. For that, we vastly improved the log and error messages we print, as well as the output of other inspection functionality, like our new flat EXPLAIN plan layout. There is still a long way to go until all our log and error messages will be fully polished so, if you encounter one that is confusing to you, please make sure to open an issue in our issue tracker.

As researchers, we also cared a lot about the exact specs of hardware and operating system that is running our system, making sure to configure it exactly to our desired behavior. Of course, not everyone is as passionate about the exact latency characteristics of SSD fsync operations as we are. So, while Umbra had a lot of manual tuning knobs for hardware characteristics, we make sure to turn these knobs for you, automatically adapting to main memory and CPU characteristics, as well as choosing the right transaction mode for your storage. All this happens transparently to the user, no matter which hardware specs and OS version you have, or if you run CedarDB standalone or in a Docker container.

Automatically adapting to the underlying hardware is just a single part of our effort to offer hassle-free deployment. Not every Linux kernel has the same set of features available, and while we always kept Umbra aligned only with the latest LLVM and Ubuntu versions, we are now encountering users with Linux Kernels older than Umbra itself. And, in contrast to the research world, you cannot compile a database system on every machine that you are going to run it on. So we cannot even rely on detecting the available features, such as io_uring and SSD write through behavior for efficient IO, at compile time, but need to do this at runtime. All this is of course transparent to the user, whether you use our local binaries through our install script, or our Docker image.

Shifting our focus to user experience is by far not limited to error messages and hassle-free deployment. You often want to push a research system to the absolute limits of both the system and the underlying hardware to investigate what is possible when you use resources to their fullest extent. So, for Umbra, we happily accepted the risk of being killed by the OOM killer just to make sure we can use every byte of memory available. For a production-ready database system, the story is different. So instead of relying on the OOM killer for memory management, we now aggressively track memory consumption, failing individual queries when they exceed the available memory instead of crashing the whole system.

Keeping forward momentum

While we achieved a lot already in the past year, we are not slowing down. Much of the development during the last year focused on features under the hood of CedarDB, aspects not directly visible to our users, but we were also already working on exciting user-visible features to roll out in the coming months and beyond.

Besides further extending PostgreSQL compatibility and user experience for CedarDB, this includes, among others, scaling storage to cloud object stores such as S3, support for new file types such as Parquet, and improved monitoring support.

We are also working hard to support the most challenging workloads and environments, with high availability, efficient replication from other database systems to CedarDB, and efficient spooling for your most complex queries on small machines. If you want to keep up-to-date with our progress, make sure to follow our newsletter.

What It Takes to Get a Research Project Ready for Production

What It Takes to Get a Research Project Ready for Production

Transitioning from a research mindset

Building to solve real-world problems

What it means to be PostgreSQL compatible

Thinking about user experience

Keeping forward momentum

Read more in the CedarDB Blog

Use CedarDB to search the CedarDB docs and blogs

Solve a Geospatial (GIS) problem with CedarDB!

Announcing the CedarDB Community Edition

What It Takes to Get a Research Project Ready for Production

Transitioning from a research mindset

Building to solve real-world problems

What it means to be PostgreSQL compatible

Thinking about user experience

Keeping forward momentum

Subscribe to our Newsletter

Use CedarDB to search the CedarDB docs and blogs

Solve a Geospatial (GIS) problem with CedarDB!

Announcing the CedarDB Community Edition