April 2, 2025

Generating Tailored and Efficient Code for Every Query

Learn the basics of code generation, which is one of the secrets behind CedarDB's performance. CedarDB creates custom machine code for every query. This keeps data in CPU registers as long as possible and minimizes unnecessary data transfers.

Generating Tailored and Efficient Code for Every Query

Fast Compilation or Fast Execution: Just Have Both!

At CedarDB, we believe in high-performance computing. And one of the easiest tricks to high performance is to do only what you really need to do. So you want to reduce data movement or reduce unnecessary branches. But how can you be sure you are only doing what you need to do? Especially when you want to be flexible and configurable at runtime.

Basically, you want to tune the code based on as much information as you can get beforehand. If you’re a programmer tasked with optimizing a snippet, you’d probably start by gathering as much information as you can about the data set. Then you would look at data alignment and loops to remove any unnecessary checks and pre-calculate or cache as much as possible. Finally, the code is tailored as much as possible to your challenge so that it does only one thing, but does it extremely well. Unfortunately, as developers, we cannot just write code that does one thing because there are users. Users demand interactivity, and in our case, they want to decide how they access the data. So isn’t there a way to have your cake and eat it too?

That’s where just-in-time compilation comes to the rescue. It’s an even older idea than B-Trees and has had its revival both in our field and generally for programming languages. The core idea is to let code generate code just before you need it. For a database, this means only generating code only after receiving the SQL query from the user. We’ll first take a step back to code generation (without the focus on just-in-time) and then explain hands-on how to write your own C code-generator.

Code Generation is Everywhere

At first glance, code generation may sound like a niche topic, but you have probably used code generation today without thinking about it. Browser engines, like Chrome’s V8, use code generation under the hood to compile even the most bloated javascript into something the processor can hopefully blaze through. Since the same V8 engine powers server-side JavaScript, code generation is a regular part of the job for most WebDevs. And certainly every vibe coder is a code generation expert, but ChatGPT’s latency and unreliability make it impractical for use in a database system.

Back on topic, every single database system we know of uses code generation in at least two steps. First, in the parsing step, where it transforms SQL into a machine-readible syntax tree. Typically, lexers and parsers are generated by parser generators which take a grammar and generate code in a language of your choice to process the input. Second, when reading files like parquet or interfacing with other systems. The metadata needed usually comes with a format specification like protobuf or in the case of Parquet, Thrift. The Thrift compiler then generates code to read and write the metadata from the binary format into structs.

And if you think about it, what you are doing with Thrift or Protobuf is not too conceptually different from a database processing rows. You receive a lot of messages with the same structure in a format which is known beforehand. Then you do a set of tranformation and computation, depending on the contents and the users requests. So why don’t we just do the same as those frameworks where everyone seems to agree that the best way is to generate code? The obvious problem is that we now have to do it at runtime, also called just-in-time, when the user requests it, instead of before when compiling the system. And that’s exactly what you will do in the end of this article. But first, let’s take a step back and look how we would program something like this by hand.

Parsing Binary Data by Hand

Since our running example will focus on code generation and not the parsing, we will use a fairly simple message format. Each message contains four different unsigned integers and is densely packed (i.e., contrary to usual struct layout in C, we don’t insert padding bytes between struct members).

struct Record {
	uint64_t id;
	uint32_t intVal;
	uint64_t longVal;
	uint32_t intVal2;
};

We encode all fixed-size data types in their literal represenation. Hence, a uint64_t will just be 8 bytes on the wire, and the uint32_t will be four bytes.

Thus each struct gets serialized and packed like shown in the image below.

Memory layout of the record with 4 Byte uint32 and unaligned 8 Byte uint64.

Now with an Array of Structs (or row-major) format, we have a lot of these records densely packed one after each other. Assume, we want to summarize the second column. If you as a programmer are up to the task to decode such an array of structs, you’d probably come up with C++ code along these lines.

// unaligned reads would be undefined behavior
template<typename T>
T readUnaligned(const std::byte* ptr) {
    T value;
    std::memcpy(&value, ptr, sizeof(T));
    return value;
}

// Hand-coded sum function
uint64_t sumRecords(const std::byte *ptr, uint64_t elements) {
	uint64_t sum = 0;
	while (elements--) {
	    ptr += sizeof(uint64_t); // Skip id
	    sum += readUnaligned<int32_t>(ptr); // Add intVal to sum
	    ptr += sizeof(uint32_t); // Consume intVal
	    ptr += sizeof(uint64_t); // Skip longIntVal
	    ptr += sizeof(uint32_t); // Skip intVal2
	}
	return sum;
}

This code receives the number of elements and a pointer to the data buffer. It extracts the value to summarize and skips all the other fields. So, this code gets everything done nicely and works fast as you’d expect since the compiler takes care of optimizing it. On my laptop, this results in a read and sum performance of roughly 60GB/s.

Interpreted Approach

Transferring this to the database use-case, you’d now be insanely fast as long as the query does not change. However, it’s obviously not feasible to write every query by hand in C++. You’d now need a plethora of programmers to keep up with the query demand and then you’d still have latency problems since it takes them time to code. So, let’s do the next best thing and handle different input formats in the function. It should work with different combinations of uint32_t and uint64_t. Furthermore, the user can specify the column to sum. Normally, we’d now have some kind of input language to specify the format or SQL in the case of a DBMS. Here, to keep things simple, we’ll directly use a vector with our types in C++ and skip the parsing.

// All supported data types
enum class Type {
	UINT32,
	UINT64,
};

// This can change for every query
// The record format
vector cols = { Type::UINT64, Type::UINT32, Type::UINT64, Type::UINT32 };
// The column, we want to sum (starting at zero index)
int colIndex = 1;

With this information at hand, we know everything necessary to parse the array of structs in the binary data and sum only the necessary column. The code follows a similar control flow as the hand-written code above, looping over all the rows. Now instead of doing something hardcoded for each row, we now decide for each column how we process it. Based on the information in cols and colIndex we advance over all the columns and only add one column to our central sum.

uint64_t sumRecords(const std::byte *ptr, uint64_t elements, const vector<Type>& cols, int colIndex) {
    uint64_t sum = 0;
    // Loop through all elements
    while (elements--) {
        // Loop through all columns
        int colNum = 0;
        for (const auto &c: cols) {
            // Decide based on the column type
            switch (c) {
                case Type::UINT64:
                    if (colNum == colIndex) {
                        sum += readUnaligned<uint64_t>(ptr);
                    }
                    ptr += sizeof(uint64_t);
                    break;
                case Type::UINT32:
                    if (colNum == colIndex) {
                        sum += readUnaligned<uint32_t>(ptr);
                    }
                    ptr += sizeof(uint32_t);
                    break;
            }
            // Increment column count
            colNum++;
        }
    }
    return sum;
}

Running this code leads to the same result as before, which is great but it comes at the cost of being 5x slower. This slowdown stems from bloating the main loop. Instead of some additions, we now have a lot of compare instructions. At least these comparisons lead to the same results for every row, so lets thank the branch-predictor that it only is a 5x slowdown and not even worse. However, when the comparisons do the same for every row, why can’t we get them out of the inner loop? And that’s exactly what code-generation does.

Code Generation to the Rescue

Generating Code basically combines both the flexibility of the interpreted approach with the performance of the hand-written code. The biggest change is that we now have a two stage process. First, we generate the code for execution and build it. Second, we run the code. For generating the code we take the same program structure from above. But instead of running the code, we only store it in a string.

Generating C Code

string generateColumnCode(const vector<Type>& cols, int colIndex) {
    // Setup a stringstream to store the C code
    stringstream ss;
    // Loop over all columns
    int colNum = 0;
    for (auto &c: cols) {
        switch (c) {
            case Type::UINT64:
                if (colNum == colIndex) {
                    ss << "sum += *((uint64_t *) ptr);\n";
                }
                ss << "ptr += sizeof(uint64_t);\n";
                break;
            case Type::UINT32:
                if (colNum == colIndex) {
                    ss << "sum += *((uint32_t *) ptr);\n";
                }
                ss << "ptr += sizeof(uint32_t);\n";
                break;
        }
        colNum++;
    }
    return ss.str();
}

For generated code, we prefer to use C instead of C++ because compiling C code is much faster. At CedarDB, we even ditch C for our own programming language, which is even faster and gives us more control over database-specific tasks like overflow checking. But that is a story for another day. As we rely on C for this example, the generated code doesn’t use the modern C++ features that we used in the first example. This results in the following code for the inner loop. 1

ptr += sizeof(uint64_t);
sum += *((uint32_t *) ptr);
ptr += sizeof(uint32_t);
ptr += sizeof(uint64_t);
ptr += sizeof(uint32_t);

With the core part ready, we only need to wrap it in a loop and then in a function. That’s exactly what the next snippet does. First, we need to add a bit more to our code from above to make it a stand-alone C file. We need stdint.h for our datatypes and print out the desired function signature. The signature is the same as the hand-coded one from the beginning. Next we write out the loop and close the file. Thus, we now have a complete C implementation of our sum function sum ready at sum.c.

void generateCCode(const vector<Type> &cols, int colIndex) {
    string_view filePath = "sum.c";
    string_view funName = "sum";

    // Open a file stream for the program
    ofstream fs;
    fs.open(filePath, ios::binary);

    // Add stdint for uint32_t, ...
    ss << "#include <stdint.h>\n";
    // Write function signature
    ss << "uint64_t " << funName << " (const char* ptr, uint64_t elements) {\n";
    // Write the main loop
    ss << "uint64_t sum = 0;\n";
    ss << "while (elements--) {\n";
    ss << generateColumnCode(cols, colIndex);
    ss << "}\n";
    // Return the sum and close the function
    ss << "return sum;\n"; ss << "}\n";
    // Write the file and close it
    ss.flush();
    ss.close();
}

Compiling and Linking the Code

With the final C code at hand, the only thing left to do is getting it running within our program. On a conceptual level, we just have to compile the program to machine code and then make it known to the program we are running. So let’s take a look behind the curtain and explain how the magic works.

First, we compile the code using the C compiler cc. It generates native optimized (O1) code 2 and calls the executable sumit.so. We named it .so since it is a shared object (or library) rather than a standalone executable since we don’t have a main function present. However, it contains our desired function and we need to connect it to our program, which is where dlopen helps. It loads the object into our program and directly resolves all the contained symbols (RTLD_NOW). So we are almost ready to call our function. The only thing missing is to get the symbol we want with dlsym and return it as a function pointer with the right type.

The code snippet does everything you need and a little bit of error reporting. Otherwise, it would only be 3 lines. After each call, we check whether it was successful and return the results of dlerror otherwise. And in the final step, we cast the raw function pointer to the type sumFun, so that the compiler knows how to call it. And that’s it. The code works on both Linux and MacOS out of the box.

// Type of the sum function (like handwritten from above)
using sumFun = uint64_t (*)(const char *, uint64_t);

// Do the magic
sumFun compileAndLinkCCode() {
    // Call the c compiler on the file
    string command = "cc -shared -fPIC -march=native -O1 -o sumit.so sum.c";
    if (system(command.c_str()) != 0)
        return nullptr;

    // Dynamic load the compiled shared object
    auto handle = dlopen("./sumit.so", RTLD_NOW);
    if (!handle) {
        cout << dlerror() << endl; // Print the error messages
        return nullptr;
    }

    // Get the function pointer of the symbol sum (our function) out of the library
    auto res = dlsym(handle, "sum");
    if (!res) {
        cout << dlerror() << endl;
        return nullptr;
    }

    // Cast to the right type and return
    return reinterpret_cast<sumFun>(res);
}

Running the code

Running the code is now simple and almost a one-liner. What is very nice about this approach is that we can clearly see a difference between structure and data. The compile function only needs information on the structure, like the columns cols and which column to sum up. The runtime function sumFun only gets information about the data, which it then processes at the same speed as the hand-coded implementation.

uint64_t sumRecordsCodeGen(const std::byte *ptr, uint64_t elements, const vector<Type>& cols, int colIndex) {
    // Generate the function
    generateCCode(cols, colIndex);
    // Compile the function
    auto sumFun = compileAndLinkCCode();
    // Run the function
	return sumFun(ptr, elements);
}

Performance Numbers

Coming back to performance, let’s have a brief rundown of the performance numbers. We measured it on a MacBook Air. Using the record format and code we developed in this running example, we get the numbers on the left.

Comparison of Throughput with two different structs

Generating the data, which we just included for reference, took 1.7 seconds. The hand-coded and code generated version are equally fast and have a throughput of 65 GB/s. Just-in-time compiling the C code takes roughly 1 millisecond of the 64 ms total runtime. Performing all the extra checks in the hot loop clearly diminishes the performance as shown with interpreted. It basically handles one tuple at a time and has a throughput of 10 GB/s.

Another benchmark, which we skipped for brevity in this blog-post, used strings (Code is attached below). The strings were encoded as unsigned integer for the length and then just the bytes of the string. This makes them variable length and thus, we cannot go fast over the strings. A problem our German strings don’t have since they are fixed size. In this case, we cannot auto vectorize as easily and the performance goes down. The tuple-at-a-time approach almost barely slows down since we perform a lot of checks in the fast path anyway. This also shows, that the faster the system gets the more potential becomes visible to tune. And code-generation is still more than twice as fast in this case.

Row Major Data is everywhere and feels natural

If you come from a vectorized engine or processing parquet files at scale, you’re probably ready to shout at us while reading: But there is columnar and vectorized data storage and it solves all problems! And yes you are right it exists and would work great for the toy example. However, as outlined in an extensive evaluation, compilation provides best performance for OLAP and OLTP queries and outperforms vectorization for everything with compute. Furthermore, we have next to no unnecessary intermediate representation and keep data as long as possible in CPU registers, exactly as a developer would do it. Meanwhile, we have also fixed all the shortcomings mentioned in the paper, namely: compile time, profiling, and adaptivity.

Code Generation in CedarDB

In this blog post, you got a look behind the curtain why CedarDB is fast at answering arbitrary queries with code as good as if an experienced dev wrote it. However, we are not really using C code generation in production. Our operators produce a custom intermediate representation which is similar to LLVM IR but tailored to our DBMS usecase. This gives us the flexibility to compile to different backends, optimized for throughput, latency or debuggability. In this blog post, we only covered the basics of the C Backend, shown in yellow on the figure.

Comparison of Throughput and Latency of Compilation Backends (Source)

The figure shows throughput and latency for all our backends running the 22 queries from TPC-H. As you can see it has the highest throughput at the cost of the highest latency. Using LLVM for code-generation drastically reduces query compilation and achieves similar throughput as compiling C. Usually, we use our custom assembly emitter (ASM) to start out with, which gives nearly 90% of the throughput in 1% of the compilation time, and switch to the LLVM binary once it is ready for long running queries. But that’s something for another blogpost.

So if you want to get hand-tuned query performance for your workload, give CedarDB a try. It does everything you read here and even more in the blink of an eye for every single query!

Join our waitlist!

Appendix

Benchmark Code

Show Bechmark Code

main.cpp

#include <chrono>
#include <cstdint>
#include <cstring>
#include <fstream>
#include <iostream>
#include <random>
#include <string>
#include <type_traits>
#include <vector>
#include <dlfcn.h>

using namespace std;

// Helper class for measuring time
class ScopedTimer {
   public:
   // On construct, measure start
   ScopedTimer(string_view label, unsigned repeat = 1) : label(label), repeat(repeat),
                                                         start(chrono::high_resolution_clock::now()) {
   }

   // On destruct, measure end and print
   ~ScopedTimer() {
      auto end = chrono::high_resolution_clock::now();
      auto ms = chrono::duration_cast<chrono::milliseconds>(end - start).count() / repeat;
      cout << label << " took " << ms << " ms\n";
   }

   private:
   string_view label;
   unsigned repeat = 1;
   chrono::high_resolution_clock::time_point start;
};

struct Record {
   uint64_t id;
   uint32_t flags;
   double value;
   string name;
};

enum class Type {
   UINT32,
   UINT64,
   DOUBLE,
   STRING,
};

// unaligned reads would be undefined behavior
template <typename T>
   requires is_trivially_copyable_v<T>
T readUnaligned(const std::byte* ptr) {
   T value;
   std::memcpy(&value, ptr, sizeof(T));
   return value;
}

uint64_t parseRecordsHandCoded(const byte* ptr, uint64_t elements) {
   uint64_t sum = 0;
   while (elements--) {
      ptr += sizeof(Record::id);
      sum += readUnaligned<uint32_t>(ptr);
      ptr += sizeof(Record::flags) + sizeof(Record::value);
      auto nameLength = readUnaligned<uint32_t>(ptr);
      ptr += nameLength + sizeof(uint32_t);
   }
   return sum;
}

uint64_t parseRecordsSwitch(const byte* ptr, uint64_t elements, const vector<Type>& cols, int sumCol) {
   uint64_t sum = 0;
   while (elements--) {
      int colNum = 0;
      for (const auto& c : cols) {
         switch (c) {
            case Type::UINT64:
            case Type::DOUBLE:
               static_assert(sizeof(uint64_t) == sizeof(double));
               if (colNum == sumCol) {
                  sum += readUnaligned<uint64_t>(ptr);
               }
               ptr += sizeof(uint64_t);
               break;
            case Type::UINT32:
               if (colNum == sumCol) {
                  sum += readUnaligned<uint32_t>(ptr);
               }
               ptr += sizeof(uint32_t);
               break;
            case Type::STRING:
               auto nameLength = readUnaligned<uint32_t>(ptr);
               if (colNum == sumCol) {
                  sum += nameLength;
               }
               ptr += nameLength + sizeof(uint32_t);
         }
         colNum++;
      }
   }
   return sum;
}

using sumFun = uint64_t (*)(const char*, uint64_t);

sumFun parseRecordsCodeGen(vector<Type>& cols, int sumCol) {
   ofstream ss;
   ss.open("sum.c", ios::binary);

   ss << "#include <stdint.h>\n"; // For uint types
   ss << "#include <string.h>\n"; // For memcpy
   ss << "uint64_t sum (const char* ptr, uint64_t elements) {\n";
   ss << "uint64_t sum = 0;\n";
   ss << "while (elements--) {\n";
   int colNum = 0;
   for (auto& c : cols) {
      switch (c) {
         case Type::UINT64:
         case Type::DOUBLE:
            ss << "// Comment\n";
            static_assert(sizeof(uint64_t) == sizeof(double));
            if (colNum == sumCol) {
               ss << "uint64_t val;\n";
               ss << "memcpy(&val, ptr, sizeof(uint32_t));\n";
               ss << "sum += val;\n";
            }
            ss << "ptr += sizeof(uint64_t);\n";
            break;
         case Type::UINT32:
            ss << "// Comment 32\n";
            if (colNum == sumCol) {
               ss << "uint32_t val;\n";
               ss << "memcpy(&val, ptr, sizeof(uint32_t));\n";
               ss << "sum += val;\n";
            }
            ss << "ptr += sizeof(uint32_t);\n";
            break;
         case Type::STRING:
            ss << "// Comment 4\n";
            ss << "uint32_t nameLength;\n";
            ss << "memcpy(&nameLength, ptr, sizeof(uint32_t));\n";
            if (colNum == sumCol) {
               ss << "sum += nameLength;\n";
            }
            ss << "ptr += nameLength + sizeof(uint32_t);\n";
      }
      colNum++;
   }
   ss << "}\n";
   ss << "return sum;\n";
   ss << "}\n";
   ss.flush();
   ss.close();

   string command = "cc -shared -fPIC -march=native -O1 -o sumit.so sum.c";

   if (system(command.c_str()) != 0)
      return nullptr;

   auto handle = dlopen("./sumit.so", RTLD_NOW);
   if (!handle) {
      cout << dlerror() << endl;
      return nullptr;
   }

   auto res = dlsym(handle, "sum");
   if (!res) {
      cout << dlerror() << endl;
      return nullptr;
   }

   return reinterpret_cast<sumFun>(res);
}

// Helper to fill the static size data
template <typename T>
   requires is_trivially_copyable_v<T>
span<byte>::iterator writeAndAdvance(span<byte>::iterator it, T value) {
   auto ptr = &*it;
   memcpy(ptr, &value, sizeof(T));
   return it + sizeof(T);
}

// Fill buffer with random numbers
uint64_t generateRandomPackedData(span<byte> buffer) {
   // Initialize random
   mt19937 gen(42);
   uniform_int_distribution<uint32_t> flagDist(0, 10);
   uniform_real_distribution<double> valueDist(-1e6, 1e6);
   uniform_int_distribution<uint32_t> strLenDist(3, 20);
   uniform_int_distribution charDist('a', 'z');

   uint64_t els = 0;
   auto it = buffer.begin();
   constexpr auto rowSize = sizeof(uint64_t) + sizeof(uint32_t) + sizeof(double) + sizeof(uint32_t);

   while (true) {
      // Get length of variable component
      uint32_t nameLen = strLenDist(gen);
      // Break if it exceeds the buffer len
      if (it + rowSize + nameLen > buffer.end()) break;

      // Write fixed-sized values
      it = writeAndAdvance(it, els++);
      it = writeAndAdvance(it, flagDist(gen));
      it = writeAndAdvance(it, valueDist(gen));
      it = writeAndAdvance(it, nameLen);

      // Write the variable length string
      for (uint32_t i = 0; i < nameLen; ++i) {
         *it++ = byte{charDist(gen)};
      }
   }

   return els;
}

int main() {
   vector cols = {
      Type::UINT64, Type::UINT32, Type::DOUBLE, Type::STRING};
   int sumCol = 1;

   // Reserve a huge chunk of memory
   constexpr size_t BUFFER_SIZE = 1ull << 27; // 4 GB
   auto mem = static_cast<byte*>(malloc(BUFFER_SIZE));
   span buffer{mem, BUFFER_SIZE};

   cout << "Running with " << buffer.size() / (1024.0 * 1024.0) << " MB of data\n";
   // Generate sample data
   uint64_t els = 0;
   {
      ScopedTimer timer("Data Generation"sv);
      els = generateRandomPackedData(buffer);
   }

   // Repeat the measurements 10x
   constexpr unsigned repeat = 10;

   // Run the hand-coded variant
   uint64_t sum = 0;
   {
      ScopedTimer timer("Hand-Written"sv, repeat);
      for (int i = 0; i < repeat; ++i) {
         sum += parseRecordsHandCoded(buffer.data(), els);
      }
   }
   cout << "CHECKSUM:" << sum << endl;

   // Run the switch-case interpreted variant
   sum = 0;
   {
      ScopedTimer timer("Interpreted"sv, repeat);
      for (int i = 0; i < repeat; ++i) {
         sum += parseRecordsSwitch(buffer.data(), els, cols, sumCol);
      }
   }
   cout << "CHECKSUM:" << sum << endl;

   // Run code-generation
   sumFun sumFun;
   {
      ScopedTimer timer("Code Generation"sv, repeat);
      sumFun = parseRecordsCodeGen(cols, sumCol);
   }
   sum = 0;
   {
      ScopedTimer timer("Code Execution"sv, repeat);
      for (int i = 0; i < repeat; ++i) {
         sum += sumFun(reinterpret_cast<char*>(buffer.data()), els);
      }
   }
   cout << "CHECKSUM:" << sum << endl;

   free(mem);
}

  1. Of course, you’d need the loadUnaligned load here as well. We’ll skip it for readability and the uint32 is aligned anyways here. ↩︎

  2. In practice, we use a handpicked set of optimizations instead of the -O flag. Usually, -O3 is not beneficial here in hit since it increases compile time quite a bit. ↩︎