September 24, 2024

The Hidden Cost of Data Movement

Moving data between disk, memory, caches, and CPU registers is one of the most critical paths when processing large amounts of data. In this post, we explore the often-overlooked costs of these data movements and show how they can be reduced for algorithms in general and specifically in a database system.

Christian Winter
Christian Winter
Christian Winter
Christian Winter
The Hidden Cost of Data Movement

The Hidden Cost of Data Movement

Recently, Mark Raasveldt of DuckDB wrote an excellent post about why memory management is crucial for efficient data processing. In his post, he focuses on the cost of having data on disk and moving it to memory. After all, everyone knows that having data in memory is what you want. As Jim Gray famously said in 2006:

Tape is Dead, Disk is Tape, Flash is Disk, RAM Locality is King

Does that mean you can rest easy as long as you make sure your data is in memory?

Unfortunately not, because compute has come a long way since 2006. RAM locality may have been king in 2006. But the king is dead, long live the king. While memory locality was good enough for 2006, the CPU only performs operations on registers, making register locality the new king in town. The fact that CPUs have megabytes of caches whose sole purpose is to bridge the distance between memory and registers shows how far the distance between CPU and memory really is.

So what else do you need to consider when simply having your data in memory is not enough to guarantee efficient access to it? I want to use Mark’s post as motivation to dive deeper into the topic of data movement during data processing and answer this question. We’ll explore this topic for everyday algorithms, as well as what it means for a database system. You can find the code for all experiments in this post at the bottom of this page, or in our examples repository.

How Expensive are Data Movements Really?

I just made a bold claim that storing your data in memory doesn’t always guarantee efficient access, so let’s back it up with some data.

The Effects of Different Access Patterns

We will use a very simple task to analyze the impact of data movement between memory and the CPU. Suppose we recieve a vector of integers and a list of non-overlapping ranges and want to count how many integers lie within any of the ranges. Let’s write some C++ code to do this:

unsigned count(const vector<unsigned>& input, const vector<pair<unsigned, unsigned>>& ranges) {
   unsigned res = 0;

   for (const auto& i : input) {
      for (const auto& range : ranges) {
         if (i >= range.first && i <= range.second) res++;
      }
   }

   return res;
}
unsigned countSwappedLoops(const vector<unsigned>& input, const vector<pair<unsigned, unsigned>>& ranges) {
   unsigned res = 0;

   for (const auto& range : ranges) {
      for (const auto& i : input) {
         if (i >= range.first && i <= range.second) res++;
      }
   }

   return res;
}

Both methods compute the same result, with a minimal difference in approach. While count loops over all integer values in the outer loop and checks them against all ranges in the inner loop, countSwappedLoops loops over the ranges in the outer loop and the values in the inner loop. In both cases, each tuple is checked against each range once, resulting in n * r * 2 comparisons, with n input values and r ranges, and one add instruction for each hit. All input values and ranges are held in memory in a C++ vector, so both methods should be equally fast, right? Let’s look at the runtime for 1 billion randomly generated integer values between 0 and 100 and the two ranges [5, 6] and [30, 60].

Runtime for 1 billion rows

Huh, what happened here? The second method is 3 times slower than the first. Do we do more comparisons in countSwappedLoops after all? Answering this question requires a look under the hood of the CPU, using Viktor Leis’ excellent Perfevent helper:

AlgorithmInstructions
count1900.66
countSwappedLoops1536.11
CPU Instructions Executed per 100 input values

It’s actually the opposite, count has to execute more instructions!

The Benefits of Data Locality

If its not in the number of instructions the CPU executes, where is the difference coming from? It’s from cache misses:

AlgorithmL1 MissesLLC Misses
count0.073.38
countSwappedLoops0.826.38
CPU Cache Misses per 100 input values

The countSwappedLoops algorithm has almost twice as many last-level cache misses and more than 10 times as many L1 misses!

While checking every tuple for every range and checking every range for every tuple look identical at first glance, they are very different to the CPU. All ranges (two in our example) will easily fit into the first level cache of the executing CPU core (48KB data cache per core on the Intel i7-13700H in my laptop). A billion integers, on the other hand, will definitely not. In the count method, fetching the next predicate to check against the current tuple is a lookup in the L1 cache, which takes about 1 nanosecond. In countSwappedLoops, fetching the next tuple to check against the current predicate often results in a cache miss and a memory load, which is about 100 times more expensive.

So what can you take away from this for your own code? It is not only important whether your data is in memory, but also how long it can stay as close to the CPU as possible. A good rule of thumb is to do as much as possible with each tuple of a large input before moving on to the next. This is generally referred to as increasing data locality. Increasing data locality is generally beneficial for any algorithm, regardless of your CPU architecture. For example, you should see similar performance improvements if you run this code on a MacBook with an ARM CPU. If you are in doubt about which of your implementations is better for data locality, you can also compare two approaches and look at the execution statistics as we did above.

Minimizing Data Movement in Database Queries

As you have seen above, the performance impact of data locality is significant, even for simple algorithms. Therefore, data locality is not a trivial issue in a database system. After all, the efficient processing of data is its most important task, and the very thing that people use database systems for. In fact, data locality is so important to database systems that it spawned its own data centric query execution model!

Let’s take a look at the difference to the traditional query execution model using another example:

create table sensors (
    id integer,
    val double
);
create table monitored_sensors (
    id integer
);
select sum(val)
from sensors
where sensors.val >= -4.0
and sensors.id in (select id from monitored_sensors);
Try it in CedarDB

The query filters the relation sensors on the value val and performs a semi-join with the table monitored_sensors on the column id, effectively calculating the sum of all sensor readings not smaller than -4.0 from monitored sensors. This results in an execution plan with five operators:

Query Plan

For those not familiar with query plans, let’s take a quick look: You read query plans bottom-up, from left to right. Starting in the leftmost branch, we take every entry in table monitored_sensors and collect it to later combine with entries from table sensors. Since we are checking for equality on the id column, and are only interested in whether there is a matching tuple in monitored_sensors, not what that tuple looks like, we collect a hash set to search against. For each tuple of sensors we first check if val is greater than or equal to -4.0. If so, we check if there is an entry in monitored_sensors with the same id using the previously created hash set. Finally, we sum all val values of the remaining tuples.

Let’s assume both execution models build the hash table identically. Now, we can focus only on the orange path, i.e. the path of the tuples of the relation sensors.

Traditional Operator-Centric Execution

The traditional model for executing queries in a database system is operator-centric, centering on individual operators such as filter, aggregate, and join. This means that for the query above, a system like PostgreSQL or DuckDB would evaluate the query using operator-centric logic blocks, one for each operator. In our example, this would correspond to one method for processing the greater than or equal to filter, one for the join probe, and one for the sum aggregation:

// Semi Join, probing each tuple against the joinTable and adding tuples with hits to the result
vector<pair<unsigned, double>> semiJoin(const vector<pair<unsigned, double>>& input,
                                        const unordered_set<unsigned>& joinTable,
                                        double selectivityEstimate) {
   vector<pair<unsigned, double>> result;
   result.reserve(input.size() * selectivityEstimate);
   for (const auto& p : input)
      if (joinTable.contains(p.first))
         result.push_back(p);
   return result;
}

// Filter, specialized for a greater or equal on doubles, adding tuples with pair.second >= comparison to the result
vector<pair<unsigned, double>> GE(const vector<pair<unsigned, double>>& input,
                                  double comparison,
                                  double selectivityEstimate) {
   vector<pair<unsigned, double>> result;
   result.reserve(input.size() * selectivityEstimate);
   for (const auto& p : input)
      if (p.second >= comparison)
         result.push_back(p);
   return result;
}

// Sum aggregation, unconditionally adding up all pair.second values
double sum(const vector<pair<unsigned, double>>& input) {
   double result = 0;
   for (const auto& p : input)
      result += p.second;
   return result;
}

Of course, this is a simplistic model. DuckDB optimizes this further, e.g. by avoiding full materialization in each method using selection vectors, predicated execution, and calling it on smaller chunks of data in parallel. Conceptually, however, this repeated iteration over the input is how vectorized systems like DuckDB execute a query.

You can see that this falls victim to the same problems as the countSwappedLoops example from the previous section. Instead of keeping the tuples in the registers between the two filter operations, they are at least evicted to the CPU cache, or have to be fetched from memory all over again.

Data-Centric Execution in CedarDB

So how can a database system increase data locality between operators? Unfortunately, there is no simple answer. To increase data locality statically would mean preparing functions for all possible combinations of data types, filter predicates, and operators to keep tuples in registers longer. This is simply impossible when you allow user-specified queries like a database system. Instead, the only way to increase data locality this is to generate specialized code for each query. When generating specialized code for each query, we do not need to prepare fused loops for all possible filter and column combinations, only for those actually required by the query. This is a much more tractable problem. In fact, the code that needs to be generated for the above query would be only 5 lines of C++ code:

// A single method performing all steps of a query
double processQuery(const vector<pair<unsigned, double>>& input,
                    const unordered_set<unsigned>& joinTable,
                    double comparison) {
   double result = 0;
   for (const auto& p : input)
      if (p.second >= comparison && joinTable.contains(p.first))
         result += p.second;
   return result;
}

Let us compare how both approaches perform on a billion randomly generated tuples, 1% of which qualify for the join:

Runtime for 1 billion rows

Again, there is a clear advantage for the data-centric approach, which is almost twice as fast as the operator-centric approach, even for such a simple query. The speedup also translates to SQL queries, of course, with data-centric CedarDB taking ~1.5 seconds to complete the above query on my laptop, while both DuckDB and ClickHouse took about 4 seconds. You can find the code to try this yourself in the appendix of this post. So why doesn’t every database system take a data-centric approach to query processing?

From a query execution perspective alone, there is no downside to the data-centric execution model. However, writing code to generate code for each operator is much harder than implementing the operator itself, and the details of our code generation framework could (and probably will) fill many blog posts.

To get an idea of the complexity of this task, take a look at the operator-centric methods again. I am sure any reader could easily write operators for other data and comparison types in this style. While the processQuery method does not look any more complicated, as an exercise, try writing code that writes a specialized processQuery method for a given number of joins and filter operators on arbitrary predicates and columns.

If this is also easy for you, we are hiring!

Summary

While working with in-memory data is an essential component of high-performance data processing, it is not the only one. Keeping data as close to the CPU as possible for as long as possible is also critical. Even for simple algorithms like the count filter functions, switching the order of two for loops can make a 3x difference in throughput. For short code fragments, such optimizations are possible by hand. However, increasing data locality on the scale of a database system like CedarDB requires a completely different processing model with fine-grained code generation.

If you want to discuss data-centric execution and other databasy things, feel free to join our Community Slack!

Appendix

Benchmark Code

Show Bechmark Code

main.cpp

#include "PerfEvent.hpp"
#include <iostream>
#include <random>
#include <unordered_set>
#include <vector>
//------------------------------------------------------------------------
// Microbenchmark

unsigned count(const std::vector<unsigned>& input, const std::vector<std::pair<unsigned, unsigned>>& ranges) {
   unsigned res = 0;

   for (const auto& i : input) {
      for (const auto& range : ranges) {
         if (i >= range.first && i <= range.second) res++;
      }
   }

   return res;
}

unsigned countSwappedLoops(const std::vector<unsigned>& input, const std::vector<std::pair<unsigned, unsigned>>& ranges) {
   unsigned res = 0;

   for (const auto& range : ranges) {
      for (const auto& i : input) {
         if (i >= range.first && i <= range.second) res++;
      }
   }

   return res;
}

void microBenchmark() {
   // Fix the benchmark parameters
   auto inputSize = 1'000'000'000u;

   std::vector<std::pair<unsigned, unsigned>> ranges{{5, 6}, {30, 60}};

   // Prepare the result
   auto res = 0u;

   // Prepare random number generation to fill input
   std::mt19937 gen(42);
   std::uniform_int_distribution<unsigned> uniform_dist(1, 100);

   auto numWarmups = 3u;
   auto numRuns = 3u;

   // Prepare space for the input data
   std::vector<unsigned> input;
   input.reserve(inputSize);

   // Fill input with ramdom integers between 0 and 100
   for (auto i = 0; i < inputSize; i++)
      input.push_back(uniform_dist(gen));

   auto benchmarkCount = [&]() {
      res = count(input, ranges);
   };

   auto benchmarkCountSwappedLoops = [&]() {
      res = countSwappedLoops(input, ranges);
   };

   for (auto i = 0; i < numWarmups; i++) benchmarkCount();
   {
      std::cout << "Count: " << std::endl;
      PerfEventBlock e(inputSize / 100);
      benchmarkCount();
      e.e->stopCounters();
      std::cout << "Result: " << res << std::endl;
      std::cout << "Stats: " << std::endl;
   }
   std::cout << std::endl
             << std::endl
             << std::endl;

   for (auto i = 0; i < numWarmups; i++) benchmarkCountSwappedLoops();
   {
      std::cout << "Count with swapped loops: " << std::endl;
      PerfEventBlock e(inputSize / 100);
      benchmarkCountSwappedLoops();
      e.e->stopCounters();
      std::cout << "Result: " << res << std::endl;
      std::cout << "Stats: " << std::endl;
   }
}

//------------------------------------------------------------------------
// Query Benchmark

//------------------------------------------------------------------------
// DuckDB style methods

// Semi Join, probing each tuple against the joinTable and adding tuples with hits to the result
std::vector<std::pair<unsigned, double>> semiJoin(const std::vector<std::pair<unsigned, double>>& input,
                                                  const std::unordered_set<unsigned>& joinTable,
                                                  double selectivityEstimate) {
   std::vector<std::pair<unsigned, double>> result;
   result.reserve(input.size() * selectivityEstimate);
   for (const auto& p : input)
      if (joinTable.contains(p.first))
         result.push_back(p);
   return result;
}

// Filter, specialized for a greater or equal on doubles, adding tuples with pair.second >= comparison to the result
std::vector<std::pair<unsigned, double>> GE(const std::vector<std::pair<unsigned, double>>& input, double comparison,
                                            double selectivityEstimate) {
   std::vector<std::pair<unsigned, double>> result;
   result.reserve(input.size() * selectivityEstimate);
   for (const auto& p : input)
      if (p.second >= comparison)
         result.push_back(p);
   return result;
}

// Sum aggregation, unconditionally adding up all pair.second values
double sum(const std::vector<std::pair<unsigned, double>>& input) {
   double result = 0;
   for (const auto& p : input)
      result += p.second;
   return result;
}

//------------------------------------------------------------------------
// CedarDB Style method

// A single method performing all steps of a query
double processQuery(const std::vector<std::pair<unsigned, double>>& input,
                    const std::unordered_set<unsigned>& joinTable,
                    double comparison) {
   double result = 0;
   for (const auto& p : input)
      if (p.second >= comparison && joinTable.contains(p.first))
         result += p.second;
   return result;
}

void queryBenchmark() {
   // Fix the benchmark parameters
   auto inputSize = 1'000'000'000u;
   auto numWarmups = 3u;

   // Prepare space for the input data
   std::vector<std::pair<unsigned, double>> input;
   input.reserve(inputSize);

   // Select double comparison value for GE comparison
   double compValue = -4.0;

   // Prepare join hash table
   std::unordered_set<unsigned> joinTable;
   joinTable.reserve(inputSize * .01);

   // Prepare random number generation to fill input
   std::mt19937 gen(42);
   std::uniform_real_distribution<double> uniform_dist(-5.0, 6.0);
   std::uniform_int_distribution<int> selection_dist(1, 100);

   // Prepare result variable
   double result;

   for (auto i = 0; i < inputSize; i++) {
      // Fill input with ascending keys and random values
      input.push_back({i, uniform_dist(gen)});

      // Select every 100th tuple for the join
      if (selection_dist(gen) > 99) {
         joinTable.insert(i);
      }
   }

   auto measureDuckDB = [&]() {
      auto res = GE(input, compValue, .85);
      res = semiJoin(res, joinTable, .01);
      result = sum(res);
   };

   auto measureCedarDB = [&]() {
      result = processQuery(input, joinTable, compValue);
   };

   // Benchmark DuckDB style
   for (auto i = 0; i < numWarmups; i++) {
      measureDuckDB();
   }
   {
      std::cout << "DuckDB Style:" << std::endl;
      PerfEventBlock e(inputSize / 100);
      measureDuckDB();
      e.e->stopCounters();
      std::cout << "Result: " << result << std::endl;
      std::cout << "Stats: " << std::endl;
   }
   std::cout << std::endl
             << std::endl
             << std::endl;

   // Benchmark CedarDB style
   for (auto i = 0; i < numWarmups; i++) {
      measureCedarDB();
   }
   {
      std::cout << "CedarDB Style:" << std::endl;
      PerfEventBlock e(inputSize / 100);
      measureCedarDB();
      e.e->stopCounters();
      std::cout << "Result: " << result << std::endl;
      std::cout << "Stats: " << std::endl;
   }
   std::cout << std::endl
             << std::endl
             << std::endl;
}

//------------------------------------------------------------------------
// Benchmark

int main() {
   std::cout << "Microbenchmark: " << std::endl
             << std::endl;
   microBenchmark();

   std::cout << std::endl
             << std::endl
             << std::endl
             << "Query Benchmark: " << std::endl
             << std::endl;
   queryBenchmark();

   return 0;
}

Perfevent.hpp Source

/*

Copyright (c) 2018 Viktor Leis

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

 */

#pragma once

#if defined(__linux__)

#include <chrono>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <map>
#include <sstream>
#include <string>
#include <vector>

#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <sys/ioctl.h>
#include <unistd.h>

struct PerfEvent {

   struct event {
      struct read_format {
         uint64_t value;
         uint64_t time_enabled;
         uint64_t time_running;
         uint64_t id;
      };

      perf_event_attr pe;
      int fd;
      read_format prev;
      read_format data;

      double readCounter() {
         double multiplexingCorrection = static_cast<double>(data.time_enabled - prev.time_enabled) / static_cast<double>(data.time_running - prev.time_running);
         return static_cast<double>(data.value - prev.value) * multiplexingCorrection;
      }
   };

   enum EventDomain : uint8_t { USER = 0b1, KERNEL = 0b10, HYPERVISOR = 0b100, ALL = 0b111 };

   std::vector<event> events;
   std::vector<std::string> names;
   std::chrono::time_point<std::chrono::steady_clock> startTime;
   std::chrono::time_point<std::chrono::steady_clock> stopTime;

   PerfEvent() {
      registerCounter("cycles", PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES);
      registerCounter("kcycles", PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, KERNEL);
      registerCounter("instructions", PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS);
      registerCounter("L1-misses", PERF_TYPE_HW_CACHE, PERF_COUNT_HW_CACHE_L1D|(PERF_COUNT_HW_CACHE_OP_READ<<8)|(PERF_COUNT_HW_CACHE_RESULT_MISS<<16));
      registerCounter("LLC-misses", PERF_TYPE_HARDWARE, PERF_COUNT_HW_CACHE_MISSES);
      registerCounter("branch-misses", PERF_TYPE_HARDWARE, PERF_COUNT_HW_BRANCH_MISSES);
      registerCounter("task-clock", PERF_TYPE_SOFTWARE, PERF_COUNT_SW_TASK_CLOCK);
      // additional counters can be found in linux/perf_event.h

      for (unsigned i=0; i<events.size(); i++) {
         auto& event = events[i];
         event.fd = static_cast<int>(syscall(__NR_perf_event_open, &event.pe, 0, -1, -1, 0));
         if (event.fd < 0) {
            std::cerr << "Error opening counter " << names[i] << std::endl;
            events.resize(0);
            names.resize(0);
            return;
         }
      }
   }

   void registerCounter(const std::string& name, uint64_t type, uint64_t eventID, EventDomain domain = ALL) {
      names.push_back(name);
      events.push_back(event());
      auto& event = events.back();
      auto& pe = event.pe;
      memset(&pe, 0, sizeof(struct perf_event_attr));
      pe.type = static_cast<uint32_t>(type);
      pe.size = sizeof(struct perf_event_attr);
      pe.config = eventID;
      pe.disabled = true;
      pe.inherit = 1;
      pe.inherit_stat = 0;
      pe.exclude_user = !(domain & USER);
      pe.exclude_kernel = !(domain & KERNEL);
      pe.exclude_hv = !(domain & HYPERVISOR);
      pe.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING;
   }

   void startCounters() {
      for (unsigned i=0; i<events.size(); i++) {
         auto& event = events[i];
         ioctl(event.fd, PERF_EVENT_IOC_RESET, 0);
         ioctl(event.fd, PERF_EVENT_IOC_ENABLE, 0);
         if (read(event.fd, &event.prev, sizeof(uint64_t) * 3) != sizeof(uint64_t) * 3)
            std::cerr << "Error reading counter " << names[i] << std::endl;
      }
      startTime = std::chrono::steady_clock::now();
   }

   ~PerfEvent() {
      for (auto& event : events) {
         close(event.fd);
      }
   }

   void stopCounters() {
      stopTime = std::chrono::steady_clock::now();
      for (unsigned i=0; i<events.size(); i++) {
         auto& event = events[i];
         if (read(event.fd, &event.data, sizeof(uint64_t) * 3) != sizeof(uint64_t) * 3)
            std::cerr << "Error reading counter " << names[i] << std::endl;
         ioctl(event.fd, PERF_EVENT_IOC_DISABLE, 0);
      }
   }

   double getDuration() {
      return std::chrono::duration<double>(stopTime - startTime).count();
   }

   double getIPC() {
      return getCounter("instructions") / getCounter("cycles");
   }

   double getCPUs() {
      return getCounter("task-clock") / (getDuration() * 1e9);
   }

   double getGHz() {
      return getCounter("cycles") / getCounter("task-clock");
   }

   double getCounter(const std::string& name) {
      for (unsigned i=0; i<events.size(); i++)
         if (names[i]==name)
            return events[i].readCounter();
      return -1;
   }

   static void printCounter(std::ostream& headerOut, std::ostream& dataOut, std::string name, std::string counterValue, bool addComma=true) {
      auto width=std::max(name.length(),counterValue.length());
      headerOut << std::setw(static_cast<int>(width)) << name << (addComma ? "," : "") << " ";
      dataOut << std::setw(static_cast<int>(width)) << counterValue << (addComma ? "," : "") << " ";
   }

   template <typename T>
   static void printCounter(std::ostream& headerOut, std::ostream& dataOut, std::string name, T counterValue, bool addComma=true) {
      std::stringstream stream;
      stream << std::fixed << std::setprecision(2) << counterValue;
      PerfEvent::printCounter(headerOut,dataOut,name,stream.str(),addComma);
   }

   void printReport(std::ostream& out, uint64_t normalizationConstant) {
      std::stringstream header;
      std::stringstream data;
      printReport(header,data,normalizationConstant);
      out << header.str() << std::endl;
      out << data.str() << std::endl;
   }

   void printReport(std::ostream& headerOut, std::ostream& dataOut, uint64_t normalizationConstant) {
      if (!events.size())
         return;

      // print all metrics
      for (unsigned i=0; i<events.size(); i++) {
         printCounter(headerOut,dataOut,names[i],events[i].readCounter()/static_cast<double>(normalizationConstant));
      }

      printCounter(headerOut,dataOut,"scale",normalizationConstant);

      // derived metrics
      printCounter(headerOut,dataOut,"IPC",getIPC());
      printCounter(headerOut,dataOut,"CPUs",getCPUs());
      printCounter(headerOut,dataOut,"GHz",getGHz(),false);
   }

   template <typename T>
   static void printCounterVertical(std::ostream& infoOut, std::string name, T counterValue, int eNameWidth) {
      std::stringstream stream;
      stream << std::fixed << std::setprecision(2) << counterValue;
      infoOut << std::setw(eNameWidth) << std::left << name << " : " << stream.str() << std::endl;
   }

   void printReportVertical(std::ostream& out, uint64_t normalizationConstant) {
      std::stringstream info;
      printReportVerticalUtil(info,normalizationConstant);
      out << info.str() << std::endl;
   }

   void printReportVerticalUtil(std::ostream& infoOut, uint64_t normalizationConstant) {
      if (!events.size())
         return;

      // get width of the widest event name. Minimum width is the one of 'scale'
      int eNameWidth=5;
      for (unsigned i=0; i<events.size(); i++) {
         eNameWidth=std::max(static_cast<int>(names[i].length()),eNameWidth);
      }

      // print all metrics
      for (unsigned i=0; i<events.size(); i++) {
         printCounterVertical(infoOut,names[i],events[i].readCounter()/static_cast<double>(normalizationConstant),eNameWidth);
      }

      printCounterVertical(infoOut,"scale",normalizationConstant,eNameWidth);

      // derived metrics
      printCounterVertical(infoOut,"IPC",getIPC(),eNameWidth);
      printCounterVertical(infoOut,"CPUs",getCPUs(),eNameWidth);
      printCounterVertical(infoOut,"GHz",getGHz(),eNameWidth);
   }
};

struct BenchmarkParameters {

   void setParam(const std::string& name,const std::string& value) {
      params[name]=value;
   }

   void setParam(const std::string& name,const char* value) {
      params[name]=value;
   }

   template <typename T>
   void setParam(const std::string& name,T value) {
      setParam(name,std::to_string(value));
   }

   void printParams(std::ostream& header,std::ostream& data) {
      for (auto& p : params) {
         PerfEvent::printCounter(header,data,p.first,p.second);
      }
   }

   BenchmarkParameters(std::string name="") {
      if (name.length())
         setParam("name",name);
   }

   private:
   std::map<std::string,std::string> params;
};

struct PerfRef {
   union {
      PerfEvent instance;
      PerfEvent *pointer;
   };
   bool has_instance;

   PerfRef() : instance(), has_instance(true) {}
   PerfRef(PerfEvent *ptr) : pointer(ptr), has_instance(false) {}
   PerfRef(const PerfRef&) = delete;

   ~PerfRef() {
      if (has_instance)
      instance.~PerfEvent();
   }

   PerfEvent* operator->() {
      return has_instance ? &instance : pointer;
   }
};

struct PerfEventBlock {
   PerfRef e;
   uint64_t scale;
   BenchmarkParameters parameters;
   bool printHeader;

   PerfEventBlock(uint64_t scale = 1, BenchmarkParameters params = {}, bool printHeader = true)
       : scale(scale),
         parameters(params),
         printHeader(printHeader) {
     e->startCounters();
   }

   PerfEventBlock(PerfEvent& perf, uint64_t scale = 1, BenchmarkParameters params = {}, bool printHeader = true)
       : e(&perf),
         scale(scale),
         parameters(params),
         printHeader(printHeader) {
     e->startCounters();
   }

   ~PerfEventBlock() {
      e->stopCounters();
      std::stringstream header;
      std::stringstream data;
      parameters.printParams(header,data);
      PerfEvent::printCounter(header,data,"time sec",e->getDuration());
      e->printReport(header, data, scale);
      if (printHeader)
         std::cout << header.str() << std::endl;
      std::cout << data.str() << std::endl;
   }
};

#else
#include <ostream>
struct PerfEvent {
   void startCounters() {}
   void stopCounters() {}
   void printReport(std::ostream&, uint64_t) {}
   template <class T> void setParam(const std::string&, const T&) {};
};

struct BenchmarkParameters {
};

struct PerfEventBlock {
   PerfEventBlock(uint64_t = 1, BenchmarkParameters = {}, bool = true) {};
   PerfEventBlock(PerfEvent e, uint64_t = 1, BenchmarkParameters = {}, bool = true) {};
};
#endif

Benchmark Output

Show Benchmark Results
Microbenchmark: 

Count:
Result: 330011698
Stats: 
time sec, cycles, kcycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
    0.87, 406.61,    0.71,      1900.66,      0.07,       3.38,          0.00,      87.34, 10000000, 4.67, 1.00, 4.66 



Count with swapped loops:
Result: 330011698
Stats: 
time sec,  cycles, kcycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
    2.83, 1296.79,    2.31,      1536.11,      0.82,       6.38,         35.31,     282.97, 10000000, 1.18, 1.00, 4.58 



Query Benchmark: 

DuckDB Style:
Result: 9.09568e+06
Stats: 
time sec,  cycles, kcycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
   20.21, 9454.34, 2595.89,      8562.50,     47.27,      99.71,         80.38,    2020.92, 10000000, 0.91, 1.00, 4.68 



CedarDB Style:
Result: 9.09568e+06
Stats: 
time sec,  cycles, kcycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
   12.33, 5741.43,   11.08,      2886.14,     38.97,      25.63,         80.31,    1233.21, 10000000, 0.50, 1.00, 4.66 

SQL Queries for Sensor Example

Show SQL Code

Code to prepare benchmark query in CedarDB and DuckDB

create table sensors (id integer, val double);
insert into sensors (
   select i as id, random()*9 - 5 as val from generate_series(1, 1000000000) x(i)
);
create table monitored_sensors (id integer);
insert into monitored_sensors (
   select i as id from generate_series(1, 1000000000) x(i)
   where random() < .01
);

Code to prepare benchmark query in ClickHouse

create table sensors (id integer, val double) engine=MergeTree() order by id;
insert into sensors select generate_series as id, randCanonical()*9 - 5 as val from generate_series(1, 1000000000);
create table monitored_sensors (id integer) engine=MergeTree() order by id;
insert into monitored_sensors select generate_series as id from generate_series(1, 1000000000) where randCanonical() < .01;

Benchmark query for CedarDB and DuckDB

select sum(val)
from sensors
where sensors.val >= -4.0
and sensors.id in (select id from monitored_sensors);

Benchmark query for ClickHouse with overflow checks enabled (why is this not the default?)

select sumWithOverflow(val)
from sensors
where sensors.val >= -4.0
and sensors.id in (select id from monitored_sensors);