Why You Shouldn’t Forget to Optimize the Data Layout

When you want to speed up your program, the obvious step is to recall the learnings of your data structure class and optimize the algorithmic complexity. Clearly, algorithms are the star of each program as swapping a hot O(n) algorithm with a lower complexity one, such as O(log n), yields almost arbitrary performance improvements. However, the way data is structured also affects performance significantly: Programs run on physical machines with physical properties such as varying data latencies to caches, disks, or RAM. After you optimized your algorithm, you need to consider these properties to achieve the best possible performance. An optimized data layout takes your algorithms and access patterns into account when deciding on how to store the bytes of your data structure on physical storage. Therefore, it can make your algorithms run several times faster. In this blog post, we will show you an example where we can achieve a 4x better read performance by just changing the data layout according to our access pattern.

Comparison of AoS and SoA Data Stores

Modern hardware, particularly the CPU, is designed to process data in specific ways. The arrangement of data in memory affects how efficiently your program can use the CPU’s cache, how often it suffers from cache misses, and how well it can leverage vectorized instructions (SIMD). Even with optimal algorithms, poor data layout can lead to frequent cache reloads, stalled pipelines, and excessive memory transfers, all of which reduce performance.

One of the most important choices for the data layout of your program comes down to how you store collections of objects or records. The remainder of this blog post builds upon a simple employee example that stores a collection of employees that contain an employee’s id, their first name, and salary.

The simple and intuitive data layout stores each employee’s complete record in contiguous memory, known as Array of Structures (AoS). This makes it easy to work with individual employees since all their information is packed together and can be fetched with a single array [i] access. The Structure of Arrays (SoA) format, on the other hand, separates the fields into distinct arrays. Each array holds one specific attribute for all employees. Here, all the identifiers are grouped in one continuous memory space, then all the first names, and lastly all the salaries.

Example employee data split into AoS or SoA.

To understand the performance implications of using different data layouts, we’ll use a simple C++ program to run multiple sets of experiments. During the experiments, we’ll do quite a bit of number crunching and analyze the runtime of the different data layout options to find the advantages and disadvantages of each format. The following snippet shows an example implementation for AoS and SoA.

#include <array>
#include <vector>
#include <cstdint>
// ----------------------------------------------------------------------------
using series = uint64_t;
using decimal = uint64_t;
// ----------------------------------------------------------------------------
// Represent all elements of a row consecutively in memory
struct Employee {
    series id;
    decimal salary;
    std::array<char, 16> firstname;
};
// Vector is a growable array of data, each element is a full row
struct ArrayOfStruct {
    std::vector<Employee> table;
};
// ----------------------------------------------------------------------------
// Store the same column consecutively in memory
struct StructOfArrays {
    std::vector<series> id;
    std::vector<decimal> salary;
    std::vector<std::array<char, 16>> firstname;
};
// ----------------------------------------------------------------------------

In the AoS format, we define a struct for an employee that contains all the necessary fields: id, first name, and salary. We then create a vector (i.e., a resizable array) of this structure to store data for multiple employees. In contrast, with SoA, we separate each field of the employee data into its own vector. This means we’ll have one array for ids, another for first names, and a third for salaries.

Update Performance

Let’s explore the properties of these different storage types. First, we want to understand the typical object-oriented way of accessing data. There, applications usually insert, update, and delete objects. In our example, we create employees when they are hired, update their salary over their time at the company, and delete them when they leave.

Because data movement is crucial to performance, we explore the different access paths of the different data layouts. All following experiments use 100 million records and were measured on a i7-13700H CPU. To understand the performance for the object-oriented access path, we update the name of persons (according to a newly earned degree) and adapt their salary to mimic the object updates over the lifetime of objects. In total, 10% of the employees are updated during the experiment.

Store	Execution Time
Array of Structs	126.58 ms
Struct of Arrays	195.66 ms

Even though the generated assembly code is almost identical, check for yourself here at godbolt, one approach is about 50% faster!

If it’s not the CPU, what is responsible for the difference? It’s the memory subsystem moving the data from main memory through the CPU caches into the CPU’s registers. The granularity of moving data is determined by the cache line size, usually 64 bytes on x86 machines. It doesn’t matter how much of a given cache line we need to access. As soon as we access any part of it we have to pay the cost of retrieving it from main memory.

Cache line accesses for accessing one object stored as AoS or SoA.

In this example, we touch a large fraction of each employee’s record as we access both name and salary. Having both fields in the same cache line lets us get away with having just one cache line access per employee update.

Read Performance

After examining both data layouts on a typical insert/update/delete workload, we want to focus on reading and analyzing properties of the whole collection. To understand the performance impact of the different data layouts on this workload, we aggregate the total personnel costs as follows.

// ----------------------------------------------------------------------------
decimal getTotalSalary(const ArrayOfStruct& store)
// Computes the salary from an array of struct store
{
    decimal salary = 0;
    // Iterate over the rows and select the salary
    for (const auto& r : store.table)
        salary += r.salary;
    return salary;
}
// ----------------------------------------------------------------------------
decimal getTotalSalary(const StructOfArrays& store)
// Computes the salary from a struct of arrays store
{
    decimal salary = 0;
    // Iterate over the column
    for (auto s : store.salary)
        salary += s;
    return salary;
}
// ----------------------------------------------------------------------------

Although the code looks quite similar, the performance is vastly different. On my machine, the SoA approach is roughly 3x faster than AoS.

Store	Execution Time
Array of Structs	137.22 ms
Struct of Arrays	45.67 ms

Interestingly, both versions are compiled to almost the same underlying assembly code. The only difference in the assembly code is the offset of accessing the next data element (32 bytes vs. 8 bytes), which can be seen here at godbolt.

As mentioned earlier, the CPU loads one 64 byte cache line at a time. It is therefore crucial to align the data layout to this granularity. The SoA approach loads 8 salary entries at once (8 * 8 byte = 64 byte), whereas the AoS storage can only move 2 records (2 * 32 byte = 64 byte) per cache line (and therefore only 2 salary entries). This wastes 75% of the memory bandwidth, achieving a goodput of only 16 bytes per 64 bytes loaded. Note that the experiments are completely in-memory and much higher differences are expected if the data resides on lower bandwidth storage devices instead of fast main memory.

Values accessed per cache line when one dimension is read using AoS or SoA.

Additionally, the SoA approach can greatly benefit of auto vectorization within the compiler. With the optimization level O3, g++ produces SIMD code for this data layout. For the AoS approach, the compiler still loops over the data given an array of employees, as shown here at godbolt. This auto vectorization gives us another 1.5x performance improvement, so in total SoA is more than 4x faster than AoS.

Store	Execution Time
Array of Structs	137.22 ms
Struct of Arrays	45.67 ms
Struct of Arrays (auto-vectorized)	33.13 ms

Comparing update and read access patterns, we can see that the data layout is a trade-off and better application knowledge helps to do the right decision. If you usually touch most of the data of your structure when looping over it, it is better to do AoS. Conversely, the salary aggregation experiment shows that SoA drastically increases performance when only few fields of a record are needed.

Compression

Two common approaches to optimizing data size are lightweight and strong compression. Strong compression (e.g., LZ77 and Huffman coding) minimizes data size without making additional assumptions about the data, but is often slow and CPU-intensive as it trades speed for size. On the other hand, lightweight compression allows you to eat your cake and have it too. Lightweight compression methods, such as run-length (RLE), frame-of-reference, or delta encoding, provide fast and efficient ways to reduce memory consumption without sacrificing access performance. Unlike strong compression, many lightweight encodings can be used to directly process the compressed data without additional extractions, further improving data movement. However, the effectiveness of these lightweight methods depends heavily on how the data is organized.

Lightweight compression schemes thrive when they can detect patterns or repeated values. An example of such a pattern is an ordered sequence of values. Using the SoA layout, we can store the data efficiently using delta encoding or a frame of reference encoding. Overall, SoA fits lightweight compression better because similar values are stored nearby. In addition, SoA provides more repetition within each field, which maximizes the compression ratio for schemes such as RLE. In contrast, the interleaving of different fields in AoS limits the effectiveness of lightweight compression.

Returning to our example, we test lightweight compression using frame-of-reference for decimal and serial values. Frame-of-reference encoding finds the minimum of the values and minimizes the bytes required to store the difference from the minimum value.

This simplified code snippet implements a frame-of-reference encoding that can store difference values of up to 2¹⁶. The full code in the appendix implements a more general version of this encoding scheme. In our example, the employees’ salary differences fit within the 2-byte difference, so we can calculate the salary using a frame-of-reference minimum value and the offset.

// ----------------------------------------------------------------------------
// Column compressed with frame-of-reference encoding
template <typename T>
struct FORColumn : Column<T> {
    // The minimum value of the column
    T minValue;
    // The FOR compressed values
    std::vector<uint16_t> compressed;
    // Constructor
    FORColumn(const std::vector<T>& values, T minValue) : minValue(minValue) {
        compressed.reserve(values.size());
        for (const auto& elem : values)
            compressed.emplace_back(elem - minValue);
    }
};
// ----------------------------------------------------------------------------
template <typename T>
std::unqiue_ptr<Column<T>> createColumn(const std::vector<T>& values)
// Compute if compression scheme can be applied and create the struct of array representation
{
    auto [minValue, maxValue] = std::ranges::minmax_element(values);
    if (*maxValue - *minValue <= std::numeric_limits<uint16_t>::max()) {
        return std::make_unique<FORColumn<T>>(values, *minValue);
    } else {
        return std::make_unique<Column<T>>(values);
    }
}
// ----------------------------------------------------------------------------
template <typename T>
decimal getTotalSalary(const FORColumn<T>& column)
// Computes the salary
{
    decimal salary = 0;
    for (auto s : column.compressed) {
        salary += s + column.minValue;
    }
    return salary;
}
// ----------------------------------------------------------------------------

As the following table shows, compression significantly reduces the data size and improves the performance. Our compression method allows us to process the data directly, while requiring less data movement. Overall, the optimized storage format results in a 7x speedup and reduces memory and storage consumption by 33%. Note that we measured these results using the generic implementation shown in the appendix.

Store	Execution Time	Size
Array of Structs	137.22 ms	3051 MB
Struct of Arrays (auto-vectorized)	33.13 ms	3051 MB
Struct of Compressed Arrays (auto-vectorized)	19.52 ms	2098 MB

Out-Of-Memory Optimization

Now that we’ve seen how SoA can greatly improve performance by using lightweight compression techniques, let’s consider how these concepts scale when dealing with large data sets. When the data set becomes too large to fit in memory, or when efficient access to only a portion of the data set is required, chunked structures of arrays come into play: We partition data into chunks of the same size and store them on disk. To speed up processing, we store the min and max value of each chunk at its start. Before processing the chunk, we can then look at the min and max values and skip chunks which don’t apply anyways, e.g. we don’t need to look at chunks with first names starting with ‘Z’ if we are only interested for first names ‘A’ to ‘M’. Because each chunk is compressed independently, we can scale the system to larger data sets without worrying about compression performance degradation, especially when considering lightweight formats such as frame-of-reference encoding.

Impact on Database Systems

Obviously, all of these observations are crucial to building a fast and scalable database system (and that’s why we blog about it here :)). Database people call AoS row-major storage and SoA column-major storage. Most traditional database systems focus on transactional processing (i.e., a lot of updates, inserts, and deletes) and implement AoS, whereas modern analytical database systems (i.e., a lot of reading data and number crunching) usually prefer the SoA data orientation. Chunked SoA has its origins in the Partition Attributes Across (PAX) format. Today, additional compression of data within the PAX format is used to read data from slow storage, such as cloud object storage (e.g., AWS S3).

In CedarDB, we combine the advantages of both formats by using a hybrid storage engine called Colibri. Unlike most database systems, Colibri allows us to excel at both transactional (write/update) and analytical (read) query processing. If you want to learn more about CedarDB or discuss storage technology, feel free to join our community Slack.

Join our Slack!

Appendix

Benchmark Code Array of Struct vs. Struct of Arrays

Show Code

main.cpp

#include <algorithm>
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
// ----------------------------------------------------------------------------
using series = uint64_t;
using decimal = uint64_t;
// ----------------------------------------------------------------------------
// Represent all elements of a row consecutively in memory
struct Employee {
    series id;
    decimal salary;
    std::array<char, 16> firstname;
};
// Vector is a growable array of data, each element is a full row
struct ArrayOfStruct {
    std::vector<Employee> table;
};
// ----------------------------------------------------------------------------
// Store the same column consecutively in memory
struct StructOfArrays {
    std::vector<series> id;
    std::vector<decimal> salary;
    std::vector<std::array<char, 16>> firstname;
};
// ----------------------------------------------------------------------------
static decimal getTotalSalary(const ArrayOfStruct& store)
// Computes the salary from an array of struct store
{
    decimal salary = 0;
    // Iterate over the rows and select the salary
    for (const auto& r : store.table)
        salary += r.salary;
    return salary;
}
// ----------------------------------------------------------------------------
static decimal getTotalSalary(const StructOfArrays& store)
// Computes the salary from a struct of arrays store
{
    decimal salary = 0;
    // Iterate over the column
    for (auto s : store.salary)
        salary += s;
    return salary;
}
// ----------------------------------------------------------------------------
static void updateEntry(ArrayOfStruct& store, std::vector<series> ids, std::array<char, 16>& newName)
// Update entries of AoS store
{
    for (auto id : ids) {
        auto& record = store.table[id];
        record.firstname = newName;
        record.salary *= 2;
    }
}
// ----------------------------------------------------------------------------
static void updateEntry(StructOfArrays& store, std::vector<series> ids, std::array<char, 16>& newName)
// Update entries of SoA store
{
    for (auto id : ids) {
        store.firstname[id] = newName;
        store.salary[id] = 2 * store.salary[id];
    }
}
// ----------------------------------------------------------------------------
int main() {
    // Create 100M Users
    auto users = 100 * 1000 * 1000ull;
    std::array<char, 16> name = {"Moritz - Felipe"};
    // For both versions
    ArrayOfStruct aos;
    StructOfArrays soa;
    // Fill the stores
    for (auto i = 0ull; i < users; i++) {
        aos.table.emplace_back(static_cast<series>(i), static_cast<decimal>(i * 100) , name);
        soa.id.push_back(static_cast<series>(i));
        soa.salary.push_back(static_cast<decimal>(i) * 100);
        soa.firstname.emplace_back(name);
    }

    // Computes the salary
    auto measureSalary = []<typename T>(const T& store) {
        // Start the timer
        std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
        auto salary = getTotalSalary(store);
        // Stop the timer
        std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
        // Determine store kind
        auto storeName = "SoA";
        if constexpr (std::is_same_v<ArrayOfStruct, T>)
            storeName = "AoS";
        // Print results
        std::cout << "Workload: Analyze; Store: " << storeName << "; Time: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;
        return salary;
    };

    // Check correctness
    if (measureSalary(aos) != measureSalary(soa))
        return -1;

    // -------------

    std::random_device rnd_device;
    std::mt19937 mt{rnd_device()}; // Generates random integers
    std::uniform_int_distribution<uint64_t> dist{0, users - 1};

    auto gen = [&]() {
        return dist(mt);
    };

    std::vector<series> ids(0.1 * users);
    std::generate(ids.begin(), ids.end(), gen);

    // Updates the column
    auto measureUpdate = []<typename T>(T& store, std::vector<series> ids) {
        // Update name and salary
        std::array<char, 16> name = {"Dr. Moritz - F."};

        // Start the timer
        std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
        updateEntry(store, ids, name);
        // Stop the timer
        std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
        // Determine store kind
        auto storeName = "SoA";
        if constexpr (std::is_same_v<ArrayOfStruct, T>)
            storeName = "AoS";
        // Print results
        std::cout << "Workload: Update; Store: " << storeName << "; Time: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;
    };

    measureUpdate(aos, ids);
    measureUpdate(soa, ids);

    return 0;
}
// ----------------------------------------------------------------------------

Full Benchmark Code

Show Code

full.cpp

#include <algorithm>
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <thread>
#include <vector>
// ----------------------------------------------------------------------------
using series = uint64_t;
using decimal = uint64_t;
// ----------------------------------------------------------------------------
// Represent all elements of a row consecutively in memory
struct Employee {
    series id;
    decimal salary;
    std::array<char, 16> firstname;
};
// Vector is a growable array of data, each element is a full row
struct ArrayOfStruct {
    // The rows
    std::vector<Employee> table;

    // Calculates the size
    size_t getSize() const {
        return table.size() * sizeof(decltype(table)::value_type) + sizeof(ArrayOfStruct);
    }
};
// ----------------------------------------------------------------------------
// Store the same column consecutively in memory
// For simplicity we ignore, that vectors put the data onto different memory regions.
// A database would most likely use a custom data structure that packs all columns.
struct StructOfArrays {
    // The columns
    std::vector<series> id;
    std::vector<decimal> salary;
    std::vector<std::array<char, 16>> firstname;

    // Calculates the size
    size_t getSize() const {
        size_t size = id.size() * sizeof(decltype(id)::value_type);
        size += salary.size() * sizeof(decltype(salary)::value_type);
        size += firstname.size() * sizeof(decltype(firstname)::value_type);
        return size + sizeof(StructOfArrays);
    }
};
// ----------------------------------------------------------------------------
// The different compression algorithms
enum class CompressionType : uint8_t {
    RAW,
    FOR1,
    FOR2,
    FOR4
};
// ----------------------------------------------------------------------------
// Base class for raw and compressed columns
template <typename T>
struct Column {
    using value_type = T;
    // The compression scheme
    CompressionType type;
    // Destructor
    virtual ~Column() {}
    // Constructor
    Column(CompressionType type) : type(type) {}
    // Calculate the size (virtual)
    virtual size_t getSize() const = 0;
};
// ---------------------------------------------------------------------------
// Column stored in raw format, i.e. vector<type>
template <typename T>
struct RawColumn : public Column<T> {
    // The raw values
    std::vector<T> column;

    // Constructor
    RawColumn(auto begin, auto end) : Column<T>(CompressionType::RAW) {
        column.reserve(end - begin);
        for (auto it = begin; it != end; ++it)
            column.emplace_back(*it);
    }
    // Destructor
    ~RawColumn() override = default;
    // Calculates the size
    size_t getSize() const override {
        return column.size() * sizeof(typename decltype(column)::value_type) + sizeof(RawColumn<T>);
    }
};
// ----------------------------------------------------------------------------
// Column compressed with frame-of-reference encoding
template <typename T, typename C>
requires std::is_unsigned_v<T>
struct FORCompressedColumn : public Column<T> {
    // The minimum value of the column
    T minValue;
    // The maximum value of the column (not necessary for FOR, but useful later on)
    T maxValue;
    // The FOR compressed values
    std::vector<C> compressed;

    // Constructor
    FORCompressedColumn(auto begin, auto end, T minValue, T maxValue, CompressionType type) : Column<T>(type), minValue(minValue), maxValue(maxValue) {
        compressed.reserve(end - begin);
        for (auto it = begin; it != end; ++it)
            compressed.emplace_back(*it - minValue);
    }
    // Destructor
    ~FORCompressedColumn() override = default;
    // Calculates the size
    size_t getSize() const override {
        return compressed.size() * sizeof(typename decltype(compressed)::value_type) + sizeof(FORCompressedColumn<T, C>);
    }
};
// ----------------------------------------------------------------------------
// Store the same column with compression
struct CompressedStructOfArrays {
    // The potentially compressed columns
    std::unique_ptr<Column<series>> id;
    std::unique_ptr<Column<decimal>> salary;
    std::unique_ptr<Column<std::array<char, 16>>> firstname;

    // Calculates the size
    size_t getSize() const {
        return id->getSize() + salary->getSize() + firstname->getSize() + sizeof(CompressedStructOfArrays);
    }
};
// ---------------------------------------------------------------------------
static constexpr auto createColumn(auto begin, auto end) -> std::unique_ptr<Column<typename decltype(begin)::value_type>>
// Compute the perfect compression scheme according to values
{
    using T = decltype(begin)::value_type;
    if constexpr (!std::is_unsigned_v<T>) {
        return std::make_unique<RawColumn<T>>(begin, end);
    } else {
        auto [minValue, maxValue] = std::ranges::minmax_element(begin, end);
        if (*maxValue - *minValue > std::numeric_limits<uint32_t>::max())
            return std::make_unique<RawColumn<T>>(begin, end);

        if (*maxValue - *minValue <= std::numeric_limits<uint8_t>::max()) {
            return std::make_unique<FORCompressedColumn<T, uint8_t>>(begin, end, *minValue, *maxValue, CompressionType::FOR1);
        } else if (maxValue - minValue <= std::numeric_limits<uint16_t>::max()) {
            return std::make_unique<FORCompressedColumn<T, uint16_t>>(begin, end, *minValue, *maxValue, CompressionType::FOR2);
        } else {
            return std::make_unique<FORCompressedColumn<T, uint32_t>>(begin, end, *minValue, *maxValue, CompressionType::FOR4);
        }
    }
}
// ----------------------------------------------------------------------------
// Store the same columns now in PAX format
struct PaxStore {
    // The blocks of compressed columns
    std::vector<CompressedStructOfArrays> pax;

    // Calculates the size
    size_t getSize() const {
        size_t size = sizeof(PaxStore);
        for (const auto& elem : pax)
            size += elem.getSize();
        return size;
    }
};
// ---------------------------------------------------------------------------
static PaxStore createPaxColumn(StructOfArrays& store, size_t paxSize)
// Compute the perfect compression scheme according to values
{
    PaxStore pax;
    pax.pax.reserve(store.id.size() / paxSize + 1);
    for (size_t i = 0; i < store.id.size(); i += paxSize) {
        CompressedStructOfArrays c;
        if (i + paxSize > store.id.size()) {
            c.id = createColumn(store.id.begin() + i, store.id.end());
            c.salary = createColumn(store.salary.begin() + i, store.salary.end());
            c.firstname = createColumn(store.firstname.begin() + i, store.firstname.end());
        } else {
            c.id = createColumn(store.id.begin() + i, store.id.begin() + i + paxSize);
            c.salary = createColumn(store.salary.begin() + i, store.salary.begin() + i + paxSize);
            c.firstname = createColumn(store.firstname.begin() + i, store.firstname.begin() + i + paxSize);
        }
        pax.pax.emplace_back(std::move(c));
    }
    return pax;
}
// ----------------------------------------------------------------------------
static decimal getTotalSalary(const ArrayOfStruct& store)
// Computes the salary
{
    decimal salary = 0;
    // Iterate over the rows and select the salary
    for (const auto& r : store.table)
        salary += r.salary;
    return salary;
}
// ----------------------------------------------------------------------------
static decimal getTotalSalary(const StructOfArrays& store)
// Computes the salary
{
    decimal salary = 0;
    // Iterate over the column
    for (auto s : store.salary)
        salary += s;
    return salary;
}
// ----------------------------------------------------------------------------
static decimal getTotalSalary(const CompressedStructOfArrays& store)
// Computes the salary
{
    auto forComputation = []<typename T>(const T& column) {
        decimal salary = 0;
        for (auto s : column.compressed) {
            salary += s + column.minValue;
        }
        return salary;
    };

    using T = decltype(store.salary)::element_type::value_type;
    decimal salary = 0;
    // Iterate over the column
    switch (store.salary->type) {
        case CompressionType::RAW:
            {
                auto column = static_cast<RawColumn<T>*>(store.salary.get());
                for (auto s : column->column) {
                    salary += s;
                }
                break;
            }
        case CompressionType::FOR1:
            return forComputation(*static_cast<FORCompressedColumn<T, uint8_t>*>(store.salary.get()));
        case CompressionType::FOR2:
            return forComputation(*static_cast<FORCompressedColumn<T, uint16_t>*>(store.salary.get()));
        case CompressionType::FOR4:
            return forComputation(*static_cast<FORCompressedColumn<T, uint32_t>*>(store.salary.get()));
    }
    return salary;
}
// ----------------------------------------------------------------------------
static decimal getTotalSalary(const PaxStore& store)
// Computes the salary
{
    decimal salary = 0;
    // Iterate over the column
    for (const auto& p : store.pax)
        salary += getTotalSalary(p);
    return salary;
}
// ----------------------------------------------------------------------------
static void updateEntry(ArrayOfStruct& store, std::vector<size_t> ids, std::array<char, 16>& newName)
// Update entries
{
    for (auto id : ids) {
        auto& record = store.table[id];
        record.firstname = newName;
        record.salary *= 2;
    }
}
// ----------------------------------------------------------------------------
static void updateEntry(StructOfArrays& store, std::vector<size_t> ids, std::array<char, 16>& newName)
// Update entries
{
    for (auto id : ids) {
        store.firstname[id] = newName;
        store.salary[id] = 2 * store.salary[id];
    }
}
// ----------------------------------------------------------------------------
int main() {
    // Create 100M Users
    auto users = 100 * 1000 * 1000ull;
    std::array<char, 16> name = {"Moritz - Felipe"};
    // For both versions
    ArrayOfStruct aos;
    StructOfArrays soa;
    CompressedStructOfArrays compressed;
    PaxStore pax;
    // Fill the stores
    for (auto i = 0ull; i < users; i++) {
        aos.table.emplace_back(static_cast<unsigned>(i), (1000 + (static_cast<decimal>(i) % 500)) * 100, name);

        soa.id.push_back(static_cast<unsigned>(i));
        soa.salary.push_back((1000 + (static_cast<decimal>(i) % 500)) * 100);
        soa.firstname.emplace_back(name);
    }

    compressed.id = createColumn(soa.id.cbegin(), soa.id.cend());
    compressed.salary = createColumn(soa.salary.cbegin(), soa.salary.cend());
    compressed.firstname = createColumn(soa.firstname.cbegin(), soa.firstname.cend());

    pax = createPaxColumn(soa, std::numeric_limits<uint16_t>::max());

    // Computes the salary
    auto measureSalary = []<typename T>(const T& store) {
        // Start the timer
        std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
        auto salary = getTotalSalary(store);
        // Stop the timer
        std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
        // Determine store kind
        auto storeName = "compressed";
        if constexpr (std::is_same_v<ArrayOfStruct, T>)
            storeName = "aos";
        if constexpr (std::is_same_v<StructOfArrays, T>)
            storeName = "soa";
        if constexpr (std::is_same_v<PaxStore, T>)
            storeName = "pax";
        // Print results
        std::cout << "Workload: Analyze; Store: " << storeName << "; Time: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]; Size: " << store.getSize() / (1024 * 1024) << " [MB]" << std::endl;
        std::this_thread::sleep_for(std::chrono::milliseconds(1000));
        return salary;
    };

    // Check correctness
    decimal salary = 0;
    if (salary = measureSalary(soa); salary != measureSalary(compressed))
        return -1;

    // Check correctness
    if (salary != measureSalary(pax))
        return -1;

    // Check correctness
    if (salary != measureSalary(aos))
        return -1;

    // -------------

    std::random_device rnd_device;
    std::mt19937 mt{rnd_device()}; // Generates random integers
    std::uniform_int_distribution<uint64_t> dist{0, users - 1};

    auto gen = [&]() {
        return dist(mt);
    };

    std::vector<uint64_t> ids(0.1 * users);
    std::generate(ids.begin(), ids.end(), gen);

    // Updates the column
    auto measureUpdate = []<typename T>(T& store, std::vector<uint64_t> ids) {
        // Update name and salary
        std::array<char, 16> name = {"Dr. Moritz - F."};

        // Start the timer
        std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
        updateEntry(store, ids, name);
        // Stop the timer
        std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
        // Determine store kind
        auto storeName = "soa";
        if constexpr (std::is_same_v<ArrayOfStruct, T>)
            storeName = "aos";
        // Print results
        std::cout << "Workload: Update; Store: " << storeName << "; Time: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;
    };

    measureUpdate(aos, ids);
    measureUpdate(soa, ids);

    return 0;
}
// ----------------------------------------------------------------------------

The Hidden Cost of Data Movement

How to Correctly Sum Up Numbers