December 24, 2024

Helping Christmas Elves Count Presents (or: Vectorized Overflow Checking)

In a previous post, we explained the importance of overflow checks when summing numbers, and mentioned that the usual approaches are not easily vectorized. Read here how to get 4x the performance when adding integers by using specialized vector instructions.

Moritz Sichert
Moritz Sichert
Moritz Sichert
Moritz Sichert
Helping Christmas Elves Count Presents (or: Vectorized Overflow Checking)

Vectorized Overflow Checking

In our earlier post on overflow handling we explained how to use the CPU flags to detected integer overflows and wrote: “this doesn’t have zero overhead as the compiler can’t vectorize the function using checked arithmetic.” Not satisfied, we took matters into our own hands: If the compiler can’t help us, we have to help ourselves!

This post explains in very low-level detail how you can very quickly sum up (signed) integers on modern x86 CPUs using specialized vector instructions such as vpternlogd. We also show some numbers comparing the hand-written vectorized sum with the compiler-assisted checked arithmetic that demonstrate that you can gain a lot of performance even when checking for overflows.

If you are also interested in the higher-level discussion of why overflow checking matters, see our previous post on this topic!

Manually Detecting Signed Integer Overflow

The premise of this post is to do vectorized integer addition while correctly detecting overflows. Scalar add instructions set the overflow flag to indicate overflows. Vector operations don’t, so we need to detect overflows manually.

All relevant CPU architectures use two’s complement to represent signed integers. So, let’s look at some examples of the binary representation of some 8-bit integers when adding them:

aba + b
10000 1000 (=8)0010 0010 (=34)0010 1010 (=42)
20100 0000 (=64)0100 0001 (=65)1000 0001 (=-127) ⚠️
30000 1010 (=10)1111 1011 (=-5)0000 0101 (=5)
40000 0001 (=1)1111 0110 (=-10)1111 0111 (=-9)
51111 1101 (=-3)0000 0101 (=5)0000 0010 (=2)
61111 0100 (=-12)0000 0011 (=3)1111 0111 (=-9)
71011 1111 (=-65)1011 1110 (=-66)0111 1101 (=125) ⚠️
81111 1111 (=-1)1111 1110 (=-2)1111 1101 (=-3)

You can see that there are some cases where adding two numbers leads to unexpected results due to integer overflow. Here, it’s 64 + 65 which results in -127 (line 2) and -65 + -66 which results in 125 (line 7).

To understand how we can detect overflow, we need to find some pattern in the table which we can then try to detect. To find the pattern, let’s first reduce all numbers to just the most significant bit (MSB):

aba + b
1000
2001 ⚠️
3010
4011
5100
6101
7110 ⚠️
8111

In a two’s complement signed integer, the MSB is 0 if the number is positive (or zero) and 1 if the number is negative. So you could interpret the reduced table as follows: In line 2, we add two positive numbers (they have MSB 0), but we get a negative number (MSB 1) as a result. This can only happen if our addition had an overflow. Similarly, in line 7, we add two negative numbers (MSB 1), but get a positive number (MSB 0) as a result.

This holds for all integers, not just for the examples in the table! You can verify this claim by playing around with a few other numbers that could easily lead to overflow such as adding the two largest positive or the two smallest negative integers. So, you only need to look at the MSBs of the two inputs and the (potentially overflowing) result to determine whether the addition has overflowed.

You can even implement this check easily in a few lines of C++:

std::optional<int> addWithOverflow(int a, int b) {
    // Only unsigned overflow is allowed in C++, so explicitly convert the
    // arguments into unsigned integers. This doesn't change their bit
    // representation as signed integers are guaranteed to use two's complement
    // in C++.
    unsigned aU = a;
    unsigned bU = b;
    unsigned rU = aU + bU;

    unsigned aMSB = aU >> 31;
    unsigned bMSB = bU >> 31;
    unsigned rMSB = rU >> 31;

    if (
        (aMSB == 0 && bMSB == 0 && rMSB == 1) ||
        (aMSB == 1 && bMSB == 1 && rMSB == 0)
    ) {
        return std::nullopt;
    }

    return static_cast<int>(rU);
}

This allows you to implement overflow checks without hardware support for overflow flags, just by using bitwise integer arithmetic. Starting with this simple non-vectorized example, we can now try to vectorize it.

Vectorize Manual Overflow Checking

We now know how to check for overflows on individual integers using regular arithmetic. CPUs with vectorized instructions usually support integer arithmetic as well. Before we try to vectorize the overflow checks, let’s start with a vectorized loop to sum integers, which still ignores the vectorized overflow handling. We will use AVX512 vector registers, which are supported by modern x86 CPUs. Each AVX512 register, also called a ZMM register, has a size of 512 bits. Using ZMM registers, the CPU can perform 16 32-bit integer operations simultaneously:

std::optional<int> vectorizedSum(std::span<const int> values) {
    constexpr size_t numValuesPerVector = sizeof(__m512i) / sizeof(int);
    __m512i sumVector = _mm512_setzero_epi32();

    size_t index = 0;
    for (; index + numValuesPerVector <= values.size(); index += numValuesPerVector) {
        auto vec = _mm512_loadu_epi32(&values[index]);
        auto newSum = _mm512_add_epi32(sumVector, vec);
        sumVector = newSum;
    }

    // Sum up the numbers within sumVector and the trailing values that don't
    // fit in a ZMM register

    int sum = _mm512_reduce_add_epi32(sumVector)
    for (; index < values.size(); ++index)
        sum += values[sum];

    return sum;
}

In this code snippet, we start by initializing the vector variable sumVector with all zeros. This vector will contain the intermediate sum and we will update it in each loop iteration. Inside the loop, we first load the next vector of values from the input into vec and then add this vector to sumVector. Since we want to return a single integer, we have to add up all the values in the vector at the end. Also, the number of values may not be an exact multiple of the vector size, so we have to add up the trailing values as well.

Bitwise Arithmetic for Overflow Checks

Now we need to add the overflow checks. In vectorized code, it is very inefficient to check the MSB of each integer individually in an if-statement, since we want to perform multiple operations at once. Instead, we want to find bitwise arithmetic instructions that are semantically equivalent to the if-statement in the addWithOverflow because the CPU can do bitwise arithmetic on vectors very efficiently.

To find appropriate arithmetic instructions, let’s look at the table above, which shows all the MSBs again. If we consider a, b, and a + b as three binary inputs that each can be 0 or 1, and want to receive 1 as a result only if we have an overflow, the table looks like this:

aba + bOverflow
10000
20011
30100
40110
51000
61010
71101
81110

We have also systematically ordered the rows in the table: The first row has all zeros for the inputs a, b, and a + b. The next row increments the value by one, as if all three inputs were a three-bit integer. So the row after 001 becomes 010, which is the result of adding 1 to 001 in binary. Together with the output in the “Overflow” column, such a table is called a truth table with three inputs.

A truth table can be used to encode any bitwise arithmetic operation. So the table gives us a starting point to find the appropriate vectorized instructions to check the MSBs, which helps us detect overflows.

Vectorized Boolean Truth Tables

To vectorize our overflow checks, we need to evaluate our above truth table for multiple inputs at the same time. Fortunately, modern x86 CPUs with the “avx512f” extension support a vectorized instruction that can compute the result for any three inputs according to an arbitrary truth table. This instruction is called vpternlogd and has four operands: The first three operands are vector registers for the three inputs and the fourth operand is an immediate operand that encodes the truth table. You can read out the operand directly from the truth table by transposing the first column and the Overflow column:

87654321
01000010

You can see the binary number 0b0100'0010 (or 0x42) which is the immediate representing this specific truth table. In a C++ program, you can use the vpternlogd instruction by calling the _mm512_ternarylogic_epi32 function. For more efficiency and to optimize for the non-overflow case, we won’t shift any bits around in the loop and check the overflow bits only at the end of the loop. We also use the vpmovd2m instruction which extracts the MSB of every vector element (by calling the _mm512_movepi32_mask function). The adjusted loop looks like this:

std::optional<int> vectorizedSum(std::span<const int> values) {
    constexpr size_t numValuesPerVector = sizeof(__m512i) / sizeof(int);
    __m512i sumVector = _mm512_setzero_epi32();
    __m512i hadOverflow = _mm512_setzero_epi32();

    size_t index = 0;
    for (; index + numValuesPerVector <= values.size(); index += numValuesPerVector) {
        auto vec = _mm512_loadu_epi32(&values[index]);
        auto newSum = _mm512_add_epi32(sumVector, vec);

        // Check the addition for overflow using the encoded truth table
        auto didOverflow = _mm512_ternarylogic_epi32(sumVector, vec, newSum, 0x42);

        // Update the `hadOverflow` variable if any overflow bit is set
        hadOverflow = _mm512_or_epi32(hadOverflow, didOverflow);

        sumVector = newSum;
    }

    // Check if any of the overflow bits in the MSB positions are set
    auto msbs = _mm512_movepi32_mask(hadOverflow);
    if (_cvtmask16_u32(msbs) != 0)
        return std::nullopt;

    // Sum up the numbers within sumVector and the trailing values that don't
    // fit in a ZMM register

    // [...]
}

We now have an overflow checked vectorized loop that leaves us with the vector variable sumVector with the subresults and any trailing values that did not fit into a vector.

Non-vectorized Sum of Trailing Values

The final step in our journey to vectorize integer addition with overflow checks are essentially just post-processing of our vectorized output. Adding up all the numbers in sumVector and the trailing values must consider overflows, as well. In our initial vectorized example, we used the _mm512_reduce_add_epi32 function which sums up all 32-bit elements of a vector register and returns the result. The compiler translates this function call into a series of vector instructions without support for overflow checks.

To add overflow checks, we simply read each element of the vector individually and use the checked arithmetic builtin function __builtin_add_overflow. Similarly, the list of integers that a caller passes to the function is not always a multiple of the number of elements that fit into a vector. So we treat the last few values individually as well. We know that a vector holds exactly 16 32-bit integers, and we can have at most 15 trailing values. So, regardless of the number of inputs, these individual overflow checks are guaranteed to add only constant overhead. The last few lines of the function look like this when we add the overflow checks:

std::optional<int> vectorizedSum(std::span<const int> values) {
    // [...]

    // Sum up the numbers within sumVector and the trailing values that don't
    // fit in a ZMM register

    int sum = 0;
    {
        std::array<int, numValuesPerVector> vectorValues;
        _mm512_store_epi32(vectorValues.data(), sumVector);
        for (auto value : vectorValues) {
            if (__builtin_add_overflow(sum, value, &sum))
                return std::nullopt;
        }
    }

    for (; index < values.size(); ++index) {
        if (__builtin_add_overflow(sum, values[index], &sum))
            return std::nullopt;
    }

    return sum;
}

Performance of Overflow-Safe Vectorized Sum

In general, writing a vectorized function by hand is something you should usually avoid. Not only are functions with many vector intrinsics harder to read than a non-vectorized alternative, but the intrinsics are also usually hardware dependent. For example, the above function can only be executed on x86 CPUs with the avx512f extension. Many recent AMD or Intel CPUs have this extension, but older CPUs, especially those released before 2016, do not. Unfortunately, even modern Intel CPUs sometimes come without avx512f support, especially low-power CPUs. For a complete list, see Wikipedia.

Theoretical Performance Analysis

Vectorized code can often perform much better than scalar alternatives. So, let’s see how our vectorizedSum should perform and then run some actual microbenchmarks. In theory, we can analyze how many instructions we need to execute for every loop iteration: In the vectorized loop we first load a vector (_mm512_loadu_epi32) and then do the addition (_mm512_add_epi32). To check for overflows, we additionally execute the vpternlog instruction (_mm512_ternarylogic_epi32) and a bitwise or (_mm512_or_epi32).

The rest of the function has a fixed runtime which is independent from the number of values that we want to sum up. So, we will disregard this in our analysis. In general, we only add two instruction per loop iteration to add overflow checks. Both are cheap bitwise arithmetic instructions, so their performance overhead should be very low.

Performance Measurements

We can evaluate this function by running it on real hardware. We wrote a short benchmark program that adds 10 million integers. We compare the runtime to a simple loop without any overflow checks which the compiler can vectorize very easily. We also tested the runtime of a loop that uses the checked arithmetic builtin __builtin_add_overflow which the compiler doesn’t vectorize. On an AMD Ryzen 9 7950X CPU, we see that the vectorized sum with overflow checks takes 7% more time than the function without overflow checks (3.92 s for the vectorized function with overflow checks vs. 3.66 s for the function without checks). The checked arithmetic function is by far the worst performer: Its execution time (15.92 s) is 4.4 times that of the function without checks and 4 times that of the vectorized sum with overflow checks. The following table shows the full result of the benchmark program:

Name (* = baseline)DimTotal msns/opBaselineOps/second
noOverflowChecks *83.812476528-2098.5
overflowChecks815.21719020643.992525.7
vectorizedSumWithOverflow84.0525064441.0631974.6
noOverflowChecks *6428.357443079-2256.9
overflowChecks64123.62519316354.360517.7
vectorizedSumWithOverflow6430.2164721221.0662118.1
noOverflowChecks *512223.379436287-2292.1
overflowChecks512989.75419331134.431517.3
vectorizedSumWithOverflow512235.8194605841.0562171.2
noOverflowChecks *40961852.142452183-2211.5
overflowChecks40967938.78419381794.286515.9
vectorizedSumWithOverflow40961959.7534784551.0582090.1
noOverflowChecks *81923655.323446206-2241.1
overflowChecks819215923.00719437264.356514.5
vectorizedSumWithOverflow81923918.8934783801.0722090.4

If you want to try this out yourself, you can use the full program at the end of this post and run the benchmarks yourself. We compiled the program using the following command: g++ -std=c++20 -Wall -Wextra -march=znver4 -O3.


This approach to adding numbers correctly not only requires a newer CPU, but it's so new that we haven't included it in CedarDB yet. Sign up for our waitlist if you want to be notified when you can try this in CedarDB!
Join our waitlist!

Appendix

Benchmark Code: Vectorized Sums

overflow_benchmark.cpp

Show Code
#include <array>
#include <optional>
#include <span>
#include <stdexcept>
#include <vector>
#include <immintrin.h>

#define PICOBENCH_IMPLEMENT_WITH_MAIN
#include "picobench/picobench.hpp"

int sumNoOverflowChecks(std::span<const int> numbers) {
   int sum = 0;
   for (auto i : numbers)
      sum += i;
   return sum;
}

int sumWithOverflow(std::span<const int> numbers) {
   int sum = 0;
   for (auto i : numbers) {
      if (__builtin_add_overflow(sum, i, &sum))
         throw std::overflow_error("sum overflow!");
   }
   return sum;
}

std::optional<int> vectorizedSum(std::span<const int> values) {
    constexpr size_t numValuesPerVector = sizeof(__m512i) / sizeof(int);
    __m512i sumVector = _mm512_setzero_epi32();
    __m512i hadOverflow = _mm512_setzero_epi32();

    size_t index = 0;
    for (; index + numValuesPerVector <= values.size(); index += numValuesPerVector) {
        auto vec = _mm512_loadu_epi32(&values[index]);
        auto newSum = _mm512_add_epi32(sumVector, vec);

        // Check the addition for overflow using the encoded truth table
        auto didOverflow = _mm512_ternarylogic_epi32(sumVector, vec, newSum, 0x42);

        // Update the `hadOverflow` variable if any overflow bit is set
        hadOverflow = _mm512_or_epi32(hadOverflow, didOverflow);

        sumVector = newSum;
    }

    // Check if any of the overflow bits in the MSB positions are set
    auto msbs = _mm512_movepi32_mask(hadOverflow);
    if (_cvtmask16_u32(msbs) != 0)
        return std::nullopt;

    // Sum up the numbers within sumVector and the trailing values that don't
    // fit in a ZMM register

    int sum = 0;
    {
        std::array<int, numValuesPerVector> vectorValues;
        _mm512_store_epi32(vectorValues.data(), sumVector);
        for (auto value : vectorValues) {
            if (__builtin_add_overflow(sum, value, &sum))
                return std::nullopt;
        }
    }

    for (; index < values.size(); ++index) {
        if (__builtin_add_overflow(sum, values[index], &sum))
            return std::nullopt;
    }

    return sum;
}

static void noOverflowChecks(picobench::state& s) {
   std::vector<int> ints;
   ints.resize(10'000'000, 1);

   picobench::scope scope(s);
   int result = 0;
   for ([[maybe_unused]] auto i : s) {
      result = sumNoOverflowChecks(ints);
   }
   s.set_result(result);
}
PICOBENCH(noOverflowChecks);

static void overflowChecks(picobench::state& s) {
   std::vector<int> ints;
   ints.resize(10'000'000, 1);

   picobench::scope scope(s);
   int result = 0;
   for ([[maybe_unused]] auto i : s) {
      result = sumWithOverflow(ints);
   }
   s.set_result(result);
}
PICOBENCH(overflowChecks);

static void vectorizedSumWithOverflow(picobench::state& s) {
   std::vector<int> ints;
   ints.resize(10'000'000, 1);

   picobench::scope scope(s);
   int result = 0;
   for ([[maybe_unused]] auto i : s) {
      result = *vectorizedSum(ints);
   }
   s.set_result(result);
}
PICOBENCH(vectorizedSumWithOverflow);