Vectorized Overflow Checking
In our earlier post on overflow handling we explained how to use the CPU flags to detected integer overflows and wrote: “this doesn’t have zero overhead as the compiler can’t vectorize the function using checked arithmetic.” Not satisfied, we took matters into our own hands: If the compiler can’t help us, we have to help ourselves!
This post explains in very low-level detail how you can very quickly sum up
(signed) integers on modern x86 CPUs using specialized vector instructions such
as vpternlogd
. We
also show some numbers comparing the hand-written vectorized sum with the
compiler-assisted checked arithmetic that demonstrate that you can gain a lot
of performance even when checking for overflows.
If you are also interested in the higher-level discussion of why overflow checking matters, see our previous post on this topic!
Manually Detecting Signed Integer Overflow
The premise of this post is to do vectorized integer addition while correctly
detecting overflows. Scalar add
instructions set the overflow flag to
indicate overflows. Vector operations don’t, so we need to detect overflows
manually.
All relevant CPU architectures use two’s complement to represent signed integers. So, let’s look at some examples of the binary representation of some 8-bit integers when adding them:
a | b | a + b | |
---|---|---|---|
1 | 0000 1000 (=8) | 0010 0010 (=34) | 0010 1010 (=42) |
2 | 0100 0000 (=64) | 0100 0001 (=65) | 1000 0001 (=-127) ⚠️ |
3 | 0000 1010 (=10) | 1111 1011 (=-5) | 0000 0101 (=5) |
4 | 0000 0001 (=1) | 1111 0110 (=-10) | 1111 0111 (=-9) |
5 | 1111 1101 (=-3) | 0000 0101 (=5) | 0000 0010 (=2) |
6 | 1111 0100 (=-12) | 0000 0011 (=3) | 1111 0111 (=-9) |
7 | 1011 1111 (=-65) | 1011 1110 (=-66) | 0111 1101 (=125) ⚠️ |
8 | 1111 1111 (=-1) | 1111 1110 (=-2) | 1111 1101 (=-3) |
You can see that there are some cases where adding two numbers leads to unexpected results due to integer overflow. Here, it’s 64 + 65 which results in -127 (line 2) and -65 + -66 which results in 125 (line 7).
To understand how we can detect overflow, we need to find some pattern in the table which we can then try to detect. To find the pattern, let’s first reduce all numbers to just the most significant bit (MSB):
a | b | a + b | |
---|---|---|---|
1 | 0 | 0 | 0 |
2 | 0 | 0 | 1 ⚠️ |
3 | 0 | 1 | 0 |
4 | 0 | 1 | 1 |
5 | 1 | 0 | 0 |
6 | 1 | 0 | 1 |
7 | 1 | 1 | 0 ⚠️ |
8 | 1 | 1 | 1 |
In a two’s complement signed integer, the MSB is 0 if the number is positive (or zero) and 1 if the number is negative. So you could interpret the reduced table as follows: In line 2, we add two positive numbers (they have MSB 0), but we get a negative number (MSB 1) as a result. This can only happen if our addition had an overflow. Similarly, in line 7, we add two negative numbers (MSB 1), but get a positive number (MSB 0) as a result.
This holds for all integers, not just for the examples in the table! You can verify this claim by playing around with a few other numbers that could easily lead to overflow such as adding the two largest positive or the two smallest negative integers. So, you only need to look at the MSBs of the two inputs and the (potentially overflowing) result to determine whether the addition has overflowed.
You can even implement this check easily in a few lines of C++:
std::optional<int> addWithOverflow(int a, int b) {
// Only unsigned overflow is allowed in C++, so explicitly convert the
// arguments into unsigned integers. This doesn't change their bit
// representation as signed integers are guaranteed to use two's complement
// in C++.
unsigned aU = a;
unsigned bU = b;
unsigned rU = aU + bU;
unsigned aMSB = aU >> 31;
unsigned bMSB = bU >> 31;
unsigned rMSB = rU >> 31;
if (
(aMSB == 0 && bMSB == 0 && rMSB == 1) ||
(aMSB == 1 && bMSB == 1 && rMSB == 0)
) {
return std::nullopt;
}
return static_cast<int>(rU);
}
This allows you to implement overflow checks without hardware support for overflow flags, just by using bitwise integer arithmetic. Starting with this simple non-vectorized example, we can now try to vectorize it.
Vectorize Manual Overflow Checking
We now know how to check for overflows on individual integers using regular arithmetic. CPUs with vectorized instructions usually support integer arithmetic as well. Before we try to vectorize the overflow checks, let’s start with a vectorized loop to sum integers, which still ignores the vectorized overflow handling. We will use AVX512 vector registers, which are supported by modern x86 CPUs. Each AVX512 register, also called a ZMM register, has a size of 512 bits. Using ZMM registers, the CPU can perform 16 32-bit integer operations simultaneously:
std::optional<int> vectorizedSum(std::span<const int> values) {
constexpr size_t numValuesPerVector = sizeof(__m512i) / sizeof(int);
__m512i sumVector = _mm512_setzero_epi32();
size_t index = 0;
for (; index + numValuesPerVector <= values.size(); index += numValuesPerVector) {
auto vec = _mm512_loadu_epi32(&values[index]);
auto newSum = _mm512_add_epi32(sumVector, vec);
sumVector = newSum;
}
// Sum up the numbers within sumVector and the trailing values that don't
// fit in a ZMM register
int sum = _mm512_reduce_add_epi32(sumVector)
for (; index < values.size(); ++index)
sum += values[sum];
return sum;
}
In this code snippet, we start by initializing the vector variable sumVector
with all zeros. This vector will contain the intermediate sum and we will
update it in each loop iteration. Inside the loop, we first load the next
vector of values from the input into vec
and then add this vector to
sumVector
. Since we want to return a single integer, we have to add up all
the values in the vector at the end. Also, the number of values may not be an
exact multiple of the vector size, so we have to add up the trailing values as
well.
Bitwise Arithmetic for Overflow Checks
Now we need to add the overflow checks. In vectorized code, it is very
inefficient to check the MSB of each integer individually in an if-statement,
since we want to perform multiple operations at once. Instead, we want to find
bitwise arithmetic instructions that are semantically equivalent to the
if-statement in the addWithOverflow
because the CPU can do bitwise arithmetic
on vectors very efficiently.
To find appropriate arithmetic instructions, let’s look at the table above,
which shows all the MSBs again. If we consider a
, b
, and a + b
as three
binary inputs that each can be 0 or 1, and want to receive 1 as a result only
if we have an overflow, the table looks like this:
a | b | a + b | Overflow | |
---|---|---|---|---|
1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 1 |
3 | 0 | 1 | 0 | 0 |
4 | 0 | 1 | 1 | 0 |
5 | 1 | 0 | 0 | 0 |
6 | 1 | 0 | 1 | 0 |
7 | 1 | 1 | 0 | 1 |
8 | 1 | 1 | 1 | 0 |
We have also systematically ordered the rows in the table: The first row has
all zeros for the inputs a
, b
, and a + b
. The next row increments the
value by one, as if all three inputs were a three-bit integer. So the row after
001
becomes 010
, which is the result of adding 1 to 001
in binary.
Together with the output in the “Overflow” column, such a table is called a
truth table with three inputs.
A truth table can be used to encode any bitwise arithmetic operation. So the table gives us a starting point to find the appropriate vectorized instructions to check the MSBs, which helps us detect overflows.
Vectorized Boolean Truth Tables
To vectorize our overflow checks, we need to evaluate our above truth table for
multiple inputs at the same time. Fortunately, modern x86 CPUs with the
“avx512f” extension support a vectorized instruction that can compute the
result for any three inputs according to an arbitrary truth table. This
instruction is called
vpternlogd
and has
four operands: The first three operands are vector registers for the three
inputs and the fourth operand is an immediate operand that encodes the truth
table. You can read out the operand directly from the truth table by
transposing the first column and the Overflow column:
8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
You can see the binary number 0b0100'0010
(or 0x42
) which is the immediate
representing this specific truth table. In a C++ program, you can use the
vpternlogd
instruction by calling the _mm512_ternarylogic_epi32
function.
For more efficiency and to optimize for the non-overflow case, we won’t shift
any bits around in the loop and check the overflow bits only at the end of the
loop. We also use the vpmovd2m
instruction which extracts the MSB of every
vector element (by calling the _mm512_movepi32_mask
function). The adjusted
loop looks like this:
std::optional<int> vectorizedSum(std::span<const int> values) {
constexpr size_t numValuesPerVector = sizeof(__m512i) / sizeof(int);
__m512i sumVector = _mm512_setzero_epi32();
__m512i hadOverflow = _mm512_setzero_epi32();
size_t index = 0;
for (; index + numValuesPerVector <= values.size(); index += numValuesPerVector) {
auto vec = _mm512_loadu_epi32(&values[index]);
auto newSum = _mm512_add_epi32(sumVector, vec);
// Check the addition for overflow using the encoded truth table
auto didOverflow = _mm512_ternarylogic_epi32(sumVector, vec, newSum, 0x42);
// Update the `hadOverflow` variable if any overflow bit is set
hadOverflow = _mm512_or_epi32(hadOverflow, didOverflow);
sumVector = newSum;
}
// Check if any of the overflow bits in the MSB positions are set
auto msbs = _mm512_movepi32_mask(hadOverflow);
if (_cvtmask16_u32(msbs) != 0)
return std::nullopt;
// Sum up the numbers within sumVector and the trailing values that don't
// fit in a ZMM register
// [...]
}
We now have an overflow checked vectorized loop that leaves us with the vector
variable sumVector
with the subresults and any trailing values that did not
fit into a vector.
Non-vectorized Sum of Trailing Values
The final step in our journey to vectorize integer addition with overflow
checks are essentially just post-processing of our vectorized output. Adding up
all the numbers in sumVector
and the trailing values must consider overflows,
as well. In our initial vectorized example, we used the
_mm512_reduce_add_epi32
function which sums up all 32-bit elements of a
vector register and returns the result. The compiler translates this function
call into a series of vector instructions without support for overflow checks.
To add overflow checks, we simply read each element of the vector individually
and use the checked arithmetic builtin function __builtin_add_overflow
.
Similarly, the list of integers that a caller passes to the function is not
always a multiple of the number of elements that fit into a vector. So we treat
the last few values individually as well. We know that a vector holds exactly
16 32-bit integers, and we can have at most 15 trailing values. So, regardless
of the number of inputs, these individual overflow checks are guaranteed to add
only constant overhead. The last few lines of the function look like this when
we add the overflow checks:
std::optional<int> vectorizedSum(std::span<const int> values) {
// [...]
// Sum up the numbers within sumVector and the trailing values that don't
// fit in a ZMM register
int sum = 0;
{
std::array<int, numValuesPerVector> vectorValues;
_mm512_store_epi32(vectorValues.data(), sumVector);
for (auto value : vectorValues) {
if (__builtin_add_overflow(sum, value, &sum))
return std::nullopt;
}
}
for (; index < values.size(); ++index) {
if (__builtin_add_overflow(sum, values[index], &sum))
return std::nullopt;
}
return sum;
}
Performance of Overflow-Safe Vectorized Sum
Theoretical Performance Analysis
Vectorized code can often perform much better than scalar alternatives. So,
let’s see how our vectorizedSum
should perform and then run some actual
microbenchmarks. In theory, we can analyze how many instructions we need to
execute for every loop iteration: In the vectorized loop we first load a vector
(_mm512_loadu_epi32
) and then do the addition (_mm512_add_epi32
). To check
for overflows, we additionally execute the vpternlog
instruction
(_mm512_ternarylogic_epi32
) and a bitwise or (_mm512_or_epi32
).
The rest of the function has a fixed runtime which is independent from the number of values that we want to sum up. So, we will disregard this in our analysis. In general, we only add two instruction per loop iteration to add overflow checks. Both are cheap bitwise arithmetic instructions, so their performance overhead should be very low.
Performance Measurements
We can evaluate this function by running it on real hardware. We wrote a short
benchmark program that adds 10 million integers. We compare the runtime to a
simple loop without any overflow checks which the compiler can vectorize very
easily. We also tested the runtime of a loop that uses the checked arithmetic
builtin __builtin_add_overflow
which the compiler doesn’t vectorize. On an
AMD Ryzen 9 7950X CPU, we see that the vectorized sum with overflow checks
takes 7% more time than the function without overflow checks (3.92 s for the
vectorized function with overflow checks vs. 3.66 s for the function without
checks). The checked arithmetic function is by far the worst performer: Its
execution time (15.92 s) is 4.4 times that of the function without checks and
4 times that of the vectorized sum with overflow checks. The following table
shows the full result of the benchmark program:
Name (* = baseline) | Dim | Total ms | ns/op | Baseline | Ops/second |
---|---|---|---|---|---|
noOverflowChecks * | 8 | 3.812 | 476528 | - | 2098.5 |
overflowChecks | 8 | 15.217 | 1902064 | 3.992 | 525.7 |
vectorizedSumWithOverflow | 8 | 4.052 | 506444 | 1.063 | 1974.6 |
noOverflowChecks * | 64 | 28.357 | 443079 | - | 2256.9 |
overflowChecks | 64 | 123.625 | 1931635 | 4.360 | 517.7 |
vectorizedSumWithOverflow | 64 | 30.216 | 472122 | 1.066 | 2118.1 |
noOverflowChecks * | 512 | 223.379 | 436287 | - | 2292.1 |
overflowChecks | 512 | 989.754 | 1933113 | 4.431 | 517.3 |
vectorizedSumWithOverflow | 512 | 235.819 | 460584 | 1.056 | 2171.2 |
noOverflowChecks * | 4096 | 1852.142 | 452183 | - | 2211.5 |
overflowChecks | 4096 | 7938.784 | 1938179 | 4.286 | 515.9 |
vectorizedSumWithOverflow | 4096 | 1959.753 | 478455 | 1.058 | 2090.1 |
noOverflowChecks * | 8192 | 3655.323 | 446206 | - | 2241.1 |
overflowChecks | 8192 | 15923.007 | 1943726 | 4.356 | 514.5 |
vectorizedSumWithOverflow | 8192 | 3918.893 | 478380 | 1.072 | 2090.4 |
If you want to try this out yourself, you can use the full program at the end
of this post and run the benchmarks yourself. We compiled the program using the
following command: g++ -std=c++20 -Wall -Wextra -march=znver4 -O3
.
This approach to adding numbers correctly not only requires a newer CPU, but it's so new that we haven't included it in CedarDB yet. Sign up for our waitlist if you want to be notified when you can try this in CedarDB!
Appendix
Benchmark Code: Vectorized Sums
Show Code
#include <array>
#include <optional>
#include <span>
#include <stdexcept>
#include <vector>
#include <immintrin.h>
#define PICOBENCH_IMPLEMENT_WITH_MAIN
#include "picobench/picobench.hpp"
int sumNoOverflowChecks(std::span<const int> numbers) {
int sum = 0;
for (auto i : numbers)
sum += i;
return sum;
}
int sumWithOverflow(std::span<const int> numbers) {
int sum = 0;
for (auto i : numbers) {
if (__builtin_add_overflow(sum, i, &sum))
throw std::overflow_error("sum overflow!");
}
return sum;
}
std::optional<int> vectorizedSum(std::span<const int> values) {
constexpr size_t numValuesPerVector = sizeof(__m512i) / sizeof(int);
__m512i sumVector = _mm512_setzero_epi32();
__m512i hadOverflow = _mm512_setzero_epi32();
size_t index = 0;
for (; index + numValuesPerVector <= values.size(); index += numValuesPerVector) {
auto vec = _mm512_loadu_epi32(&values[index]);
auto newSum = _mm512_add_epi32(sumVector, vec);
// Check the addition for overflow using the encoded truth table
auto didOverflow = _mm512_ternarylogic_epi32(sumVector, vec, newSum, 0x42);
// Update the `hadOverflow` variable if any overflow bit is set
hadOverflow = _mm512_or_epi32(hadOverflow, didOverflow);
sumVector = newSum;
}
// Check if any of the overflow bits in the MSB positions are set
auto msbs = _mm512_movepi32_mask(hadOverflow);
if (_cvtmask16_u32(msbs) != 0)
return std::nullopt;
// Sum up the numbers within sumVector and the trailing values that don't
// fit in a ZMM register
int sum = 0;
{
std::array<int, numValuesPerVector> vectorValues;
_mm512_store_epi32(vectorValues.data(), sumVector);
for (auto value : vectorValues) {
if (__builtin_add_overflow(sum, value, &sum))
return std::nullopt;
}
}
for (; index < values.size(); ++index) {
if (__builtin_add_overflow(sum, values[index], &sum))
return std::nullopt;
}
return sum;
}
static void noOverflowChecks(picobench::state& s) {
std::vector<int> ints;
ints.resize(10'000'000, 1);
picobench::scope scope(s);
int result = 0;
for ([[maybe_unused]] auto i : s) {
result = sumNoOverflowChecks(ints);
}
s.set_result(result);
}
PICOBENCH(noOverflowChecks);
static void overflowChecks(picobench::state& s) {
std::vector<int> ints;
ints.resize(10'000'000, 1);
picobench::scope scope(s);
int result = 0;
for ([[maybe_unused]] auto i : s) {
result = sumWithOverflow(ints);
}
s.set_result(result);
}
PICOBENCH(overflowChecks);
static void vectorizedSumWithOverflow(picobench::state& s) {
std::vector<int> ints;
ints.resize(10'000'000, 1);
picobench::scope scope(s);
int result = 0;
for ([[maybe_unused]] auto i : s) {
result = *vectorizedSum(ints);
}
s.set_result(result);
}
PICOBENCH(vectorizedSumWithOverflow);