Why Trees Without Branches Grow Faster: The Case for Reducing Branches in Code

In the same way that arborists remove branches from trees to ensure healthy and desirable tree growth, it can also be beneficial to remove branches in software. We claim that pruning branches is a good thing in some of our blog posts, but we never got around to explaining why. In this post, we will rectify that and explore why, although branches are essential to software, it is a good idea to reduce them where possible to increase CPU efficiency.

What are Branches in Code?

To understand why we want to reduce the number of branches in code so badly, we first need to understand what branches are. In a nutshell, branches are control flow. Whenever you want to decide what to do next in a program based on variables or input, you need a branch instruction. Without branches, any program could only ever perform data-independent operations. For example, adding two integers is branch-free. Getting the minimum, on the other hand, requires a branch to decide whether to return the first or the second value.

Branching is an essential part of any program, so branching instructions are unavoidable. However, they also add unpredictability to the code, which the CPU does not like, and so compilers try to minimize it.

Compilers can usually avoid explicit branch instructions for simple cases like the minimum calculation above, e.g. by replacing a potential branch with a a flag-dependent instruction like cmov. In larger programs, however, you may want to do something completely different depending on a variable, e.g. call a different function, thereby completely changing the paths you take in your code. And even though on the machine code level everything is just an either/or decision, a switch statement, for example, will result in potentially hundreds of either/or cases. Such complex cases cannot be handled by the compiler and result in true branch instructions, such as conditional jumps. These are costly, not because of the cost of the instructions, but because of the resulting unpredictability during execution, which interferes with optimizations made by the CPU.

CPU Optimizations During Execution

CPUs are much more than just calculators. They are made up of many sub-components that perform specific tasks, such as an arithmetic-logic unit (ALU) to perform calculations, a control unit to load and decode instructions, and a memory unit to move data between storage, memory, caches and registers. To make the most efficient use of these components, CPUs themselves perform optimizations that are only possible during execution. Intel’s manual for optimizing for their CPU architectures is several hundred pages long, so we will only take a high-level look at the optimizations most affected by branches.

Instruction Pipelining

One of the most important optimizations in a CPU is instruction pipelining. The goal of instruction pipelining is to improve the utilization of the task-specific CPU components by keeping all of them busy with the next task even before the previous one is finished. Conceptually, each operation in the program, such as adding two integers, will go through the same 5 logical steps of the von Neumann-cycle.

IF: Fetch the instruction (”I want to do an add”)
ID: Decode the instruction (add instruction)
MM: Fetch the operands (load the integers into registers if they are not already there)
EX: Execute the instruction (add the two integers)
WB: Write the result back into memory

If the processor were to process only one instruction at a time, only one of the 5 steps would take place per cycle, leaving 4 components idle. Reality is even more complex, with Intel Golden Cove offering 12 execution ports, each capable of performing one or more of these steps.

To make the most of their components, modern CPUs pipeline their execution. So while the first instruction is being decoded, the second instruction is already being fetched. By the time the result of the first instruction is written back to memory, the second instruction has been executed, the third has fetched its operands, the fourth has been decoded, and the fifth has been fetched, as shown in the figure below. This allows the CPU to execute a full instruction per cycle (IPC) instead of 1/5th of one, as outlined with a saturated pipeline below. With additional dark magic, you can even get an IPC count of 6 out of modern CPUs.

CPU Pipelining Example with Pipeline Hazard

One caveat to instruction pipelining is pipeline hazards. We will focus just on the most relevant ones for branches: control hazards. If the CPU can execute one instruction after another, pipelining works fine. But what happens if the next instruction to be executed depends on the previous one?

In such cases, the CPU cannot simply preload the instruction. In our example, the result of Instruction 7 (in Slot 1) is necessary to know which instruction to fetch next. Thus, we have to wait for the result of instruction 7. This results in Slot 2 being empty until Step 9. Only then we know the result of Instruction 8 and can load it. Since the CPU had to wait for three cycles, the pipeline was stalled. This happens, in theory, every time a branch is taken. Hence, each branch in the code will greatly reduce the IPC simply because the CPU is unable to pipeline instructions efficiently.

Branch Prediction

Because branches are such a crucial part of all software, their existence is hardly a surprise to CPU designers. The most common way to avoid breaking the execution pipeline is branch prediction. This is where the CPU tries to guess which branch will be taken, e.g. which value will be smaller at the minimum, and starts executing the code it thinks will come next. This behavior is called speculative execution, and you might be familiar with the security issues that can result from doing this carelessly.

Finding the minimum of two unknown values will always have a 50/50 chance that the first input is the minimum, leading to a lot of wasted work just to break the pipeline anyway if the CPU guesses wrong. However, many branches are much more predictable than the minimum of two random values.

Each pass through a loop contains at least one branch that decides whether there will be another pass through the loop, or whether the loop is finished. And in all but one case there will be another pass. Simply assuming that there will be another pass through the loop is one of the simplest but most effective ways of predicting branches. Of course, CPUs will do much more than just assume that each loop will be repeated, e.g. by keeping track of the past behavior of each specific branch and combining multiple metrics for these decisions. But that goes too deep for this blog post. The important point is that branches are expensive enough that CPUs will go to great lengths to work around them. But of course, even the best branch predictor is sometimes wrong, resulting in a lot of wasted work and extra work to refill the instruction pipeline.

How to Reduce Branches

It is obvious that a lot of branches in code cannot be removed. For example, every loop will always need a branch to check if it is done. However, there are cases where branches in code can be removed without impacting the logic.

Working with Boolean Values

One way to reduce branches is to try to avoid branches in your code in the first place by using a boolean expression in arithmetic. Where this is possible depends heavily on the context, so I will use one of my favorite examples from the MonetDB/X100 paper to show the idea. I will adapt it slightly while keeping the same optimizations as the original.

Suppose that receive an array of integer values and are tasked with selecting only the negative values while removing all others. A naive implementation would look like this:

void filterNegative(vector<int>& values) {
  auto j = 0;
  for(auto i = 0; i < values.size(); i++)
    if (values[i] < 0) values[j++] = values[i];
  values.resize(j);
}

We keep track of a read offset i and a write offset j. We only write values less than 0, increasing the write offset each time. However, we have a branch in the body of the loop because we treat the cases differently depending on the current value. We can also get the same result without a branch in the loop body:

void filterNegativeOpt(vector<int>& values) {
  auto j = 0;
  for(auto i = 0; i < values.size(); i++) {
    values[j] = values[i];
    j += values[i] < 0;
  }
  values.resize(j);
}

Here we always write the currently read value to the write position, whether it is negative or not, and only increase the write position if we have written a negative value. As we always run the same code, regardless of whether the read value was negative or not, we can avoid the branch altogether. To achieve this, we treat the boolean not as a conditional jump flag like above, but as an integer to increase the write offset by.

You can immediately see the advantage of this by looking at the execution statistics of both methods for 1 billion random integers with a 50/50 split of positive and negative values:

	Execution Time	IPC	Branch-Misses	L1-Misses	LLC-Misses
filterNegative	13.17 sec	1.48	650,610	253,600	37,470
filterNegativeOpt	8.21 sec	2.98	149,050	251,550	39,170

The branch-free version is over 50% faster for exactly the same task. Contrary to the speedups observed in our blog post on data movement, the improvement here is not due to cache misses, which are almost identical for both versions. What is different are the branch misses and the IPC. The optimized version has only 150 branch misses per million records, while the version with if-branch has 650. The resulting pipeline breaks are also visible in the IPC, where the branching version reaches only 148. compared to 2.98 for the branch-free version.

Generating Code

The final optimization I want to discuss is the most involved one. If you have repeating patterns, it is often possible to make decisions once and hard code them into the program. We programmers probably do this every day without even realizing it. If you’re parsing a clean JSON input array of integers, you probably don’t write code to handle strings and floats. But what if you don’t know what data types the input has, and the user doesn’t tell you until runtime?

In such cases you have two options. One is to look at the data type specified by the user, then select the appropriate parser and apply it to each input. The other is to wait with writing your code until you know which data type you will encounter, and generate code just for that, avoiding all the branches that result from decisions you can make once and have to execute many times. For the complexity of the example, it seems crazy to do the second, but at the scale of a database system, it pays off because many such small decisions add up quickly.

// Generic branching approach, supports e.g.:
// select sum(i) from foo (integer i)
// select max(n) from bar (numeric n)
int agg = 0;
for (auto& val : a.values) {
	switch(type) {
		case Int:
			switch(op) {
				case SUM: agg += val; break;
				case MAX: agg = max(agg, val); break;
			}
			break;
		case Numeric:
			switch(op) {
				case SUM: agg = numericAdd(agg, val); break;
				case MAX: agg = numericMax(agg, val); break;
			}
			break;
	}
}

Code supporting different type and aggregate combinations

// Codegen, generating specific code for each combination
// Generated code for select sum(i) from foo
int agg = 0;
for (auto& val : a.values) {
	agg += val;
}

// Generated code for select max(n) from bar
int agg = 0;
for (auto& val : a.values) {
	agg = numericMax(agg, val);
}

Code generated for only the required combinations

Even the simple example of an ungrouped aggregation in SQL requires a nested switch statement to handle different aggregate-to-type matchings, even though for one specific query the aggregate function and value type will never change. Of course, the CPU branch predictor will also notice that the same path is taken each time. However, generated code is still much faster.

Going to such great lengths to avoid branches is probably not necessary for most projects, but it is one of the components that make CedarDB so fast for complex analytics. And the blog post discussing this was one of the things that motivated me to write this post, so I wanted to come full circle.

Summary

In this post we discovered that while branches in code are essential for doing anything useful, they can be a real pain for the CPU during execution. As well as relying on your CPU’s branch predictor, we have looked at some steps you can take in everyday coding to avoid this. While there are no simple patterns that can be applied everywhere, there are plenty of opportunities if you are willing to go the extra mile. We have seen that working with boolean values in unconventional ways can save you more than 50% of runtime compared to using the same boolean in an if.

If you have complex decisions that need to be made once and then applied often at runtime, for example if you are building a database system that can handle the most complex SQL queries like we do, it is even worth generating code to reduce branching.

If you don’t want to miss future blog posts, sign up for our newsletter below!

Appendix

Show Benchmark Results

Optimized: Result: 497519210 Stats: time sec, cycles, kcycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz 8.21, 30203.80, 6187.79, 89893.77, 251.55, 39.17, 149.05, 8211.13, 1000000, 2.98, 1.00, 3.68

Unoptimized: Result: 497519210 Stats: time sec, cycles, kcycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz 13.17, 48439.09, 6269.09, 71858.57, 253.60, 37.47, 650.61, 13164.33, 1000000, 1.48, 1.00, 3.68

Helping Christmas Elves Count Presents (or: Vectorized Overflow Checking)