Very few consider C++ attractive, and hardly anyone thinks it’s easy. Choosing it for a project generally means you care about the performance of your code. And rightly so! Today machines can process hundreds of Gigabytes per second, and we, as developers, should all learn to saturate those capabilities. So let’s look into a few simple code snippets and familiarize ourselves with Google Benchmark (GB) - the most famous library in the space. Here is the plan:
- When A Nanosecond Is Too Long!
- From Nanoseconds to Microseconds
- Bigger Benchmarks
For the impatient ones, here is the source code on GitHub.
That repo also has the results for AMD EPYC 7302 and Threadripper PRO 3995WX CPUs with
When A Nanosecond Is Too Long
Every benchmark has the following structure. You define the operation, but GB decides how many times to run it.
Compile it with
-O3, run, result: 0 ns.
Thanks for attention, that’s all 😅
Unfortunately, no operation runs this fast on the computer. On a 3 GHz CPU, you would perform 3 Billion ops every second. So each would take 0.33 ns, not 0 ns. The compiler just optimized everything out.
Randomizing The Inputs
What will happen if we mutate the addition arguments between loop iterations?
25ns, looks better than zero to me, but something doesn’t add up. If the addition is a single hardware instruction, why did it take ~100 cycles? There are BMI2 instructions with such runtimes on AMD Zen and Zen 2 chips, but not the addition.
Here is the pattern we often use in benchmarking.
It’s definitely not the cleanest approach for something as small as addition, but good enough to make a point.
Initialize randomly before evaluation, then mutate.
Your mutations can be regular and predictable.
They just shouldn’t be expensive.
We also apply the
DoNotOptimize function to force the compilation of that useless line.
This run took 0.7ns per iteration or around 2 cycles.
The first cycle was spent incrementing
b on different ALUs of the same core, while the second was performed the final accumulation.
Now let’s run those benchmarks on 8 threads.
std::rand powered function took 12'000 ns, while our latest variant remained the same.
Like many other
libc functions, it depends on the global state and uses mutexes to synchronize concurrent access to global memory.
Here is its source in glibc:
Doing Some Math
Let’s say you want to compute the sine of a floating-point number.
The standard has
std::sin for that, but there is a different way.
We can approximate the result with the Taylor-Maclaurin series, taking the first three components of the expansion.
Pretty simple, right? Here is the result:
- 51 ns for the
- 27 ns for the Maclaurin series with
Can we do better?
Of course, we can.
Raising to a power
x is a generic operation.
In other words, likely slow.
Let’s unfold the expression manually:
Result: 2.4 ns, or 10x improvement! The compiler will take care of redundant operations, but he cannot do permutations.
Both addition and multiplication are associative for real numbers in math but not in programming. The order of operands will affect the result, so the compilers have their hands in cuffs. Let’s set them free!
If you are not familiar with the modern C++ attributes, here is the older GCC version:
Here we advise GCC to apply extra flags when compiling the scope of this specific function.
This is one of my favourite tricks for rapid benchmarking without having to change
Result: 0.85 ns.
For those interested, GCC has a separate manual for the full list of available floating-point math optimizations.
-ffast-math is a shortcut for
-funsafe-math-optimizations, which is a shortcut for
In our specific case, with that many chained multiplications, the ordering may change under/overflow behavior, as stated in the “Transformations” section.
We have just accelerated a floating-point trigonometric function by a factor of 60x from 51 ns to less than a nanosecond! Can we do the same for integer division, the slowest integer arithmetical operation? To a certain extent, yes.
This variant runs in 3.3 ns, or 10x longer than addition. But what if we divide by a constant, not a mutating value? In that case, we just have to make sure that the compiler knows that our denominator remains constant!
The above functions only differ in that the first one is involved in something shady -
The compiler fails to trace the origins of
money, so it can’t replace it with shifts and multiplications.
How big is the difference: 3.3 ns vs 0.5 ns!
Most of the examples in this article are abstract, but this actually happened to me a couple of months ago.
Compilers have special intrinsics, like
Those are high-performance routines shipped with the compiler and aren’t portable.
Those are different from Intel or ARM intrinsics, hardware-specific and compiler-agnostic.
These are hardware-agnostic and compiler-specific.
I forgot to pass the
-mpopcnt flag in the first variant, and the compiler silently continued.
When you want to be specific - explicitly trigger the
The cost of this mistake was: 1.9 ns and 0.3 ns, or over 6x!
Compute may be expensive, but memory accesses always are! The more you miss your caches, the more you waste time!
Let’s illustrate it by creating a cache-aligned array with 32x
That means 2x 64-byte cache lines worth of content.
Now we run two benchmarks.
Both perform the same logical operations - 8x
float additions and 8x stores.
But the first took 8 ns and the seond took 4 ns.
Our benchmark ends up being dominated by the cost of the split-load, or the lack of memory alignment, in other words.
From Nanoseconds to Microseconds
Let’s gradually grow our operations in size. What’s wrong with the following benchmark?
We are benchmarking the
std::sort, the most used
Array sizes are fixed at 3 and 4 elements in different benchmarks.
Each runs twice:
std::reversepreprocessing time included.
std::reversepreprocessing time excluded.
One took 227 ns and another took 9 ns. Which is which?
The Cost of Timing in Google Benchmark
If we think about this, branches always evaluate identically, so the only problems might be coming from
state.*Timing() functions or the reversal itself.
Anyways, if you have read the title of this section, it shouldn’t be a surprise.
This took 213 ns. Those things look tiny when processing gigabytes, but you will run your benchmark on a small input one day. Plan ahead to extract unaffected asymptotic curve as we will do later.
Failing To Make A Point, Again
Branching is also critical. We avoid it on hot paths and design branchless vectorizable procedures. It’s time-consuming but makes further porting to SIMD assembly much more manageable.
Unfortunately, it’s tough to showcase its cost in a five-line code snippet. Every simple snippet like this completes in just around 2 ns. There are still papers like this regularly published on dynamic branch prediction, so I am looking for someone to contribute a better example for the cost of branching!
Even though I couldn’t show the cost of
if before, I still recommend to separate compile-time and run-time logic.
It just makes sense.
GB comes with all kinds of bells and whistles, like complexity analysis.
Those come handy when you analyze scaling.
Most of us assume that
std::sort uses Quick Sort, at least for big inputs.
But what if the input is ginormous?
There is the Parallel STL! In GCC, those procedures are directed towards Intel TBB, which will use many threads to sort your numbers. The underlying algorithm would almost certainly be a linear-time Radix sort on GPUs. But whats the complexity of TBBs implementation?
This short benchmark logs more information than just the runtime per iteration.
Aside from the built-in
SetBytesProcessed you can add anything else:
I then run this benchmark with a couple of different policies and on numerous sizes:
Which would yield results like this:
So GB took the heavy burden of fitting and comparing our results against the suggested complexity curve.
From those experiments,
NLogN fits best.
Another thing to note here is the missing
Without it the “CPU Time” is used by default.
If you were to sleep your process, the “CPU Time” would stop growing.
GBs functionality list is extensive, and many parts of it are little-known:
- Random Interleaving with
- Hardware Performance Counters via
- Comparing with previous results with
It’s impossible to cover the depth of performance engineering and complexity of modern software/hardware with just one tool. Luckily, they have excellent single-page user guide, and all the codes from this article are also available on GitHub. Feel free to fork and extend them!
The other essential tools to know are
valgrind in their ascending complexity.
We were just focusing on runtime speed today, that’s easy.
While memory usage or hardware stalls tracking is complicated yet equally crucial for evaluating software.
Especially if it claims to be the fastest Database Management Software ever designed 🤓
So expect deeper articles on that topic very soon!