Very few consider C++ attractive, and hardly anyone thinks it’s easy. Choosing it for a project generally means you care about the performance of your code. And rightly so! Today machines can process hundreds of Gigabytes per second, and we, as developers, should all learn to saturate those capabilities. So let’s look into a few simple code snippets and familiarize ourselves with Google Benchmark (GB) - the most famous library in the space. Here is the plan:

For the impatient ones, here is the source code on GitHub. That repo also has the results for AMD EPYC 7302 and Threadripper PRO 3995WX CPUs with -O1 and -O3 builds.

When A Nanosecond Is Too Long

Every benchmark has the following structure. You define the operation, but GB decides how many times to run it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#include <benchmark/benchmark.h>

namespace bm = benchmark;

static void i32_addition(bm::State &state) {
    int32_t a, b, c;
    for (auto _ : state)
        c = a + b; 
}

BENCHMARK(i32_addition);

Compile it with -O3, run, result: 0 ns. Thanks for attention, that’s all 馃槄

Unfortunately, no operation runs this fast on the computer. On a 3 GHz CPU, you would perform 3 Billion ops every second. So each would take 0.33 ns, not 0 ns. The compiler just optimized everything out.

Randomizing The Inputs

What will happen if we mutate the addition arguments between loop iterations?

1
2
3
4
5
static void i32_addition_random(bm::State &state) {
    int32_t c = 0;
    for (auto _ : state)
        c = std::rand() + std::rand();
}

25ns, looks better than zero to me, but something doesn’t add up. If the addition is a single hardware instruction, why did it take ~100 cycles? There are BMI2 instructions with such runtimes on AMD Zen and Zen 2 chips, but not the addition.

1
2
3
4
5
static void i32_addition_semirandom(bm::State &state) {
    int32_t a = std::rand(), b = std::rand(), c = 0;
    for (auto _ : state)
        bm::DoNotOptimize(c = (++a) + (++b));
}

Here is the pattern we often use in benchmarking. It’s definitely not the cleanest approach for something as small as addition, but good enough to make a point. Initialize randomly before evaluation, then mutate. Your mutations can be regular and predictable. They just shouldn’t be expensive. We also apply the DoNotOptimize function to force the compilation of that useless line.

This run took 0.7ns per iteration or around 2 cycles. The first cycle was spent incrementing a and b on different ALUs of the same core, while the second was performed the final accumulation.

1
2
BENCHMARK(i32_addition_random)->Threads(8);
BENCHMARK(i32_addition_semirandom)->Threads(8);

Now let’s run those benchmarks on 8 threads. The std::rand powered function took 12'000 ns, while our latest variant remained the same. Like many other libc functions, it depends on the global state and uses mutexes to synchronize concurrent access to global memory. Here is its source in glibc:

1
2
3
4
5
6
7
8
long int __random (void) {
  int32_t retval;
  __libc_lock_lock (lock);
  (void) __random_r (&unsafe_state, &retval);
  __libc_lock_unlock (lock);
  return retval;
}
weak_alias (__random, random)

Doing Some Math

Let’s say you want to compute the sine of a floating-point number. The standard has std::sin for that, but there is a different way. We can approximate the result with the Taylor-Maclaurin series, taking the first three components of the expansion.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
static void f64_sin(bm::State &state) {
    double argument = std::rand(), result = 0;
    for (auto _ : state)
        bm::DoNotOptimize(result = std::sin(argument += 1.0));
}

static void f64_sin_maclaurin(bm::State &state) {
    double argument = std::rand(), result = 0;
    for (auto _ : state) {
        argument += 1.0;
        result = argument - std::pow(argument, 3) / 6 + std::pow(argument, 5) / 120;
        bm::DoNotOptimize(result);
    }
}

Pretty simple, right? Here is the result:

  • 51 ns for the std::sin.
  • 27 ns for the Maclaurin series with std::pow.

Can we do better? Of course, we can. Raising to a power x is a generic operation. In other words, likely slow. Let’s unfold the expression manually:

1
2
3
4
5
6
7
8
9
static void f64_sin_maclaurin_powless(bm::State &state) {
    double argument = std::rand(), result = 0;
    for (auto _ : state) {
        argument += 1.0;
        result = argument - (argument * argument * argument) / 6.0 +
                 (argument * argument * argument * argument * argument) / 120.0;
        bm::DoNotOptimize(result);
    }
}

Result: 2.4 ns, or 10x improvement! The compiler will take care of redundant operations, but he cannot do permutations.

GCC Handcuffed by IEEE 754

Both addition and multiplication are associative for real numbers in math but not in programming. The order of operands will affect the result, so the compilers have their hands in cuffs. Let’s set them free!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
[[gnu::optimize("-ffast-math")]]
static void f64_sin_maclaurin_with_fast_math(bm::State &state) {
    double argument = std::rand(), result = 0;
    for (auto _ : state) {
        argument += 1.0;
        result = argument - (argument * argument * argument) / 6.0 +
                 (argument * argument * argument * argument * argument) / 120.0;
        bm::DoNotOptimize(result);
    }
}

If you are not familiar with the modern C++ attributes, here is the older GCC version: __attribute__((optimize("-ffast-math"))). Here we advise GCC to apply extra flags when compiling the scope of this specific function. This is one of my favourite tricks for rapid benchmarking without having to change CMakeLists.txt. Result: 0.85 ns.

For those interested, GCC has a separate manual for the full list of available floating-point math optimizations. The -ffast-math is a shortcut for -funsafe-math-optimizations, which is a shortcut for -fassociative-math. In our specific case, with that many chained multiplications, the ordering may change under/overflow behavior, as stated in the “Transformations” section.

Integer Division

We have just accelerated a floating-point trigonometric function by a factor of 60x from 51 ns to less than a nanosecond! Can we do the same for integer division, the slowest integer arithmetical operation? To a certain extent, yes.

1
2
3
4
5
static void i64_division(bm::State &state) {
    int64_t a = std::rand(), b = std::rand(), c = 0;
    for (auto _ : state)
        bm::DoNotOptimize(c = (++a) / (++b));
}

This variant runs in 3.3 ns, or 10x longer than addition. But what if we divide by a constant, not a mutating value? In that case, we just have to make sure that the compiler knows that our denominator remains constant!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
static void i64_division_by_const(bm::State &state) {
    int64_t money = 2147483647;
    int64_t a = std::rand(), c;
    for (auto _ : state)
        bm::DoNotOptimize(c = (++a) / *std::launder(&money));
}
static void i64_division_by_constexpr(bm::State &state) {
    constexpr int64_t b = 2147483647;
    int64_t a = std::rand(), b;
    for (auto _ : state)
        bm::DoNotOptimize(c = (++a) / b);
}

The above functions only differ in that the first one is involved in something shady - money laundering. The compiler fails to trace the origins of money, so it can’t replace it with shifts and multiplications. How big is the difference: 3.3 ns vs 0.5 ns!

Hardware Acceleration

Most of the examples in this article are abstract, but this actually happened to me a couple of months ago.

1
2
3
4
5
[[gnu::target("default")]] static void u64_population_count(bm::State &state) {
    auto a = static_cast<uint64_t>(std::rand());
    for (auto _ : state)
        bm::DoNotOptimize(__builtin_popcount(++a));
}

Compilers have special intrinsics, like __builtin_popcount. Those are high-performance routines shipped with the compiler and aren’t portable. Those are different from Intel or ARM intrinsics, hardware-specific and compiler-agnostic. These are hardware-agnostic and compiler-specific.

1
2
3
4
5
[[gnu::target("popcnt")]] static void u64_population_count_x86(bm::State &state) {
    auto a = static_cast<uint64_t>(std::rand());
    for (auto _ : state)
        bm::DoNotOptimize(__builtin_popcount(++a));
}

I forgot to pass the -mpopcnt flag in the first variant, and the compiler silently continued. When you want to be specific - explicitly trigger the popcnt instruction. The cost of this mistake was: 1.9 ns and 0.3 ns, or over 6x!

Data Alignment

Compute may be expensive, but memory accesses always are! The more you miss your caches, the more you waste time!

1
2
3
4
5
6
constexpr size_t f32s_in_cacheline_k = 64 / sizeof(float);
constexpr size_t f32s_in_halfline_k = f32s_in_cacheline_k / 2;

struct alignas(64) f32_array_t {
    float raw[f32s_in_cacheline_k * 2];
};

Let’s illustrate it by creating a cache-aligned array with 32x floats. That means 2x 64-byte cache lines worth of content.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
static void f32_pairwise_accumulation(bm::State &state) {
    f32_array_t a, b, c;
    for (auto _ : state)
        for (size_t i = f32s_in_halfline_k; i != f32s_in_halfline_k * 3; ++i)
            bm::DoNotOptimize(c.raw[i] = a.raw[i] + b.raw[i]);
}

static void f32_pairwise_accumulation_aligned(bm::State &state) {
    f32_array_t a, b, c;
    for (auto _ : state)
        for (size_t i = 0; i != f32s_in_halfline_k; ++i)
            bm::DoNotOptimize(c.raw[i] = a.raw[i] + b.raw[i]);
}

Now we run two benchmarks. Both perform the same logical operations - 8x float additions and 8x stores. But the first took 8 ns and the seond took 4 ns. Our benchmark ends up being dominated by the cost of the split-load, or the lack of memory alignment, in other words.

From Nanoseconds to Microseconds

Let’s gradually grow our operations in size. What’s wrong with the following benchmark?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
static void sorting(bm::State &state) {

    auto count = static_cast<size_t>(state.range(0));
    auto include_preprocessing = static_cast<bool>(state.range(1));

    std::vector<int32_t> array(count);
    std::iota(array.begin(), array.end(), 1);

    for (auto _ : state) {

        if (!include_preprocessing)
            state.PauseTiming();
        // Reverse order is the most classical worst case, but not the only one.
        std::reverse(array.begin(), array.end());
        if (!include_preprocessing)
            state.ResumeTiming();

        std::sort(array.begin(), array.end());
        bm::DoNotOptimize(array.size());
    }
}

BENCHMARK(sorting)->Args({3, false})->Args({3, true});
BENCHMARK(sorting)->Args({4, false})->Args({4, true});

We are benchmarking the std::sort, the most used <algorithm>. Array sizes are fixed at 3 and 4 elements in different benchmarks. Each runs twice:

  1. with std::reverse preprocessing time included.
  2. with std::reverse preprocessing time excluded.

One took 227 ns and another took 9 ns. Which is which?

The Cost of Timing in Google Benchmark

If we think about this, branches always evaluate identically, so the only problems might be coming from state.*Timing() functions or the reversal itself. Anyways, if you have read the title of this section, it shouldn’t be a surprise.

1
2
3
4
5
6
7
8
9
static void upper_cost_of_pausing(bm::State &state) {
    int32_t a = std::rand(), c = 0;
    for (auto _ : state) {
        state.PauseTiming();
        ++a;
        state.ResumeTiming();
        bm::DoNotOptimize(c += a);
    }
}

This took 213 ns. Those things look tiny when processing gigabytes, but you will run your benchmark on a small input one day. Plan ahead to extract unaffected asymptotic curve as we will do later.

Failing To Make A Point, Again

Branching is also critical. We avoid it on hot paths and design branchless vectorizable procedures. It’s time-consuming but makes further porting to SIMD assembly much more manageable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
static void upper_cost_of_branching(bm::State &state) {
    volatile int32_t a = std::rand();
    volatile int32_t b = std::rand(), c;
    for (auto _ : state) {
        volatile bool prefer_addition = ((a * b ^ c) & 1) == 0;
        if (prefer_addition)
            b *= (++a) * c;
        else
            c -= (a = 3 * a + 1) & b;
    }
}

Unfortunately, it’s tough to showcase its cost in a five-line code snippet. Every simple snippet like this completes in just around 2 ns. There are still papers like this regularly published on dynamic branch prediction, so I am looking for someone to contribute a better example for the cost of branching!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
template <bool reverse_only_half_of_time_k>
static void sorting_template(bm::State &state) {

    auto count = static_cast<size_t>(state.range(0));
    std::vector<int32_t> array(count);
    std::iota(array.begin(), array.end(), 1);
    size_t iteration = 0;

    for (auto _ : state) {

        if constexpr (reverse_only_half_of_time_k)
            if (iteration & 1)
                std::reverse(array.begin(), array.end());
        else
            std::reverse(array.begin(), array.end());

        std::sort(array.begin(), array.end());
        ++iteration;
    }
}

BENCHMARK_TEMPLATE(sorting_template, false)->Arg(3);
BENCHMARK_TEMPLATE(sorting_template, true)->Arg(3);
BENCHMARK_TEMPLATE(sorting_template, false)->Arg(4);
BENCHMARK_TEMPLATE(sorting_template, true)->Arg(4);

Even though I couldn’t show the cost of if before, I still recommend to separate compile-time and run-time logic. It just makes sense.

Bigger Benchmarks

GB comes with all kinds of bells and whistles, like complexity analysis. Those come handy when you analyze scaling. Most of us assume that std::sort uses Quick Sort, at least for big inputs. But what if the input is ginormous?

There is the Parallel STL! In GCC, those procedures are directed towards Intel TBB, which will use many threads to sort your numbers. The underlying algorithm would almost certainly be a linear-time Radix sort on GPUs. But whats the complexity of TBBs implementation?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
template <typename execution_policy_t>
static void supersort(bm::State &state, execution_policy_t &&policy) {

    auto count = static_cast<size_t>(state.range(0));
    std::vector<int32_t> array(count);
    std::iota(array.begin(), array.end(), 1);

    for (auto _ : state) {
        std::reverse(policy, array.begin(), array.end());
        std::sort(policy, array.begin(), array.end());
        bm::DoNotOptimize(array.size());
    }

    state.SetComplexityN(count);
    state.SetItemsProcessed(count * state.iterations());
    state.SetBytesProcessed(count * state.iterations() * sizeof(int32_t));
}

This short benchmark logs more information than just the runtime per iteration. Aside from the built-in SetItemsProcessed and SetBytesProcessed you can add anything else:

1
state.counters["tempreture_on_mars"] = bm::Counter(-95.4);

I then run this benchmark with a couple of different policies and on numerous sizes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
BENCHMARK_CAPTURE(supersort, seq, std::execution::seq)
    ->RangeMultiplier(8)
    ->Range(1l << 20, 1l << 32)
    ->MinTime(10)
    ->Complexity(bm::oNLogN);

BENCHMARK_CAPTURE(supersort, par_unseq, std::execution::par_unseq)
    ->RangeMultiplier(8)
    ->Range(1l << 20, 1l << 32)
    ->MinTime(10)
    ->Complexity(bm::oNLogN);

Which would yield results like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
supersort/seq/1048576/min_time:10.000               ... 
supersort/seq/2097152/min_time:10.000               ... 
supersort/seq/16777216/min_time:10.000              ... 
supersort/seq/134217728/min_time:10.000             ... 
supersort/seq/1073741824/min_time:10.000            ... 
supersort/seq/4294967296/min_time:10.000            ... 
supersort/seq/min_time:10.000_BigO                1.78 NlgN
supersort/seq/min_time:10.000_RMS                   41 %   
supersort/par_unseq/1048576/min_time:10.000         ...
supersort/par_unseq/2097152/min_time:10.000         ...
supersort/par_unseq/16777216/min_time:10.000        ...
supersort/par_unseq/134217728/min_time:10.000       ...
supersort/par_unseq/1073741824/min_time:10.000      ...
supersort/par_unseq/4294967296/min_time:10.000      ...
supersort/par_unseq/min_time:10.000_BigO         0.09 NlgN
supersort/par_unseq/min_time:10.000_RMS             2 %   

So GB took the heavy burden of fitting and comparing our results against the suggested complexity curve. From those experiments, NLogN fits best.


Another thing to note here is the missing UseRealTime. Without it the “CPU Time” is used by default. If you were to sleep your process, the “CPU Time” would stop growing. GBs functionality list is extensive, and many parts of it are little-known:

It’s impossible to cover the depth of performance engineering and complexity of modern software/hardware with just one tool. Luckily, they have excellent single-page user guide, and all the codes from this article are also available on GitHub. Feel free to fork and extend them!

The other essential tools to know are perf and valgrind in their ascending complexity. We were just focusing on runtime speed today, that’s easy. While memory usage or hardware stalls tracking is complicated yet equally crucial for evaluating software. Especially if it claims to be the fastest Database Management Software ever designed 馃 So expect deeper articles on that topic very soon!

  • Tweet
  • Subscribe to receive similar articles 馃摠
  • Reach Us to try the fastest data processing software 馃敟