Context

Some time ago, I found that the performance difference between a piece of C++ code compiled under macOS with the included Clang and GCC with Homebrew was nearly an order of magnitude, and here are the runtimes:

  • GCC-13 Homebrew: 300
  • Apple Clang: 2170

The code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <algorithm>
#include <array>
#include <chrono>
#include <iostream>
#include <memory>
#include <random>

constexpr size_t FILE_N = 1e8;
constexpr size_t DATA_R = (1 << 23);

int main() {
  for (size_t i = 0; i < 4; ++i) {
    auto start = std::chrono::high_resolution_clock::now();
    std::random_device rd;
    std::mt19937_64 gen(rd());
    std::uniform_int_distribution<> dis(0, DATA_R);
    std::shared_ptr<std::array<size_t, FILE_N>> data(
        new std::array<size_t, FILE_N>());
    std::generate(data->begin(), data->end(),
                  [&dis, &gen]() { return dis(gen); });
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end -
                                                                       start)
                     .count()
              << std::endl;
    auto mean = std::accumulate(data->begin(), data->end(), 0.0) / FILE_N;
    std::cout << mean << std::endl;
  }
  return 0;
}

First the conclusion: GCC-13 Homebrew uses libstdc++, while Apple Clang uses libc++; libstdc++ optimises the implementation of uniform_int_distribution, while libc++ uses a plain implementation, and at the same time the parameters are chosen in such a way that they trigger exactly the worst-case scenario of the plain implementation, resulting in a huge performance gap.

Explore

Phenomenally, it looks like there is a big performance difference between GCC and Clang, but since an STL implementation is involved here, control variables are important: after testing, it turns out that Clang + libstdc++ performs well, GCC + libstdc++ performs well, and Clang + libc++ performs poorly.

So the problem can probably be pinpointed to libc++. So, let’s go to the uniform_int_distribution implementation of libc++:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// https://github.com/llvm/llvm-project/blob/9b2dfff57a382b757c358b43ee1df7591cb480ee/libcxx/include/__random/uniform_int_distribution.h#L233-L257
typename uniform_int_distribution<_IntType>::result_type
uniform_int_distribution<_IntType>::operator()(_URNG& __g, const param_type& __p)
_LIBCPP_DISABLE_UBSAN_UNSIGNED_INTEGER_CHECK
{
    static_assert(__libcpp_random_is_valid_urng<_URNG>::value, "");
    typedef __conditional_t<sizeof(result_type) <= sizeof(uint32_t), uint32_t, __make_unsigned_t<result_type> >
        _UIntType;
    const _UIntType __rp = _UIntType(__p.b()) - _UIntType(__p.a()) + _UIntType(1);
    if (__rp == 1)
        return __p.a();
    const size_t __dt = numeric_limits<_UIntType>::digits;
    typedef __independent_bits_engine<_URNG, _UIntType> _Eng;
    if (__rp == 0)
        return static_cast<result_type>(_Eng(__g, __dt)());
    size_t __w = __dt - std::__countl_zero(__rp) - 1;
    if ((__rp & (numeric_limits<_UIntType>::max() >> (__dt - __w))) != 0)
        ++__w;
    _Eng __e(__g, __w);
    _UIntType __u;
    do
    {
        __u = __e();
    } while (__u >= __rp);
    return static_cast<result_type>(__u + __p.a());
}

As you can see, the idea is that in order to generate a uniformly integer random number between [a, b], it first finds a power of two larger than b-a, then generates a random number modulo this power of two, and then uses rejection sampling: if the value obtained from the sampling is larger than b-a, then resample.

But in the code tested earlier, the size of the interval is a power of two, so rounding up to a power of two is equivalent to a 50% probability of needing to retry each sample. That’s a lot of regenerating random numbers, and a high branch prediction error rate.

The data obtained by @Harry-Chen using perf testing is as follows.

  • branch-misses: 30% of all branches
  • Top-down: 44.6% Bad Speculation

suggests that branch prediction is indeed becoming a bottleneck. So how is libstdc++ implemented and why doesn’t it have this problem? After searching, I found a blog: Doubling the speed of std::uniform_int_distribution in the GNU C++ library (libstdc++).

The paper Fast Random Integer Generation in an Interval proposes a new implementation of uniform_int_distribution, with twice the performance improvement over the original implementation, and merges it into the libstdc++ implementation.

So here, the problem is clearer: libstdc++ implements a better algorithm, while libc++’s algorithm encounters the worst case, and the two together observe a huge performance gap.

If DATA_R is changed to (1 << 23)-1 in the code, the probability of sampling failure of the libc++ algorithm is minimised, at which point the performance is:

  • GCC-13 Homebrew: 253
  • Apple Clang: 490

You can see that there is a roughly twofold performance gap here, which comes from the better sampling algorithm implemented in libstdc++.

The solution is to wait for libc++ to implement a better algorithm, or to avoid linking to libc++ when you need to use uniform_int_distribution, or to implement a better algorithm yourself.