Context
Some time ago, I found that the performance difference between a piece of C++ code compiled under macOS with the included Clang and GCC with Homebrew was nearly an order of magnitude, and here are the runtimes:
- GCC-13 Homebrew: 300
- Apple Clang: 2170
The code is as follows:
|
|
First the conclusion: GCC-13 Homebrew uses libstdc++, while Apple Clang uses libc++; libstdc++ optimises the implementation of uniform_int_distribution, while libc++ uses a plain implementation, and at the same time the parameters are chosen in such a way that they trigger exactly the worst-case scenario of the plain implementation, resulting in a huge performance gap.
Explore
Phenomenally, it looks like there is a big performance difference between GCC and Clang, but since an STL implementation is involved here, control variables are important: after testing, it turns out that Clang + libstdc++ performs well, GCC + libstdc++ performs well, and Clang + libc++ performs poorly.
So the problem can probably be pinpointed to libc++. So, let’s go to the uniform_int_distribution implementation of libc++:
|
|
As you can see, the idea is that in order to generate a uniformly integer random number between [a, b]
, it first finds a power of two larger than b-a
, then generates a random number modulo this power of two, and then uses rejection sampling: if the value obtained from the sampling is larger than b-a
, then resample.
But in the code tested earlier, the size of the interval is a power of two, so rounding up to a power of two is equivalent to a 50% probability of needing to retry each sample. That’s a lot of regenerating random numbers, and a high branch prediction error rate.
The data obtained by @Harry-Chen using perf testing is as follows.
- branch-misses: 30% of all branches
- Top-down: 44.6% Bad Speculation
suggests that branch prediction is indeed becoming a bottleneck. So how is libstdc++ implemented and why doesn’t it have this problem? After searching, I found a blog: Doubling the speed of std::uniform_int_distribution in the GNU C++ library (libstdc++).
The paper Fast Random Integer Generation in an Interval proposes a new implementation of uniform_int_distribution, with twice the performance improvement over the original implementation, and merges it into the libstdc++ implementation.
So here, the problem is clearer: libstdc++ implements a better algorithm, while libc++’s algorithm encounters the worst case, and the two together observe a huge performance gap.
If DATA_R
is changed to (1 << 23)-1
in the code, the probability of sampling failure of the libc++ algorithm is minimised, at which point the performance is:
- GCC-13 Homebrew: 253
- Apple Clang: 490
You can see that there is a roughly twofold performance gap here, which comes from the better sampling algorithm implemented in libstdc++.
The solution is to wait for libc++ to implement a better algorithm, or to avoid linking to libc++ when you need to use uniform_int_distribution, or to implement a better algorithm yourself.