At My Limit: A Story About Clamping
A short rabbit hole from a poorly named function
The other day, I was spelunking and found a simple function with a peculiar implementation1:
double limit(double x, double low, double high) {
return ((x < std::min(low,high)) ? std::min(low,high) :
((x > std::max(low,high)) ? std::max(low,high) : x));
}
You might recognize this by another name: clamp
.2 The choice to use the name limit
instead will bother me to the end of time.
I later found that clamp
exists in the same header, implemented two years after limit
. Presumably because someone was looking for it and couldn't find it because it was called limit
.
This function tries to be clever with its implementation by avoiding a swap of low
and high
if the arguments to the function were somehow mixed up. But its attempted cleverness makes it harder to read. Not only is it harder to read, it's harder for the compiler to properly optimize this function. Throwing the function into Godbolt using the armv8-a clang 16.0 compiler with -O2
, we get the following assembly:
limit(double, double, double):
fcmp d2, d1
fcsel d3, d2, d1, mi
fcmp d3, d0
b.gt .LBB0_3
fcmp d1, d2
fmov d3, d0
fcsel d1, d2, d1, mi
fcmp d1, d0
b.pl .LBB0_3
fmov d3, d1
.LBB0_3:
fmov d0, d3
ret
It manages to figure most of it out and replace most of the potential branches with conditional instructions, but there are two places where it decides it must branch instead.
The naive implementation with the same sanity check looks like this3:
double limit(double x, double low, double high) {
if (low > high)
std::swap(low, high);
if (x < low)
return low;
else if (x > high)
return high;
else
return x;
}
The else
isn't necessary but keeps the returns at the same level of indentation. Jury's out on which style looks nicer.
And it produces the following assembly given the same settings:
limit(double, double, double):
fcmp d1, d2
fcsel d3, d1, d2, gt
fcsel d1, d2, d1, gt
fcmp d3, d0
fcsel d2, d3, d0, mi
fcmp d1, d0
fcsel d0, d1, d2, gt
ret
The naive version is not only easier to read but the compiler recognizes every part of the pattern and is able to optimize it. The branches are removed in favor of conditional instructions and it's able to do the same work in only seven instructions versus the original eleven.
If you really want to be terse, you can still do this in two lines4:
double limit(double x, double low, double high) {
if (low > high) std::swap(low, high);
// Pick your poison
return x < low ? low : high < x ? high : x;
// return std::min(high, std::max(low, x));
}
In either case, the compiler recognizes the pattern and optimizes it. The latter saves three whole characters that we can cash in later.
Or if you have access to C++17, we can simplify this even further5:
double limit(double x, double low, double high) {
if (low > high) std::swap(low, high);
return std::clamp(x, low, high);
}
Without the sanity check, calling std::clamp
with low > high
results in undefined behavior. While most implementations of clamp
aren't going to be as paranoid as this one, it feels like a classic C++ move to standardize a worse version of something a decade or two after everyone already wrote their own version for something that should've been part of the standard library in the first place.
The output of either of these functions is effectively the same as the naive version. The final two fcsel
instructions choose different conditions and rearrange their arguments accordingly.
Curiously enough, I could not get GCC to optimize any of these functions to the level that Clang did and each function has a wildly different assembly. I found this bizarre since Clang produced three nearly identical outputs. I initially thought that I was just missing some compiler option but the handful I tried either didn't change the output or weren't recognized by Godbolt. I'm beginning to think it's likely that GCC decided that relying on branch prediction would be faster than the conditional instructions in practice since the happy path in this function would be a no-op. I would like to dig deeper and know the reason for this discrepency, but I'm at my limit.
The code used in these experiments along with the assembly can be found on Godbolt.