blomqu.ist

At My Limit: A Story About Clamping

A short rabbit hole from a poorly named function

The other day, I was spelunking and found a simple function with a peculiar implementation1:

double limit(double x, double low, double high) {
    return ((x < std::min(low,high)) ? std::min(low,high) :
        ((x > std::max(low,high)) ? std::max(low,high) : x));
}
1

You might recognize this by another name: clamp.2 The choice to use the name limit instead will bother me to the end of time.

2

I later found that clamp exists in the same header, implemented two years after limit. Presumably because someone was looking for it and couldn't find it because it was called limit.

This function tries to be clever with its implementation by avoiding a swap of low and high if the arguments to the function were somehow mixed up. But its attempted cleverness makes it harder to read. Not only is it harder to read, it's harder for the compiler to properly optimize this function. Throwing the function into Godbolt using the armv8-a clang 16.0 compiler with -O2, we get the following assembly:

limit(double, double, double):
    fcmp    d2, d1
    fcsel   d3, d2, d1, mi
    fcmp    d3, d0
    b.gt    .LBB0_3
    fcmp    d1, d2
    fmov    d3, d0
    fcsel   d1, d2, d1, mi
    fcmp    d1, d0
    b.pl    .LBB0_3
    fmov    d3, d1
.LBB0_3:
    fmov    d0, d3
    ret

It manages to figure most of it out and replace most of the potential branches with conditional instructions, but there are two places where it decides it must branch instead.

The naive implementation with the same sanity check looks like this3:

double limit(double x, double low, double high) {
    if (low > high) 
        std::swap(low, high);
    if (x < low)
        return low;
    else if (x > high)
        return high;
    else
        return x;
}
3

The else isn't necessary but keeps the returns at the same level of indentation. Jury's out on which style looks nicer.

And it produces the following assembly given the same settings:

limit(double, double, double):
    fcmp    d1, d2
    fcsel   d3, d1, d2, gt
    fcsel   d1, d2, d1, gt
    fcmp    d3, d0
    fcsel   d2, d3, d0, mi
    fcmp    d1, d0
    fcsel   d0, d1, d2, gt
    ret

The naive version is not only easier to read but the compiler recognizes every part of the pattern and is able to optimize it. The branches are removed in favor of conditional instructions and it's able to do the same work in only seven instructions versus the original eleven.

If you really want to be terse, you can still do this in two lines4:

double limit(double x, double low, double high) {
    if (low > high) std::swap(low, high);
    // Pick your poison
    return x < low ? low : high < x ? high : x;
    // return std::min(high, std::max(low, x));
}
4

In either case, the compiler recognizes the pattern and optimizes it. The latter saves three whole characters that we can cash in later.

Or if you have access to C++17, we can simplify this even further5:

double limit(double x, double low, double high) {
    if (low > high) std::swap(low, high);
    return std::clamp(x, low, high);
}
5

Without the sanity check, calling std::clamp with low > high results in undefined behavior. While most implementations of clamp aren't going to be as paranoid as this one, it feels like a classic C++ move to standardize a worse version of something a decade or two after everyone already wrote their own version for something that should've been part of the standard library in the first place.

The output of either of these functions is effectively the same as the naive version. The final two fcsel instructions choose different conditions and rearrange their arguments accordingly.

Curiously enough, I could not get GCC to optimize any of these functions to the level that Clang did and each function has a wildly different assembly. I found this bizarre since Clang produced three nearly identical outputs. I initially thought that I was just missing some compiler option but the handful I tried either didn't change the output or weren't recognized by Godbolt. I'm beginning to think it's likely that GCC decided that relying on branch prediction would be faster than the conditional instructions in practice since the happy path in this function would be a no-op. I would like to dig deeper and know the reason for this discrepency, but I'm at my limit.

The code used in these experiments along with the assembly can be found on Godbolt.