A bunch of core functions should be #[inline(always)]

I had a unit test that was running too slowly to be able to run it in debug mode (it completed instantly in release), so I started using perf to try and see what the bottlenecks were... and I was surprised.

Rust uses functions for a bunch of core operations that would be directly implemented by the compiler in C/C++. This results in some absurdities, like this code causing 4 billion function calls:

for i in 0..u32::MAX { ... }

Which if you benchmark in a debug build you will find:

callq core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next

Leading to the following workaround:

let mut i = 0;
loop {
    if i >= u32::MAX {
        break;
    }
    // ...
    i += 1;
}

There are a large number of other functions like this, like: core::ptr::mut_ptr::<impl *mut T>::add, core::ptr::write, core::ptr::drop_in_place, core::cell::UnsafeCell<T>::get.

AFAIK none of these should ever compile to more than a single instruction, and they are primitive enough that I don't think any reasonable user should really expect to step inside them.

Moreover, while I can write my own Cell with a get that is always inlined, I can't write an UnsafeCell replacement because it is special cased inside the compiler. Even if I could, most users are using a ton of third party libraries where the std types are ubiquitous.

I can appreciate debug builds are meant to be easier to debug, but these seem to substantially slow down debug builds for no debuggability benefit. Thoughts?

3 Likes

Huh. I thought #[inline(always)] didn't actually have an impact in debug mode, but it does.

It's maybe worth noting that the (rough) equivalent of for i in 0..u32::MAXRust is not for( uint32_t i = 0; i < UINT32_MAX; ++i )C, but rather for( uint32_t i : Range<uint32_t>(0, UINT32_MAX) )C++, which isn't even provided by the STL, and obviously thus wouldn't be inlined in debug mode.

Rustc generally doesn't really care about debug performance, and generally prefers erring on the side of faster compilation and avoiding open-coding methods that already exist to debug performance, if release mode perf is unimpacted.

That said, I think the libs team would be willing to try a perf run marking some of these "cheaper than a callq" methods as #[inline(always)] and seeing how it impacts debug compile speed. If it's a minimal impact, these seems reasonable.

4 Likes

I'll note as well that one can use -C opt-level=1 in debug mode, for situations where you want a bit more perf.

Work on the MIR inliner continues, and I suspect that it might one-day be the answer for these things, rather than forcing a bunch of attributes all throughout core. (It'd be one of those changes where it's really hard to have a good answer for "no, that one should/shouldn't get it" when PRs show up.)

7 Likes

I would consider it to be one of the main purposes of the feature, at least it is in the C++ world. It's not uncommon to have applications (often games) that are completely unusable if some key functions are not inlined, so you need this sort of feature to debug those applications.

I would call it the "idiomatic equivalent". In each case it is the obvious way given to users to iterate over some integers. If that ever became idiomatic C++ I hope they would make it always inline as well :wink:

Rustc generally doesn't really care about debug performance, and generally prefers erring on the side of faster compilation and avoiding open-coding methods that already exist to debug performance, if release mode perf is unimpacted.

If for these cases rustc emitted the appropriate LLVM IR directly, I suspect this would actually be a compile time improvement, but of course it would have to be measured.

1 Like

Good to know!

Work on the MIR inliner continues, and I suspect that it might one-day be the answer for these things, rather than forcing a bunch of attributes all throughout core . (It'd be one of those changes where it's really hard to have a good answer for "no, that one should/shouldn't get it" when PRs show up.)

I could foresee ambiguous cases, but I would argue the cases I'm listing here are unambiguous. What would be wrong with the litmus test: does this compile to one instruction or one unconditional function call?

2 Likes

That's an argument to run the LLVM inliner at O1 with a very low threshold, more than argument to make a bazillion code changes to core/std, IMHO. And that's in essence what a bunch of the MIR optimizations are trying to be able to do.

1 Like

I don't think it would be a bazillion, mostly just a lot of core::ptr::*. I guess what's surprising here is that things that I would normally expect to be language primitives are function calls, which for some of these ops is replacing what a C/C++ programmer would expect to be happening with an operation that is ~10x more expensive. Even for debug builds that seems outside the mental cost model most people would have. I think it's reasonable to use #[inline(always)] to make reality and people's mental model match up. Maybe a further litmus test would be: is this an operation directly implemented by the compiler in C, or does it exist exclusively to tell rustc information about the type (e.g. UnsafeCell::get)?

As an experiment I tried compiling with O1 -- it definitely helps, but still makes some weird choices -- e.g. it gets rid of UnsafeCell::get but not Cell::get.

3 Likes

IME LLVM often won't always inline at O1, and so #[inline(always)] is still useful on trivial operations.

I've also seen performance degrade from switching code to use the (harder to use incorrectly) ptr::cast in a very tight loop, compared to manual casting.

This tradeoff sucks, tbh.

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.