Caller-side inline directives

While rustc's ability to work with multiple codegen units functionality brings major benefits to compilation times, it does have the drawback of causing semi-stochastic (deterministic, but sensitive to trivial code changes) loss of inlining of perf-sensitive methods.

When this happens in a crate I maintain, I just slap an #[inline] directive on the offending method and call it a day. However, it is a more problematic when it happens in a public method of a third-party crate that I depend on, because in current Rust, it is the code of the dependency that must be patched.

Examples of crates that regularly cause this kind of issues for me are std (most recently Iterator::fold()) and ndarray (most recently ArrayViewMut::split_at()).

When I believe that I'm not the only one for whom lack of inlining is an issue, I open an issue upstream, as I did for ndarray above. But when it's a narrower edge case specific to my project, it would be nice if I could use a caller-side inline directive, as recently introduced by clang, instead of having to go codegen-units = 1 or fat LTO (thin LTO inlining heuristics are less good) and suffer the associated bad compilation times.

Any thoughts on this idea?

Generic methods are generally not marked #[inline], since generic methods have traditionally already been codegened into each compilation unit separately (and thus available to inline) as a matter of necessity. (This has AIUI been policy in std, and generally culturally adopted by ecosystem libraries as well.) Perhaps the rollout of -Zshare-generics means we should reconsider the policy?

(Out of curiosity, are you using flags like -Zshare-generics, or just the default --release configuration? I'm not certain where we are in the rollout of sharable generic monomorphizations. It was my understanding that it wasn't on for cross-crate reuse yet, but it might be on for reuse between CGUs of a single crate.)

I'm impressed you managed to get a compilation of Iterator::fold reused cross CGU (and thus not available for inlining). It must've been for function pointers, since most any other usage would end up depending on a call-site dependent type (i.e. closure) and probably in the same CGU as the calling scope. If it was for such a type, perhaps there's some better heuristic we can introduce to hopefully generally keep unique monomorphizations in the same CGU as their single caller.

The underlying complication is that #[inline] doesn't just make a function available for inlining, it also applies the "good candidate for inlining" hint. For all but a select few generic methods, we'd prefer to have the former but not the latter. And maybe the aforementioned heuristic is rather the ability to specify e.g. #[inline(weak)] on functions to get them always duplicated into the caller's CGU. The idea being that when used on methods generic over Fn*, that generally still only means codegenning any given monomorphization once, just in a "closer" CGU. (Perhaps even lint against applying the attribute to functions not generic over Fn*. On the other hand, maybe just leave it a heuristic rather than asking libs to annotate for it, if the idea is that it's always desirable for functions generic over Fn*?)

If an upstream function isn't marked #[inline], isn't generic, and isn't const, there's potentially nothing even really there to inline anymore. For nongeneric, noninline upstream functions, the .rlib has typically already compiled the function down to target machine code and discarded the MIR. (MIR-only rlibs would change this, of course, but I'm considering (my understanding of) the current status.) Downstream pipelined compilation might even only have access to .rmeta, which doesn't even have the machine code yet, just signatures and other such metadata (such as MIR for inline and/or generic functions).

So of the cases where a caller-side #[inline] would actually be certain to have the upstream MIR available to inline into the same CGU:

  • The upstream function is already #[inline], thus doing so would have no impact;
  • The upstream function is generic, so traditionally (before -Zshare-generics) would be available to inlining in any CGU it got used, thus doing it would just apply the "good inlining candidate" hint; or
  • The upstream function is const, thus this both makes the function available to inline when it wouldn't otherwise be and applies the hint.

The clang attribute seems to be the equivalent of #[inline(always)]. I don't think we'd want to permit that on the caller side unless we're willing to commit to the MIR for upstream functions always being available for inlining (minimally into the CGU) if requested. Either way, caller side #[inline] should probably be an immediate error if used on extern declarations without function bodies.

Additionally, there's the risk of caller side #[inline] not really being strong enough for what you want. If the called function primarily consists of calls to other functions which are still cross-CGU, they still won't be able to be inlined, making inlining of the parent not all that significant. If your observed failure cases indeed come from -Zshare-generics runtime pessimization, then it's not all that unlikely that this would in fact be the case, and what you actually are wanting for is less a caller-side #[inline] and more a #[share_generics = false].

(My general inline annotation heuristic has been: if a nongeneric function compiles down to a leaf function which doesn't touch the stack, it should be marked #[inline]. Otherwise, it shouldn't be marked #[inline] until macrobenchmarks provide meaningful evidence otherwise. Some wiggle room is provided to "fundamental/primitive" vibe functions which happen to be generic and/or use some stack, but not enough of it to require stack pointer manipulation on targets that provide a red zone. There are, of course, still further refinements present.)


I am using rustc v1.70 with no special codegen flags (besides target-cpu which doesn't matter here), so share-generics should only play a role inasmuch as it's been enabled on stable.

However, I do regularly encounter inlining issues with generic methods, so it looks like the conventional wisdom that generics are always inlined is wrong (as it should : like any other method, they should be available for inlining, but whether they are actually inlined should be a balanced optimizer decision).

I suspect the iterator::fold outlining that I observed happened because my program has multiple compute backends, a sequential backend and a parallel backend that calls the sequential backend. The sequential backend is the one that calls fold. If both backends end up being enabled by accident (additive feature flag woes...), and the sequential backend is inlined into the parallel backend but the fold is not inlined into the sequential backend, then we can end up with both a parallel backend and a sequential backend calling an outlined fold. But I need to cross-check if that's actually what happened, will do once I have more than a cellphone at hand.

I also agree that it could be nice to have a way to say "make this available for inlining" without actually hinting the compiler into inlining. I can't count the number of times where I've needed #[inline] for good performance with multiple codegen units in code that builds just fine with one codegen unit, it looks like the logic for inlining in each codegen unit is pretty good and it's often just the logic for splitting codegen units that has issues. Therefore, while waiting for that heuristic to be fixed, if you give me a way to only act on it, without affecting inlining inside of a codegen unit, it's less likely to cause trouble.

And given the situation that you describe (no compiler IR available for other crates in absence of #[inline]), it indeed does not look like a caller-side inlining attribute that does what I want can be implemented in current rustc, so that approach should indeed not be pursued.

I'm a bit uneasy about rolling out something specific to share-generics, until I'm convinced that it's always generics sharing that cause all my inlining problems (this I don't remember). If the code can stay focused on my actual issue (lack of inlining in the final binary), rather than the lower-level compiler feature causing the issue, that feels nicer to me.

Speaking personally, I also apply #[inline] eagerly when an abstract method's implementation exposes knowledge that should ideally be const-propagated into the caller for performance, or conversely if the function is likely to be called with const arguments and knowledge of these arguments allows for powerful optimization (think any method that revolves around a bound check + some cheap follow up, like array indexing).

For example, I semi-recently built a ring buffer whose size is rounded up to the next power of two from the minimal size that the user requires, because this allows me to get faster index wraparound (bitmask not integer modulo). In this situation, storing the size as a bitshift and having an inlined len() method that returns 1 << bitshift allows my methods to get the desired fast wraparound while keeping the underlying bit hack implementation detail contained within the constructor and len method.

Some prior discussion mentioning top-down inlining:

One of the issues is that LLVM does bottom-up inlining. So we could only apply this via the MIR-inliner. Or someone would have to implement support in LLVM first.

1 Like

Maybe you could writting some benchmark and enable PGO for that bench.

IMHO PGO is always better than tweaking code manually.

I would disagree on this one. In my opinion, sprinkling around a few permanent inlining directive for functions you know should be inlined anyway is better than forcing people into a much more complicated build procedure for performance.

Of course, if we were talking about a multi-pass JIT with runtime profiling (think JavaScript runtimes), where PGO can be performed transparently without user involvement and lead to eventually good codegen, my opinion would be different. But in AoT compiled languages, PGO goes into the "last-resort trick I use if I have no other choice" bucket for me.

For a simple example of problematic std function inlining with multiple CGUs, please clone GitHub - HadrienG2/grayscott: Rust version of the "Performance with stencil" course's examples, checkout commit bd6b2fda82b85000a449e16d38705cf5f178299d, remove the codegen-units = 1 directive in Cargo.toml, then build and run the criterion benchmark for the autovectorized compute backend as follows:

$ cargo install cargo-criterion
$ cargo criterion -p compute_autovec -- --sample-size 10 'full.*2048x.*32'

Then switch back to codegen-units = 1 and observe how throughput improves. It is a factor of 16x on the Zen 3 CPU that I currently have my hands on, although admittedly this particular microarchitecture is crazy amazing and the improvement is less impressive on other CPUs I've tried (Zen 2, Comet Lake). Maybe more like 3x faster. Still huge.

# Before
                        time:   [1.3362 s 1.3368 s 1.3375 s]
                        thrpt:  [50.176 Melem/s 50.200 Melem/s 50.222 Melem/s]

# After
                        time:   [75.417 ms 80.169 ms 84.470 ms]
                        thrpt:  [794.47 Melem/s 837.10 Melem/s 889.84 Melem/s]
                        time:   [-94.396% -94.239% -94.043%] (p = 0.00 < 0.05)
                        thrpt:  [+1578.7% +1635.8% +1684.5%]
                        Performance has improved.

A perf profile reveals that one reason why switching to codegen-units = 1 is so beneficial is that std function inlining goes quite wrong in the presence of multiple CGUs. In particular, the FnMut::call_mut trait method is not inlined for a simple closure.

This is the code that calls the function. Don't mind the complicated iterator, this is just iterating over every element of identically shaped 2D arrays. I need to do the whole row/column dance because rustc can't figure out the codegen right when I use ndarray's flat iterators together with a fixed sized array's flattened iterator, which is annoying but fair enough :


And this is rustc mistakenly outlining the FnMut::call_mut-mediated call to the folding closure, as spotted by perf:

Further, as previously alluded too, the whole inner Iterator::fold() call is not inlined either...

...which is sad, because inlining this little reduction based on a loop with 9 iterations is something rustc should do. Especially given that it knows the loop has 9 iterations, since all arrays and ndarray windows involved have dimensions known at compile time and the function that contains this info is inlined... Oh well, something for another day.

Anyway, just for completeness, I tried to see if the whole-program optimizations of LTO can fix this :

  • Thin LTO does nothing, as unfortunately happens too often when I try to use it.
  • Fat LTO, as usual, is more interesting: at the expense of a great build slowdown, it does resolve the two inlining issues discussed above, but at the cost of introducing a new equally bad inlining issue, namely outlining of the FMA instruction of the SIMD library that I use. Then again, I contributed that mul_add function myself and forgot to mark it inline :person_facepalming:, will submit a patch once I'm done writing this post.

Heh. We don't have a consistent pattern in the standard library whether looping methods like fold should be marked as inline or not (Single-step methods like next() obviously should) so it's a bit inconsistent which one gets it and which one doesn't. Usually guided by benchmarks. Map::fold is one of the ones that isn't and then probably ends up in a different CGU which prevents further inlining.

1 Like