Some Rust abstractions only provide good runtime performance to the extent that critical functions and closures are inlined.
Iterator adapters that translate into a tight compute loop are a well-known case of this, especially where auto-vectorization is desired. This happens because the compiler backend's loop auto-vectorizer only works when it "sees" the whole final loop without any function call, which means that any decision not to inline an iterator adapter or the associated user-defined function/closure can break it.
So far, I used to think that this was just an unavoidable limitation of iterator adapters, and it was acceptable as long as there was a way to fall back to manually vectorized code. After all, you need manually vectorized code (or some moral equivalent thereof) for performance as soon as your computation contains any kind of floating-point reduction anyway.
But a few months ago I also found a more complex case, involving the combination of std::simd
and the multiversion
crate, for which I could not think of any clean workaround. To cut a long story short, I ended up having to reimplement an ndarray
iterator with unsafe
code to resolve it, and this is something I would rather not recommend doing to my students.
Click here for more details on that
By design, multiversioning can only be applied to functions, not types. In applications where long-term storage of SIMD vectors of "native" length is desired (typically as a workaround for lack of ergonomic aligned storage allocators in Rust), this forces the use of two layers of multiversioning: one at the app layer where the SIMD vector length is selected and data structures are allocated, and one in the compute kernels itself. And then the whole call path between these two points must be generic over SIMD vector length...
Obiously, this causes double dispatch overhead, but that may not be a concern (since the first layer of dispatch is only called at data structure allocation time, which only happens once in a well-optimized program) and it may even be fully optimized out in the presence of sufficient inlining.
Less obviously, however, this also results in the generation of "redundant" code (e.g. SSE code manipulating Simd<f32, 8>
) that will never be called. Besides costing precious compilation time, it turns out that the generation of this "useless" code can also adversely affect later optimizer decisions pertaining to inlining.
Indeed, the compiler will see that there are two or more code paths that manipulate Simd<f32, 8>
, one coming from the SSE compute kernel version (which will never be called) and one coming from the AVX compute kernel version (which is the one that we are actually interesed in).
Because the compiler does not realize that these code paths are actually implemented differently at the hardware level, it may then "helpfully" out-line some function calls to deduplicate the presumed redundant code. This will result in a significant performance degradation if the resulting out-lining results in a loss of multiversioning for performance-critical AVX compute code, which then becomes slow SSE2 code.
It may seem that a way out of this mess would be to find a way to avoid the double dispatch layer, by e.g. introducing proper aligned allocators to Rust. But while that would be desirable for other reasons like reduced compilation overhead, it would not completely resolve the problem, because there are ISA extensions that one may want to multiversion over, which affect code generation but not SIMD vector length.
See e.g. FMA and AES-NI on x86 today, and in the near future the whole scalable vector can of worm that Arm and RISC-V have "blessed" us with will be another example.
As far as I know, only special-casing the Simd
type and all of its methods so that the inliner sees versions compiled with different target_features
flags as different could resolve this without requiring manual tuning of the optimizer's inlining decisions. But hopefully we'll also agree that this is not really a better option from a compiler complexity point of view.
Existing function attributes like #[inline]
and #[inline(always)]
are not a complete solution to this problem because...
- The choice to inline or not to inline a function is a complex tradeoff involving compilation time, code size, L1i cache footprint, compiler backend thresholds that disable specific optimizations for overly complex code... so we may not want to always inline a particular function, but only to do so in specific circumstances where it strongly benefits caller performance. Callee-side inlining directives deprive us of the ability to express this nuance.
- These attributes cannot be applied to closures, which are very often used together with iterator adapters and other performance-sensitive code.
- These attributes cannot be applied to third-party code (std, ndarray...), so it is very easy to end up in a situation where your runtime performance is blocked on some other maintainer deciding whether to merge an optimization that benefits some clients and harms other, which is not a nice state of affairs.
For a long time, I used to think that the solution would come from some sort of call-site inlining attribute, along the lines of what clang has been providing for C for a while now (IIRC zig also introduced something like this recently). This would allow moving the decision of inlining or not to the caller, which is often the best place to take such a decision.
However I'm less and less convinced nowadays that this approach is a good fit for Rust, for two reasons:
- In languages with operator overloading like Rust, not every function call looks like a function call or is amenable to the introduction of inlining attributes. What syntax would be used to control inlining of e.g. a Deref implementation? And would you really want to annotate every arithmetic operator in a complex mathematical expression?
- Less trivial code (e.g. iterator adapters again) calls through multiple layers of utility functions. With call-site inlining directives, we can only control inlining decisions of the top-level caller. But what would we do if the inlining problem resides in transitive function calls from transitive callees? Would we need even more complex annotations that break abstraction by allowing us to control the inlining of the N-th function call that a callee transitively makes?
Instead, it seems to me nowadays that the proper solution has to be something akin to GCC's flatten
function attribute, which can be used to ensure that all calls made by a function are inlined, and all calls transitively resulting from these inlined calls, are recursively inlined if possible.
This may sound like a bit of a sledgehammer, but I think it can actually be scoped well enough to be useful, provided that...
- The performance-critical code path is properly extracted into its own flattened function, to avoid the unnecessary flattening of unrelated code.
- For what it's worth, I think there's no good reason why the GCC
flatten
attribute can only be applied to functions, and it would make perfect sense for an#[flatten]
Rust attribute to be applied to code blocks in a finer-grained fashion instead.
- For what it's worth, I think there's no good reason why the GCC
- Appropriate exceptions to the general flattening rule are introduced to ensure that e.g.
#[cold]
functions representing mostly-dead error handling code paths are not inlined even when an indirect caller is flattened.
What do you think?