We may need some sort of #[flatten]

HadrienG · November 22, 2024, 8:27am

Some Rust abstractions only provide good runtime performance to the extent that critical functions and closures are inlined.

Iterator adapters that translate into a tight compute loop are a well-known case of this, especially where auto-vectorization is desired. This happens because the compiler backend's loop auto-vectorizer only works when it "sees" the whole final loop without any function call, which means that any decision not to inline an iterator adapter or the associated user-defined function/closure can break it.

So far, I used to think that this was just an unavoidable limitation of iterator adapters, and it was acceptable as long as there was a way to fall back to manually vectorized code. After all, you need manually vectorized code (or some moral equivalent thereof) for performance as soon as your computation contains any kind of floating-point reduction anyway.

But a few months ago I also found a more complex case, involving the combination of std::simd and the multiversion crate, for which I could not think of any clean workaround. To cut a long story short, I ended up having to reimplement an ndarray iterator with unsafe code to resolve it, and this is something I would rather not recommend doing to my students.

Click here for more details on that

By design, multiversioning can only be applied to functions, not types. In applications where long-term storage of SIMD vectors of "native" length is desired (typically as a workaround for lack of ergonomic aligned storage allocators in Rust), this forces the use of two layers of multiversioning: one at the app layer where the SIMD vector length is selected and data structures are allocated, and one in the compute kernels itself. And then the whole call path between these two points must be generic over SIMD vector length...

Obiously, this causes double dispatch overhead, but that may not be a concern (since the first layer of dispatch is only called at data structure allocation time, which only happens once in a well-optimized program) and it may even be fully optimized out in the presence of sufficient inlining.

Less obviously, however, this also results in the generation of "redundant" code (e.g. SSE code manipulating Simd<f32, 8>) that will never be called. Besides costing precious compilation time, it turns out that the generation of this "useless" code can also adversely affect later optimizer decisions pertaining to inlining.

Indeed, the compiler will see that there are two or more code paths that manipulate Simd<f32, 8>, one coming from the SSE compute kernel version (which will never be called) and one coming from the AVX compute kernel version (which is the one that we are actually interesed in).

Because the compiler does not realize that these code paths are actually implemented differently at the hardware level, it may then "helpfully" out-line some function calls to deduplicate the presumed redundant code. This will result in a significant performance degradation if the resulting out-lining results in a loss of multiversioning for performance-critical AVX compute code, which then becomes slow SSE2 code.

It may seem that a way out of this mess would be to find a way to avoid the double dispatch layer, by e.g. introducing proper aligned allocators to Rust. But while that would be desirable for other reasons like reduced compilation overhead, it would not completely resolve the problem, because there are ISA extensions that one may want to multiversion over, which affect code generation but not SIMD vector length.

See e.g. FMA and AES-NI on x86 today, and in the near future the whole scalable vector can of worm that Arm and RISC-V have "blessed" us with will be another example.

As far as I know, only special-casing the Simd type and all of its methods so that the inliner sees versions compiled with different target_features flags as different could resolve this without requiring manual tuning of the optimizer's inlining decisions. But hopefully we'll also agree that this is not really a better option from a compiler complexity point of view.

Existing function attributes like #[inline] and #[inline(always)] are not a complete solution to this problem because...

The choice to inline or not to inline a function is a complex tradeoff involving compilation time, code size, L1i cache footprint, compiler backend thresholds that disable specific optimizations for overly complex code... so we may not want to always inline a particular function, but only to do so in specific circumstances where it strongly benefits caller performance. Callee-side inlining directives deprive us of the ability to express this nuance.
These attributes cannot be applied to closures, which are very often used together with iterator adapters and other performance-sensitive code.
These attributes cannot be applied to third-party code (std, ndarray...), so it is very easy to end up in a situation where your runtime performance is blocked on some other maintainer deciding whether to merge an optimization that benefits some clients and harms other, which is not a nice state of affairs.

For a long time, I used to think that the solution would come from some sort of call-site inlining attribute, along the lines of what clang has been providing for C for a while now (IIRC zig also introduced something like this recently). This would allow moving the decision of inlining or not to the caller, which is often the best place to take such a decision.

However I'm less and less convinced nowadays that this approach is a good fit for Rust, for two reasons:

In languages with operator overloading like Rust, not every function call looks like a function call or is amenable to the introduction of inlining attributes. What syntax would be used to control inlining of e.g. a Deref implementation? And would you really want to annotate every arithmetic operator in a complex mathematical expression?
Less trivial code (e.g. iterator adapters again) calls through multiple layers of utility functions. With call-site inlining directives, we can only control inlining decisions of the top-level caller. But what would we do if the inlining problem resides in transitive function calls from transitive callees? Would we need even more complex annotations that break abstraction by allowing us to control the inlining of the N-th function call that a callee transitively makes?

Instead, it seems to me nowadays that the proper solution has to be something akin to GCC's flatten function attribute, which can be used to ensure that all calls made by a function are inlined, and all calls transitively resulting from these inlined calls, are recursively inlined if possible.

This may sound like a bit of a sledgehammer, but I think it can actually be scoped well enough to be useful, provided that...

The performance-critical code path is properly extracted into its own flattened function, to avoid the unnecessary flattening of unrelated code.
- For what it's worth, I think there's no good reason why the GCC flatten attribute can only be applied to functions, and it would make perfect sense for an #[flatten] Rust attribute to be applied to code blocks in a finer-grained fashion instead.
Appropriate exceptions to the general flattening rule are introduced to ensure that e.g. #[cold] functions representing mostly-dead error handling code paths are not inlined even when an indirect caller is flattened.

What do you think?

comex · November 22, 2024, 8:13pm

I'd like to see this feature, if only as a way to control inlining of standard library functions.

However, it would probably require an LLVM change. Clang currently doesn't seem to respect the recursive aspect of __attribute__((flatten)). It only forces a single level of inlining, unlike GCC.

the8472 · November 22, 2024, 8:32pm

It seems like a manual "this function is very hot, optimize it harder than the rest of the program" kind of compiltime performance tradeoff that JIT-compiled languages get as part of their design. I wonder if applying PGO would already have a similar effect.

But I think caller-side inline annotations that don't go as hard would still be useful, e.g. if you want to only annotate some paths through a complex call-graph without flattening the entire thing because it's huge and there are a whole bunch of less-commonly taken paths that aren't exactly cold, just not quite as warm as the main path.

and it would make perfect sense for an #[flatten] Rust attribute to be applied to code blocks in a finer-grained fashion instead.

That'd make sense. But either rustc has to perform this inlining itself or the codegen backends have to offer the necessary features to apply the equvalent of flatten to all the individual calls in a code block.

dlight · November 23, 2024, 12:23am

A single level of inlining for all function calls inside a function would already be very useful. You could have #[flatten(single_level)] and then introduce #[flatten(full)] if and when llvm supports fully recursive flatten.

HadrienG · November 23, 2024, 5:40am

I did not find any equivalent attribute at the LLVM IR layer. Unless clang is supposed to do the flattening on its own, which would surprise me (and also means that rustc can't reuse the associated infrastructure), such an LLVM attribute would likely need to be added as a first step?

It might indeed! However PGO is quite a pain to use, partly by design (need to write a slow big benchmark covering all expected usage of the program, and then must either make it even slower via instrumentation or have access to perf counters) and partly due to quality-of-implementation issues (there was a good talk at this year's RustFest regarding that other problem, not sure if a recording has been uploaded somewhere).

So when we know ahead of time that a program section is going to be perf-critical and that compiler optimizer heuristics have gotten this wrong before, I think it's good to have less painful alternatives to PGO which ensure that a "normal" build of the program works well.

That being said, LLVM does have a hot function attribute that we do not expose on the Rust side. Maybe it could be used to achieve the desired effect in a less invasive fashion?

I wonder if we could get a good approximation of this without complex caller-side annotations by having more nuanced variants of the #[cold] annotation on the callee side? Like we already have #[inline] as a nuanced version of #[inline(always)].

The semantics of such a #[noflatten]/#[thicken]/#[tepid] annotation (name open to bikeshedding) could be that calls to this function (and the transitive calls that result from it) are ignored by the recursive inliner pass that #[flatten] enforces, but the function is not otherwise pessimized during normal optimization pass.

Or alternatively it could be block-scoped, and target only specific calls to the function. This would give more fine-grained control.

Combined with block-scoped #[flatten], this could be used to delineate a hot path across the call graph without needing syntax to specify, on the caller side, which parts of the callee are hot. Which I would rather avoid because...

It would breach the function call abstraction (when you call a function, you are not supposed to be aware of the details of what the function is doing) and thus yield to future compatibility hazards.
For arbitrarily deep call graphs, the syntax of the annotation would get arbitrarily complex (basically containing the equivalent of a stack trace), and it sounds like this could get out of hand.

Could it be emulated on the rustc side via some moral equivalent of the following?

// === What the user writes ===
#[flatten] {
    // ... code to be flattened ...
}

// === How rustc desugars it ===
{
    let flattened = #[inline(always)] #[flatten] || {
        // ... code to be flattened ...
    };
    flattened()
}

Or would that have too many undesirable side effects?

scottmcm · November 23, 2024, 6:45am

From the subject I was expecting this to be about flattening types. &mut self in Iterator::next is a killer for optimization, if the iterator object isn't SRoA'd into oblivion.

So I wish we had something to address that more directly too, though I don't know what.

the8472 · November 23, 2024, 9:47am

Well now you have to annotate the callees too. At that point a caller-side inlining annotation would work as well and you wouldn't have to do an opt-in-opt-out dance and could be more targeted instead. In fast path for process_obligations by the8472 · Pull Request #108815 · rust-lang/rust · GitHub I had a case where the fast path was only a rather narrow slice from the hot function all the way into an external crate. Though in that case there were no other callers and regular #[inline] was ok.

HadrienG · November 23, 2024, 1:41pm

I understand why SRoA matters here, but I'm not sure if I fully understand the factors that affect the optimizer's decision to perform this optimization or not. So let me check if I get it right.

Assuming that...

The top-level Iterator::next() implementation is partially or fully #[flatten]ed in such a way that there is no out-lined function call that requires materializing an &mut to the iterator or one of its data members (that would typically an out-lined call to the next() method of a lower-level iterator in a stack of combinators, or one of the other Iterator trait methods).
The iterator is used in a typical way i.e. it is created as part of the in right-hand side of a for loop and fully driven by said for loop, or consumed immediately after creation by a collect(). So the compiler can see every direct use of the iterator object by the caller code, and it's called in a simple loop-until-None manner that shouldn't trip any optimizer analysis.
The iterator implementation does not feature any weird constructs that may lead the compiler optimizer to believe that references to the iterator or its internals could possibly escape the current stack frame (to another thread, a syscall, inline asm, etc...) during iteration.

...shouldn't the compiler's optimizer always be allowed to perform SRoA?

And if it is allowed, why would it not do so? Because the iteration code is deemed too cold for the optimization to be worthwhile? Because it goes above some complexity threshold where SRoA is considered to be too costly in terms of compilation time?

the8472:

HadrienG:

The semantics of such a #[noflatten]/#[thicken]/#[tepid] annotation (name open to bikeshedding) could be that calls to this function (and the transitive calls that result from it) are ignored by the recursive inliner pass that #[flatten] enforces, but the function is not otherwise pessimized during normal optimization pass.

Well now you have to annotate the callees too. At that point a caller-side inlining annotation would work as well and you wouldn't have to do an opt-in-opt-out dance and could be more targeted instead. In fast path for process_obligations by the8472 · Pull Request #108815 · rust-lang/rust · GitHub I had a case where the fast path was only a rather narrow slice from the hot function all the way into an external crate. Though in that case there were no other callers and regular #[inline] was ok.

I think that in order to fully understand your position, I need the discussion to be reformulated in more concrete terms. So let's introduce some examples, together with some explanations for why I think that #[flatten] is generally more useful than a fine-grained inlining directive.

Scenario 1: A simple call chain

Let's start with a simple example of a performance problem that's already hard to solve today:

// === My code ===

fn caller() {
    // ...
    perf_critical_function();
    // ...
}

fn perf_critical_function() {
    // ...
    let result = third_party::should_be_inlined(input);
    // ...
}

// === Third-party crate ===

// No inlining directive here
pub fn should_be_inlined(/*...*/) -> /*...*/ {
    // ...
    callee();
    // ...
}

fn callee() {
    // ...
}

Through a combination of profiling and static machine code analysis, I have established that I need third_party::should_be_inlined() to be inlined into perf_critical_function() to get good runtime performance. Sadly, the compiler's optimizer does not want to inline it today.

We can solve the immediate problem by introducing a call-site inlining directive along the line of what Clang provides to C-style languages, let's call it #[inline_call] for now:

// Inside of perf_critical_function()
let result = #[inline_call] third_party::should_be_inlined(input);

Alas, from my experience, a common outcome may be what I call the inlining whack-a-mole problem:

perf_critical_function() may stop being inlined into caller().
callee() may stop being inlined into should_be_inlined().

Both of these outlining consequences may or may not have performance implications. After all, my end goal is to achieve good run-time performance, not to inline the entire program into main(). But let us assume for the sake of completeness that they do have undesirable performance implications.

The way to address these two issues is actually different due to differing code ownership:

Failure to inline perf_critical_function() into caller() is not a big problem, I can resolve this situation using any of the tools that we have discussed so far. One way is to annotate this particular call to perf_critical_function() with our new #[inline_call] directive. I will do that if the lack of inlining only appears to be a problem in the particular context of caller()...
```
// Inside of caller()
#[inline_call] perf_critical_function();
```
...and another way is to mark perf_critical_function() itself #[inline]. This will be the better choice if I am convinced that all calls to this function should be inlined:
```
#[inline]  // NEW
fn perf_critical_function() { /* ... */ }
```
Failure to inline callee() into should_be_inlined() is a bigger problem, because that's not my code, so I cannot easily change it. For sure, I could submit patches to the third-party crate that perform any of the aforementioned tweaks...
```
// Inside of should_be_inlined()
#[inline_call] callee();

// === OR ===

#[inline]  // NEW
fn callee() { /* ... */ }
```
...but there is no guarantee that my patches will be accepted. If my use case is considered niche enough that only my code is affected, the maintainer may dislike the idea of inlining these calls for everyone.
- Further, even if the patch is accepted, it may be a long while before I can easily use it, if I am in a situation where I cannot easily use my own patched version and the crate maintainer is busy / has a slow release cycle. This is typically the case if a patch goes into core/std, as using a patched std is not for the faint of the heart. But it may also happen in the case of a security-critical dependency, for which maintaining my own patched version until the next release may not be acceptable.

At this point, as far as I can tell, I need another language extension to get the job done:

One option is the #[flatten] sledgehammer proposed at the top of this thread. It's not pretty, it will likely inline more than I need which may cause other issues down the line, and I will get back to all this later. But it should solve my immediate inlining problem.
```
// Inside of perf_critical_function()
let result = #[flatten] third_party::should_be_inlined(input);
```
Another option is to introduce a more powerful version of #[inline_call], that allows me to control further inlining decisions inside of the implementation of the function call that I am inlining, in a fine-grained way. But I think this is less desirable for two reasons.
1. It breaches the function abstraction:
```
// Inside of perf_critical_function()
let result =
    #[inline_call(also_inline_inner(callee))]
    third_party::should_be_inlined(input);
```
  Indeed, with this version, if the author of the third-party crate decides to rename their callee() internal utility function, my higher-order inlining directive will silently stop working (at best) or break my build (at worst).
2. I can't think of any way to make this kind of fine-grained syntax scale nicely if I am dealing with a more complex call graph where I need many functions to be inlined:
```
let result =
    #[inline_call(
        also_inline_inner(
            callee -> also_inline_inner(
                internal1 -> also_inline_inner(innermost),
                internal2,
            )
       )
    )] third_party::should_be_inlined(input);
```

So even in this simple scenario, I would argue that the #[flatten] approach feels more promising, when dealing with complex third-party abstraction stacks like iterators and their adapters.

Scenario 2: Hidden/concise function calls

That's not the only problem with the #[inline_call] style of fine-grained annotation, though. Rust provides many concise ways to call functions without explicitly writing a function name followed by parentheses. Here are just a few examples:

let data = smart_pointer.target_method();  // Hidden call to Deref::deref()
for elem in data {  // Hidden calls to IntoIterator::into_iter() & Iterator::next()
    accumulator += data.x * data.y * data.z;  // Concise arithmetic operators
}

It may be a matter of taste, but I would not like to live in a future where the recommended way to resolve inlining issues at these call sites looks like this:

let data = smart_pointer #[inline_call]. target_method();
for elem #[inline_call(IntoIterator::into_iter, Iterator::next)]in data {
    accumulator #[inline_call]+= data.x #[inline_call]* data.y
                                        #[inline_call]* data.z;
}

Especially if I later need to add even more information to these inline_call directives in order to also inline inner utility functions!

Instead, having a way to enforce inlining across an entire block of code (or, almost equivalently, an entire function) would seem more appropriate:

#[inline_all_calls] {
    let data = smart_pointer.target_method();
    for elem in data {
        accumulator += data.x * data.y * data.z;
    }
}

However, by going in this direction, we lost the ability to precisely specify which calls we want to be inlined, and essentially got a non-recursive version of #[flatten]. For sure, we could recover some of this fine-grained control by at least targeting specific functions (not call-sites)...

#[inline_all_calls(
    Deref::deref,
    IntoIterator::into_iter,
    Iterator::next,
    AddAssign::add_assign,
    Mul::mul,
)] {
    let data = smart_pointer.target_method();
    for elem in data {
        accumulator += data.x * data.y * data.z;
    }
}

...but hopefully we will agree that this syntax does not scale nicely to either more complex code blocks, nor to the inlining of inner utility functions within the implementation of each of these functions that we call directly.

To me, this is again an argument that pleads in favor of a #[flatten]-like design.

Scenario 3: Partial flattening

So far, we have seen that the #[flatten] design has several major arguments going for it, compared to a fine-grained approach like #[inline_call]:

It reliably resolves the inlining whack-a-mole problem without requiring modifications to the third-party crates that are being called.
- This should not preclude contributing patches when we think that these crates lack inlining directives in the right places, obviously. But now we are in no rush to do so, and it's no big deal if upstream disagrees with us on the usefulness of specific inline directives for their average audience.
It does not require breaching the abstractions of these third-party crates, even if there are inlining issues inside of them too.
It scales to arbitrarily complex call trees where inlining must be forced at many different call sites (e.g. autovectorization of iterator adapters, where almost everything must be inlined).
It does not introduce too much syntax noise where concision is desired (e.g. arithmetic).
It works with closures, which cannot get an #[inline] directive and cannot be named (needed by my fine-grained #[inline_call(also_inline_inner(...)]) in today's Rust.

Alas there is no free lunch in programming, so in exchange for all these nice properties, we must pay a high price: by virtue of being all-encompassing, #[flatten] will inevitably inline too much.

I would like to point out that this may not be a problem in some common Rust usage scenarios, like HPC simulations. In HPC, there can be many layers of "zero-cost" abstraction involved (ndarray, faer, slices, iterators, SIMD, rayon, dimensioned, etc.). But once all of these are peeled off, the actual calculation at the bottom of the stack is surprisingly simple.

However, if unconditional #[flatten] is applied to larger code blocks or more complex third-party abstractions, we can expect the usual problems of excessive inlining to arise:

Excessive compilation time/RAM footprint
Excessive code size
CPU ICache overflow
Optimizer code complexity threshold is reached, resulting in loss of useful optimizations (e.g. loss of loop unrolling, vectorization, etc), which defeat the performance goal of inlining

Therefore, although #[flatten] on its own would resolve many simple inlining problems, more complex problems would indeed benefit from a way to restrict its inling power.

What could these more nuanced approaches look like?

As mentioned, I am skeptical that some extended version of #[inline_call] can do the trick here. Maybe I'm not imaginative enough when it comes to syntax, but what I have in mind does not seem to scale to more complex abstraction stacks, involving multiple crates maintained by several different persons, and layers of abstraction that optimization directives should not peek under in the interest of future maintainability.
For the same reason, I would be skeptical of an extended syntax for #[flatten] that lets you exclude specific functions from the flattening process at the call site. It seems like you would get all the issues that I envision for the extended versions of #[inline_call], but for exclusion lists rather than inclusion lists.
This is why I think that in this particular case, which again does not encompass all use cases of #[flatten], there is no sane choice but to collaborate with upstream to selectively introduce annotations that disable call flattening inside of the implementation of the function of interest.
- Often, this can take the form of extracting code paths which are known to always be cold into #[cold] functions.
- In scenarios where the code is more "lukewarm" than cold and marking it cold would pessimize callers too much, a #[noflatten]/#[endflatten] directive could be introduced so that one can exclude specific code blocks from the flattening process (function calls inside of these blocks are not recursively inlined even when the caller uses #[flatten]), without otherwise affecting the way the compiler optimizes them.

Does this help clarify my point of view? And if not, can you use these same examples to tell me where your point of view differs, why, and what kind of alternate design you would envision?

the8472 · November 23, 2024, 2:59pm

Yep, that's pretty much the thing I'm concerned about. If you're relying heavily on 3rd party code then that 3rd party code might contain a lot of cold or lukewarm branches (panics, uncommon-case-handling, loop trailers, one-time initialization). So you try to flatten it, realize that it does too much and then you end up in the situation where you need to modify your callee graph anyway on which you then hold two positions about whether that's feasible or not

...but there is no guarantee that my patches will be accepted. If my use case is considered niche enough that only my code is affected, the maintainer may dislike the idea of inlining these calls for everyone.

This is why I think that in this particular case, which again does not encompass all use cases of #[flatten], there is no sane choice but to collaborate with upstream to selectively introduce annotations that disable call flattening inside of the implementation of the function of interest.

All I'm saying is that while sometimes flatten might be useful, if your code happens to have a shape where it provides good results, at other times - especially in branch-heavy code rather than easily vectorized loops - it can be necessary to apply far finer control to only inline the hottest paths through the call-tree and keep the rest at normal optimization levels.

Whether one is more common than the other mostly depends on what kind of code you're dealing with.

For example if you look at the rust codebase itself you'll find that even the iterators (which normally are more like a call-pillar than a call-tree) actually contain complex closures or are folding over Results end up being quite branchy rather than nice vectorizable loops.

It works with closures, which cannot get an #[inline] directive and cannot be named

You can annotate closures on nightly.

I can't think of any way to make this kind of fine-grained syntax scale nicely if I am dealing with a more complex call graph where I need many functions to be inlined:

Well, you could use a graph-theoretic approach. The starting point is your annotation, you define matchers (including wildcards) for paths through the graph and anything that matches gets inlined. But if it gets complex enough then yes, it shouldn't be an annotation but an external hint file, similar to PGO profiles, though perhaps human-writable if this is meant to be done manually.

If critical performance of an application depends on the precise inlining through several layers of 3rd-party crates that already is a terrible situation because you're now not only depending on the optimizer's whims but also of the 3rd-party authors'.

HadrienG · November 23, 2024, 4:00pm

I agree with a lot of what you say.

In particular, I agree that having more complex graph queries with wildcards and spillover to external files might be the missing idea that would work around the combinatorial complexity issue of fine-grained call site directives, and reduce their leaky abstraction problem. Especially if combined with block-scoping to handle "invisible" function calls and reduce verbosity in call-dense environments.

Indeed, a recursive catch-all name wildcard would be almost enough to make #[flatten] becomes a special case of query-based #[inline_call]. The main complication is that we'd still want to exclude #[cold] functions, so the call graph query language would need to be able to match function attributes in addition to names.

I also agree that my ideas for "descoping" #[flatten] do not really work in the use cases that that I have in mind, where I can't wait for upstream to merge/release contributions or even rely on them to accept the contributions.

And I also agree with this...

If critical performance of an application depends on the precise inlining through several layers of 3rd-party crates that already is a terrible situation because you're now not only depending on the optimizer's whims but also of the 3rd-party authors'.

...though think this ship has sailed: no one would want to reimplement ndarray, std::simd, or the whole core/std + third-party iterator ecosystem every time they start a new numerical computing project.

All that being said, though, I do have one nagging worry: at which point does this proposal become so complex that no one will want to implement it or maintain an externally contributed implementation, on either the rustc or the LLVM side?

the8472 · November 23, 2024, 4:55pm

I can't fully answer that. This generally involves lengthy discussions about cost-benefit and alternatives.

Perhaps some of the easier ones:

how much of a perf difference are you seeing
how critical is it to your application
as mentioned earlier, have you tried PGO?
have you tried extracting the hot part to a crate and then tuning inlining thresholds and the like
could your yak-shaving be solved by attacking the problem elsewhere (you mention allocations and inlining decisions around target_features)

kornel · November 23, 2024, 8:45pm

BTW, Rust has accidentally stabilized attributes on closures when they're inside expressions:

(1..10).map(#[inline(always)] |i| { i * 2 }).sum::<i32>();

HadrienG · November 24, 2024, 9:38am

Lack of inlining in the wrong loops typically results in the complete loss of some major loop optimizations. These include...

Autovectorization
- For inner loops that fit well in the L1d cache, the slowdown ranges from 2x to 64x depending on the target CPU and data type. For my own work, I'm typically manipulating f32s on x86 without AVX-512, so we're in the 4-8x range.
Unrolling
- This typically results in under-utilization of hardware ILP and extra slowdown caused by latency-bound dependency chains. Depending on how much the hardware tolerates the later (x86 CPUs tend to be quite tolerant with respect to other arches like GPUs) we're typically talking about an extra 2-6x slowdown in compute-bound loops.
- It should be noted that compiler optimizers are not super-skilled at this optimization and I often need to give them a hand anyway. But my favorite manual unrolling hack, which avoids duplicating compute code or tweaking loop bounds (which is bad for maintainability), also relies on aggressive inlining to work...
Bound check elision
- This optimization is even more brittle than unrolling, so whenever I can use iterators instead, I do so. But some loops are just not easy to rewrite using available iterators, and then the choice becomes rolling my own iterator (yuck, lots of unsafe) vs relying on the compiler to elide the bound checks, so in that case I appreciate when the bound check elision works.
- Perf impact is a little nuanced: bound checks themselves are not that expensive (though a typical CPU can only do one every 1-2cy, so if you need to access many different arrays it can easily cause a 2-6x slowdown wrt ALU-bound code which can do multiple ops per cycle). But loss of bound checks elision typically also results into loss of vectorization and unrolling, and that's way more costly.
And we can also lose some other less critical but nice-to-have optimizations, like this one whose name I don't know which does the ASM equivalent of turning for (int i = 0; i < N; ++i) { /* use arr[i] */ } into for (T* elem = &arr[0]; elem < &arr[arr.len()]; ++elem) { /* use elem */ }, and thus reduces the register pressure and AGU workload linked to array indexing a little.
- I would also put in this "less bad, still bad" bucket the overhead of function calls themselves, or more precisely the overhead of spilling and reloading registers on every function calls, losing SRoA for iterator structs, etc (I have yet to see code that is bottlenecked by call/ret themselves). That will still cost a ~factors slowdown, but it's not in the same badness range as losing more important optimizations like vectorization and unrolling. It is, however, highly correlated with losing these other optimizations.

Because these losses multiply each other, the total loss of compute throughput in the affected loop can easily go in the ~100x range. But depending on how performance-critical this particular compute loop is to the overall application, the application-wide impact will vary.

HPC computations are typically completely bottlenecked by a few tight compute loops (if you managed to avoid being IO-bound), so the impact on these will be huge. But for less compute-bound activities like O(N) data crunching or web servers, we would probably go into more reasonable territory, like maybe a 1.1-3x slowdown or something.

I get these issues all the time, in many different codebases, so maybe the easiest answer is to rephrase this as "how critical are my applications to the broader ecosystem"

Right now, I have not convinced #work to embrace Rust for Serious Big Projects, so I mostly use it for smaller utilities where wasting a bit of compute time is not a huge deal. What I did manage, however, is to integrate a Rust chapter in our high-performance numerical computing course, which attract a large audience (500 people registered to the last "multilingual" course in July, 200 to a smaller-audience Rust and online-only spinoff next week).

My goal with this Rust compute teaching activity is to convince people that Rust can be a solid contender to the big players of that ecosystem (C++, Fortran, Python + native backend...) in scenarios where long-term human scalability and maintainability matters more than having the best established library or GPU ecosystem. Because attracting more people to the Rust dark side is the only way we are ever going to get a more established library or GPU ecosystem anyhow

But in this performance-focused crowd, it will be much harder to convince people if I cannot show that given a reasonable amount of work, Rust can achieve similar performance to C++ and Fortran, which to me means either "I look at the ASM from the compiler and I have no idea how I could have done it better" or "the computation is simple enough that I can compute the theoretical hardware throughput limit and we are right there as we should be".

In this context, any language or compiler feature that makes it easier to get good codegen out of Rust helps. I will leave you to decide how critical that is.

I guess I should give it a try at some point, but I really hope we won't end up with the official statement that PGO is necessary to get good performance out of even relatively simple numerical computations in Rust. Because PGO is a huge pain from the user side, and on the C++/Fortran side it is only rarely needed, so that's not a good look from a language comparison PoV...

Could that affect the compiler's ability to inline code from third-party crates? Because that's most often where the issue lies. I have regularly tried to call LTO for help there, but never got any significant performance gain out of it, as even when LTO allows the compiler to inline the incriminated function it stubbornly refuses to do so. Perhaps a combination of LTO and aggressive compiler backend flags could work?

Well, that's the specific issue that led me to throw the towel and open this particular forum thread, but ndarray and iterator adapter inlining issues are an everyday annoyance. Resolving that for good (not iteratively in a fine-grained way, little by little) would, on its own, be a major quality-of-life improvement for numerical Rust practicioners. It's just that since I got inlining issues in other contexts, I figured that maybe attacking the problem from the general inlining angle was most useful.

the8472 · November 24, 2024, 1:53pm

Well that's a huge difference and I think it'd at least be a good motivation for people to look for some solution. Though I haven't seen your code, so the solution might also turn out to be "just set flag X" or "you're using it wrong". So it might help if you could provide testcases that still involve popular 3rd-party libraries and the equivalent hand-rolled loop showing a huge perf difference. Perhaps even two different flavors to show that this is a general problem rather than something that only needs a spot fix.

There's a difference between a function from a different crate being available for inlining at all (generic and #[inline] ones usually are, others depend on heuristics/thresholds) and them actually being inlined (also heuristics and thresholds).

Not sure if I've understood your issue correctly. If they stem from calls that change target-features then I think flatten won't help because those will still be inlining barriers. This would have to be solved by more clever codegen that starts using those features in the middle of the function without contaminating the branches where the features might not be available. Or by ways to make it more ergonomic to apply the multiversioning further up in the calltree.

binarycat · November 25, 2024, 1:29am

another case where this is handy is to allow llvm to do more constant folding, expecially useful for saftey-preserving assertions that in-practice almost always are run on constant values. this would be an effective performance stopgap until we have const fn in traits.

binarycat · November 25, 2024, 1:35am

rustc already has its own inlining pass, actually

doing this in rustc also makes it easier to support cranelift.

binarycat · November 25, 2024, 1:51am

this seems similar to my old proposal that allows annotating functions with attributes through an external file

HadrienG · November 25, 2024, 8:24am

the8472:

HadrienG:

Because these losses multiply each other, the total loss of compute throughput in the affected loop can easily go in the ~100x range. But depending on how performance-critical this particular compute loop is to the overall application, the application-wide impact will vary.

Well that's a huge difference and I think it'd at least be a good motivation for people to look for some solution. Though I haven't seen your code, so the solution might also turn out to be "just set flag X" or "you're using it wrong". So it might help if you could provide testcases that still involve popular 3rd-party libraries and the equivalent hand-rolled loop showing a huge perf difference. Perhaps even two different flavors to show that this is a general problem rather than something that only needs a spot fix.

In your opinion, what would be a good place to report test cases in the context of this search for a more general solution?

So far, whenever I have found an "easy" inlining issue that I can easily reduce into a 5-lines reproducer, I have reported it as an issue to the project whose function does not get inlined when I need it to (e.g. rust-lang/rust) and it resulted in the affected function being marked #[inline] or #[inline(always)].

This local fix approach usually works (which is why I was not motivated to open this forum thread until recently), but it does have a number of downsides that make me want better alternatives/more choices in future Rust:

Applying local fixes function by function consumes a significant amount of person-power, both on my side (test case reduction, reporting, follow-up) and on upstream's side (bug triage & clarification, implement fix, possibly perf regression tests, release). Each small fix is less costly than a general solution, for sure, but accumulated over many small fixes I think it should add up to more work than a general fix.
Reducing larger test cases is very time consuming, so when I'm too busy (as I often am), I don't have the time for it. Yet I feel bad about handing over 1-10kLOCs to someone in a bug report with the moral equivalent of a "good luck lol" comment, because I know it will just become another one of the 300 abandoned upstream bug reports that make upstream people depressed, and making them feel bad won't solve my problem. This means that I'll end up priorizing the reporting of inlining bugs by how nice-looking the bug report looks, rather than by their actual performance impact on my side. And for the nastier-looking bugs, I will often end up preferring local workarounds that are quite costly on my side (rewrite my own version using unsafe, spend hours randomly rewriting many semantically equivalent versions of the code in the hope of finding one that the optimizer likes better, etc) even when I genuinely think that an upstream fix would have been easier...
A few times before, I've been reminded that marking stuff #[inline] can make upstream uneasy in scenarios where it's not obvious that the function should be inlined in all cases, and my use case feels like an edge case. In this scenario, I can insist a little if the impact on my side is very large / hard to work around, but I do think that a local inlining hint as discussed in this thread would be better for everyone involved.
Sometimes, I would like to have an alternative to telling people to use the latest upstream version. This is typically the case with rustc. I like to track the MSRV of my projects and only update it for good reasons, because I want to be nice to those people who want to conveniently use their distro's automatically updated rustc package, and to the embedded people whose Rust toolchains take forever to get certified. But then I hesitate to rely on constructs that are only efficient on rustc versions way above my MSRV, as that's kind of a soft-incompatibility. Since these people are a minority, I wouldn't mind telling them to do something special with some undesirable side effects, like slapping a #[flatten] in their code, in order to accomodate the limitations of their low compiler version. But I can't afford to maintain a dual code path just for them, and I don't want to leave them with unavoidably slow code either.

This part I knew about. The question is, when the heuristics/threshold don't work for me, can I do something about it on my side using the tools available today on stable (compiler codegen flags, etc)?

Yeah, that top-level issue is trickier than a typical iterator inlining bug, this is why I did not get the chance to spend enough time fully checking my understanding of it yet. But from my current understanding, what happens is something along these lines:

When I mark a function with the #[multiversion] attribute from the multiversion crate, it results (among other things) in the code being copy-pasted into N different backend functions with different target-features attributes applied.
The optimizer notices that the body of these functions is identical, and tries to be helpful by factoring out common code into one or more out-lined functions that do not have the same target-features attributes applied.
This results in a performance disaster because 1/some of these functions must stay inlined for performance and 2/the outlined code does not benefit from the intended target-specific optimizations, so e.g. FMA is compiled into an indirect libm scalar call, AVX is not used...

If my understanding is correct (again, I should probably check it further with more experimenting to be 100% sure), then aggressive call-site inlining directives are indeed only one way to resolve this problem. But they would resolve it, in the sense that given a multiversion patch that slaps a #[flatten] on each generated function copy, no one would get this issue anymore. Though some people would likely get other issues related to #[flatten].

Another approach that would benefit multiversion users only, but benefit them more (less #[flatten] side-effects), would be to improve interactions between target-features and outlining so that the optimizer is not allowed to out-line common code from N functions with target-features attributes into a single common utility function that does not have those attributes.

But I think we could use call-site inlining directives anyway in order to resolve the other everyday issues with iterator adapters & friends, which is why I would be more partial to having these first, and see if they also provide a good enough workaround to the multiversion problem.

On a related note, one of my top uses of #[inline] directives is when I want to hit an optimizer special case but don't want to hardcode the associated bit trick directly in my code.

For example, I have written several instances of this pattern before:

struct Ring<T: Default> {
    storage: Box<[T]>
    size_pow2: u32,
}

impl<T: Default> Ring<T> {
    fn new(size_lower_bound: usize) -> Self {
        let size = size_lower_bound.next_power_of_two();
        let size_pow2 = size.trailing_zeros();
        Self {
            storage: std::iter::repeat(T::default())
                               .take(size)
                               .collect(),
            size_pow2,
        }
    }

    // Must be inlined to get power-of-two fast path
    #[inline]
    const fn size(&self) -> usize {
        1 << self.size_pow2
    }

    fn index(&self, pos: usize) -> &T {
        // Look ma, power-of-two fast path without ugly bit tricks
        &self.storage[pos % self.size()]
    }
}

That being said, for this particular purpose, I have often found #[inline] directives on function declarations to be sufficient. So I would be curious to hear more about your examples that would need call-site directives, if you have any at hand.

Topic		Replies	Views
Function Multi-Versioning for Rust? compiler	17	3472	March 25, 2019
Iterator codegen regression	6	1297	March 25, 2019
"SIMD" without the need of specialized types language design	3	987	May 18, 2024
Roadmap 2017 request - needs of HPC	5	6381	March 25, 2019
Idea: polymorphic baseline codegen compiler	22	3278	March 25, 2019

We may need some sort of #[flatten]

Scenario 1: A simple call chain

Scenario 2: Hidden/concise function calls

Scenario 3: Partial flattening

Related topics