Can/Should we add an inlining pass -Copt-level=1

So... Yeah. Can/Should we add¹ an inlining pass to the set of passes that run for builds with opt-level = 1?

Rust is really reliant on #[inline] for perf — more so than even C++. This is true of both the stdlib code and user code. Unfortunately, inlining does not happen until opt-level = 2. In my experience there's almost no difference in -O1 and -O0 at the moment, but you get a huge jump for -O2.

I get that it's a compile-perf hit, which is why I'm not suggesting it for -O0. I also am aware at some point in the future an MIR-based inliner may or may not solve some of this, but that doesn't preclude addressing it now.

(Currently, to get better perf from non-release builds without (moving all the way towards using -O2 for debug builds), in my game engine I've reimplemented some stuff from libcore/liballoc, just with more aggressive #[inline(always)] — this is... suboptimal obviously, and rather than continuing, it would be nice if we could just... do something about it)

¹ Other than the #[inline(always)] one which, well, always runs.

Other than the #[inline(always)] one which, well, always runs.

https://doc.rust-lang.org/reference/attributes/codegen.html

#[inline(always)] suggests that an inline expansion should always be performed.

Note: #[inline] in every form is a hint, with no requirements on the language to place a copy of the attributed function in the caller.

I meant that it runs at all optimization levels (unless you do something like no-prepopulate-passes)

Maybe libstd could use #[inline(always)] more?

What's the status of the MIR inliner?

Maybe we could run it, with very tame heuristics, even in -O1. That could plausibly pick up most of the problems, and as it'd apply to the generic code might not give LLVM that much extra work to do either.

I'd really rather rustc just start to be smarter about this. I don't want to force the ecosystem to start putting that all over things just to make -O1 more useful either.

4 Likes

Let's give it a shot. I've tried to inline functions that appear un-inlined for basic things like for _ in 0..10 and for x in slice.

1 Like

I know previously it was suggested on IRLO that we should #[inline(always)] any operation "cheaper than a function call" aggressively, and iirc (though I can't find the thread) the consensus there was we don't care about -O0 perf at all, so we want to be hesitant in adding any more #[inline(always)] to the stdlib.

I think personally think there's room for both more #[inline(always)] given some good heuristic for when that won't impact O0 compile time (roughly "as cheap as call"), and for some #[inline(always(O1))] that does #[inline(always)] but only at O1 or higher optimization (roughly "core zero cost abstraction that's unreasonably costly in debug builds").

Canonical blog post on the topic:

1 Like

If that was really the consensus, I'm seriously shocked. Maybe, I should send my undebuggable program[1], that makes heavy use of iterator methods…

[1] That was meant literally. I probably could've let it run for over an hour without a result, whereas the release build took around 10 seconds to finish.

1 Like

The ideal IIRC is that

  • -O0 is the fastest iteration time possible, while still following the normal compilation flow (i.e. it's not just piped through miri);
  • -O1 cuts out the low hanging fruit for optimization without significantly impacting debugging; and
  • -O2 is full optimization, maximum performance.

Plus, that you're able to mix and match which crates are optimized at what level (though this is limited somewhat due to monomorphization).

I definitely think that more things can and should be (MIR?) inlined in -O0, when the function is basically just doing one micro op, or just calling another function. But I think that general outline is fairly non-contentious. At the very minimum, there needs to be some inlining of "unimportant" frames at -O1 to make it a meaningful improvement over -O0.

(Eventually, the difference between (defaulted) -O0 and -O1 might end up being that -O0 uses cranelift and -O1 uses LLVM, which is sure to make a difference.)

Is it possible for you to set the inline-threshold? It should work with O1.

3 Likes

Recent success story: inline(always) benefits with UnsafeCell: Use `#[inline(always)]` on trivial UnsafeCell methods by joshtriplett · Pull Request #83858 · rust-lang/rust · GitHub My take away was that yes inline(always) can improve compile time if used in the right places.

I understand if these need to be used judiciously, carefully.

1 Like

Should it? The implication I get from that is that it's not applicable, because we don't run the inliner on O1.

My understanding here is that specifying the inline threshold always adds the inline pass.

1 Like

So it turns out inline(always) is extremely performance-sensitive. I've made changes that seemed super simple and obvious, and it noticeably regressed compilation time.

It's really bizarrely sensitive. In std there's:

    fn into_iter(self) -> I {
        self
    }

which I'd think is a useless no-op that should obviously be inlined. But inlining it slows down compilation up to 3% (that's pretty big for this kind of change).

1 Like

So a good news is that inlining pointer methods was acceptable. Unfortunately, inlining of basic iter methods for slices was 3.7% slowdown. Inlining of ranges was another 3% slowdown.

For the slice methods PR, we need to keep in mind that inlining happens from the bottom and up. When we see this:

    #[inline(always)]
    fn into_iter(self) -> slice::Iter<'a, T> {
        self.iter()
    }

This doesn't mean that rustc should just replace any into_iter by .iter(). Inlines happen bottom up: first iter is compiled, then into_iter is compiled as a function, and maybe iter is inlined into it, using the regular heuristics for iter. Then into_iter is not so trivial anymore, if that happens.

After that, we tell Rustc/LLVM to unconditionally inline the result of that into whatever called .into_iter(). For this reason, the #[inline(always)] attrs should probably be avoided on non-trival methods like from the Index trait or this .iter(). The triviality of the function needs to be judged from the total amount of code it can expand to (recursively). :slightly_smiling_face:

11 Likes

Ah, when you state it that way it seems obvious :smiley:

So this really blocks on a MIR inliner that would presumably operate top-down, inlining trivial function calls instead of bottom up?

1 Like

Yeah, I was wondering about that recently.

I wonder if you could do a better "worse is better" optimization pass if you go top-down; kind of like how Cranelift gets decent performance by doing very basic optimizations in a very streamlined way.

So the top-down pass would run on large functions instead of small ones, and would symbolically execute the function, doing pinhole optimizations at is goes (eg dead code elimination, eliminating local SSA duplicates, etc) and would inline small functions recursively.

This might be an improvement for generic functions (eg the into_iter() methods) that are only ever instantiated with a single substitution in the entire project, so the bottom-up approach generates intermediary data that isn't reused.

2 Likes

So, in the continuity of the discussion in the thread, it sounds like cases that should be handled by -O1?

A 3.7% slowdown probably isn't desirable for -O0 builds, even for non-negligible runtime improvements, but it fits the "low hanging fruit" approach @CAD97 describes.