So... Yeah. Can/Should we add¹ an inlining pass to the set of passes that run for builds with opt-level = 1?
Rust is really reliant on #[inline] for perf — more so than even C++. This is true of both the stdlib code and user code. Unfortunately, inlining does not happen until opt-level = 2. In my experience there's almost no difference in -O1 and -O0 at the moment, but you get a huge jump for -O2.
I get that it's a compile-perf hit, which is why I'm not suggesting it for -O0. I also am aware at some point in the future an MIR-based inliner may or may not solve some of this, but that doesn't preclude addressing it now.
(Currently, to get better perf from non-release builds without (moving all the way towards using -O2 for debug builds), in my game engine I've reimplemented some stuff from libcore/liballoc, just with more aggressive #[inline(always)] — this is... suboptimal obviously, and rather than continuing, it would be nice if we could just... do something about it)
¹ Other than the #[inline(always)] one which, well, always runs.
Maybe we could run it, with very tame heuristics, even in -O1. That could plausibly pick up most of the problems, and as it'd apply to the generic code might not give LLVM that much extra work to do either.
I'd really rather rustc just start to be smarter about this. I don't want to force the ecosystem to start putting that all over things just to make -O1 more useful either.
I know previously it was suggested on IRLO that we should #[inline(always)] any operation "cheaper than a function call" aggressively, and iirc (though I can't find the thread) the consensus there was we don't care about -O0 perf at all, so we want to be hesitant in adding any more #[inline(always)] to the stdlib.
I think personally think there's room for both more #[inline(always)] given some good heuristic for when that won't impact O0 compile time (roughly "as cheap as call"), and for some #[inline(always(O1))] that does #[inline(always)] but only at O1 or higher optimization (roughly "core zero cost abstraction that's unreasonably costly in debug builds").
If that was really the consensus, I'm seriously shocked. Maybe, I should send my undebuggable program[1], that makes heavy use of iterator methods…
[1] That was meant literally. I probably could've let it run for over an hour without a result, whereas the release build took around 10 seconds to finish.
-O0 is the fastest iteration time possible, while still following the normal compilation flow (i.e. it's not just piped through miri);
-O1 cuts out the low hanging fruit for optimization without significantly impacting debugging; and
-O2 is full optimization, maximum performance.
Plus, that you're able to mix and match which crates are optimized at what level (though this is limited somewhat due to monomorphization).
I definitely think that more things can and should be (MIR?) inlined in -O0, when the function is basically just doing one micro op, or just calling another function. But I think that general outline is fairly non-contentious. At the very minimum, there needs to be some inlining of "unimportant" frames at -O1 to make it a meaningful improvement over -O0.
(Eventually, the difference between (defaulted) -O0 and -O1 might end up being that -O0 uses cranelift and -O1 uses LLVM, which is sure to make a difference.)
So it turns out inline(always) is extremely performance-sensitive. I've made changes that seemed super simple and obvious, and it noticeably regressed compilation time.
which I'd think is a useless no-op that should obviously be inlined. But inlining it slows down compilation up to 3% (that's pretty big for this kind of change).
So a good news is that inlining pointer methods was acceptable. Unfortunately, inlining of basic iter methods for slices was 3.7% slowdown. Inlining of ranges was another 3% slowdown.
This doesn't mean that rustc should just replace any into_iter by .iter(). Inlines happen bottom up: first iter is compiled, then into_iter is compiled as a function, and maybe iter is inlined into it, using the regular heuristics for iter. Then into_iter is not so trivial anymore, if that happens.
After that, we tell Rustc/LLVM to unconditionally inline the result of that into whatever called .into_iter(). For this reason, the #[inline(always)] attrs should probably be avoided on non-trival methods like from the Index trait or this .iter(). The triviality of the function needs to be judged from the total amount of code it can expand to (recursively).
I wonder if you could do a better "worse is better" optimization pass if you go top-down; kind of like how Cranelift gets decent performance by doing very basic optimizations in a very streamlined way.
So the top-down pass would run on large functions instead of small ones, and would symbolically execute the function, doing pinhole optimizations at is goes (eg dead code elimination, eliminating local SSA duplicates, etc) and would inline small functions recursively.
This might be an improvement for generic functions (eg the into_iter() methods) that are only ever instantiated with a single substitution in the entire project, so the bottom-up approach generates intermediary data that isn't reused.
So, in the continuity of the discussion in the thread, it sounds like cases that should be handled by -O1?
A 3.7% slowdown probably isn't desirable for -O0 builds, even for non-negligible runtime improvements, but it fits the "low hanging fruit" approach @CAD97 describes.