#[inline] has a lot of implications for both runtime and compiletime cross-crate performance.
The runtime cost is obvious -- if a function is both nongeneric and not #[inline], then (cross-crate) it is always an indirect function call (until LTO, anyway).
There's a compiletime cost to #[inline] as well, though -- AIUI, #[inline] gives LLVM a hint that the function is a good candidate for inlining, which can make LLVM inline more aggressively than without. Additionally, as part of that, the function needs to be instantiated separately in every CGU in every crate.
Thus the idea for #[inline(const)] -- the point is to make the function available for inlining, but not apply any inlining hints, and ideally keep using the canonical implementation as the one in the source crate when inlining isn't done.
The reason to spell it #[inline(const)] is the intended application -- for a crate like bitvec, most functionality isn't an unambiguous candidate for inlining when the parameters are unknown. However, at the same time, at the same time they really want to be able to constant fold.
Thus where #[inline] roughly communicates "this is a good candidate for inlining," #[inline(const)] is meant to roughly communicate "this is a good candidate for constant folding," importantly without impacting the default inlining heuristics.
Is this even a distinction which we can communicate to LLVM?
And because inlining is bottom-up, whether a function is a good inlining candidate is not typically locally obvious to the developer; typically, it's better to let the optimizer use its own inlining heuristics. ↩︎
Wouldn't that become irrelevant once MIR optimizations become sufficiently powerful and/or MIR libraries become possible? Also, how is that meaningfully different from lto=thin, other than more busywork for library authors?
Bikeshed: the notation is bad. I would expect it to mean something like "inlinable if called in const context" or something like that. The inner syntax of the #[inline] attribute is its own private concern, so why not make it something more directly relevant, like #[inline(maybe)] or #[inline(enable)].
True. The only thing blocking them from being inlined by LLVM is that generic functions end up in a separate cgu. Thin local LTO as enabled by default when optimizations are enabled should already allow inlining them though.
It is possible, but it will increase the size of the crate metadata by a non-trivial amount and as such make rustc slower even for debug builds.
The nuances around CGUs are a royal pain right now. We end up needing to #[inline] things in core sometimes that already have MIR available so are essentially inline-available, but only to one of the CGUs and thus random CGU choices make a huge impact to some things sometimes.
I think how thin lto works (global, but parallel analysis of all TUs which facilitates cross-TU inlining without requiring analyzing literally everything in one chunk) is on a very fundamental level the way Rust optimizing compilers should work. We can't do separate compilation (because zero-cost abstractions) and we can't merge everything into one TU (because we need scalability across CPUs or, ideally, machines). The map-reduce of thin-lto is what's left then.
Even if in the future we replace thin lto with something like mir-only rlibs, I think the overall feeling (compile time and runtime performance) should be roughly the same for the outside observer.
This makes me think that lto=thin is just the natural, neutral thing to do for --release, and that it should ideally have been the default. I think it is not default, because lto=thin postdates Rust. But
"In Rust 2024, default for release profile is lto=thin" seems like a great thing to have on a roadmap. For builds of rust-analyzer, I get the following (with -Clink-arg=-fuse-ld=lld on Linux):
cpu 1457.66s (1415.48s user + 42.18s sys)
cpu 1538.36s (1491.24s user + 47.12s sys)
Compile time hit here seems reasonable: of course, doing more global analysis is going to be slower than not doing that, and, from the mechanics of the language, that seems to be a more-or-less mandatory analysis for reasonable runtime behavior.
Memory hit is quite a bit worse. I think the reason why we didn't enable lto for rust-analyzer is that default github builders started to oom actually? (we do lto=thin when building ra) OTOH, it doesn't seem like an unreasonable memory requirement, and memory is generally "cheaper" than time.