Why does Rust generate 10x as much unoptimized assembly as GCC?

Comparison here.

Just trying to write a simple function to extract the lowest set bit, with optimizations disabled GCC still generates 8 instructions and rustc generates almost 50. I've gone out of my way in the rust version to use as for no-op conversions, and called wrapping_neg to opt out of panic on overflow. Strangely using wrapping_neg makes it worse and causes some large overflowing_neg function to be inserted that looks to be doing something way more complicated than just the single instruction necessary. Is there a way I could be doing this better, assuming I care about debug build performance? The hit going from optimized to debug in C++ can be 2-4x, but if this is representative in Rust the difference would be 20x-40x, which almost makes running a Rust program in debug as bad as running an optimized binary under valgrind.

1 Like

Yep. It really is that bad. A lot of things just assume inlining and basic optimizations are present and won't generate even remotely reasonable code otherwise.

To be fair, template-heavy C++ can be just as bad.

In this case, wrapping_add, wrapping_sub, and wrapping_mul have their own intrinsics, but wrapping_neg is defined like this (from library/core/src/num/int_macros.rs):

#[inline]
pub const fn wrapping_neg(self) -> Self {
    self.overflowing_neg().0
}

A strange way of implementing it, but with optimizations enabled LLVM turns it into the expected thing, so probably nobody has noticed or bothered fixing it.

As a workaround, ((x as i64) & 0i64.wrapping_sub(x as i64)) as u64 doesn't produce any function calls.

2 Likes

FWIW, depending on what exactly you're measuring, even "raw, performant C" can see 20-30x performance differences between opt-none and opt-all:

Source: https://youtu.be/6hC9IxqdDDw?t=929

But yes: Rust relies on the optimizer (and specifically, the inliner) a lot more than C++ traditionally does. Part of this is historical: C and C++'s compilation model requires that each source file is compiled separately, without the ability to inline anything accross compilation units. As such, C/C++ peephole optimizations often target builtin library functionality, and more things are provided as (vendor extension) langauge items, such that they're always quick, even without aggressive whole-program optimization. Whereas Rust has always had the crate as its compilation unit, so inlining is always more available, even before you get to the fact that Rust does generally provide more functionality as function calls rather than compiler magic or builtins.

It's even worth noting that, while e.g. memcpy is likely a compiler builtin in your C/C++ compiler, and other items might directly ask the compiler to expand to some code, every intrinsic in Rust is wrapped in another layer of function calls through the stdlib. Being a language born in the golden age of optimizing compilers, rustc isn't afraid of relying on the optimizer to do its work (to a fault of giving LLVM too much IR and slowing down compilation).

It'd be interesting to play with setting the inline-threshold for -C opt-level=1 higher, as it (AIUI) doesn't really do much of anything for rustc/llvm currently. Interestingly enough, for this example even -C inline-threshold=0 (that's a zero, and not a mistake) allows the compiler to collapse this into one (messy) function, and (unsurprisingly) adding opt-level=1 as well gets us the pure assembly equivalent to GCC's -Og.

(TL;DR give me an -Og, rustc, which gives us a low-but-enough inlining threshold while maintaining inorder execution and source mapping for debugging)

12 Likes

Maybe wrapping_neg() and overflowing_neg() should get #[inline(always)] just like wrapping_add(), etc. recently did with the patch for this issue?

Then no functions calls would be produced and at least with -C opt-level=1 the assembly would be optimal.

3 Likes

As a further debug-level optimization wrapping_neg() could be implemented as 0.wrapping_sub(self), in effect making use of the wrapping_sub() intrinsic leading to code similar to the one gcc produces for -C opt-level=0.

Yes, wrapping_neg does seem a good candidate for #[inline(always)], with some consideration. I think last discussion most people settled on new stdlib #[inline(always)] being appropriate for functions which transitively inline to "roughly a single instruction" (i.e. an intrinsic). (The threshold beyond that is contested, and nonobvious, due to inlining being bottom-up.)

3 Likes

I've tried improving speed of opt-level=0 code by adding #[inline(always)] in several places. The problem is that inlining has a cost, which is non-trivial in opt-level=0 builds, and regresses compiler's performance.

So this is not merely a technical problem, but a matter of policy. Is it ok for rustc to be slower in the default opt-level of the debug mode?

BTW: the opt-level profile setting can be changed, so the debug mode can be as optimized as the user wants, so the question is more about what the default behavior should be.

1 Like

When needing to choose, it's probably best to defer the choice to the user. I can imagine legitimate use cases for both scenarios:

  • "in a debug build, I just want to have my code compiled ASAP so I can actually start running and debugging it"
  • "in a debug build, I want my code to be as fast as possible because this strange error only manifests when I run it on a 1GB input file"

Although I have the feeling that #1 is way more frequent than #2. Consequently, I think it would make more sense to keep the meaning of the current debug build as the default (i.e. minimal optimizations, compile times as short as possible), and introduce another kind of target, let's call it "optimized debug" or something, that exchanges some compilation time for some additional optimizations.

7 Likes

Prior art for this is CMake's RelWithDebInfo[1]. Please don't reuse the name, but the idea is fine :slight_smile: .

[1] For the curious, this is named as such because older Visual Studio configuration selection did "contains substring", not "eq" to find the configuration to build, so if it had been named ReleaseDebug could be built if the user selected either the Release or Debug configuration.

4 Likes

I just hope custom profiles will be stabilized someday. They would (and do, if you're willing to use nightly cargo) solve this issue, among others.

3 Likes

The MIR inliner may help with this as well.

2 Likes

The problem is that there is one more dimension to this. While doing less optimization can mean faster compilation, it can also mean much more code is generated, which means slower compilation.

So it's not just a linear spectrum between fast compiles+slow binaries and slow compiles+fast binaries. When the early stages of the compiler generate more straightfoward/naive output, and the program and libraries use lots of extra layers expecting them to be optimized away, that just increases the amount of data the later stages of the compiler have to churn through, generating more function prologues/epilogues and call sequences and shuffling things around in memory.

For a language like Rust where idiomatic code involves this much layering and "zero overhead abstractions," early-stage optimizations like MIR inlining probably should be enabled even in debug builds (though probably also with different tuning than in release builds). Nobody needs ptr::read on a u8 to compile to three levels of function calls around memcpy for any amount of debugging, and producing that output is almost certainly slower than producing a direct inline load.

10 Likes

I would qualify that "nobody" slightly. There are (for unfortunate reasons) cases where you do end up wanting to step into these "trivial top-down inline" wrappers. I'll freely and enthusiastically say that the default debug profile should include the minimal optimization passes to make these zero-cost abstractions cheaper in debug, just caution against any absolutes.

3 Likes

In general, sure, there could be rare times that you want to debug trivial wrappers, though I chose that example as one that I really doubt would ever apply- the type is fixed at u8, so if that is going wrong then a debugger is probably the wrong tool. :slight_smile:

Plus, high quality debug info should still be able to show you inlined "stack frames," which further reduces the need for actually generating code for them.

1 Like

I think we have to differentiate debug builds. If we just want to run cargo check or rust analyzer, then we want the fastest possible compilation times. If we have to do runtime debugging, then we (I, at least) want the best possible performance while retaining the ability to debug the program painlessly. Rust doesn't offer the latter, at the moment.

Some programs are practically impossible to debug at runtime without inlining (hint: ever tried chaining 20+ iterator methods and running it in debug mode?) and applying too many optimizations makes debugging impractical for other reasons.

2 Likes

This is wanted for the embedded use case as well. Debug builds are not basically not possible in the ecosystem mainly due to code size (though also speed), which is annoying but also dangerous because unless decent tests are written, debug asserts and such are unrunnable.

3 Likes

In the meantime, wouldn't enabling optimizations for debug builds help in that use case? It's not free because it's paid for with additional compile times, but so would this proposal.

2 Likes

The debug profile with debug assertions can be as optimized as the release profile, and the release profile can have debug info. You can mix and match all options in [profiles.*].

You can have e.g. debug = true + debug-assertions = true + opt-level=z.

What Rust doesn't expose yet is opt-level=g that is like opt-level=2, but without a few optimizations that reduce precision of debug info.

4 Likes

I also support the use of inlining things like iterators to zero-cost abstractions. Currently, I'm trying to debug memory management code, but this is made impossible to do because I get stuck in all the extra debugging noise (and it doesn't help that when running in QEMU I seem to get stack frame corruption, and I certainly didn't write any code that corrupts the stack frame in any way, and none of my deps do either). So my main project is in limbo because its practically impossible to debug because the code is just too noisy.

Set [profile.dev] opt-level=1, it enables inlining in debug mode.