Why does Rust generate 10x as much unoptimized assembly as GCC?

Maybe wrapping_neg() and overflowing_neg() should get #[inline(always)] just like wrapping_add(), etc. recently did with the patch for this issue?

Then no functions calls would be produced and at least with -C opt-level=1 the assembly would be optimal.


As a further debug-level optimization wrapping_neg() could be implemented as 0.wrapping_sub(self), in effect making use of the wrapping_sub() intrinsic leading to code similar to the one gcc produces for -C opt-level=0.

Yes, wrapping_neg does seem a good candidate for #[inline(always)], with some consideration. I think last discussion most people settled on new stdlib #[inline(always)] being appropriate for functions which transitively inline to "roughly a single instruction" (i.e. an intrinsic). (The threshold beyond that is contested, and nonobvious, due to inlining being bottom-up.)


I've tried improving speed of opt-level=0 code by adding #[inline(always)] in several places. The problem is that inlining has a cost, which is non-trivial in opt-level=0 builds, and regresses compiler's performance.

So this is not merely a technical problem, but a matter of policy. Is it ok for rustc to be slower in the default opt-level of the debug mode?

BTW: the opt-level profile setting can be changed, so the debug mode can be as optimized as the user wants, so the question is more about what the default behavior should be.

1 Like

When needing to choose, it's probably best to defer the choice to the user. I can imagine legitimate use cases for both scenarios:

  • "in a debug build, I just want to have my code compiled ASAP so I can actually start running and debugging it"
  • "in a debug build, I want my code to be as fast as possible because this strange error only manifests when I run it on a 1GB input file"

Although I have the feeling that #1 is way more frequent than #2. Consequently, I think it would make more sense to keep the meaning of the current debug build as the default (i.e. minimal optimizations, compile times as short as possible), and introduce another kind of target, let's call it "optimized debug" or something, that exchanges some compilation time for some additional optimizations.


Prior art for this is CMake's RelWithDebInfo[1]. Please don't reuse the name, but the idea is fine :slight_smile: .

[1] For the curious, this is named as such because older Visual Studio configuration selection did "contains substring", not "eq" to find the configuration to build, so if it had been named ReleaseDebug could be built if the user selected either the Release or Debug configuration.


I just hope custom profiles will be stabilized someday. They would (and do, if you're willing to use nightly cargo) solve this issue, among others.


The MIR inliner may help with this as well.


The problem is that there is one more dimension to this. While doing less optimization can mean faster compilation, it can also mean much more code is generated, which means slower compilation.

So it's not just a linear spectrum between fast compiles+slow binaries and slow compiles+fast binaries. When the early stages of the compiler generate more straightfoward/naive output, and the program and libraries use lots of extra layers expecting them to be optimized away, that just increases the amount of data the later stages of the compiler have to churn through, generating more function prologues/epilogues and call sequences and shuffling things around in memory.

For a language like Rust where idiomatic code involves this much layering and "zero overhead abstractions," early-stage optimizations like MIR inlining probably should be enabled even in debug builds (though probably also with different tuning than in release builds). Nobody needs ptr::read on a u8 to compile to three levels of function calls around memcpy for any amount of debugging, and producing that output is almost certainly slower than producing a direct inline load.


I would qualify that "nobody" slightly. There are (for unfortunate reasons) cases where you do end up wanting to step into these "trivial top-down inline" wrappers. I'll freely and enthusiastically say that the default debug profile should include the minimal optimization passes to make these zero-cost abstractions cheaper in debug, just caution against any absolutes.


In general, sure, there could be rare times that you want to debug trivial wrappers, though I chose that example as one that I really doubt would ever apply- the type is fixed at u8, so if that is going wrong then a debugger is probably the wrong tool. :slight_smile:

Plus, high quality debug info should still be able to show you inlined "stack frames," which further reduces the need for actually generating code for them.

1 Like

I think we have to differentiate debug builds. If we just want to run cargo check or rust analyzer, then we want the fastest possible compilation times. If we have to do runtime debugging, then we (I, at least) want the best possible performance while retaining the ability to debug the program painlessly. Rust doesn't offer the latter, at the moment.

Some programs are practically impossible to debug at runtime without inlining (hint: ever tried chaining 20+ iterator methods and running it in debug mode?) and applying too many optimizations makes debugging impractical for other reasons.


This is wanted for the embedded use case as well. Debug builds are not basically not possible in the ecosystem mainly due to code size (though also speed), which is annoying but also dangerous because unless decent tests are written, debug asserts and such are unrunnable.


In the meantime, wouldn't enabling optimizations for debug builds help in that use case? It's not free because it's paid for with additional compile times, but so would this proposal.


The debug profile with debug assertions can be as optimized as the release profile, and the release profile can have debug info. You can mix and match all options in [profiles.*].

You can have e.g. debug = true + debug-assertions = true + opt-level=z.

What Rust doesn't expose yet is opt-level=g that is like opt-level=2, but without a few optimizations that reduce precision of debug info.


I also support the use of inlining things like iterators to zero-cost abstractions. Currently, I'm trying to debug memory management code, but this is made impossible to do because I get stuck in all the extra debugging noise (and it doesn't help that when running in QEMU I seem to get stack frame corruption, and I certainly didn't write any code that corrupts the stack frame in any way, and none of my deps do either). So my main project is in limbo because its practically impossible to debug because the code is just too noisy.

Set [profile.dev] opt-level=1, it enables inlining in debug mode.

That's actually not the case:

-C inline-threshold

This option lets you set the default threshold for inlining a function. It takes an unsigned integer as a value. Inlining is based on a cost model, where a higher threshold will allow more inlining.

The default depends on the opt-level:

opt-level Threshold
0 N/A, only inlines always-inline functions
1 N/A, only inlines always-inline functions and LLVM lifetime intrinsics
2 225
3 275
s 75
z 25

Interestingly, setting a custom -C inline-threshold=0 still enables some inlining. I've seen some reasonable improvements in debug runtime speed in personal projects just by setting an inline threshold of 25.


I haven't looked at the implementation of the inlining cost function in llvm but that would make sense if less assembly instructions are emitted at the call site if the function is inlined. That could easily be the case for very small functions.


Note that in the case of cargo check the code isn't translated anyway, so the optimization level you set doesn't matter.