When working with manual loop unrolling macros I found that such feature can be quite useful, e.g. for disabling unrolling for debug builds (to significantly reduce compilation times) and especially for -Os
and -Oz
builds. Currently I plan to use no_unroll
feature, but it’s hard to signal it, even more if crate will be deep in a dependency tree. Do you think that such addition is desirable for the language?
Seems similar to the existing debug_assertions
cfg – though not the same.
I consider it very important that optimization level does not affect program semantics. It’s not possible to achieve this entirely, but this cfg
would be a pretty gaping hole. (debug_assertions
is orthogonal to optimization level, it’s just by default enabled in the debug profile and disabled in the release profile.)
The use case also doesn’t seem convincing to me. Loop unrolling should ideally be controlled by talking to the optimizer (https://github.com/rust-lang/rfcs/issues/2219), not done by hand with macros.
I agree about not affecting program semantics, but the same can be said about target features and we leave it up to authors to ensure that program produces the same result for different sets of enabled target features, I don’t think that optimization level is much different in that regard. As for loop unrolling, I again agree that ideally we should have compiler support for it, but it’s just the most noticeable example, as there are a lot of algorithms which have variants with different space-time trade-offs and compiler just can’t help here.
debug_assertions
is also enabled by default by rustc when compiling without optimizations, and disabled by default when compiling with optimizations. So it’s not 100% orthogonal to opt-level
.
I don't see how. If you are using intrinsics to access specific hardware capabilities, the presence or absence of the relevant target features is highly relevant to whether you can use those intrinsics. Of course ideally all software would be 100% portable, but that's completely impractical (for a systems language) as opposed to just being rough around the edges.
Are we talking about code size trade offs here? Please give an example. I'm the first to tell you the compiler can't magically do everything for you, what I doubt is that it's a good idea to tie this to the -C opt-level
flag.
True but again just a matter of default. So you can run an optimized build with debug assertions enabled, and a debug build without such assertions.
I am talking about the case when crate provides software fallback or provides several implementations for different feature sets. Those implementations should produce the same result, but it's up to programmer to ensure it.
Yes, size of the resulting binary to be exact. (hence -Os
and -Oz
) The loop unrolling is the most notable example, e.g. for keccak function unrolling causes increase of the binary size in the order of magnitude. A different example is this block cipher. In this implementation I use pre-computed expanded S-tables which take 1024 bytes each instead of original 128 bytes. I know it's a relatively small difference, but for other algorithms it can be much bigger, which will be substantial for embedded platforms.
I don't see another ergonomic way to do it. Yes, you can use features, but it's essentially saying twice to the compiler that "we want to minimize the binary size", through feature and -O
flag.
Sure. This is true every time you select between different implementations of the same algorithm though. The problematic part is what the selection depends on, and thus, what can affect your debugging when the implementations are not interchangeable.
I really want to avoid a situation where user code somewhere in the dependency tree has a logic bug, you try to debug it, and find that disabling optimizations (to get better debug info, and decrease the change of miscompilations) affects whether the bug occurs. Optimization level is not expected, and not supposed to, affect program semantics (and the bugs that do in practice depend on the optimizer are of a special kind -- mostly UB or compiler bugs). If the bug depends on which features you enabled in your dependency, that's much easier to identify.
Thanks for the example.
I explained above why tying decisions to optimization flags is worrying to me. I also have some reservations about the argument that this is something people will want to tune solely via the optimization level. If the difference is big (e.g., a megabyte of tables), you may want to tune it independently of optimization level — you may find the performance difference of using the larger table to be neglegible but -Copt-level=3
vs -Copt-level=z
unacceptable.
Overall I understand your concerns, this is why I've posted this idea here to see what others will think about it. Though in my personal opinion opportunity to make crates smarter in regards to optimization levels outweighs potential hazards.
Keeping debug info is strictly speaking orthogonal to the optimization level. You can use opt-level=3
and debug=true
simultaneously.
The options are orthogonal, but the quality of the debug info suffers greatly when aggressive optimizations are enabled, and this is unavoidable to some degree (LLVM could still be better at this, though).
I am also strongly opposed to allowing conditional compilation on opt_level, for basically the same reasons.
But this does make me wonder if we could do something to make performance features more discoverable. As an initial strawman, maybe we could quasi-standardize a feature name like “optimize-for-code-size” such that when you run cargo build -Os
, if there are any dependencies with a feature of that name that you did not explicitly turn on or off, cargo can print a message suggesting that you might want to consider enabling that feature.
Though now I’m wondering if there are really enough crates that need performance configuration done at build time to justify a feature like that. I think most crates can and should simply add various kinds of API surface and let clients be responsible for using it. Code size seems like it might be the only use case where “just use the faster API” is not a valid answer, since the bulkier API is still in the binary if you aren’t doing LTO. If the concern is only build time, as I think you said at the top, then having two separate macros (if the tradeoff really is that huge) seems like a more straightforward and more discoverable solution than a crate feature.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.