Why rustc 1.71 no longer eliminates Box allocations?

findepi · February 11, 2025, 5:19pm

Following the example from rust - How to make sure Box::new() really does heap allocation? - Stack Overflow

I compile the following code

#[no_mangle]
pub fn many_heap_calls() {
    for _ in 0..10 {
        let b = Box::new(42);
    }
}

with rustc 1.70 the allocations are optimized away: Compiler Explorer
with rustc 1.71 the allocations are still there: Compiler Explorer
with rustc 1.84 the allocations are still there: Compiler Explorer

In the release notes Announcing Rust 1.71.0 | Rust Blog and Release Rust 1.71.0 · rust-lang/rust · GitHub nothing stood out to me to cover this change.

Questions:

Why Rust 1.71 no longer eliminates allocation the way older versions did? Is this a feature? Can I somehow get the previous behavior?

Context

I.e. why this could matter.

I am aware that clippy can tell me about unused allocations and I can simply remove them from the code. The actual context is a more complicated. I am trying to use Box<dyn Fn> nested inside impl Fn (cause impl Fn cannot nest). The compiler can "see through" the Box dyn and realize the implementation is only one, inlining the code, which is awesome! I'm concerned the Box allocation remains:

More context

This is related to my older question Can compiler’s optimizer eliminate trivial heap allocation?. Then it was shown to me that compiler can eliminate some allocations, although the example shared in Can compiler’s optimizer eliminate trivial heap allocation? - #2 by the8472 also exhibits "better" behavior under 1.70 and "worse" under 1.71

steffahn · February 11, 2025, 5:32pm

One can probably bisect for a better clue.

cuviper · February 11, 2025, 5:54pm

AFAICS that's not actually calling the allocator, just loading the relocated function pointer several times for the unrolled loop. That's still useless, but not as bad as an actual malloc+free every time.

(edit: it's not a function pointer -- see the read_volatile noted below)

findepi · February 11, 2025, 5:58pm

My bad. Thanks for correcting me, @cuviper! Should I still assume it's not without a performance impact, if called many times? (Context here is functions for a database engine, so hopefully the code will be called billions of times.)

findepi · February 11, 2025, 6:02pm

Thanks @steffahn , this tool is very cool!

I tried to use that with

$ cat script.sh
#!/usr/bin/env bash

set -x
rm lib.s
set -eu

rustc -C opt-level=3 --emit asm --crate-type=lib src/lib.rs
# must be found
cat lib.s | sed '1,/^_many_heap_calls/d' | grep .
# must not have allocations
! cat lib.s | sed '1,/^_many_heap_calls/d' | grep ___rust_no_alloc_shim_is_unstable

then

cargo bisect-rustc --start 1.70.0 --end 1.71.0 --script=./script.sh

the tail of output is

1 versions remaining to test after this (roughly 1 steps)
installing nightly-2023-05-26
rust-std-nightly-aarch64-apple-darwin: 26.80 MB / 26.80 MB [======================================================================================================================================================] 100.00 % 3.34 MB/s testing...
RESULT: nightly-2023-05-26, ===> Script returned error
uninstalling nightly-2023-05-26

searched toolchains nightly-2023-04-15 through nightly-2023-05-27


********************************************************************************
Regression in nightly-2023-05-26
********************************************************************************

fetching https://static.rust-lang.org/dist/2023-05-25/channel-rust-nightly-git-commit-hash.txt
nightly manifest 2023-05-25: 40 B / 40 B [======================================================================================================================================================================] 100.00 % 508.96 KB/s converted 2023-05-25 to c373194cb6d882dc455a588bcc29c92a96b50252
fetching https://static.rust-lang.org/dist/2023-05-26/channel-rust-nightly-git-commit-hash.txt
nightly manifest 2023-05-26: 40 B / 40 B [======================================================================================================================================================================] 100.00 % 683.81 KB/s converted 2023-05-26 to a2b1646c597329d0a25efa3889b66650f65de1de
looking for regression commit between 2023-05-25 and 2023-05-26
fetching (via remote github) commits from max(c373194cb6d882dc455a588bcc29c92a96b50252, 2023-05-23) to a2b1646c597329d0a25efa3889b66650f65de1de
ending github query because we found starting sha: c373194cb6d882dc455a588bcc29c92a96b50252
get_commits_between returning commits, len: 9
  commit[0] 2023-05-24: Auto merge of #111260 - petrochenkov:effvisperf7, r=cjgillot
  commit[1] 2023-05-24: Auto merge of #111919 - matthiaskrgr:rollup-8qcdp0q, r=matthiaskrgr
  commit[2] 2023-05-25: Auto merge of #111925 - Manishearth:rollup-z6z6l2v, r=Manishearth
  commit[3] 2023-05-25: Auto merge of #111575 - alex:patch-1, r=pietroalbini
  commit[4] 2023-05-25: Auto merge of #111933 - matthiaskrgr:rollup-m10k3ts, r=matthiaskrgr
  commit[5] 2023-05-25: Auto merge of #111473 - compiler-errors:opaques, r=lcnr
  commit[6] 2023-05-25: Auto merge of #110906 - ekusiadadus:ekusiadadus/fix-bash-complete-etc, r=albertlarsan68
  commit[7] 2023-05-25: Auto merge of #111512 - petrochenkov:microdoc2, r=GuillaumeGomez
  commit[8] 2023-05-25: Auto merge of #86844 - bjorn3:global_alloc_improvements, r=pnkfelix
ERROR: no CI builds available between c373194cb6d882dc455a588bcc29c92a96b50252 and a2b1646c597329d0a25efa3889b66650f65de1de within last 167 days

of these, the commit[8] 2023-05-25: Auto merge of #86844 - bjorn3:global_alloc_improvements, r=pnkfelix line stand out to me. Maybe because https://github.com/rust-lang/rust/pull/86844 is something i already opened based on git log 1.70.0..1.71.0 --pretty=oneline | grep -i alloc in the rust repo.

I will read the PR, but i still don't know: is this a feature, or a bug? how can i get previous behavior?

cuviper · February 11, 2025, 6:26pm

Looks like this read_volatile (added in 86844) can't be optimized away:

github.com/rust-lang/rust

library/alloc/src/alloc.rs

8c61cd4df


      
          // Make sure we don't accidentally allow omitting the allocator shim in
          // stable code until it is actually stabilized.
          core::ptr::read_volatile(&__rust_no_alloc_shim_is_unstable);

cc @bjorn3

pitaj · February 11, 2025, 6:29pm

Relevant zulip thread

Might be worth taking a look at some alternatives

findepi · February 11, 2025, 7:41pm

My local benchmarks showed 6x perf difference (for functions from Compiler Explorer shared above)

I found the tracking issue Missed optimization/perf oddity with allocations · Issue #128854 · rust-lang/rust · GitHub. It looked like there was no "business justification" so posted there.

Thank you All very much for your help and useful pointers!

bjorn3 · February 11, 2025, 9:01pm

Not much we can do about that until we either bless directly linking rlibs or get rid of the allocator shim all together. I have a WIP PR for the latter by using weak symbols, but I still need to get it working on Windows.

findepi · February 12, 2025, 11:10am

As perhaps not uncommon, it turned out my real problem is somewhere else. I saw assembly difference, so i run benchmarks and the benchmark said that one variant is slower. But the real real difference comes not from assembly code difference. It comes from inlining capabilities of the functions being benchmarked.

Different "inlineability" may be a result of allocations (later eliminated), or closures (also later eliminated) or something else. I captured my result in Allocation elimination, loops and benchmark results · GitHub if anyone is interested.

Thank you All very much again for your replies!

findepi · February 13, 2025, 11:49am

A question about bisect. How do you use --prompt? It shows the compiler output, so I know how to use it for compiler messages. How to use it e.g. with cargo bench?

(I am observing a regression in nightly and would love to be able to find a corresponding commit.)

steffahn · February 13, 2025, 4:01pm

Using cargo bench should just work, shouldn't it? E.g.

cargo bisect-rustc --start … --end … --prompt -- bench

The final -- separates a custom cargo command, i.e. bisection calls cargo with anything that follows this -- as extra arguments; for the example above that’s a simple cargo bench then.

(If a --script is defined instead, that script will receive those arguments instead.)

With --prompt, I also like --preserve to more quickly re-trace the steps (if I made a mistake or just want to do it again for double-checking) to skip the downloads. You can later remove the installed toolchains normally via rustup. If they are too many for manual work, it could work programmatically e.g. with something like

rustup toolchain list | grep '^bisector-nightly-' | xargs -n1 rustup toolchain uninstall

With benchmarks, you don't really care about compiler-behavior either, I guess, so --preserve-target is a decent option to add, too; then you can rerun the bench (when re-doing bisection, or even with the > retry option from --prompt) without recompiling the benchmarking suite.

For fully manual interactive investigation at each stage… well one thing I've done before is opening a second terminal and just working with the installed bisector-nightly-… toolchain there.

cargo bisect-rustc doesn’t seem to be designed for this use-case; nonetheless, I was able to get a somewhat convenient fully-interactive experience with something like

… --prompt --script sh -- -c 'bash <&3' 3<&0

the --prompt involves forwarding stdout&stderr already anyway, the file descriptor 3 is used to route stdin into the inner bash, too, which doesn’t inherit the normal stdin. Then just do whatever you like, using simple cargo commands and the like within that shell (this apparently works by environment variables being set to define the toolchain that rustup should provide, and the target directory), then leave it with exit or Ctrl+D to fall back into the question from interactive --prompt.

The above even works without extra script files, because the part behind -- gets passed to sh, so the short filehandle-redirecting can simply be defined inline with the command. [The “3<&0” part is not passed to cargo bisect-rustc at all; it makes 3 point to stdin before bisect-rustc even starts, and could appear anywhere in the whole command.]

If you want the right answer to be preselected using a command within the shell, exit 0 and exit 1 can work… setting up aliases is also possible ~ apparently one would need to abuse --rcfile for this, but it works then:

… --prompt --script bash 3<&0 -- -c "bash <&3 --rcfile <(cat ~/.bashrc; echo \"alias b='exit 0'; alias r='exit 1'\")"

for instance this sets up b for baseline and r for regressed, which is pre-selected in the --prompt interaction (based the exit code, following the default regress=error logic).

findepi · February 13, 2025, 6:52pm

This is awesome writeup, thank you @steffahn !

The --prompt --script sh -- -c 'bash <&3' 3<&0 is the fully interactive mode i was hoping for, will give it a try!

Topic		Replies	Views
Can compiler’s optimizer eliminate trivial heap allocation? compiler	13	689	September 18, 2024
Pre-RFC: Support using the Rust allocator from C language design	15	1919	October 29, 2021
Layout is computed even for (de)allocators that don't use it compiler	7	1039	October 16, 2022
What does #[rustc_box] mean? compiler	3	4536	May 4, 2023
Add support for the wrapping of heap allocation and deallocation functions ideas (deprecated)	10	2975	March 25, 2019

Why rustc 1.71 no longer eliminates Box allocations?

Related topics