Unexpected 3 x performance regression starting with Rust version 1.67

bluenote10 · April 23, 2023, 2:08pm

I'm using Rust in combination with Faust, a domain specific language for digital signal processing, that compiles to many languages, including Rust. I've contributed to the Faust-to-Rust code generation, and I'm maintaining some benchmarks that compare Rust against the C++ performance.

Up until Rust version 1.67, the performance of Rust has been very good! However, starting with Rust version 1.67 the performance has been reduced significantly. In real world applications, I'm often seeing a factor of 3 slow down. I'm wondering what is causing this slow down, and if there is anything that can be done to get back the performance from older Rust versions.

I've asked this question before on StackOverflow, but it has been closed and deleted, and I was redirected to this forum. The issue is that the problem cannot be reproduced in just a few lines of code. The code generation of Faust produces very messy (but presumably highly "pre-optimized") Rust code. Any meaningful MRE requires quite a bit of code I'm afraid. I've prepared such a single-file MRE here. If preferred, I can also post it here directly (~1000 LoC)?

Running it with cargo run --release with any version starting with 1.67 and later achieves an audio throughput of ~30 MiB/sec on my machine. For comparison, running it with older Rust versions (e.g. RUSTUP_TOOLCHAIN=1.66.0 cargo run --release) results in a throughput of almost 90 MiB/sec.

Looking at the 1.67 release notes, it isn't clear why the performance is so much worse. Perhaps it is the new MIR, or the field ordering that affects this particular code?

Any thoughts if we could make any changes on the code generation side in Faust to get back to the old performance, or does the performance regression justify opening an issue for the Rust compiler itself if it was by accident?

user16251 · April 23, 2023, 2:11pm

What CPU are you targeting? How are you compiling your code? How has the assembly generated by rustc changed?

bluenote10 · April 23, 2023, 2:23pm

Everything is "default", i.e., I'm just compiling/running it with cargo run --release and this minimal Cargo.toml.

Could it be related to a change in the defaults?

My CPU is an Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz.

user16251 · April 23, 2023, 2:41pm

To really understand the regression you'd need to look at the assembly but before that, why not compile with RUSTFLAGS+="-C target-cpu=native"? Your CPU is a Haswell and has fma but I don't think you'll get that in compiled code by default.

josh · April 23, 2023, 3:05pm

Thanks for the simple reproducer! I bisected down to the exact nightly where your performance regresses. nightly-2022-11-23 has 3x the performance of nightly-2022-11-24 on my system for your test case.

If I had a nickel for every time I saw a regression between those two nightlies, I'd have two nickels, which isn't a lot, but it's weird that it happened twice, right?

I think the most likely candidate for this performance regression is https://github.com/rust-lang/rust/pull/102750, which reorders fields for repr(Rust) types.

And sure enough, if I add #[repr(C)] to your struct Dsp, the performance triples once again with today's Rust.

github.com/grame-cncm/faust

Set repr(C) on the Dsp struct to recover performance

grame-cncm:master-dev ← joshtriplett:repr-c

opened 03:14PM - 23 Apr 23 UTC

joshtriplett

+2 -0

After https://github.com/rust-lang/rust/pull/102750 , Rust reorders fields in re…pr(Rust) structures based on size/alignment. Applying such reordering to the Dsp struct substantially hurts performance. Set repr(C) on the Dsp struct to recover that performance. See https://internals.rust-lang.org/t/unexpected-3-x-performance-regression-starting-with-rust-version-1-67/18724 for the test case and investigations that led to this.

sletz · April 23, 2023, 3:36pm

@bluenote10 should be fixed here: Set repr(C) on the Dsp struct to recover performance · grame-cncm/faust@cb0772a · GitHub

bluenote10 · April 23, 2023, 4:26pm

Thanks everyone for narrowing down the problem and even fixing the Faust code gen so quickly! I'll soon re-run the benchmarks to reflect the changes.

In retrospect, it kind of makes sense that re-ordering according to size can be detrimental in this case, right? The DSP struct contains lots of small+large buffer pairs like:

fRec18: [0.0; 3],
fVec18: [0.0; 8192],
fRec19: [0.0; 3],
fVec19: [0.0; 8192],
fRec20: [0.0; 3],
fVec20: [0.0; 8192],
fRec21: [0.0; 3],
fVec21: [0.0; 8192],
[...]

The DSP logic accesses these buffers as pairs, and also exactly in that particular order. This should be optimal in terms of cache locality. Re-ordering by size separates all pairs, probably resulting in significant cache misses.

So there should be nothing wrong with #102750 per se, and no further follow-up needed?

cuviper · April 23, 2023, 4:32pm

Yeah, it was sort of a "happy accident" that the former layout algorithm didn't change your field order, but since your code is sensitive to that order, repr(C) is the right solution.

8573 · April 23, 2023, 7:23pm

I applaud this quick and effective ‘customer support’, but doesn't the added *fOut << "#[repr(C)]" deserve a comment explaining its importance?

jrose · April 23, 2023, 9:20pm

Does Rust guarantee that the members of a field are never split up if the struct is stored in memory? If so, an alternate strategy would be to use another struct or even just a tuple to keep your pairs together in memory.

jhpratt · April 23, 2023, 11:10pm

Types that are #[repr(Rust)] have no layout guarantees unless otherwise specified (namely niche value optimization guarantees).

cuviper · April 24, 2023, 12:18am

If you mean like:

struct S {
    fRecVec1: ([f64; 3], [f64; 8192]),
    fRecVec2: ([f64; 3], [f64; 8192]),
    ...
}

Then yes, a tuple ([f64; 3], [f64; 8192]) is a standalone type with its own layout, and each field must use that layout so that things like &s.fRecVec1 work as a reference to the inner tuple type. So the pairs will be together, but you can't rely on which part of the pair comes first, nor the order of fRecVec1, fRecVec2, etc. within S.

jrose · April 24, 2023, 2:11am

Good point, that may still be important enough even though the individual fields cross cache lines. Of course, in this particular case once you’ve done that you can switch to an outer array as well, which is trivially ordered.

quaternic · April 24, 2023, 12:51pm

So the relevant change was that fields are sorted based on the largest alignment they could have based on their size. I can see how that can be useful when e.g. the array in (u32, u16, [u8; 4], u16) gets to have 4-byte alignment for free since the u32 already requires it.

I don't think that should apply to a hypothetical field alignment larger than that of the containing type, which seemed to be the case in this thread: All fields and the struct itself have align 4, some fields just happened to be arrays with a power of two length.

josh · April 24, 2023, 1:15pm

Yeah, I agree. It's worth sorting things differently if it saves padding, but if something is already aligned by its minimum alignment, we shouldn't move it.

Nemo157 · April 24, 2023, 5:44pm

But if struct layout is performance sensitive you should be manually specifying it anyway, -Zrandomize-layout is a valid compiler layout algorithm, so you can't rely on some compilers that happen to let source order influence their layout. (It might actually pay to run benchmarks with -Zrandomize-layout and a few -Zlayout-seed=s to verify your code is not layout algorithm sensitive).

CAD97 · April 24, 2023, 9:32pm

Though without the presence of e.g. PGO data, if the simple layout is already size optimal, it's probably marginally better on average (and significantly in some lucky edge cases like this one) to keep the source order when not randomizing layout, since the author probably put the fields in a reasonable semantic order and the default is supposed to be reasonably performance oriented. (It might also have marginal benefits that drop order remains memory-linear, as well.)

Though this obviously should remain an unstable detail of the layout that can be changed, I do think it's a fairly reasonable ask that #[repr(Rust)] is the simple layout if there's not a layout the compiler is claiming should be "better" (rather than simply more convenient).

(There's some minor passing interest in guaranteeing #[repr(Rust)] memory layout for types which contain at most one non-1ZST type (but potentially many copies) to be arraylike. "Inorder if no better option" is a slight generalization of that.)

scottmcm · April 25, 2023, 1:46am

Rust lets you get references to fields, including mutable references, which means that any individual field cannot be broken up by the layout algorithm.

That's why, for example,

[src/main.rs:3] size_of::<(u8, u16, u8, u16)>() = 6
[src/main.rs:4] size_of::<((u8, u16), (u8, u16))>() = 8

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=b2747f99b9e8c06ab2482ada3e55e0e4

(This is also why the layout algorithms can't collapse multiple bool fields into a byte, and other related restrictions.)

jhpratt · April 25, 2023, 3:39am

I very much want there to be some way to disallow getting a reference to a certain chunk in the future. It would allow optimizations like this while still allowing the desired type. I know I've fallen back to manually bitpacking bools and using MaybeUninit solely because there's no native way to do it.

2e71828 · April 25, 2023, 5:10am

We already have this for #[repr(packed)], so it should "just" be a matter of providing an appropriate annotation / alternative repr.

Topic		Replies	Views
Compile time performance changes for 1.8	4	1847	March 25, 2019
Rust Compiler Performance Working Group announcements	44	12242	March 25, 2019
Resolving Rust's forward progress guarantees compiler	53	6811	December 22, 2024
Measuring compiler performance tools and infrastructure	4	1552	March 25, 2019
Interesting Rust/web project - rework the compiler performance website	4	1820	March 25, 2019

Unexpected 3 x performance regression starting with Rust version 1.67

Related topics