Unexpected 3 x performance regression starting with Rust version 1.67

I'm using Rust in combination with Faust, a domain specific language for digital signal processing, that compiles to many languages, including Rust. I've contributed to the Faust-to-Rust code generation, and I'm maintaining some benchmarks that compare Rust against the C++ performance.

Up until Rust version 1.67, the performance of Rust has been very good! However, starting with Rust version 1.67 the performance has been reduced significantly. In real world applications, I'm often seeing a factor of 3 slow down. I'm wondering what is causing this slow down, and if there is anything that can be done to get back the performance from older Rust versions.

I've asked this question before on StackOverflow, but it has been closed and deleted, and I was redirected to this forum. The issue is that the problem cannot be reproduced in just a few lines of code. The code generation of Faust produces very messy (but presumably highly "pre-optimized") Rust code. Any meaningful MRE requires quite a bit of code I'm afraid. I've prepared such a single-file MRE here. If preferred, I can also post it here directly (~1000 LoC)?

Running it with cargo run --release with any version starting with 1.67 and later achieves an audio throughput of ~30 MiB/sec on my machine. For comparison, running it with older Rust versions (e.g. RUSTUP_TOOLCHAIN=1.66.0 cargo run --release) results in a throughput of almost 90 MiB/sec.

Looking at the 1.67 release notes, it isn't clear why the performance is so much worse. Perhaps it is the new MIR, or the field ordering that affects this particular code?

Any thoughts if we could make any changes on the code generation side in Faust to get back to the old performance, or does the performance regression justify opening an issue for the Rust compiler itself if it was by accident?

3 Likes

What CPU are you targeting? How are you compiling your code? How has the assembly generated by rustc changed?

Everything is "default", i.e., I'm just compiling/running it with cargo run --release and this minimal Cargo.toml.

Could it be related to a change in the defaults?

My CPU is an Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz.

To really understand the regression you'd need to look at the assembly but before that, why not compile with RUSTFLAGS+="-C target-cpu=native"? Your CPU is a Haswell and has fma but I don't think you'll get that in compiled code by default.

Thanks for the simple reproducer! I bisected down to the exact nightly where your performance regresses. nightly-2022-11-23 has 3x the performance of nightly-2022-11-24 on my system for your test case.

If I had a nickel for every time I saw a regression between those two nightlies, I'd have two nickels, which isn't a lot, but it's weird that it happened twice, right?

I think the most likely candidate for this performance regression is https://github.com/rust-lang/rust/pull/102750, which reorders fields for repr(Rust) types.

And sure enough, if I add #[repr(C)] to your struct Dsp, the performance triples once again with today's Rust.

24 Likes

@bluenote10 should be fixed here: Set repr(C) on the Dsp struct to recover performance · grame-cncm/faust@cb0772a · GitHub

2 Likes

Thanks everyone for narrowing down the problem and even fixing the Faust code gen so quickly! I'll soon re-run the benchmarks to reflect the changes.

In retrospect, it kind of makes sense that re-ordering according to size can be detrimental in this case, right? The DSP struct contains lots of small+large buffer pairs like:

fRec18: [0.0; 3],
fVec18: [0.0; 8192],
fRec19: [0.0; 3],
fVec19: [0.0; 8192],
fRec20: [0.0; 3],
fVec20: [0.0; 8192],
fRec21: [0.0; 3],
fVec21: [0.0; 8192],
[...]

The DSP logic accesses these buffers as pairs, and also exactly in that particular order. This should be optimal in terms of cache locality. Re-ordering by size separates all pairs, probably resulting in significant cache misses.

So there should be nothing wrong with #102750 per se, and no further follow-up needed?

Yeah, it was sort of a "happy accident" that the former layout algorithm didn't change your field order, but since your code is sensitive to that order, repr(C) is the right solution.

1 Like

I applaud this quick and effective ‘customer support’, but doesn't the added *fOut << "#[repr(C)]" deserve a comment explaining its importance?

1 Like

Does Rust guarantee that the members of a field are never split up if the struct is stored in memory? If so, an alternate strategy would be to use another struct or even just a tuple to keep your pairs together in memory.

Types that are #[repr(Rust)] have no layout guarantees unless otherwise specified (namely niche value optimization guarantees).

If you mean like:

struct S {
    fRecVec1: ([f64; 3], [f64; 8192]),
    fRecVec2: ([f64; 3], [f64; 8192]),
    ...
}

Then yes, a tuple ([f64; 3], [f64; 8192]) is a standalone type with its own layout, and each field must use that layout so that things like &s.fRecVec1 work as a reference to the inner tuple type. So the pairs will be together, but you can't rely on which part of the pair comes first, nor the order of fRecVec1, fRecVec2, etc. within S.

5 Likes

Good point, that may still be important enough even though the individual fields cross cache lines. Of course, in this particular case once you’ve done that you can switch to an outer array as well, which is trivially ordered.

So the relevant change was that fields are sorted based on the largest alignment they could have based on their size. I can see how that can be useful when e.g. the array in (u32, u16, [u8; 4], u16) gets to have 4-byte alignment for free since the u32 already requires it.

I don't think that should apply to a hypothetical field alignment larger than that of the containing type, which seemed to be the case in this thread: All fields and the struct itself have align 4, some fields just happened to be arrays with a power of two length.

Yeah, I agree. It's worth sorting things differently if it saves padding, but if something is already aligned by its minimum alignment, we shouldn't move it.

2 Likes

But if struct layout is performance sensitive you should be manually specifying it anyway, -Zrandomize-layout is a valid compiler layout algorithm, so you can't rely on some compilers that happen to let source order influence their layout. (It might actually pay to run benchmarks with -Zrandomize-layout and a few -Zlayout-seed=s to verify your code is not layout algorithm sensitive).

11 Likes

Though without the presence of e.g. PGO data, if the simple layout is already size optimal, it's probably marginally better on average (and significantly in some lucky edge cases like this one) to keep the source order when not randomizing layout, since the author probably put the fields in a reasonable semantic order and the default is supposed to be reasonably performance oriented. (It might also have marginal benefits that drop order remains memory-linear, as well.)

Though this obviously should remain an unstable detail of the layout that can be changed, I do think it's a fairly reasonable ask that #[repr(Rust)] is the simple layout if there's not a layout the compiler is claiming should be "better" (rather than simply more convenient).

(There's some minor passing interest in guaranteeing #[repr(Rust)] memory layout for types which contain at most one non-1ZST type (but potentially many copies) to be arraylike. "Inorder if no better option" is a slight generalization of that.)

3 Likes

Rust lets you get references to fields, including mutable references, which means that any individual field cannot be broken up by the layout algorithm.

That's why, for example,

[src/main.rs:3] size_of::<(u8, u16, u8, u16)>() = 6
[src/main.rs:4] size_of::<((u8, u16), (u8, u16))>() = 8

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=b2747f99b9e8c06ab2482ada3e55e0e4

(This is also why the layout algorithms can't collapse multiple bool fields into a byte, and other related restrictions.)

3 Likes

I very much want there to be some way to disallow getting a reference to a certain chunk in the future. It would allow optimizations like this while still allowing the desired type. I know I've fallen back to manually bitpacking bools and using MaybeUninit solely because there's no native way to do it.

5 Likes

We already have this for #[repr(packed)], so it should "just" be a matter of providing an appropriate annotation / alternative repr.

2 Likes