I'm using Rust in combination with Faust, a domain specific language for digital signal processing, that compiles to many languages, including Rust. I've contributed to the Faust-to-Rust code generation, and I'm maintaining some benchmarks that compare Rust against the C++ performance.
Up until Rust version 1.67, the performance of Rust has been very good! However, starting with Rust version 1.67 the performance has been reduced significantly. In real world applications, I'm often seeing a factor of 3 slow down. I'm wondering what is causing this slow down, and if there is anything that can be done to get back the performance from older Rust versions.
I've asked this question before on StackOverflow, but it has been closed and deleted, and I was redirected to this forum. The issue is that the problem cannot be reproduced in just a few lines of code. The code generation of Faust produces very messy (but presumably highly "pre-optimized") Rust code. Any meaningful MRE requires quite a bit of code I'm afraid. I've prepared such a single-file MRE here. If preferred, I can also post it here directly (~1000 LoC)?
Running it with cargo run --release with any version starting with 1.67 and later achieves an audio throughput of ~30 MiB/sec on my machine. For comparison, running it with older Rust versions (e.g. RUSTUP_TOOLCHAIN=1.66.0 cargo run --release) results in a throughput of almost 90 MiB/sec.
Looking at the 1.67 release notes, it isn't clear why the performance is so much worse. Perhaps it is the new MIR, or the field ordering that affects this particular code?
Any thoughts if we could make any changes on the code generation side in Faust to get back to the old performance, or does the performance regression justify opening an issue for the Rust compiler itself if it was by accident?
To really understand the regression you'd need to look at the assembly but before that, why not compile with RUSTFLAGS+="-C target-cpu=native"? Your CPU is a Haswell and has fma but I don't think you'll get that in compiled code by default.
Thanks for the simple reproducer! I bisected down to the exact nightly where your performance regresses. nightly-2022-11-23 has 3x the performance of nightly-2022-11-24 on my system for your test case.
If I had a nickel for every time I saw a regression between those two nightlies, I'd have two nickels, which isn't a lot, but it's weird that it happened twice, right?
The DSP logic accesses these buffers as pairs, and also exactly in that particular order. This should be optimal in terms of cache locality. Re-ordering by size separates all pairs, probably resulting in significant cache misses.
So there should be nothing wrong with #102750 per se, and no further follow-up needed?
Does Rust guarantee that the members of a field are never split up if the struct is stored in memory? If so, an alternate strategy would be to use another struct or even just a tuple to keep your pairs together in memory.
Then yes, a tuple ([f64; 3], [f64; 8192]) is a standalone type with its own layout, and each field must use that layout so that things like &s.fRecVec1 work as a reference to the inner tuple type. So the pairs will be together, but you can't rely on which part of the pair comes first, nor the order of fRecVec1, fRecVec2, etc. within S.
Good point, that may still be important enough even though the individual fields cross cache lines. Of course, in this particular case once you’ve done that you can switch to an outer array as well, which is trivially ordered.
So the relevant change was that fields are sorted based on the largest alignment they could have based on their size. I can see how that can be useful when e.g. the array in (u32, u16, [u8; 4], u16) gets to have 4-byte alignment for free since the u32 already requires it.
I don't think that should apply to a hypothetical field alignment larger than that of the containing type, which seemed to be the case in this thread: All fields and the struct itself have align 4, some fields just happened to be arrays with a power of two length.
But if struct layout is performance sensitive you should be manually specifying it anyway, -Zrandomize-layout is a valid compiler layout algorithm, so you can't rely on some compilers that happen to let source order influence their layout. (It might actually pay to run benchmarks with -Zrandomize-layout and a few -Zlayout-seed=s to verify your code is not layout algorithm sensitive).
Though without the presence of e.g. PGO data, if the simple layout is already size optimal, it's probably marginally better on average (and significantly in some lucky edge cases like this one) to keep the source order when not randomizing layout, since the author probably put the fields in a reasonable semantic order and the default is supposed to be reasonably performance oriented. (It might also have marginal benefits that drop order remains memory-linear, as well.)
Though this obviously should remain an unstable detail of the layout that can be changed, I do think it's a fairly reasonable ask that #[repr(Rust)] is the simple layout if there's not a layout the compiler is claiming should be "better" (rather than simply more convenient).
(There's some minor passing interest in guaranteeing #[repr(Rust)] memory layout for types which contain at most one non-1ZST type (but potentially many copies) to be arraylike. "Inorder if no better option" is a slight generalization of that.)
I very much want there to be some way to disallow getting a reference to a certain chunk in the future. It would allow optimizations like this while still allowing the desired type. I know I've fallen back to manually bitpacking bools and using MaybeUninitsolely because there's no native way to do it.