Register attribute

Rustc has a MIR inliner which runs before monomorphizing generic functions, but that one still needs MIR, which is only available for generic and #[inline] functions for the reasons I mentioned.

But that does not work across crates.

The MIR inliner works just fine across crates. It is just limited to generic and #[inline] functions because we don't encode MIR for other functions for performance reasons.

1 Like

Are we talking across each other? My dream is that crate A drops some Mir on the disk, crate B reads the Mir and uses it for optimisations. AFAIK the Mir inliner inlines only within a crate.

The MIR inliner works between crates too. For example for Rust Playground if you view the MIR in debug mode it shows an explicit call, while in release mode there is no call and most debuginfo scopes point to the standard library.

In my experience, humans almost never achieve better than compilers. And when they can, it's generally in a hand-written assembly routine, not in a language like Rust.

And yes, LLVM will prefer registers if it can at all help it, and it will even push old registers to stack to make room. It costs 0 cycles to use an existing register, while it costs multiple to even load from L1 cache. The only case where I've seen myself being able to do much better is in cases where compilers in their effort to minimize registers used end up putting in moves rather than filling out scratch registers first (and ultimately generating suboptimal code), but admittedly, even this is rare.

1 Like

I am starting to see what the Mir Inliner is doing.

I also agree that it is hard to beat compilers. The only chance that I had was to help him.

Just as a data point against adding such anotation, the register keyword was deprecated in C++11 (IIRC) and definitively removed in C++17.

3 Likes

When topics like this arise, this phrase always comes to mind: "Use a profiler to find bottlenecks, not the sixth sense". For the past six weeks, I've been improving my risky library for encoding RISC-V instructions, and spent a few hours optimising my own tailored bit field implementation. Sure, I've got great performance, but the main reason I started optimising it it in the first place was speeding up code generation in my toy compiler. Turned out, it had almost no effect, because most of code generation time was spent on manipulating SSA, not generating the instructions. Not a totally wasted effort in this case, but still completely misguided and I would argue, unnecessary.

As for PGO though, I'm not sure it's the best solution. It usually improves performance, but sometimes makes it worse, and it's very hard to find the reason why. And, if I understand correctly, its effectiveness is limited by the type of load. On the other hand, manual optimisation's effectiveness 99% of the time depends only on the fragments of code you touch, so it's more robust, even though labour-intensive.

My overall opinion is that it's better to not interfere manually with the compilers' optimisers, for they are much too complex to get any meaningful result from such an effort, and the best course of action is to either leave them alone or, because optimisers want to be happy too, give them some love and try to eliminate the pathological edge cases. The latter obviously requires highest expertise in the field of compiler development, but there's no way around it. Or, alternatively, teach a huge AI model to optimise code: I heard, it works like magic most of the time, and only occasionally produces totally useless and harmful garbage :grin:

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.