Register attribute

The main motivation for this proposal is virtual machines. If a VM provides a virtual heap, stack, and registers, then the virtual registers are supposed to be fast. To guarantee the performance of those V-regs, the corresponding vars can be marked like so:

#[register(always)]
let mut r0: usize = 0;
#[register(always)]
let mut r1: u32 = 0;
#[register(always)]
let mut r2: u64 = 0;

In the same way as #[inline(always)].

The attribute will cause a warning if it's used as no-op, and it will cause a compilation error if used on types that can't fit in a reg.

Later, this could be extended to allow any type, as long as it implements both Sized and Copy traits, and the type is const-evaluable

If a virtual machine has infinite virtual registers, then the backend should be able to trivially do register allocation such that no registers need to spill. AIUI, the process for both the LLVM and Cranelift backends necessarily requires going through mem2reg optimizations, since the IR is SSA, so mut variables must be allocad rather than IR virtual SSA registers.

Any sort of guaranteed optimization is fundamentally problematic. Unless you can show that LLVM supports this functionality, it's unlikely that Rust can do this even if we want to expose this level of regalloc bullying. (IIRC, C register in clang is a noöp.)

If it's a custom backend, you can probably just lower MIR mutable places (MIR is not in SSA form and IIRC lowering handles the SSA transform) directly into VM registers if they fit, no annotation necessary.

9 Likes

AFAIK register in C is completely dead and hasn't been doing anything for decades. Optimizers know better than users.

Apart from optimization hint that optimizers don't need, it'd be semantically weird. What do moves mean? How do you take a reference to a register? Operators and some methods on integers use &self and &mut self.

17 Likes

register is dead, but at least gcc has a way of assigning specific registers to local or global variables:

However reading the documentation, it doesn't sound like it's best effort and full of footguns if you try calling any functions, especially outside of the translation unit.

Perhaps there's scope for using (an extension of) inline asm to achieve a similar effect.

I understand the problem. Perhaps it could be solved by only allowing 1 register attribute. If more registers are used, the compiler throws an error.

That's why it's supposed to be a hint, meaning "please put this var in a reg, for as much time as possible". If there are no available hardware regs, then it's (silently) no-op.

True, but they can't solve the halting problem. But humans also can't solve the halting problem. At least, a human dev knows what kind of program they're coding, and they know how it's supposed to be used, so having a reg hint (used sparingly) may be a good idea.

Those are very good questions! If I understood correctly, those would only be a problem if reg optimizations are guaranteed. The compiler can ignore the hint if it determines it's impossible to do (that would probably require solving HP partially, so it wouldn't be reliable).

Interesting, but that would be unsafe. On the plus side, being unsafe will make people think twice (or thrice) before using it.

I am always on the fence between where is my expertise and where is the expertise of the the optimiser. Should I help the optimiser? Or I do know better and force my knowledge on the optimiser. Is the register attribute a win in the end?

I am really surprised by the inline attributes sprinkled over the rust compiler and the Standard library. Do the developers know better than the optimiser? Is the some optimisation missing in the optimiser?

They don't! Several PRs adding #[inline(always)] have been rejected due to making code slower.

The ones that exist are mainly for the debug builds, where there's no optimizer to decide.

6 Likes

Inline always attributes are mainly there to clean up the code in Debug builds, that would otherwise see and exessive amount of function indirection and to assert that certain cross-crate inlining occures even without LTO.

2 Likes

The optimizer has more freedom in register allocation than the programmer. Rust's variables have their specific scopes, strictly defined semantics and limitations. If the compiler listened to you literally, it could reserve a register too early and keep it occupied for too long, or needlessly hold a value in a register that could have been merged with some other expression that lives elsewhere.

In the compiled code, variables can be heavily transformed: created later, destroyed earlier, or come in and out of existence, spilled onto stack and reloaded to free up registers for more useful purposes, or may not even exist at all if they get folded into some other expression.

Also any operation on a variable is going to create temporary copies that may need a register for themselves too, and holding a register literally for the source may be less optimal than updating it in-place with a new value that isn't that variable.

So if it's literally reserving a register for a variable, there's a high chance it will be too coarse-grained and prevent useful optimizations. If it's a vague hint of "you know what I mean, just make this variable faster" then it's not that different from what the optimizer does already.

8 Likes

IIRC, without #[inline], a function is not allowed to be inlined across crate boundaries. For simple field projection methods (i.e., getters), this can be a significant benefit.

1 Like

Then the semantics of the inline attribute are odd. The attribute does not force the compiler to do something. It is to enable him to do cross-crate inlining. I believe for ThinLTO, it is completely up to the compiler to perform cross-TU inlining.

1 Like

LLVM doesn't seem to support the register hint at all. If you use plain register, nothing in the LLVM ir generated by clang changes. If you use register int *p1 asm ("rax") = ... to specify a specific register before an asm block, LLVM will apply the chosen register as part of the call ptr asm LLVM ir instruction, not as part of the p1 local. In other words there is literally no way for the default LLVM backend of rustc to support what you want.

4 Likes

There's also global register variables, but this feature is so restricted in clang (at least when targeting x86-64) that it might as well not exist at all. If you write

register unsigned long rXX asm("rXX");

as a file-scope declaration, for any value of rXX except rbp and rsp, you get an error:

error: register 'rXX' unsuitable for global register variables on this target

Obviously rsp can't actually be used as anything except the machine-level stack pointer; I imagine the intended use of register ... asm("rsp") is to give C read-only access to the value of the stack pointer, which could be useful in rare circumstances (scanning the stack for GC roots, for instance).

That leaves rbp, but if you try to use rbp, clang crashes:

$ cat test.c
register unsigned long vm_ip asm("rbp");
unsigned long get_ip(void) { return vm_ip; }

$ clang -O2 -S test.c
Fatal error: error in backend: register rbp is allocatable:
    function has no frame pointer
...
4.	Running pass 'X86 DAG->DAG Instruction Selection' on function '@get_ip'
 #0 0x00007fae70ca5291 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int)
...
 #7 0x00007fae70bda393 llvm::report_fatal_error(llvm::Twine const&, bool)
 #8 0x00007fae734cc084
 #9 0x00007fae714c4d87 llvm::SelectionDAGISel::Select_READ_REGISTER(llvm::SDNode*)
#10 0x00007fae714c569a llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int)
#11 0x00007fae73476e4a 
#12 0x00007fae714c199f llvm::SelectionDAGISel::DoInstructionSelection()
...

If I change that to rsp (so it won't crash) and use -emit-llvm, what I get is a call to an intrinsic function:

define dso_local i64 @get_ip() local_unnamed_addr #0 {
  %1 = tail call i64 @llvm.read_register.i64(metadata !0)
  ret i64 %1
}
!llvm.named.register.rsp = !{!0}
!0 = !{!"rsp"}

I don't know how to get clang to cough up IR dumps any earlier in the compilation than what -emit-llvm produces.

1 Like

Isn't that what the compiler is always trying to do? It would much rather not spill to memory.

Said otherwise, if it's a hint that means that, why wouldn't I put it on every local variable ever?

Non-generic methods aren't instantiated in every translation unit by default, so they're not available to be inlined by default (without LTO). LLVM will inline things if they're available regardless of whether you put #[inline] on them, but if it doesn't have the implementation then it can't even if it would want to.

So I think it's less an optimizer problem and more a Rust problem -- perhaps there should be a crate-level #![auto_inline_functions_with_trivial_bodies] option, so that the vast majority of the current #[inline] annotations could be removed. Or maybe even just have that be the default (for non-debug?), making people #[inline(never)] things if they don't want that for some reason.

If you need this level of control, why not write in assembly?

You may be interested in the article Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C

I'm imagining some dude (probably me) literally talking to the compiler, saying that phrase, and the compiler replying "gotchu bro" in stdout, then the compiler proceeds to make the program 1% faster everywhere BUT a hot loop, where the program becomes 16x slower because of all the local vars. I had a good laugh at that thought!

Thank you for explaining why my idea is bad, and thank you for making me laugh!

Yes, but I meant something like "prioritize this var over all others"

I don't want to repeat myself, but I realized it's a bad idea, and that's why I suggested to only allow 1 occurrence of that attribute. The exact algorithm to count occurrences is undefined. I would suggest counting crates individually, therefore only allowing 1 #[register] per crate

To avoid unsafe (do I really have to say why?) if it's not necessary.

Thanks for the link!

1 Like

After thinking more, and reading that article, I've come to the conclusion that profile-guided-optimization is better than my proposal. The only disadvantage is that PGO requires much more time and work. Perhaps this is a good thing? because it forces devs to actually optimize the program properly, rather than being lazy

If you look at the MLIR-Clang RFC, you will find the word summaries. They were talking about ThinLTO at Clang IR. Swift is doing the same and Eckstein.

If a future rustc would drop Mir summaries on the disk and other crates would pick up the summaries, then maybe we could enable other optimizations. It is like ThinLTO with rust knowledge.

We support encoding the MIR for all functions using -Zalways-encode-mir, however it increases crate metadata size a lot and results in significant slowdown due to having to encode much more. That is why we currently only encode generic and #[inline] functions. In addition unlike LLVM ThinLTO summaries, LLVM can't lazily load MIR functions, rustc has to codegen them to LLVM ir in advance. #[inline] helps rustc guide which functions to codegen to LLVM ir to make available for inlining.

I believe the idea was that during the compilation of crate A, you drop some summary of the surface. During the compilation of crate B, that depends on A, you use the summary data of crate A for optimisations. It is the same idea as ThinLTO, but several levels of abstractions higher. LLVM does not know what generics are.