Should we ever stabilize inline assembly?

Backends care about more machine state than that, particularly about any flags that are set/reset/toggled dynamically as a result of code execution. Thus clobbers potentially needs to address almost all accessible machine state – more than that declarable as inputs and outputs.

3 Likes

Since I've never done inline assembly of any kind but still want to follow the discussion and LLVM Language Reference Manual β€” LLVM 18.0.0git documentation is a bit much to digest in five minutes, am I interpreting these right?

  • =&r means all output registers are "early-clobber" outputs, i.e. the backend may not assume this assembly reads all inputs before writing any outputs, so optimizations like using one register for both an input and an output are incorrect
  • =&{reg} means the output register named reg is an "early-clobber" output
  • =*m means output to wherever the host language variable m is
  • r means the input may be any register (is this actually useful? or is it just required by inline asm that you always provide some constraint?)
  • {reg} means the input register is reg
  • ~{reg} means this inline assembly "clobbers" the contents of reg, i.e. it is neither an input nor an output, but its value may get changed regardless so the optimizer can't rely on it staying untouched
  • ~{memory} means this inline assembly "clobbers everything" / "writes to arbitrary undeclared memory locations"

If that's what those constraints mean, it does seem like a reasonable ask of other backends, at least at first glance.

(I'd really like to hear more from potential backend implementers on how feasible all these various ideas are)

3 Likes

Well, you've got a problem there, because that's a complete list of everything LLVM will let you mark as a clobber! I suspect LLVM treats inline assembly as a flag fence, for anything not expressible as a register (x86 is weird in that you can write asm!("":::"rflags"), for example).

I doubt Rust needs to be informed of any other machine state than memory and registers; for example, I doubt LLVM, or any other reasonable backend, has any knowledge of CSRs in RISC-V.

You're basically correct. I specified =&r because, semantically, that's how we humans use the Clang =r constraint; Clang is often smart enough to use the more constrained =r LLVM constraint. =*m means that you're passing a pointer to the asm block, and that the assembly is going to read or write from that pointer. Consider (I don't write a lot of IR, so I might have this slightly wrong):

; This is a made up assembly language with Intel syntax.
%1 = call i32 asm "load_word $0, $1", "=r,=*m" (%my_ptr)

"r" is useful for when you need to get values into the assembly but don't care what register it is; compare with doing a manual callq, where you might want to ask the compiler to set up the calling convention in a way that is friendly to the register allocator. Of course, there's other register categories; FPU and SIMD registers are the most well-known, and what they mean tends to be really platform dependent. There's also the problem that "register" is a pretty flimsy concept: x86 famously has the %al < %ax < %eax < %rax and %xmm0 < %ymm0 < %zmm0 hierarchies, so that when I write

asm!("movb $0, $1" : "=r(a)" : "r(b)");

the assembler has to figure out which width of %rax to use with each register (in this case, %al).

~{memory} simply means that any speculative reads LLVM has made are invalid, and all writes must be flushed before the assembly executes. The only place I've ever used it (and where you always see it) is when implementing libc::syscall() or similar: syscalls don't actually clobber registers, except for the ones used in the syscall ABI.

Could not "flags" be modeled as single-bit registers for purposes of saying what is used for inputs vs. outputs vs. clobbers/used/reset? Could a model that looks like:

  1. Given a list of "Registers" that the HW architecture supports where each designates the bit-width, (where flags are 1-bit registers e.g.)
  2. Given list of registers that the ASM block expects to be populated with inputs
  3. Given list of registers that the ASM block expects to use for output
  4. Given list of registers that the ASM block will modify (and that will need saved before the ASM block is invoked, and restored after, if needed)
  5. A list of memory segments that the ASM block expects some inputs to be in (if any)
  6. A list of memory segments that the ASM block will use for outputs (if any)
  7. A list of memory segments that the ASM block will modify including current stack frame

If all that information were declared (with #1 being a file that defines a HW Architecture Back-End), wouldn't that be enough for the compiler to do what it needs to do? Couldn't it then just pass the in-line assembly block to the system assembler and in-line the object code that that generated? Obviously, would be unsafe, but, why can't that work and why couldn't that be stabilized? What is the blockers? Is there something fundamentally wrong with this model of things? I'm not normally working at the level of compilers/assemblers, but, I'm having difficultly understanding why that won't just work?

It seems like some are saying something like this can't work for theoretical reasons. Is that so? What exactly are those reasons?

2 Likes

Alignment sads could be a big one. And, what does "pasting in an object file" even mean? Are we asking the linker to do exotic, mid-function relocations? It sounds like you want .rs/.S LTO, which seems hard? Maybe I misunderstand.

Yeah, that part was rather amorphous. What I meant was, that the assembler would create an object file from the assembly code passed from the compiler and give back basically an array of bytes that are machine code. The compiler would then create whatever instructions were necessary to populate the inputs/outputs/clobbers etc., then, inline that set of instructions, and then add whatever was necessary after to pull the outputs and restore clobbers, etc.

Why would alignment be an issue exactly? (Again, I'm not normally working on assembler level stuff. In fact, the last time I did assembler was almost 30 years ago on missile system trajectory computers for Air Defense Missile systems that involved manually putting in machine instructions by dialing in octal nibbles for a 21-bit address/operand 3-digits at a time), so please forgive my ignorance.

I'm mostly worried about some ISAs which require instructions to be aligned in certain ways. I don't have concrete examples, but here's something that comes to mind: RISC-V has word-aligned instructions, and has an extension allowing for "compressed" half-word instructions. Mind, in this extension, all instructions, even word-wide instructions, can be half-word aligned, so RISC-V doesn't have any issues here. But, you could imagine a nasty ISA that requires some instructions to be word aligned, and some to merely be half-word aligned; then you might have alignment sads. (If someone can tell me no ISA is crazy enough to do this, please correct me!)

This is pretty close to describing inline assembly as it currently exists, except that you would require the user to choose specific registers for inputs and outputs rather than letting the compiler choose. But that only slightly simplifies the job of register allocation around inline assembly, which still has to be done even with user-specified registers.

Other than that… it's true that not wanting to reimplement assembly parsing was one reason Cranelift didn't want to support inline assembly. But as I've been discussing, using an external assembler is in fact a valid implementation strategy of the existing user-facing design. There are two downsides:

  • The compiler has to be able to either

    • use your strategy: parse the object file the assembler outputs in order to integrate the code into its own representation of the function (including relocations), or
    • use GCC's strategy: output its own code as assembly and just have the assembler assemble both together.

    Currently Cranelift can do neither; AFAIK it only supports directly generating machine code and writing that to an object file (not reading). But I believe that can change.

  • Less than ideally suited for JIT due to the cost of invoking an external process. Not a huge deal for Rust – we don't currently have a JIT, and even if we wanted one, using an external assembler is still possible, just slower. You would only have to use it for functions that actually contain inline assembly.

That's not even necessary though. The inline asm code doesn't have to be inside the function, it can be assembled in a separate object file as an out-of-line blob which the function calls into.

See my previous comment on how this would work.

1 Like

Yes, although the unnecessary jumps have a performance cost, and you still require the code generator to do register allocation around the inline assembly. It could work, but personally I favor a different approach. If we get to the point where inline assembly is designed, implemented, and ready to stabilize, yet Cranelift remains a blocker, then we should add a mode to rustc that supports inline assembly with zero backend support, at the cost of somewhat higher overhead. It would work as follows:

All asm blocks would turn into regular function calls to auto-generated external assembly functions; rustc itself would be responsible for wrapping the user's assembly with a suitable prolog/epilog (including assembly required to accept inputs and return outputs using the C calling convention), calling an external assembler, and linking the resulting object into the final binary. (This does mean doing work for each architecture, but it's not as bad as it sounds: we'd only need to support whatever architectures Cranelift supports.)

For non-asm goto this is relatively straightforward. But even asm goto can be supported with an ugly hack.

(Disclaimer: The following is worded as if we were adopting asm goto as is, but in reality it would probably look more like an if or match statement, as previously discussed. My point here is the implementation.)

Suppose you have an asm block which wants to end by branching to one of label_1, label_2, or label_3. Well, just create dummy labels in the stub function: label_1 causes the stub function to return 1; label_2 makes it return 2; and so on. Then in the caller function, after calling the stub function, use the equivalent of a match statement to branch to one of the "real" labels depending on what the function returned. This is slow, but it implements the required semantics.

4 Likes

ARM has 32-bit opcodes which must be align 4, and also starting in ARMv4T they added a separate execution mode for 16-bit opcodes that must be align 2, and also in later versions (I forget which) they added more to the 16-bit mode so that some instructions are two 16-bit values paired up (still only align 2 though, so not quite the same as the 32-bit operations).

Wouldn't that be solvable by the assembler generating aligned coded as needed and then including in the output the alignment needed for the overall block of op-codes? Then, the compiler would simply insert an appropriate number of no-op instructions before the beginning of the block to obtain the necessary alignment. Would that not be the case?

1 Like

That works, at least for the ARM THUMB paired opcodes described by @Lokathor .

Most of the x86 instructions are SIMD-related, and the intrinsics approach has already been adopted for those, so it shouldn't be too hard to add the remaining interesting ones.

Microsoft Visual C++ doesn't support inline assembly on x64, but only intrinsics (defined by Microsoft in intrin.h), and I think they compile the Windows NT kernel with it, so it should be a feasible approach.

I also think that inline assembly is a foundational feature and Rust can not be called a true systems programming language without it.

I like the suggested approach of consuming raw binary which will be simply inlined by compiler. It allows us to completely sidestep the issue of choosing assembly syntax. And more convenient APIs can be built on top of it using procedural macros which could use external assemblers or eventually maybe even a pure-Rust assembler implemented as a proc macro.

Of course we will have to specify interface to interact with register allocator and a way to specify alignment requirements for the inlined binary code. But this feature can be done incrementally, e.g. we can start with explicitly named input and clobber registers and latter extend it with dynamic registers which will be chosen by compiler.

Granted even in this form this feature will not be an easy one, but nevertheless I think it should help to somewhat reduce amount of design and implementation work.

Not quite, smart backend (LLVM) understands "meaning" of intrinsics and can freely replace instructions which you would think should be used with something entirely different. For example here you will not find vandps instruction even though _mm256_and_si256 instrinsic was used.

But in the case of inline assembly we expect that compiler will not interfere with its contents, e.g. compiler should never remove a series of nops, even though it does "nothing".

10 Likes

++. Intrinsics are still subject to Godbolt's Law: "the compiler is smarter than you". Inline assembly is for the rare occasion when the compiler is actually not smarter than you.

3 Likes

I'm not totally against this idea, but I'm skeptical that it will actually reduce the amount of design and implementation work, for a few reasons:

  • Inline assembly based on textual substitution can be lowered to LLVM's existing inline assembly feature; you have to translate the template string and constraints to LLVM's syntax, but that's straightforward. Binary inclusion based on fixed input/output registers could also be lowered to LLVM inline assembly, but dynamic registers would require some sort of API to interact with the register allocator, which would require invasive changes to LLVM.

    • And would require even more work to support LTO (where LLVM is invoked by the linker to generate code, but rustc is not currently involved).

    • And would be hard to integrate with potential future backends that already support inline assembly, such as compile-to-C, or using GCC's backend.

  • It's not just a matter of passing a Vec<u8> to be included. For parity with existing inline assembly (in C or the unstable Rust version), you need at least:

    • Relocations against symbols (both local and external)
    • The ability to stick the address of an instruction into an external section using .pushsection/.popsection (e.g. probe does this).

    For arbitrary relocations and sections you probably want a platform-native object file (that's what assemblers output anyway), so you need to be able to parse those in order to splice them into the Rust code. Which requires code for each platform.

    That sounds a lot more complicated than having the backend manage the assembler.

6 Likes

I quite like the machine code blob idea. Sure, it wouldn't be able to do all of what LLVM inline asm can do but it would cover a lot of the use cases without requiring much specification like supporting asm.

Yes, I think dynamic register allocation should be out of the question for including blobs. I'd say most cases that require this (for performance) are better served by intrinsics anyway.

It should be trivial to support any backend that already supports inline asm because you can just use the .byte directive to insert the binary blob.

Regarding the alignment issues mentioned by other people, it shouldn't be a problem because the last rust generated instruction should leave the current position suitably aligned. Alignment directives for code are only ever needed before the start of the function which can't be part of the inline asm.

1 Like

@gbutler pointed out that isn't true. For example, on ARM-with-THUMB the alignment before machine-code insertion could be align 2 but the inserted code could require align 4 because some right-hand (i.e., @2 mod 4) 16-bit THUMB instructions are conditional on the execution result of the left-hand (i.e., @0 mod 4) 16-bit instruction of the pair.

You can just pad nops in whatever assembly prelude rust emits to ensure appropriate alignment.