Should we ever stabilize inline assembly?

I'm mostly worried about some ISAs which require instructions to be aligned in certain ways. I don't have concrete examples, but here's something that comes to mind: RISC-V has word-aligned instructions, and has an extension allowing for "compressed" half-word instructions. Mind, in this extension, all instructions, even word-wide instructions, can be half-word aligned, so RISC-V doesn't have any issues here. But, you could imagine a nasty ISA that requires some instructions to be word aligned, and some to merely be half-word aligned; then you might have alignment sads. (If someone can tell me no ISA is crazy enough to do this, please correct me!)

This is pretty close to describing inline assembly as it currently exists, except that you would require the user to choose specific registers for inputs and outputs rather than letting the compiler choose. But that only slightly simplifies the job of register allocation around inline assembly, which still has to be done even with user-specified registers.

Other than that… it's true that not wanting to reimplement assembly parsing was one reason Cranelift didn't want to support inline assembly. But as I've been discussing, using an external assembler is in fact a valid implementation strategy of the existing user-facing design. There are two downsides:

  • The compiler has to be able to either

    • use your strategy: parse the object file the assembler outputs in order to integrate the code into its own representation of the function (including relocations), or
    • use GCC's strategy: output its own code as assembly and just have the assembler assemble both together.

    Currently Cranelift can do neither; AFAIK it only supports directly generating machine code and writing that to an object file (not reading). But I believe that can change.

  • Less than ideally suited for JIT due to the cost of invoking an external process. Not a huge deal for Rust – we don't currently have a JIT, and even if we wanted one, using an external assembler is still possible, just slower. You would only have to use it for functions that actually contain inline assembly.

That's not even necessary though. The inline asm code doesn't have to be inside the function, it can be assembled in a separate object file as an out-of-line blob which the function calls into.

See my previous comment on how this would work.

1 Like

Yes, although the unnecessary jumps have a performance cost, and you still require the code generator to do register allocation around the inline assembly. It could work, but personally I favor a different approach. If we get to the point where inline assembly is designed, implemented, and ready to stabilize, yet Cranelift remains a blocker, then we should add a mode to rustc that supports inline assembly with zero backend support, at the cost of somewhat higher overhead. It would work as follows:

All asm blocks would turn into regular function calls to auto-generated external assembly functions; rustc itself would be responsible for wrapping the user's assembly with a suitable prolog/epilog (including assembly required to accept inputs and return outputs using the C calling convention), calling an external assembler, and linking the resulting object into the final binary. (This does mean doing work for each architecture, but it's not as bad as it sounds: we'd only need to support whatever architectures Cranelift supports.)

For non-asm goto this is relatively straightforward. But even asm goto can be supported with an ugly hack.

(Disclaimer: The following is worded as if we were adopting asm goto as is, but in reality it would probably look more like an if or match statement, as previously discussed. My point here is the implementation.)

Suppose you have an asm block which wants to end by branching to one of label_1, label_2, or label_3. Well, just create dummy labels in the stub function: label_1 causes the stub function to return 1; label_2 makes it return 2; and so on. Then in the caller function, after calling the stub function, use the equivalent of a match statement to branch to one of the "real" labels depending on what the function returned. This is slow, but it implements the required semantics.

4 Likes

ARM has 32-bit opcodes which must be align 4, and also starting in ARMv4T they added a separate execution mode for 16-bit opcodes that must be align 2, and also in later versions (I forget which) they added more to the 16-bit mode so that some instructions are two 16-bit values paired up (still only align 2 though, so not quite the same as the 32-bit operations).

Wouldn't that be solvable by the assembler generating aligned coded as needed and then including in the output the alignment needed for the overall block of op-codes? Then, the compiler would simply insert an appropriate number of no-op instructions before the beginning of the block to obtain the necessary alignment. Would that not be the case?

1 Like

That works, at least for the ARM THUMB paired opcodes described by @Lokathor .

Most of the x86 instructions are SIMD-related, and the intrinsics approach has already been adopted for those, so it shouldn't be too hard to add the remaining interesting ones.

Microsoft Visual C++ doesn't support inline assembly on x64, but only intrinsics (defined by Microsoft in intrin.h), and I think they compile the Windows NT kernel with it, so it should be a feasible approach.

I also think that inline assembly is a foundational feature and Rust can not be called a true systems programming language without it.

I like the suggested approach of consuming raw binary which will be simply inlined by compiler. It allows us to completely sidestep the issue of choosing assembly syntax. And more convenient APIs can be built on top of it using procedural macros which could use external assemblers or eventually maybe even a pure-Rust assembler implemented as a proc macro.

Of course we will have to specify interface to interact with register allocator and a way to specify alignment requirements for the inlined binary code. But this feature can be done incrementally, e.g. we can start with explicitly named input and clobber registers and latter extend it with dynamic registers which will be chosen by compiler.

Granted even in this form this feature will not be an easy one, but nevertheless I think it should help to somewhat reduce amount of design and implementation work.

Not quite, smart backend (LLVM) understands "meaning" of intrinsics and can freely replace instructions which you would think should be used with something entirely different. For example here you will not find vandps instruction even though _mm256_and_si256 instrinsic was used.

But in the case of inline assembly we expect that compiler will not interfere with its contents, e.g. compiler should never remove a series of nops, even though it does "nothing".

10 Likes

++. Intrinsics are still subject to Godbolt's Law: "the compiler is smarter than you". Inline assembly is for the rare occasion when the compiler is actually not smarter than you.

3 Likes

I'm not totally against this idea, but I'm skeptical that it will actually reduce the amount of design and implementation work, for a few reasons:

  • Inline assembly based on textual substitution can be lowered to LLVM's existing inline assembly feature; you have to translate the template string and constraints to LLVM's syntax, but that's straightforward. Binary inclusion based on fixed input/output registers could also be lowered to LLVM inline assembly, but dynamic registers would require some sort of API to interact with the register allocator, which would require invasive changes to LLVM.

    • And would require even more work to support LTO (where LLVM is invoked by the linker to generate code, but rustc is not currently involved).

    • And would be hard to integrate with potential future backends that already support inline assembly, such as compile-to-C, or using GCC's backend.

  • It's not just a matter of passing a Vec<u8> to be included. For parity with existing inline assembly (in C or the unstable Rust version), you need at least:

    • Relocations against symbols (both local and external)
    • The ability to stick the address of an instruction into an external section using .pushsection/.popsection (e.g. probe does this).

    For arbitrary relocations and sections you probably want a platform-native object file (that's what assemblers output anyway), so you need to be able to parse those in order to splice them into the Rust code. Which requires code for each platform.

    That sounds a lot more complicated than having the backend manage the assembler.

6 Likes

I quite like the machine code blob idea. Sure, it wouldn't be able to do all of what LLVM inline asm can do but it would cover a lot of the use cases without requiring much specification like supporting asm.

Yes, I think dynamic register allocation should be out of the question for including blobs. I'd say most cases that require this (for performance) are better served by intrinsics anyway.

It should be trivial to support any backend that already supports inline asm because you can just use the .byte directive to insert the binary blob.

Regarding the alignment issues mentioned by other people, it shouldn't be a problem because the last rust generated instruction should leave the current position suitably aligned. Alignment directives for code are only ever needed before the start of the function which can't be part of the inline asm.

1 Like

@gbutler pointed out that isn't true. For example, on ARM-with-THUMB the alignment before machine-code insertion could be align 2 but the inserted code could require align 4 because some right-hand (i.e., @2 mod 4) 16-bit THUMB instructions are conditional on the execution result of the left-hand (i.e., @0 mod 4) 16-bit instruction of the pair.

You can just pad nops in whatever assembly prelude rust emits to ensure appropriate alignment.

Can you give an example of such instructions? I'm unaware of such a thing. I've written some, but not much, Thumb assembly and I never padded with nops to increase alignment.

Only if you don't have dynamic register allocation. And it's somewhat less trivial when you consider relocations and multiple sections.

Even for simple cases on LLVM, you still need to

  • Convert Rust's register constraints in LLVM register constraints.
  • Somehow communicate, from LLVM to Rust, which registers were allocated.
1 Like

I'm somewhat surprised by the claims that a system's programming language requires inline asm support, when neither C nor C++ support inline asm in the standard. Currently, rust doesn't have a standard and whatever is stabilized is effectively standardized due to the stability RFC. If inline asm were proposed as part of a new extension mechanism that is optional, that would at least be a direct comparison with C/C++. The most accurate phrasing of what people want when they make this claim seems to be that systems programming languages support implementation specific extensions and rust seemingly doesn't, at least not on stable and that's what they want. Cause the argument that a systems programming language requires inline asm effectively has no support, given the current standards.

3 Likes

I both strongly agree and strongly disagree. :slight_smile:

In this case, the biggest relevant difference is that Rust has a very strong policy of keeping stable and unstable features clearly separate so that nothing ever becomes "de facto stable." In contrast, C++ had multiple competing implementations for years before there even was a standard, so for a long time "de facto standard" was the only kind of standard there was. In practice, a Rust nightly-only feature feels like a much lower status than a "de facto standard" C++ feature; it's probably reasonable to say the former can break at any time but the latter really can't (if only because a major C++ compiler dropping something like inline asm would be utterly suicidal).

But as usual, we can make our comparisons apples-to-apples simply by being a lot more specific:

  • Do the most heavily used, battle-tested, well-maintained implementation(s) of the language with a strong support system and backwards compatibility guarantees provide inline assembly? For C++, yes. For Rust, no.
  • Does the formal specification of the abstract machine that defines the language provide an inline assembly feature, optional or otherwise? For C++, no. For Rust, no (and not just because Rust has no formal spec).

When people say things like "inline asm is a foundational feature of a systems language", I assume they're after something closer to the first bullet point. In particular, I think the kind of claim they're really getting at is that:

Rust should be an ideal language for implementing operating systems, device drivers, and other kinds of software that cannot be written without inline assembly. But because stable Rust lacks inline assembly those entire application domains are closed off to it. That is a serious problem, and addressing it should be one of our highest priorities.

And this claim still seems plausible to me, even after reading the >100 posts in this thread. Of course, I am putting words in other peoples' mouths right now, which is always a bit risky, but hopefully this helps break down the communication impasse I'm sensing here.

15 Likes

This is incorrect. ISO C++ specifies an asm(...); statement, with implementation-defined semantics for the tokens inside. The ISO committee hasn't specified the assembly itself, but it is recognized as an important feature of the language. Inline assembly isn't required, per se, but parsers are required to acknowledge it. See C++17S10.4.

My personal metric is, "the only .S file I should need to write is the code that my bootloader jumps to, which immediately calls into a Rust function." The stability guarantees you quote are, I think, a corollary of that.

4 Likes