[Pre-RFC #2]: Inline assembly

I've spent the past week preparing an RFC for inline assembly and I would like to obtain some feedback from the community, especially regarding the unresolved questions at the end.

The text is based on @Florob's pre-RFC posted in an earlier thread, but heavily modified.

The latest version of this RFC can be found here.

41 Likes

I'm really liking the feel of this so far. I need to think through all the ramifications, but, it seems like a really good and thorough treatment that allows for future extension and doesn't make Rust too dependent on specific back-ends or architectures.

3 Likes

Looks great.

A few small bikeshed things:

  • Are there concrete use cases for reg_abcd, vreg_low, and vreg_low8?
  • lateout and especially inlateout are kind of ugly. I'd suggest out_late, but I'm not sure what to do to inlateout without making it less self-explanatory.
  • The name flags is confusing since one might think it refers to the flags register. Maybe settings? Or something like asm!("asdf", pure = true)?
2 Likes

One thing that might be nice is to list specific things that can be done with inline assembly with GCC, D, VSC, etc. that you won't be able to currently do with what this RFC proposes. I see some things along those lines mentioned, but, are those the only things? It might be good for this RFC to explicitly state things those implementations currently support that this doesn't AND state in the RFC whether it is currently known and being locked in in this RFC whether or not those things should ever be supported. That might be asking too much though. I'm not an expert in this area.

This looks great. Thank you so much for working on this!

A few requests:

  • While I do like using similar syntax for input/output registers that don't get directly referenced in the format string, it seems error-prone if we can't detect a mismatch between the input/output values and the values used in the format string. I would suggest some explicit indication that an input or output is implicit, and then an error if an asm! doesn't reference every non-implicit intput/output in the format string.
  • You have i8/i16/i32/i64 in many places; in every case, those should allow either signed or unsigned types. (I didn't see a note anywhere saying that those represented both signed and unsigned of the given size.)
  • Anywhere that allows v128 should also allow i128 or u128.
  • Please mention that we may wish to provide a standardized way of switching binary sections without relying on backend .section directives. That can be a future extension, but it seems worth mentioning.
  • When you mention that this uses Intel syntax by default, you should mention that we could easily implement an asm_att! or similar that does the opposite, for convenience of copy-pasting. And you should also mention, in the alternatives section, that we could choose to do the reverse: use AT&T syntax and provide an asm_intel! for Intel syntax. Both approaches have tradeoffs; for instance, some folks who work on kernels favor AT&T syntax to help reduce differences between architectures, while some folks who work primarily on Intel assembly prefer the Intel syntax. We should mention those tradeoffs explicitly.
  • For asm goto, I would propose a different syntax for the most common case of that, which integrates into an if/else statement rather than providing arbitrary multi-way goto. That would cover a large number of uses of asm goto I've seen in the wild, while remaining relatively structured and straightforward for the compiler and other tools to analyze.
  • Yes, imm should allow floating-point immediates.
7 Likes

I am hesitant to add additional syntax for this since clobber specifications are already getting quite long. However I think a simply solution would be to lint against any unused operands that are specified as register classes. Unused operands specified as explicit registers are silently allowed.

I thought that was obvious, but I'll add a note about it.

Vector types and large integers are treated very differently by the register allocator (one goes in a vector register, the other goes into a pair of general purpose registers). LLVM and GCC are very picky as to what types they accept for various constraint codes, so this wouldn't work.

We already have this in the form of #[link_section]. You only need to use .section if the data you are encoding needs to refer to a label inside the asm itself. I don't really see how this can be done outside of the asm string, and I would really rather not have to perform any parsing of the asm string within rustc itself.

Actually the assembly languages of most architectures (at least ARM and RISC-V that I know of) are much closer to Intel syntax than AT&T. The only reason anyone is still using AT&T syntax is because GCC doesn't support inline assembly with Intel syntax. This is already somewhat covered by the "Unfamiliarity" drawback, but I can make it more explicit.

So this is a bit of a tricky case: Rust doesn't really have (C-style) labels that we can pass into the asm. Integrating into an if / else also doesn't work since that would require the asm to produce an intermediate bool result, which defeats the point. The only real way to support asm goto is what I proposed: you need to pass the code to be executed for each label directly to the asm! macro. If I misunderstood your proposal, please provide some example code to clarify what you are suggesting.

Note that this means we will be inserting the values directly into the asm string, rather than using LLVM's "i" constraint. It's probably better this way actually so let's just do that.

4 Likes

I've updated the pre-RFC based on @josh's feedback (see the edit history for changes).

3 Likes

This is a huge RFC, though it seems most of my comments have already been brought up by other folks. My main concern is making sure that it is absolutely painless to specify additional supported architectures. I work almost entirely in RISC-V, and I think that it's very important that it be easy for me to add support (even though I think that RISC-V, as an ISA, is sufficiently simple (and on-brand for Rust) that we should just support it from the beginning...). This goes for other users of less-mainstream ISAs.


Also, I think it's important to actually spell out, roughly, what sorts things are UB inside of an asm! statement. Compilers tend to be pretty bad at this IME.

There's the obvious "don't scribble over registers you didn't say you were scribbling over", but there's a few questions (varying from an obvious "no that's obviously stupid" to "I do this when I write inline assembly in C and have no idea if it's UB", in no particular order):

  • Can I do a far jump that never returns? (If so, it might be nice to have mechanism to tell the rust compiler that the asm! block should type as !.)
  • If so, can I scribble whatever I want in registers like e.g. the stack pointer and then never return (e.g., I want to write the OS code that executes before a thread starts in pure Rust).
  • Can I pretend to be a function call and grow the stack (making sure to shrink it before exiting the asm block?
  • Can I do really rude things like ret?
  • Can I raise a hardware exception or similar that would really mess up the Rust implementation's unwinding? (You merely specify that you cannot begin unwinding from inside inline assembly; you might want to strengthen this).
  • Can I do my own save-and-restore of registers I haven't told the compiler I'm touching?
  • Can I read registers I didn't say I was going to read?

These aren't quite UB questions but are other things you don't specify:

  • Can I put an uninitialized let in an in parameter? More generally, could I write
let my_ptr: *const usize = ...;
let mut allegedly_frozen: usize;
asm!("mv {}, 0({})", out(reg) allegedly_frozen, in(reg) my_ptr);
  • Should I expect empty template strings to still force save/restores, and generally act as an optimization barrier? If not, I think we should be explicit that asm should not be used in this way, and maybe have a second discussion about providing an intrinsic. Here's one of many examples of this in Chromium. I.e.,
let x = ...;
asm!("", inout(reg) x);
  • Can I actively rely on the fact that the compiler will never peek into my assembly and try to optimize it because it thinks it's smart? This is especially important for constant-time cryptography, which tends to need to play chicken with the compiler.

As a final note, I think we should consider adding a shorthand for in(reg) and friends. 99% of register constraints are "any register", so I think such a shorthand would make constraints more readable. (I'm also not a fan of the juxtaposed constraint expr syntax, and it would be nice to have some punctuation separating them.)

5 Likes

Another nitpick (emphasis added):

The part I bolded seems like it should be based on the presence of nomem rather than being baked into pure. If you have pure but not nomem, the assembly should be a pure function of the input values together with what they (transitively) point to.

I think this would be consistent with the LLVM semantics we're translating to – i.e. it's legal to read from memory even if you don't pass sideeffect, as long as you don't pass readnone.

1 Like

An unused operand specified as a register class can't possibly work, as you can't know what register you got without substituting it in; I think that should give a hard error, not a lint.

But I'd still like to catch possible mistakes caused by specifying an exact register and then forgetting to use that argument at all. I would like to distinguish that case somehow. I agree that the common case for "use this exact register" will be to use that register implicitly, and we shouldn't complicate that case. But is there some other way that we can make this error less likely?

LLVM doesn't already support putting a u128 into an SSE register and doing math on it using SSE? That seems quite unfortunate.

In any case, it seems extremely surprising to me if you can't provide a u128 input value for a 128-bit register, or get a 128-bit register into a 128-bit output.

On a different note, for ABIs that do commonly operate on register pairs, we need a good way to handle those. For instance, on 32-bit x86 ("i386"), it's common to operate on edx:eax as a register pair, for operations such as multiplication, division, or rdtsc. And on 64-bit x86, some instructions operate on rdx:rax as a pair for 128-bit operations. We need to have a way to specify those as register constraints; for instance, out("edx:eax") value should work with a u64 value.

I don't mean an intermediate bool. I'm still proposing that this would pass a label into the assembly. I'm just imagining something more structured, like this:

if_asm!("various assembly; je {else}") {
    // if body
} else {
    // else body
}

This would pass in an {else} label implicitly, and give an error if the assembly didn't reference {else}; finishing the assembly block without jumping to {else} would enter the if body. That would cover, for instance, every single use of asm goto in the Linux kernel.

A few additional thoughts on the RFC:

I don't think this should be limited to "defined in the current crate"; it should be acceptable to reference any symbol visible to the current crate. For instance, you should be able to pass a pointer to a function defined in another crate.

You might say "called as a function, potentially with a non-standard ABI".

Also, I can imagine other ways to implement this, such as running the external assembler and then inlining the resulting instructions; the only thing the backend would have to support is "make sure this value ends up in this register at this point", and an inefficient implementation of that could just arrange to move the value into that register at that point, rather than optimizing to make sure it's already in that register at that point.

Proposed answers:

Yes, but if so, any memory that was actively mutably borrowed before you jumped will be left in an undefined state. For example, in this code:

fn foo(x: &mut u32) {
    *x = 42;
    bar();
}

the optimizer should be allowed to move the write after the call, but bar() could include an asm block that jumps into oblivion.

Also, you'd be skipping any destructors for objects which may have been on the stack. That's okay if you never reuse the stack. If you do reuse the stack, it's not automatically UB, but it's unsafe, i.e. you're responsible for ensuring there wasn't anything on the stack that depended on destructors being run for correct behavior.

If you want to write code that executes without already having a stack pointer, you need either global_asm! or a #[naked] function. #[naked] functions arguably should be removed from the language, but if not, they should be treated as basically syntax sugar for global_asm!. A naked function must consist of a single asm block, which is just plopped into the output file: there's no interaction with code generation or register allocation, no possibility of inlining. It really has completely different semantics from normal functions and normal asm blocks.

Not sure, but note that this also affects whether you can perform an actual function call.

No. You have no idea where the compiler has stashed the return address.

What do you mean by "mess up unwinding"? The consequences of raising a hardware exception depend on the exception handler. For example, under a typical non-embedded operating system kernel, userland code raises hardware exceptions all the time, when it accesses pages that haven't been faulted in yet. But once the kernel has loaded the page into memory and added it to the page table, it restores all registers and resumes execution at the instruction that produced the exception, making the whole scheme invisible to userland. In an embedded context... I'd say it really depends on what you're doing with the exceptions.

Yes.

Yes, but doing so is pointless since the compiler might be storing any value whatsoever in them. The stack pointer register might be an exception (see previous question about calls).

No, it should be a compile-time error.

Is the question here whether asm implicitly freezes its outputs? I'd say it does, since I can't think of any optimization that could make the output act 'weird' while still treating the assembly as a black box (which it should).

It should make the output a 'black box' to the optimizer, but whether that involves saving or restoring anything depends on the constraints you specified...

The compiler should treat the assembly as a black box. However, the compiler is allowed to perform crazy transformations like

let a = b * c;

->

let a = if c == 42 {
    b * 42
} else {
    b * c
};

which can make seemingly constant-time operations variable time, even if the compiler doesn't know anything about the inputs a priori.

So there is no way to do guaranteed constant-time cryptography unless your entire algorithm is in a single asm block. Even then, the compiler is free to take the outputs of the algorithm and leak those through timing. Yes, this sucks.

3 Likes

I don't think this is necessary. You can just specify two output variables and combine them afterwards with ((hi as u64) << 32) | (lo as u64). I believe the compiler is smart enough to produce optimal code in this situation.

I'd like the ability to do that without UB, yes. I'd also like the ability to possibly do a jump that never returns. I don't think either of those would cause horrible problems any more than exec or _exit or a fatal fault would.

EDIT: see comex's answer regarding mutable borrows.

As long as you never fall out the end of the asm block, I don't see how that would cause UB.

Now that has really interesting implications. We may at some point want to support inputs or outputs that would end up in stack locations (e.g. if a local value currently lives on the stack and not a register). If the compiler isn't using stack frames, it might access a value using an offset from the stack pointer, so changing the stack pointer would break the stack-pointer-relative location the compiler gives you.

For that matter, if you change the stack pointer in any way, it'd be polite to include the appropriate debug information so that debuggers can figure out where locals live at all times (even if you don't reference them).

All that said, I think you should be able to adjust the stack pointer if you restore it afterwards, but that would take some care to allow.

No, definitely not.

Possibly, if you handle the trap. Consider code that (for instance) runs rdmsr, and code elsewhere that has set up a GPF handler that knows if that particular instruction faults to jump to a recovery label in the same block.

I also think you can fault if you never return.

You can't raise a fault that other code will handle by unwinding your stack, but in that case, the fault lies with the code unwinding the stack.

Yes, that seems reasonable. You could push rax, use rax as a scratch register, and then pop rax.

That said, I think the best way to handle that would be to tell the asm! block you need a scratch register of a given type, and then the compiler might give you a register it already had free so you don't have to save and restore.

The value of a register will always be a valid bit pattern, so there's no undefined behavior there, but you should not ever depend on the value. I can only think of two valid use cases: saving and restoring the register value, or debugging code that just captures registers and prints them.

Not in an in parameter, but it seems reasonable to put it in an out parameter.

We should have an explicit note that reading from an out or lateout register before you write it will produce an unspecified but valid bit pattern.

I think we should have a real "compiler barrier" operation rather than encouraging people to use the equivalent of GCC's __asm__ __volatile__ ("" : : : "memory"). But I do think we should make that work, even if we provide a better alternative.

By default, yes. I think we might in the future want to offer a mechanism to explicitly label an asm! block as permitting peephole optimizations and similar, but by default the compiler should assume that it must emit exactly the assembly specified.

2 Likes

That seems reasonable (if verbose), but if that's the recommended approach, then we should document that (with examples using mul and div for instance).

1 Like

Is this even meaningful, or if so, useful? I imagine that if you trash your current stack, and plan to diverge, the old stack is ipso facto gone. In other words, every inline assembly statement is of one of two types:

  • It never returns for any inputs, and as such can do anything to machine state: the current Rust thread isn't really a Rust thread anymore. It could jump back into Rust, but it would be more-or-less like a brand new thread with a brand new stack.
  • It may return, in which case upon control exiting the assembly, no registers may be out of place, so scribbling all over the stack pointer is just the same as spilling any other register and scribbling all over it.

In the context I'm thinking, I'm going to land in a crt0.S anyway. Also, maybe this case wasn't clear: the only situation in which I expect to scribble over the stack pointer is to abandon the old stack and set up a new one.

By that token, do we even want to say something like "doodling garbage all over the stack pointer, then jumping into a C-ABI function out in the rhubarb, is UB"? I feel that after a certain point, I don't think declaring this to be undefined behavior means anything: you've jumped out of Rust, beyond any meaningful proposition of "well-formed program".

I guess I was thinking of doing something stupid with link registers, maybe this question was dumb.

Honestly, I have no idea, I know very little about unwinding because every system I have ever touched meaningfully doesn't have it. I'm just trying to cover the entire attack surface.

Trust me, I know; I used to sit near the guy who does the constant-time stuff in BoringSSL. =P I mostly thought of the asm!("") example in the context of this. My main point here was to poke at the degree to which it is meaningful or useful to promise that the compiler will not touch, peek into, or otherwise perform optimizations with knowledge about what your asm is doing. That said, specifying optimizations is usually a great way to make your language complicated (cough copy elision/RVO cough), and making meaningful promises about that is hard.

Right, this is less a question about scratch registers and register allocation and more of a question of "can I "spill" the stack register onto not-the-stack". You'll notice a lot of these questions are mostly "I want to do stupid things to the stack but I want to know that everything is ok if Rust never notices."

Ah, but what is that (like, what does that mean, formally)? I think that's where a lot of the mystery lies in wanting a "cfence".

I think the sadness here is dynamic linking. I honestly don't know that much about dynamic linking other than the one paper I read, but I think you can't just do a string-replace to set up a GOT lookup in your assembly language? Maybe I'm overthinking this. I otherwise agree that this distinction isn't the best.

1 Like

Is it UB to set up an out register, not write to it, then read it on the Rust side? Is it valid garbage value, or is it uninitialized? Rustc can't possibly know, at any rate.

I don't think that should be UB, as long as the type you pass as the out register allows any possible bit pattern. If we were to allow an out to refer to (for instance) a repr(u8) enum, then it'd be UB to supply a value that isn't a valid enum value.

Oh, sure. I think that's kind of implied... I think what I'm getting at is "under what circumstances do values that inline assembly interacts with cease to be uninitialized."

Actually this definition is based on how GCC interprets the volatile keyword (which is basically the inverse of pure). See this example where the asm code is executed only once despite it taking the address of a global as input, having a memory clobber and modifying that global on every loop iteration.

Sure, I was just considering the worst case scenario where the backend has exactly zero support for any form of inline asm. This is currently the case for Cranelift, but it seems that its main use case at the moment is to compile faster debug builds, so performance of inline asm isn't really critical here.

Unless you're on a "fun" architecture like Itanium where a register can contain NaT (not-a-thing) that faults if you try to write it to memory. If you want to use asm! to freeze some undefined bytes, use memory instead of a register.

Just because you don't return doesn't mean you can trash any objects that are currently on the stack. There could be another thread holding a reference to one of those objects (rayon does this). But as long as you only grow the stack and don't touch any parent frames, you can indeed do whatever you want as long as you never return (and never exit the thread in a way that would free the stack without unwinding it).

This is fine, in fact I use this (AArch64) asm code to switch to another stack:

    // Switch to the new stack and execute the given function on it
    asm!(
        r#"
            mov sp, ${0}
            mov x29, #0
            mov x30, #0
            br x0
        "#
        :
        : "r" (initial_sp),
          "{x0}" (func)
        :
        : "volatile"
    );
    hint::unreachable_unchecked();

The old stack is left untouched (and never freed), and the asm! never returns.

Exactly! If you are compiling a shared library, there is no way to access a symbol outside the current shared library except through the GOT. Also GCC/LLVM get rather unhappy if you try to pass an external symbol into an asm block.

1 Like

Fine, but on single-threaded boot loader I'm working in, I don't have other threads (not to speak of OS infrastructure for Rayon...), so this should be safe. I think we need to have a discussion as to when it is acceptable to trash your stack, unfortunately. =(

Well, maybe you could get away with having the surrounding assembly generate the code to poke the GOT and then replace the sym with an imm?? Also, I know some assemblers (I believe RISC-V gas will do this) silently emit GOT offsets? Like, in PIC mode,

la t0, my_got_symbol

will actually emit instructions to fuss with gp (la isn't a real RISC-V instruction, and just a spec-defined macro for auipc/addi), which would really hate receiving an immediate instead of a symbol. (I hate dynamic linking.)

1 Like