[Pre-RFC #2]: Inline assembly

I've spent the past week preparing an RFC for inline assembly and I would like to obtain some feedback from the community, especially regarding the unresolved questions at the end.

The text is based on @Florob's pre-RFC posted in an earlier thread, but heavily modified.


This RFC specifies a new syntax for inline assembly which is suitable for eventual stabilization.

The initial implementation of this feature will focus on the ARM, x86 and RISC-V architectures. Support for more architectures will be added based on user demand.

The existing asm! macro will be renamed to llvm_asm! to provide an easy way to maintain backwards-compatibility with existing code using inline asm. However llvm_asm! is not intended to ever be stabilized.


In systems programming some tasks require dropping down to the assembly level. The primary reasons are for performance, precise timing, and low level hardware access. Using inline assembly for this is sometimes convenient, and sometimes necessary to avoid function call overhead.

The inline assembler syntax currently available in nightly Rust is very ad-hoc. It provides a thin wrapper over the inline assembly syntax available in LLVM IR. For stabilization a more user-friendly syntax that lends itself to implementation across various backends is preferable.

Guide-level explanation

Rust provides support for inline assembly via the asm! macro. It can be used to embed handwritten assembly in the assembly output generated by the compiler. Generally this should not be necessary, but might be where the required performance or timing cannot be otherwise achieved. Accessing low level hardware primitives, e.g. in kernel code, may also demand this functionality.

Note: the examples here are given in x86/x86-64 assembly, but ARM, AArch64 and RISC-V are also supported.

Basic usage

Let us start with the simplest possible example:

unsafe {

This will insert a NOP (no operation) instruction into the assembly generated by the compiler. Note that all asm! invocations have to be inside an unsafe block, as they could insert arbitrary instructions and break various invariants. The instructions to be inserted are listed in the first argument of the asm! macro as a string literal.

Inputs and outputs

Now inserting an instruction that does nothing is rather boring. Let us do something that actually acts on data:

let x: u32;
unsafe {
    asm!("mov {}, 5", out(reg) x);
assert_eq!(x, 5);

This will write the value 5 into the u32 variable x. You can see that the string literal we use to specify instructions is actually a template string. It is governed by the same rules as Rust format strings. The arguments that are inserted into the template however look a bit different then you may be familiar with. First we need to specify if the variable is an input or an output of the inline assembly. In this case it is an output. We declared this by writing out. We also need to specify in what kind of register the assembly expects the variable. In this case we put it in an arbitrary general purpose register by specifying reg. The compiler will choose an appropriate register to insert into the template and will read the variable from there after the inline assembly finishes executing.

Let see another example that also uses an input:

let i: u32 = 3;
let o: u32;
unsafe {
        mov {0}, {1}
        add {0}, {number}
    ", out(reg) o, in(reg) i, number = imm 5);
assert_eq!(i, 8);

This will add 5 to the input in variable i and write the result to variable o. The particular way this assembly does this is first copying the value from i to the output, and then adding 5 to it.

The example shows a few things:

First we can see that inputs are declared by writing in instead of out.

Second one of our operands has a type we haven't seen yet, imm. This tells the compiler to expand this argument to an immediate inside the assembly template. This is only possible for constants and literals.

Third we can see that we can specify an argument number, or name as in any format string. For inline assembly templates this is particularly useful as arguments are often used more than once. For more complex inline assembly using this facility is generally recommended, as it improves readability, and allows reordering instructions without changing the argument order.

We can further refine the above example to avoid the mov instruction:

let mut x: u32 = 3;
unsafe {
    asm!("add {0}, {number}", inout(reg) x, number = imm 5);
assert_eq!(x, 8);

We can see that inout is used to specify an argument that is both input and output. This is different from specifying an input and output separately in that it is guaranteed to assign both to the same register.

Late output operands

The Rust compiler is conservative with its allocation of operands. It is assumed that an out can be written at any time, and can therefore not share its location with any other argument. However, to guarantee optimal performance it is important to use as few registers as possible, so they won't have to be saved and reloaded around the inline assembly block. To achieve this Rust provides a lateout specifier. This can be used on any output that is guaranteed to be written only after all inputs have been consumed. There is also a inlateout variant of this specifier.

Here is an example where inlateout cannot be used:

let mut a = 4;
let b = 4;
let c = 4;
unsafe {
        add {0}, {1}
        add {0}, {2}
    ", inout(reg) a, in(reg) b, in(reg) c);
assert_eq!(a, 12);

Here the compiler is free to allocate the same register for inputs b and c since it knows they have the same value. However it must allocate a separate register for a since it uses inout and not inlateout.

However the following example can use inlateout since the output is only modified after all input registers have been read:

let mut a = 4;
let b = 4;
unsafe {
    asm!("add {0}, {1}", inlateout(reg) a, in(reg) b);
assert_eq!(a, 8);

As you can see, this assembly fragment will still work correctly if a and b are assigned to the same register.

Explicit register operands

Some instructions require that the operands be in a specific register. Therefore, Rust inline assembly provides some more specific constraint specifiers. While reg is generally available on any architecture, these are highly architecture specific. E.g. for x86 the general purpose registers eax, ebx, ecx, edx, ebp, esi, and edi among others can be addressed by their name.

unsafe {
    asm!("out 0x64, {}", in("eax") cmd);

In this example we call the out instruction to output the content of the cmd variable to port 0x64. Since the out instruction only accepts eax (and its sub registers) as operand we had to use the eax constraint specifier.

It is somewhat common that instructions have operands that are not explicitly listed in the assembly (template). Hence, unlike in regular formatting macros, we support excess arguments:

fn mul(a: u32, b: u32) -> u64 {
    let lo: u32;
    let hi: u32;

    unsafe {
            "mul {}",
            in(reg) a, in("eax") b,
            lateout("eax") lo, lateout("edx") hi

    hi as u64 << 32 + lo as u64

This uses the mul instruction to multiply two 32-bit inputs with a 64-bit result. The only explicit operand is a register, that we fill from the variable a. The second implicit operand is the eax register which we fill from the variable b. The lower 32 bits of the result are stored in eax from which we fill the variable lo. The higher 32 bits are stored in edx from which we fill the variable hi.

Note that lateout must be used for eax here since we are specifying the same register as both an input and an output.

Clobbered registers

In many cases inline assembly will modify state that is not needed as an output. Usually this is either because we have to use a scratch register in the assembly, or instructions modify state that we don't need to further examine. This state is generally referred to as being "clobbered". We need to tell the compiler about this since it may need to save and restore this state around the inline assembly block.

let ebx: u32;
let ecx: u32;

unsafe {
        in("eax") 4, in("ecx") 0,
        lateout("ebx") ebx, lateout("ecx") ecx,
        lateout("eax") _, lateout("edx") _

    "L1 Cache: {}",
    ((ebx >> 22) + 1) * (((ebx >> 12) & 0x3ff) + 1) * ((ebx & 0xfff) + 1) * (ecx + 1)

In the example above we use the cpuid instruction to get the L1 cache size. This instruction writes to eax, ebx, ecx, and edx, but for the cache size we only care about the contents of ebx and ecx.

However we still need to tell the compiler that eax and edx have been modified so that it can save any values that were in these registers before the asm. This is done by declaring these as outputs but with _ instead of a variable name, which indicates that the output value is to be discarded.

This can also be used with a general register class (e.g. reg) to obtain a scratch register for use inside the asm code.


By default, an inline assembly block is treated the same way as an external FFI function call with a custom calling convention: it may read/write memory, have observable side effects, etc. However in many cases, it is desirable to give the compiler more information about what the assembly code is actually doing so that it can optimize better.

Let's take our previous example of an add instruction:

let mut a = 4;
let b = 4;
unsafe {
        "add {0}, {1}",
        inlateout(reg) a, in(reg) b,
        flags(pure, nomem)
assert_eq!(a, 8);

Flags can be provided as an optional final argument to the asm! macro. We specified two flags here:

  • pure means that the asm code has no observable side effects and that its output depends only on its inputs.
  • nomem means that the asm code does not read or write to memory.

These allow the compiler to better optimize code using asm!, for example by eliminating pure asm! blocks whose outputs are not needed.

See the reference for the full list of available flags and their effects.

Reference-level explanation

Inline assembler is implemented as an unsafe macro asm!(). The first argument to this macro is a template string literal used to build the final assembly. The following arguments specify input and output operands. When required, flags are specified as the final argument.

The following ABNF specifies the general syntax:

dir_spec := "in" / "out" / "lateout" / "inout" / "inlateout"
reg_spec := <arch specific register class> / "<arch specific register name>"
operand_expr := expr / _
reg_operand := dir_spec "(" reg_spec ")" operand_expr
operand := reg_operand / "imm" const_expr / "sym" path
flag := "pure" / "nomem" / "readonly" / "preserves_flags" / "noreturn"
flags := "flags(" flag *["," flag] ")"
asm := "asm!(" format_string *("," [ident "="] operand) ["," flags] ")"

Template string

The assembler template uses the same syntax as format strings (i.e. placeholders are specified by curly braces). The corresponding arguments are accessed in order, by index, or by name.

The assembly code syntax used is that of the GNU assembler (GAS). The only exception is on x86 where the Intel syntax is used instead of GCC's AT&T syntax.

This RFC only specifies how operands are substituted into the template string. Actual interpretation of the final asm string is left to the assembler.

However there is one restriction on the asm string: any assembler state (e.g. the current section which can be changed with .section) must be restored to its original value at the end of the asm string.

The compiler will lint against any operands that are not used in the template string, except for operands that specify an explicit register.

Operand type

Several types of operands are supported:

  • in(<reg>) <expr>
    • <reg> can refer to a register class or an explicit register. The allocated register name is substituted into the asm template string.
    • The allocated register will contain the value of <expr> at the start of the asm code.
    • The allocated register must contain the same value at the end of the asm code (except if a lateout is allocated to the same register).
  • out(<reg>) <expr>
    • <reg> can refer to a register class or an explicit register. The allocated register name is substituted into the asm template string.
    • The allocated register will contain an unknown value at the start of the asm code.
    • <expr> must be a (possibly uninitialized) place expression, to which the contents of the allocated register is written to at the end of the asm code.
    • An underscore (_) may be specified instead of an expression, which will cause the contents of the register to be discarded at the end of the asm code (effectively acting as a clobber).
  • lateout(<reg>) <expr>
    • Identical to out except that the register allocator can reuse a register allocated to an in.
    • You should only write to the register after all inputs are read, otherwise you may clobber an input.
    • lateout must be used instead of out if you are specifying the same explicit register as an in.
  • inout(<reg>) <expr>
    • <reg> can refer to a register class or an explicit register. The allocated register name is substituted into the asm template string.
    • The allocated register will contain the value of <expr> at the start of the asm code.
    • <expr> must be an initialized place expression, to which the contents of the allocated register is written to at the end of the asm code.
  • inlateout(<reg>) <expr>
    • Identical to inout except that the register allocator can reuse a register allocated to an in (this can happen if the compiler knows the in has the same initial value as the inlateout).
    • You should only write to the register after all inputs are read, otherwise you may clobber an input.
  • imm <expr>
    • <expr> must be an integer or floating-point constant expression.
    • The value of the expression is formatted as a string and substituted directly into the asm template string.
  • sym <path>
    • <path> must refer to a fn or static defined in the current crate.
    • A mangled symbol name referring to the item is substituted into the asm template string.
    • The substituted string does not include any modifiers (e.g. GOT, PLT, relocations, etc).

Register operands

Input and output operands can be specified either as an explicit register or as a register class from which the register allocator can select a register. Explicit registers are specified as string literals (e.g. "eax") while register classes are specified as raw identifiers (e.g. reg).

Note that explicit registers treat register aliases (e.g. r14 vs lr on ARM) and smaller views of a register (e.g. eax vs rax) as equivalent to the base register. It is a compile-time error to use the same explicit register two input operand or two output operands. Additionally on ARM, it is a compile-time error to use overlapping VFP registers in input operands or in output operands.

Different registers classes have different constraints on which Rust types they allow. For example, reg generally only allows integers and pointers, but not floats or SIMD vectors.

If a value is of a smaller size than the register it is allocated in then the upper bits of that register will have an undefined value for inputs and will be ignored for outputs. It is a compile-time error for a value to be of a larger size than the register it is allocated in.

Here is the list of currently supported register classes:

Architecture Register class Registers LLVM constraint code Allowed types
x86 reg ax, bx, cx, dx, si, di, bp, r[8-15] (64-bit only) r i8, i16, i32, i64 (64-bit only)
x86 reg_abcd ax, bx, cx, dx Q i8, i16, i32, i64 (64-bit only)
x86 vreg xmm[0-7] (32-bit) xmm[0-15] (64-bit) x i32, i64, f32, f64, v128, v256, v512
x86 vreg_evex xmm[0-31] (AVX-512, otherwise same as vreg) v i32, i64, f32, f64, v128, v256, v512
x86 (AVX-512) kreg k[1-7] Yk i16, i32, i64
AArch64 reg x[0-31] r i8, i16, i32, i64
AArch64 vreg v[0-31] w i8, i16, i32, i64, f32, f64, v64, v128
AArch64 vreg_low v[0-15] x i8, i16, i32, i64, f32, f64, v64, v128
AArch64 vreg_low8 v[0-7] y i8, i16, i32, i64, f32, f64, v64, v128
ARM reg r[0-r12], r14 r i8, i16, i32
ARM vreg s[0-31], d[0-31], q[0-15] w f32, f64, v64, v128
ARM vreg_low s[0-31], d[0-15], q[0-7] t f32, f64, v64, v128
ARM vreg_low8 s[0-15], d[0-d], q[0-3] x f32, f64, v64, v128
RISC-V reg x1, x[5-31] r i8, i16, i32, i64 (64-bit only)
RISC-V vreg f[0-31] f f32, f64

Notes on allowed types:

  • Pointers and references are allowed where the equivalent integer type is allowed.
  • iLEN refers to both sized and unsized integer types. It also implicitly includes isize and usize where the length matches.
  • Fat pointers are not allowed.
  • vLEN refers to a SIMD vector that is LEN bits wide.

Additional constraint specifications may be added in the future based on demand for additional register classes (e.g. MMX, x87, etc).

Some registers have multiple names. These are all treated by the compiler as identical to the base register name. Here is the list of all supported register aliases:

Architecture Base register Aliases
x86 ax al, eax, rax
x86 bx bl, ebx, rbx
x86 cx cl, ecx, rcx
x86 dx dl, edx, rdx
x86 si sil, esi, rsi
x86 di dil, edi, rdi
x86 bp bpl, ebp, rbp
x86 sp spl, esp, rsp
x86 ip eip, rip
x86 r[8-15] r[8-15]b, r[8-15]w, r[8-15]d
x86 xmm[0-31] ymm[0-31], zmm[0-31]
AArch64 x[0-30] w[0-30]
AArch64 x29 fp
AArch64 x30 lr
AArch64 sp wsp
AArch64 xzr wzr
AArch64 v[0-31] b[0-31], h[0-31], s[0-31], d[0-31], q[0-31]
ARM r[0-3] a[1-4]
ARM r[4-9] v[1-6]
ARM r9 rfp
ARM r10 sl
ARM r11 fp
ARM r12 ip
ARM r13 sp
ARM r14 lr
ARM r15 pc
RISC-V x0 zero
RISC-V x1 ra
RISC-V x2 sp
RISC-V x3 gp
RISC-V x4 tp
RISC-V x[5-7] t[0-2]
RISC-V x8 fp, s0
RISC-V x9 s1
RISC-V x[10-17] a[0-7]
RISC-V x[18-27] s[2-11]
RISC-V x[28-31] t[3-6]
RISC-V f[0-7] ft[0-7]
RISC-V f[8-9] fs[0-1]
RISC-V f[10-17] fa[0-7]
RISC-V f[18-27] fs[2-11]
RISC-V f[28-31] ft[8-11]

Some registers are explicitly not supported for use with inline assembly:

Architecture Unsupported register Reason
All sp The stack pointer must be restored to its original value at the end of an asm code block.
x86 ah, bh, ch, dh These are poorly supported by compiler backends. Use 16-bit register views (e.g. ax) instead.
x86 k0 This is a constant zero register which can't be modified.
x86 ip This is the program counter, not a real register.
AArch64 xzr This is a constant zero register which can't be modified.
ARM pc This is the program counter, not a real register.
RISC-V x0 This is a constant zero register which can't be modified.
RISC-V gp, tp These registers are reserved and cannot be used as inputs or outputs.

Template modifiers

The placeholders can be augmented by modifiers which are specified after the : in the curly braces. These modifiers do not affect register allocation, but change the way operands are formatted when inserted into the template string. Only one modifier is allowed per template placeholder.

The supported modifiers are a subset of LLVM's (and GCC's) asm template argument modifiers.

Architecture Register class Modifier Input type Example output
x86 reg None i8 al
x86 reg None i16 ax
x86 reg None i32 eax
x86 reg None i64 rax
x86 (32-bit) reg_abcd b Any al
x86 (64-bit) reg b Any al
x86 reg_abcd h Any ah
x86 reg w Any ax
x86 reg k Any eax
x86 (64-bit) reg q Any rax
x86 vreg None i32, i64, f32, f64, v128 xmm0
x86 (AVX) vreg None v256 ymm0
x86 (AVX-512) vreg None v512 zmm0
x86 (AVX-512) kreg None Any k1
AArch64 reg None Any x0
AArch64 reg w Any w0
AArch64 reg x Any x0
AArch64 vreg None Any v0
AArch64 vreg b Any b0
AArch64 vreg h Any h0
AArch64 vreg s Any s0
AArch64 vreg d Any d0
AArch64 vreg q Any q0
ARM reg None Any r0
ARM vreg None f32 s0
ARM vreg None f64, v64 d0
ARM vreg None v128 q0
ARM vreg e / f v128 d0 / d1
RISC-V reg None Any x1
RISC-V vreg None Any f0


  • on ARM e / f: this prints the low or high doubleword register name of a NEON quad (128-bit) register.
  • on AArch64 reg: a warning is emitted if the input type is smaller than 64 bits, suggesting to use the w modifier. The warning can be suppressed by explicitly using the x modifier.


Flags are used to further influence the behavior of the inline assembly block. Currently the following flags are defined:

  • pure: The asm block has no side effects, and its outputs depend only on its direct inputs (i.e. the values themselves, not what they point to). This allows the compiler to execute the asm block fewer times than specified in the program (e.g. by hoisting it out of a loop) or even eliminate it entirely if the outputs are not used.
  • nomem: The asm blocks does not read or write to any memory. This allows the compiler to cache the values of modified global variables in registers across the asm block since it knows that they are not read or written to by the asm.
  • readonly: The asm block does not write to any memory. This allows the compiler to cache the values of unmodified global variables in registers across the asm block since it knows that they are not written to by the asm.
  • preserves_flags: The asm block does not modify the condition flags. This allows the compiler to avoid recomputing the condition flags after the asm block.
  • noreturn: The asm block never returns, and its return type is defined as ! (never). Behavior is undefined if execution falls through past the end of the asm code.

The nomem and readonly flags are mutually exclusive: it is an error to specify both. Specifying pure on an asm block with no outputs is linted against since such a block will be optimized away to nothing.

Mapping to LLVM IR

The direction specification maps to a LLVM constraint specification as follows (using a reg operand as an example):

  • in(reg) => r
  • out(reg) => =&r (Rust's outputs are early-clobber outputs in LLVM/GCC terminology)
  • inout(reg) => =&r,0 (an early-clobber output with an input tied to it, 0 here is a placeholder for the position of the output)
  • lateout(reg) => =r (Rust's late outputs are regular outputs in LLVM/GCC terminology)
  • inlateout(reg) => =r, 0 (cf. inout and lateout)

As written this RFC requires architectures to map from Rust constraint specifications to LLVM constraint codes. This is in part for better readability on Rust's side and in part for independence of the backend:

  • Register classes are mapped to the appropriate constraint code as per the table above.
  • imm operands are formatted and injected directly into the asm string.
  • sym is mapped to s for statics and X for functions.
  • a register name r1 is mapped to {r1}
  • additionally mappings for register classes are added as appropriate (cf. llvm-constraint)
  • lateout operands with an _ expression that are specified as an explicit register are converted to LLVM clobber constraints. For example, lateout("r1") _ is mapped to ~{r1} (cf. llvm-clobber).
  • If the nomem flag is not set then ~{memory} is added to the clobber list. (Although this is currently ignored by LLVM)
  • If the preserves_flags flag is not set then the following are added to the clobber list:
    • (x86) ~{dirflag}~{flags}~{fpsr}
    • (ARM/AArch64) ~{cc}

For some operand types, we will automatically insert some modifiers into the template string.

  • For sym and imm operands, we automatically insert the c modifier which removes target-specific modifiers from the value (e.g. # on ARM).
  • On AArch64, we will warn if a value smaller than 64 bits is used without a modifier since this is likely a bug (it will produce x* instead of w*). Clang has this same warning.
  • On ARM, we will automatically add the P or q LLVM modifier for f64, v64 and v128 passed into a vreg. This will cause those registers to be formatted as d* and q* respectively.

Additionally, the following attributes are added to the LLVM asm statement:

  • The nounwind attribute is always added: unwinding from an inline asm block is not allowed (and not supported by LLVM anyways).
  • If the nomem flag is set then the readnone attribute is added to the LLVM asm statement.
  • If the readonly flag is set then the readonly attribute is added to the LLVM asm statement.
  • If the pure flag is not specified then the sideffect flag is added the LLVM asm statement.
  • On x86 the inteldialect flag is added the LLVM asm statement so that the Intel syntax is used instead of the AT&T syntax.

If the noreturn flag is set then an unreachable LLVM instruction is inserted after the asm invocation.



This RFC proposes a completely new inline assembly format. It is not possible to just copy examples of GCC-style inline assembly and re-use them. There is however a fairly trivial mapping between the GCC-style and this format that could be documented to alleviate this.

Additionally, this RFC proposes using the Intel asm syntax on x86 instead of the AT&T syntax. We believe this syntax will be more familiar to most users, but may be surprising for users used to GCC-style asm.

The cpuid example above would look like this in GCC-sytle inline assembly:

// GCC doesn't allow directly clobbering an input, we need
// to use a dummy output instead.
int ebx, ecx, discard;
asm (
    : "=a"(discard), "=b"(ebx), "=c"(ecx) // outputs
    : "a"(4), "c"(0) // inputs
    : "edx" // clobbers
printf("L1 Cache: %i\n", ((ebx >> 22) + 1)
    * (((ebx >> 12) & 0x3ff) + 1)
    * ((ebx & 0xfff) + 1)
    * (ecx + 1));

Limited set of operand types

The proposed set of operand types is much smaller than that which is available through GCC-style inline assembly. In particular, the proposed syntax does not include any form of memory operands and is missing many register classes.

We chose to keep operand constraints as simple as possible, and in particular memory operands introduce a lot of complexity since different instruction support different addressing modes. At the same time, the exact rules for memory operands are not very well known (you are only allowed to access the data directly pointed to by the constraint) and are often gotten wrong.

If we discover that there is a demand for a new register class or special operand type, we can always add it later.

Difficulty of support

Inline assembly is a difficult feature to implement in a compiler backend. While LLVM does support it, this may not be the case for alternative backends such as Cranelift (see this issue).

However it is possible to implement support for inline assembly without support from the compiler backend by using an external assembler instead. Take the following (AArch64) asm block as an example:

unsafe fn foo(mut a: i32, b: i32) -> (i32, i32)
    let c;
    asm!("<some asm code>", inout(reg) a, in("x0") b, out("x20") c);
    (a, c)

This could be expanded to an external asm file with the following contents:

# Function prefix directives
.section ".text.foo_inline_asm"
.globl foo_inline_asm
.p2align 2
.type foo_inline_asm, @function

// If necessary, save callee-saved registers to the stack here.
str x20, [sp, #-16]!

// Move the pointer to the argument out of the way since x0 is used.
mov x1, x0

// Load inputs values
ldr w2, [x1, #0]
ldr w0, [x1, #4]

<some asm code>

// Store output values
str w2, [x1, #0]
str w20, [x1, #8]

// If necessary, restore callee-saved registers here.
ldr x20, [sp], #16 


# Function suffix directives
.size foo_inline_asm, . - foo_inline_asm

And the following Rust code:

unsafe fn foo(mut a: i32, b: i32) -> (i32, i32)
    let c;
        struct foo_inline_asm_args {
            a: i32,
            b: i32,
            c: i32,
        extern "C" {
            fn foo_inline_asm(args: *mut foo_inline_asm_args);
        let mut args = foo_inline_asm_args {
            a: a,
            b: b,
            c: mem::uninitialized(),
        foo_inline_asm(&mut args);
        a = args.a;
        c = args.c;
    (a, c)

Use of double braces in the template string

Because {} are used to denote operand placeholders in the template string, actual uses of braces in the assembly code need to be escaped with {{ and }}. This is needed for AVX-512 mask registers and ARM register lists.

Rationale and alternatives

Implement an embedded DSL

Both MSVC and D provide what is best described as an embedded DSL for inline assembly. It is generally close to the system assembler's syntax, but augmented with the ability to directly access variables that are in scope.

// This is D code
int ebx, ecx;
asm {
    mov EAX, 4;
    xor ECX, ECX;
    mov ebx, EBX;
    mov ecx, ECX;
writefln("L1 Cache: %s",
    ((ebx >> 22) + 1) * (((ebx >> 12) & 0x3ff) + 1)
    * ((ebx & 0xfff) + 1) * (ecx + 1));
// This is MSVC C++
int ebx_v, ecx_v;
__asm {
    mov eax, 4
    xor ecx, ecx
    mov ebx_v, ebx
    mov ecx_v, ecx
std::cout << "L1 Cache: "
    << ((ebx_v >> 22) + 1) * (((ebx_v >> 12) & 0x3ff) + 1)
        * ((ebx_v & 0xfff) + 1) * (ecx_v + 1))
    << '\n';

While this is very convenient on the user side in that it requires no specification of inputs, outputs, or clobbers, it puts a major burden on the implementation. The DSL needs to be implemented for each supported architecture, and full knowledge of the side-effect of every instruction is required.

This huge implementation overhead is likely one of the reasons MSVC only provides this capability for x86, while D at least provides it for x86 and x86-64. It should also be noted that the D reference implementation falls slightly short of supporting arbitrary assembly. E.g. the lack of access to the RIP register makes certain techniques for writing position independent code impossible.

As a stop-gap the LDC implementation of D provides a llvmasm feature that binds it closely to LLVM IR's inline assembly.

We believe it would be unfortunate to put Rust into a similar situation, making certain architectures a second-class citizen with respect to inline assembly.

Provide intrinsics for each instruction

In discussions it is often postulated that providing intrinsics is a better solution to the problems at hand. However, particularly where precise timing, and full control over the number of generated instructions is required intrinsics fall short.

Intrinsics are of course still useful and have their place for inserting specific instructions. E.g. making sure a loop uses vector instructions, rather than relying on auto-vectorization.

However, inline assembly is specifically designed for cases where more control is required. Also providing an intrinsic for every (potentially obscure) instruction that is needed e.g. during early system boot in kernel code is unlikely to scale.

Make the asm! macro return outputs

It has been suggested that the asm! macro could return its outputs like the LLVM statement does. The benefit is that it is clearer to see that variables are being modified. Particular in the case of initialization it becomes more obvious what is happening. On the other hand by necessity this splits the direction and constraint specification from the variable name, which makes this syntax overall harder to read.

fn mul(a: u32, b: u32) -> u64 {
    let (lo, hi) = unsafe {
        asm!("mul {}", in(reg) a, in("eax") b, lateout("eax"), lateout("edx"))

    hi as u64 << 32 + lo as u64

Prior art

GCC inline assembly

The proposed syntax is very similar to GCC's inline assembly in that it is based on string substitution while leaving actual interpretation of the final string to the assembler. However GCC uses poorly documented single-letter constraint codes and template modifiers. Clang tries to emulate GCC's behavior, but there are still several cases where its behavior differs from GCC's.

The main reason why this is so complicated is that GCC's inline assembly basically exports the raw internals of GCC's register allocator. This has resulted in many internal constraint codes and modifiers being widely used, despite them being completely undocumented.

D & MSVC inline assembly

See the section above.

Unresolved questions

  • Should a pure asm statement with no outputs be an error or just a warning? The asm block will be eliminated by the compiler since it has no side effects are no outputs. However such asm blocks may be produced by auto-generated code or macros.

  • Should we keep the same flags for the template modifiers as LLVM/GCC? Or should we use our own?

  • Should we allow passing a value expression (rvalue) as an inout operand? The semantics would be that of an input which is allowed to be clobbered (i.e. the output is simply discarded).

  • Some registers are reserved and cannot be used in inline assembly. We already disallow the stack pointer from being used since it is always reserved, but there are other registers that are only sometimes reserved (e.g. the frame pointer if the function needs one, r9 on some ARM targets, etc). Should we disallow the use of these registers on the frontend (rustc) or leave it for the backend (LLVM) to produce a warning if these are used?

  • Do we need to add support for tied operands? Most use cases for those should already be covered by inout.

  • Should we support x86 high byte registers (ah, bh, ch, dh) as inputs/outputs? These are supported by LLVM but not by GCC, so I feel a bit uncomfortable relying on them.

  • Should we support memory operands ("m")? This would allow generating more efficient code by taking advantage of addressing modes instead of using an intermediate register to hold the computed address.

  • Should we add formatting flags for imm operands (e.g. x to format a number as hex)? This is probably not needed in practice.

  • Should we support some sort of shorthand notation for operand names to avoid needing to write blah = out(reg) blah? For example, if the expression is just a single identifier, we could implicitly allow that operand to be referred to using that identifier.

  • What should preserves_flags do on architectures that don't have condition flags (e.g. RISC-V)? Do nothing? Compile-time error?

Future possibilities

Flag outputs

GCC supports a special type of output which allows an asm block to return a bool encoded in the condition flags register. This allows the compiler to branch directly on the condition flag instead of materializing the condition as a bool.

We can support this in the future with a special output operand type.

asm goto

GCC supports passing C labels (the ones used with goto) to an inline asm block, with an indication that the asm code may jump directly to one of these labels instead of leaving the asm block normally.

This could be supported by allowing code blocks to be specified as operand types. The following code will print a if the input value is 42, or print b otherwise.

asm!("cmp {}, 42; jeq {}",
    in(reg) val,
    label { println!("a"); },
    fallthrough { println!("b"); }

Unique ID per asm

GCC supports %= which generates a unique identifier per instance of an asm block. This is guaranteed to be unique even if the asm block is duplicated (e.g. because of inlining).

We can support this in the future with a special input operand type.

imm and sym for global_asm!

The global_asm! macro could be extended to support imm and sym operands since those can be resolved by simple string substitution. Symbols used in global_asm! will be marked as #[used] to ensure that they are not optimized away by the compiler.


I'm really liking the feel of this so far. I need to think through all the ramifications, but, it seems like a really good and thorough treatment that allows for future extension and doesn't make Rust too dependent on specific back-ends or architectures.


Looks great.

A few small bikeshed things:

  • Are there concrete use cases for reg_abcd, vreg_low, and vreg_low8?
  • lateout and especially inlateout are kind of ugly. I'd suggest out_late, but I'm not sure what to do to inlateout without making it less self-explanatory.
  • The name flags is confusing since one might think it refers to the flags register. Maybe settings? Or something like asm!("asdf", pure = true)?

One thing that might be nice is to list specific things that can be done with inline assembly with GCC, D, VSC, etc. that you won't be able to currently do with what this RFC proposes. I see some things along those lines mentioned, but, are those the only things? It might be good for this RFC to explicitly state things those implementations currently support that this doesn't AND state in the RFC whether it is currently known and being locked in in this RFC whether or not those things should ever be supported. That might be asking too much though. I'm not an expert in this area.

This looks great. Thank you so much for working on this!

A few requests:

  • While I do like using similar syntax for input/output registers that don't get directly referenced in the format string, it seems error-prone if we can't detect a mismatch between the input/output values and the values used in the format string. I would suggest some explicit indication that an input or output is implicit, and then an error if an asm! doesn't reference every non-implicit intput/output in the format string.
  • You have i8/i16/i32/i64 in many places; in every case, those should allow either signed or unsigned types. (I didn't see a note anywhere saying that those represented both signed and unsigned of the given size.)
  • Anywhere that allows v128 should also allow i128 or u128.
  • Please mention that we may wish to provide a standardized way of switching binary sections without relying on backend .section directives. That can be a future extension, but it seems worth mentioning.
  • When you mention that this uses Intel syntax by default, you should mention that we could easily implement an asm_att! or similar that does the opposite, for convenience of copy-pasting. And you should also mention, in the alternatives section, that we could choose to do the reverse: use AT&T syntax and provide an asm_intel! for Intel syntax. Both approaches have tradeoffs; for instance, some folks who work on kernels favor AT&T syntax to help reduce differences between architectures, while some folks who work primarily on Intel assembly prefer the Intel syntax. We should mention those tradeoffs explicitly.
  • For asm goto, I would propose a different syntax for the most common case of that, which integrates into an if/else statement rather than providing arbitrary multi-way goto. That would cover a large number of uses of asm goto I've seen in the wild, while remaining relatively structured and straightforward for the compiler and other tools to analyze.
  • Yes, imm should allow floating-point immediates.

I am hesitant to add additional syntax for this since clobber specifications are already getting quite long. However I think a simply solution would be to lint against any unused operands that are specified as register classes. Unused operands specified as explicit registers are silently allowed.

I thought that was obvious, but I'll add a note about it.

Vector types and large integers are treated very differently by the register allocator (one goes in a vector register, the other goes into a pair of general purpose registers). LLVM and GCC are very picky as to what types they accept for various constraint codes, so this wouldn't work.

We already have this in the form of #[link_section]. You only need to use .section if the data you are encoding needs to refer to a label inside the asm itself. I don't really see how this can be done outside of the asm string, and I would really rather not have to perform any parsing of the asm string within rustc itself.

Actually the assembly languages of most architectures (at least ARM and RISC-V that I know of) are much closer to Intel syntax than AT&T. The only reason anyone is still using AT&T syntax is because GCC doesn't support inline assembly with Intel syntax. This is already somewhat covered by the "Unfamiliarity" drawback, but I can make it more explicit.

So this is a bit of a tricky case: Rust doesn't really have (C-style) labels that we can pass into the asm. Integrating into an if / else also doesn't work since that would require the asm to produce an intermediate bool result, which defeats the point. The only real way to support asm goto is what I proposed: you need to pass the code to be executed for each label directly to the asm! macro. If I misunderstood your proposal, please provide some example code to clarify what you are suggesting.

Note that this means we will be inserting the values directly into the asm string, rather than using LLVM's "i" constraint. It's probably better this way actually so let's just do that.


I've updated the pre-RFC based on @josh's feedback (see the edit history for changes).


This is a huge RFC, though it seems most of my comments have already been brought up by other folks. My main concern is making sure that it is absolutely painless to specify additional supported architectures. I work almost entirely in RISC-V, and I think that it's very important that it be easy for me to add support (even though I think that RISC-V, as an ISA, is sufficiently simple (and on-brand for Rust) that we should just support it from the beginning...). This goes for other users of less-mainstream ISAs.

Also, I think it's important to actually spell out, roughly, what sorts things are UB inside of an asm! statement. Compilers tend to be pretty bad at this IME.

There's the obvious "don't scribble over registers you didn't say you were scribbling over", but there's a few questions (varying from an obvious "no that's obviously stupid" to "I do this when I write inline assembly in C and have no idea if it's UB", in no particular order):

  • Can I do a far jump that never returns? (If so, it might be nice to have mechanism to tell the rust compiler that the asm! block should type as !.)
  • If so, can I scribble whatever I want in registers like e.g. the stack pointer and then never return (e.g., I want to write the OS code that executes before a thread starts in pure Rust).
  • Can I pretend to be a function call and grow the stack (making sure to shrink it before exiting the asm block?
  • Can I do really rude things like ret?
  • Can I raise a hardware exception or similar that would really mess up the Rust implementation's unwinding? (You merely specify that you cannot begin unwinding from inside inline assembly; you might want to strengthen this).
  • Can I do my own save-and-restore of registers I haven't told the compiler I'm touching?
  • Can I read registers I didn't say I was going to read?

These aren't quite UB questions but are other things you don't specify:

  • Can I put an uninitialized let in an in parameter? More generally, could I write
let my_ptr: *const usize = ...;
let mut allegedly_frozen: usize;
asm!("mv {}, 0({})", out(reg) allegedly_frozen, in(reg) my_ptr);
  • Should I expect empty template strings to still force save/restores, and generally act as an optimization barrier? If not, I think we should be explicit that asm should not be used in this way, and maybe have a second discussion about providing an intrinsic. Here's one of many examples of this in Chromium. I.e.,
let x = ...;
asm!("", inout(reg) x);
  • Can I actively rely on the fact that the compiler will never peek into my assembly and try to optimize it because it thinks it's smart? This is especially important for constant-time cryptography, which tends to need to play chicken with the compiler.

As a final note, I think we should consider adding a shorthand for in(reg) and friends. 99% of register constraints are "any register", so I think such a shorthand would make constraints more readable. (I'm also not a fan of the juxtaposed constraint expr syntax, and it would be nice to have some punctuation separating them.)


Another nitpick (emphasis added):

The part I bolded seems like it should be based on the presence of nomem rather than being baked into pure. If you have pure but not nomem, the assembly should be a pure function of the input values together with what they (transitively) point to.

I think this would be consistent with the LLVM semantics we're translating to – i.e. it's legal to read from memory even if you don't pass sideeffect, as long as you don't pass readnone.

1 Like

An unused operand specified as a register class can't possibly work, as you can't know what register you got without substituting it in; I think that should give a hard error, not a lint.

But I'd still like to catch possible mistakes caused by specifying an exact register and then forgetting to use that argument at all. I would like to distinguish that case somehow. I agree that the common case for "use this exact register" will be to use that register implicitly, and we shouldn't complicate that case. But is there some other way that we can make this error less likely?

LLVM doesn't already support putting a u128 into an SSE register and doing math on it using SSE? That seems quite unfortunate.

In any case, it seems extremely surprising to me if you can't provide a u128 input value for a 128-bit register, or get a 128-bit register into a 128-bit output.

On a different note, for ABIs that do commonly operate on register pairs, we need a good way to handle those. For instance, on 32-bit x86 ("i386"), it's common to operate on edx:eax as a register pair, for operations such as multiplication, division, or rdtsc. And on 64-bit x86, some instructions operate on rdx:rax as a pair for 128-bit operations. We need to have a way to specify those as register constraints; for instance, out("edx:eax") value should work with a u64 value.

I don't mean an intermediate bool. I'm still proposing that this would pass a label into the assembly. I'm just imagining something more structured, like this:

if_asm!("various assembly; je {else}") {
    // if body
} else {
    // else body

This would pass in an {else} label implicitly, and give an error if the assembly didn't reference {else}; finishing the assembly block without jumping to {else} would enter the if body. That would cover, for instance, every single use of asm goto in the Linux kernel.

A few additional thoughts on the RFC:

I don't think this should be limited to "defined in the current crate"; it should be acceptable to reference any symbol visible to the current crate. For instance, you should be able to pass a pointer to a function defined in another crate.

You might say "called as a function, potentially with a non-standard ABI".

Also, I can imagine other ways to implement this, such as running the external assembler and then inlining the resulting instructions; the only thing the backend would have to support is "make sure this value ends up in this register at this point", and an inefficient implementation of that could just arrange to move the value into that register at that point, rather than optimizing to make sure it's already in that register at that point.

Proposed answers:

Yes, but if so, any memory that was actively mutably borrowed before you jumped will be left in an undefined state. For example, in this code:

fn foo(x: &mut u32) {
    *x = 42;

the optimizer should be allowed to move the write after the call, but bar() could include an asm block that jumps into oblivion.

Also, you'd be skipping any destructors for objects which may have been on the stack. That's okay if you never reuse the stack. If you do reuse the stack, it's not automatically UB, but it's unsafe, i.e. you're responsible for ensuring there wasn't anything on the stack that depended on destructors being run for correct behavior.

If you want to write code that executes without already having a stack pointer, you need either global_asm! or a #[naked] function. #[naked] functions arguably should be removed from the language, but if not, they should be treated as basically syntax sugar for global_asm!. A naked function must consist of a single asm block, which is just plopped into the output file: there's no interaction with code generation or register allocation, no possibility of inlining. It really has completely different semantics from normal functions and normal asm blocks.

Not sure, but note that this also affects whether you can perform an actual function call.

No. You have no idea where the compiler has stashed the return address.

What do you mean by "mess up unwinding"? The consequences of raising a hardware exception depend on the exception handler. For example, under a typical non-embedded operating system kernel, userland code raises hardware exceptions all the time, when it accesses pages that haven't been faulted in yet. But once the kernel has loaded the page into memory and added it to the page table, it restores all registers and resumes execution at the instruction that produced the exception, making the whole scheme invisible to userland. In an embedded context... I'd say it really depends on what you're doing with the exceptions.


Yes, but doing so is pointless since the compiler might be storing any value whatsoever in them. The stack pointer register might be an exception (see previous question about calls).

No, it should be a compile-time error.

Is the question here whether asm implicitly freezes its outputs? I'd say it does, since I can't think of any optimization that could make the output act 'weird' while still treating the assembly as a black box (which it should).

It should make the output a 'black box' to the optimizer, but whether that involves saving or restoring anything depends on the constraints you specified...

The compiler should treat the assembly as a black box. However, the compiler is allowed to perform crazy transformations like

let a = b * c;


let a = if c == 42 {
    b * 42
} else {
    b * c

which can make seemingly constant-time operations variable time, even if the compiler doesn't know anything about the inputs a priori.

So there is no way to do guaranteed constant-time cryptography unless your entire algorithm is in a single asm block. Even then, the compiler is free to take the outputs of the algorithm and leak those through timing. Yes, this sucks.


I don't think this is necessary. You can just specify two output variables and combine them afterwards with ((hi as u64) << 32) | (lo as u64). I believe the compiler is smart enough to produce optimal code in this situation.

I'd like the ability to do that without UB, yes. I'd also like the ability to possibly do a jump that never returns. I don't think either of those would cause horrible problems any more than exec or _exit or a fatal fault would.

EDIT: see comex's answer regarding mutable borrows.

As long as you never fall out the end of the asm block, I don't see how that would cause UB.

Now that has really interesting implications. We may at some point want to support inputs or outputs that would end up in stack locations (e.g. if a local value currently lives on the stack and not a register). If the compiler isn't using stack frames, it might access a value using an offset from the stack pointer, so changing the stack pointer would break the stack-pointer-relative location the compiler gives you.

For that matter, if you change the stack pointer in any way, it'd be polite to include the appropriate debug information so that debuggers can figure out where locals live at all times (even if you don't reference them).

All that said, I think you should be able to adjust the stack pointer if you restore it afterwards, but that would take some care to allow.

No, definitely not.

Possibly, if you handle the trap. Consider code that (for instance) runs rdmsr, and code elsewhere that has set up a GPF handler that knows if that particular instruction faults to jump to a recovery label in the same block.

I also think you can fault if you never return.

You can't raise a fault that other code will handle by unwinding your stack, but in that case, the fault lies with the code unwinding the stack.

Yes, that seems reasonable. You could push rax, use rax as a scratch register, and then pop rax.

That said, I think the best way to handle that would be to tell the asm! block you need a scratch register of a given type, and then the compiler might give you a register it already had free so you don't have to save and restore.

The value of a register will always be a valid bit pattern, so there's no undefined behavior there, but you should not ever depend on the value. I can only think of two valid use cases: saving and restoring the register value, or debugging code that just captures registers and prints them.

Not in an in parameter, but it seems reasonable to put it in an out parameter.

We should have an explicit note that reading from an out or lateout register before you write it will produce an unspecified but valid bit pattern.

I think we should have a real "compiler barrier" operation rather than encouraging people to use the equivalent of GCC's __asm__ __volatile__ ("" : : : "memory"). But I do think we should make that work, even if we provide a better alternative.

By default, yes. I think we might in the future want to offer a mechanism to explicitly label an asm! block as permitting peephole optimizations and similar, but by default the compiler should assume that it must emit exactly the assembly specified.


That seems reasonable (if verbose), but if that's the recommended approach, then we should document that (with examples using mul and div for instance).

1 Like

Is this even meaningful, or if so, useful? I imagine that if you trash your current stack, and plan to diverge, the old stack is ipso facto gone. In other words, every inline assembly statement is of one of two types:

  • It never returns for any inputs, and as such can do anything to machine state: the current Rust thread isn't really a Rust thread anymore. It could jump back into Rust, but it would be more-or-less like a brand new thread with a brand new stack.
  • It may return, in which case upon control exiting the assembly, no registers may be out of place, so scribbling all over the stack pointer is just the same as spilling any other register and scribbling all over it.

In the context I'm thinking, I'm going to land in a crt0.S anyway. Also, maybe this case wasn't clear: the only situation in which I expect to scribble over the stack pointer is to abandon the old stack and set up a new one.

By that token, do we even want to say something like "doodling garbage all over the stack pointer, then jumping into a C-ABI function out in the rhubarb, is UB"? I feel that after a certain point, I don't think declaring this to be undefined behavior means anything: you've jumped out of Rust, beyond any meaningful proposition of "well-formed program".

I guess I was thinking of doing something stupid with link registers, maybe this question was dumb.

Honestly, I have no idea, I know very little about unwinding because every system I have ever touched meaningfully doesn't have it. I'm just trying to cover the entire attack surface.

Trust me, I know; I used to sit near the guy who does the constant-time stuff in BoringSSL. =P I mostly thought of the asm!("") example in the context of this. My main point here was to poke at the degree to which it is meaningful or useful to promise that the compiler will not touch, peek into, or otherwise perform optimizations with knowledge about what your asm is doing. That said, specifying optimizations is usually a great way to make your language complicated (cough copy elision/RVO cough), and making meaningful promises about that is hard.

Right, this is less a question about scratch registers and register allocation and more of a question of "can I "spill" the stack register onto not-the-stack". You'll notice a lot of these questions are mostly "I want to do stupid things to the stack but I want to know that everything is ok if Rust never notices."

Ah, but what is that (like, what does that mean, formally)? I think that's where a lot of the mystery lies in wanting a "cfence".

I think the sadness here is dynamic linking. I honestly don't know that much about dynamic linking other than the one paper I read, but I think you can't just do a string-replace to set up a GOT lookup in your assembly language? Maybe I'm overthinking this. I otherwise agree that this distinction isn't the best.

1 Like

Is it UB to set up an out register, not write to it, then read it on the Rust side? Is it valid garbage value, or is it uninitialized? Rustc can't possibly know, at any rate.

I don't think that should be UB, as long as the type you pass as the out register allows any possible bit pattern. If we were to allow an out to refer to (for instance) a repr(u8) enum, then it'd be UB to supply a value that isn't a valid enum value.

Oh, sure. I think that's kind of implied... I think what I'm getting at is "under what circumstances do values that inline assembly interacts with cease to be uninitialized."

Actually this definition is based on how GCC interprets the volatile keyword (which is basically the inverse of pure). See this example where the asm code is executed only once despite it taking the address of a global as input, having a memory clobber and modifying that global on every loop iteration.

Sure, I was just considering the worst case scenario where the backend has exactly zero support for any form of inline asm. This is currently the case for Cranelift, but it seems that its main use case at the moment is to compile faster debug builds, so performance of inline asm isn't really critical here.

Unless you're on a "fun" architecture like Itanium where a register can contain NaT (not-a-thing) that faults if you try to write it to memory. If you want to use asm! to freeze some undefined bytes, use memory instead of a register.

Just because you don't return doesn't mean you can trash any objects that are currently on the stack. There could be another thread holding a reference to one of those objects (rayon does this). But as long as you only grow the stack and don't touch any parent frames, you can indeed do whatever you want as long as you never return (and never exit the thread in a way that would free the stack without unwinding it).

This is fine, in fact I use this (AArch64) asm code to switch to another stack:

    // Switch to the new stack and execute the given function on it
            mov sp, ${0}
            mov x29, #0
            mov x30, #0
            br x0
        : "r" (initial_sp),
          "{x0}" (func)
        : "volatile"

The old stack is left untouched (and never freed), and the asm! never returns.

Exactly! If you are compiling a shared library, there is no way to access a symbol outside the current shared library except through the GOT. Also GCC/LLVM get rather unhappy if you try to pass an external symbol into an asm block.

1 Like

Fine, but on single-threaded boot loader I'm working in, I don't have other threads (not to speak of OS infrastructure for Rayon...), so this should be safe. I think we need to have a discussion as to when it is acceptable to trash your stack, unfortunately. =(

Well, maybe you could get away with having the surrounding assembly generate the code to poke the GOT and then replace the sym with an imm?? Also, I know some assemblers (I believe RISC-V gas will do this) silently emit GOT offsets? Like, in PIC mode,

la t0, my_got_symbol

will actually emit instructions to fuss with gp (la isn't a real RISC-V instruction, and just a spec-defined macro for auipc/addi), which would really hate receiving an immediate instead of a symbol. (I hate dynamic linking.)

1 Like