Hello and a happy new year to everyone,
as some of you may be aware I gave a summary talk on inline assembly at the Rust Cologne Meetup in June 2017 (recording, slides). One reason for that was getting information to the Rust community to start a proper discussion on this (which I mostly failed to do, due to being preoccupied). The other reason was getting myself motivated to actually do the research, so I could come up with an RFC.
So this is a first draft of that RFC. It proposes an inline assembly syntax somewhat similar to what is available in gcc and clang, but in my opinion more readable and easier to remember.
Feedback and suggestions are very welcome.
Summary
Define a stable syntax for inline assembly, meant to be portable among various backends and architectures.
Motivation
In systems programming some tasks require dropping down to the assembly level. The primary reasons are for performance, precise timing, and low level hardware access. Using inline assembly for this is sometimes convenient, and sometimes necessary to avoid function call overhead.
The inline assembler syntax currently available in nightly Rust is very ad-hoc. It provides a thin wrapper over the inline assembly syntax available in LLVM IR. For stabilization a more user-friendly syntax that lends itself to implementation across various backends is preferable.
Guide-level explanation
Rust provides support for inline assembly via the asm!
macro.
It can be used to embed handwritten assembly in the assembly output generated by the compiler.
Generally this should not be necessary, but might be where the required performance or timing
cannot be otherwise achieved. Accessing low level hardware primitives, e.g. in kernel code, may also
demand this functionality.
Let us start with the simplest possible example:
unsafe {
asm!("nop");
}
This will insert a NOP (no operation) instruction into the assembly generated by the compiler.
Note that all asm!
invocations have to be inside an unsafe
block, as they could insert
arbitrary instructions and break various invariants. The instructions to be inserted are listed
in the first argument of the asm!
macro as a string literal.
Now inserting an instruction that does nothing is rather boring. Let us do something that actually acts on data:
let x: u32;
unsafe {
asm!("movl $5, {}", out(reg) x);
}
This will write the value 5
into the u32
variable x
.
You can see that the string literal we use to specify instructions is actually a template string.
It is governed by the same rules as Rust format strings.
The arguments that are inserted into the template however look a bit different then you may
be familiar with. First we need to specify if the variable is an input or an output of the
inline assembly. In this case it is an output. We declared this by writing out
.
We also need to specify in what kind of location the assembly expects the variable.
This is called a constraint specification.
In this case we put it in an arbitrary general purpose register by specifying reg
.
We could also have said mem
telling the compiler the assembly expects a memory location for this
argument. The compiler will choose an appropriate register, or memory location to insert into
the template and read the variable from there after the inline assembly.
Let see another example that also uses an input:
let i: u32 = 3;
let o: u32;
unsafe {
asm!("
movl {0}, {1};
addl {number}, {1};
", in(reg) i, out(reg) o, number = in(imm) 5);
}
This will add 5
to the input in variable i
and write the result to variable o
.
The particular way this assembly does this is first copying the value from i
to the output,
and then adding 5
to it.
The example shows a few things:
First we can see that inputs are declared by writing in
instead of out
.
Second one of our input operands has a constraint specification we haven’t seen yet, imm
.
This tells the compiler to expand this argument to an immediate inside the assembly template.
This is only possible for constants and literals.
Third we can see that we can specify an argument number, or name as in any format string. For inline assembly templates this is particularly useful as arguments are often used more than once. For more complex inline assembly using this facility is generally recommended, as it improves readability, and allows reordering instructions without changing the argument order.
In some cases we need an argument to be both an input and an output:
let mut bytes: u32 = 0x01_02_03_04;
unsafe {
asm!("bswap {}", inout(reg) bytes);
}
assert_eq!(bytes, 0x04_03_02_01);
This example uses the bswap
instruction to swap the byte order of the bytes
variable.
We can see that inout
is used to specify an argument that is both input and output.
This is different from specifying an input and output separately in that it is guaranteed to
assign both to the same register or memory location.
The Rust compiler is conservative with its allocation of operands. It is assumed that an out
can be written at any time, and can therefore not share its location with any other argument.
However, to guarantee optimal performance it is important to use as few registers as possible,
so they won’t have to be saved and reloaded around the inline assembly block.
To achieve this Rust provides a lateout
specifier. This can be used on any output that is
guaranteed to be written only after all inputs have been consumed.
There is also a inlateout
variant of this specifier.
Some instructions require that the operands be in a specific register.
Therefore, Rust inline assembly provides some more specific constraint specifiers.
While reg
, mem
, and imm
will be available on any architecture, these are highly architecture
specific. Usually a specifier for each register class, and register will be provided.
E.g. for x86 the general purpose registers eax
, ebx
, ecx
, edx
, esp
, ebp
, esi
, and edi
among others can be addressed by their name.
unsafe {
asm!("out {}, $0x64", in(eax) cmd);
}
In this example we call the out
instruction to output the content of the cmd
variable
to port 0x64
. Since the out
instruction only accepts eax
(and its sub registers) as operand
we had to use the eax
constraint specifier.
It is somewhat common that instructions have operands that are not explicitly listed in the assembly (template). Hence, unlike in regular formating macros, we support excess arguments:
fn mul(a: u32, b: u32) -> u64 {
let lo: u32;
let hi: u32;
unsafe {
asm!("mul {}", in(reg) a, in(eax) b, lateout(eax) lo, lateout(edx) hi);
}
hi as u64 << 32 + lo as u64
}
This uses the mul
instruction to multiply two 32-bit inputs with a 64-bit result.
The only explicit operand is a register, that we fill from the variable a
.
The second implicit operand is the eax
register which we fill from the variable b
.
The lower 32 bits of the result are stored in eax
from which we fill the variable lo
.
The higher 32 bits are stored in edx
from which we fill the variable hi
.
In many cases inline assembly will modify state that is not given as output. Usually this is either because we have to use a scratch register in the assembly, or instructions modify state that we don’t need to further examine. This state is generally referred to as being “clobbered”. We need to tell the compiler about this since it may need to save and restore this state around the inline assembly block.
let ebx: u32;
let ecx: u32;
unsafe {
asm!("
movl $4, %eax;
xorl %ecx, %ecx;
cpuid;
", out(ebx) ebx, out(ecx) ecx, clobber(eax, edx));
}
println!(
"L1 Cache: {}",
((ebx >> 22) + 1) * (((ebx >> 12) & 0x3ff) + 1) * ((ebx & 0xfff) + 1) * (ecx + 1)
);
We specify the clobbered state via a clobber
argument following all inputs and outputs.
In the example above we use the cpuid
instruction to get the L1 cache size.
This instruction writes to eax
, ebx
, ecx
, and edx
, but for the cache size we only
care about the contents of ebx
and ecx
. Hence, we declare those as outputs, while declaring
the other registers as clobbers.
Clobber specifications are generally architecture specific. The only clobber specification that is
always available is mem
, meaning memory that is not specified as output is being written.
Other than that all architecture registers are usually available by name.
When we said earlier that the asm!("nop")
statement would insert a nop
instruction that was
actually not the whole truth. Rust’s asm!
macro is designed to allow optimization.
This is another reason inputs and outputs need to be known to the compiler.
If outputs of the inline assembly block are never read, or there are no outputs,
the inline assembly block may be optimized away.
Also if inputs don’t change across multiple invocations of an inline assembly block the
compiler may assume it always yields the same result, only executing it once.
In some cases this may not be what we want. For example we may want to clear the interrupt flag on an x86 system:
unsafe {
asm!("cli", flags(volatile));
}
As you can see in the example we do this using the cli
instruction.
However, this instruction has no output. We only run it for the side-effect.
To avoid deletion of this inline assembly block by the optimizer we specify the volatile
flag.
Flags can be provided as an optional final argument to the asm!
macro.
For now the only generally available flag is volatile
, which enforces that the inline assembly
block is always executed. However, there may be other architecture specific flags.
E.g. on x86 the intelsyntax
flag is provided to switch from AT&T to Intel assembly syntax.
Reference-level explanation
Inline assembler is implemented as a macro asm!()
.
The first argument to this macro is a template used to build the final assembly.
The following arguments specify input and output operands.
When required, clobbers and flags are specified as the final two arguments.
The assembler template uses the same syntax as format strings.
I.e. placeholders are specified by curly braces.
The corresponding arguments are accessed in order, by index, or by name.
Future revisions may also use the format_spec
to specify what LLVM calls
template argument modifiers. However, this initial proposal elides this,
as it is not necessary for inline assembly to be useful.
The following ABNF specifies the general syntax:
dir_spec := "in" / "out" / "lateout" / "inout" / "inlateout"
constraint_spec := "reg" / "mem" / "imm" / <arch specific>
operand := [ident "="] dir_spec "(" constraint_spec ")" expr
clobber_spec := "mem" / <arch specific>
clobber := "clobber(" clobber_spec ")"
flag := "volatile" / <arch specific>
flags := "flags(" flag *["," flag] ")"
asm := "asm!(" format_string *("," operand) ["," clobber] ["," flags] ")"
Direction specification
The direction specification indicates in what way the operand is being used by the generated assembly.
Five kinds of operands are supported:
-
in
- input operand
- may be read at any time
- may not be written
-
out
- output operand
- may not be read
- may be written at any time
-
lateout
- output operand
- may not be read
- may only be written after all inputs were consumed
-
inout
- input and output operand
- may be read at any time
- may be written at any time
-
inlateout
- input and output operand
- may be read at any time
- may only be written after all inputs were consumed
The expr
given with an output must resolve to a mutable or uninitialized location.
Constraint specification
The constraint specification indicates which kinds of operand is required by the assembly template in the operands position.
Across platforms three constraint specifications are supported:
-
reg
: the operand is placed in a general purpose register -
mem
: the operand is placed in a memory location -
imm
: the operand is an immediate
All other constraint specifications are defined per architecture.
It is suggested that one exist for at least each physical register
and register class (e.g. floating point register, 128-bit vector register).
Names should be speaking rather than single letter acronyms.
I.e. prefer for example float
over f
and xmm_vector
over x
.
Clobber specification
The clobber specification is used to indicate what state is being modified apart from
the outputs. The mem
clobber specification is always available. It indicates that arbitrary memory
is being modified.
All other clobber specifications are defined per architecture. It is suggested that one exist for at least each physical register.
Flags
Flags are used to further influence the behaviour of the inline assembly block.
The only flag defined at this point in time is volatile
.
The volatile
flag indicates that the inline assembly block may have side-effects not
indicated by inputs, outputs, or clobber (i.e. may not be optimized away).
Other flags can be defined per architecture.
An intelsyntax
flag for the x86 architecture should be provided.
Mapping to LLVM IR
The direction specification maps to a LLVM constraint specification as follows (using a register operand as an example):
-
in(reg)
=>r
-
out(reg)
=>=&r
(Rust’s outputs are early-clobber outputs in LLVM/GCC terminology) -
inout(reg)
=>=&r,0
(an early-clobber output with an input tied to it,0
here is a placeholder for the position of the output) -
lateout(reg)
=>=r
(Rust’s late outputs are regular outputs in LLVM/GCC terminology) -
inlateout(reg)
=>=r, 0
(cf.inout
andlateout
)
As written this RFC requires architectures to map from Rust constraint specifications to LLVM constraint codes. This is in part for better readability on Rust’s side and in part for independence of the backend:
-
reg
is mapped tor
-
mem
is mapped tom
- a register name
r1
is mapped to{r1}
- additionally mappings for register classes are added as appropriate (cf. llvm-constraint)
For clobber specifications the following mappings apply:
-
mem
is mapped to~{memory}
- a register name
r1
is mapped to~{r1}
(cf. llvm-clobber)
The volatile
flag is mapped to adding the sideeffect
keyword to the LLVM asm
statement.
The intelsyntax
flag is mapped to adding the inteldialect
keyword to the LLVM asm
statement.
Drawbacks
Unfamiliarity
This RFC proposes a completely new inline assembly format. It is not possible to just copy examples of gcc-style inline assembly and re-use them. There is however a fairly trivial mapping between the gcc-style and this format that could be documented to alleviate this.
The clobber example above would look like this in gcc-sytel inline assembly:
int ebx, ecx;
asm (
"mov $4, %%eax;"
"xor %%ecx, %%ecx;"
"cpuid;"
"mov %%ebx, %0;"
: "=r"(ebx), "=c"(ecx) // outputs
: // inputs
: "eax", "ebx", "edx" // clobbers
);
printf("L1 Cache: %i\n", ((ebx >> 22) + 1)
* (((ebx >> 12) & 0x3ff) + 1)
* ((ebx & 0xfff) + 1)
* (ecx + 1));
Rationale and alternatives
Implement an embedded DSL
Both MSVC and D provide what is best described as an embedded DSL for inline assembly. It is generally close to the system assembler’s syntax, but augmented with the ability to directly access variables that are in scope.
// This is D code
int ebx, ecx;
asm {
mov EAX, 4;
xor ECX, ECX;
cpuid;
mov ebx, EBX;
mov ecx, ECX;
}
writefln("L1 Cache: %s",
((ebx >> 22) + 1) * (((ebx >> 12) & 0x3ff) + 1)
* ((ebx & 0xfff) + 1) * (ecx + 1));
// This is MSVC C++
int ebx_v, ecx_v;
__asm {
mov eax, 4
xor ecx, ecx
cpuid
mov ebx_v, ebx
mov ecx_v, ecx
}
std::cout << "L1 Cache: "
<< ((ebx_v >> 22) + 1) * (((ebx_v >> 12) & 0x3ff) + 1)
* ((ebx_v & 0xfff) + 1) * (ecx_v + 1))
<< '\n';
While this is very convenient on the user side in that it requires no specification of inputs, outputs, or clobbers, it puts a major burden on the implementation. The DSL needs to be implemented for each supported architecture, and full knowledge of the side-effect of every instruction is required.
This huge implementation overhead is likely one of the reasons MSVC only
provides this capability for x86, while D at least provides it for x86 and x86_64.
It should also be noted that the D reference implementation falls slightly short of supporting
arbitrary assembly. E.g. the lack of access to the RIP
register makes certain techniques for
writing position independent code impossible.
As a stop-gap the LDC implementation of D provides a llvmasm
feature that binds it closely
to LLVM IR’s inline assembly.
The author believes it would be unfortunate to put Rust into a similar situation, making certain architectures a second-class citizen with respect to inline assembly.
Provide intrinsics for each instruction
In discussions it is often postulated that providing intrinsics is a better solution to the problems at hand. However, particularly where precise timing, and full control over the number of generated instructions is required intrinsics fall short.
Intrinsics are of course still useful and have their place for inserting specific instructions. E.g. making sure a loop uses vector instructions, rather than relying on auto-vectorization.
However, inline assembly is specifically designed for cases where more control is required. Also providing an intrinsic for every (potentially obscure) instruction that is needed e.g. during early system boot in kernel code is unlikely to scale.
Make the asm!
macro return outputs
It has been suggested that the asm!
macro could return its outputs like the LLVM statement does.
The benefit is that it is clearer to see that variables are being modified.
Particular in the case of initialization it becomes more obvious what is happening.
On the other hand by necessity this splits the direction and constraint specification from
the variable name, which makes this syntax overall harder to read.
fn mul(a: u32, b: u32) -> u64 {
let (lo, hi) = unsafe {
asm!("mul {}", in(reg) a, in(eax) b, lateout(eax), lateout(edx))
};
hi as u64 << 32 + lo as u64
}
Unresolved questions
Clobbers
What actually can/has to be clobbered is somewhat unclear.
The LLVM IR documentation claims that only explicit register constraints and ~{memory}
are supported. Yet clang generates IR that has additional constraints. E.g. it will forward
a cc
(condition code) clobber from C inline assembly.
Flags
Is volatile
or sideeffect
a better flag name? LLVM internally uses sideeffect
which seems
to describe the more accurately. However, volatile
is the more familiar name.