[Pre-RFC #2]: Inline assembly

Ixrec · November 29, 2019, 9:34pm

I believe the direct answer is "no", but at least it's getting discussed in UCG issues like https://github.com/rust-lang/unsafe-code-guidelines/issues/152

hanna-kruppe · November 29, 2019, 9:42pm

Feel free to give definitions, but I expect it will be a waste of time because the core of my argument does not change unless your definition is redundant because it's equivalent to "this is what executing inline asm does, anything that doesn't change program behavior¹ is a valid optimization" — which I am very sure is the only sensible definition of what optimizations are allowed in almost every context, not just in the context of inline asm.

Practically, of course, compilers do not analyze the contents of inline asm so they work based on what any inline asm string could do in that context, but the same is true of external function calls. In fact, I do not know of any case (even after an impromptu audit of all mentions of InlineAsm in the relevant LLVM libraries) where LLVM treats inline asm as more capable than a call to an unknown function in any respect, with the sole exception that one (of two, so there's probably a bug somewhere) pass that infers the noreturn attribute considers the possibility of inline asm returning from naked functions.

Incidentially, with the scheme for supporting inline asm under Cranelift proposed by @Amanieu, inline asm is just an external function call, so (presuming you accept that implementation strategy as correct) I don't think you can actually require Craneline to be any more conservative around inline asm than it is around external function calls.

Ascribing such significance to function boundaries does not really work. Common code transformations such as outlining and partial inlining introduce function boundaries in places where none existed before, so if hiding inline asm behind a function removes its optimization-inhibiting effects, then those effects can't be relied on anyway and might as well not exist.

It is not defined yet (and should be), but since inline asm needs to be able to call e.g. functions written in C, realistically we'll still need to know what external function calls can do to define what inline asm can do, even if we do end up giving inline asm a little bit more power.

Unfortunately GCC is very imprecise at defining what this means, but I have never seen any evidence that it implies any more than an external function call with equivalent parameter list. Would you also call an external function a "compiler barrier"? Would you explicitly document it as such somewhere?

I don't think "compiler barrier" is a useful notion at all. Any accurate mental model of what kinds of optimizations are actually illegal (those which change observable program behavior) is too complex to be usefully be discussed in terms of "barriers", and conversely any mental model that starts with "barriers" is tempted to underestimate what optimizations are legal (because "barriers" are naturally interpreted very broadly).

¹ To account for UB and non-determinism we strictly speaking have to talk about refining the set of possible behaviors of each defined execution or something like that, but you get the idea.

mcy · November 29, 2019, 10:06pm

I feel this view amounts to "obviously this is going to go nowhere, so we shouldn't bother", which I don't think is a helpful view. The reason I keep bringing this up is because:

As you've said, GCC and LLVM seem to do a spectacularly bad job of writing down what all these things mean... I don't want us to make that mistake.
There seems to be disagreement in this thread about what a "compiler barrier" even is, whether we need one notwithstanding. And I think you're probably right, we should just view inline assembly as a weird external function call, but we don't even know what that means.

Something I mentioned way way upthread, which I believe is another case of this, is the classic asm volatile("", "r"(val)); "hide a value from the compiler" trick. Is this just an "external function call"? Does the UCG definition for this mean what we think we want it to mean?

What about stuff like processor fences? RISC-V's fences seem more granular than e.g. x86's, so if I want to use a specific RISC-V fence, is inline assembly going to allow me to describe the fencing behavior I'm asking of the hardware to the compiler? Or am I in "write the assembly in a separate function"[1] territory?

Really, I'm just worried about misspecifying ourselves into a corner...

[1] Not sure if this is even meaningful in the context of cross-LTO...

josh · November 29, 2019, 10:43pm

There's a fairly precise definition of the minimal expectations for compiler barriers at https://www.kernel.org/doc/Documentation/memory-barriers.txt . Search for COMPILER BARRIER, and note that the Linux kernel's barrier() translates to __asm__ __volatile__("" : : : "memory").

fintelia · November 30, 2019, 8:39pm

It would be really nice if proc macros weren't required to manipulate the string inputs to asm! calls. For instance, right now it isn't possible to build inline assembly sections using the concat! macro because the string expansion from it happens too late. I don't see this specifically addressed in the pre-RFC but ideally any string produced by a macro or const fn should be suitable.

josh · November 30, 2019, 9:17pm

Yes, we absolutely need the ability to generate assembly strings using macros, and ideally also constant functions if possible. And we also need the ability to generate the whole asm! declaration from a macro, with inputs taken from the macro parameters. We should state both of those as requirements.

CAD97 · November 30, 2019, 10:20pm

Are you certain asm!(concat!() :::) doesn't work? I didn't expect it to work with format_args!, and yet it does.

IIRC the discussion around format_args!(concat!("{", "f", "o", "o", "}"), foo=0) said that both format_args! as built-in macro look-alikes actually expand inner built-in macros (i.e. concat!) in order to try to expand the first argument to a string literal.

In fact, here's asm concat working on the playground.

fintelia · November 30, 2019, 10:53pm

I just poked around with this a bit more. It seems that any of the macros which produce string literals work, but that nothing that deals with constants does. For instance, you can't even do something like:

    const NOP: &'static str = "nop";
    asm!(NOP :::); // ERROR: inline assembly must be a string literal

comex · December 1, 2019, 8:22am

That document is not the greatest.

The "compiler barrier" section spends almost all its time discussing not barrier(), but READ_ONCE and WRITE_ONCE. These macros supposedly:

can be thought of as weak forms of barrier() that affect only the specific accesses flagged by the READ_ONCE() or WRITE_ONCE().

But this is not true. READ_ONCE and WRITE_ONCE are essentially just wrappers around volatile reads/writes, which in fact provide much stronger guarantees than compiler barriers. And the examples that follow depend on those guarantees.

I'm not sure if whoever wrote that documentation and/or code fully understands that. To be more precise, READ_ONCE and WRITE_ONCE expand to a volatile read or write, respectively, if they're called with a variable of size 1, 2, 4, or 8 – i.e. the typical sizes supported for machine loads/stores. AFAIK this is true for almost all uses of those macros in the Linux codebase. But if the variable has a different size, the macros instead perform:

barrier();
__builtin_memcpy((void *)res, (const void *)p, size);
barrier();

This suggests that the author thought of the above code as equivalent to a volatile access. But it is not: unlike volatile, and despite the ONCE in the name, it does not prevent accesses from being split or duplicated!

(Note: READ_ONCE also calls smp_read_barrier_depends(), but that does nothing except on Alpha.)

I haven't investigated the kernel's other uses of barrier(), but there are a lot of them, and I suspect a substantial number are incorrect attempts to turn regular accesses preceding or following the barrier into volatile ones. They likely do work in practice, because compilers don't normally split or duplicate accesses unless doing so simplifies the generated code, and barrier() constrains the compiler enough to remove most simplification opportunities. But this behavior is not guaranteed.

Notwithstanding that...

Valid uses for compiler barriers

[Edit: Where by "compiler barrier" I mean the equivalent of asm volatile("" ::: "memory").]

They do exist.

I claim that a compiler barrier combined with volatile accesses can perform a similar function as atomic::fence combined with relaxed atomic accesses. But whereas atomic::fence gives you specified ordering guarantees between multiple threads (depending on the ordering parameter), a compiler barrier gives you whatever the architecture guarantees based solely on program order of load/store instructions. (The architecture's guarantees are relevant because we're using volatile.)

On x86, the architecture's guarantees are at least as strong as the acquire/release ordering, so given an algorithm using acquire/release fences and relaxed atomic accesses, you could theoretically translate it to compiler barriers and volatile accesses, and the resulting program would be correct. But there's no reason to do so, since it wouldn't be any more efficient; acquire/release fences are already just a compiler barrier on x86.

More interestingly, on architectures with weaker memory ordering guarantees, you still have strong memory ordering between two instructions executing on the same CPU or thread. That includes unusual pairs like:

In userland, normal code versus a signal handler (which runs on the same thread).
In kernels and bare-metal code, normal code versus an interrupt handler (if you know the normal code won't be migrated between CPUs).
A pair of threads which you somehow know are running on the same CPU (e.g. by setting affinity, on systems where affinity is a requirement rather than a hint), but which are not otherwise synchronized.

In each of the above cases, you have two "thread"-ish things where one can be preempted by the other at any instruction boundary – which means that you can't rely on normal reads and writes to synchronize between the two. You could just use atomics, but since we're assuming a weakly ordered architecture, that would generate memory barrier instructions or special atomic load/store instructions, which are unnecessary in this special case. Instead you could use volatile accesses, which generate regular load/store instructions. You would then add compiler barriers as needed to prevent compiler reordering between volatile and non-volatile accesses.

Note that you don't need a compiler barrier to prevent the compiler from reordering multiple volatile accesses, as this is already forbidden. This is different from atomic fences, which can be used between a pair of atomic accesses as well as between non-atomic and atomic accesses.

A compiler barrier can also be useful when synchronizing between a CPU and a peripheral that performs DMA reads/writes. For example, if you've constructed a Ethernet packet in a buffer at some address, you might want to ask the NIC to read from that address and send the data over the network. To start the operation, you would typically write to an MMIO register using a volatile store. But before doing so, you need some sort of synchronization to ensure the data you wrote into the buffer will actually be visible to the hardware. Depending on the architecture, you might or might not need some assembly-level synchronization operation (e.g. flushing the data cache). But if none is required, then assuming you wrote data into the buffer using non-volatile stores, you still need a compiler barrier to prevent the compiler from moving those stores after the volatile store that kicks off the DMA.

Important points:

Yes, as @hanna-kruppe noted, a compiler barrier cannot provide more guarantees than an external function call. In particular, it can't synchronize memory that's "private" to a function further down the call stack (e.g. stack variables which don't have their address taken, or in Rust anything that's mutably borrowed). But we don't need that. We just need the guarantees of an external function call, without the overhead of actually performing a call.
To reiterate, compiler barriers do not give you permission to perform unsynchronized non-volatile loads and stores. They only make sense if there is a volatile load or store on one side of the barrier. (Or an asm block, but in that case you could just mark that block with the "accesses memory" constraint rather than using an empty one.)

comex · December 1, 2019, 8:42am

On another note...

I believe that it should guarantee that. There is not much potential benefit from the compiler analyzing asm blocks, and significant benefit from being able to assume that it won't. (Even if it is very easy to overestimate how much that actually restricts the compiler.)

And to be clear, D and MSVC don't use the semantic meaning of instructions for optimization purposes, do they? They only know enough to provide a more sugary syntax for asm blocks.

josh · December 1, 2019, 8:47am

I think there's value in the compiler attempting to analyze asm! blocks for the purposes of lints (e.g. "you missed a clobber"), but only for the purposes of lints. We should absolutely guarantee that the compiler will never "optimize" the inside of an asm! block, or second-guess clobbers, or similar. Doing so would break horribly when a developer does something the compiler doesn't understand but thinks it does.

bjorn3 · December 1, 2019, 11:52am

Assuming additional clobbers can only hurt performance, not correctness, right? Or did you mean removing them?

josh · December 1, 2019, 5:17pm

Removing clobbers would cause much more obvious correctness problems. But adding clobbers could potentially break at compile time in the backend at least, such as if they cause the backend to run out of registers.

gnzlbg · December 2, 2019, 11:36am

@comex The only optimizations that I have in mind are the ones that allow the removal of assembly blocks, e.g., when they are empty, or when they are pure and have no outputs, etc. The proposal mentions making some of these an error, which would end up achieving the same effect, so I don't really mind much about this anyway. That is, I'd be fine with an RFC that guarantees that the compiler does not "optimize" based on the content of the assembly "string" (although I'm not sure how to word this in language spec speak), maybe mentioning that is in an error to have a pure block without outputs, and a block with an empty string.

mcy · December 2, 2019, 8:14pm

I think a few people on this thread, myself included, believe it would be useful to have empty assembly blocks still have meaning (like the classic "hide this local variable's value from the compiler" trick)... so maybe this isn't an optimization you want for non-pure blocks.

josh · December 2, 2019, 8:19pm

Agreed. Optimizing away pure blocks (or for that matter duplicating them) would be perfectly fine, though.

luojia · December 3, 2019, 6:22am

Should we put 'inout' etc. after the register name? We typically write variant types after their names in Rust code.

josh · December 3, 2019, 7:15am

I think of in/out/inout more like const or mut, which come before the name. They indicate whether the variable gets read, written, or both.

mcy · December 3, 2019, 7:53am

I think roughly the standardsese you want here is "the contents of the string literal must be provided to the underlying platform verbatim, and, moreover, upon control reaching the assembly block, the underlying platform must be instructed to reach that verbatim string". (For pure you get to add "as if" in a couple places.)

For an example of how not to write this kind of standardsese, look at ISO C's definition of volatile type qualification.

mcy · December 5, 2019, 4:17am

On a completly separate note: is there any reason to keep #[naked] (insofar that it is a nightly feature)? I discovered today when trying to write some intrinsics in C (don't ask) that naked functions can't be inlined (which comes as a surprise to no one).

Before I was kind of ambivalent towards naked functions, but now I worry that this behavior is a bit subtle (I didn't realize it until I godbolt'd some things), and I still don't see what naked functions get you that you can't already express.

(This comment might be out-of-scope but I think naked functions are enough of a companion feature to inline assembly that it's worth mentioning).

Topic		Replies	Views
[Pre-RFC]: Inline assembly language design	70	14204	March 25, 2019
Stabilization path for asm!()? language design	11	3320	March 25, 2019
Older RFCs for discussion this week	9	1657	March 25, 2019
This week's older RFCs	3	1235	March 25, 2019
Next week's older RFCs for discussion	8	2161	March 25, 2019

[Pre-RFC #2]: Inline assembly

Valid uses for compiler barriers

Related topics