[Roadmap 2017] Needs of no-std / embedded developers

Access to volatile memory is trivial

I disagree. Accessing MMIO registers can cause side effects. And I don't think volatile operations should be "transparent" and look like normal memory accesses. For example:

REGISTER |= 1;
REGISTER |= 1 << 1;
REGISTER |= 1 << 2;

and

let mut t = REGISTER;
t |= 1;
t |= 1 << 1;
t |= 1 << 2;
REGISTER = t;

Are these equivalent? You can't tell with just looking at this code! If REGISTER is a "volatile chunk of memory" (as per your model) then the first snippet does 3 RMW operations and the latter only does one; more importantly these snippets will likely have different semantics because writing to REGISTER could cause a side effect (e.g. turning one/some of N LEDs).

But if instead you saw this:

let mut t = REGISTER.read();
t |= 1;
t |= 1 << 1;
t |= 1 << 2;
REGISTER.write(t);

You'd wonder "Why isn't this using direct assignments?". Then you'd go to the documentation and learn about volatile operations, MMIO and the side effects they could cause. That's how good Rust code should be: beginner friendly.

Overall, :-1: from me to making volatile operations transparent. That hides side effects which are important to note when reading code. Making code shorter to write at the expense of readability goes against unwritten rules of writing good Rust code.

9 Likes

A load that is merely atomic won’t actually get to main memory (in this case, your memory mapped registers) until it’s bumped out of the level 3 cache. Clearly, you don’t want to delay turning an LED on like that.

I think that expecting OOM to panic is okay IFF libcore includes an unwinding library that:

  • has ZERO OS dependencies and can be used in kernel mode
  • does not perform any allocations itself
  • is very small

What would be ideal is some way mark memory as volatile

Also, this already exists today in the form of VolatileCell<T> (or its variants); all read/writes through this wrapper will lower to read_volatile/write_volatile operations.

If the argument is "library authors may forget to expose chunks of memory as VolatileCell<T> and instead expose them as plain static variables" then you can replace VolatileCell<T> with #[volatile] and the problem will persist. (I think the solution to this problem is encouraging the use of code generators like svd2rust that provide an API that uses volatile operations under the hood)

Ok @japaric, you got me. I have to agree. Touchez!

4 Likes

Ok fair point, I assumed they would be fine. My point was thought, that the use of volatile is not for when the memory mapped value changes in time, but for other reasons. Like if the access has side effects or, as you point out, you need to ensure a specific sized/typed access.

Personally, think volatile is a crappy overloaded term from C rust should get rid of. Replacing it with separate distinct concepts like:

  1. access with side effects
  2. access that must not be merged or shrunk

and others where needed. Even if for now beneath the hood lowered to volatile in LLVM. This would make it possible for the compiler to do dead load/store elimination on memory mapped accesses using 2. The only way to 2 currently is using inline assembly, which rustc could easily lower to for a set of implemented architectures.

How does volatile change that? Isn't that just a matter of programming the memory region attributes correctly?

My point was thought, that the use of volatile is not for when the memory mapped value changes in time

I may still be missing your point but volatile is also used when MMIO registers "change in time" (actually, "are changed by hardware" would be more accurate). For instance:

// busy wait for some condition
while read_volatile(0xdeadbeef as *const u32) & 1 != 1 {}

Without read_volatile that while could either be optimized away or turn into an infinite loop {}.

This would make it possible for the compiler to do dead load/store elimination on memory mapped accesses using 2

Do you have a real world-ish example where dead load/store elimination would make sense, i.e. doesn't change the semantics of the program? Of course, in the context of MMIO registers.

Yes, just like it would as if it was normal memory being written by another thread. I think it should be treated in the same way as that, albeit with "2." mentioned above if required by the hardware.

In the snippet below you would like it if GPIO_PORT_A was only loaded once when optimised. You wouldn't get that if the loads were volatile.

const GPIO_PORT_A: *const u32 = 0xdeadbeef as *const u32;

fn get_button_1_state() -> bool {
    unsafe {*GPIO_PORT_A & 0x1 != 0}
}

fn get_button_2_state() -> bool {
    unsafe {*GPIO_PORT_A & 0x8 != 0}
}

fn main() {
    let button_1 = get_button_1_state();
    let button_2 = get_button_2_state();
    unsafe {use_button_states(button_1, button_2)};
}

extern {
    fn use_button_states(a: bool, b: bool);
}

EDIT: changed example because you specifically asked for MM IO

Yes, just like it would as if it was normal memory being written by another thread.

Sorry, I don't really understand what this comment is referring to.

albeit with "2"

Those two "distinc concepts" you mentioned. I feel that the compiler will respond to them, in both cases, by not reordering or merging memory operations (just like it does with volatile operations). Do you see in advantage in e.g. having two different attributes to indicate each one? IOW, should the compiler behave differently if it sees one situation or the other?

In the snippet below ...

IMO, this code feels britle. For instance, if you add a delay like this:

    let button_1 = get_button_1_state();
    delay_ms(1000);
    let button_2 = get_button_2_state();

Then the behavior of the program will be up to the compiler not to the programmer. If compiled in debug mode you get two reads of PORTA. If compiled in release mode, then the compiler merges the reads and could read the state of the port before the delay or after it; the behavior would be different in each case.

IMO, if you want to make sure the read operations "are merged" then call read_volatile once and pass the returned value to the get_button_*_state functions. That conveys better the intention of the programmer.

No problem, I will try and explain better. Consider your case more generally

// busy wait for some condition
while read_xxx(0xdeadbeef as *const u32) & 1 != 1 {}
// now we can do the the thing we wanted to do after the condition
do_something();

The following is true regardless of whether 0xdeadbeef is the address of some memory mapped IO or the address of some variable written to by another thread.

  • If read_xxx is just a plain load, the compiler can assume 0xdeadbeef doesn't change, so it can just read it once and go into an infinite loop or carry on depending on what it reads. Clearly not what we want, let's try and force the compiler to keep reading.
  • If read_xxx is a volatile load, then the compiler has to do every load written in the code, so the wait line does what we want. But, do_something doesn't depend on the value read by read_xxx, so, assuming the compiler can see there are no barriers in do_something, it is free to move do_something before the busy wait. Clearly not what we want either, let's force the compiler to keep do_something after.
  • If we insert a compiler barrier, such as an opaque function call, between the busy wait and do_something, there is still nothing stopping the processor from executing do_something before the busy wait! Imagine the case where 0xdeadbeef is the memory mapped flag a DMA uses to signal it has finished writing some memory and do_something reads that memory. On a weakly ordered processor it could read that memory before it reads 0xdeadbeef even if it was the correct order in the machine code. Clearly not what we want either.
  • What read_xxx needs to be is an atomic load-acquire. This tells the compiler and the processor not to do any memory accesses in do_something until 0xdeadbeef has been read as set.
  1. Means don't reorder or change number, 2. means don't merge (only do the exact access specified but are free to reorder and remove dead accesses) . These should be orthogonal IMO. They are both mentioned separately as requirements in LLVM Language Reference Manual — LLVM 18.0.0git documentation.

It couldn't eliminate one of them because delay_ms should contain a barrier (or at least the compiler should not know that it doesn't). If it didn't then the compiler could put both get_button_*_states before or after delay_ms anyway, even when using volatile, effectively eliminating one them.

It's not that I want to ensure the dead load is eliminated, just that I want the compiler to be able to make the usual optimisations when possible in that regard. Imagine the 2 get button functions are part of some generic API you are implementing where it's possible that they are on different PORTs/word addresses. If the compiler can see they happen to be on the same word then it would be good if it could eliminate one of the loads.

1 Like

@parched Thanks for elaborating

Imagine the case where 0xdeadbeef is the memory mapped flag a DMA uses to signal it has finished writing some memory and do_something reads that memory.

This is a very interesting example. The way I have been thinking of modeling DMA is that it would take ownership of the memory (&mut [u8]) is writing to and won't return it until after the transmission is done (e.g. read_volatile(&REG) & 1 == 1). (This happens to map nicely to futures (impl Future<Item=&mut [u8]>)). More importantly, written this way the compiler would know that both memory addresses are related and thus, I think, would not run into the problem you mentioned. I would deem an API like this as safe.

In the way you have written the DMA transfer, the compiler has no way to know that both 0xdeadbeef and the memory accessed in do_something are related and thus misoptimizes the code. An atomic AcquireLoad operation is one (expensive and not available on Cortex-M0 chips) way to solve this misoptimization. In any case, I would deem such API approach as unsafe as the ownership over the memory is not (correctly) specified at all.

It couldn't eliminate one of them because delay_ms should contain a barrier

If delay_ms use a while loop of read_volatile calls and as get_button_*_states are written then it can move/merge/eliminate-one-of them though, as you mentioned in your bullet number 2 ("if If read_xxx is a volatile load ...").

But, I suppose, that you don't want get_button_*_states to be implemented as e.g. *GPIO_PORT_A & 0x1 != 0 but rather you want to mark this memory access as (1) or (2). Which one would be in this case? Because the definitions still read the same to me: "don't change number" and "don't merge" seems to overlap in particular.

Imagine the 2 get button functions are part of some generic API you are implementing where it's possible that they are on different PORTs/word addresses. If the compiler can see they happen to be on the same word then it would be good if it could eliminate one of the loads.

My impression is that this optimization would only makes sense to make on very few cases where it doesn't actually change the behavior of the program. I agree that it wouldn't be possible at all with read_volatile.

My gut tells me that having two attributes to achieve these rare (from my POV) optimizations would be too complex for the average library author to get right and may actually cause more misoptimization problems if the library author gets them wrong. There's also the question of how much code size savings this brings to the table.

Perhaps the rules for these attributes are actually simple and allow some other nice optimizations; I don't know but I'd be happy to hear about that.

BTW, does LLVM has IR attributes for (1) and (2)? Because we are constrained to what LLVM offers to do optimizations. Unless, this is doable in MIR and doesn't involve tons of work to implement.

I can't quite picture exactly what you mean, but regardless of whether the compiler keeps things in the right order, there is nothing to stop the processor from doing it out of order without some synchronisation instruction.

Yes it is, ARM v6-M, has the dmb instruction.

Yes, it would work as we want when get_button_*_states uses read_volatile too, but it would be a useless implementation because it wouldn't separate any normal code that didn't use volatile.

Yes, 2 for this example, although that would still break if delay_ms was implemented as you suggest.

Apologies, I don't think I was clear on what I mean by "merge". I mean exactly what @comex said

(2) means, for example, that repeated loads of one address can be replaced by one load, but 2 separate loads of adjacent u32 can't be replaced by one load of u64. And as @comex also mentions it must use a plain load instruction. so, for example, on PowerPC a load-acquire-(2) would have to be implemented with a plain load then a memory barrier rather than a single load-acquire instruction.

Yes exactly, that's the premise of dead load/store elimination and it's, arguably, an important optimisation.

I'm not sure, I have never thought this much about volatile before writing these comments, so I am playing the devil's advocate here a bit to see if it worth changing the status quo with volatile.

Not as far as I know, but for most cases you want (1) you also want (2) which together equate to volatile. (2) can easily be written in assembly and fall back to volatile on architecture where that hasn't been implemented.

EDIT: For example, (2) for u32 on arm is just

pub unsafe fn read_2_u32(address: *mut u32) -> u32 {
    let value: u32;
    asm!("ldr $0, $1" : "=r"(value) : "r"(address as usize));
    value
}

I'm seeing some confusion and mixed terminology in this thread and wanted to weigh in. I'm currently working on my own Rust embedded framework and have been thinking about this stuff a lot lately.

In the quote above, you two are talking about two different things.

I suspect @japaric is referring to ldrex, which is, in fact, not available on M0. But it is not an acquire-load, it is a load-exclusive (analogous to load-linked or load-and-reserve on some other RISC architectures). This has no direct mapping in either the C/C++ memory model, or the LLVM machine-independent IR.

ldrex can act as an acquire-load when paired with dmb. So can ldr. This is fortunate, because (1) acquire-loads against device memory are useful and (2) ldrex is technically verboten on Device-attributed memory on ARMv6 and ARMv7-M. (Though it works well on Cortex-M{3,4}, but do as I say, not as I do.) (And all this is completely different on AArch64, but let's save that for later.)

@comex noted that load-acquire on PowerPC also faults on direct-store memory (equivalent to ARM device memory); I suspect they are referring to load-and-reserve, not load-acquire, as I don't remember PowerPC having an explicit load-acquire like AArch64 does.

Finally, performance nitpick from a cycle-counter: dmb is not terribly expensive on the lower-end Cortex-Ms (cheaper than division).

Here is a very nice phrasebook mapping the C/C++ memory model atomic language to machine instructions on a variety of architectures.

Now -- at the LLVM level, atomic and volatile accesses are orthogonal. You can express a volatile-acquire-load, for example. This is excellent and exactly what we want for device accesses, since we can use explicit memory orderings to control barrier generation in a portable manner. Unfortunately, Rust does not currently expose these operations by e.g. adding an Ordering parameter to volatile_load / volatile_store.

(We can skip the combinatorial explosion of RMW operations because they are not well-defined on device memory anywhere but x86. Simply orderings on volatile accesses would be nice.)

1 Like

I’m not sure how important the different shades of volatile are. I expect that driver developers would prefer to have just 1 option, rather than trying to pick the “best-performing” one.

If we take that to account, a “volatile” memory access is a memory access instruction that is essentially an IPC call that can access (and modify) all “non-owned” memory - it must be passed to the hardware as one.

Of course, on some processors you need to add barriers around IPC calls, so you need to have atomicity properties.

Exactly, and that was going to be my next suggestion for the 2017 road map, exposing volatile atomics.

To summarise my ramblings, volatile is for use when your memory access has side effects, not because it might change in time. If it might change in time you need atomics, and if it might change in time and has side effects then you need volatile atomics.

In regard to the second aspect of volatile (2) that I have mentioned, that is, doing the exact access you specify with one instruction, reading the LLVM manual it appears that is covered by unordered atomics so I think rust should expose that too.

I agree that those semantics would be quite useful for memory-mapped registers, in particular!

I like this, but I think we need a combination of both systems. So we stay with the current Volatile type, but we also lint when an address that points into "volatile" memory is ever dereferenced, ptr::write or ptr::read. I'm fairly certain we can write a lint that uses an attribute of the form #![volatile("0x500", "0x1000")] or sth similar to mark a region as volatile and then detect all mis-uses in the current crate and all generic code monomorphized by the current crate. Of course we can't really track if someone sends an integer around and uses it as a pointer address at some point, but I think in this case "good enough" can suffice.

There is an RFC for a #[repr(transparent)] for structs with one element to guarantee that the struct's representation is equivalent to the representation its single element would have. You could imagine this being extended to be able to say something like #[repr(transparent, at_address=0x0000_0004)] to indicate the address that the volatile object is located at.

With both of these, unless you're trying to build some object-capability-based system, you don't need a pointer-to-volatile wrapper type; you just need a volatile wrapper type. In particular, there would in general not be any need to convert an integer value into a pointer into a Volatile, so such a lint wouldn't be necessary.

(Normally in C you would do this with a linker script + extern volatile <type>.)

We gain nothing by that system. It’s just a different representation of what we have now. Defining at the top of the crate what parts of the memory are supposed to be volatile, and then enforcing that gives us a redundancy without too much verbosity since we are specifying the entire range of memory instead of every single access (which makes it easy to forget one)