Compiling byte code and how to correctly restore write functionality of a pointer?

I want to write a small compiler for a simple bytecode language. This byte code operates on a machine with some static amount of memory, and can read and write to constant offsets, see a small snippet of the approximate instruction set below. I want to execute this instruction set. For that purpose, I allocate the memory, my previous thought was for simplicity a Vec<u8>. Now, the optimization I want to perform; Since for the duration of execution, this memory buffer won't move and all offsets are known ahead of time, I could calculate the address on which each opcode operates ahead of time and "just perform its effect". But, alas, I believe that pointer provenance will actually make this UB.

Hence the question: how would I correctly calculate the address ahead of time, but satisfy any supposed pointer provenance rules?

enum OpCode {
    WriteConstU8 { addr: usize, value: u8 },
    WriteConstU16 { addr: usize, value: u16 },
    // ... more consts
    AddU8 { addr_op1: usize, addr_op2: usize, addr_res: usize },
    AddU16 { addr_op1: usize, addr_op2: usize, addr_res: usize },
    // ... other opcodes
}

type MemoryU8 = *mut u8;
type MemoryU16 = *mut u16;
enum CompiledOp {
    WriteConstU8 { addr: MemoryU8, value: u8 },
    WriteConstU16 { addr: MemoryU16, value: u16 },
    // ... more consts
    AddU8 { addr_op1: MemoryU16, addr_op2: MemoryU16, addr_res: MemoryU16 },
    AddU16 { addr_op1: MemoryU16, addr_op2: MemoryU16, addr_res: MemoryU16 },
    // ... other opcodes
}

fn execute_compiled_op(op: &CompiledOp) {
    // SAFETY: we checked the alignment and OOB conditions when we passed from OpCode to CompiledOp
 
    match op {
        CompiledOp::WriteConstU8 { addr, value } => unsafe { addr.write(*value) },
        CompiledOp::WriteConstU16 { addr, value } => unsafe { addr.write(*value) },
        CompiledOp::AddU8 { addr_op1, addr_op2, addr_res } => {
            unsafe { addr_res.write(addr_op1.read() + addr_op2.read()) }
        }
        CompiledOp::AddU16 { addr_op1, addr_op2, addr_res } => {
            unsafe { addr_res.write(addr_op1.read() + addr_op2.read()) }
        }
    }
}

Full example in this playground. I know the current implementation errors under MIRI with strict prov. The question is how to fix it.

Miri's complaint is that each time you call memory.as_mut_ptr(), you invalidate the result of all previous calls to it under the rules set by "Stacked Borrows".

Thus, you need to change your code to only call it once: Rust Playground does this by passing pointers around instead of references to memory, but is likely to have issues in your "real" program because of the difficulty of doing that.

So once I compute the pointers through which I want to write, I can not use a mutable reference to the original memory until after invalidating all of them. That indeed sounds like a pain to uphold, but perhaps it's simple enough to encapsulate that with a datatype that seemingly captures a &mut to the memory, but immediately downcasts to a *mut internally.

I kind of wonder though: It seems multiple derived *mut are allowed to read and write to the memory in any order. Is that guaranteed to work? I.e. the following "program" passes fine with your modifications, thanks already for that:

    let result = execute(vec![
        OpCode::WriteConstU16 { addr: 2, value: 2 },
        OpCode::WriteConstU16 { addr: 4, value: 40 },
        OpCode::AddU16 { addr_op1: 2, addr_op2: 4, addr_res: 6 },
        OpCode::AddU16 { addr_op1: 2, addr_op2: 4, addr_res: 6 }, // Aliased? write
        OpCode::AddU16 { addr_op1: 6, addr_op2: 8, addr_res: 0 }, // Aliased? read
    ]);

Do I understand correctly that all the pointer have a "SharedReadWrite" grant to the memory and thus are fine to read and write to it in any order?

Stacked Borrows has the formal rules; the informal statement of the rule is that, given a specific object, you cannot create or use an &mut to that object while any other &mut is still in use. And when you use an &mut to create a *mut, you create a use of the original &mut that's not tracked by the borrow checker, but instead by the programmer. It's thus up to you to make sure that the original &mut is not reused while the *muts are alive.

I believe that you can safely have multiple aliasing *muts to the same object, as long as you don't deference more than one at a time (since each &mut is expected to be unique), but I am not an expert here, and I've not fully internalised the Stacked Borrows ruleset for raw pointers. Note that if you do dereference into an &mut (as opposed to using pointer::read and pointer::write, as you do in your code so far), you'll need to be careful to make sure that you don't use any *mut to the object while that &mut is in scope. pointer::read and pointer::write avoid this footgun, but leave you needing to avoid data races etc manually.

You don't need to do much. The only SB problems here is that &mut _ as an argument to compile uses the pointer value at the function-exit. Also each individual call to compiler re-derives on that initial unique reference. Those uses then invalidate all the derived pointers in the CompiledOp structure that is being returned and collected.

To fix, pass the memory by pointer and it all works fine. (Except for the technicality that function should be unsafe in a public interface).

fn compile(memory: *mut [u8], op: OpCode) -> CompiledOp {
  // 
}

//

    let memory = &mut memory[correction..correction+SIZE];
    let run_memory: *mut [u8] = memory;
    
    let compiled = ops
        .into_iter()
        .map(|op| compile(run_memory, op))
        .collect::<Vec<_>>();
        
    for op in &compiled {
        execute_compiled_op(op)
    }

    memory[0]

If you want, write an abstraction around a raw pointer with shared tag, so a pointer with lifetime but only some of the other validity requirements of references. Which precisely would be your choice. It also provides opportunity for tagging the CompiledOp with a lifetime, to wrap the correct usage in a layer of safety.

A particular variant of that type already exists. It's Cell<[u8]>. Use the Cell, Luke. (* Using Cell isn't equivalent under SB, it's just equivalent enough here to work).

So similar in fact that Cell::set for non-drop types such as u8 is semantically equivalent to a core::ptr::write. The correspondence is implemented by having CompiledOp contain &Cell<u8> instead of a raw pointer, this would still auto-derive !Sync but much more explicitly and on purpose, and everything still works under miri. Here's a variant built with Cell as a primitive:

LLVM does not do much better than a jump table in either style. This is good motiviation to introduce more builtin primitives on Cell, such as Cell<[u8]>::get(Range<usize>) and maybe conversions in the style of bytemuck but behind a Cell and only shared. The current, rather minimal, implementation is cumbersome to use for some operations and many safe wrappers could be constructed.

1 Like

rather than using Cell<[u8]> you can use [Cell<u8>] which makes indexing and slicing much easier: just use the [...] operator like usual.

1 Like

Using Cell (or I guess in some situations UnsafeCell) sounds like a great idea. I still not familiar enough with the memory model to know why this should work (why can we just create a shared ref to a Cell that is some part of a bigger Cell out of "thin air", pretending there was a Cell there in the first place?) but at least it sounds like it should work and it neatly takes care of having to handle raw pointers in most places.