Int2ptr and runtime provenance models

Even without those issues, wouldn't there still be the problem of memory overhead? With capabilities required to be aligned, you can store validity using one extra bit per 16 bytes. If they were unaligned, it seems like you'd need more bits to represent e.g. "there is a valid capability at offset 3".

Yes, that becomes more noticeable. If you just encode where a capability starts in each 16 byte region, if any, then you have (annoyingly) 17 states, which gives 5 bits, so 5 times the overhead. There are constraints based on adjacent regions so not all pairs of states are possible, and so on as you expand out, so you could perhaps cut it down a bit more, but you quickly get diminishing returns and have a huge window of interaction. I haven't thought about the calculation for what the theoretical limit would be.

I have no insider knowledge about that (not even that I can't share). I assume it's a combination of:

  1. Writing various different vectorised memcpys tuned to different microarchitectures is a pain
  2. Using the vector unit itself comes with some cost (power, but also having to context switch the register file as it's often treated like floating-point registers where, if you don't use them, you don't pay the context switch overhead) so it's convenient to be able to use the wide data paths in the core without having to touch the vector unit itself (can even just copy cache lines as a unit for large memcpys)
  3. CHERI requires capabilities be preserved so you can no longer use vectors of bytes to do your memcpys, so without adding support for vectors of capabilities you risk tanking performance due to having to use a scalar memcpy (even if you can at least, on Morello, use load/store capability pair instructions to get 256 bits at a time)

Answering this question would require a full model with support for integer-pointer casts and "advanced provenance" such as C restrict or Stacked Borrows -- and I do not think such a model exist. Generally the goal is for the answer to be "no", the information loss due to lack of provenance is meant to be only leading to more programs being accepted. What remains unclear is whether this is possible in a full model.

I can only assume you mean "optimizers that do not perform integer substitution", since substitution is just clearly wrong when integers carry provenance.

I assume your argument here is that while provenance on integers might not work for optimized IRs such as LLVM, it can work for surface languages such as Rust or C. That is true. The Rust spec would become more complicated in some places (provenance needs to support a "union" operation to model integer arithmetic in case both operands have provenace) and less complicated in others (int-to-ptr cats do not have to 'guess' a provenance). The proof of correspondence from Rust to LLVM would bear the burden of proving that whatever LLVM does "soundly approximates" actually tracking provenance properly everywhere.

This would be a departure from C, which seems to be heading towards PNVI (provenance not via integers). So it might be worth figuring out why they discarded the models (that have certainly been considered) that do track provenance on integers. I am not sure if it would make teaching low-level Rust easier or harder -- people would have to get used to the idea of every integer value carrying extra metadata that is relevant for program execution; right now this issue is nicely contained to pointers, in this world it would affect basically all variables. Miri would need a whole lot of (likely inefficient) machinery to actually track provenance through arbitrary integer operations.

These are all things that I wanted to avoid, but none of them is a clear-cut objective argument that would make this impossible. By accepting this proposal we would be saying that provenance is strictly syntactically tracked through all operations (and only data dependencies count, not control dependencies). Avoiding all the thorny questions around how to support the "wildcard provenance" is certainly tempting... on the other hand, it is not necessarily clear how to do Stacked Borrows when every pointer comes with a set of tags. And there is a risk that this language can not be compiled efficiently to LLVM (which is amplified by the fact that C does not have provenance on integers, so supporting such compilation might not be an important constraint for LLVM).

To solve this, the Abstract Machine keeps track of the "provenance of the environment". Whenever data with provenance is sent to the environment (e.g. via a syscall), that provenance is added to the "provenance of the environment". Whenever data is received from the environment, all that data carries the "provenance of the environment".

So, this way provenance could be "preserved" (in a soundly approximated way) even through roundtrips via the analog world.

However, there is a version of this code which avoids even sending data with provenance to the environment -- namely if you smuggle the data through a control dependency, such as this "provenance-stripping function":

fn id_u8(x: u8) {
    for y in 0..=255u8 {
        if x == y {
            return y;
        }
    }
    unreachable!();
}

If the code did this to the pointer and then printed it, the original ptr provenance would never be added to the "provenance of the environment", and so the address (after being read back) would not be a valid pointer.

Basically, once integers have provenance, we have "non-standard integers", and this id_u8 function is not an identity function -- it actually changes the data, and adding/removing it can change whether a program has UB.

4 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.