Supporting offset-based types (allocation-bound indicies)

Reading up on pointer provenance got me thinking about non-usize sized pointers and relative pointers - again. This time not with the goal of having self referential structs [1] but just to allow improving the memory layout. Rust already does something like this in the background at compile time: Every pointer exists of an origin (original) pointer to the allocated area and an offset into it. At compile time these are combined to a single address which is stored in memory.

But what if we (for a different pointer-like type) don't do this at compile time? When allocating something we get an address from the allocator/OS, which is combined with the offset 0 [2] to form the address that is stored in-memory. This is the case for Vec<T>, [T] and most other collections. We'd end up with the original pointer ("Anchor") stored as an address/pointer in-memory and the offset (as a usize or smaller) stored separately ("offset"/"index"). This is already done for arrays, exposing the offset as a "usize" [3], which can be stored in a smaller integer size (e.g. u32) to improve the memory layout. Additionally collections like Vec<T> can be resized and are thus moved, making the use of normal *T and &T even less practical.

Therefore many suggest storing references to entries of a large collection as indicies into the array, which is unfortunate because you loose a bunch of guarantees done by the borrow checker. Having a @T offset into a collection/allocation with pointer-like semantics + being bound to the allocation may improve that situation.

An way to offset into multiple arrays at the same time (i.e. not being bound to a single array/allocation) could improve the usability of this further, until then it would still be possible by casting Offset<Size,T> to Size and using normal index-based addressing, thus opting out of the guarantees Offset has.

fn move<T>(value: T) -> T {
    value
}

fn main() {
    let mut data: vec![0,1,2,3,4,5,6,7,8,9]; // Lifetime: 'data
    // Storing references
    let a: &u8 = data[5]; // What we have now, borrows data: 'data
    dbg!(a);
    // Storing indicies
    let a: usize = 5; // not bound to data or its lifetime
    dbg!(data[a]);
    // Storing offsets
    // Stored like the index, but contains borrow checking like a
    // reference, thus this cannot outlive `data`. For now a reference
    // to the entire Vec, but could be done using references to
    // individual items if supported (different topic).
    // I'm not sure if the `[i]` syntax would work with this,
    // hence I'm using a separte function.
    // We likely need this `apply_offset` because `a` alone (at runtime)
    // isn't sufficient to get to the real address.
    let a: @u8 = data.offset(5); // Short-hand for `Offset<usize, u8>`
    dbg!(data.apply_offset(a));
    // Storing offsets in smaller types to improve size in memory.
    let a: Offset<u16, u8> = data.offset(5);
    dbg!(data.apply_offset(a));

    // Due to being relative to the address of data you can even resize
    // the vector, while offsets exist, but this is something you may not
    // want, especially reducing the size while larger offsets exist.
    data.resize(100, 0);
    dbg!(data.apply_offset(a));
}

The example above could be done in a crate, but unless I'm mistaken that implementation wouldn't be able to ensure that the offset goes into the correct collection and thus wouldn't provide much benefit over using normal indicies. You could index into another array that still has a valid lifetime. Additionally there probably would be issues with the lifetime when resizing, which may not exist when this feature is supported by the std lib and the compiler itself.

The ability to resize does of course mean that an Offset may be out-of-bounds and using it would return an Error or panic. Avoiding this may require per-item lifetimes or similar, but the same issue already exists for index-based references. This could perhaps be solved by having a "strong" and a "weak" variant, with weak allowing the allocation size to shrink while any "strong" offset allows moving and increasing the allocated size but not decreasing it. This would almost certainly not be possible in a crate, as it requires a different kind of borrow (or two).

Don’t get me wrong, we could probably have offset-based types. You could imagine an @T (no lifetime necessary) being an offset to a T inside the containing struct. But this doesn’t do anything to support generators, which use regular code with references and then compiles it a self-referential struct, and needs to support arbitrary reference-based code.

Async/Await II: Narrowing the Scope of the Problem

As an aside: This may even allow/support self referential structs (if the lifetimes work out) in the future, as contrary to async and generators the code using this is aware of the existence of these.

What are your thoughts? Would this be useful to have, does it even make sense, do I have a wrong mental model of pointer provenance, should something like this be in the language, std or a crate (if possible)?


  1. see the many existing topics and the blog post quoted below ↩︎

  2. usually, this is a no-op in the generated assembly ↩︎

  3. I don't know if the compiler does it differently, but from the effect it has this is the same ↩︎

Neither could a native language feature, unless the key type is limited to usage within a single fn body. If you want to tie keys to their container, that requires the use of a "generative" brand lifetime like first introduced by (IIRC) the indexing crate and notably proven sound by the ghost cell paper.

Actually, a primary reason for storing indices instead of references is to avoid dealing with lifetimes and holding 'static keys. Even if you introduce branding lifetimes independent from the shared XOR mutable borrow checking, this is a huge change to only solve half of the reason to use keys instead of references. And doesn't really beat out a clever library solution since the same API limitations (scoping) probably still apply.

2 Likes