Blog post: Indirect ownership, shallow borrow and self-referential data structures

I wrote some thoughts on how first class self referential structs could look like in Rust.

5 Likes

@panstromek thanks for sharing. I like the way you structured the post and explained the different parts. Here are some thoughts I had while reading:

Indirect ownership

This trait or marker akin to the stable_deref crate must be unsafe. Incorrectly marking something can easily lead to UB.

Shallow borrows

I'm a little skeptical if something as fundamental as a new reference type will be added to the language. One of the core stated goals of the language team for Rust 2024 was to reduce the learning curve, and this would directly go against it.

I missed a section talking about variance. In my experience as the author of self_cell, the hardest part of ensuring a safe-to-use interface for self-referential-structs is variance. Especially if the owner is allowed to have a lifetime, for example.

let m = unsafe {
  let m: Box<MaybeUninit<MineField>> = Box::new_uninit();
  addr_of_mut!(m.num).write(8);
  addr_of_mut!(m.num_ref).write(&m.num);
  m.assume_init()
};

As far as I can tell, it is one correct way to do it. Assuming your type constructors and destructors can't panic. Arguably there are some problems when it comes to pointer aliasing and strict-provenance. But from my understanding, this case with a Box shouldn't be affected. Maybe @Gankra can weigh in.


Giving you a way to express these structs directly seems like a nice thing at first. But one curious property of all your examples was, there is always only one owner and dependent. Having a native syntax akin to what ouroboros and rental allow, opens the door for complex structs that have multiple owners and dependents, if not even chained ones. And I'd argue this quickly becomes so complex that even while the compiler might be able to give you compile errors for unsound code, you as human might not be able to understand what is permissible and not anymore. I'd prefer a future solution that is library only at first. A library type SelfCell<Owner, Dependent> that by default boxes its content. Currently this requires macros until GATs allow abstracting over lifetimes. This would be minimally invasive, not requiring new syntax. Not making Rust even more complex to learn. And support for avoiding additional heap indirection could be added later as an optimization via an unsafe marker trait.

2 Likes

Thanks for the feedback.

Indirect ownership

This trait or marker akin to the stable_deref crate must be unsafe. Incorrectly marking something can easily lead to UB.

I'd expect it to be, at least for most types like Vec that use unsafe internally. But I am not sure about wrapper types - if you wrap a Vec, maybe it could be assumed safely. But honestly I didn't think about it too deeply, it's totally possible I'm missing something trivial.

Shallow borrows

I'm a little skeptical if something as fundamental as a new reference type will be added to the language.

Yes, absolutely. Realistically it's just way too shiny future. But this is also the part that I'm the most uncertain with. I'm not even sure if a special new reference type is needed. Maybe it'd be enough to make it work like two-phase borrow, i.e. you don't borrow the indirect data by default, only after you touch it.

I missed a section talking about variance. In my experience as the author of self_cell, the hardest part of ensuring a safe-to-use interface for self-referential-structs is variance.

I definitely skipped a lot of important problems. Variance is especially something I didn't even want to think about, because it's complicated and I wanted to focus on describing the simple case that I'm actually interested in (and what I would expect to be the most common, but that might not be the case). I also skipped Clone, Drop, Debug and other derives, Send+Sync, internal mutability... there's a lot of problems to solve, I didn't really solve anything.

self_cell looks nice, I didn't know about it. I like the simplicity.

SelfCell<Owner, Dependent> is a good start, even though I don't necessarily agree that library solution is less complex than the native one. But it will certainly be more difficult to make the native solution feel less like "adding more complexity" and more like "just removing the restriction" (like GATs and NLL feel), which I would hope for in the end.

1 Like

@mvtec-bergdolll already mentioned ouroboros, but I wanted to reiterate what it can do. In my mind it solves exactly your use case (you do not really care where exactly the memory is stored, you only want easy access to it), if not then it would be interesting where it is not able to fulfill your requirements. A simple ParseFile/Iter example:

use std::slice::Iter;
use ouroboros::self_referencing;

#[self_referencing]
pub struct ParseFile {
    buf: Vec<u8>,
    #[borrows(buf)]
    #[covariant]
    iter: Iter<'this, u8>,
}

fn main() {
    let mut val = ParseFileBuilder {
        buf: b"Hello, world!".to_vec(),
        iter_builder: |buf| buf.iter(),
    }
    .build();
    while let Some(letter) = val.with_iter_mut(|iter| iter.next()) {
        if *letter == b' ' {
            break;
        }
    }
    println!(
        "{:?}",
        val.with_iter_mut(|iter| iter.map(|c| *c as char).collect::<String>())
    );
    let buf = val.into_heads().buf;
    println!("original data: {buf:?}");
}

I feel like this crate already provides you with all the necessary tools to create and use internally boxed (because each field that is self referenced will be stored in the heap to allow moving of the whole struct) self referential structs safely. It also uses a special type of box (custom implementation) to conform to the current SB implementation.

My problem with a language feature approach is that

  1. ouroboros exists and (seems to) solve your problems
  2. creating lang support requires extensive extension of the stdlib (every type that owns its data indirectly must be marked as such and possibly needs to fulfill new invariants): need new api to modify data without moving the contained data (vec already has some of these, but others probably do not)
  3. to me this feels like feature creep. I personally only had very limited problems trying to define self referential data structures where it was irrelevant where the data is stored (I do have problems with fully inline self referential structs, but I think that those should stay in the unsafe world and only need a little bit of new API and updated SB). I see where they can be useful (parsers), but otherwise this feels like a more niche feature.
1 Like

It just simply has a high cost, so it's almost never worth it. Of course, I know that external libraries can solve this use case, but that wasn't really the point. I wanted to explore how this could look in the language for this reason. I guess I should have clarified this in the alternatives section.

This is the key, I think - a big reason why the feature is niche is because it has such a high cost now. If this was in the language natively, I'd probably use it quite a lot. But I am not going to use an external library just to make some API a little bit nicer or a little bit more efficient. Using a macro heavy dependency with a ton of unsafe code just for that is not the right call.

For all use cases I encountered so far, I always chose either cloning, leaked 'static, indexing or I just left the owner on the stack and delt with it in a bit more cumbersome way. All of these felt sub-optimal, but still less costly than a dependency. If we instead had the language support, the compound effect of being able to define simpler APIs could make the language easier to use.

Of course, all usual downsides of adding a language feature still apply, but this was more just idealistic exploration than proposing anything serious.

Do you mean adding this dependency is a high cost? If so, why is adding a dependency such a high barrier?

You clarified it enough and already mentioned rental. My issue is that I feel like there are not good enough reasons for needing this feature as a language feature.

For example async requires some form of compiler support, because one cannot implement it using proc-marco alone. You need to name some unnameable types in the state of the future. You can implement some functionality of futures using proc-macros, but you can only store nameable types in your future state and you will need to enrich each .await with some type information making the library very cumbersome to use.

In contrast ouroboros feels very natural to use, you are protected from using borrowed/borrowing fields in a dangerous way and a language feature implementation would only improve the api slightly: my_struct.borrowed_field().foo() vs my_struct.borrowed_field.foo() (there might be other use cases where this is more pronounced, arguably they might be more significant, but I am unaware of such an example with outrageously bad ergonomics).

It would be good to see some of those examples, because I am not really sure where they would be used when the support was better.

Are you programming in embedded or why do you care about having dependencies? Less dependencies is of course most of the time better, but depending on a crate that has 700,000 recent downloads (ouroboros) feels reasonable to me.

The macro only produces unsafe code to artificially extend the lifetime of the borrow to 'static, which should be sound, as the borrow only lasts as long as the struct exists. The other unsafe code in ouroboros is very minimal and easy to check.

If the API improvement is only a little bit, then why should we create the language feature in the first place? The proposed solution (and to me at least all other possible propositions seem to go the same route) are very complex to implement and require extensive interaction checking. Why invest so much for such a small gain that a library already adequately achieves?

I would argue that the exact opposite would happen, new learners would have difficulty learning yet another complex feature that has a lot of weird interactions with other features that they would then also need to understand. If they never used the feature, only the API you provide, they also would not notice a difference if you used a library instead.

It is good to be idealistic, especially when it comes to language design. But at some point the realism needs to take over. It is just not feasible to a feature for which a good library exists. If you find some issues with those libraries, then one can try to think about improving it/creating a new library, or finding a lang solution.