Improving self-referential structs


#61

We run into this same issue with gimli where we want to do zero-copy parsing of many DWARF things, but then run into self-referential issues when we want to bundle the raw mmap of the debug info with its parsed form (say to stick them in a filename -> debug-info hashmap together).

We ended up creating a Reader trait that is similar to byte's Buf and BufMut traits. One can implement Reader with reference counting behind the scenes, but we just provide an implementation for slices0. It works OK, but now the whole code base is generic over a Reader and it feels a lot more complicated than it did with raw references.

cc @philipc


With cpp_demangle, I saw this issue coming before I started writing any code and there is a single nice place to encapsulate all generic ownership (where as with gimli there is no clear single place that would be the thing that owns the data or not, it would need to be ref-counted).

I ended up making demangled symbols generic over some T: AsRef<[u8]> and then the internal parsed AST items store indices into the input rather than slices or anything like that:

/// A mangled symbol that has been parsed into an AST.
///
/// This is generic over some storage type `T` which can be either owned or
/// borrowed.
pub struct Symbol<T: AsRef<[u8]> {
    raw: T,

    // These structures store indices pointing into `raw.as_ref()` rather than
    // slices or reference-counted views..
    substitutions: subs::SubstitutionTable,
    parsed: ast::MangledName,
}

/// A `Symbol` which owns the underlying storage for the mangled name.
pub type OwnedSymbol = Symbol<Vec<u8>>;

/// A `Symbol` which borrows the underlying storage for the mangled name.
pub type BorrowedSymbol<'a> = Symbol<&'a [u8]>;

So, I’d like to conclude that:

  1. Yes, this is totally a real problem, and not just in generators and async IO domains.

  2. There are ways to work around this issue. Nevertheless, they are relatively painful for library authors, and require more effort than otherwise.

  3. If a library’s author doesn’t go through that effort, then library consumers are hosed if they need owned versions.


0 Well slice and endianity.


#62

Could there be some rules of thumb obtained from the experience of working with graph-like data structures? I’d expect them to be something like this:

  • If the whole thing is immutable and Send/Sync, access nodes with statically checked borrowing. Access to the parent chain could be passed into recursive processing calls (visitor pattern or otherwise) with something like struct<'a> ParentPath(Vec<&'a Node>).
  • If mutable, but not Sync, implement internal links with Rc/Weak and provide a RefCell-like node mutator API.
  • Mutable and Sync: it’d be nice to learn why such a design should be chosen in preference to combining the other two with conversion traits/methods between them.

#63

Or even without dynamic allocation, though that would make multi-level access cumbersome:

struct ParentChain<'a> {
    node: &'a Node,
    parent: Option<&'a ParentChain<'a>>
};

#64

I’m not sure the mutability or immutability of the structure has much of an impact on the self-referentiality problem, or at least that hasn’t been the case in my experience with rental so far. By a vast majority, most uses of rental use the shared-borrowing flavor.

If an API returns a type to you that is parameterized by lifetime in any way, then your possible uses of that type are constrained in the manner described above. Just using Rc or Arc in all cases, or being generic over some kind of vocabulary trait that you need to define and agree upon, is the only way to be sure that your library can be adapted to the widest range of use cases.

On a more meta ecosystem level, I’m just very sceptical that, even if a convention did exist that would alleviate this problem, that it would achieve universal buy in from all crates, particularly when that workaround requires significantly more effort on the part of the author. And when the failure mode is that your crate is possibly rendered unusable, I believe the cost is too high to just leave it to the ecosystem to work around this issue. However, I acknowledge that this is a matter of opinion.


#66

This would probably have to be an unsafe trait.


#67

In case anyone here missed it:


#68

Simple scenario to run into such a problem.

You want a setup() method for your integration test. From checking several popular github project the idiom is to return a Struct e.g TestEnv from such a method owning the precondition instances. This is just not possible to do in one struct, if one of those preconditions has a reference.

Concrete example. Lets say you are implementing a rest client to a http api with reqwest crate. Lets call it CatsClient. reqwest is managing its own connection pool and you are supposed to init it once and pass a refernce to the Cats Client

    struct CatsClient<'a> {
        http: &'a reqwest::Client
    }

reqwest::Client can be instantiated with default headers and perhaps soon also with a base_url. For testing each of the CatsClient endpoints, you would like to have some default reqwest::Client and also a CatsClient instantiated with some default values for convenience.

fn setup<'a>() -> TestEnv<'a> {
    let http = reqwest::Client.new();
    let cats_client = CatsClient { http: &http  };
    
    TestEnv {
        http: http, 
        cats_client: cats_client
    }
}

There is actually no way to move out both http and the cats_cilent struct out of the function (not even with a tuple), or am i missing something?

Here is some playground


#69

No, there is no way.

  1. Address of let http and http: http is different, so the reference in cats_client points to an old invalid location. The address also changes every time you move the cats_client, and Rust by design wants to move only using memcpy without having to fix internal addresses of objects.
  2. Even if that was fixed (e.g by using Box::new(http)), it would be unsafe, because Rust would allow assignment to testenv.http = new_object that destroys the old object, making reference in CatsClient invalid again.

You have to wrap the HTTP client in Rc<Client> to have it in two places.


#70

Not sure if that’s suitable for you, but you can “disentangle” the owned values from the references, like this: https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=3f59786cc2a59f68e440742866573c70. There’s a struct IntermediateEnv (sorry, bad at naming here), that holds the owned values and is constructed first, and then TestEnv can just hold references to it. It doesn’t seem totally usefull to me, but might just be enough to get your testing done.


#71
struct MyType {
    f1: i32,
    f2: ^i32 // A weak reference
}
let mut x = MyType { f1: 42, f2: ^f1 };
let f = x.f2;
let y = ^f.clone(); //^f promotes f to either &i32 or &mut i32
//Explicitly promote f to &mut i32, only type check when x is mut
let mut v:&mut i32 = ^f;
//f is still valid as it is still "week", and this is a "copy", not a "move"
let g = f;
//error: immutable borrow whilst mutable borrow in scope
let u: &i32 = ^f;
//error: send a weak reference to a function alone, without the full object
std::mem::drop(f);
//OK, we send both the object and the weak reference
(|v,_| std::mem::drop(v))(x,f);

Reasons behind:

  1. When initialize self-referential structs, we don’t want the usual borrow rules apply - this gives us maximal flexibility to create the struct.
  2. When using those self references however, we need to apply the borrow rules to ensure safety.
  3. So this design is to separate these requirements - by introducing a new reference type - the “weak” reference
  4. A weak reference can be promoted to usual references and enforce borrow rules at the promotion site
  5. A week reference can only be passed to another function as arguments with its owner.

Shifgrethor: garbage collection as a Rust library