Improving self-referential structs

I don't know, maybe. I'd like to see what that would look like. Say the parser today returns a &str. With the bytes crate, it would instead return a Buf that is the same subslice that &str references. But probably what you want it to return is AsRef<str> (or more generally, AsRef<[u8]>). So to that end, the Bytes type itself would not be surfaced in the return types. But I've not thought this through fully. It seems like some abstraction over owned vs borrowed underlying types could be made.

The performance concerns still stand, of course. But I take @withoutboats's point that it's moot to compare this performance vs support in the language, particularly since we don't even have a design for the language support (let alone the implementation and its performance implications).

But I'm with @jpernst - I'd like folks to acknowledge that this is a real problem (fortunately not too widespread overall but unfortunately a difficult one), and to continue brainstorming of how it might be solved at the language level. I'm hopeful that we don't have to resort to a library solution for what's otherwise a language shortcoming.

6 Likes

We run into this same issue with gimli where we want to do zero-copy parsing of many DWARF things, but then run into self-referential issues when we want to bundle the raw mmap of the debug info with its parsed form (say to stick them in a filename -> debug-info hashmap together).

We ended up creating a Reader trait that is similar to byte's Buf and BufMut traits. One can implement Reader with reference counting behind the scenes, but we just provide an implementation for slices0. It works OK, but now the whole code base is generic over a Reader and it feels a lot more complicated than it did with raw references.

cc @philipc


With cpp_demangle, I saw this issue coming before I started writing any code and there is a single nice place to encapsulate all generic ownership (where as with gimli there is no clear single place that would be the thing that owns the data or not, it would need to be ref-counted).

I ended up making demangled symbols generic over some T: AsRef<[u8]> and then the internal parsed AST items store indices into the input rather than slices or anything like that:

/// A mangled symbol that has been parsed into an AST.
///
/// This is generic over some storage type `T` which can be either owned or
/// borrowed.
pub struct Symbol<T: AsRef<[u8]> {
    raw: T,

    // These structures store indices pointing into `raw.as_ref()` rather than
    // slices or reference-counted views..
    substitutions: subs::SubstitutionTable,
    parsed: ast::MangledName,
}

/// A `Symbol` which owns the underlying storage for the mangled name.
pub type OwnedSymbol = Symbol<Vec<u8>>;

/// A `Symbol` which borrows the underlying storage for the mangled name.
pub type BorrowedSymbol<'a> = Symbol<&'a [u8]>;

So, I’d like to conclude that:

  1. Yes, this is totally a real problem, and not just in generators and async IO domains.

  2. There are ways to work around this issue. Nevertheless, they are relatively painful for library authors, and require more effort than otherwise.

  3. If a library’s author doesn’t go through that effort, then library consumers are hosed if they need owned versions.


0 Well slice and endianity.

5 Likes

Could there be some rules of thumb obtained from the experience of working with graph-like data structures? I’d expect them to be something like this:

  • If the whole thing is immutable and Send/Sync, access nodes with statically checked borrowing. Access to the parent chain could be passed into recursive processing calls (visitor pattern or otherwise) with something like struct<'a> ParentPath(Vec<&'a Node>).
  • If mutable, but not Sync, implement internal links with Rc/Weak and provide a RefCell-like node mutator API.
  • Mutable and Sync: it’d be nice to learn why such a design should be chosen in preference to combining the other two with conversion traits/methods between them.

Or even without dynamic allocation, though that would make multi-level access cumbersome:

struct ParentChain<'a> {
    node: &'a Node,
    parent: Option<&'a ParentChain<'a>>
};

I’m not sure the mutability or immutability of the structure has much of an impact on the self-referentiality problem, or at least that hasn’t been the case in my experience with rental so far. By a vast majority, most uses of rental use the shared-borrowing flavor.

If an API returns a type to you that is parameterized by lifetime in any way, then your possible uses of that type are constrained in the manner described above. Just using Rc or Arc in all cases, or being generic over some kind of vocabulary trait that you need to define and agree upon, is the only way to be sure that your library can be adapted to the widest range of use cases.

On a more meta ecosystem level, I’m just very sceptical that, even if a convention did exist that would alleviate this problem, that it would achieve universal buy in from all crates, particularly when that workaround requires significantly more effort on the part of the author. And when the failure mode is that your crate is possibly rendered unusable, I believe the cost is too high to just leave it to the ecosystem to work around this issue. However, I acknowledge that this is a matter of opinion.

1 Like

This would probably have to be an unsafe trait.

In case anyone here missed it:

1 Like

Simple scenario to run into such a problem.

You want a setup() method for your integration test. From checking several popular github project the idiom is to return a Struct e.g TestEnv from such a method owning the precondition instances. This is just not possible to do in one struct, if one of those preconditions has a reference.

Concrete example. Lets say you are implementing a rest client to a http api with reqwest crate. Lets call it CatsClient. reqwest is managing its own connection pool and you are supposed to init it once and pass a refernce to the Cats Client

    struct CatsClient<'a> {
        http: &'a reqwest::Client
    }

reqwest::Client can be instantiated with default headers and perhaps soon also with a base_url. For testing each of the CatsClient endpoints, you would like to have some default reqwest::Client and also a CatsClient instantiated with some default values for convenience.

fn setup<'a>() -> TestEnv<'a> {
    let http = reqwest::Client.new();
    let cats_client = CatsClient { http: &http  };
    
    TestEnv {
        http: http, 
        cats_client: cats_client
    }
}

There is actually no way to move out both http and the cats_cilent struct out of the function (not even with a tuple), or am i missing something?

Here is some playground

1 Like

No, there is no way.

  1. Address of let http and http: http is different, so the reference in cats_client points to an old invalid location. The address also changes every time you move the cats_client, and Rust by design wants to move only using memcpy without having to fix internal addresses of objects.
  2. Even if that was fixed (e.g by using Box::new(http)), it would be unsafe, because Rust would allow assignment to testenv.http = new_object that destroys the old object, making reference in CatsClient invalid again.

You have to wrap the HTTP client in Rc<Client> to have it in two places.

Not sure if that’s suitable for you, but you can “disentangle” the owned values from the references, like this: https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=3f59786cc2a59f68e440742866573c70. There’s a struct IntermediateEnv (sorry, bad at naming here), that holds the owned values and is constructed first, and then TestEnv can just hold references to it. It doesn’t seem totally usefull to me, but might just be enough to get your testing done.

struct MyType {
    f1: i32,
    f2: ^i32 // A weak reference
}
let mut x = MyType { f1: 42, f2: ^f1 };
let f = x.f2;
let y = ^f.clone(); //^f promotes f to either &i32 or &mut i32
//Explicitly promote f to &mut i32, only type check when x is mut
let mut v:&mut i32 = ^f;
//f is still valid as it is still "week", and this is a "copy", not a "move"
let g = f;
//error: immutable borrow whilst mutable borrow in scope
let u: &i32 = ^f;
//error: send a weak reference to a function alone, without the full object
std::mem::drop(f);
//OK, we send both the object and the weak reference
(|v,_| std::mem::drop(v))(x,f);

Reasons behind:

  1. When initialize self-referential structs, we don’t want the usual borrow rules apply - this gives us maximal flexibility to create the struct.
  2. When using those self references however, we need to apply the borrow rules to ensure safety.
  3. So this design is to separate these requirements - by introducing a new reference type - the “weak” reference
  4. A weak reference can be promoted to usual references and enforce borrow rules at the promotion site
  5. A week reference can only be passed to another function as arguments with its owner.

Hi,

I just encountered this problem in my work and am trying to see what kind of solutions/workarounds we have for it in October 2018 (or in Rust 2018).

My particular scenario is that I want to store a String as parsed AST of that string, where the parser produces slices to that String.

So, something like this:

pub struct FluentResource {
    source: String,
    pub ast: Vec<&str>,
}

where the ast stores slices that are relatively costly to compute, so I’d like to either compute them lazily or eagerly but only once (thus the “store the String and add get_ast method” will not work).

What are my options?

1 Like

I don’t think much has changed:

A quick and easy way out would be to replace the &strs with (usize, usize)s, and just index into source with them on demand.

Alternatively you could try the rental crate mentioned at the start of this thread—AFAIK it’s still the state of the art in actual self-referential structs like this.

Or, you might try separating the String and the AST into separate structs and just letting the consumer of the API deal with it.

This is kind of what I'm doing now, but I must say that it feels quirky to force people to do:

let sources = vec![];
let bundle = L10nBundle::new();

for path in paths {
  sources.push(read_file(path));
}

for source in sources {
  let res = FluentResource::from_string(source);
  bundle.add_resource(res);
}

bundle should not be storing the Strings because in practice many bundles are constructed out of overlapping FluentResource structs, so it's better to have a cache of parsed FluentResource structs and make bundle production cheap.

But then I have this beautiful struct that is exactly what I think I need to feed it a String and get the AST, except that it doesn't work in Rust. Aggghhh :slight_smile:

Another solution is to apply a small smattering of unsafe, as is done here:

Uuu, that’s tempting. Does this looks reasonable then?

pub struct FluentResource<'resource> {
    source: String,
    pub ast: ast::Resource<'resource>,
}

impl<'resource> FluentResource<'resource> {
    pub fn from_string(source: String) -> Result<Self, (Self, Vec<ParserError>)> {
        let ast = match parse(&source) {
            Ok(ast) => ast,
            Err((ast, errors)) => ast,
        };

        let ast = unsafe { mem::transmute(ast) };

        Ok(FluentResource { source, ast })
    }
}

I think like this:

pub struct FluentResource {
    source: String,
    ast: ast::Resource<'static>,
}

And add a getter for ast that ensures the 'static lifetime isn’t visible to users (eg like ResponseData::parsed in tokio-imap).

I don’t know why tokio-imap made the response member public, seems dangerous.

Hmm, why <'static>? the AST should live only as long as the FluentResource::source String does, no?

It should, yes. Unfortunately, adding a lifetime param to your struct as above does not ensure that. The lifetime parameter isn’t deduced from any of the parameters of the from_string method, so the compiler has no idea what lifetime to use, and has to more or less conjure one from thin air. This is called an “unbounded” lifetime and is extremely dangerous when used with unsafe code, since the compiler has no idea where the lifetime comes from and thus can’t ensure that it’s used correctly. 'static is a slightly better choice in that it won’t arbitrarily bind to anything, but it’s still a “false” lifetime, and can lead to UB if the API around it isn’t carefully controlled.

Dealing with all of this mess is what rental is for, so if you want to create such a struct, I strongly recommend using rental, as it will expose a safe API for the struct and you can be assured that the data will not be misused.

You can just as easily write a wrapper around the Vec<(usize, usize)> version to get &strs out, and the only performance impact is the cost of adding the index offset to the address of the string. I doubt its worth the considerable safety complexity to get rid of that addition.

(That is to say, if you store it as a vector of index/len pairs, &self.resource[tuple.0][...tuple.1] is literally just the string addr added to the first member of the tuple, and the optimizer can probably figure that out. Even if it can't, that's not a very expensive bit of code. I'd want real benchmarks before I start pretending something has a 'static lifetime)

I mean, this is a variant lifetime, so 'static is not safer - its equally unsafe. But the code with a new lifetime parameter is definitely not safer than using 'static, and it just puts a lifetime in the API unnecessarily.

3 Likes