Improving self-referential structs

This would probably have to be an unsafe trait.

In case anyone here missed it:

1 Like

Simple scenario to run into such a problem.

You want a setup() method for your integration test. From checking several popular github project the idiom is to return a Struct e.g TestEnv from such a method owning the precondition instances. This is just not possible to do in one struct, if one of those preconditions has a reference.

Concrete example. Lets say you are implementing a rest client to a http api with reqwest crate. Lets call it CatsClient. reqwest is managing its own connection pool and you are supposed to init it once and pass a refernce to the Cats Client

    struct CatsClient<'a> {
        http: &'a reqwest::Client
    }

reqwest::Client can be instantiated with default headers and perhaps soon also with a base_url. For testing each of the CatsClient endpoints, you would like to have some default reqwest::Client and also a CatsClient instantiated with some default values for convenience.

fn setup<'a>() -> TestEnv<'a> {
    let http = reqwest::Client.new();
    let cats_client = CatsClient { http: &http  };
    
    TestEnv {
        http: http, 
        cats_client: cats_client
    }
}

There is actually no way to move out both http and the cats_cilent struct out of the function (not even with a tuple), or am i missing something?

Here is some playground

No, there is no way.

  1. Address of let http and http: http is different, so the reference in cats_client points to an old invalid location. The address also changes every time you move the cats_client, and Rust by design wants to move only using memcpy without having to fix internal addresses of objects.
  2. Even if that was fixed (e.g by using Box::new(http)), it would be unsafe, because Rust would allow assignment to testenv.http = new_object that destroys the old object, making reference in CatsClient invalid again.

You have to wrap the HTTP client in Rc<Client> to have it in two places.

Not sure if that’s suitable for you, but you can “disentangle” the owned values from the references, like this: https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=3f59786cc2a59f68e440742866573c70. There’s a struct IntermediateEnv (sorry, bad at naming here), that holds the owned values and is constructed first, and then TestEnv can just hold references to it. It doesn’t seem totally usefull to me, but might just be enough to get your testing done.

struct MyType {
    f1: i32,
    f2: ^i32 // A weak reference
}
let mut x = MyType { f1: 42, f2: ^f1 };
let f = x.f2;
let y = ^f.clone(); //^f promotes f to either &i32 or &mut i32
//Explicitly promote f to &mut i32, only type check when x is mut
let mut v:&mut i32 = ^f;
//f is still valid as it is still "week", and this is a "copy", not a "move"
let g = f;
//error: immutable borrow whilst mutable borrow in scope
let u: &i32 = ^f;
//error: send a weak reference to a function alone, without the full object
std::mem::drop(f);
//OK, we send both the object and the weak reference
(|v,_| std::mem::drop(v))(x,f);

Reasons behind:

  1. When initialize self-referential structs, we don’t want the usual borrow rules apply - this gives us maximal flexibility to create the struct.
  2. When using those self references however, we need to apply the borrow rules to ensure safety.
  3. So this design is to separate these requirements - by introducing a new reference type - the “weak” reference
  4. A weak reference can be promoted to usual references and enforce borrow rules at the promotion site
  5. A week reference can only be passed to another function as arguments with its owner.

Hi,

I just encountered this problem in my work and am trying to see what kind of solutions/workarounds we have for it in October 2018 (or in Rust 2018).

My particular scenario is that I want to store a String as parsed AST of that string, where the parser produces slices to that String.

So, something like this:

pub struct FluentResource {
    source: String,
    pub ast: Vec<&str>,
}

where the ast stores slices that are relatively costly to compute, so I’d like to either compute them lazily or eagerly but only once (thus the “store the String and add get_ast method” will not work).

What are my options?

1 Like

I don’t think much has changed:

A quick and easy way out would be to replace the &strs with (usize, usize)s, and just index into source with them on demand.

Alternatively you could try the rental crate mentioned at the start of this thread—AFAIK it’s still the state of the art in actual self-referential structs like this.

Or, you might try separating the String and the AST into separate structs and just letting the consumer of the API deal with it.

This is kind of what I’m doing now, but I must say that it feels quirky to force people to do:

let sources = vec![];
let bundle = L10nBundle::new();

for path in paths {
  sources.push(read_file(path));
}

for source in sources {
  let res = FluentResource::from_string(source);
  bundle.add_resource(res);
}

bundle should not be storing the Strings because in practice many bundles are constructed out of overlapping FluentResource structs, so it’s better to have a cache of parsed FluentResource structs and make bundle production cheap.

But then I have this beautiful struct that is exactly what I think I need to feed it a String and get the AST, except that it doesn’t work in Rust. Aggghhh :slight_smile:

Another solution is to apply a small smattering of unsafe, as is done here:

Uuu, that’s tempting. Does this looks reasonable then?

pub struct FluentResource<'resource> {
    source: String,
    pub ast: ast::Resource<'resource>,
}

impl<'resource> FluentResource<'resource> {
    pub fn from_string(source: String) -> Result<Self, (Self, Vec<ParserError>)> {
        let ast = match parse(&source) {
            Ok(ast) => ast,
            Err((ast, errors)) => ast,
        };

        let ast = unsafe { mem::transmute(ast) };

        Ok(FluentResource { source, ast })
    }
}

I think like this:

pub struct FluentResource {
    source: String,
    ast: ast::Resource<'static>,
}

And add a getter for ast that ensures the 'static lifetime isn’t visible to users (eg like ResponseData::parsed in tokio-imap).

I don’t know why tokio-imap made the response member public, seems dangerous.

Hmm, why <'static>? the AST should live only as long as the FluentResource::source String does, no?

It should, yes. Unfortunately, adding a lifetime param to your struct as above does not ensure that. The lifetime parameter isn’t deduced from any of the parameters of the from_string method, so the compiler has no idea what lifetime to use, and has to more or less conjure one from thin air. This is called an “unbounded” lifetime and is extremely dangerous when used with unsafe code, since the compiler has no idea where the lifetime comes from and thus can’t ensure that it’s used correctly. 'static is a slightly better choice in that it won’t arbitrarily bind to anything, but it’s still a “false” lifetime, and can lead to UB if the API around it isn’t carefully controlled.

Dealing with all of this mess is what rental is for, so if you want to create such a struct, I strongly recommend using rental, as it will expose a safe API for the struct and you can be assured that the data will not be misused.

You can just as easily write a wrapper around the Vec<(usize, usize)> version to get &strs out, and the only performance impact is the cost of adding the index offset to the address of the string. I doubt its worth the considerable safety complexity to get rid of that addition.

(That is to say, if you store it as a vector of index/len pairs, &self.resource[tuple.0][...tuple.1] is literally just the string addr added to the first member of the tuple, and the optimizer can probably figure that out. Even if it can’t, that’s not a very expensive bit of code. I’d want real benchmarks before I start pretending something has a 'static lifetime)

I mean, this is a variant lifetime, so 'static is not safer - its equally unsafe. But the code with a new lifetime parameter is definitely not safer than using 'static, and it just puts a lifetime in the API unnecessarily.

3 Likes

Would it need to verify it is a character boundary too?

True because its strings, I still doubt this is worth dropping into unsafe self references.

2 Likes

Transmute without turbofish is pretty scary, since it could easily be UB. I’m guessing you’re widening a lifetime here? You should covert it into a Box<str> with into_boxed_str() to pin it in the heap, if that’s the case.

1 Like

Uhhhhhhhhhhhh.

Okay, thank goodness at least one other person finds this hair-raising.

Here is UB in tokio_imap:

extern crate tokio_imap;
extern crate tokio_codec;
extern crate bytes;

use std::fmt::Debug;
use tokio_codec::Decoder;
use tokio_imap::proto::{ImapCodec};

// Returns Response<'static>
fn make_bomb() -> impl Debug {
    use bytes::{BytesMut, BufMut};
    let mut buf = BytesMut::with_capacity(1_000_000_000);
    buf.put("* OK [UNSEEN 3] Message 3 is first unseen\r\n");

    let mut decoder = ImapCodec::default();
    let mut result = decoder.decode(&mut buf).unwrap().unwrap();

    println!("{:?}", result.response); // seems legit
    result.response
} // result is freed. uh oh.

fn main() {
    let bomb = make_bomb();
    for _ in 0..10 {
        // make other allocations that zero out memory
        make_bomb();
    }
    println!("{:?}", bomb);
}

Release output:

Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("Message 3 is first unseen") }
Data { status: Ok, code: Some(Unseen(3)), information: Some("\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}") }

Edit: created https://github.com/djc/tokio-imap/issues/30

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.