Where's the catch with Box<Read + Write>?

Hello

One of my pain points with Rust is that things like Box<Read + Write> don’t work. However, Box<Read + Sync + 'static> is completely fine. I’m pretty sure most of us already tried something like this and were surprised when the compiler complained.

From my point of view, such syntax is already part of the language, even the error message kind of suggests this is not supported yet.

Considering how obvious the thing is and how painful it is to work around it, one has to wonder why nobody implemented this yet.

So my question is, what is the blocking issue with this? Where’s the catch? Or is it just hard to implement (I’d have few ideas how to go about it on the high level, but I don’t know enough about the compiler insides to know if it would actually work and they have some drawbacks)?

If it is just hard, what is the proper way to talk about how to solve it and the steps I could take to do something about it?

7 Likes

Besides the implementation effort, there’s also the question of how you can “narrow down” to more specific trait objects. For example, can Box<Read + Write> be converted to Box<Read> and/or Box<Write>? Supporting conversion to any subset of bounds would be the most sensible thing (and it’s already possible with types like Read+Send+Sync+'static), but supporting it has some big repercussions on the implementation:

  • If you represent Read+Write trait objects as (Read vtable pointer, Write vtable pointer, data pointer) then all upcasts work just fine, but it makes fat pointers much larger (n+1 words if n traits are involved), which is probably not desirable for the majority of use cases.
  • The alternative is to use a single vtable for Read+Write that contains the methods of Read and Write but is distinct from both traits’ vtables – i.e., a trait object is (Read+Write vtable pointer, data pointer. With that strategy, trait object pointers are always two words, but it leads to a combinatorial explosion of necessary vtables (exponential in the number of traits). Furthermore, you can’t instantiate those vtables on demand, because the actual upcast might be in a separate compilation unit, so any time you create a Tr1+Tr2+...+TrN trait object you have to create vtables for all subsets of {Tr1, Tr2, …, TrN} (and even more if those traits have supertypes and we finally permit upcasting of trait objects).
  • There’s also the theoretical option of making Read+Write's vtable pointer refer to a region where Read's vtable pointer and Write's vtable pointer is stored, but then we still need to instantiate all (exponentially many) subsets of vtable pointers, so this is just a constant factor improvement. It also adds an extra indirection to vtable accesses.
3 Likes

This is related to sub-trait coercion in RFC 0401, which is not yet implemented. To implement sub-trait coercion, we have to modify the structure of vtables. In the tracking issue of sub-trait coercion, @nikomatsakis mentions multiple trait bounds on trait object types.

Well, I wasn't really asking about the implementation effort ‒ there were „big“ projects like NLL and incremental compilation, so I think these are fine. But I see there are some open questions, like you said.

I might have a workable idea for that (one that I think isn't mentioned in the linked issue and it seems there's no preference for concrete solution).

Let's say each type gets all the vtables as it has now, but gains another one (except in cases where it is not needed), or automatically implements an additional trait. Furthermore, each trait gets a unique ID ‒ something similar to TypeId, but for traits. We could then have the new vtable/trait as follows (that's more like an example than real code).

trait MultiTraitObject {
   fn lookup_vtable(&self, trait_id: &TraitId) -> Option<&Vtable>;
}

Then the caller could first look up the relevant vtable and then use it (the caller knows that it holds a multi-trait object).

This has the downside of two virtual calls instead of one. This could be partially eliminated by also providing a unique id for each method (conceptually (TraitId, Index), but possibly smaller, if another global counter is used) in a trait and adding another method:

trait MultiTraitObject {
  fn lookup_vtable(&self, trait_id: &TraitId) -> Option<&Vtable>;
  /*
     Rust isn't able to express this properly.
     But this would be implemented not in rust code, but
     generated by the compiler and it knows the proper types on both sides of the call.
  */
  fn call_method(&self, method_id: &MethodId, params: ...) -> ... {
    let method = self.lookup_vtable(method_id.trait())[method_id.method_index()].unwrap();
    (*method)(self, params)
  }
}

Everything inside the call_method is static dispatch with possibility of inlining, so the cost should be small enough. Or smaller than another virtual dispatch. It's still more expensive than just direct single dynamic virtual dispatch.

This way adds just one vtable for the support, with size linear in the number of methods+traits.

  • Single-trait objects still can use ordinary vtables (which are likely to be a bit faster)
  • Narrowing down to smaller multi-trait object is a NO-OP.
  • Narrowing down to a single-trait object can be done by looking up the vtable.
  • This probably allows some more tricks, like „narrowing“ to parent traits, querying if a trait object supports other traits in runtime, etc.

However, I'm really not sure what the actual performance impact would be and how hard this would be to implement.

Or, should I move to the linked issues for further discussion? Write it down in more details and submit as RFC? Or is there some obvious problem with this?

1 Like

Thinking back, I haven’t made one thing clear. The lookup_vtable would be something similar to how match works ‒ let the compiler optimise it to save as much space and speed as possible, but not create a huge table with the size of number of all traits. Something like:

fn lookup_vtable(&self, trait_id: &TraitId) -> Option<&Vtable> {
    match trait_id {
      TraitId::of<Read> => Some(0x42),
      TraitId::of<Write> => Some(0x84),
      _ => None
   }
}

Your lookup_vtable needs to list all the traits that a given type implements, right? That can’t be implemented, as any crate can add new traits and impls to existing types.

2 Likes

Can’t compiler interpret Read+Write as a virtual auto-implemented trait and store pointer to its implementation vtable, and instead of casting to Read we’ll use an explicit conversion which will change vtable pointer to one for Read implementation? Yes, in this case we’ll have to generate two vtables, but cost is linear, not exponential, and depends only on number of trait combinations actually used in the code.

Yeah, but at code generation everything is known.

Not when building the crate that defines the type.

Preferably, yes. But I think we could get away with just the ones listed in the multi-trait.

Is there any change generating the vtable when the real object is casted to the multi-trait object instead at the crate that defines the type? At that point, at least all the needed traits are already known.

It has the downside that we could get multiple different multi-trait-object vtables in the end, each having some subset of the traits.

Or, would it be possible to delay the creation of this table to the main crate being compiled? At that point, all the dependencies are already known, aren't they?

@newpavlov

Not sure what you mean here. A cast to a different trait object will most likely have to change the vtable pointer, unless the original vtable contains the casted-to trait's vtable as a prefix (which isn't possible for multi-trait-objects when casting to different subsets). The implementation strategies I outlined were certainly all meant to change the vtable pointer(s) when casting/converting/whatever you wanna call it.

The problem is that you need to know the concrete underlying type (e.g., File) to emit the vtable for its trait implementation, but when you see the cast Read+Write -> Read, you don't know which concrete type that is (it might even be a different one every time the cast is executed). So you need to emit the Read vtable for every concrete type that is ever turned into a Read+Write trait object, and same for every other vtable that might ever be needed.

I'm not sure whether you're suggesting to analyze the program and (for example) if Read+Write is never casted to Write, then we shouldn't have to instantiate the Write vtable for concrete types that are turned into Read+Write trait objects? That is incompatible with separate compilation, which is not going to go away (more on this in my upcoming replies to other posts in this thread). But even if it worked, the number of vtables would be O(# of concrete types turned into multi-trait object * # of subsets that multi-trait object is upcasted to), so, potentially quadratic in the program size, not linear.

@vorner

As you note this would lead to:

Different copies of the same vtable can already happen today, but because in your strategy a Read+Write vtable might include information for a dozen other traits, trait objects would exhibit different overhead on virtual calls depending on how they were created. Not necessarily a deal breaker, but not exactly desirable either.

vtables are already generated at the location where the type erasure occurs, partly because not all traits are known when compiling the type definition, but also to avoid creating vtables for (trait object, concrete type) combinations that are never actually used.

Something along those lines has been suggested by several people. It's a natural impulse, but it means that compiling a Rust program requires whole-program analysis, i.e., any line of any crate involved can potentially influence how any other line is compiled. That has severe downsides:

  • Parallelization of the compiler becomes much more difficult (and gains an unavoidable serial fraction for merging analysis results from different parts of the program)
  • "Eager" codegen, where monomorphic code in libraries is already compiled to machine code while the library is compiled (rather than waiting) becomes impossible. rustc is doing "eager" codegen currently. We're actually trying to move away from that somewhat (though it's questionable whether that is even possible for dylib crates), but it seems very unwise to close off any implementation options just for this one feature.
  • Implementation details leak out of one function and affect many other functions, negatively impacting code reuse during incremental compilation.

Hm, you are right, in the best case scenario we'll get quadratic number of vtables and at the undesirable cost.

Can't we improve this "double indirection" approach in the following way? Let "combinator" trait objects be represented as the following heap allocated structure:

pub struct CombTraitObject {
    pub data: *mut (),
    pub vtables: [*mut ()],
}

Here length of vtables length is equal to the number of listed traits (with optimization for marker traits, perhaps), e.g. for Read+Write it will be just 2. So if we want to use Read method we'll take pointer vtables[0] and for Write it will be vtables[0].

Now if we have Box<A+B+C+D> and we want to convert it into Box<B+D> we dynamically generate appropriate CombTraitObject with new_vtables = [old_vtables[1], old_vtables[3]]. And if we doing conversion into Box<A> we create the standard single indirection TraitObject struct with an appropriate pointer.

To better distinguish from "single" indirection trait objects we probably better to use not a Box, but a different name.

If I am not missing something here we'll generate the same amount of vtables as we do today.

Dynamically generating means run-time heap allocation, right? That's not possible on targets without heap allocation, and therefore not acceptable for a core language feature.

Additionally, IIUC this approach still has the big per-object space overhead that "fat pointer with one vtable pointer per trait" has, it just makes the trait object pointer slightly cheaper to move (and trades that for other costs, like double indirection and heap allocation).

Yes this appears to be correct, because seems to be a variant of "fat pointer with one vtable pointer per trait" with different data organization.

First, let me step back and state what my motivation here is. The current situation without any multi-trait-object support is unsatisfactory. Even the support without „narrowing“ support would be a great help in most scenarios. I'd like this to start moving somewhere, if I can help to push it a bit, that's even better.

My impression of the current situation is that it's stuck because we don't have an obviously good solution, only several clearly sub-optimal in some ways. But from my point of view:

  • A sub-optimal solution is still better than none.
  • I don't think if we introduce some solution that the compiler would guarantee it'll always use the same way ‒ so if a better solution is found later on, it could be replaced.

The rest of this discussion is an attempt to brain-storm possible solutions ‒ just throwing incomplete ideas around, if something looks promising. Sure, it has problems that need to be solved ‒ for that, replies in the sense „Hey, this has this concrete problem“ are very useful. On the other hand, I'd really like this not to get stuck again on „we don't have the optimal solution, so it can't be done“ mind-set.

From the proposals that I'd seen so far, mine looks with the least severe drawbacks ‒ but that could very well be a personal opinion (both because I have inclination to like my own proposal, and because I might have different notion of the costs).

Different copies of the same vtable can already happen today

Don't these get deduplicated during link time? I hoped that if two crates in the dep graph both use HashMap<u32, u32>, only one monomorphisation gets put into the result and a vtable generated on the spot when converting to trait object looks like a very similar concept to this.

but because in your strategy a Read+Write vtable might include information for a dozen other traits, trait objects would exhibit different overhead on virtual calls depending on how they were created. Not necessarily a deal breaker, but not exactly desirable either.

Agreed. In that sense, it'd probably make sense to create only the slimmest vtables needed (eg. don't include the traits that are not used as part of the multi-trait-object), so the needless overhead is as small as possible. Then we could get worse performance than other multi-trait-objects with the same set of trait only if it got created by the narrowing operation, which I think would be quite rare. Furthermore, I expect the usual set of traits to be rather small in practice and I'd still expect the virtual call itself (opaque to both CPU and compiler optimisations) to dominate the cost. I'll think about a way to measure it without having to implement the support in the compiler.

This still creates a new vtable for each different subset o:f the traits actually used, but we avoid creating all the n² combination of that subset we would have to make just because they could be used by narrowing down. The fact the user must explicitly write each such subset on the conversion site means the number is reasonably limited.

Something along those lines has been suggested by several people. It’s a natural impulse, but it means that compiling a Rust program requires whole-program analysis

Not necessarily to that level. I could imagine (read: I have no idea if this is possible in current rustc) that the type erasure would link to an extern symbol for the vtable, so most of the compilation could happen without knowing the whole set of traits the type implements. Only the generation of the vtable would have to happen at the end, when all the traits for the type are known, and the vtables are likely to be reasonably small bits of code, so that work after this all is gathered together wouldn't be that big.

Something like this was actually a step before what I was thinking. I wanted to avoid the heap allocation (what if we are in no-std situation? We may not have the heap. And, who frees it?) and the double-virtual-indirection.

The other side where it may happen is in the fat pointer itself (eg. before the indirection, not after), but that makes it even more beefy.

Uh. How would that work? A trait object is a type in its own right, so things like &(Read + Write) and Rc<Read + Write> should still work. Would you introduce special names for all these? That doesn't sound right and makes the language irregular (eg. different than Read + Send).

3 Likes

Trait objects with Box already do heap allocation regardless, no?

Hm, you are right. With const generics and a bit of compiler magic we probably can remove additional heap allocation and double indirection. Cost of adding additional pointers to stack allocated TraitObject struct is unfortunate, but it's probably will be the best solution. Although you'll pay this cost only if you use traits combinations (the more traits in your combination the fatter will be the pointer) so I think it fits under "zero-cost".

To be explicit, in this approach we'll change TraitObject struct to:

pub struct TraitObject<const N: usize> {
    pub data: *mut (),
    pub vtables: [*mut (); N],
}

Which with N=1 will be equivalent to the current TraitObject.

1 Like

A good point, quite a few people in the Rust community (me included) tend to fall prey to "perfect is the enemy of good" :upside_down_face:

In terms of implementation strategy? More or less, sure.

However, implementation strategies matter insofar that if no possible implementations strategy for upcasting is acceptable, then we'll probably never have upcasts for multi-trait objects, period. If true (and it looks more and more true to me the longer I think about it) that has implications for the language design. "We'll accept this subset of the feature and work out the generalization later" is quite different from "we let you do this thing but the natural generalization will never be possible, sorry".

See vtable addresses differ based on the number of codegen units · Issue #46139 · rust-lang/rust · GitHub -- QoI issue, currently we don't deduplicate (at least not perfectly), and there are no guarantees.

I think it will be quite common to hold a trait object with a bunch of different traits, then upcast to one specific trait to interact with a library that only needs that one trait.

That still requires analyzing the whole program when you create the vtable at the end. And it assumes there is a unique place to do that -- in the usual work flow that's the leaf binary/staticlib, but AFAICT it doesn't play nice with dylibs, where one library's compilation artifact can be used from multiple binaries. (dylibs are Terrible for many reasons, but we need to keep them at least functional.)

Not all trait objects use Box. &Trait is a thing, for example. It works perfectly fine without any heap allocation.

Also, Box<Struct> -> Box<Trait> performs no heap allocation, it just reuses the heap allocation for the object and adds a vtable pointer "externally". With your strategy, there's a heap allocation any time a trait object pointer is created or upcasted.

Isn't that exactly the "fat pointer with many vtable pointers" approach I described at the start?

(Which isn't blocked on const generics because it's compiler built-in.)

2 Likes

In the end, yes, it is.

Is it? IIUC we will have to change std::raw::TraitObject, which will be blocked on const generics if we want to reduce amount of compiler magic to the minimum.

Oh. I forgot about that because it's an obscure and unstable piece of library code that isn't used very often. We don't have to extend it to multi-trait objects, we can just say it's for trait objects with a single non-auto trait, and all other trait objects have unspecified layout. This is especially useful if we want to keep the implementation details fluid, like @vorner suggested.

We could even deprecate and remove it. It's a very brittle and limited API.

To clarify the compiler jargon, upcasting means the narrowing? If so ‒ I think most of the mentioned strategies would be sub-optimal but workable solutions. Especially if they can be made to pay for the cost only if they are actually used.

Most libraries expose the static-generics, so (if I look at it correctly) if I have Box<Read + Write> and want to call serde_json::to_writer<W: Write> on it, I don't need to create narrowed-down trait object, I just generate another monomorfization of to_writer that takes W = Box<Read + Write>. Because that one implements Write.

Therefore, I think the need to convert to narrower trait object is useful mostly in cases where I have a collection of some trait objects and want to get a collection with more general (eg less trait-constrained) trait objects. The mentioned strategies have little problems with supporting conversion to single-trait objects. So we really are talking about something a bit rare ‒ upcasting from lest say 4-trait object to 2-trait object.