As part of learning about Rust, I came accross rust-lang #2035. It sounds like an interesting problem, so I'd like to suggest what I think is a new a way to implement Box<A+B+C>
→ Box<A>
or Box<A+…+B+…+C>
→ Box<A+B+C>
. Apologies if this has been considered (and discarded) already. I did search for prior art.
Terminology:
For convenience I'll call traits like "A+B+C
" "sum traits". Simple traits like just "A
" are "unit traits". I'll call v-ptr a fat pointer containing two fields: the v-table pointer and the data pointer.
Relation to prior art
All solutions, like JeffBurdges' and vorner's and this one, are making tradeoffs. In my understanding all considered solutions have one or more of the following downsides:
- Require heap allocation (eg: creating vtables at runtime)
- Constraints on which
A
orB
work (eg: offset-based meansBox<A+B+C>
→Box<A+B>
works butBox<A+B+C>
→Box<A+C>
doesn't) - Require memory in O(trait) (eg: extra fat pointers, aka one pointer per trait)
- Require global full program view (eg)
This solution selects (3), but with a very low multiplicative factor and some optimisations where possible (eg: when heap allocation is available).
TL;DR of the V0/MVP/Simplified version
I'll start presenting a simplified version that should be portable and reasonable as a first solution. I'll expend with some refinements.
Given traits
trait A { fn func_a(); }
trait B { fn func_b1(); fn func_b2(); }
trait C { fn func_c(); }
In this simplified version, there can only been 8 traits in a sum trait. (Note: A later refinement could lift this limitation.)
The V-tables equivalent of a sum trait are called sieve-tables and look as follow. The fat pointers to a sum trait are called sieve-ptr and look as follow.
┌──────────────────┐
│ sieve-ptr of s │
│ as A+B+C │
┌─────────────────┐ ├──────────────────┤
│s of type S │◄──────┤*data │ ┌┬─────────────────────┬┐
├─────────────────┤ │sieve: 0...0111 │ ││sieve-table of A+B+C ││
│... │ │*sievetable ├───────────►├┼─────────────────────┼│
│... │ └──────────────────┘ ││ ││
│... │ ││ ││
│... │ ┌──────────────────┐ ││ ││
│... │ │ sieve-ptr of s │ ││*vtable_a ││
│... │ │downcasted to A+C │ ││*vtable_b ││
│... │ ├──────────────────┤ ││*vtable_c ││
│... │◄──────┤*data │ ││ ││
└─────────────────┘ │sieve: 0...0101 │ ││ ││
│*sievetable ├───────────►││ ││
└──────────────────┘ ││ ││
└┴─────────────────────┴┘
Pointers to "sum traits" (A+B+C
), aka sieve-pointers, have an extra field, the sieve. Compared to pointer to "unit traits" A
aka v-pointers, they are one word larger. (Note: A later refinement makes the sieve-ptrs smaller.)
The sieve bits indicate which subset of the sieve-table is relevant for a given sum-trait. 0...11001
in the sieve of an A+B+C
object means that A
is the first v-table in the sieve-table, B
the fourth v-table in the sieve-table, and C
the fifth v-table in the sieve-table.
A few operations:
- When a function receiving a pointer of type "
A+C+F
", wants to callC::func_c()
, it:- finds the offset of the second bit set to 1 in the sieve (second bit because
C
is the second interface). This takes 2 cycles on x86 usingPDEP
+TZCNT
, if I got Agner right. - adds the offset to the address of the sieve-table to find the address of the v-table of
C
. This costs a dereference. (Note: A later refinement removes that step.) - proceeds normally to call
func_c()
- finds the offset of the second bit set to 1 in the sieve (second bit because
- When a function taking a sum type needs to produce a v-ptr (eg, to call a function taking a unit type) it likewise looks up the correct v-table from the sieve-table and produces a v-ptr.
- When a function taking a sum type needs to produce a narrower sum type (eg from
A+B+C+D+E
toA+C+E
) it: - To access
align_of()
,size_of()
,drop_in_place()
, it:- Dereferences the first vtable (these function are identical across v-tables).
At build time:
- The compiler, just like now, emits sieve-tables when it would have emited a v-table (in the calling crate).
- The linker, just like now, best-effort deduplicate sieve-tables like it best-effort deduplicates v-tables.
Why 8 traits?
The sieve is likely not just 8 bits, so it could accomodate more traits, even without the further refinements. In practice, we could have 16 traits at no cost (without impacting the recomendation of which refinements to implement).
The reason to suggest 8 traits is that it is much easier to increase this number than to decrease it. 8 looks useful enough and keeps options open.
Refinement
ALL THE FOLLOWING ARE OPTIONAL FUTURE IMPROVEMENTS
Refinements are not all available in all contexts.
They are different encoding of the sieve-table suitable for different situations. Each of them mentions: 1/ The benefit it brings. 2/ The constraint it puts on the system 3/ The impact on runtime cost. 4/ The sketch of is implementation.
Together, they may mean that on Linux x86-64, the overhead of most sieve-pointers (with fewer than 44 methods) is 3 cycles per two function calls and 0 memory overhead when compared to v-pointers.
Suggestions (based on gut feelings, benchmarks needed):
- 8 bits system should not implement refinements (I would need to learn about their constraints and features before I can suggest optimisations, but maybe let's not start with that).
- 16 bits system should implement only inlining and limited narrowing, and only if they do not feature branch prediction.
- 32 bits systems should implement inlining and full narrowing, and only if they can heap allocate.
- 64 bits systems should implement bitpacking, inlining and narrowing.
Inlining
- Benefit: removes step 2 of function calls, avoiding the need to go dereference the vtable and lowering overhead of sieve-ptr to 2 cycles on x86 when compared with v-ptr.
- Constraint on the system: Most useful if there are fewer than
$SIEVE_SIZE
methods in the vast majority sum trait (eg: less than 44 methods on x86-64 if also bit-packing). - Runtime cost: Cost of the narrowing operations (defined later) when a trait is transformed from a regular sieve-table to an inlined one.
- How it works: When there are fewer than
$SIEVE_SIZE
methods in a sum trait, inline the methods, so the sieve becomes:
┌──────────────────┐
│ sieve-ptr of s │
│as A+B+C │
├──────────────────┤
│ *data │ ┌┬─────────────────────┬┐
│sieve: 0...1111 │ ││sieve-table of A+B+C ││
│*sievetable ├─────────┬─►├┼─────────────────────┼│
└──────────────────┘ │ ││*v_tables │┼───┐
│ ││A::a() ││ │
┌──────────────────┐ │ ││B::b1() ││ │
│ sieve-ptr of s │ │ ││B::b2() ││ │
│downcasted to A+C │ │ ││C::c() ││ │
├──────────────────┤ │ ││*vtable_a │┤◄──┘
│*data │ │ ││*vtable_b ││
│sieve: 0...1001 │ │ ││*vtable_b ││
│*sievetable ├─────────┘ ││*vtable_c ││
└──────────────────┘ ││ ││
└┴─────────────────────┴┘
Bit-packing
- Benefit: makes the sieve-pointers same size as v-pointers and thus same size as existing fat pointers.
- Constraint on the system: linker needs to be able to place v-tables in a specific section, such that they can be indexed relative to it (I believe this is the case for ELF, wasm).
- Runtime cost: None I believe.
- How it works: the linker is instructed to place sieve-tables in a specific section of the binary, which is less than 1GiB. This makes it possible to refer to a v-table with just 20bits, leaving 12 bits to store our sieve on 32 bit architectures (and 44 on 64 bits archs).
Narrowing limited to de-inlining
Note: This only makes sense on systems that implement inlining.
Terminology: A sieve table using the "inlining" refinement above is said to be an "inlined sieve-table". Non-inlined sieve-tables are called "indirect sieve-tables".
- Benefit: Removes the limit that inlining sets on the number of methods in a sum trait (44 on 64 bits systems).
- Constraint on the system: None.
- Runtime cost:
- No cost for function expecting an indirect sieve-ptr as argument.
- One cycle for functions expecting an inlined sieve-table.
- cost of a branch misprediction on calls of an indirect sieve-ptr from functions expecting an inlined sieve-ptr (free on systems without branch prediction like AVR I think). In concrete terms: this is the cost when a sum trait made from an object with many methods (>44 methods on x86-64) is handed over to a function expecting few methods.
- How it works: Two sieve-tables are generated: one inlined, one not inlined (indirect). A bit is reserved in the sieve to indicate the type of sieve-table. On a method call, first determine the format then use the correct resolution mechanism.
Full narrowing on systems with heap allocation
Note: This is completely optional. It's sketches how to avoid having a cap on the number of trait in a sum type. I haven't fully fleshed it out because 8-trait-should-be-enough-for-everyone-TM and because it raises Interesting Questions.
- Benefit: Removes both constraints on the sieve size (limits on the number of methods or traits) in a sum, by allowing multiple sieve sizes to coexist within a binary.
- Constraint on the system: Requires a heap allocator (see next item for the version that does not heap allocate).
- Runtime cost:
- Converting from a smaller sieve to a larger sieve is free (just pad with 0 bits).
- Converting from a larger sieve to a smaller sieve requires the dynamic allocation of a new sieve-table or its lookup from a cache. Looking up the sieve-table in a cache when a v-ptr is converted from a larger sieve to a smaller sieve.
- How it works: build the narrower sieve-table from the larger sieve-table by copying the bit-selected v-tables and caching the result. It sounds slow, but golang does something like that so presumably it's not too bad.
Full narrowing on systems without heap allocation
Note: This is completely optional. It's sketches how to avoid having a cap on the number of trait in a sum type when heap allocation is not available. I haven't fully fleshed it out because 8-trait-should-be-enough-for-everyone-TM and because it raises Interesting Questions.
- Benefit: Removes both constraints on the sieve size (limits on the number of methods or traits) in a sum, by allowing multiple sieve sizes to coexist within a binary.
- Constraint on the system: None.
- Runtime cost:
- No cost for function expecting an indirect sieve-ptr as argument.
- One cycle for functions expecting an inlined sieve-table.
- On calls of an indirect sieve-ptr from functions expecting an inlined sieve-ptr: cost of one branch misprediction+one function call (something like 20-40 cycles in total I suspect).
- How it works: When the sieve is 0, call a
dispatch()
function instead, which is at a fixed offset in all sieve-tables. Since this is part of the sieve-table, there is one suchdispatch()
for each place where a type is cast into a sum trait. That function takes a global ID of the trait as an argument (the simplest would be its canonical name) and returns a vtable. The dispatch is like vorner's proposal except 1/ it's a fallback so performance is less critical, and 2/ the dispatch function is local with no need for whole program optimization.