Sieve-tables for multiple traits objects (Box<A+B+C> to Box<A+C>, #2035)

unbrice · September 29, 2021, 3:02pm

As part of learning about Rust, I came accross rust-lang #2035. It sounds like an interesting problem, so I'd like to suggest what I think is a new a way to implement Box<A+B+C> → Box<A> or Box<A+…+B+…+C> → Box<A+B+C>. Apologies if this has been considered (and discarded) already. I did search for prior art.

Terminology: For convenience I'll call traits like "A+B+C" "sum traits". Simple traits like just "A" are "unit traits". I'll call v-ptr a fat pointer containing two fields: the v-table pointer and the data pointer.

Relation to prior art

All solutions, like JeffBurdges' and vorner's and this one, are making tradeoffs. In my understanding all considered solutions have one or more of the following downsides:

Require heap allocation (eg: creating vtables at runtime)
Constraints on which A or B work (eg: offset-based means Box<A+B+C>→Box<A+B> works but Box<A+B+C>→Box<A+C> doesn't)
Require memory in O(trait) (eg: extra fat pointers, aka one pointer per trait)
Require global full program view (eg)

This solution selects (3), but with a very low multiplicative factor and some optimisations where possible (eg: when heap allocation is available).

TL;DR of the V0/MVP/Simplified version

I'll start presenting a simplified version that should be portable and reasonable as a first solution. I'll expend with some refinements.

Given traits

trait A { fn func_a(); }
trait B { fn func_b1(); fn func_b2(); }
trait C { fn func_c(); }

In this simplified version, there can only been 8 traits in a sum trait. (Note: A later refinement could lift this limitation.)

The V-tables equivalent of a sum trait are called sieve-tables and look as follow. The fat pointers to a sum trait are called sieve-ptr and look as follow.

                          ┌──────────────────┐
                          │ sieve-ptr of s   │
                          │ as A+B+C         │
┌─────────────────┐       ├──────────────────┤
│s of type S      │◄──────┤*data             │            ┌┬─────────────────────┬┐
├─────────────────┤       │sieve: 0...0111   │            ││sieve-table of A+B+C ││
│...              │       │*sievetable       ├───────────►├┼─────────────────────┼│
│...              │       └──────────────────┘            ││                     ││
│...              │                                       ││                     ││
│...              │       ┌──────────────────┐            ││                     ││
│...              │       │ sieve-ptr of s   │            ││*vtable_a            ││
│...              │       │downcasted to A+C │            ││*vtable_b            ││
│...              │       ├──────────────────┤            ││*vtable_c            ││
│...              │◄──────┤*data             │            ││                     ││
└─────────────────┘       │sieve: 0...0101   │            ││                     ││
                          │*sievetable       ├───────────►││                     ││
                          └──────────────────┘            ││                     ││
                                                          └┴─────────────────────┴┘

Pointers to "sum traits" (A+B+C), aka sieve-pointers, have an extra field, the sieve. Compared to pointer to "unit traits" A aka v-pointers, they are one word larger. (Note: A later refinement makes the sieve-ptrs smaller.)

The sieve bits indicate which subset of the sieve-table is relevant for a given sum-trait. 0...11001 in the sieve of an A+B+C object means that A is the first v-table in the sieve-table, B the fourth v-table in the sieve-table, and C the fifth v-table in the sieve-table.

A few operations:

When a function receiving a pointer of type "A+C+F", wants to call C::func_c(), it:
1. finds the offset of the second bit set to 1 in the sieve (second bit because C is the second interface). This takes 2 cycles on x86 using PDEP+TZCNT, if I got Agner right.
2. adds the offset to the address of the sieve-table to find the address of the v-table of C. This costs a dereference. (Note: A later refinement removes that step.)
3. proceeds normally to call func_c()
When a function taking a sum type needs to produce a v-ptr (eg, to call a function taking a unit type) it likewise looks up the correct v-table from the sieve-table and produces a v-ptr.
When a function taking a sum type needs to produce a narrower sum type (eg from A+B+C+D+E to A+C+E) it:
1. clears the corresponding bits from the field. So the sieve 000...11111 becomes 000...10101. (1 cycle on x86 using PDEP if I got Agner right)
To access align_of(), size_of(), drop_in_place(), it:
1. Dereferences the first vtable (these function are identical across v-tables).

At build time:

The compiler, just like now, emits sieve-tables when it would have emited a v-table (in the calling crate).
The linker, just like now, best-effort deduplicate sieve-tables like it best-effort deduplicates v-tables.

Why 8 traits?

The sieve is likely not just 8 bits, so it could accomodate more traits, even without the further refinements. In practice, we could have 16 traits at no cost (without impacting the recomendation of which refinements to implement).

The reason to suggest 8 traits is that it is much easier to increase this number than to decrease it. 8 looks useful enough and keeps options open.

Refinement

ALL THE FOLLOWING ARE OPTIONAL FUTURE IMPROVEMENTS

Refinements are not all available in all contexts.

They are different encoding of the sieve-table suitable for different situations. Each of them mentions: 1/ The benefit it brings. 2/ The constraint it puts on the system 3/ The impact on runtime cost. 4/ The sketch of is implementation.

Together, they may mean that on Linux x86-64, the overhead of most sieve-pointers (with fewer than 44 methods) is 3 cycles per two function calls and 0 memory overhead when compared to v-pointers.

Suggestions (based on gut feelings, benchmarks needed):

8 bits system should not implement refinements (I would need to learn about their constraints and features before I can suggest optimisations, but maybe let's not start with that).
16 bits system should implement only inlining and limited narrowing, and only if they do not feature branch prediction.
32 bits systems should implement inlining and full narrowing, and only if they can heap allocate.
64 bits systems should implement bitpacking, inlining and narrowing.

Inlining

Benefit: removes step 2 of function calls, avoiding the need to go dereference the vtable and lowering overhead of sieve-ptr to 2 cycles on x86 when compared with v-ptr.
Constraint on the system: Most useful if there are fewer than $SIEVE_SIZE methods in the vast majority sum trait (eg: less than 44 methods on x86-64 if also bit-packing).
Runtime cost: Cost of the narrowing operations (defined later) when a trait is transformed from a regular sieve-table to an inlined one.
How it works: When there are fewer than $SIEVE_SIZE methods in a sum trait, inline the methods, so the sieve becomes:

┌──────────────────┐
│ sieve-ptr of s   │
│as A+B+C          │
├──────────────────┤
│ *data            │            ┌┬─────────────────────┬┐
│sieve: 0...1111   │            ││sieve-table of A+B+C ││
│*sievetable       ├─────────┬─►├┼─────────────────────┼│
└──────────────────┘         │  ││*v_tables            │┼───┐
                             │  ││A::a()               ││   │
┌──────────────────┐         │  ││B::b1()              ││   │
│ sieve-ptr of s   │         │  ││B::b2()              ││   │
│downcasted to A+C │         │  ││C::c()               ││   │
├──────────────────┤         │  ││*vtable_a            │┤◄──┘
│*data             │         │  ││*vtable_b            ││
│sieve: 0...1001   │         │  ││*vtable_b            ││
│*sievetable       ├─────────┘  ││*vtable_c            ││
└──────────────────┘            ││                     ││
                                └┴─────────────────────┴┘

Bit-packing

Benefit: makes the sieve-pointers same size as v-pointers and thus same size as existing fat pointers.
Constraint on the system: linker needs to be able to place v-tables in a specific section, such that they can be indexed relative to it (I believe this is the case for ELF, wasm).
Runtime cost: None I believe.
How it works: the linker is instructed to place sieve-tables in a specific section of the binary, which is less than 1GiB. This makes it possible to refer to a v-table with just 20bits, leaving 12 bits to store our sieve on 32 bit architectures (and 44 on 64 bits archs).

Narrowing limited to de-inlining

Note: This only makes sense on systems that implement inlining.

Terminology: A sieve table using the "inlining" refinement above is said to be an "inlined sieve-table". Non-inlined sieve-tables are called "indirect sieve-tables".

Benefit: Removes the limit that inlining sets on the number of methods in a sum trait (44 on 64 bits systems).
Constraint on the system: None.
Runtime cost:
- No cost for function expecting an indirect sieve-ptr as argument.
- One cycle for functions expecting an inlined sieve-table.
- cost of a branch misprediction on calls of an indirect sieve-ptr from functions expecting an inlined sieve-ptr (free on systems without branch prediction like AVR I think). In concrete terms: this is the cost when a sum trait made from an object with many methods (>44 methods on x86-64) is handed over to a function expecting few methods.
How it works: Two sieve-tables are generated: one inlined, one not inlined (indirect). A bit is reserved in the sieve to indicate the type of sieve-table. On a method call, first determine the format then use the correct resolution mechanism.

Full narrowing on systems with heap allocation

Note: This is completely optional. It's sketches how to avoid having a cap on the number of trait in a sum type. I haven't fully fleshed it out because 8-trait-should-be-enough-for-everyone-TM and because it raises Interesting Questions.

Benefit: Removes both constraints on the sieve size (limits on the number of methods or traits) in a sum, by allowing multiple sieve sizes to coexist within a binary.
Constraint on the system: Requires a heap allocator (see next item for the version that does not heap allocate).
Runtime cost:
- Converting from a smaller sieve to a larger sieve is free (just pad with 0 bits).
- Converting from a larger sieve to a smaller sieve requires the dynamic allocation of a new sieve-table or its lookup from a cache. Looking up the sieve-table in a cache when a v-ptr is converted from a larger sieve to a smaller sieve.
How it works: build the narrower sieve-table from the larger sieve-table by copying the bit-selected v-tables and caching the result. It sounds slow, but golang does something like that so presumably it's not too bad.

Full narrowing on systems without heap allocation

Note: This is completely optional. It's sketches how to avoid having a cap on the number of trait in a sum type when heap allocation is not available. I haven't fully fleshed it out because 8-trait-should-be-enough-for-everyone-TM and because it raises Interesting Questions.

Benefit: Removes both constraints on the sieve size (limits on the number of methods or traits) in a sum, by allowing multiple sieve sizes to coexist within a binary.
Constraint on the system: None.
Runtime cost:
- No cost for function expecting an indirect sieve-ptr as argument.
- One cycle for functions expecting an inlined sieve-table.
- On calls of an indirect sieve-ptr from functions expecting an inlined sieve-ptr: cost of one branch misprediction+one function call (something like 20-40 cycles in total I suspect).
How it works: When the sieve is 0, call a dispatch() function instead, which is at a fixed offset in all sieve-tables. Since this is part of the sieve-table, there is one such dispatch() for each place where a type is cast into a sum trait. That function takes a global ID of the trait as an argument (the simplest would be its canonical name) and returns a vtable. The dispatch is like vorner's proposal except 1/ it's a fallback so performance is less critical, and 2/ the dispatch function is local with no need for whole program optimization.

newpavlov · September 29, 2021, 10:40pm

If I understand the proposed approach correctly, then it results in an additional misdirection. On modern CPUs such pointer-chasing misdirection has a higher cost than having several additional pointers on stack as with the fatter pointers (the third listed approach). Also implementation-wise it's significantly more complex than generalizing TraitObject to work with several pointers. IIUC compiler would have to track all "sum traits" across crates to generate efficient sieve tables, while with the fatter pointers no such tracking is required. Another disadvantage of the proposed approach is that it requires a total order of traits, i.e. you need a way to normalize B + A + C and C + B + A to A + B + C, which again requires some kind of global tracking. I guess you could simply use something like mangled trait names for ordering.

unbrice · September 30, 2021, 3:59pm

First, thanks for reading such a long diatribe and for taking the time to respond

You are referring to step 2 of the function call right? If so, then that is addressed by the "Inlining" refinement. As per the proposal, I expect an overhead of 2 cycles on x86_64 using PDEP and TZCNT. Glancing at eg: ARM led me to a similar conclusion.

Actually I expect the cost of indirections to affect extra-fat pointers more, as each of them needs to be dereferenced, and for cache locality reason (the sieve-table is contiguous). Of course, only a benchmark would tell what is the actual penalty and which of the various implementation is the fastest. But designing a proper benchmark is non-trivial and if we get that far I'll need to ask for help.

Possibly? But I don't think different approaches need to be opposed. IMHO it is a good thing for a language feature to have for example a simple portable implementation and the potential of an optimised implementation on some environments.
This being said if we can already conclude that sieve-ptr are both slower and more complex than extra-fat pointers then we should discard them. But I haven't been able to reach that conclusion (yet), as per previous point.

I don't think so no. With that proposal I tried hard that no cross-crate tracking is required. A sieve-table is generated by crate, as v-tables currently are. Dynamic linking remains possible and and deduplication by the linker is best effort. Please let me know if I overlooked something that requires shared cross-crate state.

Ah yes, I forgot to mention it. Indeed they do need to agree on an order for traits, and indeed alphabetic order is sufficient (or ordering by mangled symbols). As I understand it does not require a shared cross-crate state, only a deterministic comparison of trait names (or of their symbols).

crlf0710 · October 17, 2021, 5:40am

Now that dyn upcasting has been implemented on nightly, is there still strong motivation for this?

SkiFire13 · October 17, 2021, 9:11am

The implementation on nightly only supports upcasting to a single supertrait (e.g. from dyn A + B + C to dyn A, dyn B or dyn C, but not dyn A + B, dyn A + C or dyn B + C), this proposes a way to upcast to lift that limitation

unbrice · October 24, 2021, 10:27pm

I did my best to bench a few of the consider alternatives. Benchmarking is an art that I don't master, but the code is on github and feedback is welcome.

The conclusion is that Inlining was not as useful as I hoped, for this micro-benchmark. PackedSieve seem to do as well. Also sieve tables really seems quite an attractive performance profile, on that micro-benchmark at least.

	Size	2 traits	3 traits	4 traits	5 traits
`PackedSieve`	2 words	464 ns	488 ns	482 ns	495 ns
`InlineSieve`	3 words	510 ns	510 ns	510 ns	510 ns
`MultiVPtr`	N+1 words	472 ns	753 ns	764 ns	831 ns

The README has a few pointers but this is all quick-and-dirty code. If there's interest I could add another benchmark, or if someone is interested to write one I'll add some documentation.

riking · October 26, 2021, 6:46am

Note that this is two instructions, not two cycles. PDEP in particular was ~10+ cycles slow on pre-Zen AMD if I'm remembering correctly.

Additionally, those cycles are almost free compared to the costs of uncached memory pointer chasing, which is what you really need to be worried about if you're worried about performance. This is what @newpavlov was trying to tell you.

Iterating a giant array happens at memory bus throughput due to linear prediction prefetching. Iterating a pointer chase happens at memory bus latency because the data must complete a round trip back to the CPU.

unbrice · October 27, 2021, 6:08pm

Thanks for reading and responding.

You remember correctly. Those instructions are one cycle each on newer CPUs, but they used to be much more costly, as per the link in my initial design. And they are not available on some platforms anyways.

Thus in addition to auto-detection, the micro-benchmark has a slow_pdep feature to force the use of other instructions. Results are similar (on my machine). It is nevertheless the case that select_bit.rs would benefit from further optimisations. I have some ideas, but I didn't invest too much time as in spite of not being fully optimised, sieve tables seem to already outperform considered alternatives on my naive microbenchmark. Before optimising further I think I would want to first get help to make the benchmark more realistic and also to collect more runs from other archs.

Yes and I have taken particular attention to that concern, in my initial design as well as in my earlier responses to @newpavlov.

I did my best to fully consider both the theory and the practice.

The theory with the inlining optional refinement that gets rid of this indirection. Also with an observation that pointer chasing affects extra-fat pointers more than sieve tables (the later being mostly contiguous).
I also did my best to check in practice with a microbenchmark. I fully acknowledge that benchmarking is its own art, further from my field of expertise. Nevertheless, preliminary results are very encouraging. I welcome contributions and will better documentat the code structure if people are interested.

I hope this answers your concern, otherwise please let me know. Likewise, contributions would be very welcome. Suggestions of potential contributions: Running it on some arch and tuning select_bit.rs, Guidance on how to share this with devs interested in that feature, Making the benchmark more realistic (the best would be a hacked implementation in the rust compiler but we are probably a bit early for that).

system · January 25, 2022, 6:09pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Where's the catch with Box<Read + Write>? language design	39	6385	March 25, 2019
Trait upcasting compiler	8	1693	March 25, 2019
Pre-RFC: &[T: Trait] -> &dyn [Trait] language design	9	449	November 25, 2024
Small Buffer Optimization for trait objects?	1	1257	March 25, 2019
Why Rust fails hard at scientific computing	20	17512	March 25, 2019

Sieve-tables for multiple traits objects (Box<A+B+C> to Box<A+C>, #2035)

Relation to prior art

TL;DR of the V0/MVP/Simplified version

Refinement

Related topics