Aggressive vtable compression

I was chatting with some coworkers who work in an extremely size-constrained environment. Every dyn T object in a binary produces at least three words of vtable: size, alignment, destructor. However, if you only exclusively work with &mut dyn T, these three words are very wasteful, and they pile up: three words per type-to-dyn per trait. This waste almost never matters, but it piles up in regimes where the binary is on the order of a few kilobytes in size. This gets much worse when &dyn Fn and friends get involved.

I've been thinking of a potential -Ccompress-vtables flag that sacrifices performance (and, maybe, changes language semantics...) for trying to get rid of as many of these words as possible.

The "obvious" implementation is to remove all three words and make size_of_val, align_of_val and drop_in_place panic/fail to link/fail to monomorphize when T = dyn. I don't think I need to explain why this is a bad idea. =)

Instead, we make the following observations:

  • Alignment can be stored as its logarithm.
  • In this size regime, destructors are actually kinda rare.
  • Gigantic types are rare.

One implementation I thought of was to pack all three words into one like so (for a 32-bit system; 64-bit is analogous, but less interesting). This word is called the "header".

  • The top bit of the word is set if drop_in_place is not a no-op.
  • The next five bits are the log of the alignment (which cannot be greater than 31)[1].
  • The remaining 26 low bits are the size.

The actual pointer to the destructor is now at the end of the vtable, if there is one. The destructor for a dyn is thus:

/// The virtual call dominates over having to do a branch.
if (vtable[0] & 0x8000_0000 != 0) {
  vtable[1 + num_methods](self);
}

If there is no destructor, we save on the alignment and destructor words. 2/3 is a pretty solid deal all things considered. There are two additional caveats:

  • Sizes of dyn objects are restricted to 60MB if the flag is turned on. This seems like a very hard restriction to run into if you care about vtable compression. We can just ban all types of such size if we want to avoid post-mono diagnostics (only with this flag turned on, ofc).
  • size_of_val incurs an extra and instruction.
  • align_of_val is
fn align_of_val<dyn Trait>(x: &dyn Trait) {
  // The precise choice of bit trick depends on what `1 << 32` is...
  (1 << ((x.vtable[0] >> 26 & 31) + 1)) - 1
}

These functions aren't called all that often, and the memory lookup will dominate performance either way.

Does this seem like an interesting opt-in optimization to try to implement? I might go and try to wire this into rustc at some point for fun, but I figured I'd pitch it here, too.

[1] Some ABIs might not allow such large alignments anyways... I suspect 2 bits is enough for some 32-bit ABIs, but that's not a big deal.

10 Likes

It surprises me that you would have any significant number of types and traits when the overall code size is only a few kilobytes. Do you have any real-world examples to quantify this?

(e.g. an X KB binary with Y number of types, leading to Z% vtable overhead.)

3 Likes

I linked the thread to the folks having issues. I'll let them comment if they can. =)

I don't have the numbers myself, but I've worked on size optimization for Rust before and this seems on-brand (after you throw out core::fmt, there's still a lot of low-hanging fruit).

Aside, from helping a coworker debug some measurements: it might be worthwhile to have rustc emit symbols for vtables... something that includes a mangled form of the impl, similar to <T as U> in trait method symbols. Even though they're not external, per se, being able to count them in nm would go a long way...

1 Like

In one codebase, an approximately 160kB kernel has 123 vtables, equating to 1476 wasted bytes.

In another codebase (a somewhat featureful rust bootloader), a 26 kB binary has 28 vtables, equating to 336 wasted bytes.

FWIW, we regularly make decisions to remove useful features or debug information to save even 1kB of flash, so 1.5 kB of space for-free would definitely be meaningful in this context.

4 Likes

Something to note is that this flag would also affect ABI: Code compiled with this flag could not be combined with code that was compiled without it. As a result, such a flag would always have to be used with -Z build-std (which would notably block it's stability on that flag), since presumably the standard library would not be built with the flag enabled by default.

2 Likes

Yes, this is true of anything doing interesting things with ABI, but it does seem a bit silly to block a pure-Rustc flag behind a Cargo flag. ISTR the binary Hudson is referring to doesn't use Cargo at all. (Cargo is really bad for this kind of project, given how many sensitive flags you need to thread to build something that works at all...)

Even if they were, you need -Zbuild-std to even be at the point where this optimization matters, since you'll want to -Copt-level=s/z (and similar flags) the standard library. (There is a separate conversation about how it seems really silly that we do not ship a size-optimized standard library...)

Either way, I'm more interested in implementing this optimization as a -Z flag and sorting out stabilization later. =)

1 Like

You could recover the ability to have objects larger than 226 bytes by defining all-bits-one in that field to mean that another word of size information follows.

1 Like

You probably still want that to go after the destructor to avoid a penalty to normal method calls. But that is a very cool encoding. This makes it a pure optimization that we could even turn on for s/z builds by default some day (with the caveat we should be shipping a s/z stdlib anyways...)

2 Likes

You will need a libcore one way or another. cargo build -Zbuil-std is the only way to do this that will likely become stable in the future. You can't compile it yourself as libcore depends on unstable features.

There could conceivably be a cargo rustc -Zbuild-std as well that would allow getting any artifacts that rustc outputs, to be later provided instead of sysroot items.

Similarly, rustc itself knows where sysroot is, so it can find the rustup std source distribution in the sysroot (or is it beside? I'm not completely sure) and provide a -Zbuild-std equivalent directly.


Edit: as @bjorn3 points out below, just rustc alone doesn't have enough information. cargo rustc should be able to -Zbuild-std though, and slot into any parent build system that uses rustc directly for the rest of the pipeline.

Rustc providing -Zbuild-std functionality would require duplicating the entirety of cargo. Cargo.toml files are used to define the project structure and some dependencies are downloaded from crates.io and not vendored.

Google did something funny in LLVM: 26723 – Improve code size with a more space-efficient vtable ABI

1 Like

This optimization has shipped in iOS! Their motivation for using 32-bit-relative vtables was to avoid relocation fixups which created unnecessary dirty pages in shared libraries.

1 Like

I mean, some people just don't use Cargo and build the standard library themselves by translating Cargo into the local build system. Ultimately those people are fussing around with nightly things anyways, but this all feels mildly silly. (This is kind of a tangent to this thread... -Zbuild-std is irrelevant from the perspective of the pure optimization, other than ABI silliness.)

1 Like

I occurs to me that the prebuilt std ABI concerns is actually a red herring via the following observations:

  • The dyn versions size_of_val, align_of_val, and drop_in_place are instantiated per trait.
  • Code involving dyn Trait is only instantiated by crates that can see Trait.

Therefore, the decision of whether vtables are compressed can be made per trait, and this decision can be delayed until rustc --emit metadata for each crate.

Thus, -Ccompress-vtables actually should have the following semantics: every trait may have compressed vtables; -Ccompress-vtables compresses the vtables for every trait in the crate. Codegen for dyn calls will be dependent on whether this compression bit is set, per trait.

There are some interesting considerations about turning this on for your build, since std vtables won't be compressed without -Zbuild-std. This also opens up the question if you want to do this per trait via a trait #[repr].

Given how little this saves, I think it's very much an all-or-nothing kind of thing. It probably makes sense to just do it at the crate level. Maybe turn it on for s/z builds by default?

5 Likes

Aside, from helping a coworker debug some measurements: it might be worthwhile to have rustc emit symbols for vtables... something that includes a mangled form of the impl, similar to <T as U> in trait method symbols. Even though they're not external, per se, being able to count them in nm would go a long way...

Now there is some diagnostics with a rustc_dump_vtable attribute, it is added in Refactor vtable format for upcoming trait_upcasting feature. by crlf0710 · Pull Request #86461 · rust-lang/rust · GitHub . The test cases added in that PR is quite self-explainatory.

Whoa. That's pretty cool.

That said, this makes me realize that "the end of the vtable" is not well-defined (even in current Rust)... which probably means getting rid of the destructor isn't as easy, unless we're willing to accept a cmov per trait method call... bleh.

Does the vtable "clobber" appear when there are no instances of dyn Trait at all? Because if that's the case (which can be enforced by using stuff such as #[dyn_safe(false)]), then you could manually hand-roll your own &mut dyn Trait equivalent with the vtable exactly as you want it. The whole thing can be automated through macros.

Sketch of the implementation
pub(…)
trait Trait {
    fn method (self: &'_ Self, arg: Arg)
      -> Ret
    ;
}

pub(…)
struct VTable {
    /* No need for non-`Box`ed trait objects!
    size_align_drop_in_place: PackedDataAsSuggestedInThisThread, // */
    method: unsafe fn(ptr::NonNull<ty::Erased>, Arg) -> Ret,
}

mod ty { pub struct Erased(()); }

/// &'lt dyn Trait
pub(…)
struct RefDynTrait<'lt> {
    ptr: ptr::NonNull<ty::Erased>,
    vtable: &'lt VTable,
}

impl<'lt, T : Trait> From<&'lt T> for RefDynTrait<'lt> {
    fn from (p: &'lt T)
      -> RefDynTrait<'lt>
    {
        RefDynTrait {
            ptr: ptr::NonNull::cast(p.into()),
            // Note: this may need a `const` to get `'static` promotion
            vtable: &VTable {
                method: |ptr, arg| unsafe {
                    ptr.cast::<T>().as_ref().method(arg)
                },
            },
        }
    }
}

impl Trait for RefDynTrait<'_> {
    #[inline]
    fn method (self: &'_ Self, arg: Arg)
      -> Ret
    {
        unsafe {
            (self.vtable.method)(self.ptr, arg)
        }
    }
}

I'm mentioning this because there can be many instances of desired changes w.r.t. the vtable layout, with people wanting it inlined contiguously within the object, or within each pointer itself, or in a different order for CPU cache micro-optimizations, etc. I believe we want a more general solution for this problems rather than trying to feature a whole family of flags to tackle each. Hence this "hand-roll" vtable suggestion, which is general and customizable enough to be able to cover each of these use cases; the downside being that a general-purpose macro to do that does not exist yet, and that the lack of language sugar may make some of this stuff a bit less ergonomic (e.g., while you could write AsRef<dyn Trait>/Index<Output = dyn Trait> you cannot write something equivalent with MyOwnRefDynTrait<'_> (this, incidentally, is more of an issue with "stdlibs": the traits in the standard library hardcode reference types in too many places))

1 Like

Y tho? If that was an acceptable solution, we wouldn't even be having this discussion. This is about optimizing a large codebase though an ABI-only change, which can be applied uniformly to first and third party code; as I've pointed out above, this is ABI-compatible with crates that disagree on how their trait vtables should be laid out.

If people want to have many incongruous vtable ABIs, perhaps this is the direction to be taking there?