Optimizing layout of nested enums?


#1

(I don’t know if this is the right category for optimization of compiled code, not of the compiler. Let me know.)

Why this matters

Reading @camlorn’s Optimizing Rust Struct Size post on struct field reordering made me think of the memory layout of enums in particular. Because of alignment constraints, we often end up reserving 4 or 8 bytes for an enum discriminant even the vast majority of enum types have much less than 256 variants (so one byte would suffice). This can be very wasteful, in particular with nested enums. As an extreme example, Option<Option<Option<Option<*const ()>>>> is 40 bytes on 64-bit systems.

The example above is exaggerated, but this problem happens in practice. This is a real type typical of Servo’s style crate:

pub enum LengthOrKeyword {
    Length(pub enum Length {
        NoCalc(pub enum NoCalcLength {
            Absolute(pub enum AbsoluteLength {
                Px(f32),
                In(f32),
                Cm(f32),
                Mm(f32),
                Q(f32),
                Pt(f32),
                Pc(f32),
            }),
            FontRelative(pub enum FontRelativeLength {
                Em(f32),
                Ex(f32),
                Ch(f32),
                Rem(f32),
            }),
            ViewportPercentage(pub enum ViewportPercentageLength {
                Vw(f32),
                Vh(f32),
                Vmin(f32),
                Vmax(f32),
            }),
            ServoCharacterWidth(i32),
        }),
        Calc(
            Box<CalcLengthOrPercentage>,
            pub enum AllowedLengthType {
                All,
                NonNegative,
            }
        ),
    }),
    Keyword(pub enum SizeKeyword {
        ClosestSide,
        FarthestSide,
        ClosestCorner,
        FarthestCorner,
    }),
}

(Pseudo-syntax with definition of nested enums inlined.)

On x86_64 size_of::<LengthOrKeyword>() is 24 bytes. If alignment was not a constraint (and if I’m counting correctly) it would be 11 bytes. This is 41% of wasted space.

24 bytes by themselves may not seem like a lot, but these things add up. For example, the border-radius property defines 8 lengths (horizontal and vertical ellipsis arc radii in each of 4 corners). And then there is PropertyDeclaration, an enum with hundreds of (generated) variants, for every CSS property. This was several hundred bytes until we started boxing variants over a certain threshold. Boxing does not reduce total memory usage (and even costs time to allocate), but it does reduce the amount of memmove traffic that shows up in profiles.

We sometimes “flatten” enums (LengthOrPercentageOrAutoOrContent is another real type) but this is not practical to do pervasively.

@Gankro is currently experimenting with replacing WebRender’s serialization for inter-process communication with bincode, from transmuting to &[u8]. Initial results show half of the space used (and therefore half the time spent sending these bytes). Padding bytes for alignment (used in memory but not by bincode) are presumably a large part of the difference.

What to do

I’ve had some optimization ideas and RFC issue #1230 More Exotic Enum Layout Optimizations has some more, but some of them are made impossible or difficult by two constraints:

  • Matching can take a reference to any field of an enum’s variant, so each field needs to have the same layout (and size, and alignment) as a value of the same type created independently.
  • Such a reference can be mutable, and assigned to. Assignment to *(foo: &mut T) presumably overwrites size_of::<T>() bytes including any “unused” padding bytes, so these bytes can’t (easily) be used by an type that contains T. If we change assignment to write a minimal byte range, there’s still ptr::write. And if we change that there’s a risk that these optimizations break assumptions made by unsafe code. (Or conversely, writing unsafe code would require more care of subtle constraints.)

The idea from issue #1230 that sounds most promising to me is to try and pick discriminant values that do not overlap with values of the discriminant of inner enums, and share their storage. For example:

enum Foo {
    A(u32),
    B(Bar),
    C,
}

If Bar is an enum whose discriminant can be 0 to 4, the the compiler could pick 5 for Foo::A, 6 for Foo::C, and have 0 to 4 both indicate Foo::B and serve as Bar’s discriminant.

Matching Foo:B would require a range comparison rather than simple equality. I don’t know if the impact of that on runtime perfomance is significant.

Even on layout alone, this would help but it’s not perfect. This requires the inner enum to be the first field of its variant in memory, so there may be a tradeoff to make between sharing discriminant space and reordering fields to reduce padding. (Or maybe not? I can’t find an example where nested-enum-first is worse.) Or a tradeoff with promoting the discriminant to a wider type to support larger values v.s. doing less discriminant-space sharing. (Also maybe not?) There could be more than one variant with nested enums (like NoCalcLength above) and the ranges of these discriminants might overlap. Trying to make them not overlap would require a global analysis (looking where a type is used before deciding its layout) which at best would be heuristic past the crate boundary.

What do you all think? Other ideas?


#2

Because you’re reusing the inner enum’s discriminants, and they start at 0 (C-like enums notwithstanding), that “range comparison” is < instead of ==. In fact, if you’re matching on all the Foo and Bar cases, you’ll get a more efficient switch than if you had several discriminants.

I still like to do all of this, but we’ve had several rounds of blockers, #39999 isn’t merged yet and it waited a while for #40658 which is only the first half, there’s more changes that are waiting for the former PR to be landed.

IMO we have to first remove the difference between RawNullablePointer and StructWrappedNullablePointer, by having enum variant layouts that are newtypes - and newtypes in general are a pretty crazy topic (#[repr(transparent)] wants them too) - we’ll probably just use whatever’s orthogonal but doesn’t make working with layouts a complete pain.

Once we do all of that though, @camlorn came up with a very simple scheme for finding niches (if you consider the preorder traversal of a tree of ADTs, then you’ll always be “using up” niches in that order, so you can just start at the last one you already know is used and ignore everything before) and we don’t have as many things in rustc_trans to teach about enum discriminants.


#3

How about allowing completely custom memory layout, implemented by user function?

I wish I could define an internal layout of enums as an union or struct and be able to seamlessly convert it to and from Rust enum (basically plug in my own function that finds the discriminant in arbitrarily clever way)


#4

This is a good point, but the issue is exaggerated, I think.

TL;DR: all the optimizations in the first post of that issue are sound with internal pointers, and they can push down the size of your example Servo enum to an optimal 16-bytes, with values to spare for further nesting in e.g. Option.

Specifically, all of the proposed optimizations in my post are using forbidden values, and not padding. Nothing can be broken by internal writes to x in Foo(x) if the repr of Foo(x) is exactly that of x (plus maybe some values that are outside of sizeof(x).

Here’s my analysis of the proposals:

Use Undefined Values in Other Primitives

Some(Some(false) = 0 Some(Some(true)) = 1 Some(None) = 2 None = 3

As you can see, writes by inner pointers can never corrupt the outer variant (unless the outer variant was already overwritten, in which case the write is violating type-safety anyway).

Use Multiple Invalid-Value Fields

Some(Some(ref, ref)) = (ref, ref) Some(None) = (0, 1) None = (1, 0)

Still no way to corrupt externally.

Support More than a Single Bit

(I’ve already shown this in the first example)

Use all other fields as free bits when there exists another forbidden field

Some(Some((ref, x)) = (ref, x) Some(None) = (0, 0) None = (0, 1)

Still good.

Use the Fact that Void is Statically Unreachable

You can’t “write” Void, so there’s no concern for overwriting outer things

Support Enums that have More than One Non-C-Like Variant

enum Foo {
  A(&u8, &u8),
  B(&u8),
}

A(ref, ref) = (ref, ref) B(ref) = (ref, 0)

Still good.

Treat any True Discriminant Tags as a Type with Undefined Values

Some(Some(x)) = (0, x) Some(None) = (1, 0) None = (2, 0)

Still good.

Let’s apply these optimizations to your enum:

pub enum LengthOrKeyword {
    Length(pub enum Length {
        NoCalc(pub enum NoCalcLength {
            Absolute(pub enum AbsoluteLength {
                Px(f32),
                In(f32),
                Cm(f32),
                Mm(f32),
                Q(f32),
                Pt(f32),
                Pc(f32),
            }),
            FontRelative(pub enum FontRelativeLength {
                Em(f32),
                Ex(f32),
                Ch(f32),
                Rem(f32),
            }),
            ViewportPercentage(pub enum ViewportPercentageLength {
                Vw(f32),
                Vh(f32),
                Vmin(f32),
                Vmax(f32),
            }),
            ServoCharacterWidth(i32),
        }),
        Calc(
            Box<CalcLengthOrPercentage>,
            pub enum AllowedLengthType {
                All,
                NonNegative,
            }
        ),
    }),
    Keyword(pub enum SizeKeyword {
        ClosestSide,
        FarthestSide,
        ClosestCorner,
        FarthestCorner,
    }),
}

f32, i32: no optimization opportunity (let’s ignore NaN boxing for now)

AbsoluteLength, FontRelativeLength, and ViewPortPercentageLength, all get the same layout: (tag, u32). Naively they will all get the exact same tag values (0, 1, 2, …), which makes it harder to weave them together.

However a crate-local optimizer might find all the types they’re stored in, and give them disjoint sequences. If this sounds far-fetched, a similar solution can be achieved by simply picking a “random” (deterministic hash) initial value, so that most enums incidentally don’t overlap.

I’ll also assume tags are made to fill all padding, because padding is scary and out of our control, while tags are fully controlled. So in this case we get 32-bit tags, which are very sparse and almost certainly won’t collide. So all of these are 8-bytes in total.

Vw(x) = (0, x) Vh(x) = (1, x) Vmin(x) = (2, x) Vmax(x) = (3, x)

Em(x) = (10, x) Ex(x) = (11, x) Ch(x) = (12, x), Rem(x) = (13, x)

Px(x) = (20, x) In(x) = (21, x) Cm(x) = (22, x) Mm(x) = (23, x) Q(x) = (24, x) Pt(x) = (25, x) Pc(x) = (26, x)

NoCalcLength is the union of all of these, but also adds a ServoCharacterWidth(i32) variant. Trivial local evaluation tells us we can add:

ServoCharacterWidth(30, x)

Length adds a Calc(Box, AllowedLengthType) variant, which will be the primary optimization bottleneck. It must be 16-bytes. AllowedLengthType is C-like and can fit in a byte, so it will probably just be a byte.

(random guessed values) All = 40 NonNegative = 41

Calc(box, type) = (box, type, 7-bytes of 0) NoCalc(x) = (0, x)

Finally LengthOrKeyword adds a C-like SizeKeyword. We can just stuff it in some forbidden values of the x in NoCalc(x)

(random guessed values) ClosestSide = 50 FarthestSide = 51 ClosestCorner = 52 FarthestCorner = 53

Keyword(x) = (0, x, 7-bytes of 0)

So our final result is LengthOrKeyword at 16-bytes, which is just your 11 bytes rounded up to the next biggest power of two; so basically optimal.

Note that the (0, x) case still has plenty of forbidden values, so we can further nest this in other enums without growth.

Also note that matching on this enum is still fairly cheap; e.g. checking for Calc is just box != 0, and ever other variant is just box == 0 && min <= x && x <= max. However in a full match statement these can be coalesced into a decision tree:

match no_calc {
  Absolute(x) => ...
  FontRelative(x) => ...
  ViewportPercentage(x) => ...
  ServoCharacterWidth(x) => ...
}
let tag = no_calc.tag
if tag > FontRelative.max_tag {
  // Can't be FontRelative or Absolute here
} else {
  // Can't be ViewPortPercentage or ServoCharacterWidth here 
}

We could also further optimize by making Calc and Keyword use different patterns for the “padding” so we can branch of that instead of some range of values (we can specify it to not be padding, but instead another tag with a definite value).


#5

I was thinking of crate-wide analysis too. (Possibly of with a breath-first graph traversal.) Doing something random or based on hashing seems slightly worrying: if you’re “unlucky” something like renaming a variant could suddenly make another type bigger?

Other than that I like the idea of using forbidden values like you describe. (I don’t know why exactly why I assumed that the valid-refs constraints would prevent that.)


#6

crate-local analysis is definitely the more robust solution, but hashing is a decent 80% solution to get something working. It also naturally works cross-crate, so you kinda want it anyway? I imagine if you want a robust solution, we would allow you to specify tags manually; maybe we could have a system to ask the compiler what it picked and “save” the result in the source.


#8

For something like Servo, we could have a “manual” #[first_discriminant=N] tag somewhere, along with some linting/unit testing/global registry to ensure things don’t overlap, C-style.


#9

Here is a cut down version of the example:

enum Alpha { A(f32), B(f32) }

enum Beta { X(f32), Y(f32) }

enum Outer {
    One(Alpha),
    Two(Beta),
}

fn main() {
    println!("{}", std::mem::size_of::<Outer>());
}

Which prints 12 and thus illustrates the issue at hand.


With just a realignment of the discriminant fields, it should be at most 8 bytes (with two bytes of padding):

struct Alpha { content: union { A, B }, discriminant: u8 }

struct Beta { content: union { X, Y }, discriminant: u8 }

struct Outer {
    content: union { Alpha, Beta },
    discriminant: u8,
}

With the following sizes characteristics:

  • Alpha: size 5 (8 effective), alignment 4, offset of content: 0, offset of discriminant: 4,
  • Beta: size 5 (8 effective), alignment 4, offset of content: 0, offset of discriminant: 4,
  • Outer: size 6 (8 effective), alignment 4, offset of content: 0, offset of discriminant: 5,

The crucial point is that since the last 3 bytes of Alpha and Beta are known to be padding, they can be reused to stash data members of outer types (this is an optimization the C++ Itanium ABI has in limited fashion, letting derived classes reuse the padding at the end of their base classes in some circumstances).

There is one serious consequence in terms of code generation, though: any write to &mut Alpha must only ever write 5 bytes (not 8), since Alpha may be contained in an object using its tail padding for its own attributes.

The same “tail-padding” reuse optimization is usable in struct as well:

struct Space {
    what: Outer,
    more: u16,
}

could have size 8 (effective 8), alignment 4, offset of what: 0, offset of more: 6 (within the tail-padding of Outer).


#10

Not rouding size of types up to a multiple of alignment? is possibly related.


#11

Yes; I saw it afterwards :blush:


#12

I don’t think these constraints are necessarily fundamental to the Rust language. It’s actually a really old problem that language designers were struggling with 30-40 years ago.

The first C compilers in the seventies would keep each local variable in a stack slot, so taking their address for a “mutable borrow” wasn’t a big deal:

int x = 7;
update_my_int(&x);

Any problems were fixed with simple language rules: You can’t take the address of a register variable or a bit field.

By the time Ada came out in the eighties, register allocation was a common optimization, and the designers were more concerned with efficient data representations. Thus, Ada has parameter modes which specify intent rather than mechanism:

procedure Update_My_Int(x: in out Integer);

The Ada parameter modes have more in common with Rust borrows than C pointers. Most importantly, they don’t allow the address of a by-reference argument to escape or be captured by the callee. They also leave it up to the compiler and ABI to decide if x should be passed by copy or by reference. Expressed in C, the Ada compiler is allowed to choose between:

int update_my_int(int x);
void update_my_int(int *x_addr);

It is interesting to read the rationale for this design. Many of the issues they are dealing with regarding aliasing and concurrency are solved by the Rust system of borrowing.

Rust borrows as parameter modes

The Rust compiler could take the same liberty with borrows of small objects:

fn update_my_int(x: &mut i32)

Could be lowered to a function with a C-equivalent signature of:

int update_my_int(int x);

This has a number of advantages:

  • It is faster to pass arguments directly in registers.
  • The caller doesn’t need to allocate a stack slot for the local variable x. The register allocator is much happier.
  • It would be possible to safely borrow members of a packed struct. The callee doesn’t have to worry about alignment.
  • If x were a bool, it could be represented as a bit field in a struct.
  • If x were of a nested sum type, it would not be constrained to a single prescribed representation as you mentioned above.
  • If Rust gets refinement types, this would enable even better data representation, like Option<u32> where the inner value can’t be UINT_MAX. The Ada rationale has some discussion of this topic too.

The problem with this approach is that a borrow can be converted to a pointer in Rust. It is easy enough to save the object to the stack and take its address, but pointers also imply an object identity, not just a value. This leads to the key questions:

When converting a borrow to a pointer twice, do you get the same pointer each time? Are two different borrows of the same object guaranteed to produce the same pointer?

I would propose that the answer is “no” for Copy objects and “yes” for non-Copy objects. This is like saying that &x is borrowing a copy of x, just like f(x) is passing a copy of x when x implements Copy.

Of course, data structures that are tagged with repr(C) are forced to represent data like it’s 1970. Same thing for Rust functions callable from C.


#13

The problem is that creating references on the fly doesn’t work if the function can return the reference or stash it somewhere with a longer lifetime than its own execution frame. If you have fn foo(x: &mut i32) -> &i32 { x }, x obviously can’t be passed by value.

For transformations that apply to individual function arguments, that might be tolerable, because you can often rule that out based on the function signature (sort of - unsafe code might still cast x to a pointer, stash it somewhere, and expect it to stay alive as long as the storage of the original reference, but the nascent unsafe code guidelines could forbid that). Alternately, a global analysis could identify functions that don’t let certain reference arguments escape, and transform them (though if they’re ever cast to fn pointers or trait objects, the compiler would have to generate a copy that uses the standard ABI).

But for struct layout, for the compiler to optimize a field f: F of a struct S, the compiler would have to guarantee that nowhere in the program is a function like fn foo(x: &S) -> &F { &x.f } (or that all such functions are fully inlined). If f is a public field of a crate-public struct, this guarantee is impossible; if not, it still requires expensive global analysis, and is fragile (you wouldn’t expect marking something pub, or writing a function somewhere, to have a severe performance cost in seemingly unrelated code).

I’d prefer a way for the programmer to explicitly mark struct and enum fields “inline”*, which would authorize the compiler to ‘take apart’ the type of that field and use a non-standard representation for it within the outer struct/enum. This would either forbid taking references to the field, or reduce the lifetime of such references – maybe just treating field accesses as rvalues.

As you say, changing representation would both enable a ton of optimizations that are otherwise impossible, such as packing, bools to bit fields, mixing subfields of different fields, etc. It also makes optimizations that are sometimes possible without it more reliable. @Gankro mentioned how nested enums can be stored using a single discriminant while preserving referenceability, by adjusting the base representation of the inner enum based on global analysis – but there are a ton of ways that could break, like if two enum variants have the same sub-enum, or unrelated structures contain the same enum, or the inner enum is defined in a different crate, or one of the outer enum variants stores an Option<f32> rather than having a unique sub-enum… On the other hand, if the outer enum explicitly marks the field inline, the compiler can store the tag with an offset, and things would ‘just work’ in all of the cases I mentioned, no global analysis needed.

(You suggested that the compiler could use a different layout for struct parameters to specific function calls, but that’s even more fragile than the basic de-referencing transformation. I expect it would usually be impossible for long-lived values, so you’d still be using an inefficient representation at least sometimes.)

* Incidentally, I think I got the idea of calling it “inline” from one of Jonathan Blow’s JAI videos.


#14
enum MyEnum {
    __DUMMY = 10,
    A(A),
    B(B)
}

Lets the discriminant begin at 10.


#15

Uh, really? (Not to mention that it will break and/or complicate everywhere exhaustively matching against MyEnum.)


#16

Oops, didn’t know it could only be used with c-like enums. (Should test my code)


#17

I’m thinking about taking the discriminant outside of enum normal structure to allow this sort of optimizations. Remember fat pointers that don’t have everything at the location of the data but require additional information be stored by the pointer consumer.

In case of enums this would mean that an it occupies only the space of the biggest contents the same way @matthieum suggested but not storing the discriminant near the union:

This makes enums very much the same as untagged unions from https://github.com/rust-lang/rfcs/pull/1444

Here are some examples:

let some: Option<Option<int>> = Some(Some(2));

// is represented as
union Option<T> { Some: T }

// a combination of nested enums' values may be created on demand
enum Option_Option_Discriminant { None, Some_None, Some_Some }

let some_discriminant = Option_Option_Discriminant::Some_Some; 
let some_content = Option::<Option<i32>> { Some: Option::<i32> { Some: 2 } };

In a more complex example:

enum Alpha { A(f32), B(f32) }
enum Beta { X(f32), Y(f32) }
enum Outer { One(Alpha), Two(Beta) }

let value = Outer::Two(Beta:X(0.0));

// replaced with
union Alpha { A: f32, B: f32 }
union Beta { X: f32, Y: f32 }
union Outer { One: Alpha, Two: Beta }

// combination of all enum values in the typeof(value)
enum Outer_Discriminant { One_A, One_B, Two_X, Two_Y }
let value_discriminant = Outer_Discriminant::Two_X;
let value_content = Outer { Two: Beta { X: 0.0 } };

Then if something accepts mutable reference to the internal enum we have to supply simple conversion function:

fn replace(&mut beta: Beta) {
  match *beta {
    X(v) => *beta = Y(v - 1),
    Y(v) => *beta = X(v + 1),
  }
}

let value = Outer::Two(Beta:X(0.0));
match *value {
  Two(mut ref beta) => replace(beta),
}

// is replaced with
enum Beta_Discriminant { X, Y }

// fn replace(&mut beta: Beta) {
fn replace(beta_discriminant: &mut Beta_Discriminant, beta_content: &mut Beta) {
    // match *beta {
    match *beta_discriminant {
        // X(v) =>
        Beta_Discriminant::X => {
            let v = unsafe { beta_content.X };
            // *beta = Y(v - 1)
            *beta_discriminant = Beta_Discriminant::Y;
            *beta_content = Beta { Y: v - 1.0 }
        },
        // Y(v) =>
        Beta_Discriminant::Y => {
            let v = unsafe { beta_content.Y };
            // *beta = X(v + 1)
            *beta_discriminant = Beta_Discriminant::X;
            *beta_content = Beta { X: v + 1.0 }
        },
    }
}

// let mut value = Outer::Two(Beta:X(0.0));
let mut value_discriminant = Outer_Discriminant::Two_X;
let mut value_content = Outer { Two: Beta { X: 0.0 } };

// match value {
match value_discriminant {
    // Two(mut ref beta) => replace(beta),
    Outer_Discriminant::Two_X | Outer_Discriminant::Two_Y => {
        let mut beta_content = unsafe { &mut value_content.Two };
        let mut beta_discriminant = Outer_Two_To_Beta_Discriminant(&value_discriminant);
        // (mut ref beta) => replace(beta),
        replace(&mut beta_discriminant, beta_content);
        value_discriminant = Outer_Two_From_Beta_Discriminant(value_discriminant, beta_discriminant);
    }
    _ => {}
}

fn Outer_Two_To_Beta_Discriminant(v: &Outer_Discriminant) -> Beta_Discriminant {
  return match *v {
    Outer_Discriminant::Two_X => Beta_Discriminant::X,
    Outer_Discriminant::Two_Y => Beta_Discriminant::Y,
    _ => panic!(),
  }
}

fn Outer_Two_From_Beta_Discriminant(v: Outer_Discriminant, f: Beta_Discriminant) -> Outer_Discriminant {
  return match (v, f) {
    (Outer_Discriminant::Two_X, Beta_Discriminant::Y) => Outer_Discriminant::Two_Y,
    (Outer_Discriminant::Two_Y, Beta_Discriminant::X) => Outer_Discriminant::Two_X,
    _ => panic!(),
  }
}

Here I specifically didn’t try to use flag masks to show that they are not necessary at the first step. But bit arithmetics should simplify these operations quite a bit.

This code in playpen is here: https://is.gd/zFvb0P


#18

I’ve been thinking about such optimizations too and came mostly to similar conclusions. My ideas:

  • consecutive tags may be necessary to create jump table (I’ve never looked into jump tables deeply enough to know for sure)
  • marking a type with “I don’t care about identity” could help
  • hashing is probably better than no optimization at all (issue extra warning if collision happens? Allow custom seed for hash function?)
  • assign each crate an offset and let each crate to do analysis within that crate?
  • in case of simple approach try to flatten largest variant
  • how complicated would it be to actually implement such optimizations?

#19

This would allow for generically defined enums:

enum Cons<A, B> {
    Head(A),
    Tail(B),
}

fn main() {
    // currently errors: left = 8, right = 12
    assert_eq!(8, ::std::mem::size_of::<Cons<u32, Cons<i32, Cons<u8, ()>>>>());
}

#20

I guess it’d be remiss not to mention that https://github.com/rust-lang/rust/pull/45225 now exists and implements quite a few of the desired optimizations.