Could we implement some kind of small struct optimization?

AngelicosPhosphoros · April 3, 2021, 3:52pm

I investigated issue with small structs: [ER] Optimization of tuple equality vs array equality · Issue #83585 · rust-lang/rust · GitHub

Actually, some struct with 4 or 8 u8 fields with derived PartialEq isn't zero cost: LLVM generates branched code with shifts for it. Much better option is to transmute such struct into [u8; 8] array to make entire struct comparions one assembler operation.

I think, we could do something like bit equality types or bit equality optimization.

First one maybe is simpler and possibly could be done by changing #[derive(PartialEq)]. We could add some static method to trait like:

// default in trait
#[inline(always)]
unsafe fn is_bit_equaliable()->bool{
    false 
}


// in derive macro
#[inline(always)]
unsafe fn is_bit_equaliable()->bool{
    std::mem::size_of::<Self>() == sum_of_field_sizes() // Check padding
    // for every field
   && type0::is_bit_equaliable()
   ...
   && type_last::is_bit_equaliable()
}

fn eq(&self, other: &Self)->bool{
   if Self::is_bit_equaliable() {
      let a: &[u8; std::mem::size_of::<Self>()] = transmute(self);
      let b: &[u8; std::mem::size_of::<Self>()] = transmute(other);
      a==b
  }
  else{
     // old field enumeration
  }
}

// for primitives like u8, i32, bool, char BUT not for f32, pointers and etc
#[inline(always)]
unsafe fn is_bit_equaliable()->bool{
    true
}

Another option is to do this optimization on MIR level. We need to somehow detect cases when all bytes of one struct compared with other struct bytes and replace them with direct bytes comparison. It would be nice because it would work everywhere but I don't ever imagine complexity of implementation for such feature.

CAD97 · April 3, 2021, 6:25pm

We actually already have this in the compiler, for "structural equality," for the purpose of using consts as match arm branches. Perhaps we should extend structural equality to how we generate PartialEq::eq?

AngelicosPhosphoros · April 3, 2021, 8:25pm

However, there is an issue with padding.

    #[derive(PartialEq)]
    struct A{
        a: u8,
        b: u32
    }

    #[derive(PartialEq)]
    struct B{
        a: A,
        b: u32
    }

Struct B is StructuralEq but it have some padding bytes which shouldn't be used in comparison.

AngelicosPhosphoros · April 3, 2021, 8:39pm

Well, it looks like we can calculate this using constants: https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=3854125bbdc1da4fdbfddd11f47b8ee5

H2CO3 · April 4, 2021, 4:15am

Using transmuting for this purpose is a spectacularly bad idea. First, transmute is being deprecated, second, padding is nontrivial to compute and is up to the discretion of the compiler, so a derive macro should not try to second guess the compiler.

I see you opened a similar issue a couple of days ago. Is there a reason why that fix wouldn't apply here?

Also note that branched code isn't automatically slower than bit-packed code. In this thread, I demonstrate that LLVM probably knows what it is doing, as the branched code it generates ends up being 50% faster than its bit-masking equivalent.

carbotaniuman · April 4, 2021, 7:14am

Transmute is not being deprecated and likely will never be deprecated.

AngelicosPhosphoros · April 4, 2021, 7:51am

I doubt this, we always will need something like reinterpret_cast or similar.

Yes but detection of presense of padding bytes is very easy: I linked a code already.

This fix is more about codegen for && and || operators. It fixed situation for structs which fields at least u32 or similar because they actually always sent to functios as pointers but doesn't fix for structs which smaller than u64/u32 themselves. It doesn't apply because those structs because rustc transmute them into u16/u32/u64 to fit into single register during call, then it takes fields by shifts and masks.

You motivated me to look something more realistic instead of PartialEq implementations alone and it is really optimizes pretty well on nightly:if inlined: https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=9a90c34df778cecb76806b33c47d9121

However, there is still cases when this doesn't help. For example, transmutting into byte array makes my benchmark 4-25 times faster:

u8/u8 tuple/Self        time:   [16.773 us 16.822 us 16.873 us]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
u8/u8 tuple/Random field
                        time:   [42.919 us 43.058 us 43.206 us]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
u8/u8 tuple/First field time:   [4.0004 us 4.0100 us 4.0200 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
u8/u8 tuple/Last field  time:   [16.684 us 16.716 us 16.750 us]
u8/u8 tuple/Every Second
                        time:   [25.289 us 25.429 us 25.616 us]
u8/u8 arr/Self          time:   [1.5741 us 1.5769 us 1.5801 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
u8/u8 arr/Random field  time:   [1.5677 us 1.5701 us 1.5729 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
u8/u8 arr/First field   time:   [1.5517 us 1.5540 us 1.5563 us]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  1 (1.00%) high severe
u8/u8 arr/Last field    time:   [1.5585 us 1.5616 us 1.5648 us]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
u8/u8 arr/Every Second  time:   [1.5672 us 1.5690 us 1.5708 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

H2CO3 · April 4, 2021, 8:00am

and @carbotaniuman: see the relevant IRLO thread.

Anyway, whether or not transmute is being deprecated is not the most important argument against it. The important argument is that it feels unwarranted to generously sprinkle something particularly unsafe all over the implementations of one of the most widely-used traits.

carbotaniuman · April 4, 2021, 1:55pm

I'm well aware of safe transmute, but just because we have a slightly safer option doesn't mean that the unsafe one is going to be flat out deprecated. Anyhow I do agree with the sentiment, and adding language features due to optimization issues that are likely to be fixed seems suboptimal.

AngelicosPhosphoros · April 4, 2021, 3:34pm

Well, I suggested second approach with MIR-level optimization too but I just don't know how to ever start this.

comex · April 5, 2021, 8:02pm

Your benchmarks all run the function on the same input over and over again. That means that the CPU will predict the branches perfectly. So you avoid the cost of branch misprediction, while reaping the benefit of running a few less instructions compared to the branchless version.

If you pick a random input value each time, the results should be different. (I just tested this: the difference showed up after I remembered that a uniformly random input can still be predicted well, since it will very rarely be 'aeiou'. You need a random distribution that chooses one of those letters a decent fraction of the time.)

The performance in a real program depends on how random the input is, which depends on the use case. But the branchless version is somewhat safer, since the cost of a branch misprediction is higher than the cost of a few extra instructions.

H2CO3 · April 6, 2021, 9:57am

Absolutely. That's why I didn't write anything like "branched code is always faster". I wrote that it isn't necessarily slower, and it was faster in the scenario I tested:

Of course, performing the bit packing optimization still makes sense in many cases, and I'm in favor of it. All I'm trying to say is that any such optimization should not result in leaky abstractions (whereby optimizations direct the implementation and not vice versa), nor should it introduce large amounts of unwarranted unsafety.

RalfJung · April 6, 2021, 1:59pm

No, structural equality is not bit equality as described above. In particular, having padding is fine for a type to have structural equality.

That said, what exactly structural equality is is not entirely clear, either... see

github.com/rust-lang/rust

Clarify "structural equality" rules

opened 04:41PM - 17 Jul 20 UTC

RalfJung

T-lang A-const-eval A-valtree

[RFC 1445](https://github.com/rust-lang/rfcs/blob/master/text/1445-restrict-cons…tants-in-patterns.md) introduced the notion of a type supporting "structural match", which enables constants of that type to be used in patterns. The goal was to only let primitive types and types with `derive(PartialEq, Eq)` participate in patterns. Since then the `#[structural_match]` attribute was [replaced by two traits](https://github.com/rust-lang/rust/issues/63438): [`StructuralPartialEq`](https://doc.rust-lang.org/nightly/core/marker/trait.StructuralPartialEq.html) and [`StructuralEq`](https://doc.rust-lang.org/nightly/core/marker/trait.StructuralEq.html). More recently, [a value-based analysis](https://github.com/rust-lang/rust/pull/67343) was introduced as an alternative to the type-based traits. In the past, if my understanding is correct, this was only used as a kind of "lint" to avoid strange behavior on `match`. (Though there are [some question marks](https://github.com/rust-lang/const-eval/issues/42) here and some matches on constants can affect exhaustiveness checking, which is crucial for soundness.) But my understanding is that recently, it also started playing a role around const generics -- and thus the entire story could become critical for soundness, and I start to become interested. ;) So far, the traits mentioned above are safe, so in principle nothing can depend on them for soundness. They are also unstable though, so any such soundness issue would affect just nightly Rust. There is a lot of good documentation on when and how these traits are implemented, but what I am missing is good documentation on what the trait *means*. I have the feeling that there is a very concrete guarantee associated with these traits, a guarantee that const generics and match both would like to rely on. I'd like to "tease out" that guarantee from the existing trait and analysis with the benefit of hindsight, and then hopefully that enables us to simplify this story, or at least document it better. So, here is what I *think* the guarantee is, based on what I saw so far: * A *value* `v` has *structural equality* if (a) it can be converted to a [value tree](https://github.com/rust-lang/compiler-team/issues/323), and (b) `PartialEq::eq(v, v2)` returns `true` if and only if the value tree representation of `v2` equals that of `v` (in particular, if `v2` cannot be represented in a value tree, the return value must be `false`). * A *type* has structural equality if all of its values do. Now I wonder, does this make sense? Of course this is slightly hypothetical as value trees are not implemented yet, but in most cases I think it is intuitively clear what the tree for a value looks like -- and when it is not clear, it likely has no tree. (In particular, unions, raw pointers [except when cast from an integer] and function pointers cannot be represented in a value tree.) I think this is exactly the guarantee that const generics need, but I only have a birds-eye view of const generics so I am not certain of this. And this also seems closely related to the idea of "which const values are reasonable to use as a pattern". I'd be interested in any counterexamples that you might know of. If this definition works, then I think we could replace both of the `Structural*` traits by a single marker trait that just represents this guarantee. But to evaluate if that definition makes sense, we first have to figure out what we *want* from "structural equality". EDIT (2021-05): Also see [this more recent analysis](https://github.com/rust-lang/rust/issues/74446#issuecomment-841637790). EDIT (2023-09): Here's an [update of the current status](https://github.com/rust-lang/rust/issues/74446#issuecomment-1722228305). My view changed quite a bit due to how the compiler developed over the last 2 years. ### Open problems There is one problem that I know of so far: the `StructuralEq` trait docs have this example ```rust #[derive(PartialEq, Eq)] struct Wrap<X>(X); fn higher_order(_: &()) { } const CFN: Wrap<fn(&())> = Wrap(higher_order); fn main() { match CFN { CFN => {} _ => {} } } ``` However, function pointers do not have a well-behaved notion of equality (Rust functions are "unnamed" in LLVM terms), and as such they are not suited for a value tree representation. Indeed function pointers should not be used for const generics; I assume there is some safeguard in place that rejects even some "structural equality" values/types for const generics. Moreover higher-order function pointers do not implement `PartialEq`. Curiously, floats are allowed in patterns but show a future-compat warning. If floats are problematic even though they have deterministic equality, then for sure we could deprecate matching on function pointers whose equality is non-deterministic and can change when codegen units are partitioned differently! But leaving that aside, I could imagine handling this by changing (b) to say that *if* `T` implements `PartialEq`, then this property has to hold (so values of types not implementing `PartialEq` are fine as long as we can put them into a valtree). For (a) I could imagine an "extended valtree" that permits function pointers (represented as a `ty::Instance<'tcx>`) in its leaves; we would only permit such leaves for "structural equality checking for matches" but strictly disallow them for constant generics. Cc @oli-obk @lcnr @pnkfelix @varkor @ecstatic-morse

tcsc · April 8, 2021, 9:48am

I've seen PartialEq be pretty terrible for enum, and am unsurprised it's fairly bad for structs which can't just be memcmped.

Unfortunately the suggestion in the first post has the downside of being a lot more code. It's worth keeping in mind that a huge portion of rust types have a PartialEq derive, and so significantly increasing the amount and complexity of the code that produces could really harm compile times (first the derive has to generate it, then the compiler has to emit, send to llvm, optimize, etc).

That said, it's part of the compiler... I wonder if/how much the built-in derive for PartialEq can cheat — That is, at the point when that derive is run, can it do things like inspect what the layout of the type will be? Get type info about field types? (that one is plausibly not necessary, but would be nice)

I'm guessing no, since the compiler doesn't have all the tokens of Rust code it will use to complete the code, but it seems worth asking, since it opens up the solution space here a great deal

AngelicosPhosphoros · April 8, 2021, 10:40am

I read that, and it likes that it just emits tokens as which proceeded later as usual despite being build-in macros. Also it emits automatically_derived attribute, we probably can use it as factor during later optimization.

scottmcm · April 8, 2021, 5:28pm

It depends how far you want to go with it. At one end, it can use emit a call to magic_partial_eq_intrinsic that can be lowered to whatever it needs to be in the codegen backend.

(Is that a good idea? No clue. But it's possible.)

system · July 7, 2021, 5:28pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Towards even smaller structs language design	73	4799	August 17, 2021
Faster PartialEq for small arrays of non 2^n length compiler	25	715	September 14, 2024
Another case for smaller structs	9	1047	November 22, 2021
Blog series: Lowering Rust Traits to Logic language design	26	8434	March 25, 2019
Small Buffer Optimization for trait objects?	1	1246	March 25, 2019

Could we implement some kind of small struct optimization?

Related topics