Could we implement some kind of small struct optimization?

I investigated issue with small structs: [ER] Optimization of tuple equality vs array equality · Issue #83585 · rust-lang/rust · GitHub

Actually, some struct with 4 or 8 u8 fields with derived PartialEq isn't zero cost: LLVM generates branched code with shifts for it. Much better option is to transmute such struct into [u8; 8] array to make entire struct comparions one assembler operation.

I think, we could do something like bit equality types or bit equality optimization.

First one maybe is simpler and possibly could be done by changing #[derive(PartialEq)]. We could add some static method to trait like:

// default in trait
#[inline(always)]
unsafe fn is_bit_equaliable()->bool{
    false 
}


// in derive macro
#[inline(always)]
unsafe fn is_bit_equaliable()->bool{
    std::mem::size_of::<Self>() == sum_of_field_sizes() // Check padding
    // for every field
   && type0::is_bit_equaliable()
   ...
   && type_last::is_bit_equaliable()
}

fn eq(&self, other: &Self)->bool{
   if Self::is_bit_equaliable() {
      let a: &[u8; std::mem::size_of::<Self>()] = transmute(self);
      let b: &[u8; std::mem::size_of::<Self>()] = transmute(other);
      a==b
  }
  else{
     // old field enumeration
  }
}

// for primitives like u8, i32, bool, char BUT not for f32, pointers and etc
#[inline(always)]
unsafe fn is_bit_equaliable()->bool{
    true
}

Another option is to do this optimization on MIR level. We need to somehow detect cases when all bytes of one struct compared with other struct bytes and replace them with direct bytes comparison. It would be nice because it would work everywhere but I don't ever imagine complexity of implementation for such feature.

2 Likes

We actually already have this in the compiler, for "structural equality," for the purpose of using consts as match arm branches. Perhaps we should extend structural equality to how we generate PartialEq::eq?

2 Likes

However, there is an issue with padding.

    #[derive(PartialEq)]
    struct A{
        a: u8,
        b: u32
    }

    #[derive(PartialEq)]
    struct B{
        a: A,
        b: u32
    }

Struct B is StructuralEq but it have some padding bytes which shouldn't be used in comparison.

1 Like

Well, it looks like we can calculate this using constants: https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=3854125bbdc1da4fdbfddd11f47b8ee5

Using transmuting for this purpose is a spectacularly bad idea. First, transmute is being deprecated, second, padding is nontrivial to compute and is up to the discretion of the compiler, so a derive macro should not try to second guess the compiler.

I see you opened a similar issue a couple of days ago. Is there a reason why that fix wouldn't apply here?

Also note that branched code isn't automatically slower than bit-packed code. In this thread, I demonstrate that LLVM probably knows what it is doing, as the branched code it generates ends up being 50% faster than its bit-masking equivalent.

Transmute is not being deprecated and likely will never be deprecated.

1 Like

I doubt this, we always will need something like reinterpret_cast or similar.

Yes but detection of presense of padding bytes is very easy: I linked a code already.

This fix is more about codegen for && and || operators. It fixed situation for structs which fields at least u32 or similar because they actually always sent to functios as pointers but doesn't fix for structs which smaller than u64/u32 themselves. It doesn't apply because those structs because rustc transmute them into u16/u32/u64 to fit into single register during call, then it takes fields by shifts and masks.

You motivated me to look something more realistic instead of PartialEq implementations alone and it is really optimizes pretty well on nightly:if inlined: https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=9a90c34df778cecb76806b33c47d9121

However, there is still cases when this doesn't help. For example, transmutting into byte array makes my benchmark 4-25 times faster:

u8/u8 tuple/Self        time:   [16.773 us 16.822 us 16.873 us]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
u8/u8 tuple/Random field
                        time:   [42.919 us 43.058 us 43.206 us]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
u8/u8 tuple/First field time:   [4.0004 us 4.0100 us 4.0200 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
u8/u8 tuple/Last field  time:   [16.684 us 16.716 us 16.750 us]
u8/u8 tuple/Every Second
                        time:   [25.289 us 25.429 us 25.616 us]
u8/u8 arr/Self          time:   [1.5741 us 1.5769 us 1.5801 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
u8/u8 arr/Random field  time:   [1.5677 us 1.5701 us 1.5729 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
u8/u8 arr/First field   time:   [1.5517 us 1.5540 us 1.5563 us]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  1 (1.00%) high severe
u8/u8 arr/Last field    time:   [1.5585 us 1.5616 us 1.5648 us]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
u8/u8 arr/Every Second  time:   [1.5672 us 1.5690 us 1.5708 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

and @carbotaniuman: see the relevant IRLO thread.

Anyway, whether or not transmute is being deprecated is not the most important argument against it. The important argument is that it feels unwarranted to generously sprinkle something particularly unsafe all over the implementations of one of the most widely-used traits.

1 Like

I'm well aware of safe transmute, but just because we have a slightly safer option doesn't mean that the unsafe one is going to be flat out deprecated. Anyhow I do agree with the sentiment, and adding language features due to optimization issues that are likely to be fixed seems suboptimal.

1 Like

Well, I suggested second approach with MIR-level optimization too but I just don't know how to ever start this.

2 Likes

Your benchmarks all run the function on the same input over and over again. That means that the CPU will predict the branches perfectly. So you avoid the cost of branch misprediction, while reaping the benefit of running a few less instructions compared to the branchless version.

If you pick a random input value each time, the results should be different. (I just tested this: the difference showed up after I remembered that a uniformly random input can still be predicted well, since it will very rarely be 'aeiou'. You need a random distribution that chooses one of those letters a decent fraction of the time.)

The performance in a real program depends on how random the input is, which depends on the use case. But the branchless version is somewhat safer, since the cost of a branch misprediction is higher than the cost of a few extra instructions.

12 Likes

Absolutely. That's why I didn't write anything like "branched code is always faster". I wrote that it isn't necessarily slower, and it was faster in the scenario I tested:

Of course, performing the bit packing optimization still makes sense in many cases, and I'm in favor of it. All I'm trying to say is that any such optimization should not result in leaky abstractions (whereby optimizations direct the implementation and not vice versa), nor should it introduce large amounts of unwarranted unsafety.

No, structural equality is not bit equality as described above. In particular, having padding is fine for a type to have structural equality.

That said, what exactly structural equality is is not entirely clear, either... see

1 Like

I've seen PartialEq be pretty terrible for enum, and am unsurprised it's fairly bad for structs which can't just be memcmped.

Unfortunately the suggestion in the first post has the downside of being a lot more code. It's worth keeping in mind that a huge portion of rust types have a PartialEq derive, and so significantly increasing the amount and complexity of the code that produces could really harm compile times (first the derive has to generate it, then the compiler has to emit, send to llvm, optimize, etc).

That said, it's part of the compiler... I wonder if/how much the built-in derive for PartialEq can cheat — That is, at the point when that derive is run, can it do things like inspect what the layout of the type will be? Get type info about field types? (that one is plausibly not necessary, but would be nice)

I'm guessing no, since the compiler doesn't have all the tokens of Rust code it will use to complete the code, but it seems worth asking, since it opens up the solution space here a great deal

1 Like

I read that, and it likes that it just emits tokens as which proceeded later as usual despite being build-in macros. Also it emits automatically_derived attribute, we probably can use it as factor during later optimization.

It depends how far you want to go with it. At one end, it can use emit a call to magic_partial_eq_intrinsic that can be lowered to whatever it needs to be in the codegen backend.

(Is that a good idea? No clue. But it's possible.)

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.