Slice comparison performance


#1

Hey there,

I just had some benchmarks with the nom crate and tested some fundamental performance degrease when using the tag! macro, which will internally work like &slice_a[..max_len] == &slice_b[..max_len]. So, are there any optimizations done on this in the past?

Usually it could be even faster than memcmp, for example if we implement something like this, if pa is known at compile time.

        match len {
            1 => *(pa as *const u8) == *(pb as *const u8),
            2 => *(pa as *const u16) == *(pb as *const u16),
            3 => {
                *(pa as *const u16) == *(pb as *const u16) &&
                *(pa.offset(2) as *const u8) == *(pb.offset(2) as *const u8)
            }
            4 => *(pa as *const u32) == *(pb as *const u32),
            5 => {
                *(pa as *const u32) == *(pb as *const u32) &&
                *(pa.offset(4) as *const u8) == *(pb.offset(4) as *const u8)
            }
            6 => {
                *(pa as *const u32) == *(pb as *const u32) &&
                *(pa.offset(4) as *const u16) == *(pb.offset(4) as *const u16)
            }
            7 => {
                *(pa as *const u32) == *(pb as *const u32) &&
                *(pa.offset(4) as *const u16) == *(pb.offset(4) as *const u16) &&
                *(pa.offset(6) as *const u8) == *(pb.offset(6) as *const u8)
            }
            8 => *(pa as *const u64) == *(pb as *const u64),
            9 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u8) == *(pb.offset(8) as *const u8)
            }
            10 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u16) == *(pb.offset(8) as *const u16)
            }
            11 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u16) == *(pb.offset(8) as *const u16) &&
                *(pa.offset(10) as *const u8) == *(pb.offset(10) as *const u8)
            }
            12 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u32) == *(pb.offset(8) as *const u32)
            }
            13 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u32) == *(pb.offset(8) as *const u32) &&
                *(pa.offset(12) as *const u8) == *(pb.offset(12) as *const u8)
            }
            14 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u32) == *(pb.offset(8) as *const u32) &&
                *(pa.offset(12) as *const u16) == *(pb.offset(12) as *const u16)
            }
            15 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u32) == *(pb.offset(8) as *const u32) &&
                *(pa.offset(12) as *const u16) == *(pb.offset(12) as *const u16) &&
                *(pa.offset(14) as *const u8) == *(pb.offset(14) as *const u8)
            }
            16 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u64) == *(pb.offset(8) as *const u64)
            }
            _ => memcmp(pa, pb, len) == 0,
        }

This could automatically be implemented for arrays up to a specific size, with some offsets and a mixture of casts. What I want to say is that comparing casted u64 values in c is much faster than the memcmp.

This discussion seems to be related to 16913 and also this crate.


#2

You seem to be assuming this is only about byte slices? Plus those cast tricks will only work on little-endian systems. Where that applies, I expect that memcmp probably does do something like word-at-a-time comparisons anyway, although that couldn’t get optimized like an inlined version might.


#3

It uses memcmp for equality checking nowadays but it has the problem that for short enough slices, it’s faster to just use an inline comparison instead of calling memcmp, so it can still be tuned.

One path would be to make memcmp a compiler intrinsic and depending on solution, maybe that can be enough to help it emit better code, especially for short input special cases.


#5

There is the lang item str_eq, which the compiler uses for byte array comparisons. So optimizations in that function would be what you’re looking for, right?


#6

Instersting, but when I change the tag macro of nom (slice compare) to a native call to memcmp the benchmark drops from 350 to 250ns/iter.


#7

LLVM may have special optimizations for memcmp when it knows enough?


#8

I dont think so. What we could do is to create a macro, which optimized the comparison traits here which my provided approach from the first post. What do you think?


#9

That’s a valid improvement in my opinion. @bluss suggestion seems to be the right thing long term though.


#10

Really? I am not sure if an intrinsic really fits here, what does the @compiler_subteam (@nikomatsakis) thinks?


#11

Users shouldn’t have to rely on libc for memcmp in my opinion. We already have the equivalents to memcpy and memmove.


#12

I think it seems like a good idea to implement and profile on a variety of code bases. If it helps nom significantly, it’s probably a good idea IMO. (The intrinsic thing is really to solve the dependency problem like arthurprs mentions.)

Notes on the code:

  • all those & are thankfully redundant
  • to be portable, pointer reads must be aligned, and *(pa as *const u16) is only ok if pa is well aligned for u16 and so on. You could have two separate implementations, one for arches where you know unaligned loads are fine, and one for the others.