Slice comparison performance

saschagrunert · November 20, 2016, 3:45pm

Hey there,

I just had some benchmarks with the nom crate and tested some fundamental performance degrease when using the tag! macro, which will internally work like &slice_a[..max_len] == &slice_b[..max_len]. So, are there any optimizations done on this in the past?

Usually it could be even faster than memcmp, for example if we implement something like this, if pa is known at compile time.

        match len {
            1 => *(pa as *const u8) == *(pb as *const u8),
            2 => *(pa as *const u16) == *(pb as *const u16),
            3 => {
                *(pa as *const u16) == *(pb as *const u16) &&
                *(pa.offset(2) as *const u8) == *(pb.offset(2) as *const u8)
            }
            4 => *(pa as *const u32) == *(pb as *const u32),
            5 => {
                *(pa as *const u32) == *(pb as *const u32) &&
                *(pa.offset(4) as *const u8) == *(pb.offset(4) as *const u8)
            }
            6 => {
                *(pa as *const u32) == *(pb as *const u32) &&
                *(pa.offset(4) as *const u16) == *(pb.offset(4) as *const u16)
            }
            7 => {
                *(pa as *const u32) == *(pb as *const u32) &&
                *(pa.offset(4) as *const u16) == *(pb.offset(4) as *const u16) &&
                *(pa.offset(6) as *const u8) == *(pb.offset(6) as *const u8)
            }
            8 => *(pa as *const u64) == *(pb as *const u64),
            9 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u8) == *(pb.offset(8) as *const u8)
            }
            10 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u16) == *(pb.offset(8) as *const u16)
            }
            11 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u16) == *(pb.offset(8) as *const u16) &&
                *(pa.offset(10) as *const u8) == *(pb.offset(10) as *const u8)
            }
            12 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u32) == *(pb.offset(8) as *const u32)
            }
            13 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u32) == *(pb.offset(8) as *const u32) &&
                *(pa.offset(12) as *const u8) == *(pb.offset(12) as *const u8)
            }
            14 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u32) == *(pb.offset(8) as *const u32) &&
                *(pa.offset(12) as *const u16) == *(pb.offset(12) as *const u16)
            }
            15 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u32) == *(pb.offset(8) as *const u32) &&
                *(pa.offset(12) as *const u16) == *(pb.offset(12) as *const u16) &&
                *(pa.offset(14) as *const u8) == *(pb.offset(14) as *const u8)
            }
            16 => {
                *(pa as *const u64) == *(pb as *const u64) &&
                *(pa.offset(8) as *const u64) == *(pb.offset(8) as *const u64)
            }
            _ => memcmp(pa, pb, len) == 0,
        }

This could automatically be implemented for arrays up to a specific size, with some offsets and a mixture of casts. What I want to say is that comparing casted u64 values in c is much faster than the memcmp.

This discussion seems to be related to 16913 and also this crate.

cuviper · November 20, 2016, 5:32pm

You seem to be assuming this is only about byte slices? Plus those cast tricks will only work on little-endian systems. Where that applies, I expect that memcmp probably does do something like word-at-a-time comparisons anyway, although that couldn’t get optimized like an inlined version might.

bluss · November 20, 2016, 5:39pm

It uses memcmp for equality checking nowadays but it has the problem that for short enough slices, it’s faster to just use an inline comparison instead of calling memcmp, so it can still be tuned.

One path would be to make memcmp a compiler intrinsic and depending on solution, maybe that can be enough to help it emit better code, especially for short input special cases.

seanmonstar · November 20, 2016, 6:21pm

There is the lang item str_eq, which the compiler uses for byte array comparisons. So optimizations in that function would be what you’re looking for, right?

saschagrunert · November 20, 2016, 6:48pm

Instersting, but when I change the tag macro of nom (slice compare) to a native call to memcmp the benchmark drops from 350 to 250ns/iter.

arthurprs · November 22, 2016, 3:37pm

LLVM may have special optimizations for memcmp when it knows enough?

saschagrunert · November 23, 2016, 9:21am

I dont think so. What we could do is to create a macro, which optimized the comparison traits here which my provided approach from the first post. What do you think?

arthurprs · November 23, 2016, 9:48am

That’s a valid improvement in my opinion. @bluss suggestion seems to be the right thing long term though.

saschagrunert · November 23, 2016, 9:57am

Really? I am not sure if an intrinsic really fits here, what does the @compiler_subteam (@nikomatsakis) thinks?

arthurprs · November 23, 2016, 10:04am

Users shouldn’t have to rely on libc for memcmp in my opinion. We already have the equivalents to memcpy and memmove.

bluss · November 23, 2016, 11:11am

I think it seems like a good idea to implement and profile on a variety of code bases. If it helps nom significantly, it’s probably a good idea IMO. (The intrinsic thing is really to solve the dependency problem like arthurprs mentions.)

Notes on the code:

all those & are thankfully redundant
to be portable, pointer reads must be aligned, and *(pa as *const u16) is only ok if pa is well aligned for u16 and so on. You could have two separate implementations, one for arches where you know unaligned loads are fine, and one for the others.

system · March 25, 2019, 8:27am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why is String's equality comparison slower than `&[u8]` compiler	3	669	November 15, 2024
Slice equality? libs	7	6759	August 26, 2021
Stabilizing basic functions on arrays and slices libs	19	4135	March 25, 2019
Could we implement some kind of small struct optimization? compiler	16	2092	July 7, 2021
Array (fixed width slice) `Ord` implementation	3	303	August 6, 2025

Slice comparison performance

Related topics