Why is String's equality comparison slower than `&[u8]`

I noticed that equality comparison for String known at compile time with length of 33 or more with non repeating bytes is slower than bytes (such as Vec<u8> or &[u8]) on x86.

My benchmark shows that it is 10x slower.

Here is my github repo for the behcmark.

I noticed that String uses libc function bcmp. On the other hand, &[u8] or Vec<u8> doesn't use any libc function and it just generate assembly code that looks like strcmp.

I'm not quite sure why String's implementation is so different from the others. Can anyone give me an insight to this?

The reason is that you're not actually doing equality, you're doing pattern matching.

This is clearest in the MIR: https://rust.godbolt.org/z/K9z4WMsKz

The arm

b"68975c1f-bb5e-4640-a1bd-66bcce78b73e" => 1,

is actually

[b'6', b'8', b'9', b'7', b'5', b'c', b'1', b'f', b'-', b'b', b'b', b'5', b'e', b'-', b'4', b'640-', b'a', b'1', b'b', b'd-', b'6', b'6', b'b', b'c', b'c', b'e', b'7', b'8', b'b', b'7', b'3', b'e'] => 1,

and thus it's treated as a slice pattern that checks all the bytes individually.

If you switch it to using equality instead of a slice pattern,

#[no_mangle] pub fn bytes_match(totally_random_uuid_v4: &str) -> i64
{
    match totally_random_uuid_v4.as_bytes() {
        x if x == b"68975c1f-bb5e-4640-a1bd-66bcce78b73e" => 1,
        x if x == b"0cbb46f2-2bed-4168-8263-501dad40f515" => 2,
        // lots of uuid-s
        _ => -1
    }
}

Then you'll see that one use bcmp as well: https://rust.godbolt.org/z/WWsn5droT


As for why there's a performance difference with the different approaches that depends on the number of arms? My guess would be because of icache. One problem with microbenchmarks is that if you are running only that code, the lots-of-code version looks good until it starts to bust the icache. But the lots-of-code version is often worse in real life, because loading all that code into the icache pushes out the code for whatever else you're doing around it, making it overall worse.

3 Likes

…but that still seems like a missing optimization opportunity, a slice pattern full of constants should be converted to a memcmp somewhere in the compiler (though at what threshold it might be hard to say).

EDIT: ah, the memcmp is slower, hm.

7 Likes

If the constant is smaller or equal to 32 bytes the str match gets inlined and transformed into SSE comparisons. But the slice match does not, it remains a byte-wise constant comparison.

Does it make sense to change the generated MIR for slice matches to use slice equality for subsequent elements? This might even speed up compilation because there are fewer blocks.

EDIT:

Oh, I just read the "with length of 33 or more" in the original post :smiley: