Faster PartialEq for small arrays of non 2^n length

binarycat · June 14, 2024, 10:55pm

i was messing around trying to optimize some code, and accidentally made some code that outperforms the implementation of PartialEq from libcore for slices of length 3, 5, 6, and 7

unfortunately it defiantly doesn't obey strict provenience, and i don't think it's entirely sound (namely, it will segfault if passed a slice that is on the very edge of a segment), but since bytestring equality is such a common operation, i thought i would share anyways.

the core idea is that trailing_zeros and xor can be used to give us the index of the first non-matching byte, so small bytestrings can be loaded into registers in their entirety and compared without a loop.

#![feature(test)]

extern crate test;

use std::hint::black_box;
use test::Bencher;

#[cfg(target_endian = "little")]
fn shared_prefix_z(a: &[u8], b: &[u8], min_len: usize) -> usize {
	// TODO: make sure the query Vec has a capacity of 8, so this is sound
	unsafe {
		let aa = (a.as_ptr() as *const u64).read_unaligned();
		let bb = (b.as_ptr() as *const u64).read_unaligned();
		((aa ^ bb).trailing_zeros() as usize / 8).min(min_len)
	}
}

#[bench]
fn eq(b: &mut Bencher) {
	b.iter(|| {
		black_box(black_box(b"123456") == black_box(b"123456"))
	})
}

#[bench]
fn hack(b: &mut Bencher) {
	b.iter(|| {
		black_box(shared_prefix_z(black_box(b"123456"),
						black_box(b"123456"),
						6) == 6)
	})
}

scottmcm · June 15, 2024, 8:40am

It's easy to make something that outperforms when it's UB -- after all, you can't get any faster than unreachable_unchecked().

Note that that's not enough for it to be sound, because reading uninitialized memory as u64 is immediate UB.

Also, do you know exactly how long they are? If so, you should compare [u8; N]s instead, which has a different implementation from the slice one.

binarycat · June 15, 2024, 7:17pm

ok so i'll zero-initialize it, then shrink the length. also i'm still kinda confused as to why reading uninit memory is UB instead of "Unspecified Result".

bytestrings have their length as part of their type, so this is using the faster array comparison route (thus why it only outperforms when the length is not a power of two).

scottmcm · June 15, 2024, 8:58pm

binarycat · June 15, 2024, 9:47pm

worth noting that my algorithm isn't actually broken by changing byte patters, as long as the individual bytes are marked as undef and not the entire value, since it only reads the uninitialized memory once.

scottmcm · June 15, 2024, 10:14pm

That's called [MaybeUninit<u8>; 8], not u64.

That's irrelevant to it being UB.

binarycat · June 15, 2024, 10:20pm

it's extremely relevant for whether it will have the expected behavior, though.

hopefully rust exposes the planned llvm freeze intrinsic someday, so code like this can be written soundly.

jhpratt · June 15, 2024, 10:42pm

It's undefined behavior by definition. There is no "expected" behavior; anything can happen. Trying to predict UB is fruitless.

binarycat · June 15, 2024, 10:50pm

definitions can be changed. much of rust's specified UB (such as str being valid UTF-8) are not actually utilized by the compiler, but they are specified as UB because defining behavior is a one-way ratchet in a language with strong backwards compatibility guarantees.

it's far from impossible, especially if you pin a specific compiler version, optimization level, and target architecture. it's been done many times, to varying levels of success.

Nemo157 · June 15, 2024, 11:07pm

Violating that is not direct UB, it is just a library invariant that leads to later UB, like IIRC in the Chars iterator it may lead to OOB reads.

binarycat · June 15, 2024, 11:09pm

has that change been officially made? last i heard it was something that was "being considered"

Nemo157 · June 15, 2024, 11:11pm

IIRC it got decided a few years ago.

quinedot · June 15, 2024, 11:25pm

github.com/rust-lang/rust

remove language-level UB for non-UTF-8 str

opened 04:11PM - 11 Apr 20 UTC

closed 06:13PM - 11 May 20 UTC

RalfJung

C-enhancement A-unicode T-lang disposition-merge finished-final-comment-period

This is the Rust-side issue for https://github.com/rust-lang/reference/pull/792 …just so that we can use fcpbot. The change description follows. Ever [since Rust 1.0](https://doc.rust-lang.org/1.0.0/reference.html#behavior-considered-undefined), the reference said that a non-UTF-8 `str` causes immediate UB. In terms of [today's terminology](https://www.ralfj.de/blog/2018/08/22/two-kinds-of-invariants.html), that means that `str` has a validity invariant of being valid UTF-8. However, that seems unnecessary: the compiler does not actually exploit this, nor is there any clear way it could exploit this. Making UTF-8 a library-level *safety invariant* is more than enough for everything `str` does. Most likely, it was made a validity invariant because we had not yet properly teased apart those two concepts when the document was initially written. This is also the conclusion that the UCG WG arrived at in https://github.com/rust-lang/unsafe-code-guidelines/issues/78. I therefore propose we remove the UTF-8 clause from the language spec, so that `str` will have the same validity invariant as `[u8]`.

tczajka · June 16, 2024, 5:39am

I think this is a confusion in terminology.

As far as I understand: MIRI doesn't catch it as UB, but it's still UB because the documentation of from_utf8_unchecked says so and thus the function can't be trusted to do anything predictable when given non-UTF-8 bytes, which is the definition of UB.

workingjubilee · June 16, 2024, 7:13am

It is permitted for the implementation of from_utf8_unchecked to be

pub const unsafe fn from_utf8_unchecked(v: &[u8]) -> &str {
    let Ok(s) = str::from_utf8(v) else {
        unreachable_unchecked()
    };
    s
}

This would very much be UB caught by miri!

We mostly do not do this because it optimizes poorly.

tczajka · June 16, 2024, 5:08pm

Right, definitely seems so from the documentation of from_utf8_unchecked.

And yet, str docs say:

Constructing a non-UTF-8 string slice is not immediate undefined behavior

So how do you construct a non-UTF-8 str in a sound way? Or are the docs just wrong?

binarycat · June 16, 2024, 5:11pm

I don't think you can do it safely, but you should be able to do it soundly with some form of transmutation.

tczajka · June 16, 2024, 5:15pm

Right, I misspoke, had already corrected it.

Is transmuting &[u8] into &str sound? Is this in the reference?

binarycat · June 16, 2024, 5:17pm

yes

CAD97 · June 16, 2024, 5:27pm

The library versions of from_utf8_unchecked list WF UTF-8 as a prerequisite, so they're out. str::as_bytes_mut is highly relevant, since it allows temporarily putting non UTF-8 bytes in the string, but still requires restoring UTF-8 before the borrow is allowed to expire. Even though no action occurs on lifetime expiry, morally the documentation places a WF UTF-8 assertion at that point.

Ultimately this gets into a very subtle part of Rust in both the difference between library UB and language UB, and the fact that language validity of references doesn't care about the contents of the referenced memory beyond that it exists and is borrowed.

Even our more precise documentation does a poor job in properly distinguishing between language UB ("immediate" UB) and library UB ("deferred" UB). It's something we're slowly working on improving.

No; it is a potential valid implementation for &str to be (ptr, len) but &[u8] to be (len, ptr). However, as casing between *const [u8] and *const str is documented to work as expected, since the memory representation of str and [u8] is equivalent and they have identical unsize kinds.

Using a pointer cast like this is AIUI the only way that is fully supported by the documentation as it stands today. …Or at least I think we merged the reference PR that clarifies how as casts between slice types work, I didn't actually go verify that.

Topic		Replies	Views
Mem::uninitialized, `!` and trap representations language design	56	6873	March 25, 2019
Types as Contracts: Implementation and Evaluation Unsafe Code Guidelines	20	3284	March 25, 2019
Official document that states "references must point to initialized data"? Unsafe Code Guidelines	41	2400	December 22, 2024
Rust needs a safe abstraction over uninitialized memory language design	80	4369	March 21, 2025
Role of UB / uninitialized memory Unsafe Code Guidelines	78	9766	March 25, 2019

Faster PartialEq for small arrays of non 2^n length

Related topics