Array (fixed width slice) `Ord` implementation

I hit a performance issue recently where the following code (data: &mut [u8])

data.as_chunks_mut::<4>().0
    .as_chunks_mut::<2>().0
    .sort_unstable()

went substantially (~1.5-2x) slower than

data.as_chunks_mut::<8>().0
    .sort_unstable()

which sorts a &mut [[u8; 8]] rather than a &mut [[[u8; 4]; 2]].

A coworker looked around and found that the fixed-width slices have an Ord implementation ( mod.rs - source ) that shells out to the general slice Ord implementation, which hides the length information. It seems easy enough (naively) to flesh out the fixed-width slice Ord implementation, and that doing this might recover the performance in this case.

Now, this is the part of the post where I admit that I have no clue about anything, and there might be a great reason to do things this way. I couldn’t find anything with some light searching, and thought that folks here might have a take. I’ve unblocked myself with some matches on known products of numbers (e.g. 4 x 2 = 8) because afaict you cannot multiply const generics (e.g. as_chunks_mut::<M * N>()), but it leaks abstractions around that I’d love to avoid leaking (the discovery of the 4 and the 2 are in unrelated parts of the codebase).

But if there are any clarifying thoughts here, I’d be delighted to hear! :smiley:

Thanks,

It looks like what's actually happening here is that comparing [u8; N] is specialized to just call memcmp, rather than having a loop at all, but that doesn't trigger for [[u8; N]; M]: https://rust.godbolt.org/z/h6h8PWPPT

The comparison implementation just delegating to a slice seems not to materially impact things, since it still generates excellent assembly for the innermost comparison (movbe+cmp) and the [_; 2] part got unrolled.

Do you know you're sorting nested arrays? I wonder if passing a predicate using as_flattened could help a bunch: https://rust.godbolt.org/z/abs5hb7Gh.

1 Like

Thanks for the detail! I appreciate it.

Do you know you're sorting nested arrays?

Ah, .. it is tricky to say when i “know” that they are nested arrays. This is a [u8], with a few layers of delimiting moments that we can see at runtime may be e.g. multiples of 4. And then some others that are multiples of 2. Seeing the first of these allow us to go to [[u8;4]], and the second allows us to go to [[[u8;4];2]], but they happen as a result of run-time tests (using as_chunks_mut). Ideally the two parts of code wouldn’t have to know about each other, though (e.g. the second one going from [C] to [[C; 2]] without thinking about whether C happens to be a byte slice). I think tl;dr: across the whole program yes I can know, but for software architecture reasons I’d love to not have to know. :smiley:

It does seem like the UnsignedBytewiseOrd trait that seems to do the memcpy trick you point out could perhaps be generalized to [U; K] for U: UnsignedBytewiseOrd. It’s an unsafe trait and seems to be specialization adjacent, so .. I’m 100% making stuff up there.

I had no idea about as_flattened, and actually Vec::into_flattened, which I also didn’t know about but just discovered based on this hint, solves a different problem that I had going in the other direction (from arrays back to slices). I shall ask more questions in the future!

Thanks very much.

1 Like

UnsignedBytewiseOrd currently requires sizeof(Self) == 1, so not trivially, but maybe something?

I took a quick attempt at specializing SliceOrd for arrays to flatten first, but didn't get it to work. It might need more intermediate traits; I'm unsure.

Avoiding the nested loop would probably be good for all slices-of-arrays, even ones that don't get the special trick for memcpy.