Unallocating stable sort

At this moment we have 2 sorting methods for slices:

  1. sort which is stable but allocates memory for merge sort. It is slow.
  2. sort_unstable which uses log N memory but always on stack so not dynamic allocations.

I think, there is possible that someone would need both stable sorting and lack of allocations. After some thought, I come up with such solution:

  1. Add new method sort_stable_with_mem(&mut self, &mut [MaybeUninit<Self::Item>]) which accepts buffer with same size as self, and uses it for sorting.
  2. Standard sort() may be a wrapper around new method which checks if slice sorted, then allocates buffer and calls new method.

So, users would be able to provide place on the stack or reuse temporary allocation for multiple sorts.

Alternatives:

  1. Use some stable non-allocating sort algorithm. I think, they are slower even if don't allocate. Which even worse if we use some smart allocator like jeMalloc or MiMalloc.
  2. Accept allocator as generic parameter (same as Vec or Box does). I think, this is more complex way and would some use-cases more difficult (e.g. instead of just reusing allocations, there is need to write unsafe code for allocator). Users which have Allocator written can just use it to allocate memory using it before sort_stable_with_mem call very easily.

Maybe there are another options.

What do you think? Also, maybe this is not very useful to be added to standard library and users can just use some crate for this.

While the documentation for sort_unstable does indicates that this function does not allocate:

This sort is unstable (i.e., may reorder equal elements), in-place (i.e., does not allocate), and O ( n * log( n )) worst-case.

This is not the case for the sort function, the allocating behavior is described as an implementation detail. If this is indeed the case, a future version of this function could be using a non-allocating sorting algorithm.

Another thing to note is that the size of the buffer allocated during the sorting operation is also an implementation detail. The current implementation allocates at most half the size of the input slice. My understanding is that this could change at some point, which might make it tricky to choose the size of the buffer passed to the function.

With the allocator work happening, I think instead of _with_mem it'd be like the _in methods in Vec that take an allocator, rather than taking a buffer.

But it's not obvious to me that the middle-ground use here is necessary. Because the other alternative is to just make sure your sorting predicate is distinct, like how sort_with_cached_key can use sort_unstable internally because it has an index in it, and thus is stable because there are no equivalent items.

7 Likes

It would be nice to add sort method on Vec<T> which self.reserve(self.len()/2) and exploit the unused space.

2 Likes

If that needs a reallocation, it's probably worse than just making a new allocation, since the new allocation wouldn't need to copy it over. So sure, if the space already happened to be there it'd be nice to use it, but I don't know that I'd want it to add capacity if it wasn't already there.

6 Likes

I would actually frown upon sorting touching the capacity at all. Conceptually, sorting is something that can be done purely in-place, therefore reallocation is unexpected, so it's a footgun.

Specifically, I'm thinking about unsafe code that checks the capacity and/or obtains a pointer to the buffer, sorts the vector, then uses the pointer. Although this kind of code is clearly bad practice and not robust, I would be extremely wary of introducing unexpected reallocations that cause it to misbehave. This is very subtle and non-obvious, which I think is outright inexcusable in the case of a fundamental type like Vec.

3 Likes

The sorting function could at least check if there is enough unused capacity in the vec for its purpose and sort in this space if it's the case (effectively removing an alloc and keeping theprevious vec capacity). Though I'm doubtful as to if this case often happens (the vec having at least 33% unused capacity if we're talking about the current sort implementation.

2 Likes

When the Vec was created in a manner that knows the exact length of the Vec in advance (which is probably fairly common) then there generally wouldn’t ever be ⅓ unused capacity.


However, for the case that we have a Vec created by a sequence of push operations without knowing its size in advance, or e.g. by collecting from an iterator that does not have a statically known lower bound on its length (e.g. involving filter), then it’s not too unlikely.

Each resize doubles the capacity of a Vec. Assuming a Benford’s Law style situation, i.e. a sort of scale-invariance where the probability of each specific Vec size decreases as the size increases, the probability of having the necessary spare capacity should actually be about 41.5%

calculation…

I’d model the situation of a growing Vec between any two resizes as follows.

  • The initial size is s0, and the initial capacity is C = 2·s0
  • The growth of size is modeled such that the logarithms are uniformly distributed. A parameter t between 0 and 1 can be mapped to a size s0·2t which yields sizes between s0·20 = s0 and s0·21 = C as expected; and when t is uniformly distributed, then so are the logarithms of the sizes created in this manner.
  • This means that the probability of having ⅓ unused capacity to our disposal is given by the fraction of the 0-to-1 range for t where a size s0·2t still leaves ⅓·C of additional unused capacity in C.
  • Formally, we look for x such that s0·2x + ⅓·C = C, and then the range of t that has sufficient capacity will be those t that lie between 0 and x, hence the proportion we're after is equal to x.

s0·2x + ⅓·C = C

can be simplified to

s0·2x = (2/3)·C = (2/3)·2·s0

where now (unsurprisingly) the value of the initial size s0 becomes irrelevant

2x = (2/3)·2 = 4/3

which gives x = log2(4/3) ≈ 0.41503…


Finally, whenever we have a Vec created from a (significantly) larger Vec, e.g. via retain, without shrinking the capacity, then having the necessary free capacity should be even more likely.

5 Likes

I think, single check of usize before delegating to slice sort is much easier than allocation and can provide positive effect.

1 Like

Maybe it's a too tricky case, but another optimization that has been mentioned before is that specialization could just dispatch from sort into sort_unstable for certain element types. Never with a custom comparator etc, but there's still a few use cases where it applies. For example <[i32]>::sort(). This optimization is also in the category of avoiding allocating extra buffers.

The stable O(n log n) sorts I've heard of, like GitHub - BonzaiThePenguin/WikiSort: Fast and stable sort algorithm that uses O(1) memory. Public domain. seem quite complicated and in this case still use a side buffer, but of constant size - even more possible to fit into a Vec's spare capacity?

Not necessarily a good idea when the documentation of sort_unstable explicitly states

It is typically faster than stable sorting, except in a few special cases, e.g., when the slice consists of several concatenated sorted sequences.

(emphasis mine)

and similarly, the documentation of sort says

The current algorithm […] is designed to be very fast in cases where the slice is nearly sorted, or consists of two or more sorted sequences concatenated one after another.