Short String optimization

mcy · September 20, 2018, 3:55pm

Has there been any discussion of having String live entirely on the stack for len < size_of::<usize>() * n? Imagine

enum StringInner {
    Stack {
        len: u8,
        bytes: [u8; N],
    },
    Heap(Vec<u8>),
}

Of course, off the top of my head I don’t know how well this plays with existing APIs, but I wanted to put down the idea for discussion before I mull it over more.

Note that this is the same size of the current String, if we use StringInner::Heap.0's RawVec's Unique as a niche. I have no idea if @eddyb’s work on niches can handle something like this, though.

sfackler · September 20, 2018, 4:12pm

We guarantee that we will never perform this optimization for Vec: https://doc.rust-lang.org/std/vec/struct.Vec.html#guarantees. I don’t think the same is explicitly documented for String, but String is simply a wrapper around Vec so it still holds there at least in practice.

cuviper · September 20, 2018, 4:17pm

And that guarantee of indirection is necessary for StableDeref to work.

Centril · September 20, 2018, 4:37pm

I think the assumption here that unsafe impl StableDeref for String {} was made in error and thus it would be entirely legitimate for T-libs to break it if they want to because no such guarantee was ever given.

leonardo · September 20, 2018, 4:41pm

Short string optimization is often regarded as a quite important optimization. But do we have benchmarks that compare the performance of the current String to the performance of a Short-string-optimized type in a some reasonable real programs/libraries?

Centril · September 20, 2018, 4:48pm

I’m not aware of this myself, but I believe the Tendril crate is relevant here. We could always do benchmarks in the future to see if such a change would be worth it =)

mbrubeck · September 20, 2018, 5:09pm

String docs do make a number of other promises that are hard to keep if it does not share its representation with Vec<u8>. For example, String::into_bytes says that it does not copy the contents.

pepp · September 20, 2018, 5:10pm

There have been some conversations in the past:

github.com/rust-lang/rust

`String::as_mut_vec` prevents small string optimization

opened 12:23PM - 24 Dec 14 UTC

closed 05:43PM - 14 Jan 15 UTC

petrochenkov

SSO is a popular optimization technique and is currently implemented in all the …major C++ standard libraries*. If Rust decides to adopt it too then it will come into contradiction with the `String::as_mut_vec` method, which exposes details of the current `String` implementation incompatible with SSO. I suppose, the choice between these two has to be made before marking `as_mut_vec` as stable. <sub>*Although it's not used by default in libstdc++ due to ABI compatibility</sub>

and

scottmcm · September 21, 2018, 12:20am

I don’t think there’s any choice of N here that’s obviously-correct, so I prefer having both String as today with some sort of SmallString, like we have Vec and the smallvec crate.

rpjohnst · September 21, 2018, 1:56am

The importance of this optimization is perhaps less in Rust, which has safe string slices permeating its APIs, lessening the amount of string allocation and copying that goes on.

TechnoMancer · September 21, 2018, 5:44am

The String docs also talk about its representation and says the buffer is always allocated on the heap.

gnzlbg · September 21, 2018, 8:22am

Note that since we do not technically guarantee it for String, StableDeref is playing a risky business here.

cuviper · September 21, 2018, 5:22pm

I agree that StableDeref took a risk, but since we know that code exists, we should think carefully whether it’s worth breaking them. Or we can just codify that guarantee for String and leave SSO to something like SmallString.

mcy · September 21, 2018, 6:03pm

Well, there is technically a rather nasty way out of this. I’ll note that I think this is a bad hack, and that guaranteeing that String is a newtype over Vec<u8> is a mistake.

Since things only break down when we give out references to things inside of String, we could treat SSO as a “deferred” allocation; anything that semantically gives out addresses into a String would trigger an allocation. Unfortunately, this means introducing interior mutability, since we’d need to mutate inside of Deref::deref…

mbrubeck · September 21, 2018, 6:49pm

This includes calling any &str methods on the String, and it would mean that almost any actual usage of the string's contents would prevent it from remaining on the stack.

mcy · September 21, 2018, 6:54pm

I did say it was a bad hack.

FenrirWolf · September 21, 2018, 10:18pm

Yup. The guarantees aren't as iron-clad as those of Vec, but they do seem to suggest that a non-empty String is always heap-allocated. From the docs for String:

Representation

A String is made up of three components: a pointer to some bytes, a length, and a capacity. The pointer points to an internal buffer String uses to store its data. The length is the number of bytes currently stored in the buffer, and the capacity is the size of the buffer in bytes. As such, the length will always be less than or equal to the capacity.

This buffer is always stored on the heap.

madmalik · September 23, 2018, 6:26pm

Is there actually reason to believe that the majority of use cases for std strings would benefit from SSO?

Most of the time we can use slices instead of owned strings (at least when it comes to typical situations where one uses a huge amount of very small strings, where SSO strings shine), curtesy of the borrow checker and all that Just because C++ does it it does not mean its a win for us too.

pepp · September 24, 2018, 11:20am

I remember that in those old dicussions it was said that C++ benefits from SSO because empty std::string has to contain terminating \0 and empty strings are actually very frequent. Rust does not allocate for empty strings. On the other hand SSO adds to program size by adding branches into every use of string. I am not going to search for it and my memory may be failing me but there were some posts that stated that it was measured and tested programs ran faster without SSO.

mcy · September 24, 2018, 4:08pm

It definitely sounds like SSO isn’t quite as useful for Rust; std::string is pretty pervasive in C++, and things like e.g. absl::string_view are not nearly as pleasant to use as &str.

I might argue, though, that while branching will increase binary size, I think it might actually be valuable to be able to turn on SSO as an optimization. However, to do this we’d need to explicitly remove any guarantees about the layout of String, which really shouldn’t have been made in the first place.

Topic		Replies	Views
Feature request: make every String a smartstring libs	15	1232	October 22, 2023
Small string optimization: remove as_mut_vec libs	71	13995	March 25, 2019
Why not a `ArrayStr<N>` for `str` as array for slice in std? libs	12	888	August 14, 2023
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3615	November 12, 2020
Const generic array sizes as its own mini-stabilization? libs	5	739	June 1, 2021

Short String optimization

Representation

Related topics