Small string optimization: remove as_mut_vec


It wouldn’t be mandatory under SSO for every String constructor to actually use it, including from_utf8 and from_raw_parts. However, in this still hypothetical world where SSO for String is benchmarked to be a benefit in general, I think from_utf8 might be common enough to want it; I think a copy bounded from above by 23ish bytes (which, since the destination is always exactly that length, can be done using all of three 64-bit load/stores in the common case the source isn’t near the end of a page) isn’t exactly what the documentation was worried about.


There seem to be several mentions of benchmarking SSO to see if it is a benefit. I thought the benefit of SSO was a reduction of memory use, and that it was generally done for that, despite having a performance hit.


One of the reasons this method exists is to allow direct mutation of strings as byte buffers without UTF8 checking (i.e., unsafe manipulation). There were previously a bunch of distinction unsafe APIs replicating some of the Vec methods in an ad hoc way, but as_mut_vec makes it possible to just compose with the Vec API directly.

It is somewhat limiting, in that it must be possible to acquire a particular kind of mutable reference into the string (it must be “upgradable” to a vector). But it does not imply that the String representation is always a simple Vec wrapper.


It is somewhat limiting, in that it must be possible to acquire a particular kind of mutable reference into the string (it must be “upgradable” to a vector). But it does not imply that the String representation is always a simple Vec wrapper.

I think this is likely to be quite limiting in practice in terms of the resulting available encoding space, unless String is changed to actually be larger than Vec, which sounds bad.


I agree. Sure: It’s possible, but not in a nice way.

Are there really situations in which I need to change the buffer in a safe and unsafe way alternating? Having a big raw buffer and wanting to wrap a String around it is often useful. And having a String and then mutate it in an unsafe way is too. But always swapping between unsafe and safe operations sounds strange…


After all things said, I have actually found a somewhat valid use case for as_mut_vec: in an optimized implementation of std::io::Read, where it’s certain that read_to_end would either return a complete UTF-8 string or an error, read_to_string can simply give the string’s underlying Vec to read_to_end.


I’ve to admit that I don’t quite get it. The only advantage of that is that read_to_string would be a one-liner? You are not talking about performance, right?


It is about performance and avoiding copies: read_to_string appends to a String passed by mutable reference. There is no way to move out of the reference, so it’s either reading into a temporary Vec that is then appended to the string, or passing the string into read_to_end as a Vec.

See how it’s done in std::io.


Ok, that makes sense. My important question is: Is it OK to have a bit overhead here? If we wanted to replace as_mut_vec with something else, we would have a bit additional work. Consider this (kinda pseudocode):

let vec = Vec::from_raw_parts(buf.to_raw_parts());
// do "append to vec" stuff with UTF8 checks
return String::from_raw_parts(vec.to_raw_parts());

Now this would be more work, yes. But could one live with that little additional work or is it a no-go?

  • If it’s not too bad, I will quickly try to find a fast API fix in RFC form.
  • If it’s a no-go, it is really a valid example that some stuff really just needs the fact that there is a Vec in String. But I would propose that the documentation makes clear that String is really intended to be a UTF8Wrapper around Vec.


I think reassembling the String from raw parts may be a more disciplined approach; as you can see in the std::io module, there is a guard that ensures the string is restored to a correct state even if an unwind happens, and this would look more natural with “phase transitions” between String and Vec. In general, any code temporarily lifting the string invariant for unsafe mutation should be similarly elaborate, so the verbosity of Strong::from_raw_parts is not a big issue.

@LukasKalbertodt, will you submit an RFC PR for “as_mut_vec considered harmful?”


[EDIT: This solution doesn’t make sense, read my next post]

Sorry for the delay! I would love to submit an RFC! I think I have an easy solution actually… What do you think of that:

Remove the method as_mut_vec from String and instead add the method: unsafe fn as_mut_slice(&mut self) -> &mut [u8]. Vec has the same method, though in Vec's case it is not unsafe. In String's case it needs to be unsafe because this method could insert non-UTF8 content into the buffer. Both Vec and String have the unsafe method from_raw_parts. You can obtain those “raw parts” from both classes in a similar way: Both have the method capacity that returns the current capacity. Both will have (!) the method as_mut_slice returning a slice to the buffer (which contains the ptr and len). Additionally both classes have a method to return an immutable slice (Vec::container_as_bytes and String::as_bytes). Furthermore String already has two methods to unsafely convert from and into a Vec without copying the buffer (fn into_bytes(self) -> Vec<u8> and unsafe fn from_utf8_unchecked(bytes: Vec<u8>) -> String).

TL;DR: If I’m right, we just need to replace as_mut_vec with another similar method as_mut_slice to resolve all problems.

Do you (all) agree? If yes, I will try to setup an RFC.


I have been playing with from_raw_parts and it’s not so simple. A library-provided RAII guard would be helpful for safer in-place mutation of String as a Vec. You can take a look at my attempt here.


That, of course, is only helpful in certain scenarios like unsafe appending. In general, in-place-as-Vec is both needed and highly unsafe against potential unwinding.


Ok so I figured that my last “solution” to this problem was pretty useless. Adding as_mut_slice doesn’t really make sense.

@mzabaluev: I agree that from_raw_parts is not the safest thing to use. But I don’t think a RAII guard like the one you proposed does really make sense. Your code still relies on several assumptions about the internal implementation of String, which isn’t optimal. And again: I don’t think that it’s really useful to edit the string buffer in an alternating UTF8-safe and UTF8-unsafe way.

And whenever we want to work on the string buffer without UTF8 checks, we could do:

    let mut v = s.into_bytes();
    // do stuff with v
    s = String::from_utf8_unchecked(v);

And this is not really more work than as_mut_vec, right?

So my new proposal is: Just remove as_mut_vec. You can do everything it could do with other methods anyway. I don’t really like the whole std::string design anyway… But removing the method as_mut_vec is a step in the right direction, I think.


What’s the status of the Small String Optimization? I think it’s a rather important optimization for Rust code. Most strings are quite short.


Reading the thread above it seems that the consensus is that due to the existance of &str, you can simply implement your own SmallString-Type that does the optimization and use it where it can help. Since there are few APIs requiring a String, you can just use your SmallString where you need it and create &str to it when interfacing with other APIs


That sounds like a poor workaround. I have not read the whole thread so I don’t know what the problems are (beside from_raw_parts), but I think a basic optimization like this should be automatic. Most people that could gain from this optimization will not use a SmallString. Think about having to import a crate to optimize floating point operations…


There was arguments in the github thread, that SSO can actually be a pessimization due to its unpredictable branches, code size and cheapness of small allocations.
If you want to start this discussion again, then measurements are welcome.

Also, part of C++'s motivation for SSO doesn’t apply to Rust, C++ needs it because it always has to keep the trailing zero and allocations for empty strings would be ridiculous, Rust doesn’t have this problem.


The last time the libs team talked about this I recall the prevailing attitude was that we value the simple representation of std strings and aren’t going to change it.

There are many ways to represent strings, and applications that really care about it care enough to make their own choices. For example, servo wouldn’t care about std’s small string representation, nor rustc - both have their very application-specific string representations.


I think a parallel to SmallVec is appropriate here. Using a SmallVec (for those that don’t know, it’s similar to SSO in that you have the values inline up to certain number where you switch to the heap) can be a big boon to performance. However there’s no good rule for where that threshold should be. It depends on the size of the contained type (which admittedly isn’t an issue here), and on the use case. One person might be ok with larger sizes, others might prefer a small cutoff (say one or two values). The same applies here, different people will want different cutoffs.

Also as @petrochenkov says, an empty string doesn’t need allocation in Rust.