I think reassembling the String from raw parts may be a more disciplined approach; as you can see in the std::io module, there is a guard that ensures the string is restored to a correct state even if an unwind happens, and this would look more natural with “phase transitions” between String and Vec. In general, any code temporarily lifting the string invariant for unsafe mutation should be similarly elaborate, so the verbosity of Strong::from_raw_parts is not a big issue.
@LukasKalbertodt, will you submit an RFC PR for “as_mut_vec considered harmful?”
[EDIT: This solution doesn’t make sense, read my next post]
Sorry for the delay!
I would love to submit an RFC! I think I have an easy solution actually… What do you think of that:
Remove the method as_mut_vec from String and instead add the method: unsafe fn as_mut_slice(&mut self) -> &mut [u8]. Vec has the same method, though in Vec's case it is not unsafe. In String's case it needs to be unsafe because this method could insert non-UTF8 content into the buffer.
Both Vec and String have the unsafe method from_raw_parts. You can obtain those “raw parts” from both classes in a similar way: Both have the method capacity that returns the current capacity. Both will have (!) the method as_mut_slice returning a slice to the buffer (which contains the ptr and len). Additionally both classes have a method to return an immutable slice (Vec::container_as_bytes and String::as_bytes).
Furthermore String already has two methods to unsafely convert from and into a Vec without copying the buffer (fn into_bytes(self) -> Vec<u8> and unsafe fn from_utf8_unchecked(bytes: Vec<u8>) -> String).
TL;DR: If I’m right, we just need to replace as_mut_vec with another similar method as_mut_slice to resolve all problems.
Do you (all) agree? If yes, I will try to setup an RFC.
I have been playing with from_raw_parts and it’s not so simple. A library-provided RAII guard would be helpful for safer in-place mutation of String as a Vec. You can take a look at my attempt
here.
That, of course, is only helpful in certain scenarios like unsafe appending. In general, in-place-as-Vec is both needed and highly unsafe against potential unwinding.
Ok so I figured that my last “solution” to this problem was pretty useless. Adding as_mut_slice doesn’t really make sense.
@mzabaluev: I agree that from_raw_parts is not the safest thing to use. But I don’t think a RAII guard like the one you proposed does really make sense. Your code still relies on several assumptions about the internal implementation of String, which isn’t optimal. And again: I don’t think that it’s really useful to edit the string buffer in an alternating UTF8-safe and UTF8-unsafe way.
And whenever we want to work on the string buffer without UTF8 checks, we could do:
let mut v = s.into_bytes();
// do stuff with v
s = String::from_utf8_unchecked(v);
And this is not really more work than as_mut_vec, right?
So my new proposal is: Just remove as_mut_vec. You can do everything it could do with other methods anyway.
I don’t really like the whole std::string design anyway… But removing the method as_mut_vec is a step in the right direction, I think.
Reading the thread above it seems that the consensus is that due to the existance of &str, you can simply implement your own SmallString-Type that does the optimization and use it where it can help. Since there are few APIs requiring a String, you can just use your SmallString where you need it and create &str to it when interfacing with other APIs
That sounds like a poor workaround. I have not read the whole thread so I don't know what the problems are (beside from_raw_parts), but I think a basic optimization like this should be automatic. Most people that could gain from this optimization will not use a SmallString. Think about having to import a crate to optimize floating point operations...
There was arguments in the github thread, that SSO can actually be a pessimization due to its unpredictable branches, code size and cheapness of small allocations.
If you want to start this discussion again, then measurements are welcome.
Also, part of C++'s motivation for SSO doesn’t apply to Rust, C++ needs it because it always has to keep the trailing zero and allocations for empty strings would be ridiculous, Rust doesn’t have this problem.
The last time the libs team talked about this I recall the prevailing attitude was that we value the simple representation of std strings and aren’t going to change it.
There are many ways to represent strings, and applications that really care about it care enough to make their own choices. For example, servo wouldn’t care about std’s small string representation, nor rustc - both have their very application-specific string representations.
I think a parallel to SmallVec is appropriate here. Using a SmallVec (for those that don't know, it's similar to SSO in that you have the values inline up to certain number where you switch to the heap) can be a big boon to performance. However there's no good rule for where that threshold should be. It depends on the size of the contained type (which admittedly isn't an issue here), and on the use case. One person might be ok with larger sizes, others might prefer a small cutoff (say one or two values). The same applies here, different people will want different cutoffs.
Also as @petrochenkov says, an empty string doesn't need allocation in Rust.