The problem with using some nontrivial combination of capacity and length or whatever as the short string indicator, regardless of whether this is generally possible with a Vec, is that it may require a longer assembly sequence to decode. Since strcat already, possibly correctly, believes that minimal sequences required to decode SSO bloat the icache too much, adding arbitrary constraints is probably a bad idea.
The concerns are fair, and Iām not (anymore) advocating deoptimization of String as a good builder in favor of it being a better owned data container for short strings. But:
- That does not mean there is a good enough reason to expose
Vecas an inner value ofString, to outweigh loss of freedom for backwards-compatible optimization. - An immutable owned string type should exist and be promoted.
- An efficient and ergonomic way to create a value of such a type by joining together some likely-small string slices should exist and be promoted.
Servo used to use as_mut_vec, but it seems that they do not anymore.
I guess Iām the only one who remembers the time when Rust tried to use a small vector optimization for all vectors and strings. The compile time and code size impact was extreme. Small string optimization is poison in that concentration. We canāt go back to that world.
After thinking about this a bit, Iām not sure how often this would be an issue in practice since most consumers of strings would take &str, so it wouldnāt matter if people are using a String, SmallString, etc. I think itād be best for us to instead make sure that we donāt accidentally limit peopleās ability to write their own custom owned string implementation.
The only thing that gives me pause is how weād represent a discontinuous Rope, but that type probably has different enough semantics that we couldnāt just drop it into most places that weād use a String.
Iād also be okay if we make String::as_mut_vec() private, and instead added the unsafe methods String::as_mut_bytes() and String::push_bytes(). Both of those would make sense on a shared trait that unifies the owned string type interfaces. Maybe that could claim the StrBuf name?
The problem with String::from_raw_parts is essentially the same: it is documented to be interchangeable with Vec::from_raw_parts. I can see how this could be worked around should there ever be a need for it, though. For now, perhaps, some warning could be added to the documentation that there is no official guarantee on interchangeability?
Now, I donāt assume that String could be changed to have a different internal structure in the 1.0 timeframe. But as a thought experiment, suppose you have discovered through extensive profiling that most current users of String could get a 25% speedup if the reallocation strategy is tuned in a certain way. The same does not apply for vectors in general, in fact, there are important cases where their performance would regress. So you decide to use a differently tuned implementation for String, but then when the string is cast as a Vec and used in this āmodeā for a while, this upsets things to the extent that you have to pessimize the reallocation code to account for such possibility. Whereas, if all your users go through String::into_bytes and String::from_utf8_unchecked (which as of now are essentially ways to change compile-time metadata on a structure slot), those would be the places to make the necessary adjustments to keep the hot paths clean.
"(5) it is comparatively easy for a project that does need SSO to swap out string types compared to many other large-scale changes"
I'm not sure how often this would be an issue in practice since most consumers of strings would take &str
I agree, and I did this recently on my own project with gcc vstring (SSO optimized to hold 15 chars + null, but more importantly, non copy-on-write / buggy/ racy). As long as there is a lingua franca (&str) to read among string types, which rust seems to have, it seems fine.
As far as I can tell nobody in this thread was really concerned about the removal (or private) of as_mut_vec. No one has evidence that this method is used often or in important ways. Is that correct?
And we also agree that this method can be replaced with other unsafe methods like from_raw_parts without doing much additional work (with the little exception when a SSO string needs to convert to a normal string first)., right?
And we agree that keeping this method would limit the ability to change the implementation of the string type which could (or could not) be a problem in the future.
So I donāt really see any reason to keep it.
But I read and thought about another idea that was mentioned somewhere. I am really not sure if this is useful, but I think it would be a pretty flexible solution.
We separate the āStringā into two parts: Buffer and UTF8-Wrapper. We would have an trait called StringBuf (for example) and a type UTF8Wrapper<B: StringBuf>. The wrapper ensures that everything stays UTF8 by using the methods specified in the StringBuf interface. The string module would consists of the wrapper type, a few special string buffers (like SSO) and this type alias:
type String = UTF8Wrapper<Vec>; // or UTF8Wrapper<SSOBuf>
I really like the idea. But to really use this some things needs to be clear:
- Is there enough time to make such a big change?
- Is it possible to do that without loss in performance?
- Would the trait
StringBufallow the implementation by a immutable string buffer?
So: I really think we should at least remove as_mut_vec and maybe think of another even better solution.
FWIW, from_utf8() also seems to be strongly coupled with an internal Vec representation, because the docs say:
Returns the vector as a string buffer, if possible, **taking care not to copy it**.
It should be possible to implement that without copying the buffer even without using Vec as implementation, I think. And I don't think that this should be a part of the String type but rather of a special VecString type...
It wouldnāt be mandatory under SSO for every String constructor to actually use it, including from_utf8 and from_raw_parts. However, in this still hypothetical world where SSO for String is benchmarked to be a benefit in general, I think from_utf8 might be common enough to want it; I think a copy bounded from above by 23ish bytes (which, since the destination is always exactly that length, can be done using all of three 64-bit load/stores in the common case the source isnāt near the end of a page) isnāt exactly what the documentation was worried about.
There seem to be several mentions of benchmarking SSO to see if it is a benefit. I thought the benefit of SSO was a reduction of memory use, and that it was generally done for that, despite having a performance hit.
One of the reasons this method exists is to allow direct mutation of strings as byte buffers without UTF8 checking (i.e., unsafe manipulation). There were previously a bunch of distinction unsafe APIs replicating some of the Vec methods in an ad hoc way, but as_mut_vec makes it possible to just compose with the Vec API directly.
It is somewhat limiting, in that it must be possible to acquire a particular kind of mutable reference into the string (it must be "upgradable" to a vector). But it does not imply that the String representation is always a simple Vec wrapper.
It is somewhat limiting, in that it must be possible to acquire a particular kind of mutable reference into the string (it must be "upgradable" to a vector). But it does not imply that the String representation is always a simple Vec wrapper.
I think this is likely to be quite limiting in practice in terms of the resulting available encoding space, unless String is changed to actually be larger than Vec, which sounds bad.
I agree. Sure: It's possible, but not in a nice way.
Are there really situations in which I need to change the buffer in a safe and unsafe way alternating? Having a big raw buffer and wanting to wrap a String around it is often useful. And having a String and then mutate it in an unsafe way is too. But always swapping between unsafe and safe operations sounds strange...
After all things said, I have actually found a somewhat valid use case for as_mut_vec: in an optimized implementation of std::io::Read, where itās certain that read_to_end would either return a complete UTF-8 string or an error, read_to_string can simply give the stringās underlying Vec to read_to_end.
Iāve to admit that I donāt quite get it. The only advantage of that is that read_to_string would be a one-liner? You are not talking about performance, right?
It is about performance and avoiding copies: read_to_string appends to a String passed by mutable reference. There is no way to move out of the reference, so itās either reading into a temporary Vec that is then appended to the string, or passing the string into read_to_end as a Vec.
See how itās done in std::io.
Ok, that makes sense. My important question is: Is it OK to have a bit overhead here? If we wanted to replace as_mut_vec with something else, we would have a bit additional work. Consider this (kinda pseudocode):
let vec = Vec::from_raw_parts(buf.to_raw_parts());
// do "append to vec" stuff with UTF8 checks
return String::from_raw_parts(vec.to_raw_parts());
Now this would be more work, yes. But could one live with that little additional work or is it a no-go?
- If itās not too bad, I will quickly try to find a fast API fix in RFC form.
- If itās a no-go, it is really a valid example that some stuff really just needs the fact that there is a
VecinString. But I would propose that the documentation makes clear thatStringis really intended to be a UTF8Wrapper aroundVec.