Thus far my opinion on this issue is that its fine and maybe preferable for basic data structures to have clear and simple performance characteristics (like vec is always heap allocated). Adding clever optimizations under the hood that may or may not kick in is its own kind of hazard.
To validate/disprove claims about this crippling the String
interface, can someone measure the performance impact of converting uses of as_mut_vec
to into_bytes
, plus a backwards move with String::from_utf8_unchecked
when necessary? As currently implemented, that pair of moving transformations is little more than changing the facade on the same underlying buffer. It is reported that as_mut_vec
is hardly being used, which is understandable given that it's unsafe and there almost always should be a safe way to mutate a String
which works reasonably fast.
Iām of the opinion that any data structure API that exposes the internals is a bad idea. I thought intuitively that Strings were SSO-able without a breaking change, so this was surprising to me.
Some of the points made by @strcat on GitHub need further evaluation, I think:
Small allocations take ~8ns on modern hardware.
Sorry, but without hard data backing it up against some realistic workload this is all as much hearsay as claiming that SSO/SVO is a magical speed bullet.
I don't think you realize how cheap small allocations are. Branch misses are much worse
So how many mispredicted branches are caused by frequent allocation and deallocation over varying block sizes? For workloads where SSO would be relevant, the predominant usage pattern of String
would likely be something like: allocate, a few appends, a few sequential reads, deallocate. So the allocations are likely to be stacked fairly high compared to other operations, unlike workloads that tend to mutate bigger strings. It's hard to validate the claim that branches on accessing the buffer are the main culprit in most realistic use cases and not, for example, the chosen implementation of SSO, or suboptimal compiler behavior, without seeing any details of measurements made, which details have been scarce.
And the issue, as has been said in this discussion, is not about deciding to switch to SSO right now. It's about removing an unsafe, rarely used method that limits the options for changing the String
implementation in a backwards-compatible way if a need to do so is discovered. There is only one such method so far identified, despite repeated claims to the contrary.
There is only one such method so far identified, despite repeated claims to the contrary.
from_raw_parts
would have to be altered to only accept pointers that are at least 16-bit aligned (or possibly removed), so there are at least two. And that goes quite a bit deeper if we still allow Vec<u8>
in other methods.
As far as strcat's performance numbers go: it's always best to validate, but he has contributed quite a bit to jemalloc, as well as Rust's allocator API, is writing his own allocator (which is benchmarking very well), and even got a Linux kernel patch in to improve allocator performance. I'd be inclined to trust his numbers, particularly in the absence of hard evidence demonstrating the opposite.
I think the whole question of āshould we keep as_mut_vec
ā is almost completely irrelevant to whether or not we will eventually do SSO. as_mut_vec
can easily be adjusted to de-SSO the String
if it is SSOād because it takes &mut self
, as @pepp noted above. I suspect that the performance regressions from that would be small compared with the performance regressions from requiring people to choose potentially slower alternatives to as_mut_vec
. SSO is just an optimisation which is ideally used whenever it improves speed; losing one rare case where it could be used isnāt much of a problem in my eyes. Keeping as_mut_vec
does not rule out SSO at all, so I donāt really see how it is a problem.
We are not really talking about C-strings but rather about std::string
which also has it's length included. And I don't see why std::string
would pay branches for the null byte. It could either calloc the buffer and just care about null bytes when reallocating or just write a null byte unconditionally every push.
Sure: It's pretty damn hard to evaluate this problem properly. One could analyse big open source projects of "similar" languages to get information about the usage pattern of strings. I think that even C++ application would give a good indication on how mutable strings are used. With that usage pattern one could create a benchmark and profile everything.
And I really think that people use whats most easy to use and whats familiar. If you have to choose between String
and SmallString
you'd think that the latter is a special thing to do special things with. And again: That's fine if SSO is actually something that you would just use in special cases. Problem: We don't know yet and I hardly believe that we have some serious benchmarks before beta.
I absolutely agree. But in languages aimed to be low level standard design rules has to be thrown out the window sometimes I guess. But Vec
is not necessary here. String
could expose a rare pointer or something other that is not a library type. Creating those cross type dependencies is not a good thing IMO.
I think that this is a good question and just saying that these branches are worse than everything else is not a good idea. Often when we care about performance we have a lot of stuff and we iterate over it in a loop somehow. Remembering that CPUs are pretty damn good at what they are doing (read: branch prediction) we should think of how regular our branches are. And if our strings are usernames for example most strings will use SSO and therefore the branches are easily predictable. Same thing if we use strings that are mostly large. Only problem would be a scenario where our string lengths are somewhere around 23 with 50% above and 50% below and those strings are randomly distributed in our container. And yes, if we use branches we don't just pay for misprediction but for a few other things like preventing special optimization. But I don't think that's the real issue here.
Exactly! And yep, I am pretty sure as_mut_vec
is the only method that really causes any trouble because it returns a reference to a Vec. Consuming the String and returning a new Vec wouldn't be a problem. Although I would say that we should remove all methods that include the Vec type from the string interface. Converting could be done via free function that use the unsafe "get-pointer" method of the String and the from_raw_parts
method of the Vec. When doing this conversion with a StrVec it just needs to copy three pointers. Unconditionally. Should be fast enough I guess.
And if this is not fast enough for your use case or you actually need a reference to the underlying Vec to mutate the buffer in a non UTF8 way, I would consider you the 0.1% of people that should write their own string implementations anyway.
So: Yes, this is not about "SSO is better", but about "this one method that is hardly used (and that could be replaced nearly without loss of performance with some other method) has huge implications and is preventing us from maybe doing smart stuff in the future".
EDIT:
I just don't see why he is "hiding" the hard data from us when he is good at such things.
Not really. To implement that method we would have to store the Vec inplace. Your idea, correct me if I misunderstood, is to just switch to the non-SSO way of storing the string like one would do when the string becomes larger than 23 bytes.
Problem A: When we store the Vec inline we still need any indicator if we're using SSO. At least one bit. One easy way to do this is to use the LSB of the cap field as indicator and restricting the capacity to even numbers (which is not a huge restriction, especially when using doubling policy). We can't do that with Vec since Vec has it's own policy and we can't just use some bit's of it for other stuff.
Problem B: Even if we would have solved A, we would potentially add even more branches to every method to check for the three cases (small and SSO, small but not SSO because as_mut_vec
was called, big and not SSO).
There is no stated guarantee that from_raw_parts
will use the passed buffer as is. Actually, does it even work for arbitrarily created buffers without leaking or corrupting the heap on them? This function needs some more explaining...
Also, tagged pointers are not the only scheme of encoding SSO.
Feel free to suggest a variant of SSO that doesnāt require at least one extra bit of informationā¦ in the meantime, it seems pretty clear to me that this would be a backwards incompatible change to that function, since it doesnāt require alignment right now. The same goes for any functions allowing creation of a String
from a Vec<u8>
.
This is all assuming you want them to perform well, of course. If you have to make the existing methods all perform much worse in order to enable the optimization, it becomes much less attractive.
There is one linked up the thread that uses length <= capacity to encode in two bits, which is enough to also encode the length in the same byte.
And as said above, from_raw_parts
permits backwards-compatible circumvention, unlike as_mut_vec
.
And I mentioned another method in my post above in the "EDIT" section. Just want to make sure that you don't miss that spot since you all are apparently answering to fast for me ...
If you store length <= capacity in two bits, it is not a backwards compatible change, since you have now restricted the maximum string length to differ from the maximum vector length. On a 32 bit system, this is a pretty significant difference. Restricting the capacity to even numbers is also not backwards compatible if you want to maintain existing performance.
I'm not sure if you misunderstood the method or if I misunderstood you. But we don't actually "store" anything more. We just know that if the MSB of the len field is 1 the MSB of the cap field can't be 0. So we still have the same capacity and length.
I'd argue that it's hardly a difference. I don't think that allocators ever create blocks of odd length. So this additional byte shouldn't hurt performance at all. Has to be measured though
No restriction: it just cannot happen that the MSB of length is 1 while the MSB of capacity is 0. So these two bits are encoded in the first byte using a bit-mixed pattern for the fields of the non-SSO string/vector structure, and when these bits are "wrong", this means the string is in-place. It's kinky, but does not require the pointer to be aligned.
Speaking of pointers, if from_raw_parts
is really only supposed to take buffers obtained from the standard allocator, that allocator could be guaranteed to always allocate word-aligned buffers, which is typically the case for other reasons as well.
I misunderstood the method. Okay, I agree that with the MSB method it would (seemingly, anyway) be backwards compatible (since it exploits some redundant information in the existing data structure).
What about renaming back to StrBuf
and reserving String
for (mostly) immutable string if the Box<str>
thing does not turn out well?
Your idea, correct me if I misunderstood, is to just switch to the non-SSO way of storing the string like one would do when the string becomes larger than 23 bytes.
Correct me if Iām wrong, but I think you misunderstand me. What I am proposing is that the current as_mut_vec
implementation be changed to something like this:
fn as_mut_vec(&mut self) -> &mut Vec<u8> {
if self.is_sso() {
// This method changes the Stringās representation to a non-SSO one;
// that is, one using a pointer, length, and capacity.
self.change_from_sso_to_normal_representation();
}
&mut self.vec
}
This would work regardless of how SSO gets implemented.
from_raw_parts
is potentially backwards-incompatible to change, but it depends on the precise SSO implementation. Forcing a capacity with a multiple of 2 could be backwards compatible, as the Vec
could simply be reallocated whenever the capacity is odd (I think). Itās likely that most other SSO implementations could work with it, too.
@pepp The case of building a likely small string before sealing it in an immutable value ought to be supported in an easy-to-use, idiomatic way without making the developer think of different types and their tradeoffs. This could be done, e.g. with a box_from_iterator constructor and a macro:
let s: Box<str> = build_str!(base, "foo", suffix);
Well.. not exactly I think. To make this clear: We don't want our string to be larger than 3 words (ptr, len, cap). So a Vec has the same size as our String type which means that every bit of our type is used by the vector. Since you are returning a reference to that internal vector the user could call reserve
on it with an odd number. And our String would then assume that it is using SSO because the LSB(it) is set. So this is not possible. Furthermore: If we want to use the method with the LSB(it) we have to make sure that the LSB(yte) is at the beginning or the end of our type. This is not always the case thanks to Endianness. So yeah... we have too many special restrictions on the cap, len, ptr fields (which are used by the vector) that it is impossible (I think) to do it this way.
@mzalaluev: The thing is that Box<str>
currently does not work, is not promoted in The Book and we have no experience what ergonomics it will have. It can turn out that it will not stand to the task of being number one choice for strings and then the next best name String
for suitable type will be taken by string type optimized for mutability which is arguably not the most common requirement.
I see several possibilities here:
- Status quo and do care about optimizations for
String
. - Have
&str
,Box<str>
,Rc<str>
as primary types promoted in The Book andStrBuf
,SmallString
and others as domain specific types. Thenstr
will need more work to stand to the task. - Have
&str
andString
as primary types promoted in The Book, whereString
is (mostly) immutable string type using whatever optimizations turn out to best for general case usage. Then currentString
will have to be renamed and part of it reused for newString
for the starters.