[quote=âGankro, post:19, topic:1320â]
Itâs not clear to me that the optimizations applied to C-strings necessarily apply to length-included utf-8-encoded strings with no null byte.[/quote]
We are not really talking about C-strings but rather about std::string which also has itâs length included. And I donât see why std::string would pay branches for the null byte. It could either calloc the buffer and just care about null bytes when reallocating or just write a null byte unconditionally every push.
Sure: Itâs pretty damn hard to evaluate this problem properly. One could analyse big open source projects of âsimilarâ languages to get information about the usage pattern of strings. I think that even C++ application would give a good indication on how mutable strings are used. With that usage pattern one could create a benchmark and profile everything.
And I really think that people use whats most easy to use and whats familiar. If you have to choose between String and SmallString youâd think that the latter is a special thing to do special things with. And again: Thatâs fine if SSO is actually something that you would just use in special cases. Problem: We donât know yet and I hardly believe that we have some serious benchmarks before beta.
I absolutely agree. But in languages aimed to be low level standard design rules has to be thrown out the window sometimes I guess. But Vec is not necessary here. String could expose a rare pointer or something other that is not a library type. Creating those cross type dependencies is not a good thing IMO.
I think that this is a good question and just saying that these branches are worse than everything else is not a good idea. Often when we care about performance we have a lot of stuff and we iterate over it in a loop somehow. Remembering that CPUs are pretty damn good at what they are doing (read: branch prediction) we should think of how regular our branches are. And if our strings are usernames for example most strings will use SSO and therefore the branches are easily predictable. Same thing if we use strings that are mostly large. Only problem would be a scenario where our string lengths are somewhere around 23 with 50% above and 50% below and those strings are randomly distributed in our container.
And yes, if we use branches we donât just pay for misprediction but for a few other things like preventing special optimization. But I donât think thatâs the real issue here.
[quote=âmzabaluev, post:25, topic:1320â]
And the issue, as has been said in this discussion, is not about deciding to switch to SSO right now. Itâs about removing an unsafe, rarely used method that limits the options for changing the String implementation in a backwards-compatible way if a need to do so is discovered. There is only one such method so far identified, despite repeated claims to the contrary.[/quote]
Exactly! And yep, I am pretty sure as_mut_vec is the only method that really causes any trouble because it returns a reference to a Vec. Consuming the String and returning a new Vec wouldnât be a problem. Although I would say that we should remove all methods that include the Vec type from the string interface. Converting could be done via free function that use the unsafe âget-pointerâ method of the String and the from_raw_parts method of the Vec. When doing this conversion with a StrVec it just needs to copy three pointers. Unconditionally. Should be fast enough I guess.
And if this is not fast enough for your use case or you actually need a reference to the underlying Vec to mutate the buffer in a non UTF8 way, I would consider you the 0.1% of people that should write their own string implementations anyway.
So: Yes, this is not about âSSO is betterâ, but about âthis one method that is hardly used (and that could be replaced nearly without loss of performance with some other method) has huge implications and is preventing us from maybe doing smart stuff in the futureâ.
EDIT:
I just donât see why he is âhidingâ the hard data from us when he is good at such things.
Not really. To implement that method we would have to store the Vec inplace. Your idea, correct me if I misunderstood, is to just switch to the non-SSO way of storing the string like one would do when the string becomes larger than 23 bytes.
Problem A: When we store the Vec inline we still need any indicator if weâre using SSO. At least one bit. One easy way to do this is to use the LSB of the cap field as indicator and restricting the capacity to even numbers (which is not a huge restriction, especially when using doubling policy). We canât do that with Vec since Vec has itâs own policy and we canât just use some bitâs of it for other stuff.
Problem B: Even if we would have solved A, we would potentially add even more branches to every method to check for the three cases (small and SSO, small but not SSO because as_mut_vec was called, big and not SSO).