The standard library's to_uppercase currently allocates a new String in to_uppercase and to_lowercase.
There are important use-cases where the user wants to process string by string and can re-use a single allocated String as a scratch zone:
a streaming iterator adapter that consumes &str and yields its upper-cased version as &str (and re-uses an internal String as a scratch zone)
some vectorized formats such as Apache Arrow do not use String as its main carrier of utf8; in this cases the single-item transformations to_* benefits from not having to re-allocate on every item when transforming a string.
The std currently does not cater for such use-cases.
This pre-RFC proposes adding a function write_uppercase(&self, out: &mut String) that writes the upper-cased version of self into out, optionally re-allocating out:
pub fn write_uppercase(&self, s: &mut String) {
s.clear();
s.reserve(self.len());
// below is copied from to_uppercase
for c in self[..].chars() {
match conversions::to_upper(c) {
[a, '\0', _] => s.push(a),
[a, b, '\0'] => {
s.push(a);
s.push(b);
}
[a, b, c] => {
s.push(a);
s.push(b);
s.push(c);
}
}
}
s
}
The existing function is then re-written as
pub fn to_uppercase(&self) -> String {
let mut s = String::new();
self.write_uppercase(&mut s);
s
}
to not duplicate code. This pattern is applied mutatis mutandis to to_lowercase.
The tradeoff is that this further polutes std::str's namespace.
Alternatives
The current alternative is for users to copy-paste the code inside these functions. This causes users to miss e.g. future bug fixes and/or optimizations in std's code.
Edit: Why is String::to_uppercase written with this unrolled loop instead of s.extend(self.chars().map(char::to_uppercase).flatten())? Does the iterator version not get optimized correctly?
I believe if you care about performance to the point where you want to avoid allocating strings, you're a small step away from caring about the performance overhead of char itself (vs translating bytes directly). Therefore I believe that any API that works like a char iterator only satisfies a very small set of users: Those who care enough about performance to avoid the extra allocation, yet somehow do not care enough about converting their string back and forth from char.
as_uppercase() -> impl Display or write_uppercase both sound good.
This is necessary in order to do the case mapping routine, though. You have to decode the UTF-8 to codepoints, then run those through the case mapping table (basicallyfn(char) -> ArrayVec<char>, with some exceptions (because language is complicated)), then re-encode codepoints back into UTF-8 code units.
There's no way to do the case mapping "over bytes directly" without decoding into char. Perhaps you could implement a huge finite automaton to do recognition and yield bytes directly, but it'd be significantly larger code size than with the mapping tables, and could likely lose any performance benefits to less local code. Plus, it still has to do the UTF-8 decoding and variable many-many mapping process; the only "saving" is that it'd encode the code units output rather than the code points output (which, again, increases the space of the routine a lot, since there are a lot of code points to map).
Case mapping isn't a solved problem with a given "best" solution (and there can't be a fixed one, since the problem set evolves over time), but I honestly believe that it's going to optimally stay as a transform over code points, not code units. You can do recognition on code units directly, but once the domain is so purely code points as Unicode algorithms (which are defined over code points), the extra effort to do so directly over code units isn't beneficial.
You could add fastpaths for skipping over large chunks of characters that are already in the desired casing, or implement separate codepaths for all-ascii strings. That branching can be done based on bytepatterns instead of char ranges. By exposing a character iterator you throw that optimization potential out of the window. But I wasn't trying to argue that those methods can be optimized (or even come up with concrete proposals), I am saying that we shouldn't build APIs that cannot be optimized. It shouldn't be the case that we build a new API doing the same thing as to_uppercase into std for performance, then that API is one-upped by some random crate (because somebody did find an optimization and cannot contribute it back due to API limitations) and the new std API ends up being unused.
I really like this, something like as_display() for any char iter makes a lot of sense. It still loses the size hint if it's not specialized enough, but that may be okay.
Well, it was already resurrected via specialization so that .flat_map(|x| [x, x]) has a useful hint.
Note that the best place to put the hint is on IntoIterator, not Iterator. Because array::IntoIter is variably-sized, but <[T;N] as IntoIterator>::WHATEVER_YOU_CALL_THE_HINT can be exactlyN.
So to fix the flatten in this issue, you'd probably want to extend the current specialization to offer a (usize, Option<usize>) rather than just a usize, so that the uppercase could be 1-3 chars. Except that there's nothing to IntoIterator for it because there's only the iterator, and thus the lower-bound has to be 0, which makes it useless. (The upper bounds in size_hints are never actually used for anything, unfortunately.)