Pre-pre-RFC: Support `write_uppercase(&self, &mut String)`

The standard library's to_uppercase currently allocates a new String in to_uppercase and to_lowercase.

There are important use-cases where the user wants to process string by string and can re-use a single allocated String as a scratch zone:

  • a streaming iterator adapter that consumes &str and yields its upper-cased version as &str (and re-uses an internal String as a scratch zone)
  • some vectorized formats such as Apache Arrow do not use String as its main carrier of utf8; in this cases the single-item transformations to_* benefits from not having to re-allocate on every item when transforming a string.

The std currently does not cater for such use-cases.

This pre-RFC proposes adding a function write_uppercase(&self, out: &mut String) that writes the upper-cased version of self into out, optionally re-allocating out:

pub fn write_uppercase(&self, s: &mut String) {
        s.clear();
        s.reserve(self.len());

        // below is copied from to_uppercase
        for c in self[..].chars() {
            match conversions::to_upper(c) {
                [a, '\0', _] => s.push(a),
                [a, b, '\0'] => {
                    s.push(a);
                    s.push(b);
                }
                [a, b, c] => {
                    s.push(a);
                    s.push(b);
                    s.push(c);
                }
            }
        }
        s
    }

The existing function is then re-written as

pub fn to_uppercase(&self) -> String {
        let mut s = String::new();
        self.write_uppercase(&mut s);
        s
    }

to not duplicate code. This pattern is applied mutatis mutandis to to_lowercase.

The tradeoff is that this further polutes std::str's namespace.

Alternatives

The current alternative is for users to copy-paste the code inside these functions. This causes users to miss e.g. future bug fixes and/or optimizations in std's code.

2 Likes

Given that that implementation does it one-by-one anyway, what about something like .chars_uppercase() instead?

Then one could have to_uppercase be

pub fn to_uppercase(&self) -> String {
    self.chars_uppercase().collect()
}

And with an appropriate fold override on the iterator, it should be just as good performance-wise.

12 Likes

I don't think write_uppercase should clear the string first. But anyway this whole operation can be done relatively compactly now:

scratch.clear();
scratch.reserve(s.len());
scratch.extend(s.chars().map(char::to_uppercase).flatten());

I'm not sure it's worth the extra method.

Quick note, this loses the size hint.

Edit: Why is String::to_uppercase written with this unrolled loop instead of s.extend(self.chars().map(char::to_uppercase).flatten())? Does the iterator version not get optimized correctly?

1 Like

Flatten and flat_map always lose the lower bound, because it's possible that the inner IntoIterators don't have any elements in them. I once tried to fix it, but it was rejected (Produce better size_hints from Flatten & FlatMap by scottmcm · Pull Request #48544 · rust-lang/rust · GitHub).

But a dedicated to_upper iterator could have a useful size_hint, since it could know the underlying str.

EDIT: it looks like there's now a specialization for flat_map that returns arrays, but nothing else.

8 Likes

I would prefer a function that just returns a struct that implements Display and only converts the string lazily when printed:

// create an uppercase string:
format!("{}", s.as_uppercase())

// append to an existing string:
write!(s, "{}", s.as_uppercase())

// print to stdout:
println!("{}", s.as_uppercase())
7 Likes

Note that this is true for uppercase, but not lowercase, which must specially handle lunate sigma: str.rs - source

A string output could also be implemented for any iterator producing char, e.g. your as_uppercase() becomes:

// print to stdout:
println!("{}", s.chars().map(char::to_uppercase).display_string())

I believe if you care about performance to the point where you want to avoid allocating strings, you're a small step away from caring about the performance overhead of char itself (vs translating bytes directly). Therefore I believe that any API that works like a char iterator only satisfies a very small set of users: Those who care enough about performance to avoid the extra allocation, yet somehow do not care enough about converting their string back and forth from char.

as_uppercase() -> impl Display or write_uppercase both sound good.

This is necessary in order to do the case mapping routine, though. You have to decode the UTF-8 to codepoints, then run those through the case mapping table (basically fn(char) -> ArrayVec<char>, with some exceptions (because language is complicated)), then re-encode codepoints back into UTF-8 code units.

There's no way to do the case mapping "over bytes directly" without decoding into char. Perhaps you could implement a huge finite automaton to do recognition and yield bytes directly, but it'd be significantly larger code size than with the mapping tables, and could likely lose any performance benefits to less local code. Plus, it still has to do the UTF-8 decoding and variable many-many mapping process; the only "saving" is that it'd encode the code units output rather than the code points output (which, again, increases the space of the routine a lot, since there are a lot of code points to map).

Case mapping isn't a solved problem with a given "best" solution (and there can't be a fixed one, since the problem set evolves over time), but I honestly believe that it's going to optimally stay as a transform over code points, not code units. You can do recognition on code units directly, but once the domain is so purely code points as Unicode algorithms (which are defined over code points), the extra effort to do so directly over code units isn't beneficial.

1 Like

You could add fastpaths for skipping over large chunks of characters that are already in the desired casing, or implement separate codepaths for all-ascii strings. That branching can be done based on bytepatterns instead of char ranges. By exposing a character iterator you throw that optimization potential out of the window. But I wasn't trying to argue that those methods can be optimized (or even come up with concrete proposals), I am saying that we shouldn't build APIs that cannot be optimized. It shouldn't be the case that we build a new API doing the same thing as to_uppercase into std for performance, then that API is one-upped by some random crate (because somebody did find an optimization and cannot contribute it back due to API limitations) and the new std API ends up being unused.

1 Like

I really like this, something like as_display() for any char iter makes a lot of sense. It still loses the size hint if it's not specialized enough, but that may be okay.

What is most recently proposed in this thread is an exact duplicate of Add `as_lowercase` et. al BTW.

1 Like

The PR was rejected because consts are never object safe. But I'm now wandering if we can resurrect that PR by either

  • Moving the const into a separate trait (meh)
  • Making it a function instead (functions can have Self: Sized bounds)
trait Iterator {
    // ...

    // Well-behaved implementation will
    // - static_size_hint.0 <= size_hint.0
    // - static_size_hint.1.is_none() || size_hint.1 <= static_size_hint.1
    fn static_size_hint() -> (usize, Option<usize>) { (0, None) }
}

:thinking:

Well, it was already resurrected via specialization so that .flat_map(|x| [x, x]) has a useful hint.

Note that the best place to put the hint is on IntoIterator, not Iterator. Because array::IntoIter is variably-sized, but <[T;N] as IntoIterator>::WHATEVER_YOU_CALL_THE_HINT can be exactly N.

So to fix the flatten in this issue, you'd probably want to extend the current specialization to offer a (usize, Option<usize>) rather than just a usize, so that the uppercase could be 1-3 chars. Except that there's nothing to IntoIterator for it because there's only the iterator, and thus the lower-bound has to be 0, which makes it useless. (The upper bounds in size_hints are never actually used for anything, unfortunately.)