Why's char not an utf8mb4?

Soni · May 13, 2021, 4:55pm

char is, unfortunately, utf-32.

this means you need to use str.matches(|_| true) for performance, otherwise you're wasting cycles converting utf-8 to utf-32 (and potentially back to utf-8 again). it also means you may have unnecessary length checks going on, too. (altho if the optimizer is doing its job you may not actually have to deal with those. you may even be able to skip the utf-8 to utf-32 decoding as part of the str.matches(|_| true) as the char is not being used.)

bjorn3 · May 13, 2021, 4:58pm

char is not a DST and as such must be of a fixed size. Utf-8 uses a variable size for characters. Turning char into a DST would make it much harder to use by for example preventing you from storing it in a local variable or inside a struct.

Soni · May 13, 2021, 5:05pm

utf8char would be an [u8; 4], not a DST. the length is implicitly encoded in the utf-8.

mbrubeck · May 13, 2021, 5:34pm

The implementation of Pattern for char stores the UTF-8 encoding of the char needle into a [u8; 4] and uses this for searching within the UTF-8 haystack. So cases like this do not require any UTF-8 to UTF-32 decoding:

str.matches('\n').count()

But cases like this do require decoding each character of str from UTF-8 to UTF-32 in order to pass it to the provided fn(char) -> bool:

str.matches(char::is_numeric).count()

If char were internally represented as UTF-8 bytes stored in a [u8; 4] (with padding as necessary), then the performance of the latter case could be better, because it could save the cycles spent on UTF-8 decoding. (However, it would also impact the implementation of char::is_numeric.) This could also improve the performance of some string iterators, String::from_iterator(chars), etc.

Is this an accurate summary of the problem?

The encode_unicode crate includes a UTF-8 character type. It would be interesting to implement the (unstable) Pattern trait for types like fn(encode_unicode::Utf8Char) -> bool and do some benchmark comparisons.

Soni · May 13, 2021, 5:39pm

Pretty much, yeah.

Note that e.g. is_ascii_digit would still be fast. A lot of cases would remain fast. UTF-8 is also lexically ordered, which helps even more cases remain fast.

Our main motivation is e.g. this:

/// Converts the given reason and content to a CW2.
///
/// A CW2 is a string of `[CW (reason)](content)`.
pub fn to_cw2(reason: &str, content: &str) -> String {
    let expected_size = reason.len() + content.len() + "[CW ]".len();
    let mut sb = String::with_capacity(expected_size);
    let mut count: isize = 0;
    let iter = reason.matches(|_| true);
    iter.for_each(|c| {
        match c {
            "\x10" => {
                sb += "\x10\x10";
            },
            "[" => {
                count += 1;
                sb += c;
            },
            "]" => {
                count -= 1;
                if count < 0 {
                    sb += "\x10";
                    count = 0;
                }
                sb += c;
            },
            c => {
                sb += c;
            },
        }
    });
    if count > 0 {
        let size = sb.len() + usize::try_from(count).unwrap();
        let mut sb2 = String::with_capacity(size);
        for c in sb.matches(|_| true).rev() {
            match c {
                "[" if count > 0 => {
                    count -= 1;
                    sb2 += c;
                    sb2 += "\x10";
                },
                c => {
                    sb2 += c;
                },
            }
        }
        sb = String::with_capacity(sb2.len());
        sb.extend(sb2.matches(|_| true).rev());
    }
    format!("[CW {}]{}", sb, content)
}

mbrubeck · May 13, 2021, 6:13pm

Comparisons would be more expensive than current char comparisons, though, at least on little-endian architectures (example). This would impact unicode predicates like is_numeric that are based on binary search. (Converting them to use tries might help; I'm not sure what the net impact would be.)

Have you benchmarked this code versus sb.chars()? In my own quick-and-dirty benchmark, this version using sb.chars() and sb.push(c) is actually slightly faster. It seems that the losses from UTF-8 decoding are offset by the gains from faster equality testing.

Of course, it's still possible that a UTF-8 char type would provide the best of both worlds, with iteration and equality testing both as fast as possible.

Soni · May 13, 2021, 6:40pm

Would using little-endian utf8mb4 help? It should still be faster than the current impl: you'd just be shifting whole-bytes into an u32 register, no masking needed, and shoving it back into the string should also be faster than the current impl?

We guess that does make sense, what with the extra cost of using pointers and string comparison. Altho it'd be interesting to see what happens if you change the char reversal from

        sb = String::with_capacity(sb2.len());
        sb.extend(sb2.matches(|_| true).rev());

to

        sb = String::with_capacity(sb2.len());
        sb.extend(sb2.chars().rev());

(unsure how you're benchmarking this. we actually don't know the correct way to benchmark this stuff.)

mbrubeck · May 13, 2021, 6:52pm

I created about 10,000 strings that are a few different variations on "[foo][bar][baz][qaz][foo][bar][baz][qaz]", and timed how long it takes to run to_cw(reason, "test") on every string in the array. On my test input, the matches version takes about 6.0ms to process 10,000 strings, while the chars version takes about 5.1ms.

Soni · May 14, 2021, 1:28am

Hmm...

While strongly not recommended (for obvious reasons), would it be any faster if one did

if c.len() < 1 {
  unsafe { std::hint::unreachable_unchecked(); }
}

in the relevant loops?

If so, would putting this in the relevant Pattern impls improve anything?

comex · May 14, 2021, 2:16am

It's not clear to me that producing a sequence of [u8; 4] values would actually be faster than producing a sequence of u32s. Either way you still have to calculate the offsets in the string corresponding to the beginning of UTF-8 characters.

If you want this loop to be fast, you would be better off iterating byte-by-byte instead of character-by-character. Since none of the characters your algorithm cares about – [, ], and 0x10 – are multi-byte in UTF-8, the algorithm would look pretty much the same. You could append to a Vec<u8>, then use String::from_utf8_unchecked at the end.

zackw · May 14, 2021, 6:08pm

Apart from everything else, the standards' limitation of UTF-8 to the same range as UTF-16 was an error. At the present rate of allocation, that limitation will have to be lifted in a little less than 600 years (the Unicode Consortium could buy themselves another century or so by reclaiming the private use areas, but I imagine that would be just as controversial if not more so).

With char having the same underlying representation as u32, the only thing that needs to change in Rust when this happens is the UTF-8 codec and validator. If char had the same representation as [u8;4], not so.

Soni · May 14, 2021, 6:36pm

Fortunately that's never gonna happen - lots of UTF-8 machinery today throws on >= 0x10FFFF, and UTF-8 is all about backwards compatibility.

If for some reason that does actually happen, a new Rust version (aka the dreaded Rust 2.0) may be in order.

Besides, it is unclear from existing docs whether or not char is actually UTF-32 or some other u32-compatible unspecified internal representation. So it should always be possible to change it or change it back as needed - as long as no new guarantees are made about it.

jdahlstrom · May 14, 2021, 7:57pm

char is defined to be a Unicode scalar value, which means it is also UTF-32, because UTF-32 is simply an identity mapping to the set of Unicode scalar values.

The char type represents a single character. More specifically, since ‘character’ isn’t a well-defined concept in Unicode, char is a ‘Unicode scalar value’, which is similar to, but not the same as, a ‘Unicode code point’.`

chrisd · May 14, 2021, 7:58pm

A code point isn't a user-perceived character. There are already characters made up of multiple code points. If in 500 years we find ourselves running low we can use the century's worth of remaining code points to help encode an infinite number of characters.

quinedot · May 14, 2021, 8:08pm

I believe they meant that as long as it is still a scalar value, and other guarantees (implements Copy) or pseudo-guarantees (size) are upheld, the exact representation could change. It's come up in other discussions too.

Or to put it mathily, the identity function is not the only bijection.

jdahlstrom · May 14, 2021, 8:25pm

Given that the documentation uses the term of art "Unicode scalar value" (not just any scalar value) and links to its definition, I at least would be extremely surprised if transmuting a char to u32 would yield anything else than the char's Unicode code.

zackw · May 15, 2021, 6:40pm

You may be right, but I would not consider it fortunate; rather the opposite.

Only because the standard says they have to. If the standard changed, we could just take all that code out, and the only other changes required would be to code that (now incorrectly) thinks UTF-8 sequences cannot be longer than 4 bytes, and/or doesn't know how to encode or decode five- and six-byte sequences.

All of the other alternatives would involve more changes to more code and would introduce more permanent sources of bugs; @chrisd's suggestion, for instance, would mean that UTF-8 was now saddled with all of the baggage of UTF-16 plus worse encoding density.

Soni · May 15, 2021, 7:46pm

It's okay, whatever happens happens. It's not a good excuse to avoid playing with performance-critical code tho. Performance-critical code is worth playing with, in the off-chance that you can improve performance.

char's representation is an implementation detail and not FFI-safe. If this is a problem, don't use char. Playing with it can also help improve the performance of other things.

system · August 13, 2021, 7:46pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pre-RFC - allow usage of str::utf8_char_width libs	2	650	March 8, 2022
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3572	November 12, 2020
Strings and UTF-8 language design	21	7521	March 25, 2019
Str.to_char() for full strings libs	6	2411	March 25, 2019
Pre-RFC: Add len_utf8_at method to str libs	8	1005	December 20, 2020

Why's char not an utf8mb4?

Related topics