Why's char not an utf8mb4?

char is, unfortunately, utf-32.

this means you need to use str.matches(|_| true) for performance, otherwise you're wasting cycles converting utf-8 to utf-32 (and potentially back to utf-8 again). it also means you may have unnecessary length checks going on, too. (altho if the optimizer is doing its job you may not actually have to deal with those. you may even be able to skip the utf-8 to utf-32 decoding as part of the str.matches(|_| true) as the char is not being used.)

1 Like

char is not a DST and as such must be of a fixed size. Utf-8 uses a variable size for characters. Turning char into a DST would make it much harder to use by for example preventing you from storing it in a local variable or inside a struct.

2 Likes

utf8char would be an [u8; 4], not a DST. the length is implicitly encoded in the utf-8.

The implementation of Pattern for char stores the UTF-8 encoding of the char needle into a [u8; 4] and uses this for searching within the UTF-8 haystack. So cases like this do not require any UTF-8 to UTF-32 decoding:

str.matches('\n').count()

But cases like this do require decoding each character of str from UTF-8 to UTF-32 in order to pass it to the provided fn(char) -> bool:

str.matches(char::is_numeric).count()

If char were internally represented as UTF-8 bytes stored in a [u8; 4] (with padding as necessary), then the performance of the latter case could be better, because it could save the cycles spent on UTF-8 decoding. (However, it would also impact the implementation of char::is_numeric.) This could also improve the performance of some string iterators, String::from_iterator(chars), etc.

Is this an accurate summary of the problem?

The encode_unicode crate includes a UTF-8 character type. It would be interesting to implement the (unstable) Pattern trait for types like fn(encode_unicode::Utf8Char) -> bool and do some benchmark comparisons.

9 Likes

Pretty much, yeah.

Note that e.g. is_ascii_digit would still be fast. A lot of cases would remain fast. UTF-8 is also lexically ordered, which helps even more cases remain fast.

Our main motivation is e.g. this:

/// Converts the given reason and content to a CW2.
///
/// A CW2 is a string of `[CW (reason)](content)`.
pub fn to_cw2(reason: &str, content: &str) -> String {
    let expected_size = reason.len() + content.len() + "[CW ]".len();
    let mut sb = String::with_capacity(expected_size);
    let mut count: isize = 0;
    let iter = reason.matches(|_| true);
    iter.for_each(|c| {
        match c {
            "\x10" => {
                sb += "\x10\x10";
            },
            "[" => {
                count += 1;
                sb += c;
            },
            "]" => {
                count -= 1;
                if count < 0 {
                    sb += "\x10";
                    count = 0;
                }
                sb += c;
            },
            c => {
                sb += c;
            },
        }
    });
    if count > 0 {
        let size = sb.len() + usize::try_from(count).unwrap();
        let mut sb2 = String::with_capacity(size);
        for c in sb.matches(|_| true).rev() {
            match c {
                "[" if count > 0 => {
                    count -= 1;
                    sb2 += c;
                    sb2 += "\x10";
                },
                c => {
                    sb2 += c;
                },
            }
        }
        sb = String::with_capacity(sb2.len());
        sb.extend(sb2.matches(|_| true).rev());
    }
    format!("[CW {}]{}", sb, content)
}

Comparisons would be more expensive than current char comparisons, though, at least on little-endian architectures (example). This would impact unicode predicates like is_numeric that are based on binary search. (Converting them to use tries might help; I'm not sure what the net impact would be.)

Have you benchmarked this code versus sb.chars()? In my own quick-and-dirty benchmark, this version using sb.chars() and sb.push(c) is actually slightly faster. It seems that the losses from UTF-8 decoding are offset by the gains from faster equality testing.

Of course, it's still possible that a UTF-8 char type would provide the best of both worlds, with iteration and equality testing both as fast as possible.

Would using little-endian utf8mb4 help? It should still be faster than the current impl: you'd just be shifting whole-bytes into an u32 register, no masking needed, and shoving it back into the string should also be faster than the current impl?

We guess that does make sense, what with the extra cost of using pointers and string comparison. Altho it'd be interesting to see what happens if you change the char reversal from

        sb = String::with_capacity(sb2.len());
        sb.extend(sb2.matches(|_| true).rev());

to

        sb = String::with_capacity(sb2.len());
        sb.extend(sb2.chars().rev());

(unsure how you're benchmarking this. we actually don't know the correct way to benchmark this stuff.)

I created about 10,000 strings that are a few different variations on "[foo][bar][baz][qaz][foo][bar][baz][qaz]", and timed how long it takes to run to_cw(reason, "test") on every string in the array. On my test input, the matches version takes about 6.0ms to process 10,000 strings, while the chars version takes about 5.1ms.

1 Like

Hmm...

While strongly not recommended (for obvious reasons), would it be any faster if one did

if c.len() < 1 {
  unsafe { std::hint::unreachable_unchecked(); }
}

in the relevant loops?

If so, would putting this in the relevant Pattern impls improve anything?

It's not clear to me that producing a sequence of [u8; 4] values would actually be faster than producing a sequence of u32s. Either way you still have to calculate the offsets in the string corresponding to the beginning of UTF-8 characters.

If you want this loop to be fast, you would be better off iterating byte-by-byte instead of character-by-character. Since none of the characters your algorithm cares about – [, ], and 0x10 – are multi-byte in UTF-8, the algorithm would look pretty much the same. You could append to a Vec<u8>, then use String::from_utf8_unchecked at the end.

10 Likes

Apart from everything else, the standards' limitation of UTF-8 to the same range as UTF-16 was an error. At the present rate of allocation, that limitation will have to be lifted in a little less than 600 years (the Unicode Consortium could buy themselves another century or so by reclaiming the private use areas, but I imagine that would be just as controversial if not more so).

With char having the same underlying representation as u32, the only thing that needs to change in Rust when this happens is the UTF-8 codec and validator. If char had the same representation as [u8;4], not so.

Fortunately that's never gonna happen - lots of UTF-8 machinery today throws on >= 0x10FFFF, and UTF-8 is all about backwards compatibility.

If for some reason that does actually happen, a new Rust version (aka the dreaded Rust 2.0) may be in order.

Besides, it is unclear from existing docs whether or not char is actually UTF-32 or some other u32-compatible unspecified internal representation. So it should always be possible to change it or change it back as needed - as long as no new guarantees are made about it.

char is defined to be a Unicode scalar value, which means it is also UTF-32, because UTF-32 is simply an identity mapping to the set of Unicode scalar values.

The char type represents a single character. More specifically, since ‘character’ isn’t a well-defined concept in Unicode, char is a ‘Unicode scalar value’, which is similar to, but not the same as, a ‘Unicode code point’.`

4 Likes

A code point isn't a user-perceived character. There are already characters made up of multiple code points. If in 500 years we find ourselves running low we can use the century's worth of remaining code points to help encode an infinite number of characters.

4 Likes

I believe they meant that as long as it is still a scalar value, and other guarantees (implements Copy) or pseudo-guarantees (size) are upheld, the exact representation could change. It's come up in other discussions too.

Or to put it mathily, the identity function is not the only bijection.

Given that the documentation uses the term of art "Unicode scalar value" (not just any scalar value) and links to its definition, I at least would be extremely surprised if transmuting a char to u32 would yield anything else than the char's Unicode code.

1 Like

You may be right, but I would not consider it fortunate; rather the opposite.

Only because the standard says they have to. If the standard changed, we could just take all that code out, and the only other changes required would be to code that (now incorrectly) thinks UTF-8 sequences cannot be longer than 4 bytes, and/or doesn't know how to encode or decode five- and six-byte sequences.

All of the other alternatives would involve more changes to more code and would introduce more permanent sources of bugs; @chrisd's suggestion, for instance, would mean that UTF-8 was now saddled with all of the baggage of UTF-16 plus worse encoding density.

It's okay, whatever happens happens. It's not a good excuse to avoid playing with performance-critical code tho. Performance-critical code is worth playing with, in the off-chance that you can improve performance.

char's representation is an implementation detail and not FFI-safe. If this is a problem, don't use char. Playing with it can also help improve the performance of other things.