To_upper speed

I thought I'd check if there's any reason this hasn't already been done:

In conversions::to_upper we can check if the char is ascii and skip the binary search by adding in a is_ascii() check.

    pub fn to_upper(c: char) -> [char; 3] {
        if c.is_ascii() {
            [(c as u8).to_ascii_uppercase() as char, '\0', '\0']
        } else {
            match bsearch_case_table(c, to_uppercase_table) {
                None        => [c, '\0', '\0'],
                Some(index) => to_uppercase_table[index].1,
            }
        }
    }

It speeds up ascii char conversion which seems like a very common case for the cost of one comparison.

The reason I'm asking this is that I'm seeing people learning rust trying to code this kind of thing in as they find that the current to_upper is not as fast as it could be. I figure that maybe we just be fast by default unless this is in some way incorrect?

2 Likes

(This potential speedup also applies to to_lower() - they are pretty similar)

As always, some benchmarks to compare speeds are very needed.

3 Likes

Your suggestion would speed up English but slow down processing any other language. Instead I suggest the casemapping tables should be restructured into pages for each script with two cases, and str::to_{upper,lower}case should remember which page the previous letter was on, and try that one first. That would speed up all languages.

2 Likes

The change is trivial enough that you should be able to implement it and run benchmarks on a diverse corpus.

The branch predictor should, in theory, largely mitigate that. I'm very curious what benchmarks would show.

I'd be very curious to see a benchmark for this, and even a benchmark that implements both your idea and gilescope's idea.

1 Like

Hmm doing some criterion tests and with english it's significantly faster.

Trying with aéDžßfiᾀaé it seems that adding that check seems to improve performance also. I think this is because some of the underlying letters are still ascii but with accents applied after.

So I then tried some korean and that didn't have any noticable change in performance (which is expected given that the is_ascii check is so cheap). Then I added in some spaces and noticed a performance improvement.

Most languages use spaces to separate their words (not Thai I think?). If most languages use the ascii space character for space then this optimisation pays off hansomely.

10 Likes

IIRC a common test for this is something like https://ja.wikipedia.org/wiki/錆, and that there's often just enough markup or whatever to make ascii optimizations worthwhile. (I've also seen this in the context of measuring whether UTF-8 is acceptable for language that uses many codepoints above 255.)

3 Likes

Although that does assume the text has significant amount of markup. I could be wrong but I'd be surprised if people are running to_upper on whole HTML documents (outside of benchmarks and tests). I'd assume most uses are more targetted.

2 Likes

I suggest testing on a long passage of text in Cyrillic, or some other non-Latin script that actually has both upper and lower case.

7 Likes

I'm wondering if it also could be faster to split the table so that there is one smaller char -> char table and we only look up in the char -> [char; 3] table if the first one doesn't have a hit. A sentinel can be used in the first table to let us now if search is done or if it should continue to the fallback table.

3 Likes

That's certainly an option @bluss. While the ascii check is the lowest hanging fruit, I'd argue special casing String::to_uppercase is worthwhile becuase for the single char case it doesn't need to return a 3 element array - it could just push the char into the new String. (This is an additional speedup on top of the ascii check)

It should be that a_string::to_uppercase() has better performance than a_string.chars().map(|ch| ch.to_uppercase()) but at the moment that's not the case.

2 Likes

It's also important to note that an optimization similar to the ASCII one can be applied to Cyrillic characters as well:

fn to_cyrillic_uppercase(c: char) -> char {
    let c = c as u32;
    let res = match c {
        0x0430..=0x044F => c - 0x20, // most often used characters
        0x0450..=0x045F => c - 0x50,
        0x0460..=0x0481 | 0x048A..=0x04BF | 0x04D0..=0x04FF => c & !1,
        0x04CF => 0x04C0,
        0x04C1..=0x04CE if c & 1 == 0 => c - 1,
        _ => c,
    };
    unsafe {
        std::char::from_u32_unchecked(res)
    }
}

playground

(BTW: it's quite annoying that character codes are not selected to simplify such conversions...)

I guess this can be done for most European languages as well. Maybe it's worth to talk about a more general compression of the conversion table instead of only special-casing ASCII?

3 Likes

I was just thinking about that possibility myself. There are only about 1400 Unicode characters that are changed by case mapping, and UTF-8 does a decent job of grouping them for us: outside the ASCII range, the UTF-8 encoding of a character changed by to_lower always starts with one of these sequences:

c3
c4
c5
c6
c7
c8
c9
cd
ce
cf
d0
d1
d2
d3
d4
d5
e1 82
e1 83
e1 8e
e1 8f
e1 b2
e1 b8
e1 b9
e1 ba
e1 bb
e1 bc
e1 bd
e1 be
e1 bf
e2 84
e2 85
e2 86
e2 92
e2 93
e2 b0
e2 b1
e2 b2
e2 b3
ea 99
ea 9a
ea 9c
ea 9d
ea 9e
ea 9f
ef bc
f0 90 90
f0 90 92
f0 90 93
f0 90 b2
f0 91 a2
f0 96 b9
f0 9e a4

This suggests a trie-like data structure, working on the encoded representation rather than on abstract characters. I don't have time this week to tinker with one, unfortunately.

5 Likes

If anyone wants to try this, there may be some useful prior art in @raphlinus's rust-lang/rust#33098, and @burntsushi's ucd-generate crate which it inspired.

1 Like

I think perfect hash functions could be a better fit here. Using the phf crate it should be quite easy to create a mapping for single char cases (1383 out of 1485 for UPPERCASE_TABLE and 1392 out of 1393 for LOWERCASE_TABLE).

1 Like

Another possibility for packing the table: have different tables for <28, <216, and <232. That gives an ASCII special case organically, while also being good for saving space in the BMP.

I have some experiments in that direction in https://github.com/scottmcm/uax/blob/master/src/property_tables/word_break.rs.

1 Like

Good idea! But we should not forget that list of characters with codes smaller than 2^8 also includes µ and ÿ, for which to_ascii_uppercase() will not work.

Isn't µ's code point u+3bc > 2^8?

I think that's u+b5 (micro unit) which is < 2^8. It turns out exactly half of the characters with code points < 2^8 are not ASCII characters:

Playground link

.. which makes sense because the ASCII standard uses 7-bit integers

1 Like

Here is the initial PR on the subject: Add a check for ASCII characters in to_upper and to_lower by mcastorina · Pull Request #81358 · rust-lang/rust · GitHub (It's their first rustc contribution - yay!).

One thing that's not been mentioned is that given a choice between speed and memory size used the preference has been to have a small memory footprint. If we can use less space and be faster that would be super!

6 Likes