To_upper speed

gilescope · January 24, 2021, 12:11pm

I thought I'd check if there's any reason this hasn't already been done:

In conversions::to_upper we can check if the char is ascii and skip the binary search by adding in a is_ascii() check.

    pub fn to_upper(c: char) -> [char; 3] {
        if c.is_ascii() {
            [(c as u8).to_ascii_uppercase() as char, '\0', '\0']
        } else {
            match bsearch_case_table(c, to_uppercase_table) {
                None        => [c, '\0', '\0'],
                Some(index) => to_uppercase_table[index].1,
            }
        }
    }

It speeds up ascii char conversion which seems like a very common case for the cost of one comparison.

The reason I'm asking this is that I'm seeing people learning rust trying to code this kind of thing in as they find that the current to_upper is not as fast as it could be. I figure that maybe we just be fast by default unless this is in some way incorrect?

gilescope · January 24, 2021, 1:34pm

(This potential speedup also applies to to_lower() - they are pretty similar)

tesuji · January 24, 2021, 2:58pm

As always, some benchmarks to compare speeds are very needed.

zackw · January 24, 2021, 3:15pm

Your suggestion would speed up English but slow down processing any other language. Instead I suggest the casemapping tables should be restructured into pages for each script with two cases, and str::to_{upper,lower}case should remember which page the previous letter was on, and try that one first. That would speed up all languages.

mjbshaw · January 24, 2021, 3:21pm

The change is trivial enough that you should be able to implement it and run benchmarks on a diverse corpus.

The branch predictor should, in theory, largely mitigate that. I'm very curious what benchmarks would show.

I'd be very curious to see a benchmark for this, and even a benchmark that implements both your idea and gilescope's idea.

gilescope · January 24, 2021, 5:50pm

Hmm doing some criterion tests and with english it's significantly faster.

Trying with aéǅßﬁᾀaé it seems that adding that check seems to improve performance also. I think this is because some of the underlying letters are still ascii but with accents applied after.

So I then tried some korean and that didn't have any noticable change in performance (which is expected given that the is_ascii check is so cheap). Then I added in some spaces and noticed a performance improvement.

Most languages use spaces to separate their words (not Thai I think?). If most languages use the ascii space character for space then this optimisation pays off hansomely.

scottmcm · January 24, 2021, 7:33pm

IIRC a common test for this is something like https://ja.wikipedia.org/wiki/錆, and that there's often just enough markup or whatever to make ascii optimizations worthwhile. (I've also seen this in the context of measuring whether UTF-8 is acceptable for language that uses many codepoints above 255.)

chrisd · January 24, 2021, 10:40pm

Although that does assume the text has significant amount of markup. I could be wrong but I'd be surprised if people are running to_upper on whole HTML documents (outside of benchmarks and tests). I'd assume most uses are more targetted.

zackw · January 24, 2021, 11:07pm

I suggest testing on a long passage of text in Cyrillic, or some other non-Latin script that actually has both upper and lower case.

bluss · January 25, 2021, 8:57am

I'm wondering if it also could be faster to split the table so that there is one smaller char -> char table and we only look up in the char -> [char; 3] table if the first one doesn't have a hit. A sentinel can be used in the first table to let us now if search is done or if it should continue to the fallback table.

gilescope · January 25, 2021, 10:52am

That's certainly an option @bluss. While the ascii check is the lowest hanging fruit, I'd argue special casing String::to_uppercase is worthwhile becuase for the single char case it doesn't need to return a 3 element array - it could just push the char into the new String. (This is an additional speedup on top of the ascii check)

It should be that a_string::to_uppercase() has better performance than a_string.chars().map(|ch| ch.to_uppercase()) but at the moment that's not the case.

newpavlov · January 25, 2021, 1:05pm

It's also important to note that an optimization similar to the ASCII one can be applied to Cyrillic characters as well:

fn to_cyrillic_uppercase(c: char) -> char {
    let c = c as u32;
    let res = match c {
        0x0430..=0x044F => c - 0x20, // most often used characters
        0x0450..=0x045F => c - 0x50,
        0x0460..=0x0481 | 0x048A..=0x04BF | 0x04D0..=0x04FF => c & !1,
        0x04CF => 0x04C0,
        0x04C1..=0x04CE if c & 1 == 0 => c - 1,
        _ => c,
    };
    unsafe {
        std::char::from_u32_unchecked(res)
    }
}

playground

(BTW: it's quite annoying that character codes are not selected to simplify such conversions...)

I guess this can be done for most European languages as well. Maybe it's worth to talk about a more general compression of the conversion table instead of only special-casing ASCII?

zackw · January 25, 2021, 2:01pm

I was just thinking about that possibility myself. There are only about 1400 Unicode characters that are changed by case mapping, and UTF-8 does a decent job of grouping them for us: outside the ASCII range, the UTF-8 encoding of a character changed by to_lower always starts with one of these sequences:

c3
c4
c5
c6
c7
c8
c9
cd
ce
cf
d0
d1
d2
d3
d4
d5
e1 82
e1 83
e1 8e
e1 8f
e1 b2
e1 b8
e1 b9
e1 ba
e1 bb
e1 bc
e1 bd
e1 be
e1 bf
e2 84
e2 85
e2 86
e2 92
e2 93
e2 b0
e2 b1
e2 b2
e2 b3
ea 99
ea 9a
ea 9c
ea 9d
ea 9e
ea 9f
ef bc
f0 90 90
f0 90 92
f0 90 93
f0 90 b2
f0 91 a2
f0 96 b9
f0 9e a4

This suggests a trie-like data structure, working on the encoded representation rather than on abstract characters. I don't have time this week to tinker with one, unfortunately.

mbrubeck · January 25, 2021, 5:06pm

If anyone wants to try this, there may be some useful prior art in @raphlinus's rust-lang/rust#33098, and @burntsushi's ucd-generate crate which it inspired.

newpavlov · January 25, 2021, 5:17pm

I think perfect hash functions could be a better fit here. Using the phf crate it should be quite easy to create a mapping for single char cases (1383 out of 1485 for UPPERCASE_TABLE and 1392 out of 1393 for LOWERCASE_TABLE).

scottmcm · January 25, 2021, 5:25pm

Another possibility for packing the table: have different tables for <2⁸, <2¹⁶, and <2³². That gives an ASCII special case organically, while also being good for saving space in the BMP.

I have some experiments in that direction in https://github.com/scottmcm/uax/blob/master/src/property_tables/word_break.rs.

newpavlov · January 25, 2021, 6:01pm

Good idea! But we should not forget that list of characters with codes smaller than 2^8 also includes µ and ÿ, for which to_ascii_uppercase() will not work.

marcianx · January 26, 2021, 5:28am

Isn't µ's code point u+3bc > 2^8?

miccah · January 26, 2021, 6:03am

I think that's u+b5 (micro unit) which is < 2^8. It turns out exactly half of the characters with code points < 2^8 are not ASCII characters:

Playground link

.. which makes sense because the ASCII standard uses 7-bit integers

gilescope · January 26, 2021, 4:58pm

Here is the initial PR on the subject: Add a check for ASCII characters in to_upper and to_lower by mcastorina · Pull Request #81358 · rust-lang/rust · GitHub (It's their first rustc contribution - yay!).

One thing that's not been mentioned is that given a choice between speed and memory size used the preference has been to have a small memory footprint. If we can use less space and be faster that would be super!

Topic		Replies	Views
Pre-pre-RFC: Support `write_uppercase(&self, &mut String)` libs	14	1552	May 29, 2022
Case Insensitive UTF-8 Comparison libs	11	4484	January 12, 2024
Benchmark for std::str::from_utf8()? libs	4	1108	March 25, 2019
Why's char not an utf8mb4? language design	18	1975	August 13, 2021
ASCII methods for u16	17	2720	April 11, 2021

To_upper speed

Related topics