I thought I'd check if there's any reason this hasn't already been done:
In conversions::to_upper we can check if the char is ascii and skip the binary search by adding in a is_ascii() check.
pub fn to_upper(c: char) -> [char; 3] {
if c.is_ascii() {
[(c as u8).to_ascii_uppercase() as char, '\0', '\0']
} else {
match bsearch_case_table(c, to_uppercase_table) {
None => [c, '\0', '\0'],
Some(index) => to_uppercase_table[index].1,
}
}
}
It speeds up ascii char conversion which seems like a very common case for the cost of one comparison.
The reason I'm asking this is that I'm seeing people learning rust trying to code this kind of thing in as they find that the current to_upper is not as fast as it could be. I figure that maybe we just be fast by default unless this is in some way incorrect?
Your suggestion would speed up English but slow down processing any other language. Instead I suggest the casemapping tables should be restructured into pages for each script with two cases, and str::to_{upper,lower}case should remember which page the previous letter was on, and try that one first. That would speed up all languages.
Hmm doing some criterion tests and with english it's significantly faster.
Trying with aéDžßfiᾀaé it seems that adding that check seems to improve performance also. I think this is because some of the underlying letters are still ascii but with accents applied after.
So I then tried some korean and that didn't have any noticable change in performance (which is expected given that the is_ascii check is so cheap).
Then I added in some spaces and noticed a performance improvement.
Most languages use spaces to separate their words (not Thai I think?). If most languages use the ascii space character for space then this optimisation pays off hansomely.
IIRC a common test for this is something like https://ja.wikipedia.org/wiki/錆, and that there's often just enough markup or whatever to make ascii optimizations worthwhile. (I've also seen this in the context of measuring whether UTF-8 is acceptable for language that uses many codepoints above 255.)
Although that does assume the text has significant amount of markup. I could be wrong but I'd be surprised if people are running to_upper on whole HTML documents (outside of benchmarks and tests). I'd assume most uses are more targetted.
I'm wondering if it also could be faster to split the table so that there is one smaller char -> char table and we only look up in the char -> [char; 3] table if the first one doesn't have a hit. A sentinel can be used in the first table to let us now if search is done or if it should continue to the fallback table.
That's certainly an option @bluss. While the ascii check is the lowest hanging fruit, I'd argue special casing String::to_uppercase is worthwhile becuase for the single char case it doesn't need to return a 3 element array - it could just push the char into the new String. (This is an additional speedup on top of the ascii check)
It should be that a_string::to_uppercase() has better performance than a_string.chars().map(|ch| ch.to_uppercase()) but at the moment that's not the case.
It's also important to note that an optimization similar to the ASCII one can be applied to Cyrillic characters as well:
fn to_cyrillic_uppercase(c: char) -> char {
let c = c as u32;
let res = match c {
0x0430..=0x044F => c - 0x20, // most often used characters
0x0450..=0x045F => c - 0x50,
0x0460..=0x0481 | 0x048A..=0x04BF | 0x04D0..=0x04FF => c & !1,
0x04CF => 0x04C0,
0x04C1..=0x04CE if c & 1 == 0 => c - 1,
_ => c,
};
unsafe {
std::char::from_u32_unchecked(res)
}
}
(BTW: it's quite annoying that character codes are not selected to simplify such conversions...)
I guess this can be done for most European languages as well. Maybe it's worth to talk about a more general compression of the conversion table instead of only special-casing ASCII?
I was just thinking about that possibility myself. There are only about 1400 Unicode characters that are changed by case mapping, and UTF-8 does a decent job of grouping them for us: outside the ASCII range, the UTF-8 encoding of a character changed by to_lower always starts with one of these sequences:
c3
c4
c5
c6
c7
c8
c9
cd
ce
cf
d0
d1
d2
d3
d4
d5
e1 82
e1 83
e1 8e
e1 8f
e1 b2
e1 b8
e1 b9
e1 ba
e1 bb
e1 bc
e1 bd
e1 be
e1 bf
e2 84
e2 85
e2 86
e2 92
e2 93
e2 b0
e2 b1
e2 b2
e2 b3
ea 99
ea 9a
ea 9c
ea 9d
ea 9e
ea 9f
ef bc
f0 90 90
f0 90 92
f0 90 93
f0 90 b2
f0 91 a2
f0 96 b9
f0 9e a4
This suggests a trie-like data structure, working on the encoded representation rather than on abstract characters. I don't have time this week to tinker with one, unfortunately.
I think perfect hash functions could be a better fit here. Using the phf crate it should be quite easy to create a mapping for single char cases (1383 out of 1485 for UPPERCASE_TABLE and 1392 out of 1393 for LOWERCASE_TABLE).
Another possibility for packing the table: have different tables for <28, <216, and <232. That gives an ASCII special case organically, while also being good for saving space in the BMP.
Good idea! But we should not forget that list of characters with codes smaller than 2^8 also includes µ and ÿ, for which to_ascii_uppercase() will not work.
One thing that's not been mentioned is that given a choice between speed and memory size used the preference has been to have a small memory footprint. If we can use less space and be faster that would be super!