Case Insensitive UTF-8 Comparison

The str has a case-insensitive string comparison method for comparing ASCII without allocating new strings here: str - Rust.

But why not UTF-8 comparison? I found this StackOverflow answer that does this: rust - What is an efficient way to compare strings while ignoring case? - Stack Overflow

To be clear, I am aware that I could just uppercase two strings and compare the uppercased equivalents, but I am trying to avoid an unnecessary allocation when each string could instead be compared character-by-character. My question is not how to do this, but rather, why the ASCII version of this is in the standard library, but not a UTF-8 version.

I know Unicode is complex and has things like diacritics, but I think for the vast majority of people, just a simple char::to_uppercase() would be sufficient.

I would like to add that I searched for string comparison libraries and nothing stood out to me as the go-to case-insensitive UTF-8 comparison library. I invite suggestions if anybody knows of such a thing.

Quick terminology correction: you don't want a "UTF-8" version, you want a Unicode version. You don't use an RGB888 editor, you use an image editor.

Secondly, the reason is not that "Unicode is complex and has things like diacritics", it's that "case-insensitive" is not a well-defined property in general, because it depends on what language you're talking about.

So, in order to do this, libstd would also have to grow a concept of language, and lots of big language tables, and my assumption is that no one wanted to deal with that, or bake that extra data into the standard library.

ASCII exists because it's easy and well-defined.

The problem with this approach is that it leads to people writing wrong, broken code without realising it. I mean, if there's a case-insensitive operation in the standard library, surely it's correct, right? It's not just a simplification that's actually wrong in lots of cases, right? And no, mentioning the issue in the documentation doesn't solve the issue because no one reads the documentation, especially not for a "simple" operation with a simple name. People will assume that it does what it says on the label.

10 Likes

Case-insensitive comparison is a locale-specific property. Most notably, “I” and “i” are two separate letters in Turkish, not just two different cases of the same letter. That’s out of scope for the standard library. But having a good default crate recommendation would be good.

4 Likes

Don't these same arguments apply to String::to_lowercase/uppercase

1 Like

While true, saying this in isolation without giving additional context makes your answer less useful than it could be. Unicode doesn't say, "the only way to do case insensitive comparisons is with language specific tailoring." For example, Unicode has an entire FAQ section devoted to this topic. Notice how it doesn't just throw its hands up and say, "the only correct thing is language specific tailoring." It mentions specific use cases and how certain operation may or may not be useful in specific contexts. The OP didn't give their specific context, so it's hard to say what is the best answer for them.

Unicode even offers different forms of case folding, "simple" and "full." Because sometimes "full" might not be feasible, for example, in certain types of regex engines that assume character classes match precisely a single codepoint. Despite it being incorrect in the ideal sense, it is still useful.

There's even a specific "default" case folding operation defined in Section 3.13 "Default Case Algorithms" in the Unicode Standard. @JonathanWilbur The caseless implements that algorithm, and unless you actually need language specific tailoring, that is probably what you should do.

The regex crate also offers Unicode case insensitive matching, although it is limited to "simple" case folding rules and thus will get some things wrong (even in a language agnostic context). But if one side of your comparison is invariant (because compiling a regex is expensive), that might be one avenue to pursue.

I personally find this framing very unhelpful. Unicode is itself a simplification. Hell, even language is a simplification. Remember, "all models are wrong, but, some are useful." The question is not whether you're implementing something incorrectly (a black and white view), but just what sorts of errors one can tolerate.

@JonathanWilbur If you want what is probably closest approximation of reality in a language specific context, then I'd probably start with the icu crate. I haven't used it myself, but for example, CaseMapper looks promising.

21 Likes

What immediately comes to mind for me as ‘the’ library for Unicode algorithms in general is International Components for Unicode (ICU).

The icu crate is a heavily renovated (or completely rewritten) version of ICU (called ICU4X) that is more or less specifically written for Rust; its design is not fully finished. The rust_icu crate is bindings to the traditional, mature C version of ICU.

One difference between these two, aside from the programming language, is that the traditional ICU is a dynamic library that has a single copy of the (large!) data tables needed to implement the Unicode algorithms, whereas ICU4X currently expects each program that uses it to keep its own copy of the data tables for the algorithms that that program uses.

1 Like

icu has already been suggested, and it actually has a string comparison mechanism, which is called icu::collator. This allows locale-aware string comparisons, with an option for case insensitivity. It might not have come up in your search because it's fairly new.

1 Like

You can't use to_uppercase or similar for case-insensitive comparison in Unicode, because in Turkish "i" changes to uppercase "İ" with a dot, not latin capital "I", and capital "I" lowercased changes to dotless lowercase "ı".

You also can't make any reliable string comparison in UTF-8 without applying some form of Unicode Normalization first, because Unicode has multiple different codepoints for the same letters (e.g. precomposed forms and forms with combining characters).

1 Like

Nitpick: Comparing an arbitrary pair of strings needs normalization, but if one of them is a known string, then it may be that you know there are no normalization differences that could apply — it looks like Unicode calls this subset “stable code points” — in which case there is no advantage to performing normalization before or as part of comparison. I'm not sure which common strings this applies to, but certainly the empty string "" and I'd expect also ASCII-subset non-letter strings like "*".

3 Likes

Upper/lower case is the wrong thing to do for case insensitive comparisons. What you want is known as "case folding". Unicode has algorithms for case insensitive comparisons and there are language-specific extensions. The advice to use a library that handles all this for you is a good one. Handling this all yourself is unnecessary and error prone.

5 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.