Rust + Unicode

Names in HFS+, OS X’s default filesystem, are (quoting Wikipedia) “normalized to a form very nearly the same as Unicode Normalization Form D (NFD)”. (More on “nearly” below.) This means that when creating a file with std::fs::File::create and getting its name back with std::fs::read_dir, that name may or may not compare equal to the original name in current Rust.

Most filesystems on Linux do no such normalization, the strings would always compare equal. What do you mean by “it’s NFC”?

About “nearly”, let’s quote Apple:

IMPORTANT:
An implementation must not use the Unicode utilities implemented by its native platform (for decomposition and comparison), unless those algorithms are equivalent to the HFS Plus algorithms defined here, and are guaranteed to be so forever. This is rarely the case. Platform algorithms tend to evolve with the Unicode standard. The HFS Plus algorithms cannot evolve because such evolution would invalidate existing HFS Plus volumes.

I think this is a design mistake in HFS+, but we have to deal with it. Doing Canonical (not Compatibility!) Unicode normalization helps with that, but it’s not even quite right since there’s an Apple-specific flavor of the algorithm.


This makes it sound like "no canonicalization is bad, more canonicalization fixed everything". But the Spotify story was much more subtle than that.

Their algorithm is not just Unicode normalization, it’s something called xmpp-nodeprep-03, which itself is a “profile” of stringprep. Unicode normalization is just one step of this algorithm (after "Map" (which includes case folding to lower case) and before "Prohibit" and "Check bidi"). Their problem was not failing to apply it, it was applying it more than once while their implementation was not idempotent because the Python standard library updated its unicodedata module to a new version of Unicode, and the implementation had some optimization that relied on that data being exactly Unicode version 3.2.


Serious question: what is the obvious thing? It’s not at all obvious to me. Words only have the meaning we give them, and there are so many ways to define what makes strings "equivalent".

Are the capital omega letter Ω and the Ohm sign Ω equivalent? Are lower case and upper case equivalent? In French it’s common to omit diacritics. It kinda looks wrong, but your mail is still gonna be delivered if you write Francois instead of François on an envelope. Are these equivalent? François might be mildly annoyed. But in a search engine you want to do all that and more.


So, what should we do? We can polish the API of GitHub - unicode-rs/unicode-normalization: Unicode Normalization forms according to UAX#15 rules and move it back in std. I think think that Cargo makes dependency handling easy enough that "in std" v.s. "on crates.io" is not very relevant, but if it makes some people feel better, whatever.

However I firmly believe that PartialEq for str should not use Unicode normalization or any other kind of normalization. There are so many algorithms to choose from! Canonical or compatibility? Apple-specific or latest Unicode version? Is it OK if strings that were “different” become “equivalent” when you upgrade your compiler to one that uses a newer version of Unicode?

And that doesn’t mean Rust is bad at Unicode. We give the tools for each program to use the flavor of normalization appropriate for its own use case, it doesn’t have to be an implicit default.

7 Likes