I've been doing work on docs for char
, and as such, unicode, and was tweeting about it. I have friends who have deep feels about unicode. They were surprised that grapheme-related stuff is in an external crate.
One of them was kind enough to write something up, for me to post here. So here's their thoughts:
NFC, NFD: Normalization and Unicode Equality
I am unsure if this is a documentation complaint, or a runtime complaint:
The functionality included here is only that which is necessary to provide for basic string-related manipulations. This crate does not (yet) aim to provide a full set of Unicode tables.
I argue that canonicalization of a unicode string is a fundamental operation within unicode, and without it, you cannot safely do equality on unicode user input/file names.
background: strings, canonicalisation and equality
In the rust world, a character is a unicode scalar value, or any codepoint that isn't part of a surrogate pair. Unfortunately, in unicode, there are mutliple ways to encode the character Ă©: either as one code point, or two (e plus a combining character).
Unicode equivalence - Wikipedia has an overview that doesn't involve delving into technical reports.
These are (in unicode) canonically equal, but in rust strings are compared by codepoints, so these strings are different. Treating unicode strings as different when they are canonically the same leads to fun and games.
problems caused by a lack of canonicalisation.
For example, a git repo usually stores filenames with the raw codepoints, without normalizing. On OSX, this is NFD for filenames. On Linux, it's NFC. Files committed on one system would break on another.
The fix is telling git to NFC file names before commit.
In spotify, a lack of canonicalization of unicode strings lead to hijacked accounts.
back to the argument:
I argue that canonicalization of a unicode string is a fundamental operation within unicode, and without it, you cannot safely do equality on unicode user input/file names.
There's a similar argument to be made for casefolding. Many developers use lowercase to canonicalize strings, and enough people have seen a joel spolsky post to know about a turkish i.
As it stands right now you may be able to publish two cargo packages with different normalizations, which clobber each other when written to disk on osx. There will likely be problems for any utf-8 string in rust that checks equality.
I do not know what path you wish to take, but one of the following might be better than the current status-quo.
-
Make the documentation clear you're punting on canonicalization
-
Put the nfc et al operations in core and let the user work it out.
-
Pick an normalization (NFC per rfc5198 is as good as any reason), and standardise on it internally.
So, what can we do / should we be doing something?
This was echoed by several other people as well. And, frankly, this whole conversation by other languages where "Unicode support" left many bad feels, for example, the Ruby 1.8 -> 1.9 transition, and so there's residual issues too.
Frankly, I am bad at Unicode, so I feel a bit under-qualified to comment here, but would like to hear yinz's thoughts. I had a small convo with @SimonSapin yesterday about related things as well.