Rust + Unicode


#1

I’ve been doing work on docs for char, and as such, unicode, and was tweeting about it. I have friends who have deep feels about unicode. They were surprised that grapheme-related stuff is in an external crate.

One of them was kind enough to write something up, for me to post here. So here’s their thoughts:


NFC, NFD: Normalization and Unicode Equality

I am unsure if this is a documentation complaint, or a runtime complaint:

The functionality included here is only that which is necessary to provide for basic string-related manipulations. This crate does not (yet) aim to provide a full set of Unicode tables.

I argue that canonicalization of a unicode string is a fundamental operation within unicode, and without it, you cannot safely do equality on unicode user input/file names.

background: strings, canonicalisation and equality

In the rust world, a character is a unicode scalar value, or any codepoint that isn’t part of a surrogate pair. Unfortunately, in unicode, there are mutliple ways to encode the character é: either as one code point, or two (e plus a combining character).

http://doc.rust-lang.org/1.0.0/src/core/str/mod.rs.html#1335 https://en.wikipedia.org/wiki/Unicode_equivalence has an overview that doesn’t involve delving into technical reports.

These are (in unicode) canonically equal, but in rust strings are compared by codepoints, so these strings are different. Treating unicode strings as different when they are canonically the same leads to fun and games.

problems caused by a lack of canonicalisation.

For example, a git repo usually stores filenames with the raw codepoints, without normalizing. On OSX, this is NFD for filenames. On Linux, it’s NFC. Files committed on one system would break on another.

The fix is telling git to NFC file names before commit. http://stackoverflow.com/questions/5581857/git-and-the-umlaut-problem-on-mac-os-x

In spotify, a lack of canonicalization of unicode strings lead to hijacked accounts.

back to the argument:

I argue that canonicalization of a unicode string is a fundamental operation within unicode, and without it, you cannot safely do equality on unicode user input/file names.

There’s a similar argument to be made for casefolding. Many developers use lowercase to canonicalize strings, and enough people have seen a joel spolsky post to know about a turkish i.

As it stands right now you may be able to publish two cargo packages with different normalizations, which clobber each other when written to disk on osx. There will likely be problems for any utf-8 string in rust that checks equality.

I do not know what path you wish to take, but one of the following might be better than the current status-quo.

  • Make the documentation clear you’re punting on canonicalization

  • Put the nfc et al operations in core and let the user work it out.

  • Pick an normalization (NFC per rfc5198 is as good as any reason), and standardise on it internally.


So, what can we do / should we be doing something?

This was echoed by several other people as well. And, frankly, this whole conversation by other languages where “Unicode support” left many bad feels, for example, the Ruby 1.8 -> 1.9 transition, and so there’s residual issues too.

Frankly, I am bad at Unicode, so I feel a bit under-qualified to comment here, but would like to hear yinz’s thoughts. I had a small convo with @SimonSapin yesterday about related things as well.


#2

clippy has a lint that can check for non-NFC unicode in source. To write it, I had to learn about this canonicalization thing (which I didn’t know in detail before I started).

I think we should

  • (short term) document the issue (and raise big red flags on converting Strings to OsPaths!)
  • get the canonicalizations back into std

We also may want to put the unicode_not_nfc into rustc to at least warn of non-standard encodings within the source.

However, we should not:require canonicalization on type boundaries – this would severely penalize string construction for very little gain.

Unicode is hard, and we cannot make it easy. We should not attempt to make it look like it is.


#3

TMWSP: The docs should at least introduce the major issues with text handling and where to seek more information, canonicalising comparison seems a good idea, beyond that stuff should probably get crated and pulled in conservatively.


As someone who is… let’s say “recreationally” interested in Unicode details [1], a few thoughts:

I think the worst thing we could do is make people believe we have good support when we don’t. By that, I mean that I think we should be very careful about what Rust and the stdlib claim to do correctly. When in doubt, punt.

Proper Unicode support looks to be a colossal pain to get right. For example, doing case conversion properly requires us to also have some notion of locales… and a decent way of finding out what the user’s current locale is… and overriding it when we need to… and that’s almost immediately going to raise the question of “well what about things like money formats and date formats” and oh dear the library maintainer’s in the fetal position and sobbing. This, more than most things, seems like it should definitely start in external crates and be brought into std very cautiously.

I don’t think canonicalisation baked into the string type is a good idea. That’s getting into the realm of “magically transforming data behind your back”. I’d be in favour of comparison doing lazy canonicalisation by default though (i.e. canonicalise the next grapheme cluster during comparison iff direct byte-for-byte comparison fails), and having canonicalised forms supported for output, at the least.

I suspect a good direction beyond that would be to just start trying to expose the tables and algorithms as building blocks in a set of crates (which has already kinda started). On that subject, when I went to implement a string cursor library (to allow seeking by code point or grapheme cluster boundaries), I was really annoyed that unicode-segmentation contains all the data and code to detect and process grapheme clusters… but only exposes it as an iterator. This led to some really dicey code that runs around speculatively constructing iterators and stepping them and hoping it doesn’t explode. I mean, it seemed to work, but I have no idea if that’ll keep working or not.

I started looking into parsing the UCD into tables that could be exposed from crates, but got stalled on working out how exactly to represent said tables. Giant in-memory data structure? Giant optimisable match expression? shrug

In terms of documentation, it’d be nice to have a chapter named “WEY­TUK­ATHI­WA­WYRD­WTH­AT­DWII­YC­PAI” (short for “why everything you think you know about text handling is wrong and why you really don’t want to have anything to do with it if you can possibly avoid it”) in the documentation that at least introduces these ideas and where to look for code to deal with it. I tried to find something appropriate a while back to direct people on IRC to… and couldn’t find anything. The old stand-by (Spolsky’s post) is incomplete on the subject. Even if it just brings up issues so that people are aware of them, that would probably be a decent improvement.

[1] By that, I mean that I know enough to be terrified of the idea of writing anything involving text manipulation and an insufferable pedant, but have no professional need to know any of this crap.

(Too Much Waffling; Summarise, Please.)


#4

There’s an underlying library that uses locales this way and has a concept of the current process’s locale: The C library. And this approach has since long been proven insufficient in all but the simplest cases. In a desktop calculator it might be fine, but in a server that needs to handle localization on per-session basis it’s useless. I guess it’s rather obvious with modern perspective that Rust will not copy this chunk of process-global state manipulation.

The environment locale (C library’s locale) by the way has always just been a structured guess at which language or encoding to use. Fine, your locale ctype says UTF-8, so that’s a good guess that file names and command line input is UTF-8, but it’s no guarantee.


#5

I meant like asking the OS what language the current session is using (in my case, English/Australian) so that that value can be provided to comparison or case functions. Surely all modern operating systems provide a way to find that out. Things can’t possibly be so awful that this is not the case.

I’m also not in any way advocating tying everything to some implicit bit of global state.


#6

I’m not saying you’re advocating that or anything else.

I’m saying that for most applications the concept of the environment’s locale isn’t relevant, and that it’s an obsolete concept. The same process may need to produce all of (for example) danish, swedish and chinese text, to send to different connected clients.


#7

I’m opposed to any kind of unicode handling in libstd, including canonicalization (note that I don’t consider guaranteeing the integrity of a String handling). For the following piece, if I refer to “text”, I mean “natural text”.

Most Strings don’t encode natural text. Hash keys, configuration keys, public/private keys, UUIDs, HTTP Messages, JSON files, YAML files. They are, for many reasons, abstracted using printable and readable things, but have different expectations bound to them: strict semantics, speedy usage and and a memory saving representation. For example, indexing into such a string is very much a use case, as is parsing, splitting and merging. They don’t necessarily carry a language and even then, this is usually fixed. They can ignore the issue presented here, because such canonicalization issues are very much an edge case here. Also, many convenient properties for those strings hold true (e.g. that two strings are not considered the same if they are of different lengths). They are often just displayed for debugging reasons.

Natural text is a different beast and should be treated as such. The basic operations differ. While indexing to a point in natural text is rather boring, indexing into clusters of the string (“the third word”) is very interesting (and non-trivial, what characters seperate a word?). Trimming is very much a standard use-case, as is translation and proper (graphical) rendering. All these operations are locale-dependent. Often, these values are only passed through systems (e.g. from database to view layer) and very rarely manipulated. Interesting text operation are heavyweight and should only be used if necessary. Finally, may operations have ambiguous semantics in the presence of natural text: facing combined unicode characters, my_string[2] is a very interesting operation. Do I want the displayable entity of a string or just the third unicode character?

I fundamentally don’t believe that there is a simple abstraction over those two fields of usage.

After using multiple implementations (Java, Ruby, etc.), I concluded that Ruby is, with all its flaws, right the most: it only enforces validity of Strings and keeps its hands out of more general operations bound to a specific encoding. This is what should be in the language core - it’s doable. Bugs in any implementations on top of that should be out of scope for stdlib, because they are locale dependent and for that reason treacherous.

Finally, I believe that natural text should be abstracted through a different type: Text. This should only allow locale-bound operations (either by expecting a locale to be passed on every operation or encoding the locale with the Text). There should be a conversion to Strings and from Strings, but probably, it should also not be too attached to their behaviour. (probably to support additional encodings)

So, these are my confused about all these things, some more can maybe found in my introductory presentation about Unicode a while ago (note that these are meant for beginners): https://slidr.io/skade/unicode-a#1


#8

I believe implicit locales in any API are the worst example of side-effects. They make the program run unpredictable and are hard to identify.

All APIs actually manipulating text should have an explicit notion of the locale they run under. The program can decide to gain that locale from the OS, but not core library should assume that it can be gotten from the OS and that it is correct for that operation.


#9

I believe trying to build a Text that keeps track of its locale is bound to be futile. What Locale would you give diesem Text? Ceci n’est pas un Text at all, pardon my French. This stuff is really hard, so hard that no one on this planet has figured out the right solution (I’m not even sure such a thing exists).

@DanielKeep: I don’t think that we should complicate text equality for now. Perhaps add an .eq_nfc(_) and .eq_nfd() method to the libraries, if you really like.


#10

You’re saying that it’s an obsolete concept to have a user interface display itself in the user’s selected language? Really? Not everything is a server. What about just plain old user-facing applications? Having the ability to get that information and pass it as an argument in absolutely no way prevents you from passing some other locale instead.

I agree. Good thing I never mentioned implicit locales!

You say that like text equality isn’t already complicated, and that Rust isn’t currently ignoring reality. I believe that, for better or worse, people expect == to do the “obvious” thing. The fact of the matter is that correctly comparing Unicode strings (which is what Rust explicitly says it uses) involves taking canonicalisation into account.

Look at HashMap; the default hasher makes Rust look pretty bad in benchmarks, but it’s still a good default choice because it defends against binning attacks, and if perf is an issue it can be replaced.

Then again, you’d need pretty aggressive canonicalisation to get "ABC" == "ABC", so perhaps this is one of those things where we just tell people “this doesn’t do what you expect, text sucks, deal with it”.


#11

How would you meaningfully apply any kind of these algorithms in question on that?


#12

Please file bugs when you’re annoyed. I’m sure this API can be improved.


#13

Names in HFS+, OS X’s default filesystem, are (quoting Wikipedia) “normalized to a form very nearly the same as Unicode Normalization Form D (NFD)”. (More on “nearly” below.) This means that when creating a file with std::fs::File::create and getting its name back with std::fs::read_dir, that name may or may not compare equal to the original name in current Rust.

Most filesystems on Linux do no such normalization, the strings would always compare equal. What do you mean by “it’s NFC”?

About “nearly”, let’s quote Apple: https://developer.apple.com/legacy/library/technotes/tn/tn1150.html#UnicodeSubtleties

IMPORTANT:
An implementation must not use the Unicode utilities implemented by its native platform (for decomposition and comparison), unless those algorithms are equivalent to the HFS Plus algorithms defined here, and are guaranteed to be so forever. This is rarely the case. Platform algorithms tend to evolve with the Unicode standard. The HFS Plus algorithms cannot evolve because such evolution would invalidate existing HFS Plus volumes.

I think this is a design mistake in HFS+, but we have to deal with it. Doing Canonical (not Compatibility!) Unicode normalization helps with that, but it’s not even quite right since there’s an Apple-specific flavor of the algorithm.


This makes it sound like “no canonicalization is bad, more canonicalization fixed everything”. But the Spotify story was much more subtle than that.

Their algorithm is not just Unicode normalization, it’s something called xmpp-nodeprep-03, which itself is a “profile” of stringprep. Unicode normalization is just one step of this algorithm (after “Map” (which includes case folding to lower case) and before “Prohibit” and “Check bidi”). Their problem was not failing to apply it, it was applying it more than once while their implementation was not idempotent because the Python standard library updated its unicodedata module to a new version of Unicode, and the implementation had some optimization that relied on that data being exactly Unicode version 3.2.


Serious question: what is the obvious thing? It’s not at all obvious to me. Words only have the meaning we give them, and there are so many ways to define what makes strings “equivalent”.

Are the capital omega letter Ω and the Ohm sign Ω equivalent? Are lower case and upper case equivalent? In French it’s common to omit diacritics. It kinda looks wrong, but your mail is still gonna be delivered if you write Francois instead of François on an envelope. Are these equivalent? François might be mildly annoyed. But in a search engine you want to do all that and more.


So, what should we do? We can polish the API of https://github.com/unicode-rs/unicode-normalization and move it back in std. I think think that Cargo makes dependency handling easy enough that "in std" v.s. “on crates.io” is not very relevant, but if it makes some people feel better, whatever.

However I firmly believe that PartialEq for str should not use Unicode normalization or any other kind of normalization. There are so many algorithms to choose from! Canonical or compatibility? Apple-specific or latest Unicode version? Is it OK if strings that were “different” become “equivalent” when you upgrade your compiler to one that uses a newer version of Unicode?

And that doesn’t mean Rust is bad at Unicode. We give the tools for each program to use the flavor of normalization appropriate for its own use case, it doesn’t have to be an implicit default.


#14

https://play.rust-lang.org/?gist=5c15c09cafba4df814ad&version=stable was also given as an executable example


#15

It sounds to me like we get back to the tricky question of what == means. I feel that @steveklabnik’s friend wants it to mean semantic equality of strings which would take into account canonicalisation/normalisation (are these two things the same, btw?), whereas at the moment it is something closer to bit-wise equality. IMO, Rust’s philosophy of equality is closer to the low-level view - although we are not quite C - we don’t compare for pointer equality for example - we do tend to favour simple implementations of ==. Therefore it seems to me that equality with canonicalisation should not be given by the == operator but by another method. That seems to be in agreement with most others on this thread.

So, concrete steps:

  • we should not move more unicode stuff into std in the near term,
  • we should have better documentation about unicode complexities,
  • we should advertise better that being on crates.io can still be ‘official’, being in std is not necessary (I believe the libs team will be pushing on this anyway in the near future),
  • we should continue to put effort into our unicode libs on crates.io.

#16

I don’t feel like that. “==” between strings should be “exactly the same”.

Many of these algorithms aim to provide a.operation() == b.operation() for the assumption that == is exact equality.

Note, for example, that advanced text searchers like Lucene still work with binary comparison at their core, they just have a very long preparation pipeline leading up to that check.


#17

As SimonSapin says, there are too many alternatives to select any other algorithm for equality check.

It sounds like path equality checking is non-trivial. Is it correct?


#18

I was referring to binary equality :).


#19

As the original author of the NF(K){C,D} implementations that were originally in libunicode and are now in the unicode-normalization crate I feel obliged to chime in here.

I think performing normalization as part of PartialEq::eq to “fix” string equality would be misguided. Different applications want different normalization, and (as others have said) thinking normalization on its own is sufficient is naive in many cases. If you doubt this I’d point you at the IETF’s work on StringPrep, or more recently PRECIS and LUCID. E.g. assume we added normalization, you can still publish two cargo crates, one with, one without a ZWNJ in the name. They might not clobber each other per se, but they are indistinguishable to users, copy pasting from crates.io to their Cargo.toml. I believe it is better to teach users to properly prepare their strings for comparison, and then perform a byte-by-byte comparison, or better yet have crates doing this for specific use-cases.

Concerning whether normalization should be available in libstd, I have come to believe that, while it is an important algorithm, it actually shouldn’t. A separate crate can be more agile concerning Unicode versions. It hopefully would also be more eager to add various mappings, which is equally important. I remember we have been hesitant to add such mappings to libstd in the past, due to the large-ish tables required. IMHO Cargo makes it sufficiently easy to get access to this functionality, when required.


#20

While I agree on it there is not a single right form of normalization, and different forms of normalization of different characteristics, and might even depend on locals as such implementing implicit/standard normalization as part of str/String is not a good idea at all.

Documenting this is a must, including the rust book. Through I don’t think It has to be in std. Having a create in the rust nursery should be fine. (Through we might want to advertise official packages on crates.io and the rust nursery, maybe even calling it secondary std parts. Extending the API search of std to include all “offical”/“nursery” and maybe even “recomended for nursery” packages might be a good idea).

The problem is that JSON, and I think YAML, too do suffer from unicode equality problems, as they allow Unicode keys which require “some correct” normalization to be completely comparable. I agree that they are more limited in praxis and often require more “speedy” usage, so adding a normalization on each compare is a no go. Nevertheless normalizing when they are parsed is a must for correct processing (Btw. this also affects e.g. HashMap etc., as “magic strings” and natrual Text are often not clear separated.

Neither do I :slight_smile:

I had a (very small) bit of linguistics in school and read a bit about Unicode and it’s normalization, I believe no completely right solution can exists wrt. to natural text. Neither can you assume any text has a single correct local (or any at all).

Even plain old user-facing applications should not use the concept of locals to much, if they do, it just gets extreme annoying for multilingual persons. As multilingual persons will produce text of different languages aligned to different locals possible even in the same sentence. And therefore the application has to (re)produce and handle this correctly. (Still using the local for selecting the default language for descriptions, help-text etc. is partially fine). Also as far as I know, a little bit of locals, while broken, is still the only “solution”.


I feel a “best” hypothetical solution might have been to not implement PartialEq (or PartialOrd, etc.) for the standard string type and then have “zero-overhead(except in creation)” wrapper types which implement a specific normalization and have the equality/ordering requirements (with one normalization being selected as “default” and used for string literals). Through this could also produce a usability nightmare and might be extremely tricky to get right (as simple normalization does not solve all problems of text equality).

But this chance is for now gone, so wrt. current rust I think strong documentation, possible the above mentioned clippy lint as well as “official” repos for normalization and text handling are the/a solution. (While I would like to have a lint against text comparsions which might trip over due to normalization I don’t think this can be done.)