I’m opposed to any kind of unicode handling in libstd, including canonicalization (note that I don’t consider guaranteeing the integrity of a String handling). For the following piece, if I refer to “text”, I mean “natural text”.
Most Strings don’t encode natural text. Hash keys, configuration keys, public/private keys, UUIDs, HTTP Messages, JSON files, YAML files. They are, for many reasons, abstracted using printable and readable things, but have different expectations bound to them: strict semantics, speedy usage and and a memory saving representation. For example, indexing into such a string is very much a use case, as is parsing, splitting and merging. They don’t necessarily carry a language and even then, this is usually fixed. They can ignore the issue presented here, because such canonicalization issues are very much an edge case here. Also, many convenient properties for those strings hold true (e.g. that two strings are not considered the same if they are of different lengths). They are often just displayed for debugging reasons.
Natural text is a different beast and should be treated as such. The basic operations differ. While indexing to a point in natural text is rather boring, indexing into clusters of the string (“the third word”) is very interesting (and non-trivial, what characters seperate a word?). Trimming is very much a standard use-case, as is translation and proper (graphical) rendering. All these operations are locale-dependent. Often, these values are only passed through systems (e.g. from database to view layer) and very rarely manipulated. Interesting text operation are heavyweight and should only be used if necessary. Finally, may operations have ambiguous semantics in the presence of natural text: facing combined unicode characters,
my_string is a very interesting operation. Do I want the displayable entity of a string or just the third unicode character?
I fundamentally don’t believe that there is a simple abstraction over those two fields of usage.
After using multiple implementations (Java, Ruby, etc.), I concluded that Ruby is, with all its flaws, right the most: it only enforces validity of Strings and keeps its hands out of more general operations bound to a specific encoding. This is what should be in the language core - it’s doable. Bugs in any implementations on top of that should be out of scope for stdlib, because they are locale dependent and for that reason treacherous.
Finally, I believe that natural text should be abstracted through a different type:
Text. This should only allow locale-bound operations (either by expecting a locale to be passed on every operation or encoding the locale with the Text). There should be a conversion to Strings and from Strings, but probably, it should also not be too attached to their behaviour. (probably to support additional encodings)
So, these are my confused about all these things, some more can maybe found in my introductory presentation about Unicode a while ago (note that these are meant for beginners): https://slidr.io/skade/unicode-a#1