Support for grapheme clusters in std

burntsushi · April 19, 2018, 2:58pm

Agree with everything @Manishearth said. With that said, I want to tug on the opposing side just a little, and give perspective from someone who spends a crazy amount of time dealing with text (both in the "let's parse and search text" and the "let's analyze text" senses).

There are strong and compelling reasons to care about not only the code unit indices in a UTF-8 encoded string, but also the codepoints themselves. They heavily revolve around trade offs in terms of the user experience, performance and difficulty of implementation. The key thing to remember is that Unicode, despite its amazing thoroughness, is still just a model on how to deal with human language in software, and therefore cannot itself be always correct. As a rule of thumb, strive to use Unicode where possible, but don't be afraid to make conscious trade offs.

Byte indices are typically very useful in parsing oriented code. The reason why is because byte indices can be used to slice a UTF-8 encoded string in constant time. Therefore, if something like a regex library returns a match in terms of offsets, then it is straight-forward to turn that into a slice of the original string. Usually you just want the string slice that matched, but dealing with offsets is strictly more flexible. For example, it lets you say, "after finding a match at offset range [s, e), look for a subsequent match at [e, ∞)." If all you got back was the matching subslice, then it would be a little tricky to accommodate that use case (although not impossible with some pointer arithmetic).

The use cases for codepoints are a little less clear, and almost always correspond to a balancing of trade offs. In general, using codepoints almost always makes the user experience worse. Depending on how much weight you attach to the UX, using codepoints might be completely inappropriate. The weight you attach to it might in turn depend on your target audience. For example, if most or all people using your software speak English, then it's conceivable that one might find it OK to pretend that codepoints roughly approximate characters. Some might strenuously argue against that sort of approach to building software, but, as always, reality dictates.

For example:

If you want to track offsets across the same document in multiple distinct Unicode encodings, then codepoint offsets could be useful. Namely, given a codepoint offset of [s, e), it is possible to extract precisely the same codepoints from the same document, regardless of whether it is encoded as UTF-8 or UTF-16. You could do the same with code unit indices, but it requires an additional conversion step.
Many regular expression engines (including Rust's) treat the codepoint as the fundamental atom of a match. That is, . will match a codepoint instead of a single byte or a grapheme. In general, this means . doesn't have a terribly useful correspondence to what an arbitrary user would consider as a character. Nevertheless, making the fundamental atom of a match be a grapheme implies wide ranging things about the implementation of a regex engine. In fact, this is largely what motivates the structure of UTS #18, which describes Unicode support for regular expressions. The first level is mostly applicable to regex engines that match by codepoints, where the second level starts introducing more sophisticated grapheme oriented features.
Sometimes, you just need to extract a preview or some slice of text in a graphical display. This can manifest as a requirement like, "Show the description of this item, but in this particular view, cap the description to 50 characters." What does 50 characters mean? If you said that one codepoint was one character, what would happen? Certainly, it would produce more visual failures than if you said that one grapheme was one character. How much do those failures matter to you and your users? If a grapheme library is easily available to you, then you should probably just use them.
When implementing full text search, one has to decide how to dice up the text. For example, if you're implementing full text search via n-grams, then what does an n-gram even mean? Does it matter if you use N bytes? N codepoints? N graphemes? It's likely you want to pick your definition that best balances the relevance quality of your search engine and speed of indexing.

If I could summarize the above post, I think I'd say this: "Rules exist for a reason, and following them can bear lots of fruit. However, rules are also meant to be broken, and don't be afraid to do so when given sufficient reasoning. But by golly, know when you're breaking the rules so that you understand the consequences."

Topic		Replies	Views
Implement Index<usize> for String and &str libs	48	5498	March 19, 2021
Allow &[&str] in Pattern libs	27	1363	May 10, 2021
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3565	November 12, 2020
What if strings were Code Point aware? language design	19	1918	March 30, 2023
`str` method for slicing code-point (i.e. `char`) ranges libs	23	2921	March 25, 2019

Support for grapheme clusters in std

Related topics