Support for grapheme clusters in std


#1

Hi!

I was wondering whether Rust’s string types allowed us to iterate over/slice/count/interact with grapheme clusters in addition to code-points (.chars()) and bytes (.as_bytes()). The documentation for str mentions grapheme clusters briefly:

It’s important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a ‘character’ is. Iteration over grapheme clusters may be what you actually want.

But AFAICT “what I actually want” is not present in std.

Digging further, I found this pull-request from 2014: it sounds like it implements exactly what the documentation is talking about :raised_hands: except that it if I understand correctly:

  • it was merged in 2014, and the corresponding pull-request was closed;
  • it vegetated in unstable;
  • it was then removed during some “spring-cleaning” in 2015.

This is somewhat disheartening. Right now the only hits I found for “grapheme” among open issues and pull-requests do not suggest this is on the menu; in a recent-ish thread on users.rust-lang.org, people seemed unaware that this pull-request even existed at some point, accepting that this functionality belongs in an external crate right off the bat.

What are the odds of this feature being resurrected?

  • From what I understand, this pull-request mainly added methods to str, so maybe it can be brought back with minimal breakage?

  • Although I imagine that changes in the last 3 years might make this PR hard to merge back.

  • Also, I don’t know how rigid Rust’s roadmap is;

    • on the one hand, I guess that as long as motivated volunteers have time to push for a feature, it can find its way in despite not being on the roadmap;

    • on the other hand, design and code reviews may be time-consuming enough that the Rust team would rather focus its efforts on the roadmap.

Then again, maybe everyone is fine with grapheme cluster support living in an external crate. From my limited understanding of how Unicode works,

  • bytes are useful to deal with plumbing issues such as serialization,
  • grapheme clusters are useful to deal with meat-space problems (display width, character string comparisons),
  • code-points are useful to… lure programmers into wrongly believing their language helps them deal with non-ASCII text?

That last bit was tongue-in-cheek, but I seriously can’t think of a use-case for code-points. AFAIU they are just an intermediate representation between bytes (“bits that programs can send to each other”), and grapheme clusters (“symbols that people can recognize and think of as units”): programs cannot rely on code-points to e.g. compute word widths and align text in a monospaced CLI, get the character next to the cursor and erase it in a GUI, …

tl;dr

  • Did I miss the rationale on why that feature died while in unstable?
  • Yes, I know there’s a crate for that.
  • I still feel like this feature belongs in a language’s standard string type; offering only bytes and code-points and telling the user “don’t use those; go find some crate if you want to handle text correctly” defeats the point of a string type IMO.

PS1: some last-minute googling brought these up:

  • @Manishearth’s take on code-points:

    Now, a lot of languages by default are now using Unicode-aware encodings. This is great. It gets rid of the misconception that characters are one byte long.

    But it doesn’t get rid of the misconception that user-perceived characters are one code point long.

    […]

    I strongly feel that languages should be moving in this direction, having defaults involving grapheme clusters.

    […]

    Now, Rust is a systems programming language and it just wouldn’t do to have expensive grapheme segmentation operations all over your string defaults. I’m very happy that the expensive O(n) operations are all only possible with explicit acknowledgement of the cost. So I do think that going the Swift route would be counterproductive for Rust. Not that it can anyway, due to backwards compatibility :slight_smile:

    But I would prefer if the grapheme segmentation methods were in the stdlib (they used to be). This is probably not something that will happen, though I should probably push for the unicode crates being move into the nursery at least.

  • @steveklabnik’s explanation of strings in 10 slides:

    Note: Grapheme cluster support isn’t really in the standard library; check out the unicode-segmentation crate.

PS2: I hope this is the right forum and the right category.

PS3: Since…

  • I have never contributed to Rust so far,
  • although I read This Week In Rust religiously to see how the language is shaping up, I’ve only ever used it in toy programs to better understand some of its concepts,
  • I’m not an expert on Unicode either,

… I realize that my opinion on this subject may not be very well-informed.


#2

In general a lot of what one would think to be “core” functionality is maintained in external crates, often maintained by the Rust team itself (but not always). As mentioned in the quote I would like for the segmentation crate to be moved into the nursery and get team support.

grapheme clusters are useful to deal with meat-space problems (display width, character string comparisons),

This is incorrect. They are useful, but for neither of these operations. Grapheme clusters are useful for knowing logical boundaries where text can be segmented, i.e. if you need to cut something off. If you wish to sort text, you should be relying on a collation crate. If you wish to compare text, you need normalization.

Text is hard. Codepoints aren’t a very useful abstraction for text, but grapheme clusters aren’t the solution to all the problems. 99% of the time someone wants to do an operation on text, the answer is not to use grapheme clusters; the answer is that that operation makes no sense when generalized across languages.

I seriously can’t think of a use-case for code-points.

Codepoints are useful when parsing things or when implementing unicode algorithms, and not much else. The former is rather common in Rust.

Grapheme clusters are useful too, but in specialized cases usually dealing with editing, where you probably need to be using more than just grapheme clusters.

Grapheme clusters are a better way of thinking about strings as a programmer, but really the core of it is that strings should be opaque to you and the only operations you think of are find/equality, and substringing based on find results. These are all hard to get right in a universal way, and we perhaps should expose operations for this, but this is also kinda locale dependent and an even more iffy thing to move into the stdlib.

This is a super complex space and “move grapheme clusters into the stdlib” doesn’t even begin to solve things here.

Handling text properly is not a matter of “just use this abstraction it magically solves things for you”, it’s a matter of asking the programmer what operation they actually want to do.

If Rust were more UI-focused I’d be much more for doing this, but it’s not.

programs cannot rely on code-points to e.g. compute word widths and align text in a monospaced CLI, get the character next to the cursor and erase it in a GUI, …

FWIW, grapheme clusters solve neither problem here. Both editing and monospace width are complex problems. Grapheme clusters are part of the solution for editing, but not entirely.


#3

Rust is pulled in two directions:

  • to have first class support for all of Unicode
  • to be lean, produce tiny executables, and support wide range of platforms, including some that can’t even fit Unicode-related tables in memory.

Rust can’t do both at once, so it has compromised on having very basic Unicode in stdlib, and pushed everything else to crates.


#4

Agree with everything @Manishearth said. :+1: With that said, I want to tug on the opposing side just a little, and give perspective from someone who spends a crazy amount of time dealing with text (both in the “let’s parse and search text” and the “let’s analyze text” senses).

There are strong and compelling reasons to care about not only the code unit indices in a UTF-8 encoded string, but also the codepoints themselves. They heavily revolve around trade offs in terms of the user experience, performance and difficulty of implementation. The key thing to remember is that Unicode, despite its amazing thoroughness, is still just a model on how to deal with human language in software, and therefore cannot itself be always correct. As a rule of thumb, strive to use Unicode where possible, but don’t be afraid to make conscious trade offs.

Byte indices are typically very useful in parsing oriented code. The reason why is because byte indices can be used to slice a UTF-8 encoded string in constant time. Therefore, if something like a regex library returns a match in terms of offsets, then it is straight-forward to turn that into a slice of the original string. Usually you just want the string slice that matched, but dealing with offsets is strictly more flexible. For example, it lets you say, “after finding a match at offset range [s, e), look for a subsequent match at [e, ∞).” If all you got back was the matching subslice, then it would be a little tricky to accommodate that use case (although not impossible with some pointer arithmetic).

The use cases for codepoints are a little less clear, and almost always correspond to a balancing of trade offs. In general, using codepoints almost always makes the user experience worse. Depending on how much weight you attach to the UX, using codepoints might be completely inappropriate. The weight you attach to it might in turn depend on your target audience. For example, if most or all people using your software speak English, then it’s conceivable that one might find it OK to pretend that codepoints roughly approximate characters. Some might strenuously argue against that sort of approach to building software, but, as always, reality dictates.

For example:

  • If you want to track offsets across the same document in multiple distinct Unicode encodings, then codepoint offsets could be useful. Namely, given a codepoint offset of [s, e), it is possible to extract precisely the same codepoints from the same document, regardless of whether it is encoded as UTF-8 or UTF-16. You could do the same with code unit indices, but it requires an additional conversion step.
  • Many regular expression engines (including Rust’s) treat the codepoint as the fundamental atom of a match. That is, . will match a codepoint instead of a single byte or a grapheme. In general, this means . doesn’t have a terribly useful correspondence to what an arbitrary user would consider as a character. Nevertheless, making the fundamental atom of a match be a grapheme implies wide ranging things about the implementation of a regex engine. In fact, this is largely what motivates the structure of UTS #18, which describes Unicode support for regular expressions. The first level is mostly applicable to regex engines that match by codepoints, where the second level starts introducing more sophisticated grapheme oriented features.
  • Sometimes, you just need to extract a preview or some slice of text in a graphical display. This can manifest as a requirement like, “Show the description of this item, but in this particular view, cap the description to 50 characters.” What does 50 characters mean? If you said that one codepoint was one character, what would happen? Certainly, it would produce more visual failures than if you said that one grapheme was one character. How much do those failures matter to you and your users? If a grapheme library is easily available to you, then you should probably just use them.
  • When implementing full text search, one has to decide how to dice up the text. For example, if you’re implementing full text search via n-grams, then what does an n-gram even mean? Does it matter if you use N bytes? N codepoints? N graphemes? It’s likely you want to pick your definition that best balances the relevance quality of your search engine and speed of indexing.

If I could summarize the above post, I think I’d say this: “Rules exist for a reason, and following them can bear lots of fruit. However, rules are also meant to be broken, and don’t be afraid to do so when given sufficient reasoning. But by golly, know when you’re breaking the rules so that you understand the consequences.”


#5

Every time I learn something about text, I realise how little I understand.

I’d love to read a book that spells it out to me like I’m a child. Once I know what I actually want to do and why, I get a cookbook style section explaining how to do it in Rust.

It’d be great to start with “simple” solutions that allow me to be productive and build my understanding. Then move on to exploring trade-offs and performance.


#6

Thanks a lot for your answers.

I guess my first wrong assumption was that grapheme clusters are the “unit” that can be used to solve every text-handling problem.

I just recently stumbled on some limitations of code-points for counting/iterating on “characters” by observing that "é".chars().count() != "é".chars().count() (“é” being easy to type on a French AZERTY keyboard; “é” being easy to type with desktop environments that let you input arbitrary code-points with Ctrl-Shift-U).

That specific example led me to believe that all there was to “splitting strings into elementary symbols that I can count/split/reverse” was “decode code-points from bytes; then combine code-points into grapheme clusters; there; enjoy your iterable units”.

If I understand your points correctly, while they may help for some specific operations (with some specific scripts), grapheme clusters are not a general solution to manipulating text. Thus it becomes a matter of evaluating how much value having them in std would bring vs how costly it would be, and making a judgment call on whether the tradeoff is worth the hassle.

I almost feel like arguing that if a language advertises a Unicode string type in its standard library, it should go all the way and expose every single feature the Unicode Consortium specified, since users will probably assume that this string type handles “everything” for them. I can readily accept that this is not a reasonable position to hold though; even if it didn’t represent a colossal amount of design and implementation work, if the end result is a complex API that no-one can use, this would still be an obvious loss.

My second bias is that external crates feel… mmm. I want to say “second-class”, but that would be very unfair to the Library Team. Obviously there is a lot of effort going into polishing the crate ecosystem; it just feels somewhat… diluted?

To try and explain this feeling (briefly, since this is off-topic and, AFAIK, a recurring discussion, so I guess I won’t bring anything new to the table), as someone interested in

  1. re-using code,
  2. making sure I won’t have to watch the repositories I depend on in case the original maintainer disappears and I have to switch to a fork for bug fixes,

External crates feel “risky”. I know there are curated lists like stdx, Awesome Rust or the Ergo Ecosystem, but the crates they highlight still mostly come from “random” sources.

More googling led me to the Rust Cookbook though, which I find more reassuring by virtue of being maintained in the nursery (even though some of the crates it showcases are not in the nursery themselves). I can’t seem to find it mentioned anywhere though, not on rust-lang.org, crates.io nor docs.rs; is it because it’s not considered ready for prime time yet?