Hi!
I was wondering whether Rust's string types allowed us to iterate
over/slice/count/interact with grapheme clusters in addition to
code-points (.chars()
) and bytes (.as_bytes()
). The
documentation for str
mentions grapheme clusters briefly:
It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.
But AFAICT "what I actually want" is not present in std
.
Digging further, I found this pull-request from 2014: it sounds like it implements exactly what the documentation is talking about except that it if I understand correctly:
- it was merged in 2014, and the corresponding pull-request was closed;
- it vegetated in unstable;
- it was then removed during some "spring-cleaning" in 2015.
This is somewhat disheartening. Right now the only hits I found for "grapheme" among open issues and pull-requests do not suggest this is on the menu; in a recent-ish thread on users.rust-lang.org, people seemed unaware that this pull-request even existed at some point, accepting that this functionality belongs in an external crate right off the bat.
What are the odds of this feature being resurrected?
-
From what I understand, this pull-request mainly added methods to
str
, so maybe it can be brought back with minimal breakage? -
Although I imagine that changes in the last 3 years might make this PR hard to merge back.
-
Also, I don't know how rigid Rust's roadmap is;
-
on the one hand, I guess that as long as motivated volunteers have time to push for a feature, it can find its way in despite not being on the roadmap;
-
on the other hand, design and code reviews may be time-consuming enough that the Rust team would rather focus its efforts on the roadmap.
-
Then again, maybe everyone is fine with grapheme cluster support living in an external crate. From my limited understanding of how Unicode works,
- bytes are useful to deal with plumbing issues such as serialization,
- grapheme clusters are useful to deal with meat-space problems (display width, character string comparisons),
- code-points are useful toā¦ lure programmers into wrongly believing their language helps them deal with non-ASCII text?
That last bit was tongue-in-cheek, but I seriously can't think of a use-case for code-points. AFAIU they are just an intermediate representation between bytes ("bits that programs can send to each other"), and grapheme clusters ("symbols that people can recognize and think of as units"): programs cannot rely on code-points to e.g. compute word widths and align text in a monospaced CLI, get the character next to the cursor and erase it in a GUI, ā¦
tl;dr
- Did I miss the rationale on why that feature died while in unstable?
- Yes, I know there's a crate for that.
- I still feel like this feature belongs in a language's standard string type; offering only bytes and code-points and telling the user "don't use those; go find some crate if you want to handle text correctly" defeats the point of a string type IMO.
PS1: some last-minute googling brought these up:
-
@Manishearth's take on code-points:
Now, a lot of languages by default are now using Unicode-aware encodings. This is great. It gets rid of the misconception that characters are one byte long.
But it doesnāt get rid of the misconception that user-perceived characters are one code point long.
[ā¦]
I strongly feel that languages should be moving in this direction, having defaults involving grapheme clusters.
[ā¦]
Now, Rust is a systems programming language and it just wouldnāt do to have expensive grapheme segmentation operations all over your string defaults. Iām very happy that the expensive O(n) operations are all only possible with explicit acknowledgement of the cost. So I do think that going the Swift route would be counterproductive for Rust. Not that it can anyway, due to backwards compatibility
But I would prefer if the grapheme segmentation methods were in the stdlib (they used to be). This is probably not something that will happen, though I should probably push for the unicode crates being move into the nursery at least.
-
@steveklabnik's explanation of strings in 10 slides:
Note: Grapheme cluster support isn't really in the standard library; check out the unicode-segmentation crate.
PS2: I hope this is the right forum and the right category.
PS3: Sinceā¦
- I have never contributed to Rust so far,
- although I read This Week In Rust religiously to see how the language is shaping up, I've only ever used it in toy programs to better understand some of its concepts,
- I'm not an expert on Unicode either,
ā¦ I realize that my opinion on this subject may not be very well-informed.