TL;DR: reconsider the trade-offs of String::len()
, str::len()
, str as Index
in order to make strings easier to learn for beginners and less error-prone for intermediate users.
Recently there was an interesting experience report on Reddit with feedback from a Rust beginner:
https://www.reddit.com/r/rust/comments/hzx1ak/beginners_critiques_of_rust/
One of the resulting comment threads went into the String
/str
/bytes confusion:
https://www.reddit.com/r/rust/comments/hzx1ak/beginners_critiques_of_rust/fzlwy22/
Which pointed out the different ways looking at a String
:
When speaking about UTF8 string, we have at least four different units of length
- length in bytes
- length in code points
- length in grapheme clusters
- length in glyphs.
Which one is 'natural' to use depends on context. (..) When we are speaking about UTF16 strings (like often happens in Windows and Java) situation is even worse, because sometimes you don't want length in bytes but in code pairs, so there are five 'natural' lengths.
TRPL tells a similar story in chapter 8:
Another point about UTF-8 is that there are actually three relevant ways to look at strings from Rust’s perspective: as bytes, scalar values, and grapheme clusters (the closest thing to what we would call letters).
I think idiomatic Rust APIs tend to try to making complexity in the API explicit in order to make subtle problems more noticeable or salient. In a real sense, some String
/str
APIs have traded off this property against conciseness/less syntax and actually conflate the bytes/code points notions. (It's also interesting that len()
operates on bytes but then drain()
operates on chars.)
As another point of data, "on a char
boundary" occurs in the "Panics" section of 6 different methods for String
(which includes Deref<Target = str>
), but none of these method's names are very explicit about this constraint.
In my mind, this is mainly len()
and str
's Index
implementation, which IMO make it easy to write code that works okay on ASCII text but will easily fail when working with non-ASCII UTF-8. I wonder if we would be able to prevent subtle issues like this by making it more explicit (i.e., requiring redirection through as_bytes()
). For <str as Index>
, we could add a more explicit str::slice_utf8(Range<usize>)
method instead.
Of course this kind of change would have a sizable ecosystem cost. We would certainly want to do soft deprecations first and/or gate the deprecation on an edition (I think this is not a solved problem, but could conceivably be solved if deemed useful?).
This is mostly intended for beginners, though even as a fairly experienced Rustacean I've found that slicing UTF-8 based on byte counts still trips me up sometimes. I'm curious to hear what others think.