Wild idea: deprecating APIs that conflate str and [u8]

TL;DR: reconsider the trade-offs of String::len(), str::len(), str as Index in order to make strings easier to learn for beginners and less error-prone for intermediate users.

Recently there was an interesting experience report on Reddit with feedback from a Rust beginner:

https://www.reddit.com/r/rust/comments/hzx1ak/beginners_critiques_of_rust/

One of the resulting comment threads went into the String/str/bytes confusion:

https://www.reddit.com/r/rust/comments/hzx1ak/beginners_critiques_of_rust/fzlwy22/

Which pointed out the different ways looking at a String:

When speaking about UTF8 string, we have at least four different units of length

  • length in bytes
  • length in code points
  • length in grapheme clusters
  • length in glyphs.

Which one is 'natural' to use depends on context. (..) When we are speaking about UTF16 strings (like often happens in Windows and Java) situation is even worse, because sometimes you don't want length in bytes but in code pairs, so there are five 'natural' lengths.

TRPL tells a similar story in chapter 8:

Another point about UTF-8 is that there are actually three relevant ways to look at strings from Rust’s perspective: as bytes, scalar values, and grapheme clusters (the closest thing to what we would call letters).

I think idiomatic Rust APIs tend to try to making complexity in the API explicit in order to make subtle problems more noticeable or salient. In a real sense, some String/str APIs have traded off this property against conciseness/less syntax and actually conflate the bytes/code points notions. (It's also interesting that len() operates on bytes but then drain() operates on chars.)

As another point of data, "on a char boundary" occurs in the "Panics" section of 6 different methods for String (which includes Deref<Target = str>), but none of these method's names are very explicit about this constraint.

In my mind, this is mainly len() and str's Index implementation, which IMO make it easy to write code that works okay on ASCII text but will easily fail when working with non-ASCII UTF-8. I wonder if we would be able to prevent subtle issues like this by making it more explicit (i.e., requiring redirection through as_bytes()). For <str as Index>, we could add a more explicit str::slice_utf8(Range<usize>) method instead.

Of course this kind of change would have a sizable ecosystem cost. We would certainly want to do soft deprecations first and/or gate the deprecation on an edition (I think this is not a solved problem, but could conceivably be solved if deemed useful?).

This is mostly intended for beginners, though even as a fairly experienced Rustacean I've found that slicing UTF-8 based on byte counts still trips me up sometimes. I'm curious to hear what others think.

11 Likes

The String#Representation section states that

A String is made up of three components: a pointer to some bytes, a length, and a capacity.

Redirecting to as_bytes() to get its length is not great either.

I think part of the problem is the name of String. As steveklabnik has often said, by rights it should be called StringBuf (to match PathBuf). It's a buffer containing a string. In that context len makes more sense. It's the readable length of the buffer, not the length of the string.

But it's too late to change now. Perhaps byte_len would be a viable alternative if as_bytes().len() is undesirable?

4 Likes

Feedback on Twitter comparing Swift:

It's interesting that count() returns grapheme clusters in Swift rather than bytes.

It's an interesting idea and I see where you're coming from. If we could do it all over again, removing or renaming len on strings might have been the right choice. (I note there is also an OsStr::len, which was added in Rust 1.9.) But I think the fact that String is essentially "just a Vec<u8> with a guarantee of valid UTF-8" is really part of its DNA. This is manifest in the APIs available on strings and in method behaviors. Notably, the as_bytes method, presenting itself as a zero-cost view into the underlying bytes, I think really drives that point home. And as you mentioned, the panicking behavior of various string methods when offsets aren't on UTF-8 boundaries. In that light, I think things like len and slicing make sense, because I think in order to effectively use strings in Rust, it's somewhat important to internalize the concept that it is just UTF-8 encoded bytes.

I think there are perhaps some higher ideals that might argue that such a view isn't necessary, and I might agree with them or would at least find it to be a very reasonable argument. But I think we have to evaluate the deprecation proposal in light of the current API and behaviors.

I recognize that my argumentation above is somewhat weak, and somewhat amounts to stare decisis. While I feel somewhat less strongly about the name or existence of len, I do find string slicing to be amazingly convenient and natural. I obviously work with search APIs a lot, and in that context, byte offsets show up remarkably often. Having a concise, convenient and correct syntax for using those byte offsets to get a substring is very nice from a "quality of life" perspective.

With that said, I think there is another argument against this sort of deprecation that you already mention: ecosystem cost. Not only would the churn here be monumental (although we could minimize that somewhat through automation and the edition process), but the amount of code it would make obsolete (or "unidiomatic" I suppose) is startling. String slicing and len calls are really really common in my experience, and there's a lot of code and tutorials out there that use those things. Beginners are likely to stumble into those and then get big deprecation warnings, which isn't a great experience IMO. Because now they have to not only deal with the cognitive burden of understanding the operations themselves, but also have to deal with mapping that to a new name. I admit it's not a huge leap, but we are discussing beginner papercuts here, so it seems fair to me.

I think for me personally, I see the cost of this sort of deprecation as very high, and would really like more compelling evidence that this is a serious stumbling block for beginners. Beginners will cut themselves on all sorts of things, and I agree that having too many of those cuts is not good and we should try to minimize them where we can. But when minimizing them involves costly things, I think we should be more discriminating in determining just how big of a papercut it is. Unfortunately, I don't quite know how to measure that in a low cost way. :-/

26 Likes

The question then becomes: how do you communicate the difference between the buffer (a Vec<u8> containing valid UTF-8) and the idealized view of a Unicode string (where the encoding is "merely" an implementation detail)? This seems to be an issue beginners often trip over and even experienced users can sometimes use the wrong thing.

I don't think this is purely an issue of documentation. There must be some way for the API itself to help communicate the distinction?

It'd be nice if the API itself can communicate that. However, I think the poster of that thread on Reddit tried to grasp the language by trying to write code without first reading at least the book. Luckily (s)he was gracious enough to explicitly state that the problem is not with the language, but due to a misunderstanding on his/her part. I suggested a cover-to-cover read of the book and to not skip anything. I also referred to chapter 8 for String and an explanation of its inner workings.

Sure. And if it were only one person I'd agree it would be enough to say "read the book". But this confusion comes up time and again. It's worth exploring how Rust's APIs can help, no?

Agreed. I just don't see an immediate solution at the moment. The best I can do right now is to refer people to the relevant chapter in the book.

If the API can reflect that, it'd be even better.

Would it be useful to have a newtype around String (maybe Text?), which can be trivially converted back and forth, but which exposes an API more focused on characters or graphemes rather than byte manipulation?

Of course this could be a library, but if the purpose is to help beginners, it might be useful to have it in std and refer to it in the book.

1 Like

If we had different types: Bytes, CodePoints, GraphemeClusters and Glyphs returned by the various .bytes_length()/.character_length()/…, it would be possible to have indexing using slices, and it would prevent a lot of possible mistakes (compare to using usize everywhere). Compilation errors should help beginners to not fall in those traps.

Using strong typing would change the signature of the indexing operation, but would not change the semantic or syntax of the indexing of currently valid and sensible code. Broken code (when you do arithmetic operation on the value returned by len() before using it for indexing) would no longer compiles, since Add<usize> and Sub<usize> would not be implemented for Bytes.

let hello = String::new("Hello word");
let h = hello[0]; // does not compiles
let h = hello[Byte::new(0)]; // valid and explicit

let w = hello[String::new("Hello ").len()]; // compiles and works correctly

let nb_letters /* badly named, it's the number of bytes */ = hello.len();
let d = hello[nb_letters - 2]; // does not compiles, `Sub<usize>` is not implemented for `Bytes`
let d = hello[nb_letters - Byte::new(2)]; // valid

let space = hello[String::new("Hello").len() + 1]; // does not compiles
let space = hello[String::new("Hello").len() + Byte::new(1)]; // valid

The verbosity of arithmetic operations should be a red flag in code reviews, since the user probably wants to access to a given character and not a given byte.

3 Likes

While we are at this, core string API should consider two more things

  • Resource-tight embedded might be too constrained to use unicode, using some kind of 1-byte encoding instead
  • General-purpose text API should be able to work with arbitrary non-unicode encodings.
1 Like

I strongly disagree with adding alternative encodings. It would make it so much harder to handle text correctly. UTF-8 should be the only option provided natively by Rust. If UTF-8 doesn't work for you, you can build your own string types.

9 Likes

If you're general-purpose enough to work on an arbitrary encoding, then you really are just exposing slice operations. If you don't know the encoding at all, there is nothing smarter your API can do than treat it as a byte slice.

4 Likes

Real world has plenty of non-unicode data. Ignoring it is not wise. Forcing people to write their own classes is a recipe for fragmentation of ecosystem, which would be quite unpleasant

2 Likes

I could see the advantage of having a standard buffer API (either a generic struct or a trait). This could also make it more obvious when you're handling the bits of a type vs. a more abstract view of the type (e.g. Unicode chars).

But I think non-unicode encodings are better handled outside of the standard library. The std is already quite large and it's not the easiest to contribute to.

I don't think I follow this suggestion. Maybe an example would help?

I'd encourage this kind of experimentation in a new crate to get a feel for how it works in practice. :slight_smile:

This is why I wrote bstr. Very few people are using it besides myself, so it's not clear to me that something like that belongs in std. Besides, there are also crates for doing transcoding very quickly, which works quite well when handling different encodings!

Also, it's hard for me to moderate this thread and participate in it at the same time, but I think this thread is starting to veer way off topic. This thread is proposing a targeted deprecation and it has now sprawled into discussing supporting multiple encodings in std. I don't know where the right line is, but I'd say that if you want to seriously propose that std support more encodings, then please create a new thread. :slight_smile:

10 Likes

I keep wondering: what is it that people actually need code point, grapheme cluster or glyph indexing for? Even computing the length of the string in these units seems rather pointless. (The length in bytes/code units is at least useful for checking whether a string will fit within a specific amount of storage.)

Never mind that indexing by glyph is impossible to do without access to the actual font rendering engine (that takes into account ligatures), and the others require O(n) time. (At least when the encoding is UTF-8.)

Does bstr not still distinguish UTF-8 as the preferred string encoding? I get the impression that @permeakra intended something actually encoding-agnostic here.

5 Likes

Specific non-unicode encodings themselves probably don't, though it depends on eventual intended audience. But I will argue that core API should be encoding-agnostic. In an ideal world, encoding-dependent parts like case conversion, string normalization, character enumeration and character classification, should be provided via swappable objects with common interface. Core could provide an object for ASCII only, std could provide unicode variants and users could write their own implementations if needed.

That said, I think this tread needs people working on i18n and l10n and text processing applications like firefox/servo, libreoffice and the like.

This.

You mean those are newtypes around usize that still (internally) contains an index in terms of bytes, right? (If not this would be horribly inefficient.) Then keep in mind that you could still apply a CodePoints index from one String to a different one, or to the same one after it’s modified. But I suppose it might not have been your goal to make the API 100% panic free but just catch some “obvious” mistakes. The question then is if the hassle of all these new types is worth the perhaps rather small gain in safety.

Also confusions arise where the types and the fact that there’s multiple “length” methods suggests that one is actually counting code points, graph clusters etc. Thus to really “master” String and use it correctly, you would still need to understand that String contains a byte vector and that indexing is internally using byte indices for everything even if the API totally looks like this is not the case at all.

1 Like