Wild idea: deprecating APIs that conflate str and [u8]

Rust has chosen an explicit string encoding, and limits itself to relatively low-level primitive operations. To me that's not a flaw to be fixed, merely one particular design with its own trade-offs.

Other languages try to abstract away their internal string encoding (often it's even variable at runtime), and chose to have higher-level grapheme based APIs (or are stuck with clumsy codepoint==char APIs). "Clever" strings are nice, but come at a cost.

Rust here is different, but I don't think it's wrong. Once you know it's UTF-8, and Rust's std avoids complex algorithms like counting grapheme clusters, it all makes sense.

Maybe len() should have been byte_len() or utf8_len(), but it wouldn't be a massive improvement. Human writing systems are complex, and that makes Unicode complex in many ways. At least Rust currently is obviously inadequate in handling Unicode semantics in full fidelity, but it is optimized for parsing and copying string bytes.

9 Likes

The documentation for len() is very explicit about this:

Returns the length of this String , in bytes, not char s or graphemes. In other words, it may not be what a human considers the length of the string.

It even includes an example of how to compute the number of chars instead:

let fancy_f = String::from("ƒoo");
assert_eq!(fancy_f.len(), 4);
assert_eq!(fancy_f.chars().count(), 3);

It is logical that .len() operates on bytes, because anything else would have to be computed expensively.

However, it's weird that drain() returns an Iterator<Item = char> and not just a struct that implements Deref<Target = str>, because that would be both more powerful and more explicit: .drain() would have to be replaced with .drain().chars() (or .drain().bytes() or .drain().graphemes(true), if you want to iterate over bytes or grapheme clusters instead).

4 Likes

Only Bytes would be a newtypes around usize (this is why it's the only types that should support indexing). CodePoints, GraphemeClusters and Glyphs would be used for operation that takes a linear amount of time, like counting the number of characters in a word, finding the average number of letters per words in a text, creating a compressed font that contains all (and only all) the glyphs used in a given text, 


If you want to index chars or grapheme clusters in constant time, you can use

type Chars = Vec<char>;
type GraphemeClusters<'a> = Vec<&'a str>;

Otherwise, I'm not sure how these types would be helpful. You can already iterate over bytes and chars with .bytes() and .chars(). When using the unicode-segmentation crate, you can iterate over graphemes with .graphemes(true). And glyphs can't be counted without knowing the font in which the text is rendered AFAIK.

EDIT: I think I misunderstood you; is this what you want to be able to do?

let idx: Chars = "12345".char_len();
let s: &str = "12345678"[Chars::new(0)..idx];

The main problem with this is that it looks like a cheap operation, but has linear runtime.

1 Like

Okay, I see. But you could still re-use a Bytes index from a different string or after modification. And you’d also want to be able to add Bytes to other Bytes because there’s valid use cases for that. That makes it sound a lot like usize to me. Also usize is kind-of a type for indexing bytes by nature, isn’t it?

Also, only providing the GraphemeClusters types etc for length operations seems a bit overkill IMO. Even providing a function for these types of lengths does feel unnecessary, because I find it way more convincing to have to access a linear-time length operation through an iterator anyways.

2 Likes

String len is well documented and makes sense. But is clearly something that can surprise people coming from other language std libs.

byte_len is clearer, explicitly declaring an important distinction. I'd definitely prefer to have had this instead of len.

Firstly we'd have to establish some agreement with the above. But then the question is, is this improvement worth the cost of getting there?

2 Likes

I agree byte_len or len_bytes would have been better. However if those methods were added now and len were marked as deprecated, just looking at method names one might mistakenly assume len_bytes was the length in bytes and that len is the length in characters. Which would make things worse then they are now.

For operations like len which to be O(1) only are meaningful in terms of bytes it makes sense to leave the signatures as-is, and design and document as clearly as possible so as to convey the correct mental model.

There are a few other method that might deserve attention though:

  • push and pop - These could be named push_char etc, because as the type signature is unambiguous, this does not seem like a big deal.
  • drain - This could probably be made more flexible as mentioned above.

The only other functions that don't "fit" because they are O(N) and by their name / signature are unclear are:

  • insert / insert_str
  • remove
  • replace_range
  • split_off

All of these methods work similarly. However there may be a way to make all of these methods go away altogether.

If it were possible to have a trait AssignSlice which allowed a slice to appear of the left hand side of an assignment operation these methods could be deprecated and replaced with more advanced slicing. For example:

    //instead of replace_range
    let mut greeting = "Hello world";
    greeting[6..] = name;
    //instead of insert_str
    let mut greeting = "Hello .";
    greeting[6..6] = name;
    //instead of remove
    let mut greeting = "Hello world";
    greeting[5..] = "";

Obviously this pattern leans into the fact that the slice indexes of a string are byte offsets.

1 Like

So to summarise, Rust's String is a UTF-8 byte buffer. Except when it's treated as a more abstract Unicode string. This distinction isn't always as clear as it could be. However:

  • This is all explained in the book/docs
  • Changing anything now would introduce too much churn

I think this isn't exactly a great situation becuase this is an issue that continually comes up with new users (and sometimes even trips up more experienced users). But it is what it is. If it's too late to change anything then there's nothing that can be done to address this, at least in terms of Rust's APIs.

It's exactly what I want to do. I would just not implement Index[Chars] to not make it look like a cheap operation.

What I want to prevent is something like this.

let text = String::new("NoĂȘl");
let idx: usize = text.find('Ă«').unwrap(); // assuming String.find(letter) returns the first index that match the letter in the string
let idx_next_letter: usize = idx + 1; // badly named! and it would fall into the middle of a codepoint since 'Ă«' doesn't fit in a single byte
let should_be_l: char = text[idx_next_letter]; // runtime crash

Whereas with strong typing:

let text = String::new("Noël");

let idx: Byte = text.find('Ă«').unwrap();
let next_letter: _ = idx + 1; // compile error, Add<{integer}> isn't implemented for Byte
let idx_next_letter: Byte = idx + Byte::new(1); // badly named, easy to spot in code-review
let should_be_l: char = text[idx_next_letter]; // runtime crash, but the line above would make it more obvious than the current situation

struct CharOffset {
    byte: Byte, /// a valid offset in a string, ideally should track the lifetime of the string itself, or even have a shared reference to the string (in order to error if the string is modified)
    offset: Grapheme, /// number of grapheme that we need to (linearly) consume
}
struct Grapheme {
    count: usize, /// a number of grapheme
}
let idx: Byte = text.find('Ă«').unwrap();
let idx_next_letter: GraphemeOffset = idx + Grapheme::new(1);
let should_be_l: _ = text[idx_next_letter]; // compile error: Index<Char> is not implemented for String since it would look like a cheap operation.
let this_is_l: char = text.get_letter(idx_next_letter); // compiles, not runtime error, yeah!

// note: this code is equivalent to the snippet above
let idx_next_letter: Byte = text.get_offset(idx + Char::new(1));
let this_is_l: char = text[idx_next_letter];

Even without type annotation, it should be clear to see in code review what is going on. I added them just to make it easy to understand how I intent strong types to be used.


I just want to highlight that this would be how I would have done it if I had to do it from scratch. I didn't check what the cost of migration would be to implement such solution. It should be considered only as a thought experiment.

1 Like

To be fair, most other languages also don't return the count of code points/grapheme clusters/glyphs on their string length getter. Instead, they just returns the length of u16 slice as they store strings in UTF-16 encoding underneath. Should we adopt this behavior?

Python 3 uses UTF-32, so the length always matches the number of code points.

Should we adopt this behavior?

Returning the length in UTF-16 words makes no sense for Rust, since it is

  • Undesired in 99,99% of cases
  • Expensive to compute for UTF-8 strings
    • Changing the representation of String to UTF-16 is not possible
    • Besides, why would we want to do that? UTF-16 is the worst of all worlds. It is not very memory efficient, it isn't compatible with ASCII, and characters aren't fixed width either
2 Likes

It can be more memory efficient than UTF-8 when dealing with e.g. text in east Asian languages, where the majority of characters take 3 bytes in UTF-8 but only 2 bytes in UTF-16.

Depending on what you consider a “character”, they aren’t in UTF-32 either. Sure, codepoints are, but when handling Unicode properly, you quickly have to deal with combining characters / diacritical marks, too.

If you construct an artificial example, yes, but:

  • most interchange text has markup as well, which typically falls in the ASCII range, and
  • if you're at the scale where this is actually meaningfully noticable, you should probably be using a streaming general-purpose compression algorithm which will definitely do better (and roughly equivalently well on UTF-8 and UTF-16).

If anyone here hasn't already, I'd suggest reading the UTF-8 everywhere manifesto, which makes a great argument for why UTF-8 should be the encoding of text without specific (typically, legacy) design constraints.

The Chinese translation of this manifesto takes 58.8 KiB in UTF-16, and only 51.7 KiB in UTF-8.

In overview:

  • In most cases, text is treated as a mostly-opaque blob that you read from one location and pass along to another API.
  • In the cases where it isn't completely opaque, command characters typically fall into the ASCII range.
  • Because (transitively) codepoint ≠ grapheme ≠ glyph ≠ user-perceived character ≠ user-perceived character (in a different locale), there is no such thing as a constant-time text algorithm.
8 Likes

Honestly I think encoding is beside the point. Rust uses UTF-8. That's fine. What's not so great is the API and the terminology it uses.

Sometimes a string is a UTF-8 byte buffer. Sometimes a string is a Unicode string (in the abstract sense). Which is which depends on the context. And this distinction is often not very clear to new users, despite heroic efforts by the book and documentation.

1 Like

Folks, this thread isn't about the relative merits of utf8 vs utf16.

9 Likes

Naive question: what practical difference is there between the two?

1 Like

I'd say it is both: It is a Unicode string, represented as a byte buffer. Its methods operate on char when the operation is about single elements of the string:

  • push, pop
  • insert, remove, retain

Most other methods operate on byte ranges. I think this is the correct and most efficient thing to do. The only exception is drain, which returns an Iterator<Item = char> but shouldn't IMO.

Would you perhaps elaborate what drain should do instead in your opinion? Or should it not exist at all?

@steffahn I mentioned it in this comment (sorry I was too lazy to link it).

2 Likes

I imagine APIs which provide substring indexing on Unicode strings provide codepoint-based indices, if they're meant to be used across languages and libraries which may not be consistent in their preferred byte encoding. To provide a concrete example, the Twitter API provides codepoint indices when referring to the position of hashtags or URLs in a tweet. In egg-mode, i convert these into byte-based indices when providing the strings back to Rust code. I do this because Rust uses byte-based indexing, and i didn't want users of my library to try to use codepoint indices to slice a string with.

2 Likes