Rust has chosen an explicit string encoding, and limits itself to relatively low-level primitive operations. To me that's not a flaw to be fixed, merely one particular design with its own trade-offs.
Other languages try to abstract away their internal string encoding (often it's even variable at runtime), and chose to have higher-level grapheme based APIs (or are stuck with clumsy codepoint==char APIs). "Clever" strings are nice, but come at a cost.
Rust here is different, but I don't think it's wrong. Once you know it's UTF-8, and Rust's std avoids complex algorithms like counting grapheme clusters, it all makes sense.
Maybe len() should have been byte_len() or utf8_len(), but it wouldn't be a massive improvement. Human writing systems are complex, and that makes Unicode complex in many ways. At least Rust currently is obviously inadequate in handling Unicode semantics in full fidelity, but it is optimized for parsing and copying string bytes.
Returns the length of this String , in bytes, not chars or graphemes. In other words, it may not be what a human considers the length of the string.
It even includes an example of how to compute the number of chars instead:
let fancy_f = String::from("Æoo");
assert_eq!(fancy_f.len(), 4);
assert_eq!(fancy_f.chars().count(), 3);
It is logical that .len() operates on bytes, because anything else would have to be computed expensively.
However, it's weird that drain() returns an Iterator<Item = char> and not just a struct that implements Deref<Target = str>, because that would be both more powerful and more explicit: .drain() would have to be replaced with .drain().chars() (or .drain().bytes() or .drain().graphemes(true), if you want to iterate over bytes or grapheme clusters instead).
Only Bytes would be a newtypes around usize (this is why it's the only types that should support indexing). CodePoints, GraphemeClusters and Glyphs would be used for operation that takes a linear amount of time, like counting the number of characters in a word, finding the average number of letters per words in a text, creating a compressed font that contains all (and only all) the glyphs used in a given text, âŠ
If you want to index chars or grapheme clusters in constant time, you can use
type Chars = Vec<char>;
type GraphemeClusters<'a> = Vec<&'a str>;
Otherwise, I'm not sure how these types would be helpful. You can already iterate over bytes and chars with .bytes() and .chars(). When using the unicode-segmentation crate, you can iterate over graphemes with .graphemes(true). And glyphs can't be counted without knowing the font in which the text is rendered AFAIK.
EDIT: I think I misunderstood you; is this what you want to be able to do?
let idx: Chars = "12345".char_len();
let s: &str = "12345678"[Chars::new(0)..idx];
The main problem with this is that it looks like a cheap operation, but has linear runtime.
Okay, I see. But you could still re-use a Bytes index from a different string or after modification. And youâd also want to be able to add Bytes to other Bytes because thereâs valid use cases for that. That makes it sound a lot like usize to me. Also usize is kind-of a type for indexing bytes by nature, isnât it?
Also, only providing the GraphemeClusters types etc for length operations seems a bit overkill IMO. Even providing a function for these types of lengths does feel unnecessary, because I find it way more convincing to have to access a linear-time length operation through an iterator anyways.
I agree byte_len or len_bytes would have been better. However if those methods were added now and len were marked as deprecated, just looking at method names one might mistakenly assume len_bytes was the length in bytes and that len is the length in characters. Which would make things worse then they are now.
For operations like len which to be O(1) only are meaningful in terms of bytes it makes sense to leave the signatures as-is, and design and document as clearly as possible so as to convey the correct mental model.
There are a few other method that might deserve attention though:
push and pop - These could be named push_char etc, because as the type signature is unambiguous, this does not seem like a big deal.
drain - This could probably be made more flexible as mentioned above.
The only other functions that don't "fit" because they are O(N) and by their name / signature are unclear are:
insert / insert_str
remove
replace_range
split_off
All of these methods work similarly. However there may be a way to make all of these methods go away altogether.
If it were possible to have a trait AssignSlice which allowed a slice to appear of the left hand side of an assignment operation these methods could be deprecated and replaced with more advanced slicing. For example:
//instead of replace_range
let mut greeting = "Hello world";
greeting[6..] = name;
//instead of insert_str
let mut greeting = "Hello .";
greeting[6..6] = name;
//instead of remove
let mut greeting = "Hello world";
greeting[5..] = "";
Obviously this pattern leans into the fact that the slice indexes of a string are byte offsets.
So to summarise, Rust's String is a UTF-8 byte buffer. Except when it's treated as a more abstract Unicode string. This distinction isn't always as clear as it could be. However:
This is all explained in the book/docs
Changing anything now would introduce too much churn
I think this isn't exactly a great situation becuase this is an issue that continually comes up with new users (and sometimes even trips up more experienced users). But it is what it is. If it's too late to change anything then there's nothing that can be done to address this, at least in terms of Rust's APIs.
It's exactly what I want to do. I would just not implement Index[Chars] to not make it look like a cheap operation.
What I want to prevent is something like this.
let text = String::new("NoĂȘl");
let idx: usize = text.find('Ă«').unwrap(); // assuming String.find(letter) returns the first index that match the letter in the string
let idx_next_letter: usize = idx + 1; // badly named! and it would fall into the middle of a codepoint since 'Ă«' doesn't fit in a single byte
let should_be_l: char = text[idx_next_letter]; // runtime crash
Whereas with strong typing:
let text = String::new("Noël");
let idx: Byte = text.find('Ă«').unwrap();
let next_letter: _ = idx + 1; // compile error, Add<{integer}> isn't implemented for Byte
let idx_next_letter: Byte = idx + Byte::new(1); // badly named, easy to spot in code-review
let should_be_l: char = text[idx_next_letter]; // runtime crash, but the line above would make it more obvious than the current situation
struct CharOffset {
byte: Byte, /// a valid offset in a string, ideally should track the lifetime of the string itself, or even have a shared reference to the string (in order to error if the string is modified)
offset: Grapheme, /// number of grapheme that we need to (linearly) consume
}
struct Grapheme {
count: usize, /// a number of grapheme
}
let idx: Byte = text.find('Ă«').unwrap();
let idx_next_letter: GraphemeOffset = idx + Grapheme::new(1);
let should_be_l: _ = text[idx_next_letter]; // compile error: Index<Char> is not implemented for String since it would look like a cheap operation.
let this_is_l: char = text.get_letter(idx_next_letter); // compiles, not runtime error, yeah!
// note: this code is equivalent to the snippet above
let idx_next_letter: Byte = text.get_offset(idx + Char::new(1));
let this_is_l: char = text[idx_next_letter];
Even without type annotation, it should be clear to see in code review what is going on. I added them just to make it easy to understand how I intent strong types to be used.
I just want to highlight that this would be how I would have done it if I had to do it from scratch. I didn't check what the cost of migration would be to implement such solution. It should be considered only as a thought experiment.
To be fair, most other languages also don't return the count of code points/grapheme clusters/glyphs on their string length getter. Instead, they just returns the length of u16 slice as they store strings in UTF-16 encoding underneath. Should we adopt this behavior?
Python 3 uses UTF-32, so the length always matches the number of code points.
Should we adopt this behavior?
Returning the length in UTF-16 words makes no sense for Rust, since it is
Undesired in 99,99% of cases
Expensive to compute for UTF-8 strings
Changing the representation of String to UTF-16 is not possible
Besides, why would we want to do that? UTF-16 is the worst of all worlds. It is not very memory efficient, it isn't compatible with ASCII, and characters aren't fixed width either
It can be more memory efficient than UTF-8 when dealing with e.g. text in east Asian languages, where the majority of characters take 3 bytes in UTF-8 but only 2 bytes in UTF-16.
Depending on what you consider a âcharacterâ, they arenât in UTF-32 either. Sure, codepoints are, but when handling Unicode properly, you quickly have to deal with combining characters / diacritical marks, too.
most interchange text has markup as well, which typically falls in the ASCII range, and
if you're at the scale where this is actually meaningfully noticable, you should probably be using a streaming general-purpose compression algorithm which will definitely do better (and roughly equivalently well on UTF-8 and UTF-16).
If anyone here hasn't already, I'd suggest reading the UTF-8 everywhere manifesto, which makes a great argument for why UTF-8 should be the encoding of text without specific (typically, legacy) design constraints.
The Chinese translation of this manifesto takes 58.8 KiB in UTF-16, and only 51.7 KiB in UTF-8.
In overview:
In most cases, text is treated as a mostly-opaque blob that you read from one location and pass along to another API.
In the cases where it isn't completely opaque, command characters typically fall into the ASCII range.
Because (transitively) codepoint â grapheme â glyph â user-perceived character â user-perceived character (in a different locale), there is no such thing as a constant-time text algorithm.
Honestly I think encoding is beside the point. Rust uses UTF-8. That's fine. What's not so great is the API and the terminology it uses.
Sometimes a string is a UTF-8 byte buffer. Sometimes a string is a Unicode string (in the abstract sense). Which is which depends on the context. And this distinction is often not very clear to new users, despite heroic efforts by the book and documentation.
I'd say it is both: It is a Unicode string, represented as a byte buffer. Its methods operate on char when the operation is about single elements of the string:
push, pop
insert, remove, retain
Most other methods operate on byte ranges. I think this is the correct and most efficient thing to do. The only exception is drain, which returns an Iterator<Item = char> but shouldn't IMO.
I imagine APIs which provide substring indexing on Unicode strings provide codepoint-based indices, if they're meant to be used across languages and libraries which may not be consistent in their preferred byte encoding. To provide a concrete example, the Twitter API provides codepoint indices when referring to the position of hashtags or URLs in a tweet. In egg-mode, i convert these into byte-based indices when providing the strings back to Rust code. I do this because Rust uses byte-based indexing, and i didn't want users of my library to try to use codepoint indices to slice a string with.