Implement Index<usize> for String and &str

Hello; In a lot of languages, an elegant way exists to get a char from a String. While using Rust no elegant way exists as it is in python or others. There are probably a lot of reasons why this is not in Rust but it is very inconvenient to use:

 string.chars().nth(2)

While with Index or a get function:

 string[2]

Or:

 string.get(2)

I think the same methods and implementations as Vec should be in String and &str: get, unsafe get_unchecked, Index<usize>

If someone can explain to me why it is not already in Rust STD or make an RFC that add this possibility in STD. I hope I don't sound too dumb. Thank's for reading; ccgauche.

1 Like

This is because Rust uses Unicode, and in Unicode encodings getting n-th "character" in constant time is not possible. Some languages ignore the problem and simply give incorrect results when you ask for n-th character — you'll get n-th byte or n-th code point. Which may be a fragment of a character (e.g. in case of combining characters or complex emoji) or multiple characters (e.g. ligatures). Actually Rust is a bit guilty of that too, because Rust's char is a Unicode Scalar Value, which isn't a "character".

The mere definition of what is a "character" is complex and messy for some writing systems. Getting "characters" correctly requires scanning the whole string from the beginning (in both UTF-8 and UTF-16), and a fair amount of computation and lookup tables (if you want to get "grapheme cluster", which is closest to what Unicode defines as a "character").

It's not a simple and cheap operation, so the interface also isn't simple to remind you about the cost, and nudge you towards other solutions. Use .as_bytes() if you don't care about non-ASCII characters. Use iterators to iterate over all code points. Use Unicode libraries to get proper human-recognizable characters.

20 Likes

I understand this position! But currently when using as_bytes().get() or chars().nth() we encounter the same type of problem so I don't know why this is not a good idea to make that simpler even if for specials characters you will not get the correct entire character. If someone wants Unicode, he should use a specific crate, not std.

If string[2] is equivalent to string.chars().nth(2), then string[2] would use char offsets while string[2..3] would use byte offsets. Putting aside everything else, that seems like a showstopping inconsistency to me.

We disagree. I don't see std reversing course on this.

The bottom line is that string.chars().nth(i) is a very uncommon operation. If you find yourself using it frequently, it's plausible that you're doing something incorrect. It's a good thing that std makes this operation verbose. There are very few scenarios in which string.chars().nth(i) is what you really want. So on that basis alone, it doesn't make sense to promote it to a convenient syntax.

24 Likes

It's not possible to write an Index<usize> impl for Rust's built-in string types that has type Output = char. The index method would have to return &char, but there is nowhere to store the char this reference points to.

12 Likes

I was going to say this counter-reasoning might be a good suggestion for inclusion in the docs, but it's already there. The docs could perhaps give more detail as to the reasoning, or call out .nth explicitly, but they are presently to the point and give enough keywords that an interested party can probably search out more details if they want to.

I think Index<usize, Output=u8> is implementable for string since String is a Vec. I don't see where the problem is?

While I agree with everything kornel said in their post, here's a slightly different take or phrasing:

You have to write string.chars().nth(2) not because we deliberately made it long and annoying to type, but because it's a reflection of what the computer actually has to do to find the nth character: It has to iterate over every character, counting them along the way until it gets to the nth character.

9 Likes

But you're asking for string[2] to be the same as strings.chars(2).unwrap(), which returns a char.

I think it should be mentioned that you can already do byte string literals with b"Hello world!", which gives you a &'static [u8], which allows you to index into the bytes. If you want to use ascii, using this is fine. You can also get ascii character literals by adding a b before the literal, like this; b'.' or b'!'. So non unicode strings are already fairly well supported by the std.

6 Likes

There is some support, but I wouldn't call them well supported. You can't even do substring search, for example. This is why I wrote the bstr.

6 Likes

As someone who supports non-unicode filenames and works with non-unicode data, I feel this can't be emphasized enough. Thank you so much for bstr! I hope more robust byte string and OsString support comes to std some day.

3 Likes

Sorry, but that's a terrible idea for a language in 2020 of which almost the entire ecosystem is (rightly) already committed to Unicode. We live in a global world. The times of assuming ASCII in text are over. And so should be the times of writing sloppy and incorrect code for the sake of marginal developer "convenience".

Please, just learn how to use Unicode properly. It will be better for everyone.

22 Likes

I would like to contest this somewhat. At least three people today, including myself reported running into this limitation today working through Advent of Code (1, 2, 3).

This was all indeed on the same problem, but this kind of operation doesn't seem uncommon for problems like these. We already have a notion of a string length expressed in number of characters, and a way to iterate over those characters, and even an Index impl which can return a single character. In fact the str docs mention the words "index" 40 times. It definitely felt surprising to me there wasn't a more apparent interface to get the char at a given index from a string.

These indexing arguments would be a lot more compelling if the provided examples were not strings of solely ASCII characters. Getting the n'th next lexeme in a string requires traversal from a starting point in the underlying slice that is known to be at the boundary between two lexemes, rather than a point within a single multi-byte lexeme. Likewise for gettig the n'th next character.

UTF-8 characters are not simply single bytes! Comparing to programming languages that do not support the non-Roman written languages of the world simply misses the point.

4 Likes

You might be able to get away with saying it's a common operation in puzzles like AoC, but I don't think AoC puzzles (and the like) are that representative of code in general. I say this as someone who completed AoC in 2018. It was fun partially because I got to exercise muscles and techniques that are otherwise rare. And it was also fun because I could make some very very strong assumptions about my input that you often can't make outside of puzzle environments.

Overall, I stand by what I said. Indexing by codepoint is an exceptionally rare thing.

13 Likes

I don't understand why Index<Range> is implemented and we can't add a simple index trait that returns the u8 at given position (as a vec<u8> would do)

You can return the u8 at a given index from a &[u8] slice. However, this thread is about indexing &str slices of encoded UTF-8, not slices of bytes. Since each encoded UTF-8 character occupies 1 to 3 consecutive bytes, returning potentially only part of a multi-byte character by simplistic indexing of bytes is not supported directly.

Note that you can always safely cast the string to a byte slice if, for some reason not based on the invalid presumption that all text worldwide is ASCII, you actually do want to index by bytes and not characters.

1 Like

I understand that but str already uses byte index for the index range so why can't we implement the same thing for a basic index which would just be the same as [x..=x]. Furthermore the rust std poorly support Unicode this is why crates such as Unicode exist.

I think ccgauche's point is that it's a little weird (and sometimes annoying) that s[0..1] compiles but s[0] does not (s has type &str throughout this comment). That said, I can understand why some people might push back on this since s[0..1] yields &str (and checks for UTF-8 validity) but s[0] would have to yield u8 (and ignore UTF-8 rules), which some may see as an undesirable difference in behavior. I don't think I share those concerns, but I also don't really mind just using .as_bytes() when I want to index a string.

1 Like