Implement Index<usize> for String and &str

Hello; In a lot of languages, an elegant way exists to get a char from a String. While using Rust no elegant way exists as it is in python or others. There are probably a lot of reasons why this is not in Rust but it is very inconvenient to use:

 string.chars().nth(2)

While with Index or a get function:

 string[2]

Or:

 string.get(2)

I think the same methods and implementations as Vec should be in String and &str: get, unsafe get_unchecked, Index<usize>

If someone can explain to me why it is not already in Rust STD or make an RFC that add this possibility in STD. I hope I don't sound too dumb. Thank's for reading; ccgauche.

1 Like

This is because Rust uses Unicode, and in Unicode encodings getting n-th "character" in constant time is not possible. Some languages ignore the problem and simply give incorrect results when you ask for n-th character — you'll get n-th byte or n-th code point. Which may be a fragment of a character (e.g. in case of combining characters or complex emoji) or multiple characters (e.g. ligatures). Actually Rust is a bit guilty of that too, because Rust's char is a Unicode Scalar Value, which isn't a "character".

The mere definition of what is a "character" is complex and messy for some writing systems. Getting "characters" correctly requires scanning the whole string from the beginning (in both UTF-8 and UTF-16), and a fair amount of computation and lookup tables (if you want to get "grapheme cluster", which is closest to what Unicode defines as a "character").

It's not a simple and cheap operation, so the interface also isn't simple to remind you about the cost, and nudge you towards other solutions. Use .as_bytes() if you don't care about non-ASCII characters. Use iterators to iterate over all code points. Use Unicode libraries to get proper human-recognizable characters.

10 Likes

I understand this position! But currently when using as_bytes().get() or chars().nth() we encounter the same type of problem so I don't know why this is not a good idea to make that simpler even if for specials characters you will not get the correct entire character. If someone wants Unicode, he should use a specific crate, not std.

If string[2] is equivalent to string.chars().nth(2), then string[2] would use char offsets while string[2..3] would use byte offsets. Putting aside everything else, that seems like a showstopping inconsistency to me.

We disagree. I don't see std reversing course on this.

The bottom line is that string.chars().nth(i) is a very uncommon operation. If you find yourself using it frequently, it's plausible that you're doing something incorrect. It's a good thing that std makes this operation verbose. There are very few scenarios in which string.chars().nth(i) is what you really want. So on that basis alone, it doesn't make sense to promote it to a convenient syntax.

13 Likes

It's not possible to write an Index<usize> impl for Rust's built-in string types that has type Output = char. The index method would have to return &char, but there is nowhere to store the char this reference points to.

9 Likes

I was going to say this counter-reasoning might be a good suggestion for inclusion in the docs, but it's already there. The docs could perhaps give more detail as to the reasoning, or call out .nth explicitly, but they are presently to the point and give enough keywords that an interested party can probably search out more details if they want to.

I think Index<usize, Output=u8> is implementable for string since String is a Vec. I don't see where the problem is?

While I agree with everything kornel said in their post, here's a slightly different take or phrasing:

You have to write string.chars().nth(2) not because we deliberately made it long and annoying to type, but because it's a reflection of what the computer actually has to do to find the nth character: It has to iterate over every character, counting them along the way until it gets to the nth character.

5 Likes

But you're asking for string[2] to be the same as strings.chars(2).unwrap(), which returns a char.

I think it should be mentioned that you can already do byte string literals with b"Hello world!", which gives you a &'static [u8], which allows you to index into the bytes. If you want to use ascii, using this is fine. You can also get ascii character literals by adding a b before the literal, like this; b'.' or b'!'. So non unicode strings are already fairly well supported by the std.

5 Likes

There is some support, but I wouldn't call them well supported. You can't even do substring search, for example. This is why I wrote the bstr.

4 Likes

As someone who supports non-unicode filenames and works with non-unicode data, I feel this can't be emphasized enough. Thank you so much for bstr! I hope more robust byte string and OsString support comes to std some day.

3 Likes

Sorry, but that's a terrible idea for a language in 2020 of which almost the entire ecosystem is (rightly) already committed to Unicode. We live in a global world. The times of assuming ASCII in text are over. And so should be the times of writing sloppy and incorrect code for the sake of marginal developer "convenience".

Please, just learn how to use Unicode properly. It will be better for everyone.

15 Likes