Implement Index<usize> for String and &str

Tom-Phinney · December 3, 2020, 12:36am

I don't think that's at all obvious, or even correct. Returning a single element of a potentially multi-character string should return a single character (char), not a varying portion of the underlying binary representation of the character.

quinedot · December 3, 2020, 12:41am

I believe they were referring to @jschievink's observation that returning a &char is not possible.

steffahn · December 3, 2020, 12:42am

That challenge can be solved very nicely using the ascii crate, which does provide indexing.

Only open this if you want to see a solution to that advent of code challenge.

static INPUT: &str = /* ... */;
use regex::Regex;
use ascii::AsAsciiStr;
fn main() {
    let ex = Regex::new(r"(?m)^(?P<a>\d+)-(?P<b>\d+) (?P<ch>.): (?P<st>.*)$").unwrap();
    let n = ex.captures_iter(INPUT).filter(|c| {
        let ch = c["ch"].chars().next().unwrap();
        let a: usize = c["a"].parse().unwrap();
        let b: usize = c["b"].parse().unwrap();
        // PART 1 solution: (uncomment this and comment out PART 2 below)
        // let n = c["st"].chars().filter(|&c| c == ch).count();
        // a <= n && n <= b
        // PART 2 solution:
        let st = c["st"].as_ascii_str().unwrap();
        (st[a-1] == ch) ^ (st[b-1] == ch)
    }).count();
    println!("{}", n);
}

Edit: Of course there’s also the option of working with u8s directly; the regex crate even offers an ASCII-only mode that can work on &[u8], e.g.

static INPUT: &[u8] = /* ... */;
use regex::bytes::Regex;
use std::str;
fn main() {
    let ex = Regex::new(r"(?m-u)^(?P<a>\d+)-(?P<b>\d+) (?P<ch>.): (?P<st>.*)$").unwrap();
    let n = ex.captures_iter(INPUT).filter(|c| {
        let ch = c["ch"][0];
        let a: usize = str::from_utf8(&c["a"]).unwrap().parse().unwrap();
        let b: usize = str::from_utf8(&c["b"]).unwrap().parse().unwrap();
        // PART 1 solution: (uncomment this and comment out PART 2 below)
        // let n = c["st"].iter().filter(|&&c| c == ch).count();
        // a <= n && n <= b
        // PART 2 solution:
        let st = &c["st"];
        (st[a-1] == ch) ^ (st[b-1] == ch)
    }).count();
    println!("{}", n);
}

H2CO3 · December 3, 2020, 5:57am

I'm afraid you missed the fact that &str is Unicode (UTF-8) by design, and all of std builds upon this fact. This is not "poor support" by any reasonable measure.

There will always be use cases that someone needs but should not be promoted to the standard library. Furthermore, std is maintained conservatively because of backwards compatibilty (once something is added, it stays there forever). Since package management is easy in Rust, it is very often the case that even core functionality gets prototyped in external crates.

withoutboats · December 9, 2020, 3:32am

The only implementation of Index<usize> for str I can imagine is one that returns a str representation of the codepoint beginning at that byte, panicking if it is not the beginning of a codepoint. That would be analogous to the implementation of Index<Range<usize>> we already have.

I'm not convinced that's a great idea, but I'm personally not as convinced its so terrible as other people (I wouldn't describe it as "wrong" for example). If we limited the question to that specific impl, I think we could have a more interesting conversation of pros and cons than this conversation that stops at memes about how unicode is hard, this would be an O(N) operation, every other language that has indexable strings is wrong, etc.

toc · December 9, 2020, 3:50am

Agreed that having

let s = "a🔥b";
// now s[1] ≡ s[1..5]

makes a lot of sense. It's certainly a little novel. It may make a bunch of range for-loops over (0..s.len()) seem to work until they don't but maybe that can be linted for?

It shouldn't be though. The debug version maybe but if the string is utf-8 then the length of the codepoint is available from the first byte.

withoutboats · December 9, 2020, 7:00am

yea, I'm refering to the previous arguments against an nth char definition.

gbutler · December 14, 2020, 6:07pm

To add further to this idea. Instead of panicking if the byte at the given index isn't a valid "Start Byte", back-up (or forward), to the first available "Start Byte" and return the given code-point. Panic if a "Start Byte" cannot be found within the maximum allowable length (in bytes) of a unicode code-point (and/or it cannot find a valid code-point), then panic. You could also offer a non-panicking version of the Index trait (which would require additional syntax) such that:

let c = some_str[x]; // gives the valid sequence starting at thexth byte or panics
let c_and_next_and_prev = some_str[x!] // gives the valid sequence that spans the byte pointed to as well as the previous and next code points' starting byte positions

there could also be versions that did not panic such as:

let c_and_next_and_prev = some_str[x!?] // gives the sequence spanning the pointed to bye as well as the previous code points' starting byte positions or error values if any of the 3 are invalid sequences

These would all be traits that could be implemented for indexing into things that are effectively byte arrays but where there is a notion of valid "Start Positions" and valid "Sequences". This API could support things other than Unicode (for example, SNMP/ASN.1). I believe (correct me if I'm wrong), that the [index_val!] and [index_val!?] could both be supported as syntax backwards compatibly.

The return values would be something like:

s[index_val] -> TypeOfElement (or panic) // index_val must be an index of a valid start-byte and the following bytes must be a well-formed sequence
s[index_val!] -> (usize,TypeOfElement,usize) // panics if index_val points within a malformed sequence or if the following bytes after the well-formed pointed to sequence or the bytes preceding it are an invalid sequence 
s[index_val!?] -> (Result<usize,Error<&[u8]>>,Result<TypeOfElement,Error<&[u8]>>,Result<usize,Error<&[u8]>>), never panics on invalid sequences

This would be an extension of the existing Index<T> trait to add the definition that the usize value given man return a value that starts at a different position provided the value points within the span of a valid sequence for the indexed type. This would be backwards compatible as far as I can discern.

Also, two new traits (and corresponding syntax) would be added:

// corresponding to s[index_val!] syntax
trait IndexWithBoundaries< ... > {
    ...
}

// corresponding to s[index_val!?] syntax
trait TryIndexWithBoundaries< ... >{
   ...
}

It might be preferable to just have both versions in the same trait with a default implementation of the non-try version in terms of the try version (that would likely be preferable).

mjbshaw · December 14, 2020, 7:18pm

str requires valid UTF-8. That panic shouldn't be possible. If it is, the program already has UB and the str API shouldn't try to make any promises to programs invoking UB.

cuviper · December 14, 2020, 8:05pm

If you slice &s[i..] from the middle of a codepoint it panics, so I think it would be weird if &s[i] did otherwise. I think &s[i..] and &s[i] should always have the same pointer, and further &s[i + 1] should be offset by 1 or not at all (panic).

quinedot · December 15, 2020, 12:00am

It wouldn't inherantly already have UB, but it would violate a safety invariant. Thus the library -- str API -- would be allowed to create UB without breaking any promises. (But the panic would be kinder.)

pdolezal · December 15, 2020, 8:47am

I agree and I don't like the indexing idea either.

It makes more sense to me to define functions on [u8] that would find the next and previous codepoint when given an index. Such a function should return actually something like Option<(codepoint, index, length)>. I can imagine using such a function, e.g., in a text file viewer which could display a part of a file chosen by the user for instance by dragging a scroll bar and which therefore needs to find the boundary since which it can start decoding the string (yes, I know, it might not be still the right thing to do considering graphemes).

However, I'm not sure if such a function should appear in std and how much useful it would be after all.

cuviper · December 15, 2020, 5:14pm

There's str::is_char_boundary, but maybe we could also add u8 methods like is_utf8_leading and is_utf8_continuation, and then you can use normal iterator methods to search for it.

matklad · December 18, 2020, 12:20pm

This makes me wonder: would a hypothetical Rust without char type at all be feasible? That is, like Python, only have a string type, which isn't comprised of atoms.

So, functions like chars become

impl str {
  fn codepoints(&self) -> impl Iterator<Item=&str> + '_;
}

I came to the conclusion that, if you work with strings, you should treat string slices as the atomic unit you work with.

pdolezal · December 18, 2020, 3:31pm

Although it is beyond my experience and expertise, I guess that char could be useful for low-level routines, e.g., for segmentation, and probably for easier interoperability. From this point of view I suppose it is good to have a suitable type and supporting it by the compiler seems convenient.

However, I understand your point. Mere existence of such a type makes possible for the users to it wrong or abuse it. Anyway, I think that your idea (allow me to rephrase it) to avoid char whenever possible and prefer &str instead is something that should be emphasized more in the Rust Book and elsewhere.

bjorn3 · December 18, 2020, 4:10pm

Searching for a single character is faster than searching for a string. I believe clippy even has a lint for this.

cuviper · December 18, 2020, 5:23pm

That's Clippy Lints

mbrubeck · December 18, 2020, 5:36pm

This is very similar to strings in Swift, where a String acts as a collection of Characters, but internally a Character is represented as a String. So really a String is a collection of Strings. (This is especially necessary in Swift because their Character type represents an Extended Grapheme Cluster, which has no maximum storage size.)

matklad · December 18, 2020, 5:48pm

Hm, I get 4x difference in performance when searching (long) strings. This feels like a bug in string search to me, to be honest. Looks like the impl doesn't have a memchr fast-path for short patterns, and just uses two-way algorithm.

Aloso · December 18, 2020, 5:51pm

This is a nice idea. However, because &str is borrowed, using &str instead of char everywhere could quickly lead to lifetime hell.

Furthermore, I'm not entirely convinced this would actually lead to better code, since char is a more constrained type than &str (it is exactly one code point), leading to more type safety.

Topic		Replies	Views
`str` method for slicing code-point (i.e. `char`) ranges libs	23	2835	March 25, 2019
Idea: `char_index_after` and `char_index_before` libs	20	1092	September 18, 2020
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3544	November 12, 2020
How to Operate Elements of String in Rust?	4	1175	March 25, 2019
What if strings were Code Point aware? language design	19	1878	March 30, 2023

Implement Index<usize> for String and &str

Related topics