Implement Index<usize> for String and &str

I don't think that's at all obvious, or even correct. Returning a single element of a potentially multi-character string should return a single character (char), not a varying portion of the underlying binary representation of the character.

2 Likes

I believe they were referring to @jschievink's observation that returning a &char is not possible.

That challenge can be solved very nicely using the ascii crate, which does provide indexing.

Only open this if you want to see a solution to that advent of code challenge.
static INPUT: &str = /* ... */;
use regex::Regex;
use ascii::AsAsciiStr;
fn main() {
    let ex = Regex::new(r"(?m)^(?P<a>\d+)-(?P<b>\d+) (?P<ch>.): (?P<st>.*)$").unwrap();
    let n = ex.captures_iter(INPUT).filter(|c| {
        let ch = c["ch"].chars().next().unwrap();
        let a: usize = c["a"].parse().unwrap();
        let b: usize = c["b"].parse().unwrap();
        // PART 1 solution: (uncomment this and comment out PART 2 below)
        // let n = c["st"].chars().filter(|&c| c == ch).count();
        // a <= n && n <= b
        // PART 2 solution:
        let st = c["st"].as_ascii_str().unwrap();
        (st[a-1] == ch) ^ (st[b-1] == ch)
    }).count();
    println!("{}", n);
}

Edit: Of course there’s also the option of working with u8s directly; the regex crate even offers an ASCII-only mode that can work on &[u8], e.g.

static INPUT: &[u8] = /* ... */;
use regex::bytes::Regex;
use std::str;
fn main() {
    let ex = Regex::new(r"(?m-u)^(?P<a>\d+)-(?P<b>\d+) (?P<ch>.): (?P<st>.*)$").unwrap();
    let n = ex.captures_iter(INPUT).filter(|c| {
        let ch = c["ch"][0];
        let a: usize = str::from_utf8(&c["a"]).unwrap().parse().unwrap();
        let b: usize = str::from_utf8(&c["b"]).unwrap().parse().unwrap();
        // PART 1 solution: (uncomment this and comment out PART 2 below)
        // let n = c["st"].iter().filter(|&&c| c == ch).count();
        // a <= n && n <= b
        // PART 2 solution:
        let st = &c["st"];
        (st[a-1] == ch) ^ (st[b-1] == ch)
    }).count();
    println!("{}", n);
}
5 Likes

I'm afraid you missed the fact that &str is Unicode (UTF-8) by design, and all of std builds upon this fact. This is not "poor support" by any reasonable measure.

There will always be use cases that someone needs but should not be promoted to the standard library. Furthermore, std is maintained conservatively because of backwards compatibilty (once something is added, it stays there forever). Since package management is easy in Rust, it is very often the case that even core functionality gets prototyped in external crates.

3 Likes

The only implementation of Index<usize> for str I can imagine is one that returns a str representation of the codepoint beginning at that byte, panicking if it is not the beginning of a codepoint. That would be analogous to the implementation of Index<Range<usize>> we already have.

I'm not convinced that's a great idea, but I'm personally not as convinced its so terrible as other people (I wouldn't describe it as "wrong" for example). If we limited the question to that specific impl, I think we could have a more interesting conversation of pros and cons than this conversation that stops at memes about how unicode is hard, this would be an O(N) operation, every other language that has indexable strings is wrong, etc.

8 Likes

Agreed that having

let s = "a🔥b";
// now s[1] ≡ s[1..5]

makes a lot of sense. It's certainly a little novel. It may make a bunch of range for-loops over (0..s.len()) seem to work until they don't but maybe that can be linted for?

It shouldn't be though. The debug version maybe but if the string is utf-8 then the length of the codepoint is available from the first byte.

yea, I'm refering to the previous arguments against an nth char definition.

To add further to this idea. Instead of panicking if the byte at the given index isn't a valid "Start Byte", back-up (or forward), to the first available "Start Byte" and return the given code-point. Panic if a "Start Byte" cannot be found within the maximum allowable length (in bytes) of a unicode code-point (and/or it cannot find a valid code-point), then panic. You could also offer a non-panicking version of the Index trait (which would require additional syntax) such that:

let c = some_str[x]; // gives the valid sequence starting at thexth byte or panics
let c_and_next_and_prev = some_str[x!] // gives the valid sequence that spans the byte pointed to as well as the previous and next code points' starting byte positions

there could also be versions that did not panic such as:

let c_and_next_and_prev = some_str[x!?] // gives the sequence spanning the pointed to bye as well as the previous code points' starting byte positions or error values if any of the 3 are invalid sequences

These would all be traits that could be implemented for indexing into things that are effectively byte arrays but where there is a notion of valid "Start Positions" and valid "Sequences". This API could support things other than Unicode (for example, SNMP/ASN.1). I believe (correct me if I'm wrong), that the [index_val!] and [index_val!?] could both be supported as syntax backwards compatibly.

The return values would be something like:

s[index_val] -> TypeOfElement (or panic) // index_val must be an index of a valid start-byte and the following bytes must be a well-formed sequence
s[index_val!] -> (usize,TypeOfElement,usize) // panics if index_val points within a malformed sequence or if the following bytes after the well-formed pointed to sequence or the bytes preceding it are an invalid sequence 
s[index_val!?] -> (Result<usize,Error<&[u8]>>,Result<TypeOfElement,Error<&[u8]>>,Result<usize,Error<&[u8]>>), never panics on invalid sequences

This would be an extension of the existing Index<T> trait to add the definition that the usize value given man return a value that starts at a different position provided the value points within the span of a valid sequence for the indexed type. This would be backwards compatible as far as I can discern.

Also, two new traits (and corresponding syntax) would be added:

// corresponding to s[index_val!] syntax
trait IndexWithBoundaries< ... > {
    ...
}

// corresponding to s[index_val!?] syntax
trait TryIndexWithBoundaries< ... >{
   ...
}

It might be preferable to just have both versions in the same trait with a default implementation of the non-try version in terms of the try version (that would likely be preferable).

str requires valid UTF-8. That panic shouldn't be possible. If it is, the program already has UB and the str API shouldn't try to make any promises to programs invoking UB.

3 Likes

If you slice &s[i..] from the middle of a codepoint it panics, so I think it would be weird if &s[i] did otherwise. I think &s[i..] and &s[i] should always have the same pointer, and further &s[i + 1] should be offset by 1 or not at all (panic).

8 Likes

It wouldn't inherantly already have UB, but it would violate a safety invariant. Thus the library -- str API -- would be allowed to create UB without breaking any promises. (But the panic would be kinder.)

1 Like

I agree and I don't like the indexing idea either.

It makes more sense to me to define functions on [u8] that would find the next and previous codepoint when given an index. Such a function should return actually something like Option<(codepoint, index, length)>. I can imagine using such a function, e.g., in a text file viewer which could display a part of a file chosen by the user for instance by dragging a scroll bar and which therefore needs to find the boundary since which it can start decoding the string (yes, I know, it might not be still the right thing to do considering graphemes).

However, I'm not sure if such a function should appear in std and how much useful it would be after all.

2 Likes

There's str::is_char_boundary, but maybe we could also add u8 methods like is_utf8_leading and is_utf8_continuation, and then you can use normal iterator methods to search for it.

7 Likes

This makes me wonder: would a hypothetical Rust without char type at all be feasible? That is, like Python, only have a string type, which isn't comprised of atoms.

So, functions like chars become

impl str {
  fn codepoints(&self) -> impl Iterator<Item=&str> + '_;
}

I came to the conclusion that, if you work with strings, you should treat string slices as the atomic unit you work with.

5 Likes

Although it is beyond my experience and expertise, I guess that char could be useful for low-level routines, e.g., for segmentation, and probably for easier interoperability. From this point of view I suppose it is good to have a suitable type and supporting it by the compiler seems convenient.

However, I understand your point. Mere existence of such a type makes possible for the users to it wrong or abuse it. Anyway, I think that your idea (allow me to rephrase it) to avoid char whenever possible and prefer &str instead is something that should be emphasized more in the Rust Book and elsewhere.

Searching for a single character is faster than searching for a string. I believe clippy even has a lint for this.

That's Clippy Lints

This is very similar to strings in Swift, where a String acts as a collection of Characters, but internally a Character is represented as a String. So really a String is a collection of Strings. (This is especially necessary in Swift because their Character type represents an Extended Grapheme Cluster, which has no maximum storage size.)

Hm, I get 4x difference in performance when searching (long) strings. This feels like a bug in string search to me, to be honest. Looks like the impl doesn't have a memchr fast-path for short patterns, and just uses two-way algorithm.

1 Like

This is a nice idea. However, because &str is borrowed, using &str instead of char everywhere could quickly lead to lifetime hell.

Furthermore, I'm not entirely convinced this would actually lead to better code, since char is a more constrained type than &str (it is exactly one code point), leading to more type safety.