Idea: `char_index_after` and `char_index_before`

asa-z · June 17, 2020, 6:01pm

Add new members on str that allow for easily finding the next character boundary.

fn char_index_after(&self, mut index: usize) -> Option<usize> {
    index += 1;
    while !self.is_char_boundary(index) && index < source.len() {
        index += 1;
    }
   if index >= source.len() {
       None
   } else {
        Some(index)
   }
}

fn char_index_before(&self, mut index: usize) -> Option<usize> {
    if index == 0 {
        return None;
    }
    index -= 1;
    while !source.is_char_boundary(index) {
        index -= 1;
    }
    Some(index)
}

There are obviously numerous ways to slightly tweak the function, but this is the best I have thought of.

jjpe · June 17, 2020, 7:41pm

Is a char boundary really what you want? Note that char boundaries and unicode grapheme boundaries are not the same, so using the former when you need the latter will quite likely result in broken text.

asa-z · June 17, 2020, 7:50pm

True, but the standard library provides no support for graphemes, and you have to go to an external crate for that. Therefore it is tremendously far out of scope to add grapheme segmentation to the standard library

Ixrec · June 17, 2020, 7:52pm

It sounds like you're using "working with graphemes requires external crates" as an argument for "std should have these char_* methods", which doesn't really follow. Why not just use one of those external crates for grapheme segmentation? Do you have a use case that would actually benefit from char_index_before/after, which wouldn't be more correct with graphemes?

josh · June 17, 2020, 8:05pm

What semantic operation are you trying to perform on strings? In what use case does this come up?

mbrubeck · June 17, 2020, 8:22pm

For comparison, there is a fairly short way of doing this in the standard library already, at least for the case where index is itself a char boundary:

fn char_index_after(&self, index: usize) -> Option<usize> {
    let (i, _) = self[index..].char_indices().nth(1)?;
    Some(index + i)
}

fn char_index_before(&self, index: usize) -> Option<usize> {
    let (i, _) = self[..index].char_indices().next_back()?;
    Some(i)
}

We also used to have char_range_at and char_range_at_reverse, which were similar to the proposed functions.

asa-z · June 17, 2020, 9:08pm

I'm trying to have move a cursor around on the string, by characters.

The issue is that when using a cursor on a string, this is a very common operation, and that pattern is rather long, hard to remember, and hard to type.

Tom-Phinney · June 17, 2020, 9:48pm

That's not an excuse to put it in std rather than some experimental crate that follows SemVer, and thus can be improved without backward-compatibility issues.

asa-z · June 17, 2020, 9:55pm

It doesn't make sense to have this as it's own crate, though it could be included as one part of larger crates.

It's a fairly basic building block, and the design space is incredibly tiny, which is why I suggested it for inclusion in the standard library.

It seems that the consensus is that it is not worthy of inclusion, so users who need this feature can just roll their own or use a larger crate that provides this functionality.

CAD97 · June 17, 2020, 10:58pm

Contrary to "only one obvious way to do it," because UTF-8 encoded codepoints are a maximum of four bytes long, you can manually unroll the loop into a maximum of four turns for a slight boost. (If you're lucky or clever, you might even be able to check all four bytes for being a starting byte at the same time!)

I know I've implemented this exact functionality (aligning an arbitrary byte index to a character boundary), but a quick look wasn't able to find it again.

But to be completely honest, in the supermajority of cases, I think you don't want to do this. (Thus a reason to let it be outside std: those that need it can get it but it's not going to tempt people that shouldn't.)

Specifically, because using an arbitrary byte index into a UTF-8 string that isn't on a codepoint border is pretty broken. How did you get the index in the first place? If it's supposed to be a codepoint border, you definitely want to know if it isn't, which is difficult to do so with a provided align function.

Basically, I think the overwhelming majority of cases you might legitimately want this functionality are when implementing Unicode algorithms, in which case you definitely are pulling in a Unicode library that can do that or implementing that library.

josh · June 18, 2020, 2:00am

What do you mean by "a cursor"?

Are you looking for a rope-like data structure?

asa-z · June 18, 2020, 3:02am

I mean an index into the string that I can advance forward or backwards in the string at will.

CAD97 · June 18, 2020, 3:31am

If it's not already aligned at a codepoint boundary, your system is almost certainly horribly broken. And as shown above, staying on a codepoint boundary is already easily doable (and more efficient than moving one byte and then aligning to a byte boundary, because of how UTF-8 works! It's O(1) either way, but it's 1 check versus up to 4).

asa-z · June 18, 2020, 3:47am

Yes, I now realize you could use the structure of utf8 to implement a more efficient version of the functionality. However, it seems pointless to improve the implementation when it seems there is consensus that it's not a worthy feature for std (actually core, probably).

josh · June 18, 2020, 5:57am

What you're getting in this thread is a lot of people asking you what you're trying to do. What are you using the string index for? The feedback you're getting is that there may be a better way to solve the use case you're trying to implement.

asa-z · June 18, 2020, 5:11pm

My specific use case is parsing.

CAD97 · June 18, 2020, 5:20pm

Then you definitely don't want an arbitrary alignment operation, because you definitely should always be at a codepoint boundary.

To get the offset of a single codepoint unit effectively is just string.chars().next()?.len_utf8(). And this is going to be strictly constant time, rather than the (bounded) loop required to align an arbitrary byte index to a codepoint boundary.

asa-z · June 18, 2020, 5:27pm

Yes, that's probably a better implementation (and I didn't realize len_utf8 existed, so I'm probably going to switch to that in my project's implementation.) However, since there is consensus that this is not a useful enough feature, I don't see why we continue discussing the implementation.

josh · June 18, 2020, 8:54pm

Because people here are trying to help you with your actual use case, rather than just evaluate the original proposal you made for a library addition. Both because we care about Rust users succeeding (and enjoying themselves in the process), and because sometimes in the course of talking about the actual use case we find something that should be changed in the language or libraries.

asa-z · June 20, 2020, 3:12am

I am parsing a language, and in my lexer I am keeping track of my current location that I am lexing. I need to refer to the value at the index, as well as increment and decrement it. Since none of the characters that my parser cares about are more than one codepoint, I don't need to worry about graphemes, but since I want to handle (at least fail gracefully) Unicode inputs, so I do not want to pull in unicode_segmentation or another similar crate when I don't need to. I rolled my own equivalent as a free function (though it also takes &mut usize).

Topic		Replies	Views
Implement Index<usize> for String and &str libs	48	5380	March 19, 2021
Support for grapheme clusters in std language design	6	4763	March 25, 2019
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3559	November 12, 2020
Pre-RFC: Add len_utf8_at method to str libs	8	1005	December 20, 2020
`str` method for slicing code-point (i.e. `char`) ranges libs	23	2887	March 25, 2019

Idea: `char_index_after` and `char_index_before`

Related topics