Idea: `char_index_after` and `char_index_before`

Add new members on str that allow for easily finding the next character boundary.

fn char_index_after(&self, mut index: usize) -> Option<usize> {
    index += 1;
    while !self.is_char_boundary(index) && index < source.len() {
        index += 1;
    }
   if index >= source.len() {
       None
   } else {
        Some(index)
   }
}

fn char_index_before(&self, mut index: usize) -> Option<usize> {
    if index == 0 {
        return None;
    }
    index -= 1;
    while !source.is_char_boundary(index) {
        index -= 1;
    }
    Some(index)
}

There are obviously numerous ways to slightly tweak the function, but this is the best I have thought of.

1 Like

Is a char boundary really what you want? Note that char boundaries and unicode grapheme boundaries are not the same, so using the former when you need the latter will quite likely result in broken text.

True, but the standard library provides no support for graphemes, and you have to go to an external crate for that. Therefore it is tremendously far out of scope to add grapheme segmentation to the standard library

It sounds like you're using "working with graphemes requires external crates" as an argument for "std should have these char_* methods", which doesn't really follow. Why not just use one of those external crates for grapheme segmentation? Do you have a use case that would actually benefit from char_index_before/after, which wouldn't be more correct with graphemes?

1 Like

What semantic operation are you trying to perform on strings? In what use case does this come up?

For comparison, there is a fairly short way of doing this in the standard library already, at least for the case where index is itself a char boundary:

fn char_index_after(&self, index: usize) -> Option<usize> {
    let (i, _) = self[index..].char_indices().nth(1)?;
    Some(index + i)
}

fn char_index_before(&self, index: usize) -> Option<usize> {
    let (i, _) = self[..index].char_indices().next_back()?;
    Some(i)
}

We also used to have char_range_at and char_range_at_reverse, which were similar to the proposed functions.

1 Like

I'm trying to have move a cursor around on the string, by characters.

The issue is that when using a cursor on a string, this is a very common operation, and that pattern is rather long, hard to remember, and hard to type.

That's not an excuse to put it in std rather than some experimental crate that follows SemVer, and thus can be improved without backward-compatibility issues.

It doesn't make sense to have this as it's own crate, though it could be included as one part of larger crates.

It's a fairly basic building block, and the design space is incredibly tiny, which is why I suggested it for inclusion in the standard library.

It seems that the consensus is that it is not worthy of inclusion, so users who need this feature can just roll their own or use a larger crate that provides this functionality.

Contrary to "only one obvious way to do it," because UTF-8 encoded codepoints are a maximum of four bytes long, you can manually unroll the loop into a maximum of four turns for a slight boost. (If you're lucky or clever, you might even be able to check all four bytes for being a starting byte at the same time!)

I know I've implemented this exact functionality (aligning an arbitrary byte index to a character boundary), but a quick look wasn't able to find it again.

But to be completely honest, in the supermajority of cases, I think you don't want to do this. (Thus a reason to let it be outside std: those that need it can get it but it's not going to tempt people that shouldn't.)

Specifically, because using an arbitrary byte index into a UTF-8 string that isn't on a codepoint border is pretty broken. How did you get the index in the first place? If it's supposed to be a codepoint border, you definitely want to know if it isn't, which is difficult to do so with a provided align function.

Basically, I think the overwhelming majority of cases you might legitimately want this functionality are when implementing Unicode algorithms, in which case you definitely are pulling in a Unicode library that can do that or implementing that library.

2 Likes

What do you mean by "a cursor"?

Are you looking for a rope-like data structure?

I mean an index into the string that I can advance forward or backwards in the string at will.

If it's not already aligned at a codepoint boundary, your system is almost certainly horribly broken. And as shown above, staying on a codepoint boundary is already easily doable (and more efficient than moving one byte and then aligning to a byte boundary, because of how UTF-8 works! It's O(1) either way, but it's 1 check versus up to 4).

2 Likes

Yes, I now realize you could use the structure of utf8 to implement a more efficient version of the functionality. However, it seems pointless to improve the implementation when it seems there is consensus that it's not a worthy feature for std (actually core, probably).

What you're getting in this thread is a lot of people asking you what you're trying to do. What are you using the string index for? The feedback you're getting is that there may be a better way to solve the use case you're trying to implement.

2 Likes

My specific use case is parsing.

Then you definitely don't want an arbitrary alignment operation, because you definitely should always be at a codepoint boundary.

To get the offset of a single codepoint unit effectively is just string.chars().next()?.len_utf8(). And this is going to be strictly constant time, rather than the (bounded) loop required to align an arbitrary byte index to a codepoint boundary.

1 Like

Yes, that's probably a better implementation (and I didn't realize len_utf8 existed, so I'm probably going to switch to that in my project's implementation.) However, since there is consensus that this is not a useful enough feature, I don't see why we continue discussing the implementation.

Because people here are trying to help you with your actual use case, rather than just evaluate the original proposal you made for a library addition. Both because we care about Rust users succeeding (and enjoying themselves in the process), and because sometimes in the course of talking about the actual use case we find something that should be changed in the language or libraries.

4 Likes

I am parsing a language, and in my lexer I am keeping track of my current location that I am lexing. I need to refer to the value at the index, as well as increment and decrement it. Since none of the characters that my parser cares about are more than one codepoint, I don't need to worry about graphemes, but since I want to handle (at least fail gracefully) Unicode inputs, so I do not want to pull in unicode_segmentation or another similar crate when I don't need to. I rolled my own equivalent as a free function (though it also takes &mut usize).