[ACP ALREADY WRITTEN] Pre-RFC: Bring back the `subslice_offset` methods (for extending `str::split` and such)

EDIT: I'VE OPENED AN ACP FOR THIS. THE UP-TO-DATE VERSION OF THIS IS THERE.

Summary

Bring back the str::subslice_offset and slice::subslice_offset methods. These use safe pointer arithmetic to find where something is within a slice/str.

impl str {
    fn subslice_offset(&self, item: &str) -> Range<usize>;
}
impl<T> [T] {
    fn subslice_offset(&self, item: &T) -> usize;
}

Motivation

One use of this is to improve the flexibility of methods like str::split. I find myself unable to use str::split or slice::split in many cases because I need more complex behavior.

Suppose we want to split a semicolon-separated string, but we want to include the semicolon at the start of each substring. We could use subslice_offset method to do this.

let foo = "abc;ab;a";
for sub in foo.split(";") {
    let idx = foo.subslice_offset(sub);
    let sub = &foo[idx.start.saturating_sub(1)..idx.end];
    // Now `sub` has the leading semicolon
    // This yields "abc", ";ab", and ";a" for `sub`
}

str::find cannot be used in place of str::subslice_offset.

let foo = "abc;ab;a";
for sub in foo.split(";") {
    let idx = foo.find(sub).unwrap();
    let sub = &foo[idx.saturating_sub(1)..idx + sub.len()];

    // This incorrectly yields "abc", "ab", and "a"
}

Another (more niche) but distinct problem is presented in "Problem 2" of my acp.

Exhibit A

This is the original extract_leading_metadata function in Rust's src/librustdoc/markdown.rs. As you can see, it used to use subslice_offset to find the lines at the start of the file that start with "%" and then return the remaining lines.

/// Separate any lines at the start of the file that begin with `%`.
fn extract_leading_metadata<'a>(s: &'a str) -> (Vec<&'a str>, &'a str) {
    let mut metadata = Vec::new();
    for line in s.lines() {
        if line.starts_with("%") {
            // remove %<whitespace>
            metadata.push(line[1..].trim_left())
        } else {
            let line_start_byte = s.subslice_offset(line);
            return (metadata, &s[line_start_byte..]);
        }
    }
    // if we're here, then all lines were metadata % lines.
    (metadata, "")
}

However, when subslice_offset was deprecated, s.subslice_offset was replaced with s.find(line).unwrap() which then broke this function. The following is the fixed function (which is present in rustdoc to this day).

/// Separate any lines at the start of the file that begin with `%`.
fn extract_leading_metadata<'a>(s: &'a str) -> (Vec<&'a str>, &'a str) {
    let mut metadata = Vec::new();
    let mut count = 0;
    for line in s.lines() {
        if line.starts_with("%") {
            // remove %<whitespace>
            metadata.push(line[1..].trim_left());
            count += line.len() + 1;
        } else {
            return (metadata, &s[count..]);
        }
    }
    // if we're here, then all lines were metadata % lines.
    (metadata, "")
}

As you can see, it now has this dorky count variable, and we need to manually keep track of what bytes s.lines() goes through.

Additionally, this breaks if s contains any CRLF line breaks (because these take up two bytes instead of one).

Guide-level explanation

This feature introduces the subslice_offset methods. These methods find where something is in a string/slice. Unlike str::find and similar methods, these do not actually search through the string.

Suppose we have the following code:

    let foo = "mississippis";
    let first_is: &str = foo.matches("is").nth(0).unwrap();
    let second_is: &str = foo.matches("is").nth(1).unwrap();

    assert_eq!(first_is, "is");
    assert_eq!(first_is, second_is);
    assert_eq!(foo.subslice_offset(first_is), 1..3);
    assert_eq!(foo.subslice_offset(second_is), 4..6);
    assert_eq!(foo.find(second_is), Some(1));

    foo.subslice_offset("is"); // PANICS !

Despite first_is and second_is both being "is", they point to different parts of foo. We can use subslice_offset to find where exactly they point to. In this case, first_is is &foo[1..3] and second_is is &foo[4..6]. In a sense, subslice_offset is the inverse of the indexing operation.

Unlike subslice_offset, str::find just searches for the first occurance of "is" within foo. This is why it gives the index 1 instead of 4 even though second_is is &foo[4..6].

In the last part, where it panics, we're trying to find where "is" points to within foo. However, "is" does not point to anything within foo because it is a separate string literal. This causes subslice_offset to panic.

Reference-level explanation

This could be implemented like so:

impl<T> [T] {
    fn subslice_offset(&self, item: &T) -> usize {
        let self_start = self.as_ptr() as usize;
        let item_start = item as *const T as usize;

        let byte_offset = item_start.wrapping_sub(self_start);
        let offset = byte_offset / core::mem::size_of::<T>();

        assert!(offset < self.len());

        offset
    }
}
impl str {
    fn subslice_offset(&self, substr: &str) -> Range<usize> {
        let self_start = self.as_ptr() as usize;
        let substr_start = substr.as_ptr() as usize;

        let start = substr_start.wrapping_sub(self_start);
        let end = start + substr.len();

        assert!(start < self.len());
        assert!(end <= self.len());

        start..end
    }
}

Panics

If the provided subslice is not in self, these methods panic. For example, the following would panic.

let foo = "hello world!";
let bar = "!world hello";

let world = foo.matches("world").next().unwrap();
let world2 = bar.matches("world").next().unwrap();

assert_eq!(world, world2);
assert_eq!(foo.subslice_offset(world), 6..11); // this does not panic
assert_eq!(bar.subslice_offset(world2), 1..6); // this does not panic

foo.subslice_offset(world2); // PANICS
// this panics because world2 was obtained from `bar` not `foo`

Drawbacks

  • This concept may be difficult to understand for people who haven't worked with pointers.
  • Due to their signature, these methods may be confused with str::find and similar methods which can cause bugs. In fact, this has already happened within rust itself.

Rationale and alternatives

I think that this is a good addition because it's a simple way to extend str::split and friends. On top of extending str::split and similar methods, it also has other uses (see this stack overflow post or "Problem 2" in my acp).

Alternatives

  • Add methods, such as split_indices or matches_indices that return ranges instead of subslices.
    • This adds a bunch of new methods to the standard library. This is especially bad if you look at all of the variants of these methods (eg str::rsplit, str::split_inclusive, slice::rsplitn, and str::rsplit_terminator).
  • Use a crates.rs crate such as https://docs.rs/subslice-offset/latest/subslice_offset/index.html.
    • subslice_offset is a very simple and powerful extension to slices and strings. I think that it is worth having in the standard library.
  • Not using str::split if you need more complex behavior.
    • str::split properly deals with certain edge cases, so it leads to more concise and less buggy code.

Prior art

Unresolved questions

  • What should these methods be named? Their name should clearly indicate that they are fundementally different from methods like str::find.
  • In the current design, these methods panic if the string/slice is not contained within self, but maybe they should return an Option instead.
  • Should there be a slice::subslice_offset method that takes in another slice and returns a range? What should that be named?

Future possibilities

1 Like

Prior art.

1 Like

Thanks. I somehow never found this while researching. This was very insightful for writing my RFC.

1 Like

This seems kind of brittle. Especially the str one when non-ASCII is involved. Wouldn't it be better for subslice_offset (or maybe a differently named method, since it won't be a subslice anymore) to accept *const T (in str's case, a *const u8) and just do the internal verification that the pointer really is inside the slice?

1 Like

Thanks for reading my proposal.

I think the str one should work with multi-byte characters just fine. Like str's indexing methods, this just uses ranges of byte indices.

My ACP has the most up-to-date version of this. In those methods, I do the verification and return None if it fails. Surprisingly, this can be done without any unsafe code.

I think that using references makes more sense because they are non-null and aligned, so it leads to better optimization. It's also more convenient to use.

Also, in certain cases, these methods can lead to unsound behavior when used in unsafe code. I feel like using raw pointers encourage peoples to use these in unsafe code.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.