EDIT: I'VE OPENED AN ACP FOR THIS. THE UP-TO-DATE VERSION OF THIS IS THERE.
Summary
Bring back the str::subslice_offset
and slice::subslice_offset
methods. These use safe pointer arithmetic to find where something is within a slice/str.
impl str {
fn subslice_offset(&self, item: &str) -> Range<usize>;
}
impl<T> [T] {
fn subslice_offset(&self, item: &T) -> usize;
}
Motivation
One use of this is to improve the flexibility of methods like str::split
. I find myself unable to use str::split
or slice::split
in many cases because I need more complex behavior.
Suppose we want to split a semicolon-separated string, but we want to include the semicolon at the start of each substring. We could use subslice_offset
method to do this.
let foo = "abc;ab;a";
for sub in foo.split(";") {
let idx = foo.subslice_offset(sub);
let sub = &foo[idx.start.saturating_sub(1)..idx.end];
// Now `sub` has the leading semicolon
// This yields "abc", ";ab", and ";a" for `sub`
}
str::find
cannot be used in place of str::subslice_offset
.
let foo = "abc;ab;a";
for sub in foo.split(";") {
let idx = foo.find(sub).unwrap();
let sub = &foo[idx.saturating_sub(1)..idx + sub.len()];
// This incorrectly yields "abc", "ab", and "a"
}
Another (more niche) but distinct problem is presented in "Problem 2" of my acp.
Exhibit A
This is the original extract_leading_metadata
function in Rust's src/librustdoc/markdown.rs
. As you can see, it used to use subslice_offset
to find the lines at the start of the file that start with "%"
and then return the remaining lines.
/// Separate any lines at the start of the file that begin with `%`.
fn extract_leading_metadata<'a>(s: &'a str) -> (Vec<&'a str>, &'a str) {
let mut metadata = Vec::new();
for line in s.lines() {
if line.starts_with("%") {
// remove %<whitespace>
metadata.push(line[1..].trim_left())
} else {
let line_start_byte = s.subslice_offset(line);
return (metadata, &s[line_start_byte..]);
}
}
// if we're here, then all lines were metadata % lines.
(metadata, "")
}
However, when subslice_offset
was deprecated, s.subslice_offset
was replaced with s.find(line).unwrap()
which then broke this function. The following is the fixed function (which is present in rustdoc to this day).
/// Separate any lines at the start of the file that begin with `%`.
fn extract_leading_metadata<'a>(s: &'a str) -> (Vec<&'a str>, &'a str) {
let mut metadata = Vec::new();
let mut count = 0;
for line in s.lines() {
if line.starts_with("%") {
// remove %<whitespace>
metadata.push(line[1..].trim_left());
count += line.len() + 1;
} else {
return (metadata, &s[count..]);
}
}
// if we're here, then all lines were metadata % lines.
(metadata, "")
}
As you can see, it now has this dorky count
variable, and we need to manually keep track of what bytes s.lines()
goes through.
Additionally, this breaks if s
contains any CRLF line breaks (because these take up two bytes instead of one).
Guide-level explanation
This feature introduces the subslice_offset
methods. These methods find where something is in a string/slice. Unlike str::find
and similar methods, these do not actually search through the string.
Suppose we have the following code:
let foo = "mississippis";
let first_is: &str = foo.matches("is").nth(0).unwrap();
let second_is: &str = foo.matches("is").nth(1).unwrap();
assert_eq!(first_is, "is");
assert_eq!(first_is, second_is);
assert_eq!(foo.subslice_offset(first_is), 1..3);
assert_eq!(foo.subslice_offset(second_is), 4..6);
assert_eq!(foo.find(second_is), Some(1));
foo.subslice_offset("is"); // PANICS !
Despite first_is
and second_is
both being "is"
, they point to different parts of foo
. We can use subslice_offset
to find where exactly they point to. In this case, first_is
is &foo[1..3]
and second_is
is &foo[4..6]
. In a sense, subslice_offset
is the inverse of the indexing operation.
Unlike subslice_offset
, str::find
just searches for the first occurance of "is"
within foo
. This is why it gives the index 1
instead of 4
even though second_is
is &foo[4..6]
.
In the last part, where it panics, we're trying to find where "is"
points to within foo
. However, "is"
does not point to anything within foo because it is a separate string literal. This causes subslice_offset
to panic.
Reference-level explanation
This could be implemented like so:
impl<T> [T] {
fn subslice_offset(&self, item: &T) -> usize {
let self_start = self.as_ptr() as usize;
let item_start = item as *const T as usize;
let byte_offset = item_start.wrapping_sub(self_start);
let offset = byte_offset / core::mem::size_of::<T>();
assert!(offset < self.len());
offset
}
}
impl str {
fn subslice_offset(&self, substr: &str) -> Range<usize> {
let self_start = self.as_ptr() as usize;
let substr_start = substr.as_ptr() as usize;
let start = substr_start.wrapping_sub(self_start);
let end = start + substr.len();
assert!(start < self.len());
assert!(end <= self.len());
start..end
}
}
Panics
If the provided subslice is not in self
, these methods panic. For example, the following would panic.
let foo = "hello world!";
let bar = "!world hello";
let world = foo.matches("world").next().unwrap();
let world2 = bar.matches("world").next().unwrap();
assert_eq!(world, world2);
assert_eq!(foo.subslice_offset(world), 6..11); // this does not panic
assert_eq!(bar.subslice_offset(world2), 1..6); // this does not panic
foo.subslice_offset(world2); // PANICS
// this panics because world2 was obtained from `bar` not `foo`
Drawbacks
- This concept may be difficult to understand for people who haven't worked with pointers.
- Due to their signature, these methods may be confused with
str::find
and similar methods which can cause bugs. In fact, this has already happened within rust itself.
Rationale and alternatives
I think that this is a good addition because it's a simple way to extend str::split
and friends. On top of extending str::split
and similar methods, it also has other uses (see this stack overflow post or "Problem 2" in my acp).
Alternatives
- Add methods, such as
split_indices
ormatches_indices
that return ranges instead of subslices.- This adds a bunch of new methods to the standard library. This is especially bad if you look at all of the variants of these methods (eg
str::rsplit
,str::split_inclusive
,slice::rsplitn
, andstr::rsplit_terminator
).
- This adds a bunch of new methods to the standard library. This is especially bad if you look at all of the variants of these methods (eg
- Use a crates.rs crate such as https://docs.rs/subslice-offset/latest/subslice_offset/index.html.
subslice_offset
is a very simple and powerful extension to slices and strings. I think that it is worth having in the standard library.
- Not using
str::split
if you need more complex behavior.str::split
properly deals with certain edge cases, so it leads to more concise and less buggy code.
Prior art
- subslice_offset crate.
- original
subslice_offset
PR from 2013.- deprecated here with
str::find
being implied as the alternative.
- deprecated here with
- RFC: Subslice-offset - Get the offset of references into a slice. by m-ou-se · Pull Request #2796 · rust-lang/rfcs · GitHub (a RFC similar to this but only for slices).
- rust - How to find the starting offset of a string slice of another string? - Stack Overflow (illustrates another important use for this).
Unresolved questions
- What should these methods be named? Their name should clearly indicate that they are fundementally different from methods like
str::find
. - In the current design, these methods panic if the string/slice is not contained within
self
, but maybe they should return anOption
instead. - Should there be a
slice::subslice_offset
method that takes in another slice and returns a range? What should that be named?