Pre-RFC: Add join_seq method to slices and strs

It is sometimes useful to join two strings which are known to be sequential into one string, such as when you have two strs produced from the split_at method. This is also occasionally used for slices of any type, the use case that I can immediately think of is joining two free blocks in a memory allocator.

Doing this requires some unsafe code and is orthogonal to split_at which leads me to believe it would be a good candidate for addition to the standard library. Here is an example implementation:

impl str {
    pub fn join_seq<'a>(&'a self, next: &'a str) -> Option<&'a str> {
        if next.is_empty() { return Some(self); }
        let len = self.len();
        let seq = unsafe { self.as_ptr().add(len) };
        if seq == next.as_ptr() {
            Some(unsafe {
                std::str::from_utf8_unchecked(std::slice::from_raw_parts(self.as_ptr(), len + next.len()))                
            })
        } else {
            None
        }
    }
}

The use case I've run into that specifically requires this is when attempting to create an identifier parser in nom:

fn ident(input: &str) -> IResult<&str, &str> {
    map(
        tuple((
            take_while_m_n(1, 1, |chr: char| chr.is_alphabetic() || chr == '_'),
            take_while(|chr: char| chr.is_alphanumeric() || chr == '_')
        )),
        |(s1, s2): (&str, &str)| s1.join_seq(s2).unwrap()
    )(input)
}
2 Likes

This API has been implemented in several crates and even proposed in an RFC, but it's unsound (at least, not safe without significant modifications to the API). https://github.com/rust-lang/rust/issues/62765 discusses the underlying reason for that and has links to all the instances I'm aware of.

12 Likes

Ah, so the issue is allocation boundaries on the stack. That is unfortunate.

I'll comment on the issue you linked with a proposed solution.

Going to copy my comment there back here:

In order for this to work without invoking undefined behavior the underlying string type would have to be modified. First, rejoin or join_seq or whatever would only be allowed for strings, and then a null-terminator (or any arbitrary byte) would be appended to the end of every string, which is then not included in slices to the string content. This would ensure that slices cannot be contiguous across allocation boundaries, at the cost of one extra byte per string (and probably a lot of work to the standard library). This in theory could also be done for arrays, but I would argue that is likely to be a much larger cost than a single byte per string, which is already a cost that C accepts.

Edit: actually, upon further thought, there's no reason you couldn't use the same method on arrays of any type to allow join_seq

Edit2: You would have to end pad every allocation to allow for this. Seems like far too steep of a requirement.

Dumb question, but if the function is only supposed to work for substrings split from a valid string, then maybe you could use it as a method of the original string, with some additional checking?

So the prototype would be:

    pub fn join_seq<'a>(&'a self, first: &'a str, second: &'a str) -> Option<&'a str> { ... }

And the line at the end of your example would be:

 |(s1, s2): (&str, &str)| input.join_seq(s1, s2).unwrap()
1 Like

That would work, but how useful is it? If you have a shared reference to the original string anyways then you might as well take offsets and calculate the merge as an interval instead. Needs not the slightest bit of unsafe and is much more flexible in the involved lifetimes.

2 Likes

Perhaps this is a silly question, but: since this operation could only possibly be sound when the two slices originate from the same allocation, how do you end up in a situation where you want to do this and you don't have an original string you could simply re-slice?

Even if abstraction boundaries for libraries are involved, you'd still have to somehow tell the library you're calling that the slices you're passing are from the same allocation, at which point you might as well pass it the whole original string too, so I don't even see how multiple independently-maintained crates could lead to a use case for this.

And even if the abstract machines involved didn't care about "same allocation" rules, any optimization enabled by this seems like it would be too brittle to be worth implementing at all, unless you enforce the "same allocation" rule yourself to ensure those strings consistently end up next to each other. Does anyone have an example more compelling than "in the one-in-a-million chance these strings got allocated next to each other, concat()ing them can skip a reallocation"?

1 Like

Three reasons I can think of, right now:

  • It is not currently possible to store the source of a borrow and the borrow in the same struct. So the underlying allocation might be a Box<str> and the individual slices borrowed from it. It's a bit contrived though as it can often be changed.
  • Avoiding the overhead of keeping the allocation is not completely trivial, potentially doubling the required memory from two pointer sizes to three or four if the original itself is unsized.
  • Does not trivially generalize to mutable references as the parts that are to be joined would alias with the original slice.

Given that such proposals come up at least once a year, I will guess that people come up with it more often than that but we do not always hear about it. Any ides for what would be good places to explicitly document this as not sound, to hopefully prevent this?

Note that slice::from_raw_parts is part of this proposal and their docs already say

The entire memory range of this slice must be contained within a single allocated object! Slices can never span across multiple allocated objects.

I am not sure if there are other places where such a warning can be more useful.

Anybody trying to implement a join_seq method knows intuitively that the raw slice they have constructed comes from only one allocated object, so I don't believe it's entirely adequate (intuitively it can't be or fewer people would ask this question :slight_smile:)

I think the best way to deal with this issue is to publish a crate that provides a PadString and &PadStr type that allows for this to be done safely, and then in the readme provide a long explanation for why it's necessary and why its extra features (join_seq) cannot be imported into the standard library. That way, if someone asks for this at least the course of action is obvious - just link the to the crate readme.

Imho simply adding this very counter example there should suffice (in a similar fashion to your improving the Pin docs by a lot when you added counter-examples):

Something like this
6 Likes

@dhm that looks pretty good, do you want to turn that into a PR?

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.