Splitn but returning an array

We have split_once that can split into 2 parts, and returns a tuple, this is nice because I can use destructuring and don't have any bounds checking/allocations.

The closest equivalent for >2 is splitn().collect::<Vec<_>>() and then indexing. There are a few problems: allocates a vector, and you have to hope that the bounds checks are optimized out. You could also use next() a bunch of times to create a tuple/array, but thats verbose. collecting to ArrayVec can help by reducing allocations but thats another library.

So anyway, now that const generics are stable, some methods like the following would be nice

pub fn str::splitn<'a, P, const N: usize>(&'a self, pat: P) -> Option<[&str; N]>
where
    P: Pattern<'a>, 

pub fn slice::splitn<F, const N: usize>(&self, pred: F) -> Option<[&[T]; N]>
where
    F: FnMut(&T) -> bool, 

as well as rsplit variations

What would happen if there were insufficient places to split? As in, you want to split a string into 3 parts based on spaces, but there's only one space in the string.

Itertools has collect_tuple, which sort of fills this need:

let (a, b, c) = "aa bb cc".split_whitespace().collect_tuple().unwrap();

but it could definitely be improved alongside const generics (returning a fixed size slice and/or being available for more than 12 splits). I hope that apis like this become easier and more common.

The function should be returning an option, this would be None.

2 Likes

What would happen if there were insufficient places to split? As in, you want to split a string into 3 parts based on spaces, but there's only one space in the string.

Oh yeah, probably something like split_once which returns an option. I will edit the OP. We could also use something like Result<[...], impl Iterator> in case the user wanted to iterate over what every splits there are

FWIW currently you could use array::map:

let mut split = "hey there world".split(' ');
let words = [(); 3].map(|()| split.next().unwrap());

I don't know of any good way of returning early besides panicking, though. That's just a general problem of array::map.

It's worth noting that the documentation for array::map has a special note about poor optimization (and indeed I have observed some pretty insanely large and complex assembly for a simple "collect from &[u8] returning a Option<[u8; N]>" when toying around in godbolt), which seems unusual for the std docs, and makes me wonder whether any such const-generic array-returning method would suffer the same..

Edit: added godbolt link for the mentioned poor optimization.

Itertools has collect_tuple , which sort of fills this need:

Yeah that is close, but again, another dependency, and split_once won't panic if there is more than 1 instance of the separator. It will just only split along the first. So we should use splitn().collect_tuple() not split().collect_tuple()

Also arrays make more sense than tuples because its homogenous, I wonder why Itertools generally used tuples (collect_tuples, next_tuples, etc), and whether they will add array versions. Is it to do with const_generics?

makes me wonder whether any such const-generic array-returning method would suffer the same..

Interesting that would be unfortunate, but is something that could be fixed either through the optimizer, or through judicious use of unsafe. Which kind of belongs in std, because I wouldn't be comfortable writing that myself, and I expect that [.next(), .next(), .next(), ...] would have similar codegen problems?

It's probably to do with patterns.

Being able to write let [a, b, c] = foo(); is relatively new, so a bunch of older code will return tuples so that let (a, b, c) = foo(); was available.

It seems that guess is wrong. The same code but expanded like you suggested optimizes much much better than the array::map version.

Edit: This weird poor optimization seems more related to the chaining of array::map invocations, which wouldn't apply to a const-generic str::splitn, so this is probably off topic.

Thats interesting, makes sense, maybe I should open an issue for array versions of those methods

Okay interesting to see, yeah I think codegen issues can be sorted out. I think the API would be very nice to have

On nightly you can do that with array::try_from_fn, something like

let mut split = "hey there world".split(' ');
let words: Option<[_; 3]> = std::array::try_from_fn(|_| split.next());

Though of course that's still not helpful if you want to be able to use a partial result, since if there's only two parts it'll give you a None.

To be pedantic you can't do so yet. It's let [a, b, c]: [_; 3] = foo(); even with the #![feature(generic_arg_infer)].

2 Likes

I think str::split_once has the wrong signature (what is the point of using a heterogeneous datatype that only contains the same type???), there will always be at least one string returned by str::split, so this should be reflected in the signature:

fn split_once<'a, P: Pattern<'a>>(&'a self, delimiter: P) -> (&'a str, Option<&'a str>);

but this function has been stabilized already, so there is no point in discussing that.

Instead, I would suggest changing the proposed signature to something like this:

pub fn str::split_at_most<'a, P, const N: usize>(&'a self, pat: P) -> [Option<&str>; N]
where
    P: Pattern<'a>;

which is in my opinion more useful.

For my personal use I created an extension trait that looks something like this:

#![feature(pattern)]
use core::str::pattern::Pattern;

pub trait StrExt<'a> {
    fn split_once<P: Pattern<'a>>(&self, pattern: P) -> (&'a str, Option<&'a str>);

    fn split_at_most<P, const N: usize>(&self, pattern: P) -> [Option<&'a str>; N]
    where
        P: Pattern<'a>;
}

impl<'a> StrExt<'a> for &'a str {
    fn split_once<P: Pattern<'a>>(&self, pattern: P) -> (&'a str, Option<&'a str>) {
        if let [Some(first), second] = self.split_at_most::<_, 2>(pattern) {
            (first, second)
        } else {
            unreachable!("split_at_most must return at least one element")
        }
    }

    fn split_at_most<P, const N: usize>(&self, pattern: P) -> [Option<&'a str>; N]
    where
        P: Pattern<'a>,
    {
        let mut iter = self.splitn(N, pattern);
        [(); N].map(|_| iter.next())
    }
}

Personally, I would prefer if one were to wait for some kind of generic tuples, so you could have this signature:

fn split_at_most<P: Pattern, const N: usize>(&self, pattern: P) -> (&str, Option<&str>..{ N - 1 });

Here is an example that uses those functions:

use core::time::Duration;

use super::ParserError;

trait DurationExt {
    #[inline]
    #[must_use]
    fn from_hours(hours: u64) -> Duration { Duration::from_mins(hours * 60) }

    #[inline]
    #[must_use]
    fn from_mins(mins: u64) -> Duration { Duration::from_secs(mins * 60) }
}

impl DurationExt for Duration {}

fn parse_u64(string: &str) -> Result<u64, ParserError> {
    string
        .parse::<u64>()
        .map_err(|e| ParserError::parse_int_error(e, 0..string.len()))
}

/// Parses a `String` of the following format:
/// `hours:minutes:seconds,milliseconds`
pub fn parse_duration(input: &str) -> Result<Duration, ParserError> {
    if let [Some(hours), Some(minutes), Some(seconds_millis)] = input.split_at_most::<_, 3>(':') {
        if let (seconds, Some(millis)) = seconds_millis.split_once(',') {
            let mut result = Duration::from_secs(0);

            result += Duration::from_hours(parse_u64(hours)?);
            result += Duration::from_mins(parse_u64(minutes)?);
            result += Duration::from_secs(parse_u64(seconds)?);
            result += Duration::from_millis(parse_u64(millis)?);

            return Ok(result);
        }
    }

    Err(ParserError::invalid_duration(0..input.len()))
}

Note: This does not take advantage of the fact that [Option<&str>; N] is returned instead of Option<[&str; N]>, but this could be easily changed by replacing the if let with a match to report better errors:


/// Parses a `String` of the following format:
/// `hours:minutes:seconds,milliseconds`
pub fn parse_duration(input: &str) -> Result<Duration, ParserError> {
    match input.split_at_most::<_, 3>(':') {
        [Some(hours), Some(minutes), Some(seconds_millis)] => {
            if let (seconds, Some(millis)) = seconds_millis.split_once(',') {
                let mut result = Duration::from_secs(0);

                result += Duration::from_hours(parse_u64(hours)?);
                result += Duration::from_mins(parse_u64(minutes)?);
                result += Duration::from_secs(parse_u64(seconds)?);
                result += Duration::from_millis(parse_u64(millis)?);

                Ok(result)
            } else {
                Err(ParserError::InvalidSeconds)
            }
        }
        [Some(hours), None, None] => Err(ParserError::invalid_duration(0..input.len())),
        [Some(hours), Some(minutes), None] => Err(ParserError::MissingSeconds),
    }
}
1 Like

Yeah, I'm aware for const generic lengths: Irrefutable slice patterns should be inferred as arrays of that length · Issue #76342 · rust-lang/rust · GitHub

But let [r, g, b, a] = my_u32_pixel.to_le_bytes(); does work, which is a situation where people might have wanted a tuple instead before array patterns existed.

2 Likes
let mut split = "hey there world".split(' ');
let words: Option<[_; 3]> = std::array::try_from_fn(|_| split.next());

This would need to use splitn(' ', 3), otherwise, this wouldn't do what one would expect.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.