Add split_ascii_whitespace to [u8]?

str and [u8] share a bunch of *ascii* methods. There is (at least?) one exception: str::split_ascii_whitespace was stabilized in 1.34, but does not exist in [u8], which seems like an oversight. It's not difficult to work around with split(is_ascii_whitespace), but might warrant addition for consistency. There's also the gotcha that you have to remember to filter out empty subslices to get the same semantics. The str implementation already works on bytes internally:

pub fn split_ascii_whitespace(&self) -> SplitAsciiWhitespace<'_> {
    let inner = self.as_bytes()
        .split(IsAsciiWhitespace)
        .filter(BytesIsNotEmpty)
        .map(UnsafeBytesToStr);
    SplitAsciiWhitespace { inner }
}

This method is useful for ad-hoc parsers of possibly-non-utf8 text. It (and split_whitespace) is similarly missing from ByteStr, and could be added to that as well.

I'd suggest filing an ACP.

Note that this method (with a different name) exists in bstr: ByteSlice in bstr - Rust

1 Like

Should this be a method on [Char]?

Yes, probably. But as of now [Char] doesn't yet have any own methods I think (Edit after checking: besides as_str and as_bytes).

Kind of seems like it should be a default method on Split + Seq<IsAsciiable>. But rust doesn't really have an organizational abstraction that helps with that beyond "absolute pile of traits".

That's a separate issue. You may have data in unspecified ASCII-superset encoding, or even UTF-8, but want to split it first to avoid converting/validating the whole string.

1 Like

Good luck optimizing that.

But then splitting on ASCII whitespaces doesn't make much sense: for instance, in latin-1 encoding 0x85 is whitespace ("next line").

But this proposed method would do just that: split byte soup text on ASCII whitespaces. u8 already has is_ascii_whitespace, used by the str implementation, which already works on [u8] behind the curtains.

My present use case is parsing Wavefront .OBJ files, but there are of course many, many textual file formats made of fields separated by ASCII whitespace where you don't want to needlessly assume encoding beyond "ASCII superset".

1 Like

It makes sense because the encoding isn't always known (when searching files, for example, it's rarely known) and you may prioritize performance above other things. See the crate docs for bstr and my write-ups on motivation from a conceptual perspective and performance perspective. These use different examples than "split on ASCII whitespace," but the same ideas apply.

Opened an ACP.