Add split_ascii_whitespace to [u8]?

jdahlstrom · October 11, 2025, 11:15am

str and [u8] share a bunch of *ascii* methods. There is (at least?) one exception: str::split_ascii_whitespace was stabilized in 1.34, but does not exist in [u8], which seems like an oversight. It's not difficult to work around with split(is_ascii_whitespace), but might warrant addition for consistency. There's also the gotcha that you have to remember to filter out empty subslices to get the same semantics. The str implementation already works on bytes internally:

pub fn split_ascii_whitespace(&self) -> SplitAsciiWhitespace<'_> {
    let inner = self.as_bytes()
        .split(IsAsciiWhitespace)
        .filter(BytesIsNotEmpty)
        .map(UnsafeBytesToStr);
    SplitAsciiWhitespace { inner }
}

This method is useful for ad-hoc parsers of possibly-non-utf8 text. It (and split_whitespace) is similarly missing from ByteStr, and could be added to that as well.

burntsushi · October 11, 2025, 12:37pm

I'd suggest filing an ACP.

Note that this method (with a different name) exists in bstr: ByteSlice in bstr - Rust

tczajka · October 11, 2025, 12:43pm

Should this be a method on [Char]?

jdahlstrom · October 11, 2025, 12:47pm

Yes, probably. But as of now [Char] doesn't yet have any own methods I think (Edit after checking: besides as_str and as_bytes).

toc · October 11, 2025, 12:53pm

Kind of seems like it should be a default method on Split + Seq<IsAsciiable>. But rust doesn't really have an organizational abstraction that helps with that beyond "absolute pile of traits".

kornel · October 11, 2025, 1:08pm

That's a separate issue. You may have data in unspecified ASCII-superset encoding, or even UTF-8, but want to split it first to avoid converting/validating the whole string.

burntsushi · October 11, 2025, 1:12pm

Good luck optimizing that.

tczajka · October 11, 2025, 1:39pm

But then splitting on ASCII whitespaces doesn't make much sense: for instance, in latin-1 encoding 0x85 is whitespace ("next line").

jdahlstrom · October 11, 2025, 1:49pm

But this proposed method would do just that: split byte soup text on ASCII whitespaces. u8 already has is_ascii_whitespace, used by the str implementation, which already works on [u8] behind the curtains.

My present use case is parsing Wavefront .OBJ files, but there are of course many, many textual file formats made of fields separated by ASCII whitespace where you don't want to needlessly assume encoding beyond "ASCII superset".

burntsushi · October 11, 2025, 1:52pm

It makes sense because the encoding isn't always known (when searching files, for example, it's rarely known) and you may prioritize performance above other things. See the crate docs for bstr and my write-ups on motivation from a conceptual perspective and performance perspective. These use different examples than "split on ASCII whitespace," but the same ideas apply.

jdahlstrom · October 11, 2025, 1:52pm

Opened an ACP.

Topic		Replies	Views
ASCII methods for u16	17	2926	April 11, 2021
Pre-RFC: String from ASCII (not allowing UTF-8) libs	16	2467	August 8, 2021
Fn char::as_ascii(self) -> Option<u8>	4	782	April 20, 2020
Improvements to AsciiExt language design	10	1665	March 25, 2019
Str vs slice APIs libs	3	1621	March 25, 2019

Add split_ascii_whitespace to [u8]?

Related topics