Idea: str::split_suffix/split_prefix

There have been several times where I want to use str::split, but I want the splits to include the separator. One question with such a method though is whether the separator should be at the start of the split strings or at the end.

I'd like to propose two functions for str: split_suffix and split_prefix. They work like split, but include the separator at the end or beginning (respectively). Examples showing split/split_suffix/split_prefix:

let message = "foo\nbar\n\nbaz\n";

let splits: Vec<_> = message.split('\n').collect();
assert_eq!(splits, vec!["foo", "bar", "", "baz", ""]);

let splits: Vec<_> = message.split_suffix('\n').collect();
assert_eq!(splits, vec!["foo\n", "bar\n", "\n", "baz\n"]);

let splits: Vec<_> = message.split_prefix('\n').collect();
assert_eq!(splits, vec!["foo", "\nbar", "\n", "\nbaz", "\n"]);

I guess you could also argue for split_terminator_suffix and split_terminator_prefix (same as split_suffix/split_prefix but if there's trailing terminator it is omitted). Or a variant that puts the separators in their own strings (e.g., ["foo", "\n", "bar", "\n", "\n", "baz", "\n"]). But I'm not interested in those functions. I'm primarily interested in split_suffix. My use case: I want to separate a string by lines ('\n'), but I want to keep all the characters in the splits.

I can implement this myself as a crate, but I think others would also find this useful.

1 Like

You should probably publish it as a crate first, then if that gets popular we can contemplate getting it into std. Personally, I think it's probably too niche to be in the standard library.

IMO, there's enough non-trivial ways to customize string splitting/tokenization that we probably want a separate crate for this even if some of it made it into std.

For example, I have often wanted "soft" delimeters and "hard" delimeters as defined by the tokenizer in our BDE library. Basically, if you have a string like this:

"   foo   bar         baz\n\nquux\n    "

and you want to split it into:

["foo", "bar", "baz", "", "quux", ""]

that means you want "\n" to be a "hard delimeter character" (two or more of them in a row means two or more separate delimeters, with empty output tokens between them) and you want " " to be a "soft delimeter character" (two or more of them in a row just means one long delimeter). As the link above shows, just specifying this API unambiguously takes significant effort.

I suspect if we thought about it some more we could come up with more bells and whistles than just my delimeter softness and @mjbshaw's delimeter retention. Fortunately, the current str::split appears to be uncontroversially what most people want most of the time, so letting the fancier notions get implemented in a much fancier crate seems like the right move to me.