Allow &[&str] in Pattern

As some may know, we're anti-char. As such it would be nice to have &[&str] and &[String] in std::str::pattern::Pattern, for consistency with discouraging indiscriminate use of USVs where actual characters/grapheme clusters may be more helpful. With potential optimizations for when the strings are sorted, ofc.

1 Like

Are you suggesting that str.find(&["a", "b", "c"]) would behave like str.find("abc")?

I would intuitively expect it to do something different: find any of a, b, or c.

3 Likes

str.find(&["a", "b", "c", "d"]) should behave like str.find(&['a', 'b', 'c', 'd']):

fn main() {
    println!("{:?}", "Hello, world!".find(&['a', 'b', 'c', 'd'] as &[char])); // Some(11)
}

If you want to request a feature from T-libs, you probably want to open an issue on rust-lang/rust.

While at it: maybe also add &[char; N] and &[&str; N]? :sweat_smile:

I don't think we're "anti-char". char has its valid use cases. But I think that implementing Pattern for &[&str] is useful even when the strings in the slice aren't single Unicode graphemes. For example, a parser could use this:

input.find(&["//", "/*", "\"", "'"][..])

or a simple profanity filter:

haystack.find(&["asshat", "douchebag", /* more swear words */][..])

There are many more use cases.

Ideally, the Pattern trait would be stabilized so it could be implemented for types outside of std, like Regex :slight_smile:

The regex crate does implement it, but you have to enable the pattern feature: regex::Regex - Rust

Also, the multi-substring search API suggested here doesn't make a ton of sense since building the multi-substring searcher is typically expensive. (Unless you don't mind using a naive and slow search algorithm.)

The crate you want to do this for you is aho-corasick.

4 Likes

you can optimize if the array is sorted. but please deprecate char, as char generally leads to broken unicode handling. :‌) (see also: can't get a char from an str[usize] operation)

Please explain. stdlib's implementation works as expected, and nothing the user does could lead to things being broken there. Regardless, trait implementations cannot be deprecated even if it was wanted.

As we said:

Please don't use char, it's generally the wrong tool for the job. You can't even do e.g. 'é' in Rust, and if you attempt to match on 'é' it'll randomly fail even tho it'll look like it matches perfectly.

A naive and a sorted multi-substring searcher would be more than good enough.

Using a &str as pattern instead wouldn't help in this case. To handle it properly, you have to normalize the string first, and then it doesn't matter whether you use 'é' or "é" as pattern.

4 Likes

you'd use &["é", "é"] as the pattern.

Imagine seeing that in some source code and wondering if it's correct, and then finding out it's actually the same bytes two times. This is the case in your post, which has most likely been normalized somewhere along the way over the web.

It's probably a good idea to use escapes when the distinction matters, so that the meaning of the code doesn't change under Unicode normalization:

["\u{e9}", "\u{65}\u{0301}"] /* two forms of é */
8 Likes

honestly we were mostly too lazy to type in the correct unicode. besides, most ppl wouldn't know how to tell. but yes ideally you'd spell them out like that instead. :‌p

point being, &[&str] would be good. ^^

Define "sorted". Since we're dealing with Unicode, there are at least 5 different ways of sorting strings, at least two of which are locale-sensitive.

And in any case, Rust prefers avoiding APIs (where possible) that are just magically more or less performant depending on minor source code details.

It might be reasonable to have a pattern object constructor assume sorted input and miss some matches if it isn't, but it would be expected to make that known and offer a version that does the sort for you.

The correct answer isn't to "just make &[&str] work as a pattern", as convenient as that would be, because of all the little pitfalls involved, which have been illustrated here.

The best solution would be to stabilize the Pattern trait so that you can use things like aho-corasick.

3 Likes

the naive approach works for [char]. so what's the problem? but in particular, sorted by UTF-8 byte order would be fine, and would allow binary searching the pattern (altho that's kinda slow so it might be worse than the naive approach).

and the point is moot when using actual arrays, as min const generics allows rust to create a copy of the array internally and sort it then.

in any case we don't intend them to be a replacement for grep, but a replacement for &[char] Patterns.

(fwiw, we're pretty sure something like trim_start_matches would be faster even with a naive [&str] over an [char] as it entirely avoids decoding)

As an aside, the semantics of a char slice as a pattern (which is to say, match any of the chars) seem to be rather poorly documented. Probably partly because the Pattern trait and its impls are still unstable…

The pattern can be a &str , char , a slice of char s, or a function or closure that determines if a character matches.

Oh yeah. We'd say having a slice of &strs as a valid pattern would also help with that.

char is four bytes, so backtracking isn't a problem. &str can be arbitrarily wrong.

char is anywhere between 1 and 4 bytes, when matching against &str. either you scan the string as chars (which means decoding) or as bytes (which means encoding). with &str you're just matching raw bytes against raw bytes (thanks to UTF-8-provided guarantees).