Allow &[&str] in Pattern

Soni · February 2, 2021, 1:18pm

As some may know, we're anti-char. As such it would be nice to have &[&str] and &[String] in std::str::pattern::Pattern, for consistency with discouraging indiscriminate use of USVs where actual characters/grapheme clusters may be more helpful. With potential optimizations for when the strings are sorted, ofc.

comex · February 3, 2021, 4:56pm

Are you suggesting that str.find(&["a", "b", "c"]) would behave like str.find("abc")?

I would intuitively expect it to do something different: find any of a, b, or c.

Soni · February 3, 2021, 5:21pm

str.find(&["a", "b", "c", "d"]) should behave like str.find(&['a', 'b', 'c', 'd']):

fn main() {
    println!("{:?}", "Hello, world!".find(&['a', 'b', 'c', 'd'] as &[char])); // Some(11)
}

camelid · February 3, 2021, 5:28pm

If you want to request a feature from T-libs, you probably want to open an issue on rust-lang/rust.

Soni · February 3, 2021, 5:31pm

While at it: maybe also add &[char; N] and &[&str; N]?

Aloso · February 6, 2021, 10:07pm

I don't think we're "anti-char". char has its valid use cases. But I think that implementing Pattern for &[&str] is useful even when the strings in the slice aren't single Unicode graphemes. For example, a parser could use this:

input.find(&["//", "/*", "\"", "'"][..])

or a simple profanity filter:

haystack.find(&["asshat", "douchebag", /* more swear words */][..])

There are many more use cases.

Ideally, the Pattern trait would be stabilized so it could be implemented for types outside of std, like Regex

burntsushi · February 6, 2021, 11:46pm

The regex crate does implement it, but you have to enable the pattern feature: regex::Regex - Rust

Also, the multi-substring search API suggested here doesn't make a ton of sense since building the multi-substring searcher is typically expensive. (Unless you don't mind using a naive and slow search algorithm.)

The crate you want to do this for you is aho-corasick.

Soni · February 7, 2021, 2:21am

you can optimize if the array is sorted. but please deprecate char, as char generally leads to broken unicode handling. :‌) (see also: can't get a char from an str[usize] operation)

jhpratt · February 7, 2021, 8:46am

Please explain. stdlib's implementation works as expected, and nothing the user does could lead to things being broken there. Regardless, trait implementations cannot be deprecated even if it was wanted.

Soni · February 7, 2021, 12:06pm

As we said:

Please don't use char, it's generally the wrong tool for the job. You can't even do e.g. 'é' in Rust, and if you attempt to match on 'é' it'll randomly fail even tho it'll look like it matches perfectly.

A naive and a sorted multi-substring searcher would be more than good enough.

Aloso · February 7, 2021, 1:33pm

Using a &str as pattern instead wouldn't help in this case. To handle it properly, you have to normalize the string first, and then it doesn't matter whether you use 'é' or "é" as pattern.

Soni · February 7, 2021, 2:10pm

you'd use &["é", "é"] as the pattern.

quaternic · February 7, 2021, 6:24pm

Imagine seeing that in some source code and wondering if it's correct, and then finding out it's actually the same bytes two times. This is the case in your post, which has most likely been normalized somewhere along the way over the web.

It's probably a good idea to use escapes when the distinction matters, so that the meaning of the code doesn't change under Unicode normalization:

["\u{e9}", "\u{65}\u{0301}"] /* two forms of é */

Soni · February 7, 2021, 6:55pm

honestly we were mostly too lazy to type in the correct unicode. besides, most ppl wouldn't know how to tell. but yes ideally you'd spell them out like that instead. :‌p

point being, &[&str] would be good. ^^

CAD97 · February 7, 2021, 7:07pm

Define "sorted". Since we're dealing with Unicode, there are at least 5 different ways of sorting strings, at least two of which are locale-sensitive.

And in any case, Rust prefers avoiding APIs (where possible) that are just magically more or less performant depending on minor source code details.

It might be reasonable to have a pattern object constructor assume sorted input and miss some matches if it isn't, but it would be expected to make that known and offer a version that does the sort for you.

The correct answer isn't to "just make &[&str] work as a pattern", as convenient as that would be, because of all the little pitfalls involved, which have been illustrated here.

The best solution would be to stabilize the Pattern trait so that you can use things like aho-corasick.

Soni · February 7, 2021, 7:27pm

the naive approach works for [char]. so what's the problem? but in particular, sorted by UTF-8 byte order would be fine, and would allow binary searching the pattern (altho that's kinda slow so it might be worse than the naive approach).

and the point is moot when using actual arrays, as min const generics allows rust to create a copy of the array internally and sort it then.

in any case we don't intend them to be a replacement for grep, but a replacement for &[char] Patterns.

(fwiw, we're pretty sure something like trim_start_matches would be faster even with a naive [&str] over an [char] as it entirely avoids decoding)

jdahlstrom · February 7, 2021, 8:36pm

As an aside, the semantics of a char slice as a pattern (which is to say, match any of the chars) seem to be rather poorly documented. Probably partly because the Pattern trait and its impls are still unstable…

The pattern can be a &str , char , a slice of char s, or a function or closure that determines if a character matches.

Soni · February 7, 2021, 9:19pm

Oh yeah. We'd say having a slice of &strs as a valid pattern would also help with that.

notriddle · February 9, 2021, 12:15am

char is four bytes, so backtracking isn't a problem. &str can be arbitrarily wrong.

Soni · February 9, 2021, 1:01am

char is anywhere between 1 and 4 bytes, when matching against &str. either you scan the string as chars (which means decoding) or as bytes (which means encoding). with &str you're just matching raw bytes against raw bytes (thanks to UTF-8-provided guarantees).

Topic		Replies	Views
Bug in the docs? str::find is stable, but std::str::pattern::Pattern isn't documentation	7	820	December 22, 2022
Pre-RFC: New generic string pattern API for `&str` libs	8	2568	March 25, 2019
Support for grapheme clusters in std language design	6	4721	March 25, 2019
Str vs slice APIs libs	3	1590	March 25, 2019
Have an Pattern impl for &[[char; 1]]	3	515	June 2, 2021

Allow &[&str] in Pattern

Related topics