As some may know, we're anti-char
. As such it would be nice to have &[&str]
and &[String]
in std::str::pattern::Pattern
, for consistency with discouraging indiscriminate use of USVs where actual characters/grapheme clusters may be more helpful. With potential optimizations for when the strings are sorted, ofc.
Are you suggesting that str.find(&["a", "b", "c"])
would behave like str.find("abc")
?
I would intuitively expect it to do something different: find any of a, b, or c.
str.find(&["a", "b", "c", "d"])
should behave like str.find(&['a', 'b', 'c', 'd'])
:
fn main() {
println!("{:?}", "Hello, world!".find(&['a', 'b', 'c', 'd'] as &[char])); // Some(11)
}
If you want to request a feature from T-libs, you probably want to open an issue on rust-lang/rust
.
While at it: maybe also add &[char; N]
and &[&str; N]
?
I don't think we're "anti-char". char
has its valid use cases. But I think that implementing Pattern
for &[&str]
is useful even when the strings in the slice aren't single Unicode graphemes. For example, a parser could use this:
input.find(&["//", "/*", "\"", "'"][..])
or a simple profanity filter:
haystack.find(&["asshat", "douchebag", /* more swear words */][..])
There are many more use cases.
Ideally, the Pattern
trait would be stabilized so it could be implemented for types outside of std
, like Regex
The regex crate does implement it, but you have to enable the pattern feature: regex::Regex - Rust
Also, the multi-substring search API suggested here doesn't make a ton of sense since building the multi-substring searcher is typically expensive. (Unless you don't mind using a naive and slow search algorithm.)
The crate you want to do this for you is aho-corasick.
you can optimize if the array is sorted. but please deprecate char
, as char
generally leads to broken unicode handling. :) (see also: can't get a char
from an str[usize]
operation)
Please explain. stdlib's implementation works as expected, and nothing the user does could lead to things being broken there. Regardless, trait implementations cannot be deprecated even if it was wanted.
As we said:
Please don't use char
, it's generally the wrong tool for the job. You can't even do e.g. 'é'
in Rust, and if you attempt to match on 'é'
it'll randomly fail even tho it'll look like it matches perfectly.
A naive and a sorted multi-substring searcher would be more than good enough.
Using a &str
as pattern instead wouldn't help in this case. To handle it properly, you have to normalize the string first, and then it doesn't matter whether you use 'é'
or "é"
as pattern.
you'd use &["é", "é"]
as the pattern.
Imagine seeing that in some source code and wondering if it's correct, and then finding out it's actually the same bytes two times. This is the case in your post, which has most likely been normalized somewhere along the way over the web.
It's probably a good idea to use escapes when the distinction matters, so that the meaning of the code doesn't change under Unicode normalization:
["\u{e9}", "\u{65}\u{0301}"] /* two forms of é */
honestly we were mostly too lazy to type in the correct unicode. besides, most ppl wouldn't know how to tell. but yes ideally you'd spell them out like that instead. :p
point being, &[&str] would be good. ^^
Define "sorted". Since we're dealing with Unicode, there are at least 5 different ways of sorting strings, at least two of which are locale-sensitive.
And in any case, Rust prefers avoiding APIs (where possible) that are just magically more or less performant depending on minor source code details.
It might be reasonable to have a pattern object constructor assume sorted input and miss some matches if it isn't, but it would be expected to make that known and offer a version that does the sort for you.
The correct answer isn't to "just make &[&str]
work as a pattern", as convenient as that would be, because of all the little pitfalls involved, which have been illustrated here.
The best solution would be to stabilize the Pattern
trait so that you can use things like aho-corasick.
the naive approach works for [char]. so what's the problem? but in particular, sorted by UTF-8 byte order would be fine, and would allow binary searching the pattern (altho that's kinda slow so it might be worse than the naive approach).
and the point is moot when using actual arrays, as min const generics allows rust to create a copy of the array internally and sort it then.
in any case we don't intend them to be a replacement for grep, but a replacement for &[char]
Patterns.
(fwiw, we're pretty sure something like trim_start_matches
would be faster even with a naive [&str]
over an [char]
as it entirely avoids decoding)
As an aside, the semantics of a char
slice as a pattern (which is to say, match any of the char
s) seem to be rather poorly documented. Probably partly because the Pattern
trait and its impls are still unstable…
The pattern can be a
&str
,char
, a slice ofchar
s, or a function or closure that determines if a character matches.
Oh yeah. We'd say having a slice of &str
s as a valid pattern would also help with that.
char
is four bytes, so backtracking isn't a problem. &str
can be arbitrarily wrong.
char
is anywhere between 1 and 4 bytes, when matching against &str
. either you scan the string as chars (which means decoding) or as bytes (which means encoding). with &str
you're just matching raw bytes against raw bytes (thanks to UTF-8-provided guarantees).