I had this idea in my head for the last few months, in the last weeks I actually got around to actually implement it in a preliminary version, and now I’d like to get some feedback and questions about it. 
Current situation
Right now, string slices define a couple of methods to look for some string pattern in it - usually a char or a &str. To be exact, these are all methods in question:
pub trait StrSlice<'a> {
fn contains<'a>(&self, needle: &'a str) -> bool;
fn contains_char(&self, needle: char) -> bool;
fn split<Sep: CharEq>(&self, sep: Sep) -> CharSplits<'a, Sep>;
fn splitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<'a, Sep>;
fn rsplitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<'a, Sep>;
fn split_terminator<Sep: CharEq>(&self, sep: Sep) -> CharSplits<'a, Sep>;
fn split_str(&self, &'a str) -> StrSplits<'a>;
fn match_indices(&self, sep: &'a str) -> MatchIndices<'a>;
fn starts_with(&self, needle: &str) -> bool;
fn ends_with(&self, needle: &str) -> bool;
fn trim_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
fn trim_left_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
fn trim_right_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
fn find<C: CharEq>(&self, search: C) -> Option<uint>;
fn rfind<C: CharEq>(&self, search: C) -> Option<uint>;
fn find_str(&self, &str) -> Option<uint>;
// ...
}
In this set of methods, there are quite some inconsistencies: Some are defined to work with char, some with CharEq, and some with &str. Furthermore, some operations are duplicated to provide different implementations for char/CharEq and &str.
In my opinion, this is a perfect opportunity to generically abstract over the concept of a “string pattern” to provide a unified API:
The new &str API
The new API is based on the idea of a string Pattern that can be used to obtain a double-ended-iterator-like matcher to find all occurrences of it in a string:
pub trait StrSlice<'a> {
fn contains<M, P: Pattern<'a, M>>(self, pat: P) -> bool;
fn split<M, P: Pattern<'a, M>>(self, pat: P) -> Splits<M>;
fn rsplit<M, P: Pattern<'a, M>>(self, pat: P) -> RSplits<M>;
fn splitn<M, P: Pattern<'a, M>>(self, pat: P, n: uint) -> NSplits<M>;
fn rsplitn<M, P: Pattern<'a, M>>(self, pat: P, n: uint) -> RNSplits<M>;
fn split_terminator<M, P: Pattern<'a, M>>(self, pat: P) -> TermSplits<M>;
fn rsplit_terminator<M, P: Pattern<'a, M>>(self, pat: P) -> RTermSplits<M>;
fn matches<M, P: Pattern<'a, M>>(self, pat: P) -> Matches<M>;
fn rmatches<M, P: Pattern<'a, M>>(self, pat: P) -> RMatches<M>;
fn match_indices<M, P: Pattern<'a, M>>(self, pat: P) -> MatchIndices<M>;
fn rmatch_indices<M, P: Pattern<'a, M>>(self, pat: P) -> RMatchIndices<M>;
fn starts_with<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> bool;
fn ends_with<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> bool;
fn trim_matches<M: DoubleEndedMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;
fn trim_left_matches<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;
fn trim_right_matches<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;
fn find<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> Option<uint>;
fn rfind<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> Option<uint>;
// ...
}
Design details
Separate methods for front and back matching
The API is based around finding all non-overlapping occurrences of an arbitrary pattern in a string. This has one issue though:
If a pattern is able to match overlapping substrings, then finding non-overlapping matches from the front becomes a different operation than finding non-overlapping matches from the back. As an example, take the pattern "aa" and the haystack "aaa". Matching from the front gives you "[aa]a", while matching from the back gives "a[aa]".
For this reason, all methods on StrSlice that return an iterator based on matchers exist in two single-ended versions, rather than in one double-ended one.
Traits
- A
Pattern<'a, M> acts as a builder for creating a string matcher M from Self. Types like char, &str or Regex implement this in order to be directly usable with the methods of StrSlice.
-
Matcher, LeftMatcher, RightMatcher, DoubleEndedMatcher: Those provide a iterator-like interface for finding all non-overlapping occurrences of a pattern in a string, and are used as a building block for higher level operations like splitting or trimming.
-
Matcher is the base trait, used for getting the original haystack &str back out of a matcher.
-
LeftMatcher is for advancing the matcher state by finding the next match from the front.
-
RightMatcher is for advancing the matcher state by finding the next match from the back.
-
DoubleEndedMatcher is a marker trait for expressing that the matches of LeftMatcher are equal to those of RightMatcher in reverse order. This allows iterators build on matchers to still provide a double-ended iterator API for Pattern implementations where overlapping matches are impossible. (This feature is optional, but provided to be compatible with the current API)
Advantages
- One method name per operation; A single common abstraction rather than special cases for a few types
- Allows extending the string API with custom string patterns (Eg
Ascii, Regex, predicates…)
- Allows extending the string API with custom string search algorithms (Eg
BoyerMoore(&str))
- Allows to specialize the implementation per pattern and matcher impl
Disadvantages
- More complexity up front due to generics
WIP Implementation
Issues, open questions
- Performance is untested
- Generics/abstraction layers might cause size bloat
- Generics/abstraction layers might hinder optimizations
- Some type inference bugs around closures get hit by the impl (
.split(|c: char| ... ) fails to inferr)
- Should
DoubleEndedIterator impls be supported where possible?
Future extensions
- A similar abstraction could be implemented for API’s that modify a
String, so that string.push("foo"), string.push('f'), string.push('f'.to_ascii()) all work.
- This would allow stuff like
s.replace(®ex!(...), "foo"), which would be generic over both the pattern matched and the string fragment it gets replaced with.
New API with associated items and where clauses
pub trait StrSlice<'a> {
fn contains<P>(self, pat: P) -> bool where P: Pattern<'a>;
fn split<P>(self, pat: P) -> Splits<P::Matcher> where P: Pattern<'a>;
fn rsplit<P>(self, pat: P) -> RSplits<P::Matcher> where P: Pattern<'a>;
fn splitn<P>(self, pat: P, n: uint) -> NSplits<P::Matcher> where P: Pattern<'a>;
fn rsplitn<P>(self, pat: P, n: uint) -> RNSplits<P::Matcher> where P: Pattern<'a>;
fn split_terminator<P>(self, pat: P) -> TermSplits<P::Matcher> where P: Pattern<'a>;
fn rsplit_terminator<P>(self, pat: P) -> RTermSplits<P::Matcher> where P: Pattern<'a>;
fn matches<P>(self, pat: P) -> Matches<P::Matcher> where P: Pattern<'a>;
fn rmatches<P>(self, pat: P) -> RMatches<P::Matcher> where P: Pattern<'a>;
fn match_indices<P>(self, pat: P) -> MatchIndices<P::Matcher> where P: Pattern<'a>;
fn rmatch_indices<P>(self, pat: P) -> RMatchIndices<P::Matcher> where P: Pattern<'a>;
fn starts_with<P>(self, pat: P) -> bool where P: Pattern<'a>,
P::Matcher: LeftMatcher<'a>;
fn ends_with<P>(self, pat: P) -> bool where P: Pattern<'a>,
P::Matcher: RightMatcher<'a>;
fn trim_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>,
P::Matcher: DoubleEndedMatcher<'a>;
fn trim_left_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>,
P::Matcher: LeftMatcher<'a>;
fn trim_right_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>,
P::Matcher: RightMatcher<'a>;
fn find<P>(self, pat: P) -> Option<uint> where P: Pattern<'a>,
P::Matcher: LeftMatcher<'a>;
fn rfind<P>(self, pat: P) -> Option<uint> where P: Pattern<'a>,
P::Matcher: RightMatcher<'a>;
// ...
}