I had this idea in my head for the last few months, in the last weeks I actually got around to actually implement it in a preliminary version, and now I'd like to get some feedback and questions about it.
Current situation
Right now, string slices define a couple of methods to look for some string pattern in it - usually a char
or a &str
. To be exact, these are all methods in question:
pub trait StrSlice<'a> {
fn contains<'a>(&self, needle: &'a str) -> bool;
fn contains_char(&self, needle: char) -> bool;
fn split<Sep: CharEq>(&self, sep: Sep) -> CharSplits<'a, Sep>;
fn splitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<'a, Sep>;
fn rsplitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<'a, Sep>;
fn split_terminator<Sep: CharEq>(&self, sep: Sep) -> CharSplits<'a, Sep>;
fn split_str(&self, &'a str) -> StrSplits<'a>;
fn match_indices(&self, sep: &'a str) -> MatchIndices<'a>;
fn starts_with(&self, needle: &str) -> bool;
fn ends_with(&self, needle: &str) -> bool;
fn trim_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
fn trim_left_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
fn trim_right_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
fn find<C: CharEq>(&self, search: C) -> Option<uint>;
fn rfind<C: CharEq>(&self, search: C) -> Option<uint>;
fn find_str(&self, &str) -> Option<uint>;
// ...
}
In this set of methods, there are quite some inconsistencies: Some are defined to work with char
, some with CharEq
, and some with &str
. Furthermore, some operations are duplicated to provide different implementations for char/CharEq
and &str
.
In my opinion, this is a perfect opportunity to generically abstract over the concept of a "string pattern" to provide a unified API:
The new &str
API
The new API is based on the idea of a string Pattern
that can be used to obtain a double-ended-iterator-like matcher to find all occurrences of it in a string:
pub trait StrSlice<'a> {
fn contains<M, P: Pattern<'a, M>>(self, pat: P) -> bool;
fn split<M, P: Pattern<'a, M>>(self, pat: P) -> Splits<M>;
fn rsplit<M, P: Pattern<'a, M>>(self, pat: P) -> RSplits<M>;
fn splitn<M, P: Pattern<'a, M>>(self, pat: P, n: uint) -> NSplits<M>;
fn rsplitn<M, P: Pattern<'a, M>>(self, pat: P, n: uint) -> RNSplits<M>;
fn split_terminator<M, P: Pattern<'a, M>>(self, pat: P) -> TermSplits<M>;
fn rsplit_terminator<M, P: Pattern<'a, M>>(self, pat: P) -> RTermSplits<M>;
fn matches<M, P: Pattern<'a, M>>(self, pat: P) -> Matches<M>;
fn rmatches<M, P: Pattern<'a, M>>(self, pat: P) -> RMatches<M>;
fn match_indices<M, P: Pattern<'a, M>>(self, pat: P) -> MatchIndices<M>;
fn rmatch_indices<M, P: Pattern<'a, M>>(self, pat: P) -> RMatchIndices<M>;
fn starts_with<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> bool;
fn ends_with<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> bool;
fn trim_matches<M: DoubleEndedMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;
fn trim_left_matches<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;
fn trim_right_matches<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;
fn find<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> Option<uint>;
fn rfind<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> Option<uint>;
// ...
}
Design details
Separate methods for front and back matching
The API is based around finding all non-overlapping occurrences of an arbitrary pattern in a string. This has one issue though:
If a pattern is able to match overlapping substrings, then finding non-overlapping matches from the front becomes a different operation than finding non-overlapping matches from the back. As an example, take the pattern "aa"
and the haystack "aaa"
. Matching from the front gives you "[aa]a"
, while matching from the back gives "a[aa]"
.
For this reason, all methods on StrSlice
that return an iterator based on matchers exist in two single-ended versions, rather than in one double-ended one.
Traits
- A
Pattern<'a, M>
acts as a builder for creating a string matcherM
fromSelf
. Types likechar
,&str
orRegex
implement this in order to be directly usable with the methods ofStrSlice
. -
Matcher
,LeftMatcher
,RightMatcher
,DoubleEndedMatcher
: Those provide a iterator-like interface for finding all non-overlapping occurrences of a pattern in a string, and are used as a building block for higher level operations like splitting or trimming.-
Matcher
is the base trait, used for getting the original haystack&str
back out of a matcher. -
LeftMatcher
is for advancing the matcher state by finding the next match from the front. -
RightMatcher
is for advancing the matcher state by finding the next match from the back. -
DoubleEndedMatcher
is a marker trait for expressing that the matches ofLeftMatcher
are equal to those ofRightMatcher
in reverse order. This allows iterators build on matchers to still provide a double-ended iterator API forPattern
implementations where overlapping matches are impossible. (This feature is optional, but provided to be compatible with the current API)
-
Advantages
- One method name per operation; A single common abstraction rather than special cases for a few types
- Allows extending the string API with custom string patterns (Eg
Ascii
,Regex
, predicates...) - Allows extending the string API with custom string search algorithms (Eg
BoyerMoore(&str)
) - Allows to specialize the implementation per pattern and matcher impl
Disadvantages
- More complexity up front due to generics
WIP Implementation
Issues, open questions
- Performance is untested
- Generics/abstraction layers might cause size bloat
- Generics/abstraction layers might hinder optimizations
- Some type inference bugs around closures get hit by the impl (
.split(|c: char| ... )
fails to inferr) - Should
DoubleEndedIterator
impls be supported where possible?
Future extensions
- A similar abstraction could be implemented for API's that modify a
String
, so thatstring.push("foo")
,string.push('f')
,string.push('f'.to_ascii())
all work. - This would allow stuff like
s.replace(®ex!(...), "foo")
, which would be generic over both the pattern matched and the string fragment it gets replaced with.
New API with associated items and where
clauses
pub trait StrSlice<'a> {
fn contains<P>(self, pat: P) -> bool where P: Pattern<'a>;
fn split<P>(self, pat: P) -> Splits<P::Matcher> where P: Pattern<'a>;
fn rsplit<P>(self, pat: P) -> RSplits<P::Matcher> where P: Pattern<'a>;
fn splitn<P>(self, pat: P, n: uint) -> NSplits<P::Matcher> where P: Pattern<'a>;
fn rsplitn<P>(self, pat: P, n: uint) -> RNSplits<P::Matcher> where P: Pattern<'a>;
fn split_terminator<P>(self, pat: P) -> TermSplits<P::Matcher> where P: Pattern<'a>;
fn rsplit_terminator<P>(self, pat: P) -> RTermSplits<P::Matcher> where P: Pattern<'a>;
fn matches<P>(self, pat: P) -> Matches<P::Matcher> where P: Pattern<'a>;
fn rmatches<P>(self, pat: P) -> RMatches<P::Matcher> where P: Pattern<'a>;
fn match_indices<P>(self, pat: P) -> MatchIndices<P::Matcher> where P: Pattern<'a>;
fn rmatch_indices<P>(self, pat: P) -> RMatchIndices<P::Matcher> where P: Pattern<'a>;
fn starts_with<P>(self, pat: P) -> bool where P: Pattern<'a>,
P::Matcher: LeftMatcher<'a>;
fn ends_with<P>(self, pat: P) -> bool where P: Pattern<'a>,
P::Matcher: RightMatcher<'a>;
fn trim_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>,
P::Matcher: DoubleEndedMatcher<'a>;
fn trim_left_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>,
P::Matcher: LeftMatcher<'a>;
fn trim_right_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>,
P::Matcher: RightMatcher<'a>;
fn find<P>(self, pat: P) -> Option<uint> where P: Pattern<'a>,
P::Matcher: LeftMatcher<'a>;
fn rfind<P>(self, pat: P) -> Option<uint> where P: Pattern<'a>,
P::Matcher: RightMatcher<'a>;
// ...
}