Pre-RFC: New generic string pattern API for `&str`


#1

I had this idea in my head for the last few months, in the last weeks I actually got around to actually implement it in a preliminary version, and now I’d like to get some feedback and questions about it. :slight_smile:


Current situation

Right now, string slices define a couple of methods to look for some string pattern in it - usually a char or a &str. To be exact, these are all methods in question:

pub trait StrSlice<'a> {
    fn contains<'a>(&self, needle: &'a str) -> bool;
    fn contains_char(&self, needle: char) -> bool;

    fn split<Sep: CharEq>(&self, sep: Sep) -> CharSplits<'a, Sep>;
    fn splitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<'a, Sep>;
    fn rsplitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<'a, Sep>;
    fn split_terminator<Sep: CharEq>(&self, sep: Sep) -> CharSplits<'a, Sep>;
    fn split_str(&self, &'a str) -> StrSplits<'a>;

    fn match_indices(&self, sep: &'a str) -> MatchIndices<'a>;

    fn starts_with(&self, needle: &str) -> bool;
    fn ends_with(&self, needle: &str) -> bool;

    fn trim_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
    fn trim_left_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
    fn trim_right_chars<C: CharEq>(&self, to_trim: C) -> &'a str;

    fn find<C: CharEq>(&self, search: C) -> Option<uint>;
    fn rfind<C: CharEq>(&self, search: C) -> Option<uint>;
    fn find_str(&self, &str) -> Option<uint>;

    // ...
}

In this set of methods, there are quite some inconsistencies: Some are defined to work with char, some with CharEq, and some with &str. Furthermore, some operations are duplicated to provide different implementations for char/CharEq and &str.

In my opinion, this is a perfect opportunity to generically abstract over the concept of a “string pattern” to provide a unified API:

The new &str API

The new API is based on the idea of a string Pattern that can be used to obtain a double-ended-iterator-like matcher to find all occurrences of it in a string:

pub trait StrSlice<'a> {
    fn contains<M, P: Pattern<'a, M>>(self, pat: P) -> bool;

    fn split<M, P: Pattern<'a, M>>(self, pat: P) -> Splits<M>;
    fn rsplit<M, P: Pattern<'a, M>>(self, pat: P) -> RSplits<M>;
    fn splitn<M, P: Pattern<'a, M>>(self, pat: P, n: uint) -> NSplits<M>;
    fn rsplitn<M, P: Pattern<'a, M>>(self, pat: P, n: uint) -> RNSplits<M>;
    fn split_terminator<M, P: Pattern<'a, M>>(self, pat: P) -> TermSplits<M>;
    fn rsplit_terminator<M, P: Pattern<'a, M>>(self, pat: P) -> RTermSplits<M>;

    fn matches<M, P: Pattern<'a, M>>(self, pat: P) -> Matches<M>;
    fn rmatches<M, P: Pattern<'a, M>>(self, pat: P) -> RMatches<M>;
    fn match_indices<M, P: Pattern<'a, M>>(self, pat: P) -> MatchIndices<M>;
    fn rmatch_indices<M, P: Pattern<'a, M>>(self, pat: P) -> RMatchIndices<M>;

    fn starts_with<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> bool;
    fn ends_with<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> bool;

    fn trim_matches<M: DoubleEndedMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;
    fn trim_left_matches<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;
    fn trim_right_matches<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;

    fn find<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> Option<uint>;
    fn rfind<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> Option<uint>;

    // ...
}

Design details

Separate methods for front and back matching

The API is based around finding all non-overlapping occurrences of an arbitrary pattern in a string. This has one issue though:

If a pattern is able to match overlapping substrings, then finding non-overlapping matches from the front becomes a different operation than finding non-overlapping matches from the back. As an example, take the pattern "aa" and the haystack "aaa". Matching from the front gives you "[aa]a", while matching from the back gives "a[aa]".

For this reason, all methods on StrSlice that return an iterator based on matchers exist in two single-ended versions, rather than in one double-ended one.

Traits

  • A Pattern<'a, M> acts as a builder for creating a string matcher M from Self. Types like char, &str or Regex implement this in order to be directly usable with the methods of StrSlice.
  • Matcher, LeftMatcher, RightMatcher, DoubleEndedMatcher: Those provide a iterator-like interface for finding all non-overlapping occurrences of a pattern in a string, and are used as a building block for higher level operations like splitting or trimming.
    • Matcher is the base trait, used for getting the original haystack &str back out of a matcher.
    • LeftMatcher is for advancing the matcher state by finding the next match from the front.
    • RightMatcher is for advancing the matcher state by finding the next match from the back.
    • DoubleEndedMatcher is a marker trait for expressing that the matches of LeftMatcher are equal to those of RightMatcher in reverse order. This allows iterators build on matchers to still provide a double-ended iterator API for Pattern implementations where overlapping matches are impossible. (This feature is optional, but provided to be compatible with the current API)

Advantages

  • One method name per operation; A single common abstraction rather than special cases for a few types
  • Allows extending the string API with custom string patterns (Eg Ascii, Regex, predicates…)
  • Allows extending the string API with custom string search algorithms (Eg BoyerMoore(&str))
  • Allows to specialize the implementation per pattern and matcher impl

Disadvantages

  • More complexity up front due to generics

WIP Implementation

Issues, open questions

  • Performance is untested
  • Generics/abstraction layers might cause size bloat
  • Generics/abstraction layers might hinder optimizations
  • Some type inference bugs around closures get hit by the impl (.split(|c: char| ... ) fails to inferr)
  • Should DoubleEndedIterator impls be supported where possible?

Future extensions

  • A similar abstraction could be implemented for API’s that modify a String, so that string.push("foo"), string.push('f'), string.push('f'.to_ascii()) all work.
  • This would allow stuff like s.replace(&regex!(...), "foo"), which would be generic over both the pattern matched and the string fragment it gets replaced with.

New API with associated items and where clauses

pub trait StrSlice<'a> {
    fn contains<P>(self, pat: P) -> bool where P: Pattern<'a>;

    fn split<P>(self, pat: P) -> Splits<P::Matcher> where P: Pattern<'a>;
    fn rsplit<P>(self, pat: P) -> RSplits<P::Matcher> where P: Pattern<'a>;
    fn splitn<P>(self, pat: P, n: uint) -> NSplits<P::Matcher> where P: Pattern<'a>;
    fn rsplitn<P>(self, pat: P, n: uint) -> RNSplits<P::Matcher> where P: Pattern<'a>;
    fn split_terminator<P>(self, pat: P) -> TermSplits<P::Matcher> where P: Pattern<'a>;
    fn rsplit_terminator<P>(self, pat: P) -> RTermSplits<P::Matcher> where P: Pattern<'a>;

    fn matches<P>(self, pat: P) -> Matches<P::Matcher> where P: Pattern<'a>;
    fn rmatches<P>(self, pat: P) -> RMatches<P::Matcher> where P: Pattern<'a>;
    fn match_indices<P>(self, pat: P) -> MatchIndices<P::Matcher> where P: Pattern<'a>;
    fn rmatch_indices<P>(self, pat: P) -> RMatchIndices<P::Matcher> where P: Pattern<'a>;

    fn starts_with<P>(self, pat: P) -> bool where P: Pattern<'a>, 
                                                  P::Matcher: LeftMatcher<'a>;
    fn ends_with<P>(self, pat: P) -> bool where P: Pattern<'a>, 
                                                P::Matcher: RightMatcher<'a>;

    fn trim_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>, 
                                                      P::Matcher: DoubleEndedMatcher<'a>;
    fn trim_left_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>, 
                                                           P::Matcher: LeftMatcher<'a>;
    fn trim_right_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>, 
                                                            P::Matcher: RightMatcher<'a>;

    fn find<P>(self, pat: P) -> Option<uint> where P: Pattern<'a>, 
                                                   P::Matcher: LeftMatcher<'a>;
    fn rfind<P>(self, pat: P) -> Option<uint> where P: Pattern<'a>, 
                                                    P::Matcher: RightMatcher<'a>;

    // ...
}

#2

I love it. I definitely think that if an API has 2+ methods to do the same action with different types, that is definitely where a Trait should be used instead.


#3

How will we ensure efficiency? That is, splitting on a character can be done in an O(n) pass over bytes (if splitting on ASCII) or codepoints (for a char), and splitting on a string should be O(n+m) (not the naive O(nm)).


#4

@huon: The trait-based design allows for such optimization cases in principle, and my current implementation already implements most of it:

  • Searching for a char that is ascii or a &str with length 1 dynamically chooses a O(n) byte searcher (A hypothetical implementation for the Ascii type would just hardcode this selection directly)
  • Searching for a &str with a len greater than 1 uses a dynamically chosen search algorithm as implemented in https://github.com/rust-lang/rust/pull/14135.
  • Searching for a char that is non-ascii turns it into a small inline utf8 string, and uses a dynamically selected searcher, which is at most O(4n).
  • Searching for a type implementing CharEq selects a impl that decodes and compares each character in O(n), and that can dynamically specialize for a byte searcher if the CharEq implementer identifies as only caring about ascii.

#5

Ah, cool; I’ll admit I didn’t look at the code at all; hopefully I’ll be able to glance over it sometime.

Is this faster than using a codepoint iterator?


#6

The code point iterator converts utf-8 to codepoints O(number of codepoints) – the other algorithm just converts the codepoint to utf-8 once and then does a bytewise search.


#7

Whether comparing a short slice of 2-4 bytes is faster than decoding a utf8 character and a integer comparison still has to be determined (I created https://github.com/Kimundi/utf8char to investigate that optimization in general). That said, as it currently is implemented, its probably slower.

And if it turns out to be a dead end in general, the implementation can simply be changed to use the current char-decoding one.


#8

@tbu the codepoint iterator is guaranteed to look at each byte in the input exactly once, this isn’t guaranteed for a byte-slice-search (although thinking about it now, the self-synchronisation property of UTF-8 means that it is actually possible (or better) for the byte search too).