Pre-RFC: New generic string pattern API for `&str`

Kimundi · August 12, 2014, 12:32am

I had this idea in my head for the last few months, in the last weeks I actually got around to actually implement it in a preliminary version, and now I'd like to get some feedback and questions about it.

Current situation

Right now, string slices define a couple of methods to look for some string pattern in it - usually a char or a &str. To be exact, these are all methods in question:

pub trait StrSlice<'a> {
    fn contains<'a>(&self, needle: &'a str) -> bool;
    fn contains_char(&self, needle: char) -> bool;

    fn split<Sep: CharEq>(&self, sep: Sep) -> CharSplits<'a, Sep>;
    fn splitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<'a, Sep>;
    fn rsplitn<Sep: CharEq>(&self, sep: Sep, count: uint) -> CharSplitsN<'a, Sep>;
    fn split_terminator<Sep: CharEq>(&self, sep: Sep) -> CharSplits<'a, Sep>;
    fn split_str(&self, &'a str) -> StrSplits<'a>;

    fn match_indices(&self, sep: &'a str) -> MatchIndices<'a>;

    fn starts_with(&self, needle: &str) -> bool;
    fn ends_with(&self, needle: &str) -> bool;

    fn trim_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
    fn trim_left_chars<C: CharEq>(&self, to_trim: C) -> &'a str;
    fn trim_right_chars<C: CharEq>(&self, to_trim: C) -> &'a str;

    fn find<C: CharEq>(&self, search: C) -> Option<uint>;
    fn rfind<C: CharEq>(&self, search: C) -> Option<uint>;
    fn find_str(&self, &str) -> Option<uint>;

    // ...
}

In this set of methods, there are quite some inconsistencies: Some are defined to work with char, some with CharEq, and some with &str. Furthermore, some operations are duplicated to provide different implementations for char/CharEq and &str.

In my opinion, this is a perfect opportunity to generically abstract over the concept of a "string pattern" to provide a unified API:

The new `&str` API

The new API is based on the idea of a string Pattern that can be used to obtain a double-ended-iterator-like matcher to find all occurrences of it in a string:

pub trait StrSlice<'a> {
    fn contains<M, P: Pattern<'a, M>>(self, pat: P) -> bool;

    fn split<M, P: Pattern<'a, M>>(self, pat: P) -> Splits<M>;
    fn rsplit<M, P: Pattern<'a, M>>(self, pat: P) -> RSplits<M>;
    fn splitn<M, P: Pattern<'a, M>>(self, pat: P, n: uint) -> NSplits<M>;
    fn rsplitn<M, P: Pattern<'a, M>>(self, pat: P, n: uint) -> RNSplits<M>;
    fn split_terminator<M, P: Pattern<'a, M>>(self, pat: P) -> TermSplits<M>;
    fn rsplit_terminator<M, P: Pattern<'a, M>>(self, pat: P) -> RTermSplits<M>;

    fn matches<M, P: Pattern<'a, M>>(self, pat: P) -> Matches<M>;
    fn rmatches<M, P: Pattern<'a, M>>(self, pat: P) -> RMatches<M>;
    fn match_indices<M, P: Pattern<'a, M>>(self, pat: P) -> MatchIndices<M>;
    fn rmatch_indices<M, P: Pattern<'a, M>>(self, pat: P) -> RMatchIndices<M>;

    fn starts_with<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> bool;
    fn ends_with<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> bool;

    fn trim_matches<M: DoubleEndedMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;
    fn trim_left_matches<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;
    fn trim_right_matches<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> &'a str;

    fn find<M: LeftMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> Option<uint>;
    fn rfind<M: RightMatcher<'a>, P: Pattern<'a, M>>(self, pat: P) -> Option<uint>;

    // ...
}

Design details

Separate methods for front and back matching

The API is based around finding all non-overlapping occurrences of an arbitrary pattern in a string. This has one issue though:

If a pattern is able to match overlapping substrings, then finding non-overlapping matches from the front becomes a different operation than finding non-overlapping matches from the back. As an example, take the pattern "aa" and the haystack "aaa". Matching from the front gives you "[aa]a", while matching from the back gives "a[aa]".

For this reason, all methods on StrSlice that return an iterator based on matchers exist in two single-ended versions, rather than in one double-ended one.

Traits

A Pattern<'a, M> acts as a builder for creating a string matcher M from Self. Types like char, &str or Regex implement this in order to be directly usable with the methods of StrSlice.
Matcher, LeftMatcher, RightMatcher, DoubleEndedMatcher: Those provide a iterator-like interface for finding all non-overlapping occurrences of a pattern in a string, and are used as a building block for higher level operations like splitting or trimming.
- Matcher is the base trait, used for getting the original haystack &str back out of a matcher.
- LeftMatcher is for advancing the matcher state by finding the next match from the front.
- RightMatcher is for advancing the matcher state by finding the next match from the back.
- DoubleEndedMatcher is a marker trait for expressing that the matches of LeftMatcher are equal to those of RightMatcher in reverse order. This allows iterators build on matchers to still provide a double-ended iterator API for Pattern implementations where overlapping matches are impossible. (This feature is optional, but provided to be compatible with the current API)

Advantages

One method name per operation; A single common abstraction rather than special cases for a few types
Allows extending the string API with custom string patterns (Eg Ascii, Regex, predicates...)
Allows extending the string API with custom string search algorithms (Eg BoyerMoore(&str))
Allows to specialize the implementation per pattern and matcher impl

Disadvantages

More complexity up front due to generics

WIP Implementation

Issues, open questions

Performance is untested
Generics/abstraction layers might cause size bloat
Generics/abstraction layers might hinder optimizations
Some type inference bugs around closures get hit by the impl (.split(|c: char| ... ) fails to inferr)
Should DoubleEndedIterator impls be supported where possible?

Future extensions

A similar abstraction could be implemented for API's that modify a String, so that string.push("foo"), string.push('f'), string.push('f'.to_ascii()) all work.
This would allow stuff like s.replace(&regex!(...), "foo"), which would be generic over both the pattern matched and the string fragment it gets replaced with.

New API with associated items and `where` clauses

pub trait StrSlice<'a> {
    fn contains<P>(self, pat: P) -> bool where P: Pattern<'a>;

    fn split<P>(self, pat: P) -> Splits<P::Matcher> where P: Pattern<'a>;
    fn rsplit<P>(self, pat: P) -> RSplits<P::Matcher> where P: Pattern<'a>;
    fn splitn<P>(self, pat: P, n: uint) -> NSplits<P::Matcher> where P: Pattern<'a>;
    fn rsplitn<P>(self, pat: P, n: uint) -> RNSplits<P::Matcher> where P: Pattern<'a>;
    fn split_terminator<P>(self, pat: P) -> TermSplits<P::Matcher> where P: Pattern<'a>;
    fn rsplit_terminator<P>(self, pat: P) -> RTermSplits<P::Matcher> where P: Pattern<'a>;

    fn matches<P>(self, pat: P) -> Matches<P::Matcher> where P: Pattern<'a>;
    fn rmatches<P>(self, pat: P) -> RMatches<P::Matcher> where P: Pattern<'a>;
    fn match_indices<P>(self, pat: P) -> MatchIndices<P::Matcher> where P: Pattern<'a>;
    fn rmatch_indices<P>(self, pat: P) -> RMatchIndices<P::Matcher> where P: Pattern<'a>;

    fn starts_with<P>(self, pat: P) -> bool where P: Pattern<'a>, 
                                                  P::Matcher: LeftMatcher<'a>;
    fn ends_with<P>(self, pat: P) -> bool where P: Pattern<'a>, 
                                                P::Matcher: RightMatcher<'a>;

    fn trim_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>, 
                                                      P::Matcher: DoubleEndedMatcher<'a>;
    fn trim_left_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>, 
                                                           P::Matcher: LeftMatcher<'a>;
    fn trim_right_matches<P>(self, pat: P) -> &'a str where P: Pattern<'a>, 
                                                            P::Matcher: RightMatcher<'a>;

    fn find<P>(self, pat: P) -> Option<uint> where P: Pattern<'a>, 
                                                   P::Matcher: LeftMatcher<'a>;
    fn rfind<P>(self, pat: P) -> Option<uint> where P: Pattern<'a>, 
                                                    P::Matcher: RightMatcher<'a>;

    // ...
}

seanmonstar · August 12, 2014, 1:30am

I love it. I definitely think that if an API has 2+ methods to do the same action with different types, that is definitely where a Trait should be used instead.

huon · August 12, 2014, 10:43am

How will we ensure efficiency? That is, splitting on a character can be done in an O(n) pass over bytes (if splitting on ASCII) or codepoints (for a char), and splitting on a string should be O(n+m) (not the naive O(nm)).

Kimundi · August 12, 2014, 12:22pm

@huon: The trait-based design allows for such optimization cases in principle, and my current implementation already implements most of it:

Searching for a char that is ascii or a &str with length 1 dynamically chooses a O(n) byte searcher (A hypothetical implementation for the Ascii type would just hardcode this selection directly)
Searching for a &str with a len greater than 1 uses a dynamically chosen search algorithm as implemented in https://github.com/rust-lang/rust/pull/14135.
Searching for a char that is non-ascii turns it into a small inline utf8 string, and uses a dynamically selected searcher, which is at most O(4n).
Searching for a type implementing CharEq selects a impl that decodes and compares each character in O(n), and that can dynamically specialize for a byte searcher if the CharEq implementer identifies as only caring about ascii.

huon · August 12, 2014, 12:46pm

Ah, cool; I'll admit I didn't look at the code at all; hopefully I'll be able to glance over it sometime.

Is this faster than using a codepoint iterator?

tbu · August 12, 2014, 1:17pm

The code point iterator converts utf-8 to codepoints O(number of codepoints) – the other algorithm just converts the codepoint to utf-8 once and then does a bytewise search.

Kimundi · August 12, 2014, 1:53pm

Whether comparing a short slice of 2-4 bytes is faster than decoding a utf8 character and a integer comparison still has to be determined (I created https://github.com/Kimundi/utf8char to investigate that optimization in general). That said, as it currently is implemented, its probably slower.

And if it turns out to be a dead end in general, the implementation can simply be changed to use the current char-decoding one.

huon · August 13, 2014, 5:06am

@tbu the codepoint iterator is guaranteed to look at each byte in the input exactly once, this isn’t guaranteed for a byte-slice-search (although thinking about it now, the self-synchronisation property of UTF-8 means that it is actually possible (or better) for the byte search too).

system · March 25, 2019, 8:22am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Missing split_last method on &str	7	772	October 1, 2021
Substring slice pattern matching language design	8	5245	September 11, 2022
Str vs slice APIs libs	3	1590	March 25, 2019
Allow &[&str] in Pattern libs	27	1361	May 10, 2021
`str` method for slicing code-point (i.e. `char`) ranges libs	23	2871	March 25, 2019