Substring slice pattern matching

I could not find the following proposal discussed anywhere, which is surprising, since it looks like a very natural addition.

The proposal is to add substring pattern syntax for pattern-matching, which would serve the same purpose as the slice patterns for slices and arrays. The syntax is prone to bikeshedding, but imho it would be most natural to generalize concat! and concat_bytes! built-in macros to allow using them in pattern position.

An example which would be easier with this proposal:

// before
fn frobnicate_str(s: &str) -> Frob {
    if s.starts_with("foo") || s.starts_with("moo") {
        frobnicate(&s["foo".len()..])
    } else if s.starts_with("bozoom") {
        frobnicate(&s["bozoom".len()..])
    } else if s.starts_with("ba") {
        barnicate(&s["ba".len()..])
    } else if s.ends_with("quux") {
        quuxnicate(&s[..s.len() - "quux".len()])
    } else {
        panic!("not a frobnicatable string")
    }
}

// after
fn frobnicate_str(s: &str) -> Frob {
    match s {
        concat!("foo" | "moo" | "bozoom", rest@..) => frobnicate(rest),
        concat!("ba", rest@..) => barnicate(rest),
        concat!(start@.., "quux") => quuxnicate(&start),
        _ => panic!("not a frobnicatable string"),
    }
}

In my opinion, a match is easier to read that a chain of ifs: the conditions are very explicit, the pattern matching allows to encode complex conditions in a simple form, and the compiler guarantees exhaustive checks.

A similar construct should be added to allow matching on byte slices, i.e.

fn frobnicate_slice(s: &[u8]) -> Frob {
    match s {
        concat_bytes!(b"foo" | b"moo" | b"bozoom", rest@..) => frobnicate(rest),
        concat_bytes!(b"ba", rest@..) => barnicate(rest),
        concat_bytes!(start@.., b"quux") => quuxnicate(&start),
        _ => panic!("not a frobnicatable string"),
    }
}

Note that I do not propose to expand the pattern syntax in any way beyond concat! and concat_bytes!. In particular, allowing non-literal slices in the pattern, or patterns for containers such as String or Vec, or custom pattern matching, or allowing more than one variable-length pattern are beyond the scope of proposal.

Note that this is more of a syntax sugar than any real extension of the language's expressive power. For byte strings in particular, a pattern match

if let concat_bytes!(b"foo" | b"quux", rest@..) = b { ... }

can be equivalently rewritten to

if let [b'f', b'o', b'o', rest@..] | [b'q', b'u', b'u', b'x', rest@..] = b { ... }

However, the first pattern is much easier to read and write.

For strings, there is currently no direct equivalent of slice patterns. One could match the string slice as a byte slice and manually transcribe a string pattern into the corresponding byte pattern:

if let [b'f', b'o', b'o', rest@..] = s.as_bytes() { ...}

However, this is even less practical than for byte slices in cases where the string pattern goes beyond the ASCII range of characters. Also, rest will be a byte slice within the match arm, so if we need to work with actual strings, then we must convert it back into a string slice, either performing redundant utf8 checks or using error-prone unsafe code.

Syntax

In the current proposal the obvious, although somewhat clunky, pattern syntax concat(_bytes)!(p1, ..., pn) is used. Since concat(_bytes)! is a macro, if a better syntax is ever found, the migration to new syntax can be made seamless by changing the expansion of the macros to the new pattern syntax. Also, since concat(_bytes)! is a built-in macro, there is nothing preventing it from supporting the new patterns even if there is currently no non-macro way to express them in the language (just as there is currently no other way to concatenate compile-time slices).

The expansion of the macro doesn't need to know the surrounding context (i.e. whether it is expanded in pattern position), since adding the pattern syntax is backwards-compatible. Currently the only valid syntax is a concatenation of slice literals, and the macro expand to their concatenation. With this proposal the concatenation of slices is still a slice, just as before. However, it would be now possible to use more general slice patterns within the macro, in which case it would expand to some (for the moment unstable) slice pattern.

If Rust had a slice concatenation operator (say ++), then the most natural syntax for string patterns would use ++ instead of concat!. I find it unlikely that such operator would ever be added, though.

Semantics

The byte string patterns with concat_bytes! can be directly desugared to the stable slice patterns as above, assuming that there are no or-patterns within the concat_bytes! call. This means that, in particular, arrays can also be matched with the byte string patterns just like with the usual slice patterns.

Nested or-patterns present the same challenges as any other or-patterns. In principle, they can be directly desugared into sequences of ordinary patterns. In practice, this can lead to an exponential code blowup for deeply nested or-patterns, so some smarter algorithm may be desirable.

String slice patterns with concat! do not currently have a stable analogue. One option is to introduce special unstable string slice patterns, just like core::ptr::addr_of! expands to an unstable &raw reference. Another option is to desugar string slice patterns into byte slice patterns, together with conversions to and from a byte slice.

Alternatives

For byte slices, the currently stable slice patterns allow to do express the same pattern matching as proposed above. However, this involves manually converting byte slices into explicit arrays of individual bytes, which is tedious and confusing when the byte slices actually represent ASCII strings.

For string slices, the stable alternative is to use the string splitting, pattern matching and subslice functions to manually implement pattern matching. This involves more boilerplate code, redundant definitions of the matched strings (possibly extracted to local variables), or manually computed and hardcoded offsets based on the lengths of fixed suffixes and prefixes. For ad-hoc parsing this can result in quite complex code, viz. the example at the start.

1 Like

strip_prefix and strip_suffix really weaken the argument since they greatly simplify the original code.

2 Likes

Wouldn't the gold standard here be regex? I would wonder what is the minimum number of features necessary to get this working:

use regex::r;

fn frobnicate_str(s: &str) -> Frob {
    match s {
        r!("(foo|moo|bozoom)(?P<rest>.*)") => frobnicate(rest),
        r!("ba(?P<rest>.*)") => barnicate(rest),
        r!("(?P<start>)quux") => quuxnicate(&start),
        _ => panic!("not a frobnicatable string"),
    }
}

Or something very close to this. When I first discovered the regex library in rust this is actually how I expected it to work!

For comparison, using them you get:

fn frobnicate_str(s: &str) -> Frob {
    if let Some(rest) = s.strip_prefix("foo") {
        frobnicate(rest)
    } else if let Some(rest) = s.strip_prefix("moo") {
        frobnicate(rest)
    } else if let Some(rest) = s.strip_prefix("bozoom") {
        frobnicate(rest)
    } else if let Some(rest) = s.strip_prefix("ba") {
        barnicate(rest)
    } else if let Some(start) = s.strip_suffix("quux") {
        quuxnicate(start)
    } else {
        panic!("not a frobnicatable string")
    }
}

Or:

fn frobnicate_str(s: &str) -> Frob {
    if let Some(rest) = s.strip_prefix("foo")
        .or_else(|| s.strip_prefix("moo"))
        .or_else(|| s.strip_prefix("bozoom"))
    {
        frobnicate(rest)
    } else if let Some(rest) = s.strip_prefix("ba") {
        barnicate(rest)
    } else if let Some(start) = s.strip_suffix("quux") {
        quuxnicate(start)
    } else {
        panic!("not a frobnicatable string")
    }
}

That depends. In terms of expressiveness, most likely yes. However, regexes have O(2^n) runtime behavior, which no non-nested match expression exhibits today.

This is also an issue with .ends_with() although there the code is O(n). Still, it could silently and suddenly make matches much more expensive.

I am not advocating for regex match support (A macro would probably be good enough), but using actual regular regexes would have great performance. The compiler could match all regexes simultaneously using a product automaton which would make the overall match time O(n) where n is the string length

Not really no. It could be O(m * n) though, if it's based on the regex crate:

Now while this is at least polynomial and not exponential worst case such as in some other languages, it's still easily more expensive than is naively expected from a match expr.

Given your example, I believe this is not possible. In particular, your example includes the use of capturing groups and, importantly, extracting the match positions of specific capturing groups. This requires something strictly more powerful than a simple automaton (namely, a tagged automaton). You can flub simple automata to give you match offsets, but that's about as far as you can get.

If no regexes are used in the match loop, then you can provide a stricter time guarantee.

(I've specifically avoided trying to have an opinions on whether we "should" do any of this or not. I just wanted to clarify things about regexes.)

4 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.