Std::str return matched char for "starts_with", "strip_prefix" and other methods that take in a Pattern

I'm building a lexer in rust, and I am currently resolving tokens using a trie. I am using str over u8 since I want to be able to accomodate UTF-8 strings, and str handles a lot of that heavy lifting for me. The current Pattern api allows for specifying multiple chars to match on, but in such cases, there is no way to find out which char the string was matched on. This leads to unergonomic and potentially slower code. Here's what my code looks like now.

if let Some(stripped) = string.strip_prefix('=') {
    if let Some(stripped) = string.strip_prefix('=') {
        if let Some(stripped) = stripped.strip_prefix('=') {
            string = stripped;
            make_eqeqeq_token()
        } else {
            string = stripped;
            make_eqeq_token()
        }
    } else {
        string = stripped;
        make_eq_token()
    }
} else if let Some(stripped) = string.strip_prefix('>') {
    if let Some(stripped) = string.strip_prefix('=') {
        string = stripped;
        make_geq_token()
    } else {
        string = stripped;
        make_greater_token()
    }
} else if let Some(stripped) = string.strip_prefix('<') {
    // repeats
}

I propose making a method that also returns what character matched, which would allow for a more pattern-matchy syntax like below.

match string.strip_prefix_with_match(&['=', '>', '<', ...]) {
    ('=', stripped) => match stripped.strip_prefix('=') {
        Some(stripped) => match stripped.strip_prefix('=') {
            Some(stripped) => {
                string = stripped;
                make_eqeqeq_token()
            }
            None => make_eqeq_token()
        }
        None => make_eq_token()
    }
    ('>', stripped) => match stripped.strip_prefix('=') {
        Some(stripped) => {
            string = stripped;
            make_geq_token()
        }
        None => make_greater_token()
    }
    ('<', stripped) => match stripped.strip_prefix('=') {
        Some(stripped) => {
            string = stripped;
            make_leq_token()
        }
        None => make_less_token()
    }
    ('!', stripped) => {/* repeats */}
}

I see two advantages to this approach. The first is readability. In general, pattern matching is more readable than "if else if else" especially as the number of cases grow. Additionally, in this approach, the reader can figure out which case this corresponds to much quicker since they only needs to read up to ('=', stripped), whereas in the current implementation, they would need to read up to almost the very end of the line at strip_prefix.

Secondly, it may have performance benefits. I preface this by stating that I'm not too well versed in rust internals, but intuitively, in this implementation, only one function call is made followed by matching on a character: an operation which can be optimized to be constant time. In the prior, a function call must be made on each and every case. To be honest, I wouldn't be surprised if this is optimized to effectively be one call + constant time matching by LLVM magic, but this is a relatively minor benefit compared to the first anyways.

There existed a discussion on this here, but that was shut down relatively due to a lack of clear use case. I present my code as a clear use case for a string pattern match that returns the matched pattern.

I think this proposal for substring pattern matching is also relevant.

For your concrete usecase I would use the complete operator representation (==, =, >=, ...) as prefix. Much better readability and probably also better performance.

Using the complete operator representation (i.e. passing in a string as a pattern) doesn't quite offer everything that's needed in this use case in two regards:

  • For longer keywords, checking for the entire string would require many back tracks. For example, suppose I have keywords like return, unlike, unless, until, up, and upto. I input the string "unlass". If we were using the whole keyword as the pattern, we would have to walk all the way to "unl" to realise that the input string is not "unlike". Then, we would have to walk all the way to "unl" again to realize the input string is not "unless". Therefore, passing in the entire string is more computationally expensive than using a trie, where we short circuit when we see an unexpected character. This is a large reason why a trie data structure is employed here, so that we don't need backtrack after every mismatch. So, while passing in the entire string is a valid approach in my toy code, it is not a valid solution in general.

  • Even disregarding performance, sometimes we want to know exactly where the mismatch occurred, perhaps for better error messages when parsing. In such a case, passing in the entire string does not give us much useful information besides that it didn't match. A trie enables us to pinpoint where a mismatch occurred, and inform the user accordingly.

As an appendix, here's what a trie traversal would look like (roughly) with the current api vs the idea

if let Some(stripped) = string.strip_prefix('u') {
  if let Some(stripped) = stripped.strip_prefix('n') {
    if let Some(stripped) = stripped.strip_prefix('l') {
      if let Some(stripped) = stripped.strip_prefix("ike") {
        emit_unlike()
      } else if let Some(stripped) = stripped.strip_prefix("ess") {
        emit_unless()
      } else {
        emit_error()
      }
    } else if let Some(stripped) = stripped.strip_prefix("til") {
      emit_until()
    }
  } else if let Some(stripped) = string.strip_prefix('p') {
    if let Some(stripped) = stripped.strip_prefix("to") {
      emit_upto()
    } else if stripped.empty() {
      emit_up()
    } else {
      emit_error()
    }
  } else if let Some(stripped) = string.strip_prefix('r') {
    //more matching...
  } else {
    emit_error()
  }
}

With some information about the match, we can have

match string.strip_prefix_with_match(&['r', 'u', ...]) {
  ('u', stripped) => match stripped.strip_prefix_with_match(&['p', 'n']) {
    ('p', stripped) => match stripped.strip_prefix("to") {
      Some(_) => emit_upto(),
      None if stripped.empty() => emit_up(),
      None => emit_error()
    }
    ('n', stripped) => match stripped.strip_prefix_with_match(&['t', 'l']) {
      ('t', stripped) => match stripped.strip_prefix("til") {
         Some(_) => emit_until(),
         None => emit_error(),
      }
      ('l', stripped) => match stripped.strip_prefix(&['e', 'i']) {
        ('e', stripped) => match stripped.strip_prefix("ss") {
          Some(_) => emit_unless(),
          None => emit_error()
        }
        ('i', stripped) => match stripped.strip_prefix("ke") {
          Some(_) => emit_unless(),
          None=> emit_error()
        }
      }
    }
  }
//...
}

Now granted, by nature of tries, there is a fair bit of nesting going on in both cases. But I feel the the pattern matching offers better readability.

I wonder if you might be better off just using something like a RegexSet in regex - Rust instead of doing it via the Pattern-taking APIs in std...

With regards to performance: I wonder if it wouldn't be faster branching on 4 or 8 byte comparisons, at least in some cases. It would potentially mean fewer branches (depending on your exact grammar) and it is essentially the same cost on modern CPUs to work with the full register width.

You could add strip_prefix_with_match using an extension trait on str today in your own code. Would be a good way to test it out, and would allow you to use the functionality immediately.

Drawing from str::split_once's naming, I would probably expect the OP's desired method would be called split_prefix.

The way to write as a match today might be like

match string.chars().next() {
    Some('=') => {
        string = &string[1..];
        match string.chars().next() {
            Some('=') => {
                string = &string[1..];
                match string.chars().next() {
                    Some('=') => {
                        string = &string[1..];
                        make_token![===];
                    }
                    _ => make_token![==],
                }
            }
            Some('>') => {
                string = &string[1..];
                make_token![=>];
            }
            _ => make_token![=],
        }
    }
    /* ... */
}

I don't know if the slicing panics fully optimize out, unfortunately. The Chars iterator provides access to the un-iterated string, but keeping access to that isn't very straightforward either.

And anyway, properly optimizing a lexer is far more complicated than just building a naive trie traversal and hoping for the best. It can be decent, but so can a more naive approach which looks to rescan shared prefixes, because the optimizer can do some magic with the strong guarantees Rust references have. IMO the single-char-at-a-time style is only so common because of the prevalence of getc or other single pass streaming style input APIs.

1 Like

This sounds to me like you'd want a Trie data structure, e.g., as in the trie-rs crate (Disclaimer: I haven't personally used that crate yet).

The reason why I named it strip_prefix_with_match is because there already exists a strip_prefix method for strs, which strips the specified Pattern and returns the string without the prefix. I agree it's not a very good name but it was the most clear way to get my point across.

Regarding optimizations, you are 100% correct that I'm taking inspiration from C, and also that this is a naive optimization. Regardless, I do think the example demonstrates a valid use case for an additional of "strip" method that returns the matched pattern.

On a semi-related note, could you elaborate more on the optimizer magic when rescanning shared prefixes? I'm not entirely sure I understand what you are referring to, but if there already exists a rustier way to do effectively the same thing, then this feature would be unneeded.

It's certainly not guaranteed, and I don't even know if it happens, but given code like

/* let s: &str; */
match &s[..6] {
    "unlike" => handle_unlike(),
    "unless" => handle_unless(),
    _ => {}
}

the compiler could compile it with straightforward repeated tests (pseudocode mix of Rust and C)

/* slice guard */
let p = s.as_ptr();
if strncmp(p, "unlike", 6) == 0 {
    handle_unlike()
} else if if strncmp(p, "unless", 6) == 0 {
    handle_unless()
}

or it could spot the shared prefix and check it once

if strncmp(p, "unl", 3) == 0 {
    if strncmp(p + 3, "ike", 3) == 0 {
        handle_unlike()
    } else if strncmp(p + 3, "ess", 3) == 0 {
        handle_unless()
    }
}

and this is still true with starts_with or whatever method of doing the tests, because there's a guarantee that the bytes behind the &str reference don't change, so once the test is done once the compiler can see that testing the first few bytes again is redundant and will give the same result.

Think about the API you want to provide for processing tokens. Now give the same thing to your lexer, but with the domain being text.

Hopefully, eventually it'll be possible to write (byte) string prefix patterns the way slice patterns work already, e.g.

match bytes {
    [b'=', b'=', b'=', rest @ ..] => make![===],
    [b'=', b'=', rest @ ..] => make![==],
    [b'=', b'>', rest @ ..] => make![=>],
    [b'=', rest @ ..] => make![=],
    // etc
}

The suggestion was:

I think that makes more sense because strip implies (to me) tossing out. split just means you get back parts.

2 Likes

The optimizer will very likely transform s.strip_prefix("unless") into some integer operations for every keyword that fits into usize and not compare the bytes individually. You probably get the best performance if you branch on 1, 2, 4, 8 length prefixes, i.e. you would check for "retu", "unli", "unle", "unti" and than have an additional branch (which will be free if ther are no syntax errors).

Have a look at shared_prefix_detection here: Compiler Explorer

The compiler optimizes a prefix check of unlike and unlite into a single 32 bit integer check for "unli" == 1768713845.

3 Likes

Wow this is incredibly insightful! Thanks for the compiler explorer link. My takeaway from this discussion is that the rust compiler & LLVM effectively does all the heavy lifting for whatever string matching I intend to do, so I just need to specify the logic

I found a way to do what I'm looking for (and a bit more) just using the std::str operations. It looks a bit ugly, but is safe and works. Note that this only works on nightly because the pattern feature is needed (not the unstable internals, just the signature).

fn trim_start_with_match<'a, P: Pattern<'a>>(source: &'a str, pattern: P) -> (&'a str, &'a str) {
    let remaining = source.trim_start_matches(pattern);
    // SAFETY: remaining is the "tail" of source after consuming pattern
    unsafe { 
        let trimmed = source.strip_suffix(remaining).unwrap_unchecked();
        (trimmed, remaining)
    }
}