I'm building a lexer in rust, and I am currently resolving tokens using a trie. I am using str
over u8
since I want to be able to accomodate UTF-8 strings, and str
handles a lot of that heavy lifting for me. The current Pattern api allows for specifying multiple chars to match on, but in such cases, there is no way to find out which char the string was matched on. This leads to unergonomic and potentially slower code. Here's what my code looks like now.
if let Some(stripped) = string.strip_prefix('=') {
if let Some(stripped) = string.strip_prefix('=') {
if let Some(stripped) = stripped.strip_prefix('=') {
string = stripped;
make_eqeqeq_token()
} else {
string = stripped;
make_eqeq_token()
}
} else {
string = stripped;
make_eq_token()
}
} else if let Some(stripped) = string.strip_prefix('>') {
if let Some(stripped) = string.strip_prefix('=') {
string = stripped;
make_geq_token()
} else {
string = stripped;
make_greater_token()
}
} else if let Some(stripped) = string.strip_prefix('<') {
// repeats
}
I propose making a method that also returns what character matched, which would allow for a more pattern-matchy syntax like below.
match string.strip_prefix_with_match(&['=', '>', '<', ...]) {
('=', stripped) => match stripped.strip_prefix('=') {
Some(stripped) => match stripped.strip_prefix('=') {
Some(stripped) => {
string = stripped;
make_eqeqeq_token()
}
None => make_eqeq_token()
}
None => make_eq_token()
}
('>', stripped) => match stripped.strip_prefix('=') {
Some(stripped) => {
string = stripped;
make_geq_token()
}
None => make_greater_token()
}
('<', stripped) => match stripped.strip_prefix('=') {
Some(stripped) => {
string = stripped;
make_leq_token()
}
None => make_less_token()
}
('!', stripped) => {/* repeats */}
}
I see two advantages to this approach.
The first is readability. In general, pattern matching is more readable than "if else if else" especially as the number of cases grow. Additionally, in this approach, the reader can figure out which case this corresponds to much quicker since they only needs to read up to ('=', stripped)
, whereas in the current implementation, they would need to read up to almost the very end of the line at strip_prefix
.
Secondly, it may have performance benefits. I preface this by stating that I'm not too well versed in rust internals, but intuitively, in this implementation, only one function call is made followed by matching on a character: an operation which can be optimized to be constant time. In the prior, a function call must be made on each and every case. To be honest, I wouldn't be surprised if this is optimized to effectively be one call + constant time matching by LLVM magic, but this is a relatively minor benefit compared to the first anyways.
There existed a discussion on this here, but that was shut down relatively due to a lack of clear use case. I present my code as a clear use case for a string pattern match that returns the matched pattern.