I was bench marking match statement and realized that using runtime allocating data structures can be faster than hard-coding match patterns.
e.g.
match string {
.. lots of strings
}
let mut data = sometDataStructure();
data.insert_lots_of_strings();
data.get(value) // this is faster
It seems like match statement always translates itself into a jump table, do you think it would be benefit-al if compiler was more aggressive in it's optimization strategy, like inserting a hash function for strings and such.
So, if I were to implement this as a trait, it would look like this.
e.g.
trait MatchStmtTrait<T>
where
T: Eq + PartialEq
{
/// is there a function for narrowing down possible candidates implemented?
const IS_MATCH_HINT_IMPLEMENTED: bool = false;
/// creates value that helps you narrow down the options
fn match_hint(&self) -> T;
}
This is something that I just came up with and I must admit that it's not well thought out.
I appreciate feed backs.
Match does get optimized by llvm, but obviously llvm can't be correct in every case. For instance if the early match clauses were much more likely than the later ones, I would expect naïve match to outperform a more complex search. sometDataStructure also has different semantics from match in the case of duplicate keys, which is maybe harder for llvm to reason about.
In general this seems more, if you understand your data you can come up with faster code to deal with it.
This is up to LLVM, generally. It's allowed to mostly do what it wants with the control flow. So the default answer for "I'd like it to codegen differently" is to file an LLVM issue describing how it should have picked a different strategy. (After all, LLVM has the target-specific knowledge about things like branch costs, predicated instructions, etc.)
That said, it's possible that there are some cases where rustc isn't communicating enough information to LLVM for it to be able to make good choices. It would perhaps help to demonstrate some of those, in order to focus what would be best on the Rust side to be able to convey intent and thus let LLVM know better what to do.
A classic thing, for example, has been that it might be good to have ways to mark some match arms as much more likely that others, in case pre-checking that one as a normal branch might be a good choice, then using a jump table for the others, say.
But that does tend to quickly run into "well, if you care about that you should probably use PGO instead" to get far more information -- and more accurate information -- than the programmer can reasonably include in source.
I don't think it should be left up to LLVM. Generating naive code, and then having LLVM guess it looks like a match, and then optimize it somehow, is a lot of work, and in practice it doesn't work well (e.g. match of lots of strings ends up being series of calls to memcmp). Rustc should be able to generate smarter match code in the first place.
What kind of "smarter" are you thinking here? Trying to change match lowering -- already one of the most complicated parts of the compiler -- sounds scary. And optimizing it later makes me think it'd be easier to just do in LLVM. (LLVM knows what memcmp is; it could optimize those string compares itself.)
This compile downs to this assembly.
It just calls PartialEq multiple times.
I think we can make it faster if we could compare it against the length of the strings before calling PartialEq. We should be able to avoid all those costly operations
It appears that alloc command is being called before calling PartialEq; If we could remove this, it should let us save CPU power, CPU cache and reduce memory allocation.
This starts by matching it against the enum, then it goes to compare it against the strings.
I think we might be able to further optimize this by telling LLVM that it can change the ordering of those operations.
I wouldn't expect rustc to be smart about complex bindings with mixed moves and if guard clauses, but I think it could have predefined solutions for situations when the arms are all simple constants with strings or integers, and generate perfect hash tables/jump tables, or even use bisection.
I'm not sure what to do about assumptions that the first few match arms may be more likely. Perhaps it should leave it up to LLVM for low numbers of match arms, and be algorithmically clever only for huge match statements. Maybe it could have direct linear check for the first 1-3 arms, and assume everything else is equally likely, thus suitable for clever algorithms.
LLVM on its own is not clever. This code is awful:
The thing that's unclear to me is whether that's fundamental to LLVM somehow.
Is there a reason that LLVM shouldn't do better for that LLVM-IR input? My instinct is that it's better for LLVM to recognize matching on bytes like this and do something smarter so that clang and rustc and co don't all need to special-case it somehow and emit something harder for LLVM to recognize.
but I think it could have predefined solutions for situations when the arms are all simple constants with strings or integers, and generate perfect hash tables/jump tables, or even use bisection.
I would note that any switch/if-ladder can be optimized by LLVM (and GCC) into:
An if-ladder.
An if-tree (bisection).
A jump table.
And combinations of the above.
Going straight to PHF in rustc may impact the ability of the backend to apply the above, and thus incur regressions on existing code, which would be fairly sad.
LLVM on its own is not clever. This code is awful:
It doesn't look great indeed, but how does it perform?
First of all, the first comparison separates 4-bytes strings from 5-bytes strings.
Secondly, LLVM does apply the string -> int optimization, packing the bytes into a register and comparing the registers themselves... but it suffers from the fact that this strategy only works with powers-of-2, and your identifiers foo10 to foo99 are 5 bytes long, requiring a 4 bytes comparison followed by a 1 byte comparison.
I can see that LLVM picked an if-ladder for the 4-bytes section, for example, instead of an if-tree. This does not seem algorithmically ideal, but I have no idea about the performance of it. And said performance may depend on the data submitted, it'll likely vary quite a lot depending on whether foo1 appears much more often than foo2 (etc...) or whether they're all as likely.
A PHF performance profile should be flat regardless of the input, which is better for random data, but worse for heavily biased data where the first branch matches most often. Preserving this "edge" for the first (few) branch(es) may be part of the reason for LLVM preserving the if-ladder: assuming the developer knows best.
I don't want to dismiss the PHF idea, but I am afraid that because it won't automatically be a win, it'd be hard to apply systematically.
A more guided approach would avoid the issue. For example, if a PHF was only used if the match statement was adorned by a #[rustc::phf] attribute, then it would be easy to benchmark whether a PHF is beneficial or not, and apply it only when it is.
And interestingly, this seems like something a proc-macro could do, to validate the experiment.