The current codegen for feature detection macros such as is_x86_feature_detected
can be poor, especially if detection is done at a fine level of granularity. One particular use case is an &&-chain for multiple features, in which case it does the atomic load each time (Godbolt). Doing feature detection inside a loop also has very poor performance, and many potential opportunities are lost when inlining; in the case of one function calling into another, both must do the full feature detection and dispatch.
The culprit is the use of atomic memory semantics to represent the feature detection cache, which is conservative but not ideal. The actual semantics desired are "dynamically constant," which as far as I know is not represented in any attempt to formalize a memory model. However, LLVM's "unordered" is a close approximation. I well understand the difficulties in exposing these semantics to abitrary user code (unordered
as a solution to “Bit-wise reasoning for atomic accesses”), but I think there's a strong case for using it for this particular purpose.
There are other alternatives to be considered. One is to coalesce the atomic reads, without making any language or code changes. Ralf Jung suggested in that thread that more optimization is possible. I'm wary; while I think that optimization could fix the &&-chain linked, you wouldn't want to apply it in a loop, otherwise you'd break other valid atomic use cases such as canceling a long-running computation. That behavior is the exact opposite of what we want for dispatch based on feature detection.
Another, which I think may be valid for some application developers but not good enough for the core language or ecosystem-infrastructure libraries, is to just use UnsafeCell
. This is clearly a data race according to formal memory models, but it's also hard to see how things could go wrong in practice.
For CPU feature detection specifically, another possibility is to resolve it in link time or somewhere around lang_start. I understand concerns about adding static initializers in general, but in this case the tradeoff may be worth it. I mention this possibility partly because Rust already has a poor implementation of multi-versioning for f32::mul_add() and friends, which is a serious performance footgun when compiled at the default x86-64 target-cpu level. With more optimized feature detection, it may be possible to replace this with something better.
Another related topic I want to mention is how to take the address of a function under multi-versioning, which is discussed as a Rust project goal. If feature detection were reasonably well optimized, then one possibly straightforward answer is to generate code that includes dynamic dispatch, and take the address of that. More generally, better codegen for detection and dispatch may take off some of the pressure to model CPU feature detection in Rust's type system in proposals to make multi-versioning better.