Better codegen for CPU feature detection

The current codegen for feature detection macros such as is_x86_feature_detected can be poor, especially if detection is done at a fine level of granularity. One particular use case is an &&-chain for multiple features, in which case it does the atomic load each time (Godbolt). Doing feature detection inside a loop also has very poor performance, and many potential opportunities are lost when inlining; in the case of one function calling into another, both must do the full feature detection and dispatch.

The culprit is the use of atomic memory semantics to represent the feature detection cache, which is conservative but not ideal. The actual semantics desired are "dynamically constant," which as far as I know is not represented in any attempt to formalize a memory model. However, LLVM's "unordered" is a close approximation. I well understand the difficulties in exposing these semantics to abitrary user code (unordered as a solution to “Bit-wise reasoning for atomic accesses”), but I think there's a strong case for using it for this particular purpose.

There are other alternatives to be considered. One is to coalesce the atomic reads, without making any language or code changes. Ralf Jung suggested in that thread that more optimization is possible. I'm wary; while I think that optimization could fix the &&-chain linked, you wouldn't want to apply it in a loop, otherwise you'd break other valid atomic use cases such as canceling a long-running computation. That behavior is the exact opposite of what we want for dispatch based on feature detection.

Another, which I think may be valid for some application developers but not good enough for the core language or ecosystem-infrastructure libraries, is to just use UnsafeCell. This is clearly a data race according to formal memory models, but it's also hard to see how things could go wrong in practice.

For CPU feature detection specifically, another possibility is to resolve it in link time or somewhere around lang_start. I understand concerns about adding static initializers in general, but in this case the tradeoff may be worth it. I mention this possibility partly because Rust already has a poor implementation of multi-versioning for f32::mul_add() and friends, which is a serious performance footgun when compiled at the default x86-64 target-cpu level. With more optimized feature detection, it may be possible to replace this with something better.

Another related topic I want to mention is how to take the address of a function under multi-versioning, which is discussed as a Rust project goal. If feature detection were reasonably well optimized, then one possibly straightforward answer is to generate code that includes dynamic dispatch, and take the address of that. More generally, better codegen for detection and dispatch may take off some of the pressure to model CPU feature detection in Rust's type system in proposals to make multi-versioning better.

11 Likes

See also cpufeatures - Rust, which chooses to cache the results it gets.

2 Likes

Cpufeatures uses atomics to cache the result just like is_*_feature_detected!().

If you're considering to muck around with trickery like unordered then the good old PROT_NONE + segfault handler approach for initialize-once, read-often data could be an alternative on platforms that support it.

1 Like

If you are able to map the memory as PROT_NONE at program startup, you can just as easily directly initialize the set of available cpu features at that point. Getting all cpu features is cheaper than a syscall that affects the page tables. Especially on x86 where you don't need any syscalls at all.

1 Like

Unfortunately, based on Public view of rust-lang | Zulip team chat it sounds like PROT_NONE might be UB. :frowning:

The pattern I was referring to involves the signal handler fixing up the access, so this would never be observable by the program. It's just a memory load that's unusually slow. Afaict that's a different pattern than what's discussed in that thread. See Public view of rust-lang | Zulip team chat

Ah good point, yeah the "there is no runtime" issue...

If we're willing to tolerate a bit of unsafe which is still sound when used correctly, we can insist that main() calls an initialization function which does the detection and stores the result in an UnsafeCell. Then subsequent nonatomic reads of that cell are fine. The issue is that the initialization function cannot be exported without an unsafe qualifier, as it would be a data race if there were any other concurrent reference to that cell. This may be a good tradeoff for some applications, but places a burden on applications to do the initialization early and to do it safely.

Regarding doing CPU detection before main, on Windows we're already paying some of that cost and not getting the benefit, the __isa_available_init() function in the CRT does cpuid and stores the result in __isa_enabled, which is defined as `extern "C" long" (see Microsoft STL which takes advantage of this).

Note that we can read __isa_enabled easily from Rust, just by typing static __isa_enabled: u32; and putting it in an extern "C" block. I'm now wondering whether on Windows it would be reasonable to just read this and skip the Rust-provided check.

I've also been poking into the implementation of fmaf on Windows (using Visual Studio debugging) and think there may be another CPU feature detection set by the linker. The implementation of fmaf is a jmp to __imp_fmaf, which in turn reads a global dword value (which is 3 on my AMD HX 370), jumps to a fallback function if it's 0, then does vfmadd213ss xmm0, xmm1, xmm2; ret. I set a watchpoint on the dword and couldn't find any code that sets it; it seems to have its value before control is transferred to __scrt_common_main. So, if I'm reading this right, we have two feature detection mechanisms running before main, and one after, and we still get highly suboptimal codegen for idiomatic Rust code.

1 Like

The MSVC intrinsic seems to be __check_isa_support. We'd[1] ultimately prefer for the windows-msvc target[2] to only use the UCRT and not any of the Visual Studio license requiring runtime libraries, but doing the same thing as the intrinsic should be reasonable IMHO, especially while we do still utilize vcruntime by necessity.

I did a very small amount of googling and wasn't able to find any information on MSVC linker/loader functionality that depends on available CPU features. But, in a vacuum, support from the loader does seem like the ideal way to handle function multiversioning.

For GNU systems, STT_GNU_IFUNC seems like it should be able to do such?


  1. Morally required disclaimer: I say "we" but I do not aim to represent any part of the Rust project. This is based solely off of my own vague recollection of various and varried discussion which may have easily been superceded since. ↩︎

  2. Or potentially a new “windows-ucrt” target, depending on whether we consider the usage of vcruntime and vcstartup relevant to the “-msvc” target. I suppose we'll cross that bridge once it becomes reasonable to do so. ↩︎

What you need is compile-time multiversioning of a call graph with those specific instructions enabled (and thus no additional detection code inside those functions) and then do runtime switching at some higher entry point. If you're doing it inside a hot loop that's too fine-grained, and trying to make that more efficient seems misguided because you'd still be branching inside the loop body.

Insist in what way? is_x86_feature_detected is stable, so new requirements cannot be imposed on it. It is possible to run Rust code that uses std without main, e.g. when providing Rust functions as a library to an existing application whose entrypoint is written in C; breaking feature detection for those applications would be a breaking change.

Beyond the PROT_NONE maybe being UB, I don't think I'd like to have to deal with already-installed signal handlers from Rust's runtime. Signals already give me headaches…really don't need more, especially if I'd rather just shove them through signalfd and…then what is supposed to handle the init-once logic?

It's still much better to have it be more efficient if possible, though, to avoid the performance pitfall of naive at-use feature dispatch. The canonical example being fNN::mul_add, which would much prefer to use the instruction if available than fall back to a libm function call.

Specifically, unordered only works for load/stores guaranteed to not tear by the target and thus ensures (that if only unordered is used) only values which were written at some point will be read. This is the theoretically ideal constraint for racy initialization of a single word value.

Rust actually already installs a handler for SIGSEGV and SIGBUS, so that a stack overflow gets to print a message indicating so.

But your points still stands: Unlike one that would be used to initialize something, the current handler is just informative so you won't break anything important by replacing it with your own.

That SAFETY comment in there is exactly the kind of stuff that makes my head hurt around signals :slight_smile: .

1 Like

Indeed. I've done some more experiments and find that the compiler won't inline across a target_feature boundary, so calling into a function that uses a gated intrinsic won't be efficient. So the dream of getting good code with detection in a hot loop looks impractical.

I think I can make a more nuanced argument why you still want more optimized detection. You only get good code when there is an unbroken call of inlines from a "head" function that has the #[target_feature] attribute enabled down to the leaves. There are cases where you cannot (dynamic dispatch) or simply do not want to inline everything. At those points, you do want dispatching, but you want that dispatching to be as efficient as possible.

Sorry for not being clear, that "we" was the perspective of writing a library that uses SIMD. Such a library could export a library_init() function that does the detection and stores the flag, but that function would have to be unsafe to avoid (perhaps technical) race conditions.

An update on __isa_enabled (I've been spending quality time with a debugger): it looks like the __ISA_AVAILABLE_AVX2 bit is set strictly on the "avx2" feature. I don't see any code in there that's checking fma. On the other hand, detection of AVX-512 seems to check all 5 of the x86-64 v4 feature bits.

I also did some digging into the implementation of fmaf. There's an indirection through api-ms-win-crt-math-l1-1-0.dll, but the actual function is in ucrtbase.dll. It checks __use_fma3_lib, but that symbol does not seem to be exposed through the linker. That's actually set by __acrt_initialize_fma3, which does the feature detection using cpuid. That in turn is called by initialize_c, which I'm assuming is called some time before __scrt_common_main, but this is straining my knowledge and curiosity already. That also seems to only check the avx2 feature (bit 5 of extended_features_ebx). I think that's just imprecise detection (see a bug where it misdetects a CPU with AVX2 but not BMI2). That suggests there might be problems running this code on a VIA Isaiah C4650, but I haven't dug further. If this analysis is correct, then Rust code should be failing on that chip already, and new code that only relies on __isa_enabled & 0x20 is no worse.

In any case, my point in this is not that we should be relying on undocumented MS CRT symbols. Rather, it's to argue that it should be ok to do CPU feature detection before main, citing the precedent that we already are. Twice, in fact.

An atomic (relaxed) load is equivalent to a normal load at the machine level on all platforms. I think the problem here is not the atomic load itself, but the presence of multiple atomic loads.

I think a simpler way to resolve the && issue would be for the is_*_feature_detected macro to support taking multiple features and doing the && internally, so the end result of that is cached.

You can of course also write your own cache around it that does this combining.

1 Like

I wonder if there is a way to express this using an inline assembly block marked with pure and nomem which checks the global cache and if not executes cpuid and stores the value in the global cache for later. If I understand nomem correctly it should be sound as long as the assembly block is the only one that can access the global cache, right?

SkiFire13: yes, I think this may be a winner. Originally I considered asm but had two thoughts, both of which seem to be wrong. One was that the asm block would always be included so it wouldn't be an optimization, and the other was that it wouldn't be any safer to use asm than the equivalent unsafe Rust. But I think pure with no inputs is in fact a way to express "dynamically constant" as stated above. Also I think the asm is safe where UnsafeCell would not be, and I agree with your interpretation of nomem, that the cache is private to the asm.

This is actually better than unordered, as the compiler can assume the value won't change even with arbitrary function calls.

I'm prototyping and it looks promising. I'd appreciate confirmation that the approach is sound, or a heads-up on any potential risks or problems.

1 Like

Also I feel like the correct term for this is "idempotent". The operation of fetching the cpuid might have side-effects the first time it's executed (e.g. setting the global cache), but repeated executions will always result in a noop (which can hence be removed) and the same output (which can hence be cached).