I'm confused where an inline assembly block would be helpful. A relaxed atomic load should have the same runtime behavior.
Briefly: if a function calls several callees, and each of those callees has its own detection and dispatch, and those callees are inlined, then all but one of the detections can be eliminated.
That said, I've been continuing to experiment, and there's not as much optimization that current Rust and LLVM will do here, compared with the approach of making functions generic over the SIMD capability (as pulp does), passing around a ZST for this, and dispatching on the token (I do this by having methods like .as_avx2()
on the Simd trait that the token impl's). That's yielding the highest quality code in my experiments, but ergonomically is not ideal.