`cfg(target_feature = "fast-bmi2")`? `cfg(target_cpu = "...")`?

Example: _pext_u64() and _pdep_u64() can be useful to speed up to speed up some bit manipulation tasks. Unfortunately the instructions are slow on some CPUs, usually slower than problem specific fallbacks.

So how can library authors properly select the optimized or fallback implementation?


  • As of now #[cfg(target_feature = "bmi2")] means pext and pdep will be available on the target CPU. Introduce #[cfg(target_feature = "fast-bmi2")], meaning that the target CPU supports the feature with a fast implementation (instead of microcode).

    LLVM already has some fast-* subtarget features, like fast-bextr, whenever they are useful for LLVMs own code generation. Currently none of these are whitelisted to be exposed by rustc.

  • Expose the target CPU via something like #[cfg(target_cpu = "znver3")] (with possible values from rustc --print target-cpus). That would be sufficiently powerful to manually select CPU specific optimizations.



I don't think we'd want to have target_cpu options, because those would break if someone selects a set of CPU features rather than a specific target CPU.

Having fast-xyz features seems potentially workable, though, if LLVM has a clear definition of those.

EDIT: Thinking about it further, I don't think we want fast-xyz; I think we want to know if a target CPU has slow xyz, and otherwise we should assume that it has fast xyz.

What kind of breakage do you mean with target_cpu?

Often, people enable specific target features, without enabling a specific target CPU. For instance, people may turn on specific instruction sets they expect to have available on machines they want to use, without ever setting a target CPU.

I also expect that many people will set target_cpu to a class of CPUs rather than a specific model; for instance, it'll become increasingly common to use x86-64-v2 or x86-64-v3.

And in any case, this is a temporary problem, since current generation AMD CPUs no longer have a massive performance limitation on PEXT/PDEP. So it may make sense to use the target CPU instead to determine if the CPU has slow PEXT/PDEP, and flag that, while treating all other CPUs (or just the enablement of bmi2 without a specific target CPU) as having fast PEXT/PDEP.

Indeed, I would write this case as

    target_feature = "bmi2",
    not(target_cpu = "znver1"),
    not(target_cpu = "znver2")

Yes, someone compiling for e.g. x86-64-v3 may see bad performance running it on their AMD, but they would still have the option to recompile for that specific target, fixing the issue without losing other optimizations that do apply. Notably the rest of BMI2 performs fine on those processors: table of latencies at uops.info (small numbers good)

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.