Pre-RFC: contextual target feature detection

Summary

Add new is_{arch}_feature_enabled macros for detecting at compile time if a target feature is enabled in a context.

Motivation

RFC #2045 added two methods of querying target features:

  • conditional compilation, e.g. #[cfg(target_feature = "avx")]
  • runtime detection, e.g. is_x86_feature_detected!("avx")

Conditional compilation only allows querying the base target features, and does not interact with #[target_feature]. Runtime detection is necessary for safely calling functions tagged with #[target_feature], but incurs a runtime overhead. In some cases, it is necessary to determine which target features are enabled at code generation (particularly after inlining) to allow various optimizations:

#[inline(always)]
fn conditional_compilation(...) {
    // This branch is always optimized out, and depends on the target features enabled at code generation.
    // If this were `cfg!(target_feature = "avx")`, this branch would never select the AVX version
    // (without enabling AVX for the entire binary).
    if is_x86_feature_enabled!("avx") {
        unsafe { avx_implementation(...) }
    } else {
        generic_implementation(...)
    }
}

#[inline(always)]
fn runtime_detection(...) {
    // If AVX is enabled at code generation, this branch is optimized out and runtime detection is skipped.
    if is_x86_feature_enabled!("avx") || is_x86_feature_detected!("avx") {
        unsafe { avx_implementation(...) }
    } else {
        generic_implementation(...)
    }
}

#[target_feature(enable = "avx")]
unsafe fn with_avx_enabled(...) {
    // This call selects the AVX implementation, because of this function's target features.
    conditional_compilation(...);

    // This call selects the AVX implementation, skipping runtime detection!
    runtime_detection(...);
}

A particularly useful case is nested runtime detection:

#[inline(always)]
fn first(...) {
    #[target_feature(enable = "avx")]
    #[inline]
    unsafe fn first_avx(...) {
        second(...)
    }
    
    #[inline(always)]
    fn first_generic(...) {
        second(...)
    }

    if is_x86_feature_detected!("avx") {
        unsafe { first_avx(...) }
    } else {
        first_generic(...)
    }
}

#[inline(always)]
fn second(...) {
    #[target_feature(enable = "avx")]
    #[inline]
    unsafe fn second_avx(...) {
        ...
    }
    
    #[inline(always)]
    fn second_generic(...) {
        ...
    }

    if is_x86_feature_enabled!("avx") || is_x86_feature_detected!("avx") {
        unsafe { second_avx(...) }
    } else {
        second_generic(...)
    }
}

After inlining, calling first only runs target feature detection once! The nested runtime detection is elided.

Guide-level explanation

The syntax for is_{arch}_feature_enabled macro is like is_{arch}_feature_detected.

Instead of detecting if the target feature is supported at runtime, the macro returns whether the target feature is supported in the current context at compile time.

A target feature is supported in a particular context if at least one of the following is true:

  • The target supports the feature by default, or it is enabled with the -Ctarget-feature flag (i.e. cfg!(target_feature = "feature") is true)
  • The function containing the macro invocation is annotated with #[target_feature(enable = "feature")]
  • The function containing the macro invocation is inlined into a function annotated with #[target_feature(enable = "feature")]

The accuracy of this macro is "best-effort": depending on your particular code and Rust configuration, it may still return false even if the feature is enabled. For example:

  • Using #[inline(always)] may be more accurate than #[inline]
  • This feature depends on MIR inlining being enabled

Reference-level explanation

The macros generate calls to new compiler intrinsics. These intrinsics are lowered by each codegen backend to the appropriate true or false value depending on the target features of the parent function.

Codegen operates on optimized MIR, so the intrinsic is lowered after MIR inlining has occurred. When lowering the intrinsic, the backend can optionally account for additional inlining passes.

Drawbacks

Target features are already complicated, and adding this macro increases complexity.

Rationale and alternatives

Compared to runtime detection

In small functions, the runtime cost of feature detection and branching is unacceptable. Runtime detection is also inefficient and redundant when used in many functions that may call each other.

Compared to conditional compilation

RFC #2045 proprosed making #[cfg(target_feature = "feature")] context-dependent, but this was never implemented. cfg is always consistent throughout a program, so making it context dependent might be confusing and lead to mistakes. Additionally, it's not clear how context-dependent #[cfg(target_feature = "feature")] on items (rather than blocks) should or could work. Adding a new is_{arch}_feature_enabled macro avoids this complexity.

Compared to a library

This kind of feature necessarily requires compiler support.

Some crates, such as glam or simd-json, opt to use cfg which simplifies the library API, but is pessimistic since many applications compile with default target features to maximize compatibility. Other libraries, such as rustfft or highway, leave the decision up to the user at the cost of a more verbose API, especially for users that are not aware of or interested in SIMD or target features.

Prior art

I've already mentioned RFC #2045 a couple times, which briefly discusses this problem, but the proposed solution was never implemented.

As far as I know, something like this has not been implemented in another compiler or language.

Unresolved questions

  • Is relying on MIR inlining sufficient, or does this need to interact with the LLVM inline pass?
  • How would the LLVM backend lower the intrinsic to account for LLVM's inline pass? A few possibilities:
    • Add a new intrinsic to LLVM
    • Use a custom optimization pass
  • Is the identifier too similar to is_{arch}_feature_detected?
    • The names are very similar (only 6 characters different) and may appear to be the same at a glance.
    • On the other hand, the two macros are very closely related, to the point of being interchangeable with just a compile time vs runtime tradeoff.

Future possibilities

The proposed macros are relatively simple, but there are a few avenues for future changes:

  • is_{arch}_feature_detected could make use of this macro for additional optimization opportunity.
  • Since inlining is necessarily late in the compilation process, it would be difficult to make the macro evaluate to a const value, and this RFC does not propose that. Future improvement could make it const, however.
  • There is additional opportunity for inlining after the MIR inliner, performed by LLVM. This functionality could be integrated into LLVM to improve accuracy in those cases.
  • New language features affecting target features (such as hypothetical #[target_feature] closure or block) could interact with these macros.
4 Likes

This isn't going to work simply as described, because in the context of the function without a #[target_feature], the feature isn't enabled yet. It's not sufficient for the rustc backend to lower the macro intrinsic to true/false; it'd need to lower to something the codegen backend (i.e. LLVM) is aware of, since that's where the optimization/inlining actually happens. Even then it's not quite that simple, since inlining happens bottom-up, so when the function is being optimized (the first time before it gets potentially inlined) it's without information about whether it's being called with the target feature enabled.

What I think you actually want is multiple function compilations (a la multiversion) but with which version the caller chooses selected by compile-time enabled detection instead of runtime available detection.

1 Like

Thanks for the comment!

I specify that this is implemented as an intrinsic for that reason--this is something only the backend can lower. Initially, it would be possible to implement this without interacting with LLVM at all, since MIR inlining occurs before codegen. That alone would be sufficient for #[inline(always)], at least. (should the initial implementation only support #[inline(always)], to be consistent?) A more accurate implementation would indeed need to interact with LLVM. If accepted, this might result in a new LLVM intrinsic that implements this, but otherwise I think it would be possible with a custom pass after the inlining pass.

What I think you actually want is multiple function compilations (a la multiversion ) but with which version the caller chooses selected by compile-time enabled detection instead of runtime available detection.

I am the author of multiversion :slight_smile: Some of this is inspired by the limitations I've run into there. multiversion actually already supports compile-time detection, but this only helps the multiversioned function. This proposal is complementary to multiversioning: you can design crates with small functions that inline into a larger multiversioned function. Imagine, for example, a linear algebra crate which depends on the contextual target features, inlined into an image processing crate with a multiversioned sharpening function.

2 Likes

As a comment, a comparison with multiversion would be a great section to add under Alternatives.

2 Likes

link to previous discussion: https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd/topic/MIR.20inlining.20.2B.20target.20features

link to draft RFC in git:

2 Likes

Are there any use cases where I want compile-time detection of a feature being enabled, but not runtime detection?

I see the use of #[cfg(target_feature = "…")] as a way to not even attempt to compile a function where it requires a certain target feature. And I see the benefit of is_{arch}_feature_detected! to go down an optimized branch at runtime; in addition, it would be nice if is_{arch}_feature_detected! could be evaluated at compile time where it's known to always be true.

What I'm struggling to come up with is a use case for is_{arch}_feature_enabled! where you wouldn't want to write if is_{arch}_feature_enabled!("{feature}") || is_{arch}_feature_detected!("{feature}"); all the cases I can think of for this are as easily handled by #[cfg(target_feature = "{feature}")].

Would your use case be met by changing the definition of is_{arch}_feature_detected! to say that is_{arch}_feature_detected!("{feature}") is a compile-time true if #[cfg(target_feature = "{feature}")] would be enabled?

The two likely counterexamples would be a case where #[cfg(target_feature)] can't be used, but a compile-time macro can, or a case where target_features are set to enable a feature, but you still want to check at runtime for some reason that I'm not spotting.

yes, a good example is implementing approximate mul-add on x86 where you want it to either be mul and add or a fused-mul-add -- you want the target features from the final function approx_mul_add is inlined into so you can use the most efficient operation available, but you don't want is_x86_target_feature_detected because it's waay more expensive than just a mul-add, defeating the entire purpose of selecting which implementation to use.

/// either `factor0 * factor1 + term` or `factor0.mul_add(factor1, term)`, whichever is likely faster
#[inline]
fn approx_mul_add(factor0: f32, factor1: f32, term: f32) -> f32 {
    if is_x86_feature_enabled!("fma") {
        factor0.mul_add(factor1, term) // a single instruction faster than separate mul and add
    } else {
        factor0 * factor1 + term // separate mul and add is faster than calling the library `fmaf` function
    }
}
1 Like

But then why isn't that met by cfg!(target_feature="fma")? I see why you want to make a compile-time decision if at all possible; what I don't see is why we need two different ways to express the same constraint.

Naïvely, my expectation is that you'd use cfg!(target_feature="…") (or #[cfg(target_feature="…")]) for any case where you need the decision to be made at compile time, and the compiler will make sure that this always does the optimal thing. Similarly, you'd use is_x86_feature_detected! where a runtime decision is OK, and rely on the compiler removing the runtime check where it's always true. Any cases where is_x86_feature_enabled!(…) is different to cfg!(target_feature="…") feel like bugs - the compiler shouldn't have two different ideas of the current compile-time flags.

1 Like

This is unfortunately a confusing aspect of target features--the simple answer is that cfg is not contextual. As I mentioned here, the original RFC for target features did in fact propose cfg to be contextual, but this is contrary to how cfg works elsewhere and this was never implemented. For example, how could conditional compilation (#[cfg]) possibly work after inlining? Adding special semantics into a few special cases of cfg is probably not a good idea, since it has a wide variety of parameters that are expected to be set uniformly across the program. It's unfortunate, but I think cfg's current behavior is unavoidable. The difference between the two should be well documented.

1 Like

#[cfg(target_feature = "fma")] isn't sufficient by itself:

lets say you have a crate approx_mul_add, and a fast_sin crate that builds on that, and finally the user's code. in the following example you need the target features after inlining for approx_mul_add to use fma when fma is detected.

// in approx_mul_add crate:

/// either `factor0 * factor1 + term` or `factor0.mul_add(factor1, term)`, whichever is likely faster
#[inline]
pub fn approx_mul_add(factor0: f32, factor1: f32, term: f32) -> f32 {
    if is_x86_feature_enabled!("fma") {
        factor0.mul_add(factor1, term) // a single instruction faster than separate mul and add
    } else {
        factor0 * factor1 + term // separate mul and add is faster than calling the library `fmaf` function
    }
}

// in fast_sin crate:

#[inline]
fn eval_polynomial(x: f32, coeffs: &[f32]) -> f32 {
    let mut retval = 0.0;
    for coeff in coeffs.iter().rev() {
        retval = approx_mul_add(x, retval, *coeff);
    }
    retval
}

/// compute sin(π / 2 * v) with v.abs() <= 0.5
#[inline]
fn sin_poly(v: f32) -> f32 {
    v * eval_polynomial(
        v * v,
        &[
            1.570796326794897,
            -0.6459640975062462,
            0.07969262624616703,
            -0.004681754135318687,
        ],
    )
}

/// compute cos(π / 2 * v) with v.abs() <= 0.5
#[inline]
fn cos_poly(v: f32) -> f32 {
    eval_polynomial(
        v * v,
        &[
            1.0,
            -1.23370055013617,
            0.253669507901048,
            -0.02086348076335296,
        ],
    )
}

/// compute sin(2 * π * v)
#[inline]
pub fn fast_sin_2pi(mut v: f32) -> f32 {
    if !v.is_finite() {
        return f32::NAN;
    }
    if v.abs() >= (1 << 28) as f32 {
        return 0.0; // v is always an integer, hence sin == 0
    }
    v *= 4.0;
    let vi = v.round_ties_even() as i32;
    v -= vi as f32;
    match vi & 0x3 {
        0 => sin_poly(v),
        1 => cos_poly(v),
        2 => -sin_poly(v),
        _ => -cos_poly(v),
    }
}

// in main program:

#[inline]
fn algorithm(v: &mut [f32]) {
    for v in v {
        *v = fast_sin_2pi(*v);
    }
}

#[inline(never)]
#[target_feature(enable = "fma")]
unsafe fn with_fma(v: &mut [f32]) {
    algorithm(v)
}

#[inline(never)]
fn without_fma(v: &mut [f32]) {
    algorithm(v)
}

pub fn dispatch(v: &mut [f32]) {
    if is_x86_feature_detected!("fma") {
        unsafe { with_fma(v) };
    } else {
        without_fma(v);
    }
}

Thanks - that's the context I was missing. It leads to a second question: are there other cfg tests that would benefit from being context-dependent, or is this truly a special case?

1 Like

I'm not sure if the reference is exhaustive, but I don't believe any of the others are context dependent. Things like architecture, endianness, and operating system aren't dependent on code generation (apart from maybe some obscure situations). Target features are uniquely runtime specified.

In which case, given that is_{arch}_feature_enabled! is basically only going to exist because cfg isn't context-aware, is it worth spending a bit of time working out what a context-aware cfg would look like, or are there cases where you want cfg!(target_feature"{feature}") to be false where is_{arch}_feature_enabled!("{feature}") is true?

My instinct is to assume that it will be similarly difficult to implement a useful is_{arch}_feature_enabled! (one that is context-aware, since one that only returns true if cfg!(target_feature = ) returns true is not useful) as it will be to teach cfg! and #[cfg] to be context-aware. After all, the point of is_{arch}_feature_enabled! is conditional compilation (use the fast path if guaranteed available at compile time, use the slow path if not because a runtime test is too costly), and that's the very thing you gave as an example of what's going to be challenging about using cfg!(target_feature = "…") instead.

#[cfg] has to happen during macro expansion as it can hide code that wouldn't compile and because you can have multiple disjoint #[cfg] on items with the same name. cfg!() could plausibly be made context dependent in some cases, but not in const contexts as in those cases it can affect typechecking which has to happen before we can figure out the context.

Would those cases also be problematic for is_{arch}_feature_enabled!, since that's basically a subset of cfg!()?

No, is_{arch}_feature_enabled!() would expand to a call of an intrinsic that would almost certainly not be marked as const fn and as such can't be used in const contexts, unlike cfg!().

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.