Getting explicit SIMD on stable Rust

The way I was thinking about solving this is the following. Your example would fail to compile:

let answer = if avx_enabled() {
    _my_avx_intrinsic(arg)
    //^^^^ Error: tried to use AVX intrinsic but none in scope.
} else {
    fallback(arg)
};

but the following example would succeed:

let answer = if avx_enabled() {
   // Users opts into target features explicitly: 
   #[use_target_feature(SSE4, AVX)] {
     _my_avx_intrinsic(arg)
   }
} else {
    fallback(arg)
};

and the following would fail as well:

let answer = if avx_enabled() {
   #[use_target_feature(SSE4)] {
     _my_avx_intrinsic(arg)
     //^^^^ Error: tried to use AVX intrinsic but none in scope.
   }
} else {
    fallback(arg)
};

Typically libraries like liboil, and OpenMP, use "something" like the following pattern.

They have an static function pointer, that is initialized to the implementation to be used. We can have a macro for conditional compilation for incompatible architectures, I just called it, target_architecture, but that is a strawman:

// Conditional compilation for x86
#[cfg(target_architecture(x86))] {  

// Detect the features at run-time and initialize a static function pointer
// with the appropriate algorithm implementation: 
lazy_static! {
    static ref SOME_ALGORITHM_IMPL:  fn(...) -> ... =
      if avx_enabled() {
        some_algorithm_avx_impl 
      } else if sse42_enabled() {
        some_algorithm_sse42_impl
      } else {
        some_algorithm_fallback_impl
      }
    };
}

Note how this code doesn't have any target_feature flags, since it is not doing anything "feature" specific, it is just setting a function pointer.

In the same way, we can add the code for ARM:

// conditional compilation for ARM
#[cfg(target_architecture(ARM))] {  
lazy_static! {
    static ref SOME_ALGORITHM_IMPL:  fn(...) -> ... =
      if neon_enabled() {
        some_algorithm_neon_impl 
      } else {
        some_algorithm_fallback_impl
      }
    };
}

and the code for other architectures:

// conditional compilation for not X86, ARM
#[cfg(!target_feature(x86), !target_architecture(ARM))] { 

// no need to use lazy static here:
static SOME_ALGORITHM_IMPL:  fn(...) -> ... = some_algorithm_fallback_impl;

}

Now we implement the algorithm for all architectures, it just forward to the function pointer:

// The algorithm just uses the function pointer
fn some_algorithm(args...) -> ... {
  SOME_ALGORITHM_IMPL(args...)
}

And now we use the target_feature macros combined with the target_architecture macros to generate the code of the different implementations:

// For X86
#[cfg(target_architecture(x86))] { 

// Different implementations of the functions are generated by the compiler

#[target_feature(AVX)]
fn some_algorithm_avx_impl(args...) -> ... {
  // Might use AVX features (and probably SSE42, since AVX is a strict superset)
}

#[target_feature(SSE42)]
fn some_algorithm_sse42_impl(args...) -> ... {
  // Might use SSE42 features, cannot use AVX features (compiler error) c
}
} 

// For ARM
#[cfg(target_architecture(ARM))] { 

#[target_feature(NEON)]
fn some_algorithm_neon_impl(args...) -> ... { }

}

// The fallback is generated for all architectures
fn some_algorithm_fallback_impl(args...) -> ... {
  // Compiler should error if user tries to use any target features here
}

Note how one must use #[target_feature(...)] on the functions to enable the features for the whole function. That should be just sugar for:

fn name(...) -> ... {
 #[target_feature(...)] {
  // body
 }
}

This should work very similarly to the current way in which code is conditionally included depending on enabled target features:

// This works in Rust today (in nightly)
pub fn pext<T: IntF32T64>(x: T, mask_: T) -> T {
    if cfg!(target_feature = "bmi2") {  // compile-time condition
        unsafe { intrinsics::pext(x, mask_) }
    } else {
        alg::bmi2::pext(x, mask_)
    }

I said before that in the feature blocks the compiler should not use features not supported even if the binary target is set to use those features, but I think that does not make sense. The compiler will use those features everywhere else, so the binary cannot work in targets that don't support those anyways.