Suggestion for a low-effort way to take advantage of SIMD and other architecture specific tricks LLVM knows about

Hi @pedorcr, here is the procedural macro I mentioned I have been working on. The macro itself is more or less working but the runtime library it calls still needs to be implemented. It works like this:

  • The original function is replaced by one that loads a static function pointer and calls that.
  • The function pointer is initially pointing to a setup function that checks that hardware capabilities and replaces the function pointer with the optimal version of function for subsequent calls.

Basically it turns

#[runtime_target_feature("+avx")]
pub fn sum(input: &[u32]) -> u32 {
    input.iter().sum()
}

to

pub fn sum(input: &[u32]) -> u32 {
    pub extern crate runtime_target_feature_rt as rt;

    static PTR: rt::atomic::Atomic<fn (&[u32]) -> u32> = rt::atomic::Atomic::new(setup);

    fn setup(input: &[u32]) -> u32 {
        let chosen_function = if rt::have_avx( ) {
            enable_avx
        } else {
            default
        };
        PTR.store(chosen_function, rt::atomic::Ordering::Relaxed);
        chosen_function(input)
    }

    fn default(input: &[u32]) -> u32 {
        input.iter().sum( )
    }

    #[target_feature = "+avx"]
    fn enable_avx(input: &[u32] ) -> u32 {
        input.iter().sum()
    }

    PTR.load(rt::atomic::Ordering::Relaxed)(input)
}

Your #[makefast] attribute could just be a wrapper around runtime_target_feature with some appropriately hardcoded features.

links (note this is the first proper rust I’ve written so any code review is welcome) : The procedural macro attribute source and rt crate and test crate

The GNU ifunc you mention does have some advantages but is quite fragile and not very portable so I wouldn’t recommend implementing that. https://sourceware.org/ml/libc-alpha/2015-11/msg00108.html

5 Likes