Suggestion for a low-effort way to take advantage of SIMD and other architecture specific tricks LLVM knows about

I opened issue #42432 but was directed to open a topic here instead. Issue #27731 already tracks the fine work being done to expose SIMD in ways that are explicit to the programmer. If you're able to code in those specific ways big gains can be obtained. However there is something simple can be done before to performance sensitive code that sometimes greatly improves its speed, just tell LLVM to take advantage of those instructions. The speedup from that is free in developer time and can be quite large. I extracted a simple benchmark from one of the computationally expensive functions in rawloader, matrix multiplying camera RGB values to get XYZ:

I programmed the same multiplication over a 100MP image in both C and rust. Here are the results. All values are in ms/megapixel run on a i5-6200U. The runbench script in the repository will compile and run the tests for you with no other interaction.

Compiler                                    | -O3   | -O3 -march=native
-----------------------------------------------------------------------
rustc 1.19.0-nightly (e0cc22b4b 2017-05-31) | 11.76 | 6.92 (-41%)
clang 3.8.0-2ubuntu4                        | 13.31 | 5.69 (-57%)
gcc 5.4.0 20160609                          | 7.77  | 4.70 (-40%)

So rust nightly is faster than clang (but that's probably llvm 3.8 vs 4.0) and the reduction in runtime is quite worthwile. The problem with doing this of course is that now the binary is not portable to architectures lower than mine and it's not optimized for archictures above it either.

My suggestion is to allow the developer to do something like #[makefast] fn(...). Anything that gets annotated like that gets compiled multiple times for each of the architecture levels and then at runtime, depending on the machine being used, the highest level gets used. At least the GNU toolchain already includes a way to make the runtime dispatch penalty also disappear:

http://www.agner.org/optimize/blog/read.php?i=167

1 Like

I’ve had a look at how to use the simd crate in the benchmark. Here are the results of using the simd vector types:

Compilation              | nomal code | simd code
----------------------------------------------------
-O3                      | 16.25      | 14.70 (-10%)
-O3 -C target-cpu=native | 7.06       | 6.31 (-11%)

(commit with the code: https://github.com/pedrocr/rustc-math-bench/commit/f81e57c8794ba91b8cd870c2c434f35d413025b1)

Using simd is definitely worth it, helps speed and the code looks good. It’s also quite worth it even then to be able to compile for the most extended as possible feature set to get some nice speedups. So even after simd is stabilized it would be great to be able to automatically compile a multi-feature binary that decides what to use at runtime.

1 Like

Hi @pedorcr, here is the procedural macro I mentioned I have been working on. The macro itself is more or less working but the runtime library it calls still needs to be implemented. It works like this:

  • The original function is replaced by one that loads a static function pointer and calls that.
  • The function pointer is initially pointing to a setup function that checks that hardware capabilities and replaces the function pointer with the optimal version of function for subsequent calls.

Basically it turns

#[runtime_target_feature("+avx")]
pub fn sum(input: &[u32]) -> u32 {
    input.iter().sum()
}

to

pub fn sum(input: &[u32]) -> u32 {
    pub extern crate runtime_target_feature_rt as rt;

    static PTR: rt::atomic::Atomic<fn (&[u32]) -> u32> = rt::atomic::Atomic::new(setup);

    fn setup(input: &[u32]) -> u32 {
        let chosen_function = if rt::have_avx( ) {
            enable_avx
        } else {
            default
        };
        PTR.store(chosen_function, rt::atomic::Ordering::Relaxed);
        chosen_function(input)
    }

    fn default(input: &[u32]) -> u32 {
        input.iter().sum( )
    }

    #[target_feature = "+avx"]
    fn enable_avx(input: &[u32] ) -> u32 {
        input.iter().sum()
    }

    PTR.load(rt::atomic::Ordering::Relaxed)(input)
}

Your #[makefast] attribute could just be a wrapper around runtime_target_feature with some appropriately hardcoded features.

links (note this is the first proper rust I’ve written so any code review is welcome) : The procedural macro attribute source and rt crate and test crate

The GNU ifunc you mention does have some advantages but is quite fragile and not very portable so I wouldn’t recommend implementing that. https://sourceware.org/ml/libc-alpha/2015-11/msg00108.html

5 Likes

@parched looks cool. How do you compile this then? Do you just force compiling with avx? Could lazy_static! be a good way to replace ifunc in a portable way?

You just compile as normal but you add the attribute to your funtion with the features you want to enable like https://github.com/parched/runtime-target-feature-rs/blob/master/tests/src/lib.rs

Note you also need nightly and add

#![feature(proc_macro)]
#![feature(target_feature)]
#![feature(const_fn)]

as well as adding the runtime library (runtime-target-feature-rt) as a dependency.

@parched and #[runtime_target_feature("+avx")] already makes it so avx is used? That’s pretty much the whole thing. I thought this was going to require compiler changes.

Yes its uses avx if the hardware it is running on supports it otherwise it just uses the default version. No compiler changes needed, just 3 feature gates need to stablize before it can be used on stable.

This looks like pretty much the way to go then. I’m trying to benchmark but getting the following error compiling:

error[E0463]: can't find crate for `rt`
  --> src/main.rs:17:1
   |
17 | #[runtime_target_feature("+avx")]
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ can't find crate

I’ve followed your example code and added the crate directly from your git master.

It would also make sense to have several sets of features (e.g., sse3, avx, avx2, avx512) for the same function. Might even make sense to have a convenience function like #[make-simd-fast] that uses that pre-defined set.

You need to add the rt crate too as a dependency, see https://github.com/parched/runtime-target-feature-test-rs/blob/master/Cargo.toml (note the path will need replacing).

It should support more that one feature i.e.,

 #[runtime_target_feature("+avx,+avx2,+sse3")]

Note though I have implemented the hardware checking functions yet (see https://github.com/parched/runtime-target-feature-rt-rs/blob/master/src/x86.rs) so it will always think it can use them.

@parched haha, that works great! I get the same performance as compiling with target-cpu=native from #[runtime_target_feature("+avx")]. Very cool.

Why all the different repos? Why not a single one for this feature?

Great! Note, there will be a slow down on the first call to the function once the hardware checking functions are implemented though, how big I’m not sure.

Because currently procedural macros have to be in a crate by them selves, which I agree is a bit annoying.

Wouldn’t it be cleaner to just do the checks once on startup (with lazy_static! or similar) instead of on first call?

I don’t think so, this works quite similar to lazy_static anyway. At least not in a portable way. I’ll implement the hardware feature check tomorrow and you can test your benchmarks again then.

Thanks will check back then. What I meant was to just replace the cpu functions with a set of lazy_static! global constants and then you can just use those in whichever calls are needed. That way the features get detected at startup instead of first call. It will probably not make much of a difference in terms of performance. I’ll have to redo the benchmark to detect it anyway as I just added this to main() and am timing just a part of the function.

Ok, I’ve implemented the x86 target feature checks so feel free to test again.

This is very cool! As a followup, it would be awesome to be able to explicitly list separate implementations somehow, like:

runtime_alternates! {
#[runtime_target_feature("+avx")]
pub fn sum(input: &[u32]) -> u32 {
    // something with AVX intrinsics
}

pub fn sum(input: &[u32]) -> u32 {
    input.iter().sum()
}
}

(totally made-up syntax there, but I think you get the idea) I looked up how we do this in Gecko C++ and it’s not very pretty. Having a nice Rust solution would be awesome!

1 Like

Hi luser, how I imagined doing such a thing was with cfg_target_feature but unfortunately it’s broken see https://github.com/rust-lang/rust/issues/42515.

It’s not broken. The cfg(target_feature) attr is only set when you pass the appropriate command line flags to rustc.

Broken is probably the wrong word, rather it behaves unintuitively in my mind.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.