I opened issue #42432 but was directed to open a topic here instead. Issue #27731 already tracks the fine work being done to expose SIMD in ways that are explicit to the programmer. If you're able to code in those specific ways big gains can be obtained. However there is something simple can be done before to performance sensitive code that sometimes greatly improves its speed, just tell LLVM to take advantage of those instructions. The speedup from that is free in developer time and can be quite large. I extracted a simple benchmark from one of the computationally expensive functions in rawloader, matrix multiplying camera RGB values to get XYZ:
I programmed the same multiplication over a 100MP image in both C and rust. Here are the results. All values are in ms/megapixel run on a i5-6200U. The runbench script in the repository will compile and run the tests for you with no other interaction.
So rust nightly is faster than clang (but that's probably llvm 3.8 vs 4.0) and the reduction in runtime is quite worthwile. The problem with doing this of course is that now the binary is not portable to architectures lower than mine and it's not optimized for archictures above it either.
My suggestion is to allow the developer to do something like #[makefast] fn(...). Anything that gets annotated like that gets compiled multiple times for each of the architecture levels and then at runtime, depending on the machine being used, the highest level gets used. At least the GNU toolchain already includes a way to make the runtime dispatch penalty also disappear:
Using simd is definitely worth it, helps speed and the code looks good. It’s also quite worth it even then to be able to compile for the most extended as possible feature set to get some nice speedups. So even after simd is stabilized it would be great to be able to automatically compile a multi-feature binary that decides what to use at runtime.
Hi @pedorcr, here is the procedural macro I mentioned I have been working on. The macro itself is more or less working but the runtime library it calls still needs to be implemented. It works like this:
The original function is replaced by one that loads a static function pointer and calls that.
The function pointer is initially pointing to a setup function that checks that hardware capabilities and replaces the function pointer with the optimal version of function for subsequent calls.
@parched looks cool. How do you compile this then? Do you just force compiling with avx? Could lazy_static! be a good way to replace ifunc in a portable way?
@parched and #[runtime_target_feature("+avx")] already makes it so avx is used? That’s pretty much the whole thing. I thought this was going to require compiler changes.
Yes its uses avx if the hardware it is running on supports it otherwise it just uses the default version. No compiler changes needed, just 3 feature gates need to stablize before it can be used on stable.
I’ve followed your example code and added the crate directly from your git master.
It would also make sense to have several sets of features (e.g., sse3, avx, avx2, avx512) for the same function. Might even make sense to have a convenience function like #[make-simd-fast] that uses that pre-defined set.
Great! Note, there will be a slow down on the first call to the function once the hardware checking functions are implemented though, how big I’m not sure.
Because currently procedural macros have to be in a crate by them selves, which I agree is a bit annoying.
I don’t think so, this works quite similar to lazy_static anyway. At least not in a portable way. I’ll implement the hardware feature check tomorrow and you can test your benchmarks again then.
Thanks will check back then. What I meant was to just replace the cpu functions with a set of lazy_static! global constants and then you can just use those in whichever calls are needed. That way the features get detected at startup instead of first call. It will probably not make much of a difference in terms of performance. I’ll have to redo the benchmark to detect it anyway as I just added this to main() and am timing just a part of the function.
(totally made-up syntax there, but I think you get the idea) I looked up how we do this in Gecko C++ and it’s not very pretty. Having a nice Rust solution would be awesome!