Function Multi-Versioning for Rust?


#1

This is one of the things I wish Rust had, because it would make code possibly more performant:

FMV (as in GCC 6) is a technique to generate vectorized code for newer targets as well as fallback versions for machines that don’t support a certain instruction set. This is already possible using the compiler today: If I compile something with --target_arch=+avx2, I get very good assembly. If I compile the same code with --target_arch=+sse2 or with the default flags (no arch specified), I get worse assembly. That’s simply because LLVM can see that it has guaranteed access to these instructions, so it uses them.

The idea of FMV is to compile the same function multiple times (for whatever architectures you wish) - ex. once for no SIMD, once for SSE2+, once for AVX2+. At the callsites of these functions, a small runtime check is added (probably via cpuid) at the callsites (where you call the function) or a global variable is set when the program starts - this variable holds, which instruction set is valid. If the CPU supports AVX2, the AVX2 code branch is chosen. Otherwise, there are fallbacks to the no-SIMD versions.

Obviously, duplicating functions lead to code bloat, so it should be used sparingly. I know that Rusts SIMD story isn’t stabilized yet, but I wanted to ask if something like this would be possible for the Rust compiler and if anyone has thought about it.

Something like this:

#[fmv(auto("+sse2", "+avx2"))]
fn auto_vectorized() {
    /* very performance critical code*/ 
}

#[fmv(manual("+sse2"))]
fn manually_vectorized() { }

This can’t be done using libraries, this would be built into the compiler. I think that LLVM has limited support for FMV, but I don’t know for sure.

Another thing I wanted to ask: Would it be possible to auto-vectorize iterators? For example, you could iterate / loop through elements in blocks of 4 or 8 with SIMD, instead of 1 (default iterator). Not sure if this is already done.


#2

You can do this on nightly Rust. See https://docs.rs/stdsimd


#3

Also see https://github.com/parched/runtime-target-feature-rs, it’s a procedural macro almost as your describe.


#4

runtime-target-feature is definitely a step in the right direction, however:

  • stdsimd and runtime-target-feature-rs do not automatically generate multiple versions of your function. Neither does GCC - however: If I compile the same function (without changing the contents, no manual SIMD), I get shorter assembly. Ex. by default, I get movss instructions. With --target-feature=avx, I get vmovss instructions, which are shorter. That is without using stsimd features. Just the regular compiler.

So what I’d want is this:

  • I have a function, regular rust function, doesn’t use SIMD.
  • The compiler compiles the function essentially multiple times - once for --target_feature=avx, once for --target-feature=sse2 and a fallback version. This happens automatically (controlled via an attribute), without copying any code.
  • At the callsites of these functions, a small runtime check is implemented that selects the version of the function that can run.

This is why I thought of the “manual” and “automatic” way. Currently you could do this with runtime-target-feature-rs, but you’d have to implement the function twice. That is ok, that is the “manual” way to write SIMD. What I’d be asking for is the “automatic” way.


#5

This is exactly what the runtime_target_feature attribute does :slightly_smiling_face:. Further, it only does the check the first time you call the function.


#6

I think a much better solution is to compile the whole binary for each of several targets, since that actually works with inlining, autovectorization, general puropose instructions and different instruction scheduling needs for different CPU targets, while “function multi-versioning” is pretty much an hack that only works in very narrow cases.

Maybe there could be some support in Cargo for doing that?

That is:

  1. Ability to specified the desired set of targets to compiyle
  2. Ability to store multiple targets in rlibs
  3. Ability to link together multiple target binary compilations with a main() doing the target detection and dispatching
  4. Ability to link together multiple target dynamic library compilations with target dispatching done on each exported symbol (obviously the detection must be cached or even better done at library initialization if possible)

#7

Why is it a hack? Why do you think it has only very narrow use cases? I certainly don’t agree with that at all. Being able to ship a single binary is monumentally convenient. Certainly, std will want to do it, which is where the vendor intrinsics will be defined. More to the point, Rust isn’t unique here, because that’s how gcc and Clang do it as well. So I think calling that a “hack” is a bit much. If it’s a hack, it’s one that everyone uses and we shouldn’t be ashamed of doing the same thing.

I for one look forward to being able to ship a single binary without getting bug reports like this: https://github.com/BurntSushi/ripgrep/issues/135 — Compiling a separate binary for every possible supported combination of targets is a maintenance hazard and a UX nightmare. e.g., “I’m a user looking to download a release on Github. I know I’m on Linux, but there are a bazillion choices. Which one do I pick?”


#8
  1. You need to apply it manually all over the code
  2. It only works on the code directly in the functions you marked that way
  3. If using IFUNC or another dynamic linker-based scheme, the functions cannot be inlined
  4. Libraries either incur the dispatching (or no-inlining) cost on each function, or have to export _sse, _avx, _avx2 versions designed to be called by callers that are multi-versioned
  5. Does not work for autovectorization in the whole program (since you don’t know which code is going to be autovectorized)
  6. Does not allow to automatically take advantage of new instructions that the compiler can automatically generate in normal code in the whole program (same as autovectorization)
  7. Does not allow to schedule the code in the whole program differently depending on the most popular CPUs at each “level”
  8. Does not allow to optimize data layout for each SIMD instruction set (i.e. using the minimal alignment for the vector length)

In other words, it’s a poor solution except for the specific case of making small manually chosen pieces of vectorizable code run fast at the cost of littering it with annotations.

On the other hand, simply compiling the binary several times for each instruction set level solves all these issues and is a relatively simple mechanism that just works. Obviously both binary size and release compilation times get multiplied by 4-8 times for popular architectures, but that’s not a big issue usually.

You can ship a single binary even with the multiple compilation solution, by just linking them together and adding a dispatcher on main() and on each exported symbol.

All it needs is Cargo support to do the separate compilations and to link them together.

It’s also possible to have a file for each binary and some sort of selection system, but that requires OS support at least for dynamic libraries which might not be available on all OSes.


#9

None of the things you listed are a problem in my domain (text search).

I think you grossly underestimate the release problems.

I would be more interested to hear why gcc/Clang have chosen to use things like target_feature despite all of the issues you listed. It suggests that there is more to the story that isn’t being told here.


#10

Why can’t that be a fat binary with selection at loading time, thus having only 1 alternative in RAM ?


#11

It’s a choice between making a few, specifically-selected functions “fat” and loading only one of each, or making the entire program “fat.” The first option still gives you the vast majority of the perf wins without doubling or tripling the binary size.


#12

I get that, but as long as the executable is better optimized and smaller in RAM, I’m not sure a bigger disk footprint is that important. Disk space is cheap.


#13

Nothing’s stopping you from using fat binaries right now. Asking why it can’t be done that way comes across as asking why it shouldn’t always be done that way.

And in some scenarios, files size is important, so even if 90% of users are satisfied with fat binaries, Rust needs a solution for the other 10%.


#14

I’d guess it’s because it does indeed solve the problem of writing a specific function with intrinsics and using it conditionally, and also because while Rust has Cargo, C/C++ has no standard build system, so they can’t add support for multiple compilation to the build system.

As for fat binaries, I don’t think Cargo has built-in support for that? (obviously you can do it manually, but it’s a moderate amount of work, esp. to make it work for exported symbols in shared libraries)


#15

So I tried out runtime-target-feature-rs. Currently it doesn’t work as it is supposed to, just a heads up. The problem is that you often need a different memory layout for SIMD to work effectively - contiguous arrays works better than struct fields and currently, runtime-target-feature-rs doesn’t optimize it very well. This is an issue I adressed here.

I am currently not sure if it is faster to call cpuid repeatedly or just once - mainly because the memory access to the stored value could take longer than a few assembly instructions. Here is a test I wrote to make a rectangle-rotation function SIMD-optimized, with a fallback function. The main point is that I am now doing the SIMD test manually, but only once per vector (instead of once per element).

Also, my main point about auto-vectorizing functions spares roughly 3 - 5 assembly instructions per function, but the rest of the assembly instructions are the same, so there is a large overhead in code bloat. This could maybe solved with extra labels (in LLVM), but that is more wishful thinking than reality.

So it’s true that you can do runtime SIMD detection, but you have to do it manually anyways to be effective, so the automatic vectorization I proposed is more or less pointless.


#16

Sorry, I missed that until now, see comment there.

Well, it doesn’t really access a stored value other than the function pointer, so no more overhead than calling a function from a shared library.


#17

Note that GCC uses IFUNCs on targets which support them, so calls to target clones turn into indirect function calls, without a run-time check after the implementing function has been patched in (either as the result of lazy binding, or during initial relocation with BIND_NOW).