Getting explicit SIMD on stable Rust

burntsushi · November 14, 2016, 1:26am

Last week, the libs team discussed SIMD stabilization. I’d like to write up some of the problems we discussed and some possible avenues to move forward on getting explicit use of SIMD on stable Rust. (Explicit use of SIMD means that the programmer takes some explicit action to vectorize their code, as opposed to relying on the compiler to vectorize it for them.)

Disclaimer: I personally have very little experience with SIMD, and my compiler backend knowledge is relatively limited. I have no doubt committed serious errors and omissions. I welcome fixes. I’m hoping the compiler team can chime in!

Prior work on this topic:

RFC: https://github.com/rust-lang/rfcs/blob/master/text/1199-simd-infrastructure.md
The simd crate.

In the current state of the world, the only way to use explicit SIMD instructions—whether they are intrinsics exposed by LLVM directly or a convenient abstraction as defined in the simd crate—requires unstable Rust. There are a number of features required:

cfg_target_feature - AFAIK, this feature permits instructing the compiler to actually emit SIMD instructions. For example, much of the simd crate uses target_feature for conditional compilation.
repr_simd - This is used to annotate structs such that they can be used as parameters to SIMD intrinsics. There are some limitations on where repr(simd) can be used (for example, they can’t be used with generics?), but I don’t know the details here.
platform_intrinsics - This makes various LLVM intrinsics available for use with an explicit extern block.

If the above features were stabilized, then for example, the simd crate could be made to work on stable Rust. With that said, the path to stabilizing them isn’t clear. There are numerous problems. I’ll try to outline them below:

`-C target-feature=foo` is hard to use

In today’s Rust, actually using target-feature is pretty inconvenient. I’ve at least been telling people to use RUSTFLAGS. For example, to compile ripgrep with SIMD support, one needs to do this:

RUSTFLAGS="-C target-feature=+ssse3" cargo build --release --features simd-accel

One could also use target-cpu=native, but the advantage of the above command is that binaries can be distributed to most x86_64 platforms (but not all).

It is possible that this specific thing might be able to get worked out with scenarios, but most folks probably will want to eschew this anyway in favor of runtime detection. Which brings us to the next concern.

How does runtime detection using cpuid work?

The libs team didn’t quite seem to know how this would work. Here’s an example problem that I think should be solvable to help motivate this:

I’d like to compile a single binary that works on all Linux x86_64 platforms.
I’d like for that binary to make use of SIMD instructions such as those introduced in SSE 4.2 only if they are available. If they aren’t available, then the program should be capable of using a fallback implementation that doesn’t use SSE 4.2 instructions.

A key thing to note here is that the current system is subtly insufficient. In particular, while said binary might be capable of using SSE 4.2 instructions in places, the compiler probably shouldn’t be using any SSE 4.2 instructions for autovectorization optimizations, since that could preclude running on a platform without SSE 4.2! (N.B. I’m using SSE 4.2 just as an example here.)

Intrinsics

It’s my understanding that there are thousands of intrinsics, and all of them follow specific LLVM naming conventions. (What else is LLVM specific?) Stabilizing these directly seems potentially ill-advised for a couple important reasons:

The API surface area is huge and platform dependent. If LLVM decided to change or remove one of these intrinsics, we would be beholden to them on the next LLVM upgrade, and thereby possibly sacrificing our stability story.
If, one day, someone wanted to write a Rust compiler that didn’t use LLVM, would it be feasible for that compiler to provide exactly the same set of intrinsics as LLVM? (Probably not.)

An alternative to stabilizing intrinsics directly

One thing that has been tossed around is the ability to stabilize an abstraction around SIMD instructions without exposing the intrinsics directly. For example, we could, in theory, move the existing simd crate into std and stabilize that without tackling the problems with stabilizing intrinsics directly.

My personal take on this is that we really need to provide a way to use intrinsics on stable Rust. There are so many of them that it would be a herculean task to build an abstraction around all of them that met everyone’s use cases. Moreover, my understanding of the current feel of things is that the simd crate’s abstraction is controversial pending potential future language changes (like integer generics?).

The libs team discussed this particular issue, and one possibility came up of building a special libstd-llvm crate that shipped with stable Rust and provided access to LLVM’s intrinsics with the caveat that it exists outside of Rust’s stability story. How do folks feel about that?

est31 · November 14, 2016, 2:21am

Why not do autovectorisation for autodetecting binaries too? The problems you deal with are the same, the only difference is that the compiler has to provide the fallback and not the programmer.

Generally I would really love to see SIMD in stable rust.

burntsushi · November 14, 2016, 2:36am

I don’t think I said we shouldn’t do that.

comex · November 14, 2016, 5:24am

I still think it makes sense to start with stabilizing inline assembly, which can be used as a suboptimal but not terrible alternative to native SIMD intrinsics, and has many other uses too.

burntsushi · November 14, 2016, 1:06pm

That’s certainly one approach to take, but I’d really like to focus on what we can do to make explicit SIMD work well in Rust, since I think we want to do that irrespective of inline assembly on stable Rust. (It may turn out that inline assembly is less work, but I don’t think we know that yet.)

emoon · November 14, 2016, 1:17pm

For someone that has written code for a fair amount of year using SIMD in C/C++ I would prefer a two step approach:

Implement the intrinsics as they are listed in the Intel Manuals (and same approach for other CPUs) in the case of Intel they already provide lots of documantation for them and people coming from the C/C++ world would be very familiar with them. As Clang already supports them with LLVM they should match quite good. If potential change to another backend it would still be possible to use them as there are docs around for them anyway.
Approach 1 is fairly typeless this is actually a good thing when writing optimized SIMD code where you may do floating point tricks directly with integer instructions. As always with something close to assembly it’s easy to do something wrong. Step 2 would would be something more typed (I think I have seen things like f32x4, i16x8,etc) which is more user friendly but can make it somewhat non-intuitive when doing trickery (you would need to a fair amount of casting)

I can see Rust wanting to go more for approach 2 but I think it would be nice to have approach 1 also.

willi_kappler · November 14, 2016, 1:32pm

...

Sure it's a huge task but why not start small and grow with time ? That way you could emiminate the other two points you also mentioned:

Or do s.th. higher level like for ex. liboil does.

That would be an option (and I would be happy with that also), but I can already hear people complaining when things break. Rust just got a very good reputaition for its stability and backward compatibility.

burntsushi · November 14, 2016, 2:01pm

This is something we should do regardless of the path we take. The difference is: does all growth need to be through std (which has an insanely high bar) or can growth happen on crates.io? I am very strongly in favor of the latter because I don't think the former can scale.

Right, that's why using them would have to entail some kind of warning/tooling that it exists outside our stability guarantees.

jneem · November 14, 2016, 2:12pm

I don't think this is actually the case. rustc defines a mapping (in src/etc/platform-intrinsics/) from its own names (which in the case of x86 are basically Intel's names) to LLVM's names. In particular, if LLVM changed the names for some reason then rustc would need to be updated anyway to reflect the new names.

burntsushi · November 14, 2016, 2:25pm

Oh, cool. In terms of the stability story, is naming the only thing we have to worry about? Can the semantics of an intrinsic change?

eddyb · November 14, 2016, 2:59pm

We can certainly use the definitions of the vendors, which couldn’t change, but that’s only for platform-specific intrinsics. In fact, the only thing that would tie us to LLVM is the llvmint approach where llvm.* functions are imported directly - that is not the same as platform-intrinsics.

Platform-independent SIMD intrinsics would have straight-forward semantics we define, many of those not even being LLVM intrinsics underneath but rather instructions using its vector types (e.g. you can do iadd or fadd on a vector directly).

If you wanted to stabilize something in libcore it’d probably have to be the latter, to keep it target-feature-agnostic for the most part.

However, doing anything in libcore at all would mean all SIMD types have the same semantics and if you want to isolate that you’d have to wrap them in newtypes. The difference between such an abstraction and exposed intrinsics is mostly in the type-system machinery involved. If the intrinsics can be declared numerous times (e.g. by some macro in some crates.io crate), each individual declaration can be checked in a way that might not be in stable Rust for a couple years or more, if ever!

We certainly could’ve had a better intrinsic story already, if anyone were actively working on it. The MIR transition has taken priority, but now it should be possible to model the ideal situation where even, e.g. trait impl methods (like Add::add for some SIMD type) can be hooked up to intrinsics. A few people have expressed some interest in writing up a proposal taking all of that into account but it sort of got lost in the cracks.

stoklund · November 14, 2016, 3:30pm

I’ve been working on a proposal for a portable SIMD instruction set for WebAssembly. It’s based on the work of the SIMD.js working group who made sure that it can be mapped to Intel, ARM, and MIPS architectures.

eddyb · November 14, 2016, 3:42pm

That’s a nice set of “abstracted SIMD intrinsics”, and I agree we want something like that for the baseline. Sadly that’s only half the story and we will likely try to match the vendors on their individual ISA extensions.

burntsushi · November 14, 2016, 4:22pm

That’s cool, but I still think that’s something that should be iterated on outside of std. For example, it doesn’t look like that includes SSE 4.2 instructions like PCMPESTRI or CRC, both of which I’d love to use. Plus a whole boatload from AVX2 too.

eddyb · November 14, 2016, 4:35pm

I believe it’s supposed not to (contain such instructions), but rather form a common base, i.e. the “abstract subset”, which are given semantics by us (or by some standard like wasm), as opposed to by vendors (which usually expose intrinsics that are 1:1 with actual instructions in their respective architectures).

burntsushi · November 14, 2016, 4:44pm

I understand that. I’m just trying to push my argument in favor of providing access to SIMD operations directly on stable Rust, as opposed to trying to stabilize an abstraction in std like the existing simd crate or @stoklund’s work. I feel like we need to at least reach a consensus on that before anything else, no? I say that because if we do decide to make SIMD operations available on Rust stable, then the entire question of a nicer API can be completely punted and we can therefore avoid getting into those weeds here.

eddyb · November 14, 2016, 5:01pm

IMO that’s a red herring, unless the vendor intrinsics happen to cover absolutely everything? That is, I’m actually not sure it’s possible to get rid of any abstracted operation. Even if it is, keep in mind that for most backends you’ll have to effectively turn SSE into vector ops, with the target codegen of the backend turning the vector ops into (possibly different) SSE instructions.

It’s not a deal-breaker and it might be easier to only maintain vendor intrinsics as the stable thing exposed by rustc, just pointing out how some intuition might be wrong here.

burntsushi · November 14, 2016, 5:07pm

@eddyb

I think there are probably gaps in my understanding of how this actually works in the compiler. Could you help me fill them? (I can’t quite connect all of the dots in your comments, probably because I don’t quite have a full grasp of the jargon involved, but I’d like to.)

This is what my current understanding is:

LLVM has a giant list of intrinsics that it supports using an LLVM specific naming convention.
rustc maps a subset of the LLVM intrinsics to intrinsics exposed for use by users of rustc. The names of the intrinsics exposed by rustc aren’t necessarily the names chosen by LLVM. In today’s world, users can get at these intrinsics by using extern "platform-intrinsics" { ... }.
The subset mentioned above is arbitrary. That is, it’s a subset purely because it hasn’t been made exhaustive yet. (i.e., “we haven’t added the requisite JSON yet.”)

And in particular, these intrinsics might be the things that we could make available in Rust stable. In my view, this is almost no abstraction at all.

I feel like I’m missing a step though! Note in particular that this proposal isn’t necessarily based on me knowing all of the same alternatives that you know!

eddyb · November 14, 2016, 5:29pm

You’re correct AFAIK, although some comments above suggest we moved to vendor-specific names.

However, that’s not all of "platform-intrinsics", only platform-specific ones (x86_*, arm_*, etc.).

On top of that we have a basic set of abstracted “vector operations” (see typeck and trans).

The LLVM intrinsic story is indeed changing over time, and interestingly enough they seem to prefer replacing intrinsics with canonical combinations of instructions. That’s the history of the “AutoUpgrade” mechanism which is how old intrinsics are kept working in newer versions - I am not aware of the actual “statute of limitations” on those, but we can always go back and read what they did if we need to.

jethrogb · November 14, 2016, 5:33pm

I think we should support some form of stable use of all vendor-defined intrinsics.

Here’s an article on GCC’s function multi-versioning: https://lwn.net/Articles/691932/

Topic		Replies	Views
Stabilizing SIMD-aligned types ahead of the rest of SIMD language design	3	1842	March 25, 2019
SIMD now available in libstd on nightly! libs	15	9252	March 25, 2019
What's the next step towards the stabilization of SIMD? language design	16	3778	March 25, 2019
How to make core::arch simd intrinsics safe: language design	6	1251	August 28, 2022
Packed_simd: `cfg(target_feature)` does not play well with `#[target_feature]`	3	2056	March 25, 2019

Getting explicit SIMD on stable Rust

-C target-feature=foo is hard to use

How does runtime detection using cpuid work?

Intrinsics

An alternative to stabilizing intrinsics directly

Related topics

`-C target-feature=foo` is hard to use