Getting explicit SIMD on stable Rust

...

Sure it's a huge task but why not start small and grow with time ? That way you could emiminate the other two points you also mentioned:

Or do s.th. higher level like for ex. liboil does.

That would be an option (and I would be happy with that also), but I can already hear people complaining when things break. Rust just got a very good reputaition for its stability and backward compatibility.

This is something we should do regardless of the path we take. The difference is: does all growth need to be through std (which has an insanely high bar) or can growth happen on crates.io? I am very strongly in favor of the latter because I don't think the former can scale.

Right, that's why using them would have to entail some kind of warning/tooling that it exists outside our stability guarantees.

I don't think this is actually the case. rustc defines a mapping (in src/etc/platform-intrinsics/) from its own names (which in the case of x86 are basically Intel's names) to LLVM's names. In particular, if LLVM changed the names for some reason then rustc would need to be updated anyway to reflect the new names.

Oh, cool. In terms of the stability story, is naming the only thing we have to worry about? Can the semantics of an intrinsic change?

We can certainly use the definitions of the vendors, which couldn’t change, but that’s only for platform-specific intrinsics. In fact, the only thing that would tie us to LLVM is the llvmint approach where llvm.* functions are imported directly - that is not the same as platform-intrinsics.

Platform-independent SIMD intrinsics would have straight-forward semantics we define, many of those not even being LLVM intrinsics underneath but rather instructions using its vector types (e.g. you can do iadd or fadd on a vector directly).

If you wanted to stabilize something in libcore it’d probably have to be the latter, to keep it target-feature-agnostic for the most part.

However, doing anything in libcore at all would mean all SIMD types have the same semantics and if you want to isolate that you’d have to wrap them in newtypes. The difference between such an abstraction and exposed intrinsics is mostly in the type-system machinery involved. If the intrinsics can be declared numerous times (e.g. by some macro in some crates.io crate), each individual declaration can be checked in a way that might not be in stable Rust for a couple years or more, if ever!

We certainly could’ve had a better intrinsic story already, if anyone were actively working on it. The MIR transition has taken priority, but now it should be possible to model the ideal situation where even, e.g. trait impl methods (like Add::add for some SIMD type) can be hooked up to intrinsics. A few people have expressed some interest in writing up a proposal taking all of that into account but it sort of got lost in the cracks.

I’ve been working on a proposal for a portable SIMD instruction set for WebAssembly. It’s based on the work of the SIMD.js working group who made sure that it can be mapped to Intel, ARM, and MIPS architectures.

12 Likes

That’s a nice set of “abstracted SIMD intrinsics”, and I agree we want something like that for the baseline. Sadly that’s only half the story and we will likely try to match the vendors on their individual ISA extensions.

That’s cool, but I still think that’s something that should be iterated on outside of std. For example, it doesn’t look like that includes SSE 4.2 instructions like PCMPESTRI or CRC, both of which I’d love to use. :slight_smile: Plus a whole boatload from AVX2 too.

2 Likes

I believe it’s supposed not to (contain such instructions), but rather form a common base, i.e. the “abstract subset”, which are given semantics by us (or by some standard like wasm), as opposed to by vendors (which usually expose intrinsics that are 1:1 with actual instructions in their respective architectures).

I understand that. I’m just trying to push my argument in favor of providing access to SIMD operations directly on stable Rust, as opposed to trying to stabilize an abstraction in std like the existing simd crate or @stoklund’s work. I feel like we need to at least reach a consensus on that before anything else, no? I say that because if we do decide to make SIMD operations available on Rust stable, then the entire question of a nicer API can be completely punted and we can therefore avoid getting into those weeds here.

IMO that’s a red herring, unless the vendor intrinsics happen to cover absolutely everything? That is, I’m actually not sure it’s possible to get rid of any abstracted operation. Even if it is, keep in mind that for most backends you’ll have to effectively turn SSE into vector ops, with the target codegen of the backend turning the vector ops into (possibly different) SSE instructions.

It’s not a deal-breaker and it might be easier to only maintain vendor intrinsics as the stable thing exposed by rustc, just pointing out how some intuition might be wrong here.

@eddyb

I think there are probably gaps in my understanding of how this actually works in the compiler. Could you help me fill them? (I can’t quite connect all of the dots in your comments, probably because I don’t quite have a full grasp of the jargon involved, but I’d like to.)

This is what my current understanding is:

  • LLVM has a giant list of intrinsics that it supports using an LLVM specific naming convention.
  • rustc maps a subset of the LLVM intrinsics to intrinsics exposed for use by users of rustc. The names of the intrinsics exposed by rustc aren’t necessarily the names chosen by LLVM. In today’s world, users can get at these intrinsics by using extern "platform-intrinsics" { ... }.
  • The subset mentioned above is arbitrary. That is, it’s a subset purely because it hasn’t been made exhaustive yet. (i.e., “we haven’t added the requisite JSON yet.”)

And in particular, these intrinsics might be the things that we could make available in Rust stable. In my view, this is almost no abstraction at all.

I feel like I’m missing a step though! Note in particular that this proposal isn’t necessarily based on me knowing all of the same alternatives that you know!

You’re correct AFAIK, although some comments above suggest we moved to vendor-specific names.

However, that’s not all of "platform-intrinsics", only platform-specific ones (x86_*, arm_*, etc.).

On top of that we have a basic set of abstracted “vector operations” (see typeck and trans).

The LLVM intrinsic story is indeed changing over time, and interestingly enough they seem to prefer replacing intrinsics with canonical combinations of instructions. That’s the history of the “AutoUpgrade” mechanism which is how old intrinsics are kept working in newer versions - I am not aware of the actual “statute of limitations” on those, but we can always go back and read what they did if we need to.

I think we should support some form of stable use of all vendor-defined intrinsics.

Here’s an article on GCC’s function multi-versioning: https://lwn.net/Articles/691932/

4 Likes

If you want a very ugly starting point. I tooled up this crate to generate x86_64, ARM, and AArch64. It isn’t type safe and assumes everything is signed (as LLVM does) so I retracted the patch. The crate steals the names of GCC symbols, but uses the definitions from llvmint.

But this maybe a good starting point to build on if you want to define all the SIMD instructions in a fairly quickly manner that can plug into Rust-C easily.

1 Like

I think that you should focus on the underlying features here, rather than SIMD. There may well be other processor-specific instructions sets we want to handle in the same way. In particular, a general purpose intrinsics mechanism, something like target-feature, and something to handle the dynamic CPUID stuff.

I think that beyond the core of SIMD (roughly the SIMD.js subset), there are a huge number of instructions which can’t really have abstract support, we need a mechanism for these to be available to programmers similarly to how we expose platform-specific features which are OS-dependent. A better intrinsic story might be a solution here (together with scenarios or target-feature + CPUID stuff).

Some random thoughts:

  • inline assembly or auto-vectorisation is not enough - you need to be able to mark data as SIMD data or you inevitably lose performance marshalling/un-marshalling. This can be seen today - LLVM does some auto-vectorisation, but if you manually use the simd crate you can get significant perf improvements.
  • I think the core SIMD stuff (as in Huon’s SIMD crate and SIMD.js) should be on the path to inclusion in std - it needs work, but it is fairly general purpose and should be in the nursery and have a roadmap to at least official ownership, if not quite a physical place in std. The non-core stuff is too platform-specific to be anywhere near std, IMO.
  • Given that SIMD is defined by the processors, I think it is fine to have ‘Rust intrinsics’ for them and to assume that any backend will support them (either as intrinsics or assembly or whatever). In fact what Rust offers might end up being a superset of what llvm offers.
  • I think that SIMD ‘intrinsics’ don’t really have to be intrinsics at all - they could be just extern functions with a single asm instruction.
5 Likes

I think it’s pretty early to talk about making the SIMD crate official. The lack of value generics is a big barrier to having any sort of interface over the intrinsics (see, e.g., the SIMD crate’s lack of support for shuffles), and so I think value generics have to happen before we can get any experience on what official SIMD support should look like.

But I completely agree about stabilizing intrinsics. They will be an important part of platform-specific programming no matter what nicer facade eventually goes on top, so I think we should just go ahead and stabilize them.

The more interesting/difficult part (IMO) is the runtime detection. @jethrogb mentioned gcc’s function multi-versioning, but I think there’s at least one more wrinkle: if I have a Foo<T> type then I might want to say “every method on Foo<u8x16> should be compiled using SSE4.1, any every method on Foo<u8x32> should be compiled using AVX2.”

2 Likes

@comex One difference between intrinsics and inline assembly is that while the optimizer understands intrinsics and can optimize them, it does currently not optimize inline assembly (it just uses the one provided). Inline assembly is, as a consequence, a worse solutiont than using the backend intrinsics.

@burntsushi Would it be possible to allow mixing crates that require nightly internally (but not on their API or some other constraints) with crates that require stable? That could be an alternative.

@stoklund I still think we need to provide a way to call into the intrinsics directly, not just an abstraction.

I think that whether std has something like std::arch::x86/x86_64/AArch64/... with intrinsics available in the different architectures is orthogonal to a std::simd module that provides a cross-platform abstraction over simd intrinsics. Both things are worth pursuing.

Note that SIMD are not the only types of intrinsics available. Another example of intrinsics would be the bitwise manipulation instruction sets (BMI1, BMI2, ABM, and TBM) or the AES encryption intrinsics, which are also architecture dependent, and also orthogonal to a cross-platform library for bitwise manipulation or for doing encryption. All of these are worth pursuing.

Still, there has been too little experimentation in Rust with building cross-platform libraries for exposing subsets of platform intrinsics. I think that this could be improved if these libraries could be written and used in stable rust. The features that would enable this are mainly:

  • exposing architecture dependent intrinsics in rust stable, probably through std::arch::_ or similar,
  • a stable way to branch on the architecture (scenarios, target_feature, …),
  • [repr(simd)] or similar.

Once we are there we might not need to stabilize the simd crate anymore, and if somebody wants to implement a different way of doing simd, they will be able to write their own crate for that which uses intrinsics directly.

To support run-time architecture dependent code we would need a lot of compiler support. The binary has to include code for a set of architectures, detect the type of architecture on initialization, and be able to modify itself at run-time. AFAIK OpenMP pragma simd does this but is not trivial.

2 Likes

I wanted to jot down some thoughts as well that we discussed in libs triage yesterday, but this is also primarily my own personal opinion about how we should proceed here.

  • We should strive to not implement or stabilize a “high level” or “type safe” wrapper around simd in the standard library. That is, there will be no ergonomic interface to simd in the standard library. My thinking is that we may not quite have all the language features necessary to even do this, and it’s also not clear what this should look like. This of course means that this sort of wrapper should not exist (e.g. the simd crate), I’m just thinking it should be explicilty built on crates.io rather than the standard library. In that sense, I’d personally like to table all discussion (just for now) about a high-level, type-safe, generic, etc, wrapper around simd.

    Note that this sentiment is echoing @emoon’s thoughts which I’ve also heard from many others who desire SIMD, which is that intrinsics are crucial and a type-safe interface isn’t. Also note that the simd subset @stoklund’s working on would be a great candidate for basing the simd crate off (I believe).

  • We should not stabilize intrinsics as literally intrinsics. Consider, for example, the transmute function. This is an intrinsic but it is not stable to define it as such, but rather it’s reexported as mem::transmute. This is essentially my preferred strategy with simd intrinsics as well. That is, I think we should table all discussion of how these intrinsics and functions are defined, and rather focus on where they’re defined, the APIs, the semantics, etc. Note that this would also table discussion about whether these are lang items or not as it wouldn’t be part of the stable API surface area.

  • Given that information, I’d like to propose that we follow these guidelines for stabilizing access to SIMD:

    • A new module of the standard library is created, std::simd
    • All new intrinsics are placed directly inside this simd module. (or perhaps inside an inner intrinsics modoule)
    • Method names match exactly what’s found in official architecture documentation. That is, these aren’t tied to LLVM at all but rather the well-known and accepted documentation for these intrinsics. Note that this helps porting code, reading official docs and translating to Rust, searching Rust docs, adding new intrinsics, etc. Recall that our goal is not to have an ergonomic interface to simd in the standard library yet, so that’s not necessarily a requirement here.

To me, this is at least a pretty plausible route to stabilization of the APIs involved with SIMD. Unfortunately, SIMD has lots of weird restrictions as well. For example @huon’s shuffle intrinsics, a critical operation with simd, has the requirement that the indices passed are a constant array-of-constants. I suspect that there are other APIs like this in SIMD which have requirements we can’t quite express in the type system per se.

@eddyb assures me, however, that we can at least encode these restrictions in the compiler. That is, we could make it an error to call a shuffle intrinsic without a constant-array-of-constants, and this error could also be generated pre-monomorphization to ensure that it doesn’t turn up as some weird downstream bug.

@burntsushi and I will be looking into the intrinsic space here to ensure we have a comprehensive view of what all the intrinsics look like. Basically we want to classify them into categories to ensure that we have a semi-reasonable route to expressing them all in Rust.


Ok, so with that in mind, we’re in theory in a position to add all intrinsics to the standard library, give a stable interface, and remain flexible to change implementation details in the future as we see fit. I’d imagine that this set of intrinsics can cover anything that standard simd features enable. That is, @burntsushi’s other desired intrinsics, which come with SSE 4.2, @nrc’s thoughts about more than just SIMD, and @gnzlbg’s desire for perhaps other intrinsics (e.g. ABM/TBM/AES) could all be defined roughly in this manner. Perhaps not all literally in std::simd itself, but we could consider a std::arch module as well (maybe)

The last major problem which needs to be solved for SIMD support (I believe) is the “dynamic dispatch” problem. In other words, compiling multiple versions of a function and then dispatching to the correct one at runtime. @jethrogb pointed out that gcc has support for this with multi versioning which is indeed pretty slick!

I think this problem boils down to two separate problems:

  1. How do I compiling one function separately from the rest (with different codegen options).
  2. How do I dispatch at runtime to the right function.

Problem (1) was added to LLVM recently (e.g. support at the lowest layer) and I would propose punting on problem (2). The dispatch problem can be solved with a feature called ifunc on ELF targets (read, Linux), which is quite efficient but not portable. In that sense I think we should hold off on solving this just yet and perhaps leave that to be discussed another day.

So for the problem of compiling multiple functions, you could imagine something like:

#[target_feature = "avx"]
fn foo() {
    // the avx feature is enabled for this and only this function
}

#[target_feature = "ssse3"]
#[target_feature = "avx2"]
fn bar() {
    // avx2/ssse3 are both defined for this function
}

fn dispatch() {
    if cpuid().supports_avx2() {
        bar();
    } else {
        foo();
    }
}

That is, we could add an attribute to just enable features like __attribute__((target("avx2")) in C. My main question and hesitation around this feature is what it implies on the LLVM side of things. That is, what happens if we call an avx2 intrinsic in a function which hasn’t enabled the avx2 feature set? If LLVM silently works then it’s perhaps quite flexible to basically be done at that point, but if LLVM hits a random codegen error then we’ll have to ensure stricter guarantees about what functions you can call and where you can call them. Put another way, our SIMD support should literally never show you an LLVM error, in any possible usage of SIMD.

I believe @aturon is going to coordinate with compiler folks to ensure that the LLVM and codegen side of things is ready for this problem.


Ok, that was a bit long than anticipated, but curious to hear others’ thoughts about this!

14 Likes

Just a quick +1 to the idea of providing a stable interface to CPU-specific instructions in any way, shape, or form. The other day I wanted to use the PEXT instruction from BMI2 (so: not SIMD), which is available on Haswell processors or later, and found the fact that I needed to use a nightly build of Rust to do it without calling out to C to be rather silly, considering that the stability of these instructions has everything to do with Intel and nothing to do with Rust. People who are willing to accept the portability implications of using CPU-specific instructions should be able to use them.

(Everything @alexcrichton wrote also sounds very reasonable to me, although I am not an expert in the area.)

2 Likes