Getting explicit SIMD on stable Rust


#21

If you want a very ugly starting point. I tooled up this crate to generate x86_64, ARM, and AArch64. It isn’t type safe and assumes everything is signed (as LLVM does) so I retracted the patch. The crate steals the names of GCC symbols, but uses the definitions from llvmint.

But this maybe a good starting point to build on if you want to define all the SIMD instructions in a fairly quickly manner that can plug into Rust-C easily.


#22

I think that you should focus on the underlying features here, rather than SIMD. There may well be other processor-specific instructions sets we want to handle in the same way. In particular, a general purpose intrinsics mechanism, something like target-feature, and something to handle the dynamic CPUID stuff.

I think that beyond the core of SIMD (roughly the SIMD.js subset), there are a huge number of instructions which can’t really have abstract support, we need a mechanism for these to be available to programmers similarly to how we expose platform-specific features which are OS-dependent. A better intrinsic story might be a solution here (together with scenarios or target-feature + CPUID stuff).

Some random thoughts:

  • inline assembly or auto-vectorisation is not enough - you need to be able to mark data as SIMD data or you inevitably lose performance marshalling/un-marshalling. This can be seen today - LLVM does some auto-vectorisation, but if you manually use the simd crate you can get significant perf improvements.
  • I think the core SIMD stuff (as in Huon’s SIMD crate and SIMD.js) should be on the path to inclusion in std - it needs work, but it is fairly general purpose and should be in the nursery and have a roadmap to at least official ownership, if not quite a physical place in std. The non-core stuff is too platform-specific to be anywhere near std, IMO.
  • Given that SIMD is defined by the processors, I think it is fine to have ‘Rust intrinsics’ for them and to assume that any backend will support them (either as intrinsics or assembly or whatever). In fact what Rust offers might end up being a superset of what llvm offers.
  • I think that SIMD ‘intrinsics’ don’t really have to be intrinsics at all - they could be just extern functions with a single asm instruction.

#23

I think it’s pretty early to talk about making the SIMD crate official. The lack of value generics is a big barrier to having any sort of interface over the intrinsics (see, e.g., the SIMD crate’s lack of support for shuffles), and so I think value generics have to happen before we can get any experience on what official SIMD support should look like.

But I completely agree about stabilizing intrinsics. They will be an important part of platform-specific programming no matter what nicer facade eventually goes on top, so I think we should just go ahead and stabilize them.

The more interesting/difficult part (IMO) is the runtime detection. @jethrogb mentioned gcc’s function multi-versioning, but I think there’s at least one more wrinkle: if I have a Foo<T> type then I might want to say “every method on Foo<u8x16> should be compiled using SSE4.1, any every method on Foo<u8x32> should be compiled using AVX2.”


#24

@comex One difference between intrinsics and inline assembly is that while the optimizer understands intrinsics and can optimize them, it does currently not optimize inline assembly (it just uses the one provided). Inline assembly is, as a consequence, a worse solutiont than using the backend intrinsics.

@burntsushi Would it be possible to allow mixing crates that require nightly internally (but not on their API or some other constraints) with crates that require stable? That could be an alternative.

@stoklund I still think we need to provide a way to call into the intrinsics directly, not just an abstraction.

I think that whether std has something like std::arch::x86/x86_64/AArch64/... with intrinsics available in the different architectures is orthogonal to a std::simd module that provides a cross-platform abstraction over simd intrinsics. Both things are worth pursuing.

Note that SIMD are not the only types of intrinsics available. Another example of intrinsics would be the bitwise manipulation instruction sets (BMI1, BMI2, ABM, and TBM) or the AES encryption intrinsics, which are also architecture dependent, and also orthogonal to a cross-platform library for bitwise manipulation or for doing encryption. All of these are worth pursuing.

Still, there has been too little experimentation in Rust with building cross-platform libraries for exposing subsets of platform intrinsics. I think that this could be improved if these libraries could be written and used in stable rust. The features that would enable this are mainly:

  • exposing architecture dependent intrinsics in rust stable, probably through std::arch::_ or similar,
  • a stable way to branch on the architecture (scenarios, target_feature, …),
  • [repr(simd)] or similar.

Once we are there we might not need to stabilize the simd crate anymore, and if somebody wants to implement a different way of doing simd, they will be able to write their own crate for that which uses intrinsics directly.

To support run-time architecture dependent code we would need a lot of compiler support. The binary has to include code for a set of architectures, detect the type of architecture on initialization, and be able to modify itself at run-time. AFAIK OpenMP pragma simd does this but is not trivial.


#25

I wanted to jot down some thoughts as well that we discussed in libs triage yesterday, but this is also primarily my own personal opinion about how we should proceed here.

  • We should strive to not implement or stabilize a “high level” or “type safe” wrapper around simd in the standard library. That is, there will be no ergonomic interface to simd in the standard library. My thinking is that we may not quite have all the language features necessary to even do this, and it’s also not clear what this should look like. This of course means that this sort of wrapper should not exist (e.g. the simd crate), I’m just thinking it should be explicilty built on crates.io rather than the standard library. In that sense, I’d personally like to table all discussion (just for now) about a high-level, type-safe, generic, etc, wrapper around simd.

    Note that this sentiment is echoing @emoon’s thoughts which I’ve also heard from many others who desire SIMD, which is that intrinsics are crucial and a type-safe interface isn’t. Also note that the simd subset @stoklund’s working on would be a great candidate for basing the simd crate off (I believe).

  • We should not stabilize intrinsics as literally intrinsics. Consider, for example, the transmute function. This is an intrinsic but it is not stable to define it as such, but rather it’s reexported as mem::transmute. This is essentially my preferred strategy with simd intrinsics as well. That is, I think we should table all discussion of how these intrinsics and functions are defined, and rather focus on where they’re defined, the APIs, the semantics, etc. Note that this would also table discussion about whether these are lang items or not as it wouldn’t be part of the stable API surface area.

  • Given that information, I’d like to propose that we follow these guidelines for stabilizing access to SIMD:

    • A new module of the standard library is created, std::simd
    • All new intrinsics are placed directly inside this simd module. (or perhaps inside an inner intrinsics modoule)
    • Method names match exactly what’s found in official architecture documentation. That is, these aren’t tied to LLVM at all but rather the well-known and accepted documentation for these intrinsics. Note that this helps porting code, reading official docs and translating to Rust, searching Rust docs, adding new intrinsics, etc. Recall that our goal is not to have an ergonomic interface to simd in the standard library yet, so that’s not necessarily a requirement here.

To me, this is at least a pretty plausible route to stabilization of the APIs involved with SIMD. Unfortunately, SIMD has lots of weird restrictions as well. For example @huon’s shuffle intrinsics, a critical operation with simd, has the requirement that the indices passed are a constant array-of-constants. I suspect that there are other APIs like this in SIMD which have requirements we can’t quite express in the type system per se.

@eddyb assures me, however, that we can at least encode these restrictions in the compiler. That is, we could make it an error to call a shuffle intrinsic without a constant-array-of-constants, and this error could also be generated pre-monomorphization to ensure that it doesn’t turn up as some weird downstream bug.

@burntsushi and I will be looking into the intrinsic space here to ensure we have a comprehensive view of what all the intrinsics look like. Basically we want to classify them into categories to ensure that we have a semi-reasonable route to expressing them all in Rust.


Ok, so with that in mind, we’re in theory in a position to add all intrinsics to the standard library, give a stable interface, and remain flexible to change implementation details in the future as we see fit. I’d imagine that this set of intrinsics can cover anything that standard simd features enable. That is, @burntsushi’s other desired intrinsics, which come with SSE 4.2, @nrc’s thoughts about more than just SIMD, and @gnzlbg’s desire for perhaps other intrinsics (e.g. ABM/TBM/AES) could all be defined roughly in this manner. Perhaps not all literally in std::simd itself, but we could consider a std::arch module as well (maybe)

The last major problem which needs to be solved for SIMD support (I believe) is the “dynamic dispatch” problem. In other words, compiling multiple versions of a function and then dispatching to the correct one at runtime. @jethrogb pointed out that gcc has support for this with multi versioning which is indeed pretty slick!

I think this problem boils down to two separate problems:

  1. How do I compiling one function separately from the rest (with different codegen options).
  2. How do I dispatch at runtime to the right function.

Problem (1) was added to LLVM recently (e.g. support at the lowest layer) and I would propose punting on problem (2). The dispatch problem can be solved with a feature called ifunc on ELF targets (read, Linux), which is quite efficient but not portable. In that sense I think we should hold off on solving this just yet and perhaps leave that to be discussed another day.

So for the problem of compiling multiple functions, you could imagine something like:

#[target_feature = "avx"]
fn foo() {
    // the avx feature is enabled for this and only this function
}

#[target_feature = "ssse3"]
#[target_feature = "avx2"]
fn bar() {
    // avx2/ssse3 are both defined for this function
}

fn dispatch() {
    if cpuid().supports_avx2() {
        bar();
    } else {
        foo();
    }
}

That is, we could add an attribute to just enable features like __attribute__((target("avx2")) in C. My main question and hesitation around this feature is what it implies on the LLVM side of things. That is, what happens if we call an avx2 intrinsic in a function which hasn’t enabled the avx2 feature set? If LLVM silently works then it’s perhaps quite flexible to basically be done at that point, but if LLVM hits a random codegen error then we’ll have to ensure stricter guarantees about what functions you can call and where you can call them. Put another way, our SIMD support should literally never show you an LLVM error, in any possible usage of SIMD.

I believe @aturon is going to coordinate with compiler folks to ensure that the LLVM and codegen side of things is ready for this problem.


Ok, that was a bit long than anticipated, but curious to hear others’ thoughts about this!


#26

Just a quick +1 to the idea of providing a stable interface to CPU-specific instructions in any way, shape, or form. The other day I wanted to use the PEXT instruction from BMI2 (so: not SIMD), which is available on Haswell processors or later, and found the fact that I needed to use a nightly build of Rust to do it without calling out to C to be rather silly, considering that the stability of these instructions has everything to do with Intel and nothing to do with Rust. People who are willing to accept the portability implications of using CPU-specific instructions should be able to use them.

(Everything @alexcrichton wrote also sounds very reasonable to me, although I am not an expert in the area.)


#27

Sounds reasonable.

I think that the easiest thing to do would be to “dump” all intrinsics into std::intrinsics (or arch or arch_intrinsics), have each guarded with a target_feature that enables it, and completely ignore the concept of different architectures. If the feature is enabled, the intrinsic can be used, period.

My suggestion of classifying intrinsics into an std::arch module with x86, ARM, … submodules is a bad idea. It doesn’t add any value, it puts us in the business of classifying architectures, and it introduces other problems like what to do when an intrinsic is supported on multiple architectures, or what are we going to do if ARM adds SIMD support next year?

My main question and hesitation around this feature is what it implies on the LLVM side of things. That is, what happens if we call an avx2 intrinsic in a function which hasn’t enabled the avx2 feature set?

Functions don’t have to enable feature sets to use intrinsics. Whoever compiles the binary needs to enable the feature sets it is going to use. If you try to use an intrinsic and the feature set for that intrinsic is not enabled you should get a compiler error. If the feature set is enabled LLVM will generate code for it, and if you try to run the binary on an architecture without the intrinsic you will get a run-time error.

EDIT: in particular for your example, currently one just uses a macro at compile time to match on the enabled architectures, and select an appropriate intrinsic. This works fine.

Switching at run-time is way harder. You would need to enable all features at compile time (to be able to use them and generate code for them). That is, you enable AVX2 and SSE3, and branch at run-time, and use one or the other. The problem is that if you enable AVX2 in LLVM it will generate code for it (e.g. through auto-vectorization), so even if you can branch at run-time explicitly, you will still trap due to optimizations happening somewhere else.

You can, for every function that uses a target dependent intrinsic, generate a set of functions at compile-time, and on initialization of the binary, detect the architecture, and “monomorphize the binary for the architecture”.


#28

@gnzlbg ah yes so one clarifying point is that I’m assuming that the organization is all gated with scenarios to ensure platform-compatibility is still reasonable for downstream crates. In that sense I agree we’d dump them in the same module and avoid premature organization.

You brought up an important point, though, about target_feature. There are a number of intrinsics that are only available with AVX2, for example. My thinking, however, is that you can access these intrinsics no matter what. That is, so long as you’re compiling for a CPU architecture that could ever have an intrinsic, you have access to it. This solves the case where in general your code can’t use AVX but for small sections it can. I feel like it’d be too weird of a language feature to say the std::simd module had different contents based on where it was viewed from.

Note, though, that this hinges on the fact that LLVM can actually work with this (on a technical level). That is, if it actually causes a codegen error to call an avx2 function from a function where that’s not enabled then we have a whole new can of worms to deal with.

It’s true yeah, that LLVM will auto-vectorize based on enabled features so we can’t blanket do that. I also don’t think we can double/triple/etc the size of binaries just by compiling one with AVX and another without


#29

I think I don’t fully understand what you mean.

Platforms, as in Windows, Linux, Macos, are orthogonal to CPU/architecture features (SSE, AVX, BMI …). Did you meant that the intrinsics will be organized using scenarios for the architectures/CPU features? Or could you elaborate on what this has to do with platforms?

If I cannot always use an intrinsic how is that enforced at compile time? (Using something like the current target_feature?).

Do you mean something like a binary compiled for SSE2, but where some functions are compiled for AVX, and at run-time the program checks what the target supports and uses one or the other?

This works fine in clang (OpenMP does this), and also works fine in rustc (e.g. by using inline assembly). The only thing one cannot do is insert e.g. ARM instructions in an x86 binary.


I think there are 2 main ways in which intrinsics for multiple features are combined:

  1. Generate code for different targets depending on what features the target architecture supports.
    • Requires detecting the features of the target at compile-time.
    • Requires branching/matching on the available features of the target at compile-time.
    • In those branches, code for other targets is not included in the binary.
  2. Generate code for different targets independently of what features the binary target supports.
    • Requires setting the target of a function/block (necessary so that LLVM doesn’t use AVX in an SSE2 function just because the final binary is compiled with AVX).
    • Requires incompatible features (one cannot embed an ARM Neon instruction in an x86 binary).
    • Run-time feature detection would make this more useful (e.g. it allows generating some function for different targets, and then at run-time setting a function pointer to the best implementation for the target).

In both cases it would be really helpful if the compiler would tell me that I cannot use some AVX function when writing code for SSE4 (e.g. because I am writing an SSE4 function, or am inside a compile-time branch that is only taken for SSE4), or that if I use an ARM Neon instruction in some place I cannot target x86 anymore.


#30

There are levels of official-ness. It is clearly too eary to move it to std, but it could be moved to rust-unofficial or rust-nursery. While it may be far from an eventual solution, I think there is a benefit to getting experience with that crate as is and at the moment it is somewhat languishing.

Sure, I think exactly where a high-level lib lives doesn’t matter so much. I think it is important that the libs team support such a library to some extent. I expect it would be the primary way for most users to use SIMD, so while it might not be the focus for now, it is important to have a good story there.

Isn’t the existing simd crate already based on that subset (simd.js)?

Do expect to have intrinsics for other instructions which we would want to expose in this way? If so is it worth having some hierarchy, e.g., std::instructions::simd or something?

ah, yes, I would like that and std::arch::simd


#31

Considering intrinsics are tightly bound to architectures and that operations that sound similar may operate in subtly different ways I think this is a terrible idea. Intrinsics are intrinsically tied to a target architecture. If “ARM adds SIMD support next year” (which it already has many years ago) we will add the relevant intrinsics to the language.

Yes, there should be a higher-level SIMD abstraction library, but since it is not at all clear what the API of such an abstraction should look like (as discussed above) this won’t live in std any time soon. What is needed from the stable Rust compiler is a way to access all architecture-specific functionality such that those higher-level abstractions can be built in stable Rust.

  1. What will be the type of parameters passed to these intrinsics?
  2. Remember when talking about a simd module that we might aim to support more than SIMD instructions.
  3. “exactly” including unnecessary leading underscores?

#32

FWIW we already do this, we’d just need to know it in the compiler, which we have to anyway, if the compiler codegens the intrinsics. If you do what Clang does and effectively polyfill a bunch of intrinsics (in terms of more general operations), you could also have a way to force arguments to be constant (the equivalent of a feature I’ve mentioned in the past, fn foo<C: const u8>(imm8: C), making the function const-generic over its value argument, although you’d use an attribute for now).


#33

The exact ones from their architecture documentation, although I’m not sure we’ll need the __m128i vs __m128d distinction, we can probably get away fine with one type per bit width - possibly with aliases?


#34

If I understand correctly, at least SSE actually has (on some micro-architectures) two internal registers per XMM register for integer and floating-point operations. Executing instructions that implicitly move data from the floating-point pipeline to the integer pipeline (and vice-versa) are slower than if you’d kept everything in the same pipeline. This is the distinction that’s being made in the i vs d types, which me might keep and require explicit casting for to denote this distinction.


#35

Please make it work with core from the start, if at all practical. I also suggest that it might be better to give it access to unstable features like core and std have, but still keep it outside the core/std namespace so that it doesn’t get fossilized too early.

For ring it would be very useful to have architecture-specific CPUID intrinsics that allow full access to the underlying CPUID instruction feature set. Such a feature is useful outside of SIMD so it would be great if it could be agreed-to and implemented independently from the SIMD stuff.

I would rather have intrinsics for architecture-specific instruction sets rather than any kind of abstraction or high-level interface, even if they are only temporary to allow independent parallel development of a permanent, perhaps high-level, SIMD interface.


#36

Even if there would be two internal registers this is totally hidden from the user. On the assembly level the instructions work on the same registers regardless if it’s an integer or floating point instruction. The only reason I see that you have to cast is that it provides some slight type safety which I really don’t think matters much at this level.

To cast from float to integer you do (in C/C++) _mm_castsi128_ps which returns __m128 from __m128i as input. Looking at the implementation (for) clang it looks like this

static __inline__ __m128 __attribute__((__always_inline__, __nodebug__))
_mm_castsi128_ps(__m128i in)
{
    return (__m128)in;
}

http://clang.llvm.org/doxygen/emmintrin_8h-source.html

So this just casts it without doing anything special and I’m pretty sure it’s the same for all compilers. If there is a cost when switching from float -> integer it’s more likely that if you use the result before it’s fully completed you will get a stall but that is up to the programmer to deal with anyway imo.


#37

I’d say Rust is designed (and well-equipped) to make such things apparent to the programmer.


#38

Btw: Intel Intrinsics Guide. I hope this is exhaustive. But there are also AMD-specific intrinsics for e.g. FMA4.


#39

The programmer has no idea of the actual cost just because you need to do a cast. This is just the same thing when casting for float to int in regular code. The programmer will still need to read in hw docs of what the actual cost is (which is more complicated due to out of order these days on many CPUs)


#40

Using different types can lead the programmer to look at the documentation pages for those types where all this could be clearly explained. If they’re all one type there is no signalling to the programmer at all.