Getting explicit SIMD on stable Rust

@burntsushi I think this discussion is touching too many orthogonal concepts at the same time, and it is hard to keep track of all of them at the same time when they are interleaved. At the same time, it feels like we all agree on 80% with @stoklund 's RFC, and we are just arguing about some parts of it, and sometimes only about details (or just bikeshedding).

Would it be possible to have a github repo with @stoklund “pre-RFC” as a starting point, so that we can fill issues there and/or send PRs? This would allow us to discuss the different aspects of the pre-RFC independently. I think this would help a lot to add clarity to the discussion. It allows us to send PRs with particular improvements/additions/substractions to the RFC, to discuss these independently, and to collaboratively grow a RFC-candidate text (with examples and docs and …) that at least has 90% consensus of anybody interested.

There is precedence in doing things this way (for example, the rust-fmt RFCs). I don’t know if it is worth it to go that far (and have a special community team, and FCPs for the issues and stuff), but we are arriving at 300 comments here, which without a particular text of the status-quo and what we are discussing about, makes it extremely time involving for new people to chime in.

(EDIT: or maybe we should move the discussion to @burntsushi stdsimd repo and fill issues there + work towards a pre-RFC all together there. I don’t know, I am open to alternatives).

Is there a fundamental reason why we shouldn't omit 512bit? "There are a few systems that make use of it" doesn't seem like a compelling enough reason to do it right now as opposed to later. By comparison, every single x86_64 system has SSE2 operations that are completely unavailable on stable Rust. Let's start there.

The key motivating reason for providing some niceties is to make it possible for crates like simd to exist on stable Rust without doing a monumental amount of work.

If we do decide to punt on the niceties, it shouldn't be because they are nice. It should be because we've explicitly chosen to make it unreasonable for a cross platform simd crate to exist on stable Rust. (I would personally be OK if we landed there, but my feeling is that there's overwhelming support for being a touch more ambitious with very low-risk niceties to foster more experimentation on crates.io.)

I think this discussion is long because 1) I had a lot of misconceptions at the start and needed to be educated and 2) a lot of folks kept insisting on doing more stuff initially instead of choosing to start as small as reasonably possible.

How about I write a pre-RFC first and then we can decide whether this is worth doing? I will start a new thread for that.

The status quo in my mind is roughly:

  • ~@stoklund's proposal
  • Stabilize #[target_feature = "..."] to expose function specific target optimizations. Define all vendor intrinsics with them.
  • At monomorphization time, if a function calls another function that has a SIMD type in its type signature, then set the appropriate target_feature attributes for that function. See this post for context.
  • Specify a module layout of vendor intrinsics, presumably using scenarios.

The only reason why the pre-RFC doesn't exist right now is because I haven't written it yet. Give me some time. :slight_smile: There will have to be significant portions devoted to motivating this design, explaining SIMD and why we've somewhat abandoned our previous SIMD path to stabilization.

2 Likes

Here is an idea: to further reduce the scope, table everything related to target_feature. We can do that later. As I understand, target_feature is unnecessary for SSE2-only on x86_64 case. Also, unlike having a cross platform SIMD type, it doesn’t seem to raise forward compatibility concerns. (That’s my understanding; if it does, we certainly should consider them.)

My understanding is that this does not satisfy @burntsushi’s requirements, but it does satisfy @hsivonen’s.

I'll note that both Clang's and gcc's headers for SSE2 do set the sse2 target feature option. For Clang: clang/lib/Headers/emmintrin.h at master · llvm-mirror/clang · GitHub

Only doing SSE2 doesn't really help with this. We still need to write the type signatures for each of the integer-specific SSE2 vendor intrinsics. We can either use exactly what Intel provides---which I think everyone agrees would be sad---or we can use lane oriented types like i16x8. If we do that, then it's a very small jump to just defining all lane oriented types for all platforms.

The problem with punting on #[target_feature] is that we also end up punting on some of the most critical issues with how SIMD support will work going forward.

Nevertheless, it is interesting thought. If we gated everything cfg'd x86_64, then I think everything would work.

Is there a fundamental reason why we shouldn't omit 512bit?

To me the fundamental reasons are that it is worse than AVX/AVX2 with respect to the "explosion" of partially untyped vector types/intrinsics. So things that might be "barely" worth it design-wise for dealing with AVX2 might become worth it for dealing with AVX-512 (rustc already has a DSL for generating intrinsics for AVX2...). An RFC for SIMD would need to convince me that the proposed architecture scales to AVX-512 without issues. AVX-512 also exposes more shuffle intrinsics that might make some generic shuffle functionality worth it (just like for LLVM), and it also might make some generic prefetching functionality worth it (again just like for LLVM). Things like imprecise division might mean that Div is less of a straightforward thing to do for AVX-512.

Or it might not. It might well be that the proposed approach is fine, and can deal with AVX-512 just fine. But I don't know, and any RFC that wants to convince me would at least need to show how these would need to be done for AVX-512, FMA, shuffle, prefetching, imprecise division, ...

IMO I think that the best we can do given all of the above is to at least attempt to support AVX-512 from the very beginning. We don't need to include that in the RFC, but we should definetely show that it will be possible to do so in a future RFC without major hassles by showing an implementation. Whether we gain something or not by then omitting AVX-512 from the initial RFC, I don't know, but I don't think so.

The key motivating reason for providing some niceties is to make it possible for crates like simd to exist on stable Rust without doing a monumental amount of work.

What monumental amount of work exactly? I guess they would need to wrap all vector types in newtypes to be able to implement Add for them, but I also guess that these crates will want to do that anyways to support multiple architectures, software fall backs, ...

We have very little experience building higher-level SIMD APIs. I would rather focus on a truly minimal lower-level SIMD API that lets us use all the hardware first (at least all that LLVM/Clang supports), and as higher-level APIs are implemented on top of that then, with more experience, decide if there are some niceties that are worth adding to std.

I think this discussion is long because 1) I had a lot of misconceptions at the start and needed to be educated and 2) a lot of folks kept insisting on doing more stuff initially instead of choosing to start as small as reasonably possible.

How about I write a pre-RFC first and then we can decide whether this is worth doing? I will start a new thread for that.

The only reason why the pre-RFC doesn't exist right now is because I haven't written it yet. Give me some time. :slight_smile: There will have to be significant portions devoted to motivating this design, explaining SIMD and why we've somewhat abandoned our previous SIMD path to stabilization.

I don't know. I think it would be enough for you to let us know when you have an implementation with the "fundamental architecture" and a bunch of intrinsics, but by no means complete (not even full SSE2 support). For the RFC, a "totally-incomplete-pre-RFC" without design/motivation/SIMD explanation/... but with the "fundamental architecture/pieces" of the proposal sketched is also enough.

We can help implementing the rest of the intrinsics (at least the ones that Clang supports), and as they get implemented new issues will arise, which we then can fill and discuss over there. Once the implementation is "feature complete" writing an RFC for it will be easier.

You can, of course, do all the work yourself, and when you are finished, show the pre-RFC and the implementation. But then I think it is kind of pointless to keep discussing here much until that is the case. And I also think that might be risky, since a lot of people here from different backgrounds could offer significant feedback during the implementation/prototyping that could save you a lot of work.

But we aren't looking to expose any generic shuffle or prefetching operations, so I don't understand why we need to solve this now.

Please see: Getting explicit SIMD on stable Rust - #210 by stoklund

I remain unconvinced that "all" is a necessary initial criteria.

I am convinced that demanding "all" initially is a very good way to defeat an effort to get something stabilized.

Please, do not let perfect be the enemy of good.

That does not sound prudent. We shouldn't put too much work into an implementation that people haven't agreed to stabilize yet.

Well... right... This thread had died down because I thought we had reached a somewhat reasonable consensus. I think I had said a few times already that I'd like to organize this thread into a pre-RFC. If there's widespread disagreement over the details of that pre-RFC, then perhaps we can go your route. But I don't see any reason to do that right now.

That's literally what this thread was for.

Perusing the AVX-512 headers in Clang, there doesn't appear to be anything starkly different when compared with the SSE/AVX2 headers. The intrinsics exposed use types like __m512i, just like the SSE/AVX2 headers do. The main differences as far as I can see are pure volume and widespread use of masks, but the masks are just normal u32s or u64s.

Probably the most significant problem, actually, is needing to go through and change all uses of __m512i to the appropriate lane oriented type. But that's human labor that needs to only be done once. Is there something else that concerns you?

I remain unconvinced that "all" is a necessary initial criteria.

I would like the RFC to convince me that all intrinsics supported by LLVM/Clang are implementable with the proposed system/architecture/mechanism without having to write a SIMD2.0 RFC that deprecates the first one. To me that means that it should at least explore how representative intrinsics of instruction sets not being proposed in the first version would be implemented. At that point it might be better to just implement those in the first version of the RFC (I don't know).

Please, do not let perfect be the enemy of good.

The only intrinsics I care about are AVX/AVX2, AVX-512, and BMI2. So I am actually genuinely interested into answering the following questions: "Will I need to drop down to C for using AVX/AVX2/AVX512 after this RFC? If not, good. If yes, will this be the case forever because this RFC prevents an approach to solve this from being implementable?".

Probably the most significant problem, actually, is needing to go through and change all uses of __m512i to the appropriate lane oriented type. But that's human labor that needs to only be done once. Is there something else that concerns you?

I think that doing the labor manually is a perfectly valid solution (if you don't have time for it I can help!). If you say that adding support for AVX-512 is tedious but does not require solving any new issues, that's enough to calm my concerns a bit, they will be fully calmed when I look at an implementation and try to implement one or two myself. @stoklund mentioned above that it might be possible to do something better for the masks using boolean vectors (maybe he can expand on what he meant). I'd rather have the same API as C (with u32/u64 for the masks), but I am open to anything that is better (although the boolean vector approach sounds to me that it rather belongs to a high-level API).

But it is also incomplete

If we don't have direct platform bindings why are we trying to abstract these into generics?!? You have to run before you can walk.

Prefetching doesn't work on x86_64 currently [1][2][3][4][5]. It remains to be seen if AVX-512 will improve this situation. Again why are we abstracting away what we don't support?

[1] The problem with prefetch [LWN.net] [2] Software prefetching considered harmful [LWN.net] [3] Prefetching considered harmful [LWN.net] [4] Re: Software prefetching considered harmful [LWN.net] [5] kernel facilities for cache prefetching [LWN.net]

AVX-512 nearly triples the number of platform intrinsics. Full support of it is kind of a poison pill for the RFC to move forward.

110% this.

I attempted to merge in full intrinsic support and was road blocked by the fact that nobody is sure how they want to handle current SIMD vector ducktyping.

Again just voicing support.

AVX-512 is currently only supported on the Xeon-Phi. And this is only extensions F, CDI, ERI, PFI. Skylake-E/EP adn Knights Lake is announced (most likely) this August at Gamescon. This will start mainstream roll out of AVX-512. I mean AVX/AVX2 is still fairly rare.

Everything else is >18 months away. AVX-512 doesn't exist yet. It is okay to not support it initially.

Not depreciate but extend.

I feel the best solution is:

RFC 1:

  • Figure out internal Rustc SIMD vector ducktyping.
  • Agree how to round out AVX-512, Aarch64, MIPS, and PPC extensions as a continuous process going forward.
  • Agree how to handle code-gen surrounding SIMD versioning on self-similiar target platforms.
  • Expose SIMD as raw unsafe compiler intrinsics on stable.

RFC 2:

  • Start building generic non-platform specific wrappings
  • High level things that do platform specific code gen.

LLVM (and clang) already expose builtin vector shuffling functionality (__builtin_shufflevector) so we could just expose something like this (and... run before we walk). This might not be worth it for SSE4 or AVX, but as the number of vector types explodes (going to AVX-512) we might want to consider this. I don't think we should, but I do think that we should prototype against AVX-512 because it makes everything more tedious, which might encourage us to at least look and weight alternative solutions that we might have not considered before. Nothing more, nothing less.

(I am personally against generic vector functionality. For a low level API, I only want intrinsics that map directly to the hardware. But I also want the design space to be fully explored, and I am open to be convinced of a better solution if there is one).

1 Like

Prefetching doesn't work on x86_64 currently

It doesn't work for them and it works for us (~15-20% speed-ups) and for them (with similar speed ups). Why? I only read the first link of 5 you posted, but they are using prefetching on the next element of a linked list. To get an speed up from this, the amount of work per element of the list needs to be huge (~1000 cycles), and the next element of the list needs to not be in cache. If the elements of the list are close enough in memory, or you don't have enough work to do, the prefetching instructions are going to either not have an impact or result in a slow-down.

Again why are we abstracting away what we don't support?

Clang (and LLVM), GCC, Intel, Cray, IBM, PGI, nvcc, and MSVC compilers all do support prefetching. So what do you mean with abstracting away what we don't support? The prefetching instructions are there. I want to be able to emit them if I need to.

But they aren't link. The SSE prefetching instructions aren't in the Rust Compiler currently. Yes they're in [insert main stream compiler here] but I digress.

My point is: Rust's current intrinsic support is horrible.

Focusing on getting the SIMD subsystems into a state where all avaliable LLVM bindings can be brought in is what I thought this discussion was about. I attempted to write a patch to fix this, and ultimately I had to close it because nobody is sure how internally the rust compiler should represent SIMD registers. I was told this discussion should solve that.

Figuring out the high level cross platform type safe bike shed wrapper isn't running before we walk.

We don't even have legs.

But they aren't link.

They aren't exposed in rustc yet because when the SIMD intrinsics were implemented, clang didn't had AVX-512 support. The intrinsics are, however, there in the AVX-512 instruction set, and they are there in the clang AVX-512 headers (and obviously in LLVM IR __builtin_ia32_xxxx).

Figuring out the high level cross platform type safe bike shed wrapper isn't running before we walk.

Who is talking about that? I've only talked about low-level intrinsics, and about taking into account what tools LLVM gives us to avoid repetitive work when implementing those. We might not need those tools when going up to AVX, but we might need them when going to AVX2/AVX-512 because of the explosion in number of vector types. So we should at least evaluate whether we want to expose these LLVM intrinsics as well. So there are 3 distinct things, from lower to higher level:

  • lowest level: non-generic (map to 1 asm instruction) and "generic" LLVM intrinsics, both used to implement some of the low-level intrinsics in the clang headers. Why do the "generic" LLVM intrinsics exist? Because sometimes the optimizer needs more semantic information to emit better code, and other times the gazillion vector types in Intel AVX/AVX2/AVX-512 drove somebody insane enough that they ended up implementing a "generic" shuffle operation in LLVM to implement the low-level intrinsics on top of that... This is the reason why probably rustc has a domain specific language implementation to avoid the repetitive work of adding new intrinsics when too many vector types are involved (somebody did not wanted to add and maintain all of that by hand).

  • Low-Level: direct map to the hardware intrinsics, should generate the assembly instructions you want, basically what clang headers offer. Most of them are implemented by calling LLVM __builtins that map directly to the assembly instructions, but in a way that is not opaque to the optimizer (i.e. not using assembly instructions directly). Some of them are implemented by calling "generic" LLVM builtins, that are SIMD algorithms for some operations, for example, this header which calls the "generic" shufflevector intrinsic all over the place.

  • High-level SIMD library: type-safe portable wrapper for intrinsics, with SIMD algorithms, iteration over memory, and what not.

My stand on these things is that we should offer the low-level intrinsics that map directly to the hardware in a first RFC.

But... if when doing so it turns out that it would be useful to expose some of the lower-level LLVM intrinsics (like the generic ones), or that if when we expose those we get a nicer API for free and need to expose less stuff in total, well I'd probably still want to have the 1:1 intrinsics anyways but would be open to change my mind. The issue is, that some of these generic intrinsics are only worth it if we go up to AVX2/AVX-512 because the explosion in vector types is what make them so much nicer than the real ones. A high-level library like the simd crate might, due to this reason, prefer to build on top of the LLVM intrinsic rather that the 1:1 maps.

I don't know. As @burntsushi mentions, AVX-512 should only be more tedious to implement. But LLVM has tools to make it less tedious. My point is that we should explore those tools and not ignore them, they are there for multiple reasons, and clang uses them.

Internal details of how this is implemented probably don't need to be part of the RFC.

As it stands now, rustc has most of what we need to provide access to low level vendor intrinsics: GitHub - rust-lang/stdarch: Rust's standard library vendor-specific APIs and run-time feature detection

The idea here is to build consensus on where we want to go before actually adding this to std.

I think this is where everyone else stands too. The only addition is a very small cross platform API built on top of existing LLVM generic intrinsics.

What tools does LLVM expose? If the only problem is indeed tedium, then it seems like a cost we should note, but not one that is insurmountable.

Also, I don't understand why you keep talking about an explosion in vector types. SSE adds 10, AVX adds 10 and doesn't AVX-512 add another ten? AVX-512 also has mask types, but there's no combinatoric explosion as far as I can see.

What tools does LLVM expose?

For example llvm.shufflevector (aka __builtin_shufflevector in clang), which is used to implement the different AVX intrinsics for shuffling vectors.

If the only problem is indeed tedium, then it seems like a cost we should note, but not one that is insurmountable.

Is tedium the only problem? Maybe? I haven't tried so I don't know. "We have explored the other architectures and there doesn't seem to be any significant new problems (see here and there for some incomplete prototype experiments)" sounds way better than "We haven't looked at the other architectures and don't know if there will be significant problems".

Also, I don't understand why you keep talking about an explosion in vector types. SSE adds 10, AVX adds 10 and doesn't AVX-512 add another ten? AVX-512 also has mask types, but there's no combinatoric explosion as far as I can see

The 10 512-bit vector types are just i64x8, u64x8, f64x8, i32x16, u32x16, f32x16, i16x32, u16x32, i8x64, u8x64. The mask types are kind of irrelevant since we can model them with u16,u32,u64 (unless we want to do something "more clever" than C).

My point is that 10 vector types in SSE is "fine", 20 vector types in AVX starts to seem like a lot, and 30 vector types in AVX-512 seems like too much. I don't want to predict the future but it seems reasonable to assume that this number will increase (it would not surprise me if the recently announced Phis for deep learning come with half-precision SIMD lanes for floats f16x{2...32} just like nVidia GPUs do).

With type level integers the "vector-type user interface" could be just simd::pack<T, N={2...64}>, which would automatically scale without having to add new vector types to std every time Intel increases the width of the vector lanes.

I personally don't think that waiting for type-level integers is worth it, and I think we can add such a "generic" abstraction afterwards, but IIRC lack of type-level integers was one of the reasons mentioned in the simd crate for a slow down in its development.

Anyhow, if AVX-512 is just "add this 10 vector types", and it doesn't introduce any further difficulties, I don't see why it should be omitted/ignored from an initial RFC.

Yes, I'm using that in various places. Example: https://github.com/BurntSushi/stdsimd/blob/master/src/x86/sse2.rs#L286

FWIW, we've been down the "but we could have awesome SIMD types with language features X, Y and Z" road a few times, both in this thread and in other places. It's something that needs to be covered in the RFC, but I think most folks are on the same page that we shouldn't be blocking SIMD on type-level integers.

FYI I was trying to see how the approach proposed could be extended for RISC-V and ARM’s SVE and run into this ARM white paper: A sneak peak into SVE and VLA programming. It is a short read (16 pages) if you are interested.

IIUC from a programming point of view, nothing changes. The main idea is that SVE registers are always multiples of 128-bit registers (up to 2048-bit wide). The programmer writes code that targets 128-bit registers (e.g. using f32x4), where the elements in the registers can be [8,16,32,64] bit wide. So how do you target wider registers? You cannot do that directly. What you do, is e.g. you “loop” over the 128-bit registers. The compiler backend (LLVM for us) then inserts an instruction that tells the CPU how often do you loop over the register, and then at run-time the CPU decides whether to promote the 128-bit operations to a wider register. So for example if you loop 4 times over a f32x4 doing some vector operation, the CPU then decides whether to promote that loop to a single operation on a 512-bit wide vector lane, or to two operations on two 256-bit wide vector lanes. (Note: one does not really need to target 128-bit register, each SVE lane has a “predicate” lane that is basically a mask for the vector elements, such that a loop over 7 f32 can still be vectorized at run-time using a single 256-bit width lane without dealing with the “tail” by masking away the “operations” on the 8th element).

I haven’t been able to find any info on RISC-V vector instructions, but hopefully they will work similarly.

What does this mean? IIUC this means that everything being proposed here should work seamlessly “as is” for ARM’s SVE. The code one writes using f32x4 for NEON today should auto-magically be promoted to wider instructions on newer CPUs if LLVM doesn’t screw up. I guess this is good news.

1 Like

So what is the current status of this?