Getting explicit SIMD on stable Rust

hanna-kruppe · November 23, 2016, 8:08pm

After reading this post and thinking about it some more, I now lean towards "intrinsics built on general vector support where possible" being the right layering.

However, I'm not sure what (if any) user-visible consequences you think should come from this. Would you like to see #[repr(simd)] and friends to be prioritized and stabilized before (or at least no later than) intrinsics? Would you cut all "legacy" intrinsics that cover the same ground as generalized SIMD operations (i.e., simple arithmetic, shuffles, and element access)?

stoklund · November 23, 2016, 8:10pm

By basic operations, I mean trivial things like simd_add and accessing individual lanes of a vector. It isn’t tied to LLVM any more than having a u64 type that supports +.

Both Clang and GCC provide this kind of portable SIMD support, and their builtins for the target-specific intrinsics use those types.

Here’s an example of portable SIMD code written without the Intel headers. Yes, it’s very limited what you can do without target-specific intrinsics, but at least you have a common type system between SSE and NEON.

The types defined by arm_neon.h make a lot more sense, BTW.

stoklund · November 23, 2016, 9:19pm

I think the most important thing is to have a stable way of talking about SIMD types in Rust. These SIMD types should support lane types that are the same as Rust's integer and floating point scalar types. They should provide a way of accessing individual lanes as normal Rust scalar L-values. The #[repr(simd)] types seem to fit this mold, allowing for x.3 = y.2 + 2 syntax.

Even inline assembly would require some type to represent a SIMD vector.

Second, give shared target-independent names to the most basic operations:

Bitcasting between SIMD types of the same size.
A splat operation which replicates a Rust scalar into all lanes.
Bitwise and/or/xor/not.
Integer wrapping_add, wrapping_sub, wrapping_neg, and wrapping_mul.
Floating point arithmetic.

These basic operations don't vary across platforms, so there doesn't seem to be an advantage to defining target-specific intrinsics for them.

Now you can provide target-specific intrinsics for everything else, but filter out those intrinsics that are exact semantic copies of an existing basic operation.

The arm_neon.h types are pretty much an exact match with the Rust types, except for the polynomial types which can be mapped to the equivalent unsigned integer types.

The Intel header types are pretty weird. In many cases, an intrinsic has an obvious type. In other cases, it might be a good idea to make the intrinsic generic so it can work on both signed and unsigned SIMD types.

There are many more SIMD operations that can be defined in a target-independent way, but they have minor differences between platforms, so I don't think they should stand in the way of defining target-specific intrinsics that do the same thing. These are:

Shuffles and swizzles. The set of efficient shuffles varies.
Reciprocal and reciprocal sqrt approximations. Most ISAs have them, but with varying precision in the approximation.
Conversions between integer and floating point. Rounding and overflow behavior varies. Availability of unsigned integer conversions varies.
Lane-wise comparisons. Should the result be a boolean vector or an unsigned mask vector?
Blendv / lane-wise select. Should the controlling vector be booleans or ints? Should the sign bit control the whole lane, or is the select bit-wise?
Shift by a scalar. The overflow behavior varies when the scalar is larger than the bit width of the lanes.
Saturating add/sub is universally available for 8-bit and 16-bit lanes, but not for larger lanes.

gnzlbg · November 24, 2016, 9:36am

And what would prevent that from being possible?

That function would just dispatch to the Add function implemented for T. If T implements Add, then nothing can fail.

f32x8 would only be available if some target feature is enabled, and it might conditionally have different Add implementations (or none) depending on the target features available.

The only error you could get from your code is a front-end compiler error saying Add is not implemented for T.

A per function target feature always tries to emit the code. If the code cannot be emitted, you will get a backend error. You would need to wrap per function target features in conditional compilation blocks:


#[cfg(target_architecture("x86"))] {
  
  #[target_feature("AVX")]
  fn foo() { ... }  // code for foo is always generated on x86

}

instead of requiring ... unsafe { ... }

Intrinsics will always be unsafe. That is possible with a higher level library that abstracts the intrinsics away and provides fallback implementations (this is exactly what the simd crate currently does).

I think that you seem to be trying to reduce the need for #[cfg(target_feature(...))] to try to rely as much on #[target_feature]. One is a compile-time flag, the other is just a function attribute, that says, generate this function with this code.

If you want to generate multiple functions with #[target_feature] that do the same thing for different feature sets, you can either:

give them different names (Rust doesn't have overloading), and wrap them in some other function without the attribute that dispatches to the proper one depending on run-time detection or using #[cfg(target_feature(...))] at compile-time
give them the same name but only compile one of them (by using #[cfg(target_feature))])

The same would be true for impls, when you write:

#[target_feature("AVX")]
impl Add for f32x8 { ... }

what this is saying is always generate the code for Add for f32x8, but the whole thing should be wrapped in an #[cfg] anyways because doing so is not always possible (e.g. ARM). In particular you might want to offer an f32x8-like abstraction that also works when AVX is not available, but you cannot have two impls for the same type, so adding the following below the previous impl does not work:

#[target_feature("SSE4.2")]
impl Add for f32x8 { ... }

The only ways you could make it work are the same as for function, be sure there is only one impl of Add being generated for f32x8, by either using conditional compilation, or an impl of Add that is independent of the target feature and uses a layer of indirection. For example at compile time:

// independent of target features
impl Add for f32x8 { 
// compile-time switch
fn foo()  {
  if #[cfg(target_feature("AVX"))] {
    // do this using AVX
  } else {
    // fallback impl, can be plain Rust, works for x86 and ARM
  }
} 
// run-time switch:
fn bar()  {
  if #[cfg(architecture("X86"))] {
  if cpuid.is("AVX") {
    f32x8_add_AVX_bar_impl(); //< written with #[target_feature]
  } else {
    f32x8_add_fallback_bar_impl(); 
  }  
  } else { 
   // do something for ARM  
  }
}

gnzlbg · November 24, 2016, 10:03am

I showed here, in this discussion how you can do this. You basically use a static function pointer that you set on first call. No branches.

gnzlbg · November 24, 2016, 10:25am

What @hsivonen proposes is useful on its own, it already works, and I think it is completely orthogonal to the other features being proposed (@burntsushi ?) :

#[target_feature] function attribute (LLVMs __attribute__(target("OPTION"))),
- this is a function attribute, I also wanted that the attribute must be specified at the call site so that it is clear what is going on
run-time feature detection: cpuid.feature(str) -> bool.

I would also like to have these features, but I don’t think that it is necessary for the first RFC on SIMD to cover them all in detail. However, I do think that a second RFC following up on these features should be a blocker for the stabilization of the first RFC.

jneem · November 24, 2016, 12:32pm

There were a couple of points that I was worried about, but they were all addressed by @hanna-kruppe.

currently, the impl of Add for a simd type is only conditionally available depending on the program-wide cpu features. I think it should be always available (or maybe, conditionally available whenever the architecture has any possibility of supporting it).
I want the function calling T::add to inline it, because otherwise the performance cost will negate the benefits of using SIMD. That means the compiler needs to be smart enough to automatically compile the generic function with the cpu features that its callee requires.

gnzlbg · November 24, 2016, 1:59pm

I think this is something everybody wants, but note that providing "fallbacks" is a different problem than providing intrinsics. While intrinsics must be provided by the compiler (or a plugin), fallbacks can be provided by rust code, and libraries. We have one library already that does this to some degree, the simd crate, but it is not widely used, so we don't have that much experience with it.

In particular, different types of fallback behavior would make sense for different types of applications. Software emulation of the intrinsics is just one possibility, another would be to just panic!, or return an error type, or...

Hopefully, once intrinsics get stabilized, higher-level libraries like the simd crate will get used more, more libraries for abstracting simd might be written, and at some point, we might even consider stabilizing one of those. But right now, I think we are far away from that point. Raw intrinsics is the first step towards that goal.

I want the function calling T::add to inline it, because otherwise the performance cost will negate the benefits of using SIMD. That means the compiler needs to be smart enough to automatically compile the generic function with the cpu features that its callee requires.

Not really. When you have:

#[target_feature("AVX")]
fn foo() { ... }

the compiler might generate a copy of foo and put it in the binary (if the function is exported), but if it can prove that the function is never used it doesn't need to do that (e.g. if you are using conditional compilation to control which intrinsic functions you are using).

When you call foo from another function, generic or not, like this:

fn bar() { 
  foo()
}

the compiler can still inline foo like this:

fn bar() {
  #[target_feature("AVX")] {
    // foo content
  }
}

and optimize across the function boundary just fine. [#target_feature("AVX")] just means that in this scope the compiler can emit AVX instructions independently of the target_feature of the binary. Since this can introduce SIGILL in your binaries, I kind of want functions marked with #[target_feature("AVX")] to be "unsafe" and require an annotation at the call site, so that the code you write looks like this:

fn bar() {
  #[uses_feature("SSE2")] {
    bar();  //< Error: bar uses feature AVX which is not available in this scope
  }
}

The objective of this would to maintain the low number of segmentation faults that can actually happen in safe rust. Another thing that was discussed long time ago would be to allow finer unsafe granularity, so that you could write unsafe(uses_feature("SSE2")) { ... } instead, without opting into "full" unsafety.

jneem · November 24, 2016, 3:00pm

I agree, this is a library issue and so probably not worth discussing further yet.

Without actually trying to implement this, this is just speculation, but I'm not sure that sort of inlining would be enough to give good performance. For example, in a generic function bar<T>, if T is 256 bits then I really want to keep T in registers even in the part of bar that wasn't inlined from foo. But anyway, @hanna-kruppe convinced me that it's easy to do this with a MIR pass, by propagating the #[target_feature] annotation from foo to bar if bar unconditionally calls foo.

eddyb · November 24, 2016, 4:35pm

@stoklund Do you believe there’s any problem with stabilizing the vendor names first, even if they’re not more fundamental? Even if we implement them just like clang does?

hsivonen · November 26, 2016, 12:59pm

It would

Involve more work for rustc/stdlib developers to write the vendor-name wrappers for loads/stores, basic comparisons, basic arithmetic and basic bitwise ops. (As opposed to shipping what already exists.)
Involve more work for Rust programmers who support more than one ISA who'd have to use conditional compilation to invoke loads/stores, basic comparisons, basic arithmetic and basic bitwise ops in different ways depending on target ISA.

emoon · November 26, 2016, 1:08pm

Maybe it’s just me but I thinking that many parts of this thread (not all posts though) has derailed a bit talking about more of high-level SIMD abstractions. The topic for this thread is about getting explicit SIMD support on stable Rust so it would be great if topic would be kept just at that as this thread is getting quite long.

eddyb · November 26, 2016, 1:14pm

This is already done by clang we can just copy and adapt their stubs. Partially automatable.
The simd crate could easily do that (given how it’s also partially automated) and everyone who wants to build their own high-level API can reuse parts of it.

hanna-kruppe · November 26, 2016, 1:47pm

All the codegen work is already done (simd_add, #[repr(simd)] etc.). Deciding for each intrinsic whether it's "really an intrinsic" or a wrapper around one of the previously mentioned operations, and not only can we consult clang for that as @eddyb pointed out, the same decisions would be needed to say "this intrinsic duplicated functionality we already have, we won't expose it". All that's really additional work is writing some functions with bodies as trivial as @burntsushi demonstrated above (transmute and call simd_add etc.), for which we can consult clang again.

That only applies if the whole surrounding algorithm is using only this bare-bones set of operations. Otherwise, you already need conditional compilation (and possibly a whole different algorithm) because of the other, platform-specific operations you're doing. For those algorithms, a crate like simd or even one with a much smaller API surface is sufficient, and only needs to be written once.

stoklund · November 27, 2016, 8:31pm

This topic is getting a bit confusing, so I don't quite know what you mean by this question.

Are you proposing to stabilize intrinsics without first stabilizing #repr(simd)? Is that even possible? These intrinsics take SIMD arguments, and stable Rust doesn't have any SIMD types.
Are you proposing to stabilize polymorphic extern "platform-intrinsic" intrinsics as currently implemented, or do you want to expose only functions bound to concrete types as @alexcrichton proposed above?

In my opinion, we shouldn't stabilize vendor names for basic operations. Instead of:

simd_xor(x, y)

We would get:

__msa_xor_v(x, y) (pseudo-polymorphic)
veorq_u8(x, y)
veorq_u32(x, y)
veorq_u64(x, y)
veorq_u16(x, y)
veorq_s8(x, y)
veorq_s32(x, y)
veorq_s64(x, y)
veorq_s16(x, y)
veor_u8(x, y)
veor_u32(x, y)
veor_u64(x, y)
veor_u16(x, y)
veor_s8(x, y)
veor_s32(x, y)
veor_s64(x, y)
veor_s16(x, y)
_mm_xor_si64(x, y)
_m_pxor(x, y)
_mm_xor_ps(x, y)
_mm_xor_pd(x, y)
_mm_xor_si128(x, y)
_mm256_xor_pd(x, y)
_mm256_xor_ps(x, y)
_mm256_xor_si256(x, y)
_mm512_xor_pd(x, y)
_mm512_xor_ps(x, y)
_mm512_xor_epi32(x, y)
_mm512_xor_epi64(x, y)
_mm512_xor_si512(x, y)
vec_xor(x, y) (polymorphic)

I agree with @hsivonen that this would be sad. I also don't think this wouldn't provide any benefit over using platform-independent names for these basic operations.

It is also worth noting here that the different CPU vendors use different type systems when mapping to C.

ARM NEON uses strongly typed SIMD types with explicitly signed/unsigned/float lanes. The NEON types don't automatically bitcast to another SIMD type of the same size.
MIPS MSA uses GCC-style vector types with explicitly signed/unsigned/float lanes. These types do automatically bitcast, which is why there is only one __msa_xor_v() intrinsic. It's pseudo-polymorphic because of the automatic bit-casting, so only one is needed.
Intel uses GCC-style vector types, but only distinguishes float/double/integer lanes as discussed above. There is no difference between 8x16, 16x8, 32x4, or 64x2 integer vectors.
AltiVec (POWER) uses a C language extension with a vector keyword and function overloading, so vec_xor works on a large number of hetereogeneous argument types.

If we stabilize functions bound to concrete types instead of polymorphic platform intrinsics, would each vendor get a separate set of types?

The Intel convention of squashing all integer-like SIMD types into a single __m128i type sticks out like a sore thumb. All the other vendors use types that explicitly specify the size and signedness of integer lanes. And so does the simd crate, even for x86.

comex · November 27, 2016, 10:50pm

There would be some benefit.

Porting: Using low-level vendor names makes it easy to port C code that uses SIMD intrinsics to Rust, as well as for people to leverage their existing familiarity with those names when writing Rust code. Both of these are a little more difficult if Rust exposes only part of the vendor interfaces, while some semi-random subset requires a new interface.
Consistency: Yes, simd_xor would be equivalent to a boatload of functions, mainly because some architectures take the approach of having a separate function name for each width variant rather than using polymorphism. (If there were only one function per architecture, your list wouldn't look nearly as scary.) But since the plan would still be to use the vendor names for all other (non-"basic") operations, the explicit-SIMD crate is going to end up full of similar long lists anyway. For example, ARM has a set of instructions for transposing 2x2 vectors, so you'd end up with vtrn_s8, vtrn_s16, vtrn_s32, vtrn_u8, vtrn_u16, vtrn_u32, vtrn_f32, vtrn_p8, vtrn_p16, vtrnq_s8, vtrnq_s16, vtrnq_s32, vtrnq_u8, vtrnq_u16, vtrnq_u32, vtrnq_f32, vtrnq_p8, and vtrnq_p16.

Having some operations explicitly designate width and use vendor naming conventions, while others are polymorphic and use a different naming convention, is confusing and harder to use even for people without preexisting experience.

Support: Even basic operations often vary wildly in architecture support. For example, simd_add looks good on x86 because SSE2 has instructions to add vectors of 8, 16, 32, or 64-bit integers packed into 128-bit registers. But what about simd_mul? SSE2 has 16-bit integer multiplication, and... that's it. SSE4 adds a 32-bit version; AVX-512VL/DQ adds a 64-bit version; there's no 8-bit version.

Similarly, look at comparisons (simd_eq etc.). SSE2 has equality, signed greater-than, and signed less-than on 8, 16, and 32-bit packed integers, but for 64-bit you need SSE4.2. There are no >= or <= variants, and there are no unsigned variants.

So how should simd_mul or simd_eq work in cases of unsupported operations or unsupported platforms? In some cases there are reasonably efficient workarounds to accomplish the same operation using a sequence of instructions; in others there are not. In any case, quietly generating workarounds isn't appropriate for a "direct to the metal" interface; the user should be confident that the intrinsics correspond directly to individual machine instructions.

So I guess you would say that these operations should simply fail to compile in unsupported cases. But I'd argue that a generic-looking operation that's not actually generic is unergonomic. For a user, it's very little help that they only have to remember simd_mul rather than N architecture-specific intrinsics, if they still have to remember which architectures actually support various sorts of simd_mul.

Backwards compatibility: I'm assuming that you want operations deemed "basic" to only be supported using generic functions, as opposed to supporting both generic and vendor-specific versions. However, that creates a problem if we ever want to expand the set of "basic" operations in the future. For a SIMD operation to be considered "basic", it would probably need a long history of preexisting support in various architectures, so its vendor-specific equivalents would already exist in Rust. Stability would preclude removing those functions, so we'd end up with a confusing situation where some basic operations have vendor equivalents and others don't.
Bikeshedding: Alternately, if basic operations are to be supported both ways, then it doesn't hurt to start by stabilizing the vendor-specific versions. This avoids bikeshedding over things like which operations should be considered "basic" and how the generic functions should be named. Bikeshedding over types, on the other hand, seems unavoidable, but at least we can cut down the amount that has to be decided...

jethrogb · November 27, 2016, 11:45pm

This seems unavoidable if we want to support non-basic intrinsics.

stoklund · November 28, 2016, 4:24am

Most of these concerns can be addressed by a simd-vendor-lockin crate which provides all the vendor-specific names for the basic operations, possibly disabled as desired by target feature checks. These details are not fundamental, and we don't need to burden the Rust compiler and standard library with them.

Since the #[repr(simd)] types are nominal, it would be quite unfortunate to partition the SIMD types by vendor in my opinion. It would make it much more difficult to write platform-independent SIMD code in the future without layers of abstractions and unnecessary target feature checking.

It doesn't seem difficult to provide a common set of standard SIMD types that all the platform-specific operations can use. ARM NEON, MIPS MSA, and AltiVec already agree on the 128-bit wide types with all the Rust integer (less isize and usize) and floating point types as lanes. The minor differences are:

ARM has polynomial vector types. Integer types can be substituted.
AltiVec has boolean types like vector bool int which is like a i32x4 where each lane can only be 0 or -1. LLVM doesn't support these types, but Cretonne does (as b32x4). They're not so easy to express in Rust's type system, so they can be substituted with integer vector types with little loss.

For the Intel intrinsics, we should use more detailed integer types than __m128i. Note that even Clang's builtins already has this detailed integer type information.

I noted that the D language provides shared SIMD types with basic operations, but completely punts on typed platform-specific intrinsics. All their operands are void16.

scottmcm · November 28, 2016, 4:44am

Warning: naïve statements ahead, made without historical context.

When I look at everything here, I think the biggest sadness for me is the proliferation of uninteresting & unextensible nominal types.

I was very much expecting a type that acted mostly like a normal Rust array, just slightly more restricted in exchange for more functionality. I figured it would compile to an LLVM vector with insert/extract through [] (since they don’t require constants) and with the wrapping_* methods producing the LLVM vector instructions where easy. I figure that then whatever LLVM does to lower that would be pretty good even if I make currently-slightly-silly choices like < i32 x 8 >. (These don’t feel unreasonable for the compiler to know about, as while it’s obviously more than just #[repr(simd)] would be, it doesn’t depend on not-at-all-obvious things like “what’s a good typesafe general shuffle API over arbitrary sized stuff?”.)

That by no means obviates intrinsics (and really bad choices like < i16 x 9 > would probably never have great performance), but does mean that the signatures of those intrinsics could reference common & clear types, problems like shuffle! needing to generate types go away naturally, avoids excess transmute, means data structures aren’t tied to the algorithm architecture, and means multiple crates for nicer interfaces would be at least somewhat interoperable.

Probably insanity: I pessimistically guess I can’t get a slice into an LLVM < i32 x 4 >? It’d be magical if the fixed-length LLVM array was SIMD-compatible for SIMD-reasonable types. If they’re sliceable, would it impact more than alignment? Are things like [i32;4] commonly found in the middle of structs?

fyl2xp1 · November 28, 2016, 6:02pm

I finally think my question in the user’s forum isn’t too off-topic any more: Why do the simd-crates use a tupel rather than an array?

Topic		Replies	Views
Pre-RFC: SIMD groundwork language design	40	10270	March 25, 2019
SIMD in `const` easily language design	25	1497	February 26, 2026
What's the next step towards the stabilization of SIMD? language design	16	3924	March 25, 2019
Pre-RFC: target-feature detection in libcore language design	29	1893	October 12, 2019
Suggestion for a low-effort way to take advantage of SIMD and other architecture specific tricks LLVM knows about internals	19	3176	March 25, 2019

Getting explicit SIMD on stable Rust

Related topics