Getting explicit SIMD on stable Rust

Works fine for me: Rust Playground

This doesn’t: https://is.gd/cH5VHp

I wonder if your example is a bug…

We seem to be converging on a consensus, so I’ve fleshed out my straw man proposal and added rationales. The proposal is divided into steps that reflect my sense of priority.

Step 1: Common opaque SIMD types

Define a set of opaque SIMD types in std::simd for the commonly supported vector types:

64-bit vectors:

  • i32x2, u32x2, f32x2,
  • i16x4, u16x4,
  • i8x8, u8x8.

128-bit vectors:

  • i64x2, u64x2, f64x2,
  • i32x4, u32x4, f32x4,
  • i16x8, u16x8,
  • i8x16, u8x16.

256-bit vectors:

  • i64x4, u64x4, f64x4,
  • i32x8, u32x8, f32x8,
  • i16x16, u16x16,
  • i8x32, u8x32.

These types are opaque except for the following traits:

  • Copy and Clone.
  • Default. The default vector is the all-zeros bit pattern.
  • From<[T; N]> and Into<[T; N]>, where T is the lane type and N is the number of lanes, so f32x4 implements From<[f32; 4]>, for example.

The SIMD types differ from a user-defined newtype like struct Foo([T;N]) in the following ways:

Alignment

The alignment of SIMD types can be specified by the target ABI, but not all SIMD vector sizes are supported for all targets.

  1. If the target ABI specifies the alignment of a SIMD type, use that. Otherwise,
  2. If a smaller SIMD type with the same lane type exists, use the alignment of the smaller type. Otherwise,
  3. Use the alignment of the lane type.

The alignment of SIMD types that are not supported on a target today are subject to change if that target adds support for the type. For example, if ARM decides to add support for 256-bit SIMD, the alignment of the 256-bit types may have to change on that platform.

FFI

Some ABIs specify alternative behavior of SIMD types in function call parameters and return types. The SIMD types here behave like C SIMD types when used in FFI calls.

If SIMD types that are not covered by the ABI are used in FFI function calls, they behave the same way as a user-defined struct Foo([T;N]) newtype would.

Rationale

The goal here is not to provide general, ergonomic language support for SIMD programming. The goal is to:

  1. Establish standard, well-known names for the SIMD types that are used in practice in order to prevent per-vendor nominal types.
  2. Provide a minimal basis for the implementation of vendor intrinsics.
  3. Provide a minimal basis for portable SIMD programming.
  4. Be forwards compatible with future language support for SIMD types.

The intention is that when full language support for SIMD types is added, these names can be replaced with type aliases (whatever that future syntax may be):

type f32x4 = Simd<f32, 4>;
...

We should make sure today that such a substitution won’t break code tomorrow.

In the spirit of minimalism, I removed even constructors from these types, so they have no methods outside the trait impls. Constructors can be provided externally via the Default and From<[T;N]> implementations.

Step 2: Vendor intrinsics

Provide a complete mapping of the intrinsics in vendor header files like <arm_neon.h>, but using the standard SIMD types. All of these are exposed as functions that are guarded by target feature detection.

The exposed names of the intrinsics should match the vendor names so they can be searched for easily.

Intel integers

The mapping of vendor types to Rust SIMD types is trivial except for the Intel integer vector types. They will be mapped as follows:

  • __m128i becomes i64x2, i32x4, i16x8, or i8x16.
  • __m256i becomes i64x4, i32x8, i16x16, or i8x32.

For most intrinsics, there is only one obvious choice which can be derived from Clang’s corresponding builtin signature. Some intrinsics will need to be provided in multiple per-type versions.

Rationale

Since we’re also adding portable SIMD arithmetic operations, many vendor intrinsics will be redundant. However, since there are thousands of vendor intrinsics, the relative size of the redundancy is very tiny. It is beneficial for somebody porting SIMD code written in C to be able to find everything in one place.

The mapping of the intel integer types is a compromise which:

  • Avoids the creation of a separate nominal vendor-specific SIMD type like x86::__m128i.
  • Provides a small improvement in type safety over Intel’s approach by forcing explicit casts when switching lane geometry.
  • Avoids the duplication of a large number of intrinsic names into signed/unsigned variants.
  • Avoids the mistakes we would inevitable make if we attempted to manually pick correct signed or unsigned types for these some 3000 intrinsics.

Step 3: Portable SIMD operations

Provide a basic set of SIMD operations that are available unconditionally on all target platforms.

For all SIMD types, implement:

  • BitAnd,
  • BitOr,
  • BitXor, and
  • Not.

For all integer SIMD types, add methods:

  • wrapping_neg(),
  • wrapping_add(),
  • wrapping_sub(), and
  • wrapping_mul().

For all floating point SIMD types, implement:

  • Neg,
  • Add,
  • Sub,
  • Mul, and
  • Div.

Add methods:

  • abs() and
  • sqrt().

Rationale

It would be possible, but very complicated, to implement portable SIMD operations in terms of the vendor intrinsics which are basically a 1-1 mapping of the instruction sets. There is a lot of strange holes in the complicated availability matrix, and picking the right instructions is equivalent to writing a code generator. Rust already has a code generator which encodes all of that information—LLVM.

I omitted wrapping_div on purpose because it is not supported by any current architecture.

Step 4: Bitcasts

Provide methods which make it easy to reinterpret the bits in the lane of a vector as a different type.

  • For floating point and unsigned integer SIMD types, add a method to_ibits() which produces a vector with the same lane geometry, but with signed integer lanes, so f32x4 -> i32x4, u8x16 -> i8x16, etc.
  • For floating point and signed integer SIMD types, add a method to_ubits() which produces a vector with the same lane geometry, but with unsigned integer lanes, so f32x4 -> u32x4, i8x16 -> u8x16, etc.
  • For integer SIMD types with 32-bit or 64-bit lanes, add a to_fbits() method which reinterprets the lanes as floating point. u32x4 -> f32x4, etc.
  • For all integer SIMD types T1, T2 of the same size, implement T1::From<T2>.

Rationale

Bitcasts are much more common in SIMD programming than when using regular scalar variables. They should be easy to use.

Our compromise in mapping the Intel intrinsics requires some amount of signed/unsigned flipping.

By providing bitcasts that don’t change the number of lanes, we are able to preserve some of the benefits of type checking, since it is more common to change lane types than to change lane geometry.

Lane geometry changes also happen enough that it makes sense to provide them with the From` trait.

9 Likes

Your type is private, mine is public but can’t be named.

@stoklund I think we are pretty close to the same page modulo maybe some bikeshedding. I’m still skeptical of step 3. When you put it in context, it doesn’t seem like much. It is tempting.

I think my current plan is to keep chugging away at my implementation and write up a pre-RFC in a new thread. If we have broad consensus there, then I’ll mush on to a full RFC.

Sounds great, thanks!

FWIW, I think that step 3 is realistically blocking stabilization of the current simd crate. It would be nuts to start implementing portable SIMD arithmetic in terms of vendor intrinsics when LLVM has already done all that hard work.

Understood. That’s a really important point I think. Would it be possible to unpack that a bit? Or point to something that expands a bit more on just how complex it is? Not that I don’t buy it—but if we’re going to sell a small cross platform API in std to folks, we’ll need a good motivation.

For what it’s worth, while the simd crate couldn’t be stabilized as-is, I feel like one could offer up nice APIs for specific platforms without too much trouble. (e.g., Just x86.) Is that accurate?

I'm afraid that stabilizing vendor intrinsics first, and a Rustic cross-platform API later may lead to lots of non-cross-platform SIMD code being produced in the period when the only stable way to use SIMD in Rust is via platform-specific intrinsics. To make it harder for people to fall into that pit of non-cross-platform code, I think that Rust has to have a cross-platform SIMD API (at least for simple things like add, mul, cmd and shuffle) ready at the very moment of stabilising platform intrinsics.

The first Rust API doesn't have to be perfect. It should just allow simple cross-platform SIMD to be possible. It could be just a bunch of opaque types like i32x4 with no public fields and all simd operations available through methods. It could be inside a std::simd::v1 module, which could be just deprecated if the better SIMD abstraction comes up (eg. maybe when the type system evolves to have type-level integers).

There is a lot of details to work out. Suppose you’re writing impl Mul for i32x4. It would go something like this:

  • If MIPS MSA is available, use __msa_mulv_w().
  • If ARM NEON is available, use vmul_i32().
  • If SSE 4.1 is available, use _mm_mullo_epi32().
  • If SSE2 is available, try to cobble something together out of 16-bit multiplications using _mm_mulhi_epi16() and _mm_mullo_epi16(), or maybe _mm_mul_epu32() combined with some shuffling.
  • Otherwise, expand into a lane-wise scalar multiplication.

In particular, trying to construct an i32x4 multiplication out of existing SSE2 intrinsics requires some work and knowledge. Work and knowledge that has already been put into LLVM.

This is just one operation for one type. There’s about 150 of those to go through. Then you would need to write individual unit tests for all of them since you’re guaranteed to have picked the wrong intrinsic by mistake at least once. Then find a MIPS machine to run your unit tests on. No, not that one. One with SIMD instructions available.

2 Likes

I just don't think this is a good enough reason to delay stabilization. It's a cost, yes, but it's one I'd happily pay to get a low level form of SIMD out the door. Regardless of whether we have a Rustic API, we need the low level API on stable Rust.

Also, as has been repeated many many many times in this thread, the set of intrinsics that can be put behind a cross platform API is tiny. Overall, this is the beginning of a stabilization effort that will include thousands of intrinsics.

We cannot keep rehashing this. Please see: Getting explicit SIMD on stable Rust - #195 by scottmcm

2 Likes

Thank you for spelling that out. Some variation of it will no doubt find its way into the RFC. :slight_smile:

:+1: But you forgot 512-bit (AVX-512 / __mm512i).

Also, you should specify how the per-type versions are distinguished. To avoid adding friction by changing names compared to the vendor standard, I suggest making the relevant functions generic, bounded on traits which would be implemented for suitable vector types. The traits wouldn’t have to be stabilized themselves.

Perhaps same approach could be used to let the many signedness-agnostic functions work on, e.g., both i32x4 and u32x4.

1 Like

The only way to get around not stabilizing the traits AFAIK is by adding them to the prelude. I doubt that's going to fly.

Nevermind the fact that I definitely think adding generics to the vendor Intel functions is a bridge too far at this point. If we got there, I think I'd sooner go back to the __m128/__m128i/__m128d types.

Nope, I silently omitted it :wink:

I would prefer to skip AVX512 in the first round of stabilization. The mask registers are mapped to scalar integers in the C API, and I think we could do something better if we had boolean vectors.

Is there a real demand for AVX512 today? It is available in shipping hardware beyond Xeon PHI?

Isn’t that only true for trait methods, as opposed to bounds on functions?

Hmm. Good point. My general dislike for adding generics at this point stands though. :slight_smile:

I don't know about non-Intel architectures, but Intel intrinsics all have a different suffix based on the type you're supposed to pass in.

+1 for punting on AVX-512. Although yesterday's AWS announcement suggests it's coming to regular processors sometime soon. OTOH I've heard Xeon E5/E7 v5 is still about a year out at this point.

Feels like a great sweet spot. Small enough set to be uncontroversial, just enough traits that it's possible to use non-horribly.

And I'm pleased to see that, even if you ignore intrinsics, the simple cases can be coded reasonably—no unsafe and fairly ergonomically—in such a way that they even produce LLVM vector intrinsics. For example, this gives %2 = fmul <4 x float> %0, %1 and then mulps %xmm1, %xmm0:

pub fn my_mul(a: i32x4, b: i32x4) -> i32x4 {
    let a: [i32;4] = a.into();
    let b: [i32;4] = b.into();
    [
        a[0] * b[0],
        a[1] * b[1],
        a[2] * b[2],
        a[3] * b[3],
    ].into()
}

I think it might need to be more nuanced than that. Something like __m128i _mm_adds_epu8(__m128i a, __m128i b) that "Adds the 16 unsigned 8-bit integers in a to the 16 unsigned 8-bit integers in b using saturating arithmetic" should probably take u8x16s, not i8x16s.

Maybe it's sufficient to say "a function takes a type implied by the suffix of its name"? That'd justify only providing the signed versions of the ones that don't care initially, though I bet we'll want to offer _mm_add_epu32 at some point, despite only _mm_add_epi32 existing officially.

Will adding literally thousands more functions to std impact non-simd-using consumers at all? I take it they can't go in an intel_intrinsics crate because of needing nightly for something?

:sparkles: Yay! :fireworks:

Given all the llvm intrinsics that take vectors, I suspect there's a few more. The only one I'll recommend for the RFC, though is mul_add on floating-point vectors. (Oh, and since we have i32::saturating_add in stable, i32x4::saturating_add would be nice, and there's an MMX instruction for it.)

And I assume "step 3" isn't gated on completion of "step 2", despite the numbering?

Just a word of warning here: It's easy enough to tickle LLVM's SLP vectorizer into producing one or two SIMD instructions with tiny toy examples, but this is not a robust alternative to hooking up the real portable SIMD operations (my step 3). LLVM has a number of optimizations that run before the SLP vectorizer gets to see the code, and they can easily obfuscate things by optimizing the separate lane expressions in different ways.

Try creating a larger example, and make sure that there are constant folding opportunities in some lanes, but not in others. That should trip up the SLP vectorizer enough to just give you scalar code. When we have to defend this proposal in RFC form, it would actually be good to have real counterexamples at hand to show that this is not a viable alternative to step 3.

I was going for conservatism rather than completeness, so I left out many things.

  • Fused multiply-add (mul_add) is available on many platforms, but not universally, and it is a very expensive operation to emulate. f64x2::mul_add() requires a 104-bit wide multiplication which you really, really want implemented in hardware.
  • Saturating arithmetic is universally available for 8-bit and 16-bit lanes, but not for i32x4. It turns out that nobody actually needs saturating arithmetic larger than 16 bits. I left it out to avoid the asymmetry.
  • Comparisons and selects are universally available, but with subtle variations across platforms that we don't need to resolve now. Compare the semantics of Intel's blendvps instruction with NEON's vbsl. The difference is just big enough to get in the way.

Most of these things should be fairly straightforward to implement in terms of vendor intrinsics, which I'm sure the simd crate will do.

I would like to see a second round of stabilization that brings the portable operations up to par with SIMD.js at least, but there are a number of issues to work out first.

Right. The steps represent my priority, not a dependency.

(Oops, it looks like I pasted in the wrong code in my previous post. That rust code should have been f32s, not i32s. I won't edit it since that seems to do weird things to timestamps here?)

Absolutely.

As you suggested, man was one easy to produce :disappointed: This one, which I'm not surprised didn't work, does generate vector instructions, but it's four mulsss, not a mulps :wink:

pub fn my_dot_direct(a: f32x4, b: f32x4) -> f32 {
    let a: [f32;4] = a.into();
    let b: [f32;4] = b.into();
    a[0] * b[0] +
    a[1] * b[1] +
    a[2] * b[2] +
    a[3] * b[3]
}

And I guess the inliner is one of the passes that runs before the SLP vectorizer. I was hoping this would inline the vector fmul from my_mul, but nope. No fmul <4 x float> in the LLVM, no mulps in the assembly.

pub fn my_dot_usingmul(a: f32x4, b: f32x4) -> f32 {
    let c: [f32;4] = my_mul(a, b).into();
    c[0] + c[1] + c[2] + c[3]
}

But then a tiny change makes it happy again. This one actually generates rather nice-looking vector operations even for the additions:

pub fn my_dot_usingmul_balanced(a: f32x4, b: f32x4) -> f32 {
    let c: [f32;4] = my_mul(a, b).into();
    (c[0] + c[1]) + (c[2] + c[3])
}

(Well, except for the fact that apparently it didn't notice that %2 and %4 are exactly the same thing, which I thought SSA was really good at.)

  %2 = fmul <4 x float> %0, %1
  %3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 2>
  %4 = fmul <4 x float> %0, %1
  %5 = shufflevector <4 x float> %4, <4 x float> undef, <2 x i32> <i32 1, i32 3>
  %6 = fadd <2 x float> %3, %5
  %7 = extractelement <2 x float> %6, i32 0
  %8 = extractelement <2 x float> %6, i32 1
  %9 = fadd float %7, %8
  ret float %9

Yay, non-associative floating point math! :angry: Makes me want a bunch of sum_unspecified_order() methods (on simd, slices, ...), but that's definitely a post-step-4 bikeshed.

(Hmm, looking at llvm's fastmath flags makes we want to go make an nnan_f32 type that lowers like that all the way to llvm, and would actually be Ord & Eq. But that's a distraction for the future...)