Getting explicit SIMD on stable Rust

Yet another thing to bikeshed: what syntax or method should be used for bitcasting?

transmute is unnecessarily unsafe. as could work, but it would be a bit odd that u32 as f32 converts the numbers to floating-point while u32x4 as f32x4 bitcasts. You might expect the latter to do an implicit _mm_cvtepi32_ps instead. On the other hand, if there’s a method or function to bitcast between SIMD types, it could also be implemented for non-SIMD types, e.g. bitcasting u32 to f32. (There isn’t already a method for that, is there?)

We could start with From impls for all the integer vectors (which do a transmute internally). Every other cast could require the appropriate _mm_cvt... call.

If we can’t put some trust in the types as provided by LLVM, then it seems like the only choice is to revert back to __m128/__m128i/__m128d, no? Even then, consumers will need to get the types right on their own, which seems worse…

That covers conversions, but how do you bitcast between two different SIMD types without converting? If the answer is as, my point is that it’s strange for as to convert scalar types but bitcast SIMD types…

Sorry, I’m not following. If we leave as out of this, what does From impls for integer vectors and _mm_cvt... for float<->int conversion leave out?

Are you saying you want to be able to bitcast from, say, f32x4 to u32x4? If so, what’s the use case? Can we just force people to use transmute for that if they really need it? Seems kind of like a detail we can punt on…

Yes, bitcast. Offhand, I recall working on code in the Dolphin emulator that uses integer SIMD operations on floats for various reasons, such as quickly checking for denormals/NaN. In addition to the float<->int case there’s casting between different lane widths, like u32x4 <-> u64x2.

But sure, punting works, as does transmute.

I think this is covered by From impls on all of the integer vector types? Basically, every integer vector type has a From impl from every other integer vector type (of the same bit size). That way, you can move between them freely. All of those can just bitcast I think?

Oh, sure, if that’s included. I thought you had a more limited set of impls in mind.

1 Like

I agree. But I still think it needs to be stabilized in a crate, not standardized in std.

Having _mm_load_si128 and _mm_loadu_si128 provided by the crate is good. But being forced to have them cluttering up std forever? Please no. I guess that's why in the OP you mentioned the "special rules crate" possibility.

Where did the particular push for this thread come from, BTW? What kind of "in stable" is it looking for if "without stability" is a plausible solution?

Do we know how close we are to being able to write "canonical Rust" implementations of (the vast majority) intrinsics in nightly rust? Some easy cases work, like

#[repr(simd)]
pub struct __m128i(u64, u64);

struct i16x8(i16, i16, i16, i16, i16, i16, i16, i16);

#[no_mangle]
pub fn _mm_add_epi16(a: __m128i, b: __m128i) -> __m128i {
    let a: i16x8 = unsafe { transmute(a) };
    let b: i16x8 = unsafe { transmute(b) };
    let c = i16x8(
        a.0.wrapping_add(b.0),
        a.1.wrapping_add(b.1),
        a.2.wrapping_add(b.2),
        a.3.wrapping_add(b.3),
        a.4.wrapping_add(b.4),
        a.5.wrapping_add(b.5),
        a.6.wrapping_add(b.6),
        a.7.wrapping_add(b.7),
    );
    unsafe { transmute(c) }
}

produces the expected LLVM vector instruction (plus bitcasts), resulting in the expected assembly instruction

_mm_add_epi16:
	.cfi_startproc
	paddw	%xmm1, %xmm0
	retq

Note that this only needed unstable for the definition of __m128i, not for i16x8.

Maybe that's a way to go. Stabilize a some_ugly_name_for_a_128_bit_bag_type_128_bit_alignment hidden somewhere (with no methods or anything, bit which might even be helpful as PhantomData until the alignment RFC stabilizes), with a comment that people should probably not use it directly, but probably want to use __m128i from the csimd crate.

Then write rust implementations of all the intrinsics, not unlike the pattern above. Now you can immediately code using intrinsics, and they are functional, if not performant, on all targets (even ones like javascript). (A csimd_unstable crate that intrinsics for people who want performant output now would be a good thing to have in addition.)

Then there are just "should be better vectorized" bugs to fix. With a particular case in hand, it's be easier to decide between "oh, we could translate that to LLVM better" or "man, LLVM should have noticed that" or "you know what? this case is weird enough that a std intrinsic sounds reasonable".

I think there are plenty of interesting things about codegen to be found doing that. For example, I tried this:

#[no_mangle]
pub fn _mm_adds_epi16(a: __m128i, b: __m128i) -> __m128i {
    let a: i16x8 = unsafe { transmute(a) };
    let b: i16x8 = unsafe { transmute(b) };
    let c = i16x8(
        a.0.saturating_add(b.0),
        a.1.saturating_add(b.1),
        a.2.saturating_add(b.2),
        a.3.saturating_add(b.3),
        a.4.saturating_add(b.4),
        a.5.saturating_add(b.5),
        a.6.saturating_add(b.6),
        a.7.saturating_add(b.7),
    );
    unsafe { transmute(c) }
}

After all, wrapping_add worked great, so I figured that'd give me paddsw no problem. But oh man is it unhappy! There's eight of these:

  %37 = extractelement <8 x i16> %bc102, i32 5
  %38 = extractelement <8 x i16> %bc103, i32 5
  %39 = tail call { i16, i1 } @llvm.sadd.with.overflow.i16(i16 %37, i16 %38) #4
  %40 = extractvalue { i16, i1 } %39, 0
  %41 = extractvalue { i16, i1 } %39, 1
  %42 = lshr i16 %38, 15
  %43 = add nuw i16 %42, 32767
  %_0.0.i80 = select i1 %41, i16 %43, i16 %40

It makes sense that that's what rustc needs to do given what's in LLVM, but ouch. No wonder it doesn't auto-vectorize. Maybe we could get an @llvm.sadd.with.saturation.i16? Would make rustc's life easier and add more autovectorization opportunities. Or I wonder what would happen if rustc emitted saturating_add as building a vector and using an llvm intrinsic? Is LLVM smart enough to combine vector types if some parts are undef?

From your comments earlier in this topic, it looks like you might have misunderstood what the blendvps instruction does. It doesn't use the 4 low bits of the mask argument; it uses the MSB of each of the 4 lanes. In a floating point number, the MSB is the sign bit, so if you pass it a vector of floats, it gives you a-lanes for the positive mask numbers, and b-lanes for the negative mask numbers. That's a perfectly useful operation on its own. For example, _mm_blendv_ps(a, 0, a) replaces all the negative lanes in a with 0.

Sure, there's also cases where your mask vector will be an integer, but Intel's choice of type for the mask is not wrong. And Clang is consistent with Intel's choice.

For sure. This is very common in simd code. An example is the mask argument to the blendvps instruction. I just argued above that it can be an f32x4, but it is just as commonly the output of a comparison which will be an integer vector.

I totally agree. To me (like I stated before) a simd value (read register in many cases) is something that holds some data. To the CPU this data is untyped and any instruction that operates on that register will use whatever data that is in it. Here is an example from algorithm - John Carmack's Unusual Fast Inverse Square Root (Quake III) - Stack Overflow

float Q_rsqrt( float number )
{
  long i;
  float x2, y;
  const float threehalfs = 1.5F;

  x2 = number * 0.5F;
  y  = number;
  i  = * ( long * ) &y;
  i  = 0x5f3759df - ( i >> 1 );
  y  = * ( float * ) &i;
  y  = y * ( threehalfs - ( x2 * y * y ) );

  return y;
}

Now this exact operation isn't usually needed in SIMD as there are approximations for it but you get the idea.

1 Like

I suppose I was confused by this part in the description: "The mask operand is the third source register, and encoded in bits[7:4] of the immediate byte(imm8)." as well as the operation of the BLEND instruction. I looked at the RTL again and you're completely right.

Actually, it looks like that's no longer a problem. Per LangRef for getelementptr-instruction:

When indexing into an array, pointer or vector, integers of any width are allowed, and they are not required to be constant. [...] The first type indexed into must be a pointer value, subsequent types can be arrays, vectors, and structs.

Nightly will even generate such a thing if you do things like write to a SIMD type through a borrow transmuted to &mut [f32; 4], like

  %8 = tail call float @llvm.fma.f32(float %5, float %6, float %7) #5
  %9 = getelementptr inbounds <4 x float>, <4 x float>* %r, i64 0, i64 0
  store float %8, float* %9, align 16

Noticed while experimenting further with what happens if I try to code intrinsics in Rust. Turns out that while "You can use llvm.fma on any floating point or vector of floating point type", it doesn't currently manage to turn an extract-fma-insert-each into an @llvm.fma.v4f32.

Rust + LLVM for fma thing
#[no_mangle]
pub fn _mm_fmadd_ps(a: Simd4x<f32>, b: Simd4x<f32>, c: Simd4x<f32>) -> Simd4x<f32> {
    Simd4x(
        a.0.mul_add(b.0, c.0),
        a.1.mul_add(b.1, c.1),
        a.2.mul_add(b.2, c.2),
        a.3.mul_add(b.3, c.3),
    )
}

; Function Attrs: nounwind readnone uwtable
define <4 x float> @_mm_fmadd_ps(<4 x float>, <4 x float>, <4 x float>) unnamed_addr #0 {
entry-block:
  %a.0.vec.extract = extractelement <4 x float> %0, i32 0
  %b.0.vec.extract = extractelement <4 x float> %1, i32 0
  %c.0.vec.extract = extractelement <4 x float> %2, i32 0
  %3 = tail call float @llvm.fma.f32(float %a.0.vec.extract, float %b.0.vec.extract, float %c.0.vec.extract) #5
  %a.4.vec.extract = extractelement <4 x float> %0, i32 1
  %b.4.vec.extract = extractelement <4 x float> %1, i32 1
  %c.4.vec.extract = extractelement <4 x float> %2, i32 1
  %4 = tail call float @llvm.fma.f32(float %a.4.vec.extract, float %b.4.vec.extract, float %c.4.vec.extract) #5
  %a.8.vec.extract = extractelement <4 x float> %0, i32 2
  %b.8.vec.extract = extractelement <4 x float> %1, i32 2
  %c.8.vec.extract = extractelement <4 x float> %2, i32 2
  %5 = tail call float @llvm.fma.f32(float %a.8.vec.extract, float %b.8.vec.extract, float %c.8.vec.extract) #5
  %a.12.vec.extract = extractelement <4 x float> %0, i32 3
  %b.12.vec.extract = extractelement <4 x float> %1, i32 3
  %c.12.vec.extract = extractelement <4 x float> %2, i32 3
  %6 = tail call float @llvm.fma.f32(float %a.12.vec.extract, float %b.12.vec.extract, float %c.12.vec.extract) #5
  %_0.0.vec.insert = insertelement <4 x float> undef, float %3, i32 0
  %_0.4.vec.insert = insertelement <4 x float> %_0.0.vec.insert, float %4, i32 1
  %_0.8.vec.insert = insertelement <4 x float> %_0.4.vec.insert, float %5, i32 2
  %_0.12.vec.insert = insertelement <4 x float> %_0.8.vec.insert, float %6, i32 3
  ret <4 x float> %_0.12.vec.insert
}
1 Like

I'd like to impose a new parameter for discussion in this thread because it is too big and too unwieldy to keep rehashing the same thing over and over. If your proposal is to stabilize a Rusty SIMD API in std, then your proposal must also provide a way for Rust programmers to use vendor defined intrinsics such as _mm_cmpestri, _mm_sad_epu8, _mm_crc32_u64 and _mm_alignr_epi8 on stable Rust. If your proposal cannot account for using those intrinsics on stable Rust, then I think it is reasonable to say that it is dead on arrival. The reason why it's dead on arrival is because the set of intrinsics that do not fit in any obvious higher level cross platform abstraction dwarfs the set of intrinsics that make up our current simd crate or similar efforts such as the SIMD APIs in WebAssembly. Therefore, building a Rusty API for them is not feasible in a time frame measured on a scale of years. Thus, our only choice here is to stabilize some way to access them at the lowest common denominator: via an implementation of vendor defined interfaces. The canonical path for this stabilization is to expose them through std.

An argument could be made to identify the set of vendor defined intrinsics that are part of a reasonable higher level API (take the current simd crate as a straw man proposal), stabilize that higher level API and avoid defining any intrinsics that are covered by that API. The problem with this proposal is two-fold:

  1. We are almost 2 years after Rust 1.0 without stable SIMD support precisely because nobody can agree on what that higher level API looks like in Rust. Actually, that's probably not quite accurate. Possibly the issue is that Rust the language doesn't quite have everything we need to make the "best" SIMD API. See C like SIMD intrinsics for Rust · Issue #1639 · rust-lang/rfcs · GitHub for some details.
  2. The set of intrinsics that you omit is a teeny tiny portion of all available intrinsics, so you still need to provide everything else. At that point, for simplicity sake, it seems reasonable to just provide all intrinsics anyway, because it seems surprising that they wouldn't be there. A key advantage of using vendor defined APIs is that there's an obvious mapping between what one does in C/C++ and what one does in Rust. Since the set of intrinsics is so large, the obviousness of this mapping is critical.

What all of this means is that we can punt on the Rusty API for now and move forward only with the vendor defined intrinsic interfaces. The Rust API can come later.

This suggestion was born out of a misunderstanding I had. Namely, that intrinsics like _mm_load_si128 or _mm_cmpestri are actually part of a vendor defined interface and are not tied to any specific compiler backend. I had thought that we were forced to stabilize a set of intrinsics that were very specific to llvm, but this is in fact not the case. We can provide an API that is completely divorced from LLVM.

I don't understand this question. Are you still referring to my weird "stable-but-not-stable" crate idea? If so, let's drop it. It was a bad idea. (I had thought we moved on from that a long time ago in this thread at least.)

AFAIK, there are no obvious blockers, but that is indeed a question that is my responsibility to answer as part of this stabilization effort. Thankfully, a lot of this is done already in the Clang and GCC compilers. All that's left for us to do is port it. Here are some SSE2 definitions in Clang: clang/lib/Headers/emmintrin.h at master · llvm-mirror/clang · GitHub --- Here is my in-progress work to provide the same API in Rust: https://github.com/BurntSushi/stdsimd/blob/master/src/x86/sse2.rs --- Note the use of llvm intrinsics internally and the use of llvm's cross platform SIMD APIs. The llvm intrinsics are, for example, something that we will never ever never stabilize because we can't guarantee a stable interface to them. Therefore, the only way to make an implementation that uses them on stable is to put them in std, since std is uniquely endowed with the ability to use unstable Rust features.


I'm hoping this resolves some of the other questions/comments in your post that I didn't directly respond to.

4 Likes

You're quite right. I guess I was assuming that all intrinsics would be marked as unsafe, as they currently are. Of course, most of them (everything except load/store/gather/scatter?) could be safe. Similarly, bitcasting between SIMD types of the same size could be safe.

Because the choice of type isn't always obvious, and sometimes there are multiple reasonable choices.

I agree 100% that there should be a wrapping_add that works. I'm not sure I agree that _mm_add_epi32 should be left out. There is a (small) advantage to including it, namely that rust would support the C api completely.

To summarize, we have three choices for the types of intrinsics:

  1. Copy the vendor exactly. In the case of Intel, this means using their weird untyped interface, with casts for everything (but non necessarily unsafe ones).

  2. Copy clang exactly. This would still involve casts whenever you wanted to use an unsigned variant, and possibly also for some intrinsics that don't have a single canonical type.

  3. Copy clang, but add all the variants. There are no casts, but a bunch more functions.

There is also an orthogonal question: we want some nicer API on top of the bare intrinsics. Some of the intrinsics (e.g. _mm_add_epi32) will be redundant in the presence of the nicer API. Two choices:

a. include all the intrinsics. This has two advantages: it's more consistent with the vendor's C API, and we can stabilize it now without worrying about what the nicer API looks like.

b. Don't include all the intrinsics. This has the advantage of reducing "clutter" in std, although @burntsushi points out that the number of intrinsics in question is a tiny fraction.

I'm in favor of 1a here, although I think 1b (edit: sorry, I meant 2a...) is fine too.

I'm strongly in favor of a over b, because I like consistency and I continue to want to punt on almost every SIMD abstraction in the first pass, sans perhaps some safe type conversions and constructors.

I was in favor of 1 (copy Intel), but I am now leaning more towards 2 (copy Clang internals into our public API by using more appropriate vector types). I personally found @stoklund's arguments persuasive. My plan is to actually put it into practice and see how it works out.

By the way, #[target_feature] should be landing soon.

3 Likes

Question: does anyone have any contact with folks responsible for maintaining the SIMD Intel headers in either gcc or Clang? It feels potentially useful to get their views on this.

I’d like to take a stab at a concrete proposal given all the discussion we’ve had so far:

Motivation

  • It’s unlikely that we’ll get consensus on a higher-level abstraction any time soon and without having tried various APIs. In order to gain experience with SIMD API design, we want to be able to experiment with higher-level SIMD abstractions on stable Rust. We must therefore implement the lower-level intrinsics first.
  • Whatever lower-level abstraction we implement, we must make sure that it’s stably upgradeable to a higher-level abstraction in the future.
  • We must make sure that the lower-level intrinsics are sufficiently similar to the vendor documentation for ease of use (conversion of assembly/C and looking up behavior), and the lower-level intrinsics must remain easily usable when combined with higher-level abstractions.

Proposal

  1. Implement opaque types for common SIMD lane configurations which are the same across all platform and which can be aliased in the future to some other type.
  2. Implement all named vendor-specific intrinsics with the correct type signatures.

std::simd

Contains public type aliases for the opaque types fWxN, iWxN and uWxN for W ϵ {16,32,64} for floats and W ϵ {8,16,32,64} for integers, and N such that 64-bit, 128-bit and 256-bit widths are defined. They must be type aliases so that they only exist in the type namespace and we can alias them to or redefine them as something stable in the future (like Simd<T>).

Bitcasts between the integer types are supported using as, otherwise use transmute.

std::intrinsics::arch

We’ll have one submodule per architecture, which will have a submodule per feature set, which will contain the relevant intrinsics.

For example:

mod arch {
  mod x86 {
    mod sse2 {
      fn _mm_cvtpd_pi32(a: f64x2) -> i32x2;
      // ...
    }
    mod sse3 {
      // ...
    }
  }
  mod arm {
      // ...
  }
  // ...
}

Side note: allowing bitcasts between integers means that we don’t have to worry so much about “missing” integer intrinsics for Intel.

Unresolved questions

  • Do we implement any architecture-independent intrinsics in round 1? (candidates: shuffle, insert/extract, broadcast, bitwise logic, arithmetic)
  • We can use the Clang database to get Intel’s types right, but is is not complete, the operations that clang supports using operators are not in there.
  • For Intel operations that don’t specify a lane configuration (like PAND), do we introduce another opaque integer type __m128i which can also be bitcast?
  • Do we mark any intrinsics as safe? And in particular, do we mark intrinsics that could result in integer overflow as safe?

Note that is unrelated to the proposal: I just though of this: if in the future we will use array or tuple-style intrinsics, we don’t really need a shuffle primitve, because you can pretty much write it directly as a tuple or array initializer. Similarly for insert/extract. Broadcast works easily with array-style intrinsics as previously pointed out in this thread.

1 Like

We can't expose type aliases that alias to private types. Stated differently, if a type alias is public then so must be the types it points to. (If you try to do this today, you'll get a warning that this will eventually be a hard error.)

I don't think there's a problem with just using opaque types. Even if we didn't define target independent types, we'd need to define target dependent types. Either way, if it's not seamlessly upgrade-able to some other type representation, then we'll have to deal with that.

I don't think we should try to modify as to permit this. My vague feeling is that we'll get push back, and using From/Into conversions seems just fine to me.

This probably makes sense in today's world. My vague feeling though is that we'll want to use scenarios for this, in which case I think everything can just go in on flat std::intrinsics namespace. (Sad to say that I honestly don't know too much more than this, but it's what I've heard from around the block. @alexcrichton might be able to say more.)

I also suspect that the word intrinsics is not going to be popular, because it conflates "compiler intrinsics" with "vendor intrinsics." My current plan is that these functions will be normal functions and won't be stabilized as compiler intrinsics themselves. (Even though of course they'll be using them internally.)

Finally, I'm not quite sure about trying to group these intrinsics into fine grained categories. We could just have one flat namespace and not both with what's sse2 and what's ssse3, etc...

We don't. Not in this initial stabilization effort anyway. We just expose the types, maybe some constructors along with the vendor intrinsics and that's it. I hope we can look towards arch independent stuff later.

I think the idea here is to just use signed types for those operations and require folks to bitcast if necessary.

I think this is a possibility. It's either that or we add variants of those intrinsics for every type. e.g., _mm_and_si128_u8x16, _mm_and_si128_i8x16, _mm_and_si128_u16x8, ..., _mm_and_si128_u64x2, _mm_and_si128_i64x2.

I think the only question here is whether executing a non-existent CPU instruction is safe or not. I suspect it is, and therefore most intrinsics should be safe I think.

Normal integer overflow in Rust is safe. Is there a reason why overflow using vendor intrinsics wouldn't also be safe?