Getting explicit SIMD on stable Rust


Historical note: SIMD types in Rust was originally (in 2013) implemented (by myself) as structural types, like tuples. (See #5841.) Graydon was “uncomfortable” with adding an entirely new kind of types and wanted to use attributes. But attributes couldn’t be attached to structural types, so SIMD types became nominal types.

I believe this was a wrong decision.


Can you please elaborate what the different kinds of types are you’re talking about here?


(f64, f64) is a structural type. It is the same type everywhere, and does not have the place of definition. struct f64x2(f64, f64) is a nominal type. It has a name, the place of definition, and is incompatible with types defined exactly the same way but elsewhere.

Now it is preferrable to have user-defined SIMD types, so that people can name fields XYZ, RGB, etc. to tell what they mean. But when SIMD types are nominal, XYZ type and RGB type are incompatible with f32x4 type defined in the standard library which is used in the standard library function signatures.

You can workaround this by defining generic functions, like fn simd_add<T>(T, T) -> T. (This is the current status.) But that doesn’t help you when you are trying to define conversion from i32x4 to f32x4, since you can’t get output type from input type.

It makes much more sense for SIMD types of same layout to be automatically compatible with each other.


I think Rust should be optimized for writing new Rust code instead of being optimized for porting C code. Surely it’s better for Rust to excel at new code than in line-by-line ports of old code.

Surely the old thing can be kept around as an alias for the new thing. This seems like a poor reason to not to have cross-ISA syntax/API for the basic stuff.

Considering the overlap of what’s already in platform_intrinsics and @stoklund’s work on SIMD for WebAssembly, it seems to me that this concern is exaggerated. Also, if Rust went for vendor-specificity over cross-ISA operations due to being unable to decide how to name the cross-ISA stuff, Rust would a very serious non-technical problem. I’d prefer such a non-technical problem to be addressed right away (by some sort of leader fiat on naming if needed) instead of letting a culture of bikeshed avoidance steer technical outcomes over time.


On the one hand, we have a vendor specific API that is proven to cover a lot of use cases. There is literally no design work required. On the other hand, we have a cross-ISA abstraction that would take design work to bring it to Rust. I don’t think anyone has argued against having that, but I think we need to be able to experiment with it. Stabilizing the vendor specific APIs should enable folks to experiment with cross-ISA abstractions on stable Rust. One we stabilize something in std, we’re pretty much stuck with it, so we should have a very high confidence that whatever we stabilize is something we’ll be happy with for years to come. I feel like we can say that about vendor specific APIs, but I have little confidence that we can say that about something else at this point.

Just like Rust 1.0 was not the end, neither will this. We can start with what we know works and build more convenient APIs later.

In general, I feel like your comment doesn’t quite resolve all of the concerns laid out by @comex either.

I can say from personal experience, that as someone who is learning SIMD as they go in the context of Rust, that this tripped me up significantly. Just the mere act of detangling the differences between vendor specific APIs, compiler built-ins, intrinsics and the special SIMD instrinsics exposed by Rust via LLVM has been a long arduous journey. There’s really nothing that describes the relationship between all of them, and even as I’m working on a vendor specific API in Rust now, it seems nearly arbitrary to me whether a vendor intrinsic is implemented in terms of a cross platform API (e.g., _mm_slli_si128) or whether it’s implemented in terms of a compiler builtin/intrinsic (e.g., _mm_slli_epi16). (I don’t mean to claim that it’s actually arbitrary. I’m just confronting the limits of my own knowledge here.)

Another interesting thing I’ve learned is that many intrinsics don’t necessarily correspond to a single particular instruction. Depending on how the compiler is invoked, for example, _mm_slli_si128 might use either the PSLLDQ or VPSLLDQ instructions (even though the Intel documentation specifies PSLLDQ). I don’t know if there’s any actual practical ramifications of this though.


Today’s unstable interface between the compiler and library on one side and the simd crate on the other side is the extern "platform-intrinsic" {...} intrinsics and the #[repr(simd)] types. This interface has:

  1. Structural typing, provided by the somewhat quirky way the intrincics bind to any simd type when they are declared.
  2. A rich set of basic operations that are available on any target.

I am not suggesting that we stabilize this particular interface, but it is the interface we have tested in the past year. The interface that is now being proposed for stabilization has:

  1. Nominal vendor-specific SIMD types.
  2. Vendor-specific names for operations that map to a single instruction, and nothing else.

We do not have experience implementing the simd crate on top of such an interface, and I’ll assume we can agree that we would need some experience with this interface before stabilizing it.

To bring up one concrete example of the problems we’ll find:

So how is the simd crate supposed to provide impl Mul for u32x4? Is it supposed to play tricks with pmulhuw on SSE2 targets? LLVM would do a much better job of that. Or are we not supposed to have a simd crate that provides portable types on top of this interface? Is this just to get a stable per-vendor SIMD assembler in Rust?

Could you clarify this? Do you mean the C APIs?

For what it’s worth, Clang today has to support four slightly different variations of SIMD types at the language level in order to support those C APIs.

It’s been claimed many times in this topic that we can easily add the portable SIMD semantics later. There hasn’t been any concrete examples of what that would look like. GCC and Clang were able to slide their portable SIMD types underneath the vendor intrinsic headers while preserving source compatibility. Rust has stronger typing than C, so it’s not obvious to me that we can pull off the same trick. Note that I don’t have a lot of experience with stabilization issues in Rust.


For it to work on stable Rust, the implementation would, I guess, have to use _mm_mul_epu32 when it’s available and equivalent functions for other platforms.

While LLVM could bail us out here with its SIMD multiplication intrinsic, it’s not going to bail us out for other things. For example, we should provide a way for folks to use, say, _mm_sad_epu8 (or equivalent) if they want to on stable Rust. If we provide that, then shouldn’t we also provide _mm_mul_epu32? I think the answer is “yes,” and this is completely orthogonal to whether our current simd crate could be (optimally) built on top of this interface or not. That is, we should probably be doing it anyway.

I feel like there is too much focus on a convenient abstraction than on actually providing a way for folks to use vendor defined intrinsics on stable Rust. A convenient abstraction covers a teeny tiny portion of what’s available. It doesn’t answer enough questions on its own. Therefore, if you want to argue in favor of a convenient abstraction, then I think you must simultaneously address the large swath of vendor intrinsics that won’t be part of that abstraction. How do you expose them?

Yes. The APIs defined by the vendors.

Can you unpack this a bit more? As a straw man to get things started, consider this design progression:

  1. We move forward with a no-frills port of vendor specified C APIs in Rust. This probably includes defining specific nominal types from each vendor.
  2. Some time later, we decide that we want a convenient SIMD abstraction. We’ve been experimenting with one on, and the maintainers have become frustrated because they can’t access LLVM’s cross platform SIMD functions like simd_mul or simd_add.
  3. We decide we need to stabilize some way of getting at those cross platform functions. Whether it’s a thin wrapper or a thick abstraction in std, we realize that something about our vendor specific APIs (from 1) doesn’t mix well with this new abstraction. Therefore, we define a whole new set of cross platform SIMD types and provide a way to convert between them and the vendor specific types defined in 1.

Is there something in this progression that I’ve missed? Is it not feasible or possible?


I think you misunderstand my position. I am not arguing that we don’t expose all the vendor intrinsics. I am arguing that we make the effort to design a shared set of SIMD types and expose the vendor intrinsics on top of the shared types, and that the shared SIMD types provide basic arithmetic like a * b (or a.wrapping_mul(b), if you prefer).

The D language core.simd module is very close to what I have in mind. It provides a convenient foundation for defining the vendor intrinsics, and for defining abstract platform-independent functionality. Their matrix of supported basic operations is a good start. (We can do better than their typeless interface for accessing vendor intrinsics).

I don’t have a strong opinion on whether _mm_mul_epu32 should be exposed as well. I guess it makes sense to provide the intrinsic under target feature flags, so you can use it if it is important to you to know that it will map to a single instruction.

I don’t think there are technical obstacles to your strawman plan, but it would cause the ~500 lines of common operators in the simd crate to explode into some 150 individually spelled out functions per supported target architecture, not including patching all the weird holes in older SSE versions.

If I were the simd crate maintainer, I would wait for stable portable SIMD operations rather than attempting such an error-prone and temporary transformation.

Clang’s implementation of the Intel intrinsics headers uses the GCC-style vector types, so it is possible to write code like this:

__m128 mul4(__m128 x, __m128 y) {
    return x * y;

I don’t think MSVC let’s you do the same.

The result is that when using Clang, you can write code using mostly the portable SIMD operators and only fall back to, say, NEON intrinsics when you need to. There is no need for typecasts when you use the intrinsics, and the NEON intrinsics are still type checked.

With your strawman plan, it seems to me that you would need some kind of casts when mixing portable code with vendor intrinsics, possibly even losing the benefits of type checking in the process? The vendor intrinsics effectively become typeless.

It seems unnecessary to me to create this split between nominal vendor types and portable SIMD types when there is wide agreement between GCC, Clang, D, OpenCL, and SIMD.js on what such a type system should look like. The only odd man out is the Intel headers, and then only for the integer types.

My alternative strawman proposal would be this:

  1. Provide a std::simd module which defines 64-bit, 128-bit, and 256-bit SIMD types with lanes that match Rust’s integer and floating point scalar types. So something like f32x2, f32x4, f32x8, f64x2, f64x4, u8x8, u8x16, ... u64x4, i8x8, i8x16, ...i64x4. No need for boolean or polynomial types to begin with. Let these types be opaque except perhaps for a T::splat(x) constructor and a Default implementation.
  2. Provide a no-frills port of vendor specified C APIs, but using the std::simd types. You will find that everything but the Intel integer types maps trivially.
  3. Add operators to the std::simd types following the D matrix.
  4. Have debates about whether we should provide boolean vector types.


OK. I understand now. The part that I’m not understanding is why you think we must do this in our initial stabilization. If we’re going to stabilize the vendor intrinsic APIs in both your proposal and in mine, then why can’t we just start with that? (OK, I think I got it now. Read on.) Why do we need to also solve the “nice abstraction” problem too? (I will note that @eddyb can probably chime in with a number of reasons why that’s not feasible in today’s Rust because we don’t have the necessary language features.)

I see. Yes. It would require casting. I’m not really sure how one could button that up either, so it does seem slightly unfortunate.

Could you expand on why Intel is a problem here? It seems like the types you’ve given should map cleanly to the Intel intrinsics too?

Another option here is to take your proposal and cut it down to just (1) and (2). That would at least give us type safety, but would allow us to punt on convenient operations for now. Another question: if we didn’t have boolean vector types, then what would the return type of _mm_cmpeq_epi8 be, for example?

@eddyb What are your thoughts on @stoklund’s straw man proposal? How does it jive with your ideal API given the requisite Rust language features?

cc @alexcrichton


Certainly. If you look in the <arm_neon.h> header file, you will find a set of typedefs that are basically equivalent to my proposed std::simd types. The Intel headers, however, only define 3 types (restricting ourselves to 128 bits without loss of generality):

  • __m128 which maps to f32x4,
  • __m128d which maps to f64x2, and
  • __m128i which corresponds to i8x16, i16x8, i32x4, or i64x2 depending on the instruction.

And there’s no distinction between signed and unsigned integers either. So the “no frills” port of intrinsics using the __m128i type needs to make a choice for that type. Some reasonable solutions would be:

  1. Provide an opaque Intel-specific __m128i type, and let the Intel vendor intrinsics be forever doomed to incessant casting when mixed with portable code.
  2. Map each instance of __m128i to one of the signed std::simd types based on the common-sense information we can extract from Clang’s database of builtins and LLVM’s type requirements for the corresponding intrinsics. Pretend the unsigned integer types don’t exist.
  3. As 2., but also provide unsigned versions of the intrinsics. Some intrinsics are inherently signed or unsigned (like saturating arithmetic). Many others don’t have a preference and would need to be provided in two versions.

Clang uses 2. for its builtins used to implement the Intel headers. So, to use your own example, _mm_sad_epu8 has the signature (__m128i, __m128i) -> __m128i in <emmintrin.h> as specified by Intel, but Clang’s __builtin_ia32_psadbw128 which implements this particular operation has the signature (i8x16, i8x16) -> i64x2. The “inherent” signature of the operation is (u8x16, u8x16) -> u64x2.

OpenCL, which extends C with SIMD types, defines comparisons between vector types as returning a vector of signed integer type with the same lane configuration as the inputs:

int4 f32x4_lt(float4 x, float4 y) {
    return x < y;

The resulting i32x4 lanes are 0 for false and -1 for true. This is consistent with how most instruction set architectures work. (The exception on the horizon is AVX512 which has mask registers).

LLVM IR uses an <i1 x 4> type to represent the result of comparing both f32x4 and f64x4 types. Its lower-level code generator expands this into something similar to i32x4 or i64x4 depending on context.

Cretonne IR, which attempts to represent a lower abstraction level than LLVM IR, has types b1x4, b32x4, and b64x4 to cover all needs.

Boolean vector types are not present in vendor C mappings, except for <altivec.h>. This is probably related to the fact that C didn’t have a _Bool type when these things were defined.


My opinion basically matches @stoklund.

  1. SIMD types should be entirely new kind of types, and use structural typing, and do not use attributes like #[repr(simd)].

  2. 1 does not exist today. So failing that, SIMD types should be defined in the cross-platform way, TxN where T is Rust scalar type and N is number of lanes.

  3. There are opinions here that we should just define architecture dependent SIMD types, thus x86::f32x4, arm::f32x4, etc. This is a bad idea.

  4. There are opinions here that we should give up on type safe SIMD to match Intel header exactly, thus x86::__m128i instead of x86::u8x16. This is a very bad idea.


I highly support increased type safety. I feel that this point might be the most divisive in the vendor-specific vs. generic debate. A compromise might be that we do stabilize all vendor intrinsics but with “correct” types. I do worry specifically for the Intel ones that we’re not going to get the signatures right on all variants of all intrinsics. I already gave an example before of the BLENDV family of instructions (where the third operand is not a float) and I just found another gem in SSE2: ANDPx computes the bitwise AND of packed floating-point numbers.

They are basically the same instruction. If you look at the encoding table under PSLLDQ you’ll see the VPSLLDQ instruction there as well. In general the Vxxx instructions (if they exist) are AVX versions of the same xxx instructions. AVX supports non-destructive operation (the destination operand can be different from both source operands), 256-bit vectors for some instructions as well as unaligned memory operands. Mixing AVX and SSE instructions supposedly decreases performance.


Thanks for the pointer! There’s so much good stuff in that thread :trophy:

I read three requirements from the types section of RFC 1199:

  1. Primitive type, repeated 1<<N times
  2. No padding
  3. Appropriate alignment

“Something with the same layout but different type-level attributes” (3) makes me think newtype. And according to repr(Rust) in nomicon, existing rust arrays meet (1) & (2):

However with the exception of arrays (which are densely packed and in-order), the layout of data is not by default specified in Rust.

Combining those in stable, without any special anything, even seems to compile to the LLVM vector instructions:

pub struct Simd<T>(T);

pub fn demo(a: &mut Simd<[i32; 4]>, b: &Simd<[i32; 4]>) {
    a.0[0] += b.0[0];
    a.0[1] += b.0[1];
    a.0[2] += b.0[2];
    a.0[3] += b.0[3];


define void @demo(%"Simd<[i32; 4]>"* nocapture dereferenceable(16), %"Simd<[i32; 4]>"* noalias nocapture readonly dereferenceable(16)) unnamed_addr #0 {
  %2 = bitcast %"Simd<[i32; 4]>"* %1 to <4 x i32>*
  %3 = load <4 x i32>, <4 x i32>* %2, align 4
  %4 = bitcast %"Simd<[i32; 4]>"* %0 to <4 x i32>*
  %5 = load <4 x i32>, <4 x i32>* %4, align 4
  %6 = add <4 x i32> %5, %3
  %7 = bitcast %"Simd<[i32; 4]>"* %0 to <4 x i32>*
  store <4 x i32> %6, <4 x i32>* %7, align 4
  ret void

and then

	movdqu	(%rsi), %xmm0
	movdqu	(%rdi), %xmm1
	paddd	%xmm0, %xmm1
	movdqu	%xmm1, (%rdi)

(The alignment RFC isn’t in nightly, right? I couldn’t find any way to force those align 4s to align 128s to see what would happen.)

So I’d tweak part 1 of stoklund’s proposal to something like this:

  1. Add pub struct Simd<T>(T); to the library. Don’t stabilize #[repr(simd)] for now, but use it internally to require T to be [ (i|u|f)N; 1 << M ] and to add the “appropriate” alignment to the monomorphized type. (It could also continue to do what it does today, for people opted-in in unstable.) Derive the obvious things, but otherwise add the absolute minimal set of Trait implementations—I’m thinking just Index and IndexMut, plus maybe letting it coerse to a slice

Miscellaneous thoughts and justifications:

  • I kept trying to come up with a bikeshed syntax for what a new primitive type would look like for this. Given the similarity, I figured it should be close to arrays, but everything seemed either awkward or likely to collide with associated constants, const fn, value generics, etc. Simd<[T; N]> is surprisingly good.
  • Anyone who wants them can add type aliases for f32x4, i16x8, etc.
  • The layout is predictable enough that unsafe will give you reasonable conversions, so safe conversions are not needed for stabilization.
  • There’s no std trait for wrapping_add, so crates can write some trait you import to get it. Not implementing Add is intentional, as it should panic for overflow like normal [ui]N, and it feels like that would generate fundamentally not-SIMD-like code. (It could also be implemented later if that statement turns out to be incorrect.)
  • Layout-compatible conversions (so I can use .r .g .b .a instead of [0] [1] [2] [3]) are something that can be figured out later. I also think they should be discussed broader than just simd types, since I see no reason I shouldn’t always be allowed to convert my &[T;7] into &(T,T,T,T,T,T,T). (With generics I figure it’s harder to allow &(A,B,C) to convert to anything.) And if tuples are just structs with implied field names, maybe I can opt-in to the same stuff with something like #[repr(array)] struct RGBA { ... }
  • This doesn’t decide one way or another on whether intrinsics should be stabilized, or what their names should be. But it does say what their types should be.
  • I just realized I never provided anything to let you create such a thing. Maybe it should contain pub T? Describing broadcast as Simd([x; 4]) looks great, and even without alignment gives reasonable-seeming insertelement instructions in LLVM.


Why is this such a bad idea? I’m perfectly content to be forever doomed to incessant casting, especially since this is only the lowest-level interface, and a nicer abstraction can be built on top of it.


To me, it is a bad idea because of casting, no really, unsafe transmuting. I don’t see how it can be implemented as casting aka “as”.

I think nicer abstraction cannot be built on top of it, because of said unsafe transmuting. Requiring unsafe transmuting is not a nice abstraction.


Why can’t a safe abstraction for casting be built on top of unsafe transmutes? What’s the problem with that?


I think we are talking past each other, but hopefully we will arrive at some common ground soon. As I understand, one proposal is to have “low-level SIMD API” with types like x86::f32x4 (or even x86::__m128), as well as “high-level SIMD API” with types like f32x4. Since they won’t be compatible with each other, conversions will be necessary if you want to use two APIs together.

Sure, since conversion itself is safe, conversion function can be safe Rust function, just implemented with unsafe transmute. But it is much nicer to do without conversion function at all.

Here is my understanding of current proposals:

Proposal A: 1. Have architecture dependent SIMD types. (Not yet implemented, but easy) 2. Have low-level SIMD API using 1. (Not yet implemented, but easy) Later, 3. Have common SIMD types. (Already implemented, but not stable) 4. Have high-level SIMD API using 2 and 3. (Not yet implemented, current unstable implementation uses LLVM vector types and normal LLVM instructions, not architecture dependent intrinsics. Implementing, for example, SIMD multiplication in terms of intrinsics involve legalizing or scalarizing depending on architecture to fill in architecture capability holes, which duplicate what is already in LLVM.) 5. Live with casting between 1 and 3 approximately forever.

Variant of above is not to change implementation of high-level SIMD API. Do not reimplement in terms of low-level SIMD API, just use the current implementation.

Proposal B: 1. Have common SIMD types. (Already implemented, but not stable) 2. Have high-level SIMD API using 1. (Already implemented, but not stable) 3. Have low-level SIMD API using 1. (Not yet implemented, and harder than in proposal A, but still easy, I think) 4. No casting.


I came here to make the same suggestion as @scottmcm.

As far as I know, a SIMD type in memory has the same layout as a fixed-length array, except for stricter alignment. Is that correct?

In that case, to avoid a combinatorial explosion of new nominal SIMD types – i8x16, i16x8, and so on, which we would presumably have to keep extending whenever larger register sizes are introduced – we can just define a single type:

struct Simd<T>(pub T);

This immediately lets us express any SIMD type we could ever possibly need as Simd<[i8; 16]>, Simd<[i16; 8]>, and so on, without any further definitions required.

Semantically, the Simd type behaves the same way as every newtype. You can use it with any T, put any value in, access it, and take it back out. It’s just a transparent wrapper.

For choices of T which correspond to valid SIMD types on the target architecture – that is [Prim; N] where Prim is a fixed-size primitive integer or floating point type, N is an appropriate power of 2, and multiplying yields a size of 128, 256, 512, or whatever – Simd<T> has the alignment of the corresponding machine SIMD type.

For all other choices of T, the alignment doesn’t really matter, and could be left unspecified, could be the same as that of T itself, or could be equal to size_of::<T>() (which I believe is always the case for SIMD types? so it would be consistent in that way).

This immediately gives us SIMD vector construction in terms of just array construction – per-element Simd([1, 2, 3, 4]), splatting Simd([42; 4]), and deconstruction and lane extraction as array element access: simd.0.0, simd.0.1, simd.0.2, simd.0.3, and so on.

Now the important bit: all of the exposed instrinsics would still be defined in terms of concrete choices of types. That is, with typedefs for convenience:

type f32x4 = Simd<[f32; 4]>;
type f64x2 = Simd<[f64; 2]>;

// I'm not familiar with all the vendor intrinsic names so I made some up for example's sake
fn _mwhatever_add_four_floats(f32x4, f32x4) -> f32x4;
fn _mwhatever_add_two_doubles(f64x2, f64x2) -> f64x2;

The win is just that we don’t need to define completely separate struct types for every single valid SIMD type and all of their construction, casting, and so on operations.

If and when the type system is suitably extended with e.g. constant generics, it will also be straightforwardly possible to define the signatures of generic operations such as fn simd_mul<T: Primitive, const N: usize>(Simd<[T; N]>, Simd<[T; N]>) -> Simd<[T; N]>, without having to define even more separate types for that purpose, but there’s no reason that needs to be done in the first iteration. (Just because the Simd type itself is generic, doesn’t mean that the operations defined over it need to be!)


Could someone comment on what it would take to make #[lang(the_simd_type)] struct Simd<T>(pub T); work in Rust today?


I think we’re actually at mutual understanding. I think your key point is the negative weight you’re assigning to an API that requires casts. My point is that if that API can be buttoned up behind a safe interface, then perhaps its negative weight is lessened a bit.

The additional point to make here I think (as expressed by @jneem a few comments up) is that if you’re using the higher level cross platform API with the lower level target dependent vendor defined intrinsics, then having to deal with a safe interface for casts doesn’t seem like the end of world. I do expect that most uses of vendor intrinsics should themselves be bundled up behind safe platform independent APIs. For example, if I wanted to write a SIMD accelerated string searching library, then I’d probably be OK with a bit of an inconvenience when using vendor provided intrinsics, but consumers of my API shouldn’t have to deal with that.

With all that said, if the only thing standing in the way of getting rid of that casting is defining some widely recognized architecture independent SIMD types, then that does feel like something that is viable to do. It sounds like the remaining challenge there is making sure we get the types right with Intel’s intrinsics.