Getting explicit SIMD on stable Rust


I’ve been busy, but my planned next step is to write up a pre-RFC.


I just released the 0.1 version of bitintr my library for using the Bitwise Manipulation Intrinsics supported by rustc (ABM, TBM, BMI1, BMI2). It is a bit off-topic because it is not about SIMD, but not completely off-topic because it is about intrinsics.


I think the best solution to the SIMD issues I am having is going primarily on the Clang / OpenCL / GLSL / HLSL / Metal track and have something similar to [u8; 8] but for SIMD vectors that magically support swizzling, shuffling, and the more complicated constructors via the compiler. (This has been done before, like @sanxiyn mentioned)

The main issue with the current SIMD implementation for me is that it complicates basic things massively compared to Clang C, and the types are entirely arbitrary and there’s loads of complications so I can’t even write a macro to abstract it away. I just want to write let position = matrix * float4(input.position, 1.0) or something like that if input.position is a float3 and have it work. If I have to add extern stuff to call intrinsics then that’s more than fine for me, I think we can leave most of that up to some crates at least to start with.

If we also want Intel-compatible intrinsics (as opposed to the LLVM) that actually match the stuff we do in C, then we can let a crate or something do that. I don’t think there’s any actual road-blocks to that even with our current #[repr(SIMD)] and they could then be converted to the OpenCL-like stuff at a later date with no change in interface. I honestly think that if you just want to call random intrinsics then the current stuff is kinda okay.


I talked to @burntsushi on IRC and they told me to clarify why I think the current implementation of #[repr(simd)] is a less future-compatible building block for SIMD vector types. I’m not sure if I’m able to, but this is my attempt!

Instead of SIMD vectors behaving like struct X I would like if they behaved like [T; n]. This allows for some neat stuff later, so let’s imagine a macro simd_vector_ty![T; n] returning a vector type to avoid any new syntax:

type f32x4 = simd_vector_ty![f32; 4];
type __m128 = simd_vector_ty![f32; 4];

It should in most ways act like a [T; n] array, but I’m not sure what would work best for a constructor f32x4(x, y, z, w), simd_vector_ty![x, y, z, w] or just solve it the Intel intrinsic way and not have a stable / portable constructor for now.

Utilising generic integers

If (when?) we get generic integers (I’m going to just make up syntax here, ignore that bit), we want to be able to use them to make SIMD vectors more comfortable and generic. If we use tuple structures it isn’t obvious how we would utilise this, but if it is a type like [T; n] then we’re forward compatible.

You’ll see how C++/Metal templates solve this compatability in the appendix.

Utilising generic integers would allow us to write a smoothstep prototoype as follows, this would not be possible with #[repr(simd)]:

fn<N: int> smoothstep(e0: f32, e1: f32, x: simd_vector_ty![f32; N]) -> simd_vector_ty![f32; N]

Or a dynamic shuffle function like:

fn<N: int, M: int> dynamic_shuffle(x: simd_vector_ty![f32; N], mask: simd_vector_ty![i32; M]) -> simd_vector_ty![f32; M]

We could even silently replace simd_vector_ty! with a normal macro that expands to SimdVector<T, N> or something and keep old code compiling. How to do this transition in a struct-based approach isn’t obvious to me.


Some vector operations just plain don’t match the Rust type system, we will use the LLVM instruction shufflevector as it is the most obvious example. LLVM describes it like this:

<result> = shufflevector <n x <ty>> <v1>, <n x <ty>> <v2>, <m x i32> <mask>    ; yields <m x <ty>>

The first two operands of a ‘shufflevector‘ instruction are vectors with the same type. The third argument is a shuffle mask whose element type is always ‘i32’. The result of the instruction is a vector whose length is the same as the shuffle mask and whose element type is the same as the element type of the first two operands.

The shuffle mask operand is required to be a constant vector with either constant integer or undef values.

But this signature is not expressible in the Rust typesystem, at least not right now. (generic integers, constexpr would be required, we could map negative numbers to undef)

But with some compiler magic I imagine this could be turned into an intrinsic. fn<T, U, Mask> shufflevector(v1: T, v2: T, mask: Mask) -> U with a magic bound on Mask to be a constexpr [i32; n] and U to be a SIMD struct with the same element type as T and the same n as Mask. Unfortunately this function will then look like a normal function but behave entirely differently.

We would call it with, let v2 = shufflevector::<f32x4, f32x3, [i32; 3]>(v1, v1, [0, 1, 2]);, and it would be slightly better than the current hack, where is probably the worst-case scenario.

The thing is that I don’t want to expose this function for a generic wrapper. I might want to implement it as let v2 = v1.shufflevector::<f32x3, [i32; 3]>([0, 1, 2]) or something in my wrapper, but since this function isn’t expressible in the type system either, I cannot do that.

But if we drop the requirement that we must force this into a function prototype and just try to make the best of the situation and instead implement it as faux structure members as Clang/OpenCL/GLSL/HLSL/Metal/RenderScript do. Then suddenly it’s a familiar syntax for graphics programmers.

Then the shuffle from above could be written as let v2 =;, which to me is clearer and shorter. If we use the OpenCL syntax then it covers up to 16 elements (s0 - sF), which is enough for AVX512. These shuffles are completely portable and allows you to use the same code on ARM and AMD64 for example, which is very nice.

But it has a catch. Since we’re not explicit with the types anymore, the compiler needs to be able to figure out the return type of that accessor. If we allow more than one f32x3 kind of type, then the compiler won’t be able to pick which one. This is where struct X becomes an issue. If we instead have vectors mirror [T; n], we can easily generate those types in the compiler without knowing the name of a specific struct.

The accessor syntax wouldn’t cover the issue of implementing intrinsics, we still need a shufflevector intrinsic for that, but since we don’t need to expose that to users, it can have a similar interface to what simd_shuffle has now.


This is the weakest argument, but if we ever let #[repr(simd)] leak out of std then we’ll have multiple types that are almost, but not completely the same. I don’t see why anyone should want their own f32-but with different traits type, and I do not think it is a good idea for vectors either. Especially since they end up as exactly the same type on the LLVM-level.

Appendix: Types like Clang

We should compare this to how other languages and implementations does this, since that should give us insight into what is the most common representation of these types. We’ll focus on the Intel intrinsics since those are by far the most portable, supported by at least 4 different compilers, then take a quick look at ARM and some accelerator/GPU variants.

From xmmintrin.h (the SSE intrinsics header) we have two variants, one from Clang/GCC:

typedef float __m128 __attribute__((__vector_size__(16))); // Clang's version
typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__)); // GCC's version

In MSVC __m128 is a keyword ( with some magic, the Intel compiler it depends on the version. It was magic in older versions as far as I can see typedef long long __mm128, but now it seems to be something like #[decl(simd)] with a struct or union depending on the exact type.

From arm_neon.h (the ARM Neon intrinsics header) we have two versions. Clang is like the one I propose, and the __simd128_float32_t type is a magic type that GCC implements internally:

typedef __attribute__((neon_vector_type(4))) float32_t float32x4_t; // Clang's version
typedef __simd128_float32_t float32x4_t; // GCC's AArch32 version

In metal_types.h (the type headers for the Metal GPU language, which is essentially C++ for GPUs) from Clang we get an alias template that is again almost the same but with integer generics support. This is what simd_vector_ty![T; n] could be if we got integer generics too:

template <typename T, int numElt> using vec = __attribute__(( ext_vector_type(numElt))) T;

In cl_platform.h (the host header for OpenCL) has multiple versions of this for various compilers:

typedef __attribute__((ext_vector_type(4))) float __cl_float4; // Clang
typedef __attribute__((vector_size(16))) float __cl_float4; // GCC
typedef vector float __cl_float4; // Standard AltiVec intrinsics on GCC/Clang/IBM VisualAge on PowerPC (also possibly MSVC, I have no access to the Xbox 360 compiler)
typedef __m128 __cl_float4; // GCC/Clang/Intel/MSVC on x86


@Aurora That looks similar to the Simd<T> idea floated by @scottmcm and I earlier in the thread. The main counterargument was that if we introduce opaque types like u64x2, i32x4 etc. for now, they can backwards-compatibly be replaced with typedefs for Simd<[u64; 2]>, Simd<[i32; 4]>, etc. later on.

Incidentally, couldn’t we just use plain array syntax for shuffles as well? E.g. let v2 = Simd([v1.0.0, v1.0.1, v1.0.2]) in the Simd<T> scenario (or ignore the Simd() and .0 noise othewise).


It’s very similar to Simd<[T; n]> @glaebhoerl, but with a primitive structural type instead of a magical lang nominal struct. I think they would both be implemented in the compiler similarly. I support that solution too.

We could go for the #[repr(simd)] to something else migration later, but is it worth it? I’m sure if we teamed up we could get a prototype for simd_vector_ty![T; n] or Simd<[T; n]> up and running in a weekend if people agreed on a basic syntax. It has been in the compiler before, so it shouldn’t be super-hard, I’m volunteering.

The issue with let v2 = Simd([v1.0.0, v1.0.1, v1.0.2]) is that it is very hard to read and would need an optimiser to turn all of that indexing and vector construction into a single shufflevector instruction. I’m not convinced LLVM can do that reliably, especially not when building without optimisations / in debug mode.


@Aurora The plan is to stabilize intrinsics and the basic way to define vector types, so that we can build the lowest-level portable libraries around those, and on top of them build a stack. A plausible stack could look like this:

  • L0: std::intrinsics: non-portable unsafe simd compiler intrinsics and vector types
  • L1: low-level portable SIMD wrapper for x86/ARM/Neon/SVE with software fallback
  • L2: low-level portable SIMD algorithms for words
  • L3: low-level portable SIMD algorithms for sequences of bytes
  • L4: mid-level application SIMD algorithms (math, string processing, linear algebra, regex, stencils, images, CPU software rendering, CPU ray casting, collision detection…)
  • L-WorldDomination: high-level application libraries, e.g., linear algebra library that uses on shared memory either rayon + simd on the CPU, or OpenCL/CUDA on the GPU (or both), and MPI to distribute the work in a cluster.

So when you say:

I just want to write let position = matrix * float4(input.position, 1.0)

It looks to me that you want something at L4, which might be provided by a SIMD stack, but as you argue could also be part of an OpenCL/CUDA or HLSL/GLSL stack.

The thing is, we are not even at L0 yet! And arguably as you recognize, a L0 for OpenCL and CUDA would look very different from a L0 for SIMD.

So the main objective here is to get L0 done, so that people can start experimenting with building libraries to solve L>0. Some people will try to tackle L-WorldDomination directly (I’m looking at you @burntsushi and your absurdly fast regex implementations), and others will start with building L1 (which is what huon’s simd crate actually is). Hopefully as time progresses, we’ll get good libraries for L1-3 so that nobody has to attempt to write them ever again, and maybe even some day we’ll get good L4 libraries for some well-understood application domains. I doubt we’ll ever get perfect L-WorldDomination libraries, since no language has them, and the higher level one goes, the larger the design space is and the more trade-offs decisions one has to take (which for better or worse cannot always please everybody).


About “generic integers”, do you mean something like typenum?


typenum is a nice hack which somewhat eases the pain, but it has it’s own limits and costs, e.g. for non trivial stuff type signatures became bloated, errors usually hard to read because types defined recursively (this can be fixed but still) and it has compatibility issues with arrays.

I think @Aurora implied something like this RFC.


@burntsushi any progress on the pre-RFC ? Is there anything I could do to help?


FWIW i’ve started testing the assembly generated by rustc on different architectures (x86_64, armv7, and armv8) in my bitintr crate (see and the asm/ directory). I’ve found a couple of issues already (in particular with respect to LLVM armv8 missing optimizations). If anybody working on SIMD support wants to use this maybe we could refactor it into its own cargo-script.


16 elements is enough for vectors with 32-bit elements, but not for smaller element types. Shuffles for those are not part of AVX-512 proper, but are included by further instruction set extensions. Besides, current hardware SIMD width should not dictate the design of vector types, nor should syntax be designed under the assumption that hardware won’t change in the future.

I do admit that dedicated syntax is quite convenient when you need to shuffle a vector, but for the above reasons it can’t be the only way to shuffle. If it’s even a good fit for Rust to begin with, which I’m not sure of.

This is true, but it points to an issue with wrapper types: The primitive type would be privileged, gaining special syntax that a newtype can’t emulate. If you wanted to add a method, or present a different/more expansive interface to SIMD, you’d need an ugly extension trait or client code would need to regularly take out the structural type, perform an operation (e.g., shuffle) with it, and then wrap it back up. Moreover, if you wanted to represent a vector with some invariant as newtype, you’d be out of luck.

So to encourage a strong ecosystem around SIMD computations, I believe our representation of shuffles should be expressible in library code to enable newtypes. Naturally, we don’t currently have the means to express a shuffle signature — but luckily, we don’t need fully generic shuffles right now, if we start with intrinsics. If, however, we start out with special syntax for shuffles, it becomes much harder to turn that into something that can be supported by libraries.


I have never felt that this is a big issue in C++/ObjC, but I would love if there was some sort of macros(?) to provide libraries the ability to provide custom accessors on newtypes.

I want a generic interface for portable code (target at least AArch64 + AMD64), and you want close-to-metal performance for a single architecture. Fortunately these goals are not blocking each other in any way, but they are unfortunately also pretty much orthogonal. I think that’s why we have different priorities and see things differently.

I realise that AMD64 is the platform that 99% of all Rust code targets right now, so maybe the right choice is to just ignore portability for now but I hope it is not.

The VBMI permutes are not an issue since they are already expressible in the Rust type system, fn vpermb(a: __mm512, index: __mm512) -> __mm512 for example.


By that standard, there is no need for a generic shuffle operation since all extant and future shuffle instructions are available as intrinsics. But in fact, a generic shuffle operation is highly desirable, and it would be extremely sad if it was restricted to an arbitrary subset of sensible SIMD types.

Consider: LLVM happily shuffles vectors larger than the largest register supported by the hardware.


I do subscribe to the view that platform-specifc types and intrinsics are the only things that can reliably stabilize right now, but I’m also arguing with an eye towards the future, where higher-level, portable SIMD libraries flourish. These libraries, and platform-agnostic code written using these libraries, will have good reasons to want to create newtypes of vectors. Any special magic that gets added to SIMD types now will make such wrappers feel less first-class, and thus hurt that pretty intrinsic-free code.


Both of the following functions produce shufflevector <4 x i32> %0, <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2> in release mode today, FWIW:

pub fn rearrange_via_tuplestruct(a: Simd4<i32>) -> Simd4<i32> {

pub fn rearrange_via_array(a: Simd4<i32>) -> Simd4<i32> {
    [ a[1], a[3], a[0], a[2] ].into()
(Helpers, including horrible transmutes as a simd struct can't have an array in today's nightly)

pub struct Simd4<T>(T,T,T,T);

impl<T> std::ops::Deref for Simd4<T> {
    type Target = [T; 4];
    fn deref(&self) -> &Self::Target {
        unsafe { std::mem::transmute(self) }

impl<T> From<[T;4]> for Simd4<T> where T : Copy {
    fn from(a: [T;4]) -> Self {
        unsafe { std::mem::transmute_copy(&a) }

a.1 and a[1] are the same LLVM GEP, so struct vs tuple vs array doesn’t really matter to it.

Of course, as Aurora said, this is the simplest case in release mode, not a proof of bet-your-business-on-it-ility. But I’m not convinced that either is “very hard to read”; they seem to quite explicitly and directly show what’s happening.


It should in most ways act like a [T; n] array

Is that actually desirable from the point of view of making costs explicit in a systems language?


I thought exposing generic LLVM shuffling had been deemed out of scope for the initial release-channel SIMD feature, and initially only shuffles that map to specific instructions would be provided in a manner similar to C intrinsics even if underneath the API they generated shufflevector in the LLVM IR. Is this not so?

(Generic shufflevector is rather bad at making costs explicit. For example, LLVM uses shufflevector to view the higher or lower half of a NEON quadword register via its doubleword aliasing, but of course such viewing is only needed for type system purposes and there’s no instruction generated at all. OTOH, if you want a shuffle that the ISA doesn’t have a single instruction for, you get a non-obvious number of instructions.)

Anyway, I’d like to avoid slowdowns in getting SIMD to the release channel, so I’m worried about reopening the discussion regarding the types. Firefox is about to drop support for non-NEON ARM, so soon SIMD-in-release-channel-Rust will not only be SSE2-relevant for Firefox but NEON-relevant, too.


That’s because you’re using a much better syntax :wink:

I was discussing, Simd([a.0.0, a.0.1, a.0.2]), which I find very hard to read. Simd(a.1, a.3, a.0, a.2) as in your example isn’t nearly as hard to read for me.


Looks like AVX512 will be available on regular server platforms: Rust will need to support it.


So, I’ve tried to implement memcmp and memcpy with what Rust currently has and couldn’t get a way to express the case that matters for implementing both of those: memory operands for some of the instructions. For example in SSE there’s pcmpeqb which admits an unaligned memory operand. That’s the missing piece for implementing memcmp between two differently aligned pointers.

Intel intrinsics do not cover this use-case (they all take vectors as well), and it might be hard to make LLVM emit such an instruction as well (to do that we’d need at least a way to do unaligned vector loads from the memory (i.e. load %vecty %ptr, align 1)).