I talked to @burntsushi on IRC and they told me to clarify why I think the current implementation of
#[repr(simd)] is a less future-compatible building block for SIMD vector types. I'm not sure if I'm able to, but this is my attempt!
Instead of SIMD vectors behaving like
struct X I would like if they behaved like
[T; n]. This allows for some neat stuff later, so let's imagine a macro
simd_vector_ty![T; n] returning a vector type to avoid any new syntax:
type f32x4 = simd_vector_ty![f32; 4];
type __m128 = simd_vector_ty![f32; 4];
It should in most ways act like a
[T; n] array, but I'm not sure what would work best for a constructor
f32x4(x, y, z, w),
simd_vector_ty![x, y, z, w] or just solve it the Intel intrinsic way and not have a stable / portable constructor for now.
Utilising generic integers
If (when?) we get generic integers (I'm going to just make up syntax here, ignore that bit), we want to be able to use them to make SIMD vectors more comfortable and generic. If we use tuple structures it isn't obvious how we would utilise this, but if it is a type like
[T; n] then we're forward compatible.
You'll see how C++/Metal templates solve this compatability in the appendix.
Utilising generic integers would allow us to write a smoothstep prototoype as follows, this would not be possible with
fn<N: int> smoothstep(e0: f32, e1: f32, x: simd_vector_ty![f32; N]) -> simd_vector_ty![f32; N]
Or a dynamic shuffle function like:
fn<N: int, M: int> dynamic_shuffle(x: simd_vector_ty![f32; N], mask: simd_vector_ty![i32; M]) -> simd_vector_ty![f32; M]
We could even silently replace
simd_vector_ty! with a normal macro that expands to
SimdVector<T, N> or something and keep old code compiling. How to do this transition in a struct-based approach isn't obvious to me.
Some vector operations just plain don't match the Rust type system, we will use the LLVM instruction
shufflevector as it is the most obvious example. LLVM describes it like this:
<result> = shufflevector <n x <ty>> <v1>, <n x <ty>> <v2>, <m x i32> <mask> ; yields <m x <ty>>
The first two operands of a ‘shufflevector‘ instruction are vectors with the same type. The third argument is a shuffle mask whose element type is always ‘i32’. The result of the instruction is a vector whose length is the same as the shuffle mask and whose element type is the same as the element type of the first two operands.
The shuffle mask operand is required to be a constant vector with either constant integer or undef values.
But this signature is not expressible in the Rust typesystem, at least not right now. (generic integers, constexpr would be required, we could map negative numbers to undef)
But with some compiler magic I imagine this could be turned into an intrinsic.
fn<T, U, Mask> shufflevector(v1: T, v2: T, mask: Mask) -> U with a magic bound on Mask to be a constexpr
[i32; n] and
U to be a SIMD struct with the same element type as
T and the same
Mask. Unfortunately this function will then look like a normal function but behave entirely differently.
We would call it with,
let v2 = shufflevector::<f32x4, f32x3, [i32; 3]>(v1, v1, [0, 1, 2]);, and it would be slightly better than the current hack, where https://github.com/BurntSushi/stdsimd/blob/master/src/x86/sse2.rs#L956-L1008 is probably the worst-case scenario.
The thing is that I don't want to expose this function for a generic wrapper. I might want to implement it as
let v2 = v1.shufflevector::<f32x3, [i32; 3]>([0, 1, 2]) or something in my wrapper, but since this function isn't expressible in the type system either, I cannot do that.
But if we drop the requirement that we must force this into a function prototype and just try to make the best of the situation and instead implement it as faux structure members as Clang/OpenCL/GLSL/HLSL/Metal/RenderScript do. Then suddenly it's a familiar syntax for graphics programmers.
Then the shuffle from above could be written as
let v2 = v1.xyz;, which to me is clearer and shorter. If we use the OpenCL syntax then it covers up to 16 elements (s0 - sF), which is enough for AVX512. These shuffles are completely portable and allows you to use the same code on ARM and AMD64 for example, which is very nice.
But it has a catch. Since we're not explicit with the types anymore, the compiler needs to be able to figure out the return type of that accessor. If we allow more than one
f32x3 kind of type, then the compiler won't be able to pick which one. This is where
struct X becomes an issue. If we instead have vectors mirror
[T; n], we can easily generate those types in the compiler without knowing the name of a specific struct.
The accessor syntax wouldn't cover the issue of implementing intrinsics, we still need a
shufflevector intrinsic for that, but since we don't need to expose that to users, it can have a similar interface to what
simd_shuffle has now.
This is the weakest argument, but if we ever let
#[repr(simd)] leak out of
std then we'll have multiple types that are almost, but not completely the same. I don't see why anyone should want their own
f32-but with different traits type, and I do not think it is a good idea for vectors either. Especially since they end up as exactly the same type on the LLVM-level.
Appendix: Types like Clang
We should compare this to how other languages and implementations does this, since that should give us insight into what is the most common representation of these types. We'll focus on the Intel intrinsics since those are by far the most portable, supported by at least 4 different compilers, then take a quick look at ARM and some accelerator/GPU variants.
xmmintrin.h (the SSE intrinsics header) we have two variants, one from Clang/GCC:
typedef float __m128 __attribute__((__vector_size__(16))); // Clang's version
typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__)); // GCC's version
__m128 is a keyword (https://msdn.microsoft.com/en-us/library/2e6a4at9.aspx) with some magic, the Intel compiler it depends on the version. It was magic in older versions as far as I can see
typedef long long __mm128, but now it seems to be something like
#[decl(simd)] with a struct or union depending on the exact type.
arm_neon.h (the ARM Neon intrinsics header) we have two versions. Clang is like the one I propose, and the
__simd128_float32_t type is a magic type that GCC implements internally:
typedef __attribute__((neon_vector_type(4))) float32_t float32x4_t; // Clang's version
typedef __simd128_float32_t float32x4_t; // GCC's AArch32 version
metal_types.h (the type headers for the Metal GPU language, which is essentially C++ for GPUs) from Clang we get an alias template that is again almost the same but with integer generics support. This is what
simd_vector_ty![T; n] could be if we got integer generics too:
template <typename T, int numElt> using vec = __attribute__(( ext_vector_type(numElt))) T;
cl_platform.h (the host header for OpenCL) has multiple versions of this for various compilers:
typedef __attribute__((ext_vector_type(4))) float __cl_float4; // Clang
typedef __attribute__((vector_size(16))) float __cl_float4; // GCC
typedef vector float __cl_float4; // Standard AltiVec intrinsics on GCC/Clang/IBM VisualAge on PowerPC (also possibly MSVC, I have no access to the Xbox 360 compiler)
typedef __m128 __cl_float4; // GCC/Clang/Intel/MSVC on x86