Getting explicit SIMD on stable Rust

Simd<[T; N]> is very appealing to me because it gives us structural typing. I think the primary reason to be against using array types is what to do with taking addresses.

That is, if Simd<[T; N]> is implemented as <N x T> in LLVM, you can’t use getelementptr on it. This means if x is Simd([1, 2, 3, 4]), for x.0.0, first dot and second dot needs to be translated differently. More importantly, I don’t think taking address of element of LLVM vector type value works at all. This probably needs extra type check to prevent taking addresses.

I think Simd idea is orthogonal to stablization, because we can migrate to it backward-compatibly using type alias. Is this correct?

I don’t think dealing with Intel’s intrinsics is problematic. Annoying, yes, but no problem. After all, “low-level and high-level SIMD API with common type” design is simply mirroring what is available in LLVM. llvm.x86.sse2.add.sd intrinsic and add instruction both takes same LLVM vector type.

That is, we can just surface type signature of LLVM intrinsic, instead of casting as in Clang intrinsic header to be compatible. In this case we are trusting LLVM to get type signature right, which seems reasonable.

That is a good point. I forgot that the LLVM intrinsics are indeed typed. (Which is silly, because I knew that.)

Hmm, that's unfortunate. What's the fundamental reason for this restriction?

That said, I believe we already have all of the infrastructure needed for such a check in place for #[repr(packed)] (which only requires unsafe to take an address, instead of forbidding it outright), so the same thing could probably adapted for use here, and the restriction can always be lifted backwards-compatibly later on if there's a way to do so.

My understanding is that unsafe check for #[repr(packed)], aka RFC 1240, is approved but not yet implemented.

By the way, I think there is no fundamental reason why taking address of element of LLVM vector type value shouldn’t work. It’s complicated because registers do not have addresses, but i32 can also be in registers and taking address of i32 works fine, because LLVM does all appropriate transformation when promoting from memory to register.

So as I understand, it is a “simple” matter of implementing memory to register promotion transformation for LLVM vector types; it is just unimplemented. In practice it probably is super complicated and needs tons of LLVM experience to implement.

1 Like

Sorry to be thick, but I don't understand why this is the case. Couldn't it be safe to cast between integer types (e.g. from u8x16 to u16x8 or __mm128i)?

I also don't see why the existence of (possibly unsafe) transmuting at the lower level precludes a nicer interface. For example, if rust exposes _mm_add_epi16 exactly the way Intel defines it, I could still write

fn add_u16x8(a: u16x8, b: u16x8) -> u16x8 {
    unsafe {transmute( _mm_add_epi16(transmute(a), transmute(b))}
}
fn add_i16x8(a: i16x8, b: i16x8) -> i16x8 {
    unsafe {transmute( _mm_add_epi16(transmute(a), transmute(b))}
}

in my wrapper crate, and now everyone gets a typesafe interface without any casts.

That would be a reasonable option, although it doesn't seem obviously better than using Intel's definition. For one thing, LLVM's choice to use signed instead of unsigned types is a little arbitrary. Another advantage of using Intel's type signatures is that they are well-known and (more importantly, IMO) well-documented. At some point I was trying to add support for _mm256_permute2x128_si256 to rustc. Can you guess what LLVM's corresponding type signature is? I had to search in the LLVM source tree to find out, whereas the Intel documentation has a great web page.

Clang has a database of type signatures that we can use.

I think we should do something like this, but I would like to defer that discussion. We have a lot of people who just want to use the vendor intrinsics, and we should give them something as soon as possible. I proposed defining these types as opaque so that we have the freedom to do this later. Once we have the vendor intrinsics out the door, we'll have the time to do it right.

We should make sure that the types I propose here can be replaced with type aliases in the future without breaking existing sources.

FWIW, when we do have the time to do it right, I think we should aim higher than the Simd<T> construct. I would encourage you and @glaebhoerl to take a closer look at the vector types in the OpenCL language. They have really neat syntax for swizzles and shuffles.

This is correct, although there are some subtle issues on big-endian machines that we don't need to get into here.

This is exactly what I had in mind with my proposed types. I only ask that we start out with opaque types for the purposes of shipping the vendor intrinsics, and then change them to aliases like this later when we have a better design for the SIMD language support.

Please let's defer this if at all possible. I think we can do much better than Simd<T>, but designing that doesn't have to block the vendor intrinsics. (And I don't want to derail this already very long topic with that discussion).

It's worth remembering that we are talking about literally thousands of intrinsics. If this second interface uses different names, it becomes difficult to find anything.

The stack of abstractions in Clang goes like this:

  • _mm_sad_epu8(__m128i, __m128i) -> __m128i uses
  • __builtin_ia32_psadbw128(i8x16, i8x16) -> i64x2 which becomes
  • declare <2 x i64> @llvm.x86.sse2.psad.bw(<16 x i8>, <16 x i8>) in LLVM.

Note how the lower-level abstractions have correct types and only the Intel name defined in the header file has the weird types. That is because it has to be source compatible with the Intel and Microsoft compilers, and the integer types look like that for historical reasons. It can't be changed now in C compilers without breaking source compatibility.

We don't have the burden of source compat since all those existing sources are C and C++. So we don't need to carry that mistake (IMHO) forward.

Rust has excellent searchable documentation too, so as soon as _mm256_permute2x128_si256 is exposed as a function in the standard library somewhere, it will be trivial to look up its Rust signature.

I agree that the lack of signed / unsigned information is unfortunate. Maybe we can get our hands on a better database than Clang's?

FWIW, when we do have the time to do it right, I think we should aim higher than the Simd<T> construct. (snip) Take a closer look at the vector types in the OpenCL language. They have really neat syntax for swizzles and shuffles.

You know what? All of OpenCL swizzles and shuffles were implemented in 2013. See [rust-dev] OpenCL-style accessors, casts.

@stoklund I’m trying to apply your proposal to practice. One thing I’ve come across (@eddyb pointed it out to me) is that not all vendor intrinsics actually have a type that is expressible in terms of lanes. Here’s an example: _mm_setzero_si128. What is its return type?

There are others: _mm_storeu_si128, _mm_castpd_si128, _mm_and_si128 (plus variants of those).

Do these mean we still need the __m128/__m128i/__m128d types? And therefore, there is still some casting required if, say, you want to apply _mm_and_si128 on a u8x16.

I guess another alternative eddyb had was to use generics to make those vendor intrinsic functions work with any of the lane-specific types. I’m not sure how that shakes out, but I imagine you’d just need to implement something that does automatic casting (and supports return type polymorphism for cases like _mm_setzero_si128).

You are right, some operations have types that don’t map immediately to the concrete types I proposed. I actually attempted to classify these things as part of the process of defining WebAssembly SIMD operations.

But let’s be concrete. The _mm_setzero_si128 intrinsic is actually provided by the impl Default for ... I suggested, but suppose we want the explicit ones too. We could define intrinsics:

fn _mm_setzero_si128_u8x16() -> u8x16;
fn _mm_setzero_si128_i8x16() -> i8x16;
fn _mm_setzero_si128_u32x4() -> u32x4;
fn _mm_setzero_si128_i32x4() -> i32x4;
...

Similarly, _mm_and_si128 corresponds to providing impl And for ..., but we could also provide per-type incarnations of this one. For comparison, the <arm_neon.h> header provides per-type vand_... functions.

Moving on, what about _mm_add_epi16? This is an operation on 16x8 sign-agnostic integers, so we would generate two bindings:

fn _mm_add_epi16_u(u16x8, u16x8) -> u16x8;
fn _mm_add_epi16_i(i16x8, i16x8) -> i16x8;

One day, we may be able to specify trait bounds on an intrinsic like this, so we could add:

fn _mm_add_epi16<T>(T, T) -> T  where T: ...mumble...

A number of intrinsics are inherently signed or unsigned, so you would only provide a single version:

fn _mm_sad_epu8(u8x16, u8x16) -> u64x2;

Now, it is important to realize that this is not nearly the combinatorial explosion it appears to be. Only the bitwise operators and a couple special cases like _mm_setzero_si128 need to be provided for all the integer types. The majority of operations only need signed/unsigned variants. We are talking and/or/xor versus thousands of operations that don’t explode.

The compromise mapping that Clang uses is to pretend that unsigned types don’t exist. Then the bulk of Intel integer intrinsics map to a single Rust signature.

3 Likes

I'm not familiar with the compiler, but here's my guess based on searching for the error message: Update TyS's is_simd, simd_type, and simd_size to allow TyArray. Something like (warning! not tried; probably doesn't compile)

    pub fn is_simd(&self) -> bool {
        match self.sty {
            TyAdt(def, _) => def.is_simd(),
            TyArray(def, n) => def.is_fundamental() && n.is_power_of_two(),
            _ => false
        }
    }

(I think Layout::compute_uncached and sizing_type_of would then do the right thing.)

But you can bitcast between %"Simd<[i32; 4]>"* and <4 x i32>* (seen here already happening in Rust stable) and [4 x i32] as well (presumably). So I'm not convinced this is an issue.

I don't think it's a nice abstraction if using the intrinsics requires unsafe. Why would we force an abstraction to wrap every single intrinsic (and there will always be some random intrinsic that's weird, and thus is used rarely enough to bother wrapping) when the intrinsics could take a reasonable type in the first place?

I totally agree that a nice interface to more complicated stuff like splitting, rearranging, sums-of-differences, etc is out of scope for right now. But some things here just seem obvious to me, like that there should just be a wrapping_add that works, rather than _mm_add_epi32 and _mm_add_pi32 and unsigned versions and versions for all the sizes. I not even convinced that an intrinsic is necessary for it, as I'm confident that LLVM can turn "you added each element of a vector type" into "you added a vector type" (since it already turns "you added each element of an array-in-a-struct" into "you added a vector type").

What would opaque versions of these look like? How are they initialized or read from?

Technically, we don't have to provide anything at all to get started. In the official C API, the __m128 type is completely opaque, and you have to use intrinsics for everything.

Right. I do appreciate that this particular problem could be solved to some extent by a nicer abstraction.

How do you know when an intrinsic is sign-agnostic or not? I note that for _mm_add_epi16, the Intel intrinsic docs say this:

Add packed 16-bit integers in "a" and "b", and store the results in "dst".

But the Clang docs say this:

/// \brief Adds the corresponding elements of two 128-bit vectors of [8 x i16], /// saving the lower 16 bits of each sum in the corresponding element of a /// 128-bit result vector of [8 x i16]. The integer elements of both /// parameters can be either signed or unsigned.

But if we look at _mm_sub_epi16, the Intel docs say:

Subtract packed 16-bit integers in "b" from packed 16-bit integers in "a", and store the results in "dst".

And the Clang docs say:

/// \brief Subtracts the corresponding 16-bit integer values in the operands.

If I had to guess, I'd say _mm_sub_epi16 is sign agnostic.

This pattern keeps going. Consider _mm_mulhi_epi16. Intel:

Multiply the packed 16-bit integers in "a" and "b", producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in "dst".

I'd guess that this is also sign agnostic, but Clang says:

/// \brief Multiplies the corresponding elements of two signed [8 x i16] /// vectors, saving the upper 16 bits of each 32-bit product in the /// corresponding element of a 128-bit signed [8 x i16] result vector.

I guess this is the price we wind up having to pay for better type safety here. We have to interpret every intrinsic carefully and diverge more and more from what the Intel headers say. I understand the divergence is not great and that a straight-forward mapping is still maintained. But still, it makes me slightly queasy.

Question for the thread: Is standardizing the C API a goal?

My opinion: we should standardize something Rusty, and the C api should be in a c_simd crate.

For a trivial example, I don't think there's a place in std for _mm_setzero_si128. It feels like T::default() (as was mentioned) is the only reasonable name for that method in something "official".

I think standardizing something with a straight-forward mapping to the C API is a goal, yes.

I don't think there's a soul in the world that doesn't want a Rusty API. But the C API is tested and well known, and its path to stabilization is dramatically shorter than a Rusty API.

The high bits of a wide multiplication are not sign-agnostic :wink:

0x100i16 * -1i16 = -0x100i32 = 0xffffff00i32
0x100u16 * 0xffffu16         = 0x00ffff00u32

I don't think it would be wise to attempt this classification of the intrinsics manually. We need a database with the relevant information.

If we can't get our hands on such a database, we can fall back to the next best thing: Bind Intel intrinsics to signed types only, using the Clang database. We could provide signed/unsigned casts that don't change the lane geometry to still get some benefit from type checking.

1 Like

This seems like the most prudent approach then.

1 Like

We can't use that as a definitive source of truth. I obviously only took a cursory look just know, but I already found that they got BLENDV wrong (see discussion in this topic previously for why).

I don't think that's a good idea. If we go down this path, we need to maintain our own database, because no other language (I think?) cares enough about actual type safety and correctness for the low level intrinsics. We can start out with the clang database but we have to verify everything. There are about 1400 Intel intrinsics (with overlap between single/double/128-bit/256-bit, and not counting AVX-512). If everyone in this topic (I'm guessing about 20 people?) takes 70 to check, we're done.