Getting explicit SIMD on stable Rust

OK, confusion is cleared up I think. As a simple PoC, this works:

[andrew@Liger simdtest] cat test.c
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>

__attribute__((target("avx2")))
int64_t test_avx2() {
    __m256i a = _mm256_set_epi64x(1, 2, 3, 4);
    __m256i b = _mm256_set_epi64x(5, 6, 7, 8);
    __m256i r = _mm256_add_epi64(a, b);
    return ((int64_t *) &r)[3];
}

int main() {
    printf("%ld\n", test_avx2());
    return 0;
}
[andrew@Liger simdtest] clang test.c -O
[andrew@Liger simdtest] ./a.out 
6

With the inline assembly approach, you'd have functions like this:

#[inline(always)]
fn _mm256_add_epi64(a: _m256i, b: _m256i) -> _m256i {
    let result: _m256i;
    unsafe { asm!("vpaddq $1, $2, $0" : "=x"(result) : "x"(a), "x"(b)); }
    result
}

There are no ties to specific registers; the compiler is still responsible for register allocation. The main drawback compared to intrinsics is that the compiler doesn't have an estimate for instruction cost, so the scheduling may be worse.

(I verified that this actually works with the appropriate feature flag. I hope the asm syntax is improved before stabilization, though.)

This by the way is how we could implement intrinsics that are missing in LLVM.

Small suggestion re: target_feature from an embedded ARM perspective:

If you use “+avx” instead of just “avx” it opens the door to future “-avx” syntax, that is, CLI settings without a feature. Llvm supports this and it is occasionally useful for e.g. avoiding the use of floating point in interrupt handlers.

3 Likes

Just because the other stuff already exists on nightly and are useful for SSE2 without runtime detection, so it's mainly a matter of allowing the existing features to be used on release. I do want to have runtime detection, I just don't want the existing features to wait for one that isn't yet on nightly.

By having rustc emit the same LLVM IR as clang emits for what looks to the user like an Intel intrinsic in emmintrin.h. For example, a user facing x86_mm_unpacklo_epi8 "intrinsic" would internally map to simd_shuffle16(a, b, [0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23]) as in emmintrin.h (though rustc already builds a bit of type safety (number of lanes) on top of the underlying LLVM shufflevector intrinsic).

OK. Nice.

Since actual intrinsics already work on nightly, I think the focus should be on shipping those. It seems weird to ship something worse due to concerns. (I'm fine with shipping asm!, too. I'm just unhappy about the prospect of shipping asm! becoming an excuse not to ship repr_simd and platform_intrinsics.)

As noted above, whenever there's an ISA vendor-defined intrinsic in a C header for which corresponding LLVM intrinsic does not exist, the correct solution is to look at what C incantation the header file has for clang, what LLVM IR clang produces for that C incantation and make rustc produce the same LLVM IR. (Though occasionally there are ways to have slightly different starting points but have the right optimization results anyway. Compare the Rust code for unaligned loads with the emmintrin.h code for unaligned loads.)

2 Likes

This connected a few dots for me. Very helpful. Thank you!

1 Like

That’s a very good point, +FEAT and -FEAT match what users already supply to -C target-feature on the command line. That it also helps embedded developers is nice, too.

Here's a random collection of thoughts I had while catching up on the replies in the last day:


I agree that intrinsics and runtime detection support should be separate features pursued and the former can and should be stabilized before the latter. However, I don't think it's a good idea to stabilize the SIMD-related features as they currently exist on nightly. There are a number of issues and unresolved questions discussed in the tracking issue:

The most important issues seem to be post-monomorphization errors with "generic intrinsics" such as simd_eq and the exact interface to the intrinsics.


Regarding naming, I think sticking to the vendor names is good simply for the buy-in from people used to intrinsics in C. They're atrocious for regular use but higher-level wrappers can and should be developed on crates.io. The only changes I would make is namespacing and dropping the leading underscores (which I think is just an artifact of C lacking namespacing). So, to be specific, the C intrinsic _mm256_set_epi64x should be mapped to std::intrinsics::mm256_set_epi64x. I could be convinced to make further mechanical adjustments (e.g., remove the mm_ and mm256_ prefixes and put them in feature namespaces such as std::intrinsics::sse, std::intrinsics::avx2).


Regarding intrinsic signatures: I see the appeal of getting more type safety. But I am also quite skeptical that we will find the exactly right way to do all this in the first pass (assuming there is even a single right answer for all intrinsics) and I'd rather not block stabilization on that issue. IIRC the main concern with just copying C's intrinsic signatures was mixing of floats and integers, but if we make it so the already-existing-in-C-but-apparently-not-enforced types __m128 (f32), __m128d (f64), and __m128i (integers of any bit width, which is harmless as far as UB is concerned) are separate types and require transmute (or unsafe utility methods) to convert between them, that should solve those concerns.


Regarding platform-agnostic intrinsics such as simd_add<T>, I am ambivalent. On the one hand, they're certainly aesthetically pleasing (though I am a bit worried that the exact set of operations is rather arbitrary, as in, basically what LLVM decided to add IR support for). But on the other hand, they also cause problems:

  • trans-time errors as discussed in the tracking issue linked above
  • need for richer signatures encoding lane count for all values (rather than just the int/f32/f64 split that can be lifted verbatim from the C headers)

I believe code that's already knee-deep in platform-specific intrinsics won't singificantly suffer from not having those higher-level intrinsics, and any higher level library has the opportunity to abstract over these operations with traits which it will need anyway (for its own API and for its clients). In short, I don't believe the troubles are worth solving for the MVP (explicit C-style intrinsics on stable). Even if we later add such intrinsics, having the C-style platform-specific intrinsics available might be useful for porting C code to Rust.

3 Likes

I believe code that's already knee-deep in platform-specific intrinsics won't singificantly suffer from not having those higher-level intrinsics, and any higher level library has the opportunity to abstract over these operations with traits which it will need anyway (for its own API and for its clients). In short, I don't believe the troubles are worth solving for the MVP (explicit C-style intrinsics on stable). Even if we later add such intrinsics, having the C-style platform-specific intrinsics available might be useful for porting C code to Rust.

I would like to learn more about this. How do we stabilize intrinsics that simd_add is supposed to replace if llvm itself doesn't expose them? Should we provide an implementation ourselves using simd_add internally, for example?

By implementing them so that they emit the same LLVM ir as clang does. I'm not sure if that's completely trivial in general, but in the case that there's a one-to-one mapping from LLVM intrinsic to platform intrinsic (e.g. in the case of adding vectors) then it should be easy to get right.

As @jneem said, you’d basically “special case” the intrinsics to generate the proper IR instructions instead of a call to a corresponding LLVM intrinsic. For example, _m_add_ps(a, b) would generate a vector addition fadd <4 x float> %a, %b. Experimentation with gcc.godbolt.org (not as good as just running clang locally and inspecting the output, but good enough as I don’t have a clang at hand right now) suggests that that’s the full extent of what Clang does:

  • It doesn’t try to be smart about what LLVM vector type it maps the __m128[id]? types to: Regardless of what you do with the values later on, __m128 is simply <4 x float>, __m128d is <2 x double>, and __m128i is <2 x i64>.
  • If I pass (for example) __m128d arguments to _m_add_ps it just inserts bitcasts from <2 x double> to <float x 4> to get the arguments of that fadd instruction, rather than changing the IR types of the __m128d values or anything more complicated. It seems to be a local “if types don’t match but vector width does, bitcast” check.
  • Likewise, __m128i are always just represented as <2 x i64> and bitcast'd on demand to smaller integer types when needed (e.g., <8 x i16> for *_epi16 intrinsics).
  • It generates horrible IR even for basic operations such as creating a vector literal (_mm_set_epi16 with constant arguments): It puts the eight scalars into allocas, then loads them and inserts them into a <8 x i16> vector one by one (and the machine code generated at -O0 is correspondingly terrible), and finally bitcasts that to <2 x i64> (because hey, the result is a __m128i = <2 x i64>). Of course this is optimized away entirely, even at -O1, to a constant of the “right” vector type (as needed later on, e.g. <4 x float> if the result is fed into _mm_add_ps).

So I have hope that it will be rather easy to match clang’s codegen, because all the smartness seems to be in the LLVM optimization passes rather than in clang.

1 Like

I think it would help me best if we used concrete examples. It’s really hard for me to understand otherwise.

For example, in Clang’s emmintrin.h file, consider its definition of _mm_add_epi64:

static __inline__ __m128i __DEFAULT_FN_ATTRS
_mm_add_epi64(__m128i __a, __m128i __b)
{
  return (__m128i)((__v2du)__a + (__v2du)__b);
}

In Rust, I’m assuming we’d use simd_add here? (I guess in an ideal world, we’d impl Add on vector types, but that impl would use simd_add…?)

The other bit I find interesting here is the use of cast to the __v2du type. There are others:

/* from mmintrin.h */
typedef long long __m64 __attribute__((__vector_size__(8)));

typedef long long __v1di __attribute__((__vector_size__(8)));
typedef int __v2si __attribute__((__vector_size__(8)));
typedef short __v4hi __attribute__((__vector_size__(8)));
typedef char __v8qi __attribute__((__vector_size__(8)));

/* from xmmintrin.h */
typedef int __v4si __attribute__((__vector_size__(16)));
typedef float __v4sf __attribute__((__vector_size__(16)));
typedef float __m128 __attribute__((__vector_size__(16)));

/* Unsigned types */
typedef unsigned int __v4su __attribute__((__vector_size__(16)));

/* from emmintrin.h */
typedef double __m128d __attribute__((__vector_size__(16)));
typedef long long __m128i __attribute__((__vector_size__(16)));

/* Type defines.  */
typedef double __v2df __attribute__ ((__vector_size__ (16)));
typedef long long __v2di __attribute__ ((__vector_size__ (16)));
typedef short __v8hi __attribute__((__vector_size__(16)));
typedef char __v16qi __attribute__((__vector_size__(16)));

/* Unsigned types */
typedef unsigned long long __v2du __attribute__ ((__vector_size__ (16)));
typedef unsigned short __v8hu __attribute__((__vector_size__(16)));
typedef unsigned char __v16qu __attribute__((__vector_size__(16)));

/* We need an explicitly signed variant for char. Note that this shouldn't
 * appear in the interface though. */
typedef signed char __v16qs __attribute__((__vector_size__(16)));

I guess some of these types indicate the lane size to LLVM? Can we try to convert these to Rust types to make sure we’re all on the same page? I’ll take the first crack.

#[repr(simd)]
struct __m128(f32, f32, f32, f32);
#[repr(simd)]
struct __m128d(f64, f64);
#[repr(simd)]
struct __m128i(i64, i64);
#[repr(simd)]
struct __v2df(f64, f64);
#[repr(simd)]
struct __v2di(i16, i16);
#[repr(simd)]
struct __v8hi(i16, i16, i16, i16, i16, i16, i16, i16);
#[repr(simd)]
struct __v16qi(i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8);
#[repr(simd)]
struct __v2du(u64, u64);
#[repr(simd)]
struct __v8hu(u16, u16, u16, u16, u16, u16, u16, u16);
#[repr(simd)]
struct __v16qu(u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8);
#[repr(simd)]
struct __v16qs(i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8);

If I have this right, then I have some questions. To start:

  • Why is there both a __m128d and a __v2df? They appear identical?
  • Similarly, why is there both a __v16qi and a __v16qs?
  • The implementation of _mm_add_epi64 in Clang asks for two __m128i values (which I’ve surmised to be signed), but internally, it casts them each to __v2du (which I’ve surmised to be unsigned). What is going on here? Is the interface this way simply because Intel didn’t spec a __m128u type? How does one know to cast to unsigned or not? (I guess you need to look at the intrinsic name, e.g., _mm_adds_epi16.)
  • Presumably, we’d export __m128, __m128d and __m128i, but not __v2di, __v8hu, etc.?
  • Other things I’m missing?

I would like to avoid derailing this discussion about stabilizing a base level of SIMD support with a discussion on a higher level interface that can be evolved on crates.io.

Maybe folks interested in the higher level APIs can start a new thread?

2 Likes

The reason I avoided that so far is that this requires bikeshedding over how intrinsics are represented internally. But let's go with the style you use in your post, at least as straw syntax.

If we wanted to re-use the existing repr(simd) infrastructure, yes. I'm not sure if that's the best way long-term, but luckily that's just implementation details and need not be stabilized.

I am not familiar with clang's internals but it seems to me that these __v* types (or more specifically the __vector_size__ attribute) are indeed roughly equivalent to rustc's current #[repr(simd)]. They encode lane size and element type. So I agree with your translation.

Presumably for consistency (__v* types for implementation, __m128* types for the API in keeping with Intel docs). There doesn't seem to be a functional difference between __m128d and __v2df in a quick experiment.

I have no idea.

Neither CPUs nor LLVM IR care whether a value is "signed" or "unsigned". Registers are registers and and the iN type is the iN type. Operations may or may not be signed or unsigned. Addition doesn't care about signed-ness and thus both LLVM IR and CPUs have only a single (sign-agnostic) kind of addition instruction. So the difference between __v2du and __v2di (if it exists at all) is probably irrelevant for the addition intrinsics.

This bleeds over into the intrinsics API: there is no such thing as a signed or unsigned __m128* type (just as there are no separate types for lane widths), only some intrinsics could be said to use lane width X and be signed/unsigned. If __m128* types are internally defined in terms of int or unsigned, that's probably just because C has no concept of an integer type that is neither signed nor unsigned.

I'd like to give an example of an intrinsic where signed-ness of the intermediate __v* type is important but all operations I can think of where signed-ness could matter either don't exist in common SIMD instruction sets (e.g., integer division) or map to LLVM intrinsics (so to LLVM it's just a call with iN type, which is sign-agnostic, so clang doesn't have to care about signed-ness either). As an example of the latter, compare the SSE 4.1 _mm_mul_epi32 and _mm_mul_epu32 intrinsics.

To answer your confusion: I think singed-ness of the element type usually doesn't matter and therefore clang probably just plays really fast and loose with them.

Yes. (And analogous types for the intrinsics belonging to other intstruction set intrinsics.)

2 Likes

Thank you. That helped immensely!

Doesn't LLVM have a way to signal that signed overflow is undefined behavior? I know Rust doesn't use this, but I thought Clang did.

That indeed exists, but it’s also a property of the operation rather than the type. Addition instructions which come from C-level signed additions are tagged nsw (assume no signed overflow). There’s also nuw for unsigned wrapping.

SIMD operations that depend on integer signedness:

  • Right shifts, if you choose to call arithmetic shift ‘signed’.
  • Greater than / less than lane-wise comparisons.
  • Saturating arithmetic.
  • Integer to float conversion.
  • Float to integer conversion. SSE only has a signed conversion, but NEON has both.

Other operations like add/sub don’t care about signedness, but they do care if the lanes are ints or floats.

Operations like shuffles and swizzles don’t care if the lanes are ints or floats, but they do care how many lanes you have and how many bits are in each lane.

Operations like and/or/xor don’t care about lanes, they just see 128 bits.

When designing a compiler IR, you get to choose which of these details go in the value types, and which go in the instructions. If you choose non-signed integer types like LLVM does, you then need to provide signed/unsigned instruction variants for those operations that care. In assembly, registers don’t have types, so the instructions carry all the type information.

In a programming language, as opposed to a compiler IR, you typically use detailed types (i.e., with signed and unsigned integer types). This is also what the arm_neon.h header file does for C, and what SIMD.js does for JavaScript.

I don’t actually know the history of the Intel intrinsics headers, but they look like an attempt at ramming the assembly type system through the compilers of the time. The __m64 type was magic for “use an MMX register”, and the __m128 type was magic for “use an XMM register”. Later, compilers added more portable SIMD types with the __vector_size__ attribute, and the Intel intrinsics headers had to adapt by adding more types.

Today, the types and intrinsics provided in the Intel headers are significantly more low-level than LLVM IR. The headers provided with Clang are emulating the lower-level intrinsics by mapping to LLVM IR instructions. You don’t actually get the low-level control that the intrinsic names are suggesting.

For example, the Intel headers provide _mm_and_ps, _mm_and_pd, and _mm_and_si128 intrinsics, and you might think that these map to the andps, andpd, and pand instructions. They don’t. They all map to an LLVM and instruction, and the code generator will select a concrete instruction independently from the original intrinsic you used.

Similarly, _mm_sub_ss maps to a scalar subtraction followed by a vector lane insertion in LLVM IR. If you’re lucky, that then becomes a single subss instruction, but probably not.

Given that SIMD intrinsics in Rust will be mapped to LLVM IR (or Cretonne) and not directly to SSE or NEON assembly, a shared SIMD type system and a shared set of basic operations are actually more fundamental than providing a set of intrinsics for SSE or NEON.

Once you have this shared SIMD foundation, you can then add ISA-specific intrinsics for all those weird instructions that only a few SIMD ISAs implement.

The target-specific intrinsics should build on a shared system of target-independent SIMD types and basic operations, not the other way around.

This is also how C compilers work now; the Intel header types and basic operations are only for backwards source compatibility.

Intel does not define the __v* types. AFAICT this is something the GCC authors came up with.

It's only more fundamental if your compiler backend is LLVM. As has been discussed prior in this topic, the language needs to define backend-independent operations, and for architecture-specific intrinsics really the only authoritative source is the architecture vendors, either the intrinsics they define or the instruction mnemonics they define.