[Pre-RFC] Meta-target feature for 128-bit SIMD


#1

Now that the standard library is on track to getting portable SIMD under core::simd/std::simd, it’s relevant to query whether 128-bit SIMD (the width that’s actually widely portable) is available without having to query for various architecture-specific 128-bit target feature names.

So instead of having to query for SSE2, NEON, aarch64 (always has 128-bit SIMD) and, once Rust exposes SIMD on PPC/POWER and MIPS, for AltiVec/VMX or MSA, it would be useful to be able to query for a meta target_feature covering all these. (The SSE levels already provide precedent for target_features that imply others. Specifically, a given SSE level implying the presence of the lower levels.)

It seems that logically this should be a target_feature value. However, it seems that namespace is owned by LLVM. Instead of just writing an RFC, should I somehow seek to reserve the name token on the LLVM level? (What the name would be is, of course, a total bikeshed. I suggest simd128.)

(In the SSE world, this would trigger on SSE2 and not on just SSE, because SSE alone seems insufficient for good code and CPUs that have SSE but not SSE2 are rare anyway.)

// CC @gnzlbg


#2

What does exactly “128bit SIMD” mean?


#3

I understand that to mean single-instruction multiple-data instructions operating on hardware vectors which are 128 bit wide, and can consequently be filled with the following data:

  • 2x f64/u64/i64
  • 4x f32/u32/i32
  • 8x u16/i16
  • 16x u8/i8

As a possible portability caveat, I’m not sure if all SIMD instruction sets support all of these combinations. The capabilities of Intel’s SIMD instruction sets certainly vary depending on which data types you are operating on, and IIRC some of the above combinations may not be available in hardware (e.g. u64 on 32-bit systems).

Another longer-term concern, on the Rust side this time, is that the language does not yet support f16, which is quickly gaining popularity on the hardware side due to the ongoing machine learning craze.


#4
  • On x86 and x86_64 it means that target_feature sse2 is present.
  • On arm it means that target_feature neon is present.
  • On aarch64 128-bit SIMD is always present.
  • On ppc/ppc64/ppc64le it means that AltiVec/VMX is present.
  • On MIPS it means that MSA is present.

AFAICT, for the programmer this means (approximately?) that the portable operations on the following types are fast enough to be worthwhile to use in place of unvectorized plain types: u8x16, i8x16, u16x8, i16x8, u32x4, i32x4, f32x4.


#5

You may want to consider adding f64x2 to this list, as it is often used in scientific computations and very widely supported. I do not know if it is well-supported in all of the hardware architectures that you listed, though, that should definitely be cross-checked.


#6

I left f64x2 out, because on superficial inspection, it doesn’t seem to be available in all these 128-bit SIMD instructions sets.


#7

If the feature enables features of different targets, does that mean that you want the following to be valid?

#[cfg(all(target_arch = "x86_64", target_feature = "128simd"))] {
    std::arch::x86_64::_mm_add_epi16(a.into_bits(), b.into_bits());
}
#[cfg(all(target_arch = "arm", target_feature = "128simd"))] {
    std::arch::arm::vaddq_s8(a.into_bits(), b.into_bits());
}
#[cfg(target_arch = "x86_64")]
if is_x86_feature_detected("128simd") {
 std::arch::x86_64::_mm_add_epi16 (a.into_bits(), b.into_bits());
}
#[cfg(target_arch = "arm")]
if is_arm_feature_detected("128simd") {
  std::arch::arm::vaddq_s8(a.into_bits(), b.into_bits());
}

Note in particular that the arch module and the arch specific run-time feature detection macros are only available on the specific archs, and also, that each arch intrinsics take arch-specific vector types that you would need to convert to/from.


#8

Superficially, I’m guessing, “yes”, but for the immediate use cases that I have in mind, it doesn’t matter.


#9

Aw, that is unfortunate. But on second thought, even if f64 turned out to be supported across the whole list of archs that you proposed earlier, it is definitely true that native double-precision support is either lacking or badly castrated (cough cough NVidia) on other hardware archs that Rust may want to support someday like GPUs.

Maybe the scientific computation use case could be better handled by adding an additional “f64simd” feature flag to the target indicating native f64 SIMD support, then? I have not read up well enough on target feature flags, are they composable?


#10

@hsivonen I think this makes sense.

Right now, if one writes code that only uses portable vector operations, having to do the cfg_attr dance is pure cancer:

#[cfg_attr(any(target_arch = "x86", target_arch = "x86_64"), target_feature(enabled = "sse,sse2"))]
#[cfg_attr(target_arch = "arm", target_feature(enabled = "neon"))]
#[cfg_attr(target_arch = "aarch64", target_feature(enabled = "asimd"))]
#[cfg_attr(target_arch = "powerpc64", target_feature(enabled = "altivec,vsx"))]
#[cfg_attr(target_arch = "mips64", target_feature(enabled = "msa"))]
unsafe fn adds_three_vectors(x: u16x8, y: u16x8, z: u16x8) -> u16x8 { x + y + z }

The good thing is that one can write a proc macro that does this:

#[sane_simd]
unsafe fn adds_three_vectors(x: u16x8, y: u16x8, z: u16x8) -> u16x8 { x + y + z }

The bad thing is that one needs a proc macro to do this. There is no easy way to define these in your code once, and then reuse them.


#11

If you’re getting the in business of discriminating by hardware capabilities, excluding certain vector types that are 128 bit because they’re not supported by all targets you care about, etc. then why not just roll your own criteria from cfg(target_arch) and the existing target_features?

I assume the goal is to restrict use of SIMD (and all the extra contortions that you usually need to do to make algorithms SIMD-friendly) to targets where this actually has a reasonable chance to be fast. But I would expect this is just a crude heuristic – to actually see if some SIMD-fied algorithm will be fast on some platform, you need to actually measure. So what’s the value of codifying this particular approximation?


#12

FWIW what I meant with “this makes sense” was that I think it makes sense to solve this problem. Whether the solution should be a target feature just for this… I am less convinced about that.

I’d prefer to either add a crate that exposes this proc macro so that you can just use it, or to extend the language allow users to reuse their “annotations” in their programs.


#13
  1. I’m not aware of a convenient way to bundle a bunch of cfg criteria into something that can be checked by checking one feature/cfg symbol elsewhere in the crate. (Maybe such a thing exists and I’m just unaware?)
  2. While testing is always ideal, it’s a matter of failure mode for when the crate author doesn’t have the hardware to test everything. If I test that the SIMD code path is faster on mainstream desktop (x86 and x86_64) and mainstream mobile (arm and aarch64) architectures, it seems like a fair guess to give SIMD to POWER and MIPS. It’s possible that it’s a pessimization compared to the ALU code, but before actually measuring, it seems like the better guess that the SIMD will be faster on POWER and MIPS, too.

#14

So @hsivonen, my recommendation right now is to just implement a proc macro that does this for you. It only takes 10 lines of Rust code to do so and it will make your life infinitely better (please share the crate if you do this!).

Right now implementing support for something like this is far down my priority list (portable vector types and related features would need to come first).


#15

Yes this is a real problem, but also one independent of SIMD and worth solving more generally (rather than adding ad-hoc “combined” cfgs to the language).

Yes but you can do this regardless of who creates the heuristic. It doens’t need to be in the language, and doing it outside the language makes it easier to experiment with the heuristics or offer multiple ones.

Aside: With enough contortions you don’t even need proc macro attributes. This syntax is possible with macro_rules or macros 1.1 (the stable portions of proc macros, via proc-macro-hack), as long as you keep the syntax to

cfg_128_simd! {
    fn foo() { ... }
    fn bar() { ... }
}

#16

Wouldn’t it be better to have a numpy-like arbitrarily-sized-vector cross-platform API instead of only supporting 128-bit SIMD, esp. since 256-bit SIMD is also widely available in AVX and that API would also work better on platforms with no SIMD support?

If this kind of feature is desired anyway, then being able to check for any specific vector type seems the correct solution (something like #[cfg(target_simd(f32x4))]).


#17

It’s not just a matter of vector sizes, but what operations are supported. For example, if you want to multiply instead of add, x86 supports doing that on i16x8 in SSE2, i32x4 in SSE4.1, i64x2 with AVX512, and i8x16 not at all. Or if you want to compare two vectors (for greater/less-than), you can do that in SSE2 for all four integer sizes, but only for signed integers; unsigned integers have to wait until AVX512. Other platforms have their own idiosyncratic limitations…


#18

The idea is to use 128-bit SIMD to implement arbitrarily-sized cross-platform vector API.


#19

This is more or less exactly what faster does, or how is that any different from what you mean?