Getting explicit SIMD on stable Rust

[s]Here's a bit of a contrived example which would be safe if you never bit-cast floats: Rust Playground

Edit: I guess you can get the same result by doing std::f32::INFINITY/std::f32::INFINITY

No. Just to name one of many reasons, wouldn't you like to write:

fn f(a: uint32x4) -> uint32x4 {
    a / 3
}

instead of

fn f(a: uint32x4) -> uint32x4 {
    (a * 0xaaaaaaab) >> 1
}

?

We could presumably do this through a deny-by-default lint. For dependencies the lint would be turned off by default but for your crate you'd have to deal with it. That's probably the best story we have though for adding this in a backwards compatible fashion.

Hm sorry I find it terribly hard to parse all the weird intel names, so I'm not following 100%. So the C headers have three types, one for vectors with f32 elements, one for f64 elements, and one for integer elements of any size? The intrinsics then say which width they take and you can pass in any matching type with the same width?

If the C intrinsics and/or headers are so duck typed this would indeed pose a problem. We may have to develop some form of naming convention to differentiate if we want to do so.


Certainly! So one thing is I would prefer to never have errors come up during monomorphization. That is, if a crate compiles successfully, then 100% of downstream consumers will also compile successfully with it (no matter how they use it). In that sense I personally at least prefer to avoid adding codegen errors wherever possible.

Additionally, this feature doesn't really enable any more use cases. It's primarily just a lint for newbies (like me) to make sure things weren't messed up. That being said that's also true of the entire simd module I'm thinking of. It's just a bunch of intrinsics which have terrible names. The module is very much an "expert mode" style. Downstream usage in a more high-level or typesafe fashion is where I think this abstraction would happen.

Given that I can't personally at least see any plausible route to having this sort of static enforcement in a reasonable time frame that could be stabilized. Stabilization I think is important because if this feature existed it would want to be used by the wrapper crates on crates.io.

I don't mind if a feature like this is perhaps discussed in parallel, though! It'd certainly be a nice-to-have.


Inline assembly would indeed be nice to have! Unfortunately the stability story for it is much harder than SIMD seems like it might be (even though SIMD itself is not easy). In that sense I think it makes sense to pursue the intrinsic route rather than the inline assembly route as it's a faster (and perhaps later more ergonomic method) of getting access to SIMD.

This is almost correct (and the general implications about duck typing are the same). Intel supports vector operations on 128-bit and 256-bit operands. These vectors can be composed of different scalar types, for example you can fit 4 32-bit floats in a 128-bit vector or 16 16-bit signed integers in a 256-bit vector. Intrinsics generally specify the concrete type you need to pass in but this can either be over-specific (in case of parameters a or b of blendvps) or just plain wrong (for parameter mask of blendvps).

If C doesn’t prevent you from using one of these types as another, surely we just need one type (e.g. __m128)?

Well, you have to explicitly cast them, sometimes. If we have a single type in Rust, how will you construct/deconstruct the type? We could have a #[repr(simd)] union but now we're suddenly tied to those semantics for stabilization.

Oh I expected the type to be opaque and have a bunch of constructors/accessors. Anything else I can think of would have more drawbacks than my favorite solution (let the user access the intrinsic-importing mechanism directly, and plug in their own types).

Rather than stabilizing all those “platform-intrinsics”, why not instead stabilize #![feature(link_llvm_intrinsics)]?
IMO, that this would be the minimum necessary stabilization that would allow SIMD on stable. Most of the rest of proposed features (comparisons, shuffling, etc) could be implemented as a library.

Edit: We’d still need #[repr(simd)] and #[target_feature()], of course.

So a crate would need to opt into this lint in the future? (and it would be just a warning and not an error?).

If this is the case that sounds like it would not be what we want. I would like to know the technical reasons why such an extension isn't trivial to implement (it seems that @hanna-kruppe also thinks that it should be trivial to implement), or why it is a bad idea.

IIUC the only argument that has been given (multiple times) is that "it would delay stabilization". While I understand the desire to stabilize intrinsics as soon as possible, I don't think that time pressure should drive the design. In particular, when the proposed extension should not only be trivial to implement, but would prevent completely broken programs from compiling and/or prevent backend errors.

I still agree with the overall design, but haven't given much thought yet to the interaction of #[repr(simd)] with feature dependent vector types like _m128/256/512..., masks like __mmask16. My gut feeling tells me that we might be able to define the vector types as aliases on tuples with #[repr(simd)] such that we only need to stabilize the #[repr(simd)] feature without introducing any new primitive types. However, since the vector types would be stabilized as well independently of how we define them, I wonder if the way in which they are defined would make a difference or not.

I also think that if the spec defines two types to be different and not interchangeable, we should enforce that in the intrinsics API even if C doesn't do that. Simplifying the use of the spec is the job of a higher-level library. The job of the low-level library is to match the spec as closely as possible.

Do note that clang doesn’t use builtins for many vendors intrinsics that can otherwise be implemented in C (it just guarantees the right codegen).

On top of that, while LLVM may have some of those as intrinsics, it keeps removing intrinsics, in favor of canonical IR forms representing them (e.g. min/max are icmp + select).

Not to mention how irresponsible stabilizing access to LLVM would be - LLVM itself isn’t stable! And we already have 2 or 3 non-LLVM backends being worked on.

This is true, good, and bad.

True: LLVM is really good at optimizing e.g. bit manipulation. Recently, all the TBM LLVM intrinsics were removed from LLVM (~10-15 intrinsics) because the optimizer became perfect at recognizing their algorithms.

Good: if you write one of the TBM bit manipulation algorithm in C or Rust, LLVM's optimizer will always give you the fast machine instruction for that.

Bad: this only happens as long as you enable LLVM optimizations! When there were LLVM intrinsics, you always got the machine code you wanted, even on debug builds. When they were removed, the only way to always get the machine code you wanted became to use inline assembly.

I actually don't think that pursuing the guarantee that the right intrinsics are used in debug builds is worth it, but this is a side-effect of how LLVM works: optimizations only happen if you enable the optimizer...

On the other hand, baking a massive number of processor-specific intrinsics into the language, that processors happen to support this year, also feels iffy.

I would prefer to outsource dealing with architecture pecularities to crates such as SIMD, and instead equip them for dealing with platform differences (such as providing cfg's for detecting whether a particular intrinsic is supported by the compiler, so they can fall back to a generic implementation).

The objective is that these crates can be used in stable, so they would need a stable way of dealing with platform differences, their intrinsics, and types, at both compile-time and run-time.

Is there actually a way of providing that without stabilizing platforms and intrinsics into the language?

Even if we went the suboptimal inline assembly route, we would need to "stabilize" assembly instructions, registers, ... if we wanted to prevent backend errors...

Is preventing backend errors worth pursuing? (I just assumed it was).

As I said, I believe that this feature won't lead to errors during monomorphization. Whether a #[target_feature] attribute is set on a function which uses an intrinsic doesn't depend on the types substituted for the generic parameters, because it is a property of the source-level function definition. If you're worried about function pointers or Fn* objects/type parameters sneaking in, note that this "lint" isn't transitive. This is perfectly fine IMHO, even though it will crash at run time (on machines without AVX, assuming there isn't any CPUID detection in the callers of look_ma_no_hands):

#[target_feature(avx)]
fn foo() { some_avx_intrinsic(...) }

fn look_ma_no_hands() { foo(); }

On the contrary, I believe that not explicitly marking functions in which instrinsics may be called will lead to codegen errors. See, for example, this program which uses an AVX intrinsic and crashes in LLVM. Again, AFAIK the official way to prevent this is to set an attribute on the (LLVM-level) function that enables the "subtarget feature" (here, AVX) for this function only.

It might be required to make things not crash in codegen, see above. That is in fact my only motivation for this feature, I don't consider it a lint for the user, I consider it an unfortunate but necessary (in the sense of intrinsic complexity, not silly limitation) tool for instruction the compiler's codegen.

That would be unfortunate, since it would either prevent you from writing code that is generic over SIMD widths, or it would prevent LLVM from inlining. For example, in the current SIMD crate f32x4 implements Add, and f32x8 implements Add if AVX is available. I'd like to be able to write

fn sum_simd<T: Simd + Add>(xs: &[T]) -> T {
  //...
}

and have it be compiled with AVX instructions when T is f32x8 and SSE instructions when T is f32x4. Also, I'd like LLVM to inline the actual add instructions, since having a function call for every addition would defeat the purpose. For a less trivial example, see the teddy crate, which uses generics to implement a text matching algorithm for both 16- and 32-byte blocks using mostly the same code.

I too had concerns about composability, but they seem mostly unfounded. The scenario you mention is not really affected. If I understand correctly, f32x8 implements add if AVX is globally available (e.g. through -C target-feature=avx or -C target-cpu=native). Then #[target_feature] doesn’t even enter the picture since it’s only an escape hatch from cfg(target_feature), i.e., allows using features that aren’t available for the rest of the crate.

#[target_feature] and generics would need to interact if the choice to use intrinsics depended on the types substituted, which is essentially specialization. But if you specialize you do have a separate source-level function to which the attribute can be applied.

That is certainly the case now, since global availability is currently the only availability. But I would hope that a world with per-function #[target_feature] would allow

#[target_feature("avx")]
fn foo(a: f32x8, b: f32x8) -> f32x8 { a + b }

instead of requiring

#[target_feature("avx")]
fn foo(a: f32x8, b: f32x8) -> f32x8 { unsafe {  _mm256_add_ps(a, b) } }

Getting back to generics, I don't think specialization is required in order to have a situation where #[target_feature] interacts with generics. For example, here is a pattern that currently works using #[cfg(target_feature)], and I would be sad if it were impossible using #[target_feature]. I make a trait that abstracts over all the SIMD operations I need:

trait SimdStuff {
  fn foo(self) -> Self;
}

#[target_feature("sse")]
impl SimdStuff for f32x4 {
  #[inline(always)]
  fn foo(self) -> Self { /* ... */ }
}

#[target_feature("avx")]
impl SimdStuff for f32x8 {
  #[inline(always)]
  fn foo(self) -> Self { /* ... */ }
}

Then I write my actual algorithm to only use the operations in SimdStuff:

#[what_should_go_here?]
fn my_alg<T: SimdStuff>(input: T) -> T {
   input.foo()
}

In the example above, I'm not using specialization but I still have separate source-level functions (namely, the impls of foo) to apply #[target_feature] attributes. Nevertheless, it seems to me that I still need to apply some attribute to my_alg, and that attribute needs to "see" T.

Is there a better way to write this that doesn't require the attribute on my_alg to know what T is?

Unfortunately that gets very complicated very quickly (how do you propagate it to callees, how do you avoid post-monomorphization errors, what happens if one function is called from multiple locations with different sets of target features, etc.). For that sort of thing I would agree with @alexcrichton: intrinsics are too important to block on nice additional features like that. I'm also hopeful that expanding the regions where certain intrinsics are available can be done in a backwards compatible manner (in principle, if the aforementioned problems can be solved).

I assume you mean #[cfg(target_feature="...")] in the code? Resolving cfgs happens early in the compilation pipeline, long before even name resolution (which would be the bare minimum needed to even recognize intrinsics, let alone do anything about them). So the snippet you post would simply have a different set of impls depending on the global target_features. Any code that requires f32x8 : SimdStuff would likewise need to deal with cfg(target_feature), not set a target feature for itself.

Nothing needs to go there. In general (without even involving generics) it is nonsensical to require that callers transitively add all target features their callees use (not to mention that this would fall apart in the face of generics and function pointers and trait objects). After all, the only reason for this localized fiddling with target_features is to allow CPUID detection, i.e. using intrinsics in subprograms without requiring the rest of the program does it.

To address your inlining concern: If LLVM can prove that one function unconditionally calls another one with an attribute such as "avx", it should be able to propagate the attribute to the caller, which should enable inlining if it was blocked on that attribute being missing in the caller. If LLVM doesn't currently do this, it should be feasible to add.

Just to be completely clear, this attribute I’ve been pushing (straw-named #[target_feature(..)]) would not be the be-all and end-all. It would serve exactly one purpose: To instruct LLVM that it may assume additional CPU features in the function that aren’t present elsewhere. So it would be the equivalent of gcc/clang __attribute__((target(".."))). It would not deal with propagating these decisions up to callers or down to calees or solve other hard questions about compositionality that need to be solved for a really high-quality, higher-level SIMD library

@hanna-kruppe its not far to then add functionality that only some dispatcher macro (that works with static function pointers that are either initialized at startup or for the first call), and functions with compatible #[target_feature(...)] can call functions marked as such.

If such an error doesn’t get added, and it gets stabilized this way, adding it later will be much harder.

Rust should strive to make SIMD abstractions as safe to use as possible, as it does for other aspects of systems programming.