Getting explicit SIMD on stable Rust

And one more thing regarding type naming: Intel calls all different packings the same type. 8 packed unsigned bytes or 4 packed signed integers are both __m128i. I don’t think we should directly copy this in Rust.

1 Like

Well if you look at the implementation I posted of the cast it actually doesn’t do anything. The only reason I see keeping these casts would be to keep the have the API closer to the C/C++ API.

If "ARM adds SIMD support next year" (which it already has many years ago) we will add the relevant intrinsics to the language.

Oops sorry, I meant to say what happens if ARM adds e.g. SSE4 support, or AVX support (not SIMD support, this is already available through Neon).

Considering intrinsics are tightly bound to architectures and that operations that sound similar may operate in subtly different ways I think this is a terrible idea. Intrinsics are intrinsically tied to a target architecture.

The problem is that defining a set of architectures is really hard and time consuming: x86 vs ARM vs x86_64 vs x86::Haswell vs ...

What I am proposing is that instead of defining architectures, we just do what Clang and GCC do, that is: define architecture features. For example, SSE2, SSE3, SSE4.2 SSE4A, AVX, NEON..., BMI, AES, ...

The programmer then opts into using only some sets of architecture features for some part of their code, and the compiler enforces that they don't use something outside these features in those parts (both via direct intrinsic calls and backend optimizations).

A problem that one must consider is that of "incompatible architectures", e.g., one cannot embed NEON and SSE assembly in the same binary, so we should probably have some notion of "incompatible features" (that might become compatible in the future). The objective is that if you embed SSE4 unconditionally in a binary and try to target ARM, you will get a front-end error (instead of a backend error).

Your concern, that "intrinsics [...] that sound similar may operate in subtly different ways", should be adressed by this, since intrinsics for an architecture feature like SSE4.2 sound and behave exactly the same independently of the architecture used (Intel Haswell vs AMD Bulldozer). If ARM would add SSE4.2 support to their architecture, they would do so ensuring the exact same semantics, or otherwise they would not be able to call it SSE4.2.

2 Likes

The main problem with the exposure of the crate is that it requires a nightly compiler. Would moving it to rust-official or rust-nursery be able to remove this requirement and make it work on stable? Even if it doesn't make it work on stable, moving it to the nursery probably will require an RFC, which would give the library more exposure and improve it.

1 Like

But we have to if we want to keep the lowest-level API vendor oriented.

The strength of crates.io is that we can have multiple competing crates before anointing an official (or semi-official) one. Since rust is missing features that are necessary for a good SIMD crate, I don't think we can even have that competition yet. (IMO, that's also the reason that the simd crate is languishing.)

3 Likes

Ah yes when I say "platform" I really mean "target", as in any value of --target you pass to the compiler. So in that sense 32-bit x86 Linux is a different platform than 64-bit x86 Linux.

Ah I was thinking you indeed can always use an intrinsic, so I may be misinterpreting this. My thinking though is that there is no compile-time guarantee that you don't use AVX in a binary which doesn't have it enabled, it's up to you to avoid doing so.

Indeed!

I definitely agree that it would be quite helpful. I think we can definitely solve the I-used-neon-on-sse3 problem by ensuring that std::simd (or whatever the module is) only has relevant intrinsics for your platform. This is sort of how on Linux there's no AsRawHandle trait, but it exists on Windows. The API of the standard library is just different.

Telling you that AVX can't be used when SSE4 is enabled is much harder, though. For example consider:

let answer = if avx_enabled() {
    _my_avx_intrinsic(arg)
} else {
    fallback(arg)
};

Here the compiler would have to know that in the first expression of the if you can use avx intrinsics, but not in the latter. In general this seems like a very very hard to solve problem which has very odd repercussions on the language itself (weird changes to resolve). I'd prefer to consider this as "would be nice to solve" but not necessary for stabilizing SIMD access.


A good point yeah! I suspect that we'll forever be working very closely with the simd crate to ensure it's always got a path forward.

Perhaps! I don't really mind too much where these go. It seems like the placement should solve at least two constraints though:

  • Accommodate growth to other intrinsics for a platform other than purely SIMD
  • Allow a high-level type-safe SIMD wrapper to be added at some point (if ever)

In that sense std::arch seems fine to me, but I'm ok punting on this until RFC-time.


My thinking is all the intrinsics match the headers exactly. That is, what you see in Intel or ARM docs is what you'll see in the standard library, no more, no less. And yes I'm thinking leading underscore and all. Basically this is trying to solve the case that you don't have to jump through hoops to get to SIMD, but rather you have raw access to the instructions that the CPU has support for. This means not applying transformations when searching for docs, etc.


Sounds like a good idea to me! Should be quite plausible. All this discussion about std::simd can simply be replaced with "defined in core::simd and reexported in std".


To be clear, I'm thinking we should specifically avoid trying to "improve" intrinsics. That's the responsibility of a type-safe and/or high-level wrapper (e.g. the simd crate), but not the standard library. Anecdotes seems to support that SIMD veterans would not benefit and might be hindered from premature abstractions, but SIMD newbies would greatly benefit from such a change (e.g. the simd crate).


Unfortunately it's basically impossible to have any crate using unstable APIs unless it's literally shipped with the compiler. So no matter how "official" we make the simd crate, it'll never compile on stable as-is (unless we stabilize the intrinsic access).

2 Likes

Ok, so all intrinsics are going to be unsafe, then?

Presumably, yes.

The way I was thinking about solving this is the following. Your example would fail to compile:

let answer = if avx_enabled() {
    _my_avx_intrinsic(arg)
    //^^^^ Error: tried to use AVX intrinsic but none in scope.
} else {
    fallback(arg)
};

but the following example would succeed:

let answer = if avx_enabled() {
   // Users opts into target features explicitly: 
   #[use_target_feature(SSE4, AVX)] {
     _my_avx_intrinsic(arg)
   }
} else {
    fallback(arg)
};

and the following would fail as well:

let answer = if avx_enabled() {
   #[use_target_feature(SSE4)] {
     _my_avx_intrinsic(arg)
     //^^^^ Error: tried to use AVX intrinsic but none in scope.
   }
} else {
    fallback(arg)
};

Typically libraries like liboil, and OpenMP, use "something" like the following pattern.

They have an static function pointer, that is initialized to the implementation to be used. We can have a macro for conditional compilation for incompatible architectures, I just called it, target_architecture, but that is a strawman:

// Conditional compilation for x86
#[cfg(target_architecture(x86))] {  

// Detect the features at run-time and initialize a static function pointer
// with the appropriate algorithm implementation: 
lazy_static! {
    static ref SOME_ALGORITHM_IMPL:  fn(...) -> ... =
      if avx_enabled() {
        some_algorithm_avx_impl 
      } else if sse42_enabled() {
        some_algorithm_sse42_impl
      } else {
        some_algorithm_fallback_impl
      }
    };
}

Note how this code doesn't have any target_feature flags, since it is not doing anything "feature" specific, it is just setting a function pointer.

In the same way, we can add the code for ARM:

// conditional compilation for ARM
#[cfg(target_architecture(ARM))] {  
lazy_static! {
    static ref SOME_ALGORITHM_IMPL:  fn(...) -> ... =
      if neon_enabled() {
        some_algorithm_neon_impl 
      } else {
        some_algorithm_fallback_impl
      }
    };
}

and the code for other architectures:

// conditional compilation for not X86, ARM
#[cfg(!target_feature(x86), !target_architecture(ARM))] { 

// no need to use lazy static here:
static SOME_ALGORITHM_IMPL:  fn(...) -> ... = some_algorithm_fallback_impl;

}

Now we implement the algorithm for all architectures, it just forward to the function pointer:

// The algorithm just uses the function pointer
fn some_algorithm(args...) -> ... {
  SOME_ALGORITHM_IMPL(args...)
}

And now we use the target_feature macros combined with the target_architecture macros to generate the code of the different implementations:

// For X86
#[cfg(target_architecture(x86))] { 

// Different implementations of the functions are generated by the compiler

#[target_feature(AVX)]
fn some_algorithm_avx_impl(args...) -> ... {
  // Might use AVX features (and probably SSE42, since AVX is a strict superset)
}

#[target_feature(SSE42)]
fn some_algorithm_sse42_impl(args...) -> ... {
  // Might use SSE42 features, cannot use AVX features (compiler error) c
}
} 

// For ARM
#[cfg(target_architecture(ARM))] { 

#[target_feature(NEON)]
fn some_algorithm_neon_impl(args...) -> ... { }

}

// The fallback is generated for all architectures
fn some_algorithm_fallback_impl(args...) -> ... {
  // Compiler should error if user tries to use any target features here
}

Note how one must use #[target_feature(...)] on the functions to enable the features for the whole function. That should be just sugar for:

fn name(...) -> ... {
 #[target_feature(...)] {
  // body
 }
}

This should work very similarly to the current way in which code is conditionally included depending on enabled target features:

// This works in Rust today (in nightly)
pub fn pext<T: IntF32T64>(x: T, mask_: T) -> T {
    if cfg!(target_feature = "bmi2") {  // compile-time condition
        unsafe { intrinsics::pext(x, mask_) }
    } else {
        alg::bmi2::pext(x, mask_)
    }

I said before that in the feature blocks the compiler should not use features not supported even if the binary target is set to use those features, but I think that does not make sense. The compiler will use those features everywhere else, so the binary cannot work in targets that don't support those anyways.

  1. I have a pretty good idea of how to map what we’ve discussed here on the current features available through RFC 1199, except how to allow mixing of Intel’s __m128i and __m128d types. Considering these both map to various actual Rust types and you can interchange them freely, that doesn’t really map onto the rigidness that #[repr(simd)] struct __m128d(f32,f32,f32,f32); would be.
  2. we should really enable stable generic shuffle/insert/extract operations since the compiler is much better at choosing the right instructions for this than humans.

This seems risky. If LLVM knows that some SIMD intrinsics are pure, it would be within its rights to hoist them out of the if statement… I’m not sure whether it actually does so, but if not it could in the future, and so could an alternative Rust backend.

Also, SIMD extensions sometimes introduce not just new instructions but new registers, like YMMn (wider than XMMn) with AVX. If you use intrinsics within such a conditional that operate on YMM-sized values, the register allocator would have to know to allocate YMM registers within guarded blocks but never touch them outside, including when saving/restoring around function calls and maybe in the prolog/epilog*. This is certainly possible to implement, but it sounds messy and I doubt LLVM currently does it.

Probably best to keep architecture variants in their own functions, and make the compiler fully aware of what variant it’s targeting in a given function.

* upper bits of YMMn are always caller-saved for compatibility, but on Windows the lower bits of some of them are callee-saved; saving only the lower bits in the prolog/epilog can be done portably but I’m not sure if it always is

1 Like

@gnzlbg right yeah I understand what you mean, and I understand the organization as well. I would personally claim that such an implementation in the compiler would unreasonably delay SIMD stabilization. The alternative, not giving you this static checking, seems to me like the more appealing route.

I don't think we shouldn't have static-checking in one way or another that you don't use AVX outside of an avx-enabled function. I just think that the feature shouldn't block SIMD stabilization.


I'm not personally super familiar with all the x86 SIMD instructions, so could you go into some more detail about the implications of having these two types? Put another way, with a strict interpretation of what I'm thinking we'd end up with:

// in src/libstd/simd.rs
pub struct __m128i(...);
pub struct __m128d(...);

and then we'd match those types exactly with all the intrinsics in Intel's documentation. Is there something wrong with this approach, though?

Indeed! This is something @burntsushi is looking into, and we'll likely have some intrinsic or another to do this.

Do you think it is possible to add this checking in a backwards compatible way? I don’t see how this could be done, so if we ever want this feature it might block stabilization anyways :confused:

First: looks like we missed a typed in prior discussions: there's __m128 (4xf32), __m128d (2xf64) and __m128i (any packed integer vector), as well as the 256-bit versions of these from AVX onwards.

The C API lets you use these types interchangeably, and this is necessary, because some operations that you might want to use on both types of data exist only for one type. For example, there's the BLENDV family of instructions, which is like shuffle except that you can pass a variable mask. AVX defines __m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask) which lets you shuffle around single-precision floats, but it could be used for 32-bit integers as well. Only with AVX2 did they introduce the integer version _mm256_blendv_epi8. Also note that it takes in the mask as a __m256 but it's really only using the lowest 4 bits as a bitmask (so not a float at all).

Could you elaborate why you think that? While I have some concerns about what it will mean for modularity/abstraction, the actual feature seems straightforward enough, especially if restricted to annotating functions with #[target_feature] rather than arbitrary blocks. It wouldn't interact with monomorphization, and with a matching perma-unstable attribute (call it #[needs_target_feature]) on the intrinsics and potential wrapper functions, the check should be a rather simple lint. It would also be a good match for how LLVM supports code with different ISA extensions within one compilation unit — I am a bit skeptical that LLVM is, in your words, flexible enough to just support any intrinsics anywhere. What will definitely work is if we set the corresponding attributes on functions with #[target_feature(..)].

On the topic of unsafety, anything not dealing with pointers (e.g. such as gather/scatter) or uninitialized memory, should be safe, at least IMO, unless the cost to figure out which intrinsics can be safe overrides it.

Currently in safe Rust you can't bit-cast an integer to a float. It's not clear this property would be preserved when making intrinsics safe.

I expect to find safe wrappers for such transmutes all over the place, TBQH. Although I'm not sure everyone else agrees on this, or if there's some weird UB-like semantics based on floating-point bitpatterns.

I would like to support @comex opinion on this. I think inline assembly stabilization will be a better approach not only to this problem, but also for Rust in general. So instead of stabilizing intrinsics or some higher level SIMD API in core or std, we could work on set of crates which will use asm!, starting from platform dependent low-level crates which will provide intrinsic like interfaces and going up to more high level stuff like simd crate with fallback options for platforms without necessary instructions using either runtime or compile time feature detection. Advantage of this approach is that it allows to freely experiment with interfaces and crates structure without imposing backward compatibility requirements.

Also do not forget about other low-level platform dependent features (e.g. AES NI) which are currently not accessible in stable pure Rust. Or say in next gen processors we’ll get new instruction sets, should we wait until new intrinsics get stabilized in Rust’s std or core or should we allow developers to create experimental crates which will those instructions through asm!?

Yes, there is a concern about inability of optimizing inline assembly, but I think compiler optimizations are not so important on this level, as you get to it when those optimizations do not give you desired performance. And in some cases absence of “optimizations” is quite desirable. (most notable example is cryptography)

So summarizing: in my opinion stabilized asm! + runtime feature detection in mid-term will not only more or less solve SIMD problem, but also will provide significant boost for low level and performance critical programming in Rust.

1 Like