Pre-RFC: SIMD groundwork


#1
  • Feature Name: simd_basics, cfg_target_feature
  • Start Date: 2015-06-02
  • RFC PR: (leave this empty)
  • Rust Issue: (leave this empty)

Summary

Lay the ground work for building powerful SIMD functionality.

Motivation

SIMD (Single-Instruction Multiple-Data) is an important part of performant modern applications. Most CPUs used for that sort of task provide dedicated hardware and instructions for operating on multiple values in a single instruction, and exposing this is an important part of being a low-level language.

This RFC lays the ground-work for building nice SIMD functionality, but doesn’t fill everything out. The goal here is to provide the raw types and access to the raw instructions on each platform.

Where does this code go? Aka. why not in std?

This RFC is focused on building stable, powerful SIMD functionality in external crates, not std.

This makes it much easier to support functionality only "occasionally" available with Rust’s preexisting cfg system. There’s no way for std to conditionally provide an API based on the target features used for the final artifact. Building std in every configuration is certainly untenable. Hence, if it were to be in std, there would need to be some highly delayed cfg system to support that sort of conditional API exposure.

With an external crate, we can leverage cargo's existing build infrastructure: compiling with some target features will rebuild with those features enabled.

Detailed design

The design comes in three parts, all on the path to stabilisation:

  • types (feature(simd_basics))
  • operations (feature(simd_basics))
  • platform detection (feature(cfg_target_feature))

The general idea is to avoid bad performance cliffs, so that an intrinsic call in Rust maps to preferably one CPU instruction, or, if not, the “optimal” sequence required to do the given operation anyway. This means exposing a lot of platform specific details, since platforms behave very differently: both across architecture families (x86, x86-64, ARM, MIPS, …), and even within a family (x86-64’s Skylake, Haswell, Nehalem, …).

There is definitely a common core of SIMD functionality shared across many platforms, but this RFC doesn’t try to extract that, it is just building tools that can be wrapped into a more uniform API later.

Types & traits

There are two new attributes: repr(simd) and simd_primitive_trait

#[repr(simd)]
struct f32x4(f32, f32, f23, f23);

#[repr(simd)]
struct Simd2<T>(T, T);

#[simd_primitive_trait]
trait SimdPrim {}

repr(simd)

The simd repr can be attached to a struct and will cause such a struct to be compiled to a SIMD vector. It can be generic, but it is required that any fully monomorphised instance of the type consist of only a single “primitive” type, repeated some number of times. The restrictions on the element type are exactly the same restrictions as #[simd_primitive_trait] traits impose on their implementing types.

The repr(simd) may not enforce that the trait bound exists/does the right thing at the type checking level for generic repr(simd) types. As such, it will be possible to get the code-generator to error out (ala the old transmute size errosr), however, this shouldn’t cause problems in practice: libraries wrapping this functionality would layer type-safety on top (i.e. generic repr(simd) types would use the SimdPrim trait as a bound).

It is illegal to take an internal reference to the fields of a repr(simd) type, because the representation of booleans may require to change, so that booleans are bit-packed. The official external library providing SIMD support will have private fields so this will not be generally observable.

simd_primitive_trait

Traits marked with the simd_primitive_trait attribute are special: types implementing it are those that can be stored in SIMD vectors. Initially, only primitives and single-field structs that store SimdPrim types will be allowed to implement it.

This is explicitly not a lang item: it is legal to have multiple distinct traits in a compilation. The attribute just adds the restriction and possibly tweaks type’s internal representation (as such, it’s legal for a single type to implement multiple traits with the attribute, if a bit pointless).

This trait exists to allow new-type wrappers around primitives to also be usable in a SIMD context. However, this only works in limited scenarios (i.e. when the type wraps a single primitive) and so needs to be an explicit part of every type’s API: type authors opt-in to being designed-for-SIMD. If it was implicit, changes to private fields may break downstream code.

Operations

CPU vendors usually offer “standard” C headers for their CPU specific operations, such as arm_neon.h and the ...mmintrin.h headers for x86(-64).

All of these would be exposed as compiler intrinsics with names very similar to those that the vendor suggests (only difference would be some form of manual namespacing, e.g. prefixing with the CPU target), loadable via an extern block with an appropriate ABI. This subset of intrinsics would be on the path to stabilisation (that is, one can "import" them with extern in stable code), and would not be exported by std.

extern "rust-intrinsic" {
    fn x86_mm_abs_epi16(a: Simd8<i16>) -> Simd8<i16>;
    // ...
}

These all use entirely concrete types, and this is the core interface to these intrinsics: essentially it is just allowing code to exactly specify a CPU instruction to use. These intrinsics only actually work on a subset of the CPUs that Rust targets, and are only be available for externing on those targets. The signatures are typechecked, but in a “duck-typed” manner: it will just ensure that the types are SIMD vectors with the appropriate length and element type, it will not enforce a specific nominal type.

There would additionally be a small set of cross-platform operations that are either generally efficiently supported everywhere or are extremely useful. These won’t necessarily map to a single instruction, but will be shimmed as efficiently as possible.

  • shuffles and extracting/inserting elements
  • comparisons

Lastly, arithmetic and conversions are supported via built-in operators.

Shuffles & element operations

One of the most powerful features of SIMD is the ability to rearrange data within vectors, giving super-linear speed-ups sometimes. As such, shuffles are exposed generally: intrinsics that represent arbitrary shuffles.

This may violate the “one instruction per instrinsic” principal depending on the shuffle, but rearranging SIMD vectors is extremely useful, and providing a direct intrinsic lets the compiler (a) do the programmers work in synthesising the optimal (short) sequence of instructions to get a given shuffle and (b) track data through shuffles without having to understand all the details of every platform specific intrinsic for shuffling.

extern "rust-intrinsic" {
    fn simd_shuffle2<T, Elem>(v: T, w: T, i0: u32, i1: u32) -> Simd2<Elem>;
    fn simd_shuffle4<T, Elem>(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Sidm4<Elem>;
    fn simd_shuffle8<T, Elem>(v: T, w: T,
                              i0: u32, i1: u32, i2: u32, i3: u32,
                              i4: u32, i5: u32, i6: u32, i7: u32) -> Simd8<Elem>;
    fn simd_shuffle16<T, Elem>(v: T, w: T,
                               i0: u32, i1: u32, i2: u32, i3: u32,
                               i4: u32, i5: u32, i6: u32, i7: u32
                               i8: u32, i9: u32, i10: u32, i11: u32,
                               i12: u32, i13: u32, i14: u32, i15: u32) -> Simd16<Elem>;
}

The raw definitions are only checked for validity at monomorphisation time, ensure that T is a SIMD vector, U is the element type of T etc. Libraries can use traits to ensure that these will be enforced by the type checker too.

This approach has some downsides: simd_shuffle32 (e.g. Simd32<u8> on AVX, and Simd32<u16> on AVX-512) and especially simd_shuffle64 (e.g. Simd64<u8> on AVX-512) are unwieldy. These have similar type "safety"/code-generation errors to the vectors themselves.

These operations are semantically:

// vector of double length
let z = concat(v, w);

return [z[i0], z[i1], z[i2], ...]

The indices iN have to be compile time constants. Out of bounds indices yield unspecified results.

Similarly, intrinsics for inserting/extracting elements into/out of vectors are provided, to allow modelling the SIMD vectors as actual CPU registers as much as possible:

extern "rust-intrinsic" {
    fn simd_insert<T, Elem>(v: T, i0: u32, elem: Elem) -> T;
    fn simd_extract<T, Elem>(v: T, i0: u32) -> Elem;
}

The i0 indices do not have to be constant. These are equivalent to v[i0] = elem and v[i0] respectively. They are type checked similarly to the shuffles.

Comparisons

Comparisons are implemented via intrinsics, because the current comparison operator infrastructure doesn’t easily lend itself to return vectors, as required.

The raw signatures would look like:

extern "rust-intrinsic" {
    fn simd_eq<T, U>(v: T, w: T) -> U;
    fn simd_ne<T, U>(v: T, w: T) -> U;
    fn simd_lt<T, U>(v: T, w: T) -> U;
    fn simd_le<T, U>(v: T, w: T) -> U;
    fn simd_gt<T, U>(v: T, w: T) -> U;
    fn simd_ge<T, U>(v: T, w: T) -> U;
}

These are type checked during code-generation similarly to the shuffles. Ensuring that T and U has the same length, and that U is appropriately “boolean”-y. Libraries can use traits to ensure that these will be enforced by the type checker too.

Built-in functionality

Any type marked repr(simd) automatically has the +, - and * operators work. The / operator works for floating point, and the << and >> ones work for integers.

SIMD vectors can be converted with as. As with intrinsics, this is "duck-typed" it is possible to cast a vector type V to a type W if their lengths match and their elements are castable (i.e. are primitives), there’s no enforcement of nominal types.

All of these are never checked: explicit SIMD is essentially only required for speed, and checking inflates one instruction to 5 or more.

Platform Detection

The availability of efficient SIMD functionality is very fine-grained, and our current cfg(target_arch = "...") is not precise enough. This RFC proposes a target_feature cfg, that would be set to the features of the architecture that are known to be supported by the exact target e.g.

  • a default x86-64 compilation would essentially only set target_feature = "sse" and target_feature = "sse2"
  • compiling with -C target-feature="+sse4.2" would set target_feature = "sse4.2", target_feature = "sse.4.1", …, target_feature = "sse".
  • compiling with -C target-cpu=native on a modern CPU might set target_feature = "avx2", target_feature = "avx", …

The possible values of target_feature will be a selected whitelist, not necessarily just everything LLVM understands. There are other non-SIMD features that might have target_features set too, such as popcnt and rdrnd on x86/x86-64.)

With a cfg_if_else! macro that expands to the first cfg that is satisfied (ala @alexcrichton’s cascade), code might look like:

cfg_if_else! {
    if #[cfg(target_feature = "avx")] {
        fn foo() { /* use AVX things */ }
    } else if #[cfg(target_feature = "sse4.1")] {
        fn foo() { /* use SSE4.1 things */ }
    } else if #[cfg(target_feature = "sse2")] {
        fn foo() { /* use SSE2 things */ }
    } else if #[cfg(target_feature = "neon")] {
        fn foo() { /* use NEON things */ }
    } else {
        fn foo() { /* universal fallback */ }
    }
}

Extensions

  • scatter/gather operations allow (partially) operating on a SIMD vector of pointers. This would require extending SimdPrim to also allow pointer types.
  • allow (and ignore for everything but type checking) zero-sized types in repr(simd) structs, to allow tagging them with markers

Alternatives

  • The SIMD on-route-to-stable intrinsics could have their own ABI

  • Intrinsics could instead by namespaced by ABI, extern "x86-intrinsic", extern "arm-intrinsic".

  • There could be more syntactic support for shuffles, either with true syntax, or with a syntax extension. The latter might look like: shuffle![x, y, i0, i1, i2, i3, i4, ...]. However, this requires that shuffles are restricted to a single type only (i.e. Simd4<T> can be shuffled to Simd4<T> but nothing else), or some sort of type synthesis. The compiler has to somehow work out the return value:

    let x: Simd4<u32> = ...;
    let y: Simd4<u32> = ...;
    
    // reverse all the elements.
    let z = shuffle![x, y, 7, 6, 5, 4, 3, 2, 1, 0];
    

    Presumably z should be Simd8<u32>, but it’s not obvious how the compiler can know this. The repr(simd) approach means there may be more than one SIMD-vector type with the Simd8<u32> shape (or, in fact, there may be zero).

  • Instead of platform detection, there could be feature detection (e.g. "platform supports something equivalent to x86’s DPPS"), but there probably aren’t enough cross-platform commonalities for this to be worth it. (Each “feature” would essentially be a platform specific cfg anyway.)

  • Check vector operators in debug mode just like the scalar versions.

Unresolved questions

  • Should integer vectors get / and % automatically? Most CPUs don’t support them for vectors.
  • How should out-of-bounds shuffle and insert/extract indices be handled?

What's the next step towards the stabilization of SIMD?
Syntactic sugar for delegation of implementation
#2

I’ve not finished reading yet, so I apologize if the answer is in the text (feel free to tell me to RTFM), but why would we want additional simd_primitive_trait Traits? i.e., why not a lang item?


#3

Looks good.

It isn’t clear why this is desirable. Other common marker traits are lang items. Is there any reason to want more than one SIMD marker?

Making all these grungy intrinsics stable makes me pretty uncomfortable. We carefully hide away all the other intrinsics, and these are acknowledged to be ugly but we’re going to put them in the spec. It would be nice if we could cover our asses and make these available for out-of-tree work without making any commitments about them. __ prefixes maybe?!

Why have ‘cross-platform operations’ if all the platform-specific intrinsics are available. The cross-platform stuff can be implemented in the libs right?

Would be nice to see all the intrinsics enumerated.

The intrinsics are not bound by anything that limits them to Simd. Also they specify the user-defined Simd type in the signature? How does that work when other intrinsics are using primitives or other compiler-known types?


#4

This seems like something that would work well with https://github.com/rust-lang/rfcs/pull/1062 (which really ought to get some love and attention)


#5

Yeah, I am not in love with “random attributes”. Attributes still make me kind of uncomfortable – I wonder if it’s the lack of namespacing. That said, I think the arguments in favor of moving most of the “built-up infrastructure” to crates is excellent – but it’s unclear that these have to be stable crates, at least not for a while, right? That is, they might require nightly builds.


#6

I like this idea, we can specifically document all of these as “very likely to change” to highly discourage use of them and then make the extern crate simd experience so nice you never even feel like you need to use them. Overall it seems like we should definitely be sternly warning against using anything in this RFC as it’s primarily just meant to build the crate externally.

Yeah I think we should definitely land everything in this RFC as unstable to start out with. Only once the external crate has gotten some traction, has been implemented, and is confident that it can expand successfully do I think we should actually stabilize the infrastructure here. Restricting “nice SIMD” to nightly for little while longer doesn’t seem so bad at all.


#7

As long as it’s Nightly only, and we make it clear that we expect this to evolve, I think that’s fine.


#8

This is a common question so I’ll definitely need to clarify.

Basically: our current lang item system means that there can only be one instance of it in a whole dependency graph. This would mean that two versions of the hypothetical simd crate cannot be linked into a final artifact, which would be extremely unfortunate, I think. (I.e. if a used simd 1.0 and b used simd 2.0, then it would be illegal to depend on both a and b.)

I’m also not particularly comfortable with having a large number of new intrinsics, but I’m not sure there’s a good alternative. Each intrinsic essentially maps to a single hardware CPU instruction on ARM/x86/…, and tweaking the exact instructions is something that users of SIMD care about. We could take a more curated approach where we have a smaller set of intrinsics but this would probably take more effort.

Happy to put in __ prefixes or generally strongly discourage people other than the simd crate itself using this directly.

Yes and no. If you’re talking about shuffles and comparisons: it is much easier to implement shuffles in the compiler, or else every programmer would have to think about the sequence of instructions required to get a given shuffle (I don’t think Rust has enough metaprogramming to allow computing the optimum sequence given an input series of indices).

There are many intrinsics. I’m not sure the RFC would benefit from lists of several thousand intrinsics. Vendors usually provide a canonical C header for their intrinsics, so there’s pretty much no question of naming/functionality with the approach the RFC sketches out.

The approach I want to take is to have them weakly type-checked: basically at monomorphisation time the compiler will check the inputs and outputs all make sense. In the worst case, this will mean that one can get very delayed errors with bad error messages (like the old transmute) however, in practice this shouldn’t occur. The API for, say, simd_shuffle2 would look more like

fn simd_shuffle2<T, U>(v: T, w: T, i0: u32, i1: u32) -> Simd2<U>
    where T: SimdVector<Elem = U>

And the SimdVector trait would be defined to essentially ensure all errors that would be caught at monomorphisation time are caught might earlier. (It would be unsafe trait etc.) Furthermore, generally people won’t be calling these intrinsics directly, instead they’ll use higher level APIs that manage keeping things all in order.

Yup! My experiments have been (weakly) blocked on that, in fact.


#9

This looks excellent for my purposes, in terms of functionality :slight_smile:


#10

Wow. Why do these have to be intrinsics and not simply inline assembly? LLVM doesn’t define these bazilion intrinsics does it? What kind of magic do the corresponding C headers do?


#11

Being intrinsics allows for more optimisation, however it’s a good point that inline assembly may work?

I believe it does. The C headers call compiler built-ins that generally follow the intrinsics themselves, e.g. this is an excerpt of clang 3.6.0’s xmmintrin.h (the x86 SSE intrinsics):

static __inline__ __m64 __attribute__((__always_inline__, __nodebug__))
_mm_max_pu8(__m64 __a, __m64 __b)
{
  return (__m64)__builtin_ia32_pmaxub((__v8qi)__a, (__v8qi)__b);
}

static __inline__ __m64 __attribute__((__always_inline__, __nodebug__))
_mm_min_pi16(__m64 __a, __m64 __b)
{
  return (__m64)__builtin_ia32_pminsw((__v4hi)__a, (__v4hi)__b);
}

#12

It does. The target definition files contain all the information, here’s a section from one of the files: https://github.com/rust-lang/llvm/blob/master/lib/Target/X86/X86InstrSSE.td#L3712-L3736

Inline assembly is not a brilliant option because in most cases it’s completely opaque. I’d much rather let LLVM handle it if it can, instead of using inline asm.


#13

f23 are typos?

Can’t these be linted?

I assume this refers to operations and conversions, because it is not completely clear what “all of this” refers to. While I agree about operations, I think it would be possible to check/lint conversions at compile time, no?

A bit nitpicky, but isn’t cfg!(target_feature="X") and regular branches fine? Looks more, you know, native to rust and optimises well.


#14

The namespace prefixing for the low level intrinsics seems pretty bad to me - could we use modules here?

If the shuffles are going to have compiler support, is it also worth givng them syntax? e.g., x.3210 or x[3210] to reverse a length 4 vec.

I feel like fixed length arrays would be a better match for SIMD vectors than structs, e.g,

#[repr(simd)]
type f32x4 = [f32; 4];

Of course that is adding functionality to type aliases which is probably undesirable. Maybe we should allow struct foo[T; n] as a way to de-anonymise arrays in the same way that tuple structs de-anonymise tuples?


#15

Unfortunately the branch elimination happens during translation so all the code in all the branches needs to be valid. If you limit the availablity of platform-specific intrinsics then you need to have a way of removing the invalid functions entirely from the AST.

Well the not-a-constant one is an error. You can’t have non-constant indexes for shuffling.


#16

There’s one major missing component to this RFC: runtime CPU feature detection. Getting good performance out of SIMD on x86 in particular is heavily dependent on the exact features exposed by a CPU, and this RFC provides no way to conditionally use a CPU feature. It’s not necessary to implement runtime detection immediately, but the design of platform detection needs to take it into account. I’m not sure that cfg() is the right model: cfg features have to be consistent across an entire program.

One argument in favor of supporting integer /: some common cases, like x / 10, can be vectorized without any special CPU support.


#17

Regarding the question of intrinsics as opposed to inline assembly, and considering the consensus seems to be that extern crate simd; would be quite okay living on nightly for now, why not have the first few iterations be along these lines?

#[inline(always)]
fn _mm_max_pu8(a: [u8; 8], b: [u8; 8]) {
    asm!("just one instruction" : "with proper" : "register", "specifiers");
}

While it would only accept literal types, the compiler should be able to optimize it quite well without requiring a shedload of new intrinsics.

I also think that the quirky structural typing behavior being proposed for the intrinsics needs a lot more discussion, and might be better served by creating a trait for types to opt in to structural typing For Realsies. Then the interfaces could change to:

#[inline(always)]
fn _mm_max_pu8<T>(a: T, b: T) where T: Structural<Layout=[u8;8]> {
    asm!("just one instruction" : "with proper" : "register", "specifiers");
}

(cue the bikeshed)


#18

Why is this attribute necessary? Wouldn’t a marker-trait suffice?

unsafe trait SimdPrim {}

unsafe impl SimdPrim for u32 {}
unsafe impl SimdPrim for i32 {}
unsafe impl SimdPrim for u64 {}
unsafe impl SimdPrim for i64 {}
unsafe impl SimdPrim for u16 {}
unsafe impl SimdPrim for i16 {}
unsafe impl SimdPrim for u8 {}
unsafe impl SimdPrim for i8 {}
unsafe impl SimdPrim for usize {}
unsafe impl SimdPrim for isize {}
unsafe impl SimdPrim for f32 {}
unsafe impl SimdPrim for f64 {}
unsafe impl SimdPrim for bool {}

struct Simd4<T: SimdPrim>([T; 4]);

// this might not be right, since it allows nesting...
unsafe impl<T: SimdPrim> SimdPrim for x4<T> {}

So a [T: SimdPrim; N] ?

Until we get value generics, couldn’t this be a shim to the actual intrinsic + some debug_assert calls?


#19

What I am asking is whether is would be possible to write a lint/compile-time check to insure indices are not out-of-bounds. At the first sight it shouldn’t be impossible, because indices are constant and vector length seems to be a part of the type, hence the question.


#20

Thanks everyone for your responses! I’m replying here and also adjusting my local copy of the RFC. :smile:

Yes, thanks.

As @Aatch said, not really. The constants have to actually be constants for code-generation: linting isn’t enough. For out of bounds accesses, I’m not sure we can tackle every case (and, even if we do have a lint, we have do something for allow/warn, since the code will run at runtime). In particular, my intention is to use something like RFC 1062 to wrap the raw intrinsics, so it’s not obvious when an index will be out-of-bounds. (In general it won’t be known until code generation time.)

Could you expand? I don’t know what linting/checking you’re envisioning: it’s not possible to check/lint conversions of values that are too large for the target type since the values are only known at runtime in general.

The namespacing is for the compiler to recognise which intrinsic to call, we’d have to have more trickery if we just wanted to use modules (the compiler would have to consider the name of the module when looking at an extern block to work out what it should be doing).

I agree it’s an important part of SIMD functionality. However, I’m not sure this RFC is the place to solve it. Certainly the concern essentially only applies to the cfg(target_feature) part of the RFC.

I thought about it before posting this RFC, and I’m not sure how to do it any way other than cfg. I think we may want some way to compile a crate with several different configurations and load them together, similar to the C/C++ method. (I.e. basically C/C++ will compile each file with different configurations.)

I’d be extremely interested in hearing other’s thoughts about this.

Good point. I wonder if something like vector.const_div::<10>() works… Or maybe vector / Const::<10> (brainstorming…). Or if we should just eat the performance cliff and allow plain old vector / 10 (or vector1 / vector2) and rely on the optimiser to handle the constant cases.

This sounds like it may something to investigate for libraries to be able to impose type-safety on the raw intrinsics, but the utility of enforcing this on every intrinsic at the compiler level is not totally obvious to me.

It’s not obvious to me how much this representation detail matters. :slight_smile:

In any case: we can define [T; n] as another thing that can be repr(simd)'d. If/when we get generic integers that can be used for array lengths, it seems very useful to allow it, but it’s not clearly useful right now.

The original intention was to use the attribute as a cue to flatten the representation. Currently we represent Foo in the following with several layers of LLVM structs.

struct Foo(Bar);
struct Bar(Baz);
struct Baz(u8);

However, we need to represent it as a raw u8 (well, i8 in LLVM’s parlance). I suppose we could just do this automatically whenever types are used in repr(simd): they’re totally flattened. It then becomes the responsibility of the libraries building on this functionality to provide the appropriate bounds to ensure the non-representation properties (i.e. making SIMD-compatibility part of a type’s interface).

I’m not sure what you mean. Could you clarify? This RFC isn’t proposing how to implement higher-level interfaces but it sounds like this may be what you’re talking about?

In any case, debug_assert!s aren’t enough to ensure something is a compile-time constant. (Totally minor note: if things are compile time constants there’s no reason to use debug_assert over assert: the branches will be statically known.)

Also, shimming with some sort of match to “convert” runtime values into statically known compile-time ones and relying on the optimiser to eliminate the branch for true compile-time ones runs into an exponential explosion: even just shuffling f32x4's requires 84 = 4096 branches, and it grows super-exponentially with the number of elements ((2n)n): certainly totally unreasonable to do u8x16 shuffles in this way.

In any case, I have know idea if this is what you were envisioning. Please correct any of my misunderstandings! :smile: