Pre-RFC: SIMD groundwork

I like this idea, we can specifically document all of these as "very likely to change" to highly discourage use of them and then make the extern crate simd experience so nice you never even feel like you need to use them. Overall it seems like we should definitely be sternly warning against using anything in this RFC as it's primarily just meant to build the crate externally.

Yeah I think we should definitely land everything in this RFC as unstable to start out with. Only once the external crate has gotten some traction, has been implemented, and is confident that it can expand successfully do I think we should actually stabilize the infrastructure here. Restricting "nice SIMD" to nightly for little while longer doesn't seem so bad at all.

As long as it's Nightly only, and we make it clear that we expect this to evolve, I think that's fine.

This is a common question so I'll definitely need to clarify.

Basically: our current lang item system means that there can only be one instance of it in a whole dependency graph. This would mean that two versions of the hypothetical simd crate cannot be linked into a final artifact, which would be extremely unfortunate, I think. (I.e. if a used simd 1.0 and b used simd 2.0, then it would be illegal to depend on both a and b.)

I'm also not particularly comfortable with having a large number of new intrinsics, but I'm not sure there's a good alternative. Each intrinsic essentially maps to a single hardware CPU instruction on ARM/x86/..., and tweaking the exact instructions is something that users of SIMD care about. We could take a more curated approach where we have a smaller set of intrinsics but this would probably take more effort.

Happy to put in __ prefixes or generally strongly discourage people other than the simd crate itself using this directly.

Yes and no. If you're talking about shuffles and comparisons: it is much easier to implement shuffles in the compiler, or else every programmer would have to think about the sequence of instructions required to get a given shuffle (I don't think Rust has enough metaprogramming to allow computing the optimum sequence given an input series of indices).

There are many intrinsics. I'm not sure the RFC would benefit from lists of several thousand intrinsics. Vendors usually provide a canonical C header for their intrinsics, so there's pretty much no question of naming/functionality with the approach the RFC sketches out.

The approach I want to take is to have them weakly type-checked: basically at monomorphisation time the compiler will check the inputs and outputs all make sense. In the worst case, this will mean that one can get very delayed errors with bad error messages (like the old transmute) however, in practice this shouldn't occur. The API for, say, simd_shuffle2 would look more like

fn simd_shuffle2<T, U>(v: T, w: T, i0: u32, i1: u32) -> Simd2<U>
    where T: SimdVector<Elem = U>

And the SimdVector trait would be defined to essentially ensure all errors that would be caught at monomorphisation time are caught might earlier. (It would be unsafe trait etc.) Furthermore, generally people won't be calling these intrinsics directly, instead they'll use higher level APIs that manage keeping things all in order.

Yup! My experiments have been (weakly) blocked on that, in fact.

This looks excellent for my purposes, in terms of functionality :slight_smile:

Wow. Why do these have to be intrinsics and not simply inline assembly? LLVM doesn't define these bazilion intrinsics does it? What kind of magic do the corresponding C headers do?

Being intrinsics allows for more optimisation, however it's a good point that inline assembly may work?

I believe it does. The C headers call compiler built-ins that generally follow the intrinsics themselves, e.g. this is an excerpt of clang 3.6.0's xmmintrin.h (the x86 SSE intrinsics):

static __inline__ __m64 __attribute__((__always_inline__, __nodebug__))
_mm_max_pu8(__m64 __a, __m64 __b)
{
  return (__m64)__builtin_ia32_pmaxub((__v8qi)__a, (__v8qi)__b);
}

static __inline__ __m64 __attribute__((__always_inline__, __nodebug__))
_mm_min_pi16(__m64 __a, __m64 __b)
{
  return (__m64)__builtin_ia32_pminsw((__v4hi)__a, (__v4hi)__b);
}

It does. The target definition files contain all the information, here's a section from one of the files: llvm/lib/Target/X86/X86InstrSSE.td at master · rust-lang/llvm · GitHub

Inline assembly is not a brilliant option because in most cases it's completely opaque. I'd much rather let LLVM handle it if it can, instead of using inline asm.

f23 are typos?

Can’t these be linted?

I assume this refers to operations and conversions, because it is not completely clear what “all of this” refers to. While I agree about operations, I think it would be possible to check/lint conversions at compile time, no?

A bit nitpicky, but isn’t cfg!(target_feature="X") and regular branches fine? Looks more, you know, native to rust and optimises well.

The namespace prefixing for the low level intrinsics seems pretty bad to me - could we use modules here?

If the shuffles are going to have compiler support, is it also worth givng them syntax? e.g., x.3210 or x[3210] to reverse a length 4 vec.

I feel like fixed length arrays would be a better match for SIMD vectors than structs, e.g,

#[repr(simd)]
type f32x4 = [f32; 4];

Of course that is adding functionality to type aliases which is probably undesirable. Maybe we should allow struct foo[T; n] as a way to de-anonymise arrays in the same way that tuple structs de-anonymise tuples?

1 Like

Unfortunately the branch elimination happens during translation so all the code in all the branches needs to be valid. If you limit the availablity of platform-specific intrinsics then you need to have a way of removing the invalid functions entirely from the AST.

Well the not-a-constant one is an error. You can't have non-constant indexes for shuffling.

There’s one major missing component to this RFC: runtime CPU feature detection. Getting good performance out of SIMD on x86 in particular is heavily dependent on the exact features exposed by a CPU, and this RFC provides no way to conditionally use a CPU feature. It’s not necessary to implement runtime detection immediately, but the design of platform detection needs to take it into account. I’m not sure that cfg() is the right model: cfg features have to be consistent across an entire program.

One argument in favor of supporting integer /: some common cases, like x / 10, can be vectorized without any special CPU support.

3 Likes

Regarding the question of intrinsics as opposed to inline assembly, and considering the consensus seems to be that extern crate simd; would be quite okay living on nightly for now, why not have the first few iterations be along these lines?

#[inline(always)]
fn _mm_max_pu8(a: [u8; 8], b: [u8; 8]) {
    asm!("just one instruction" : "with proper" : "register", "specifiers");
}

While it would only accept literal types, the compiler should be able to optimize it quite well without requiring a shedload of new intrinsics.

I also think that the quirky structural typing behavior being proposed for the intrinsics needs a lot more discussion, and might be better served by creating a trait for types to opt in to structural typing For Realsies. Then the interfaces could change to:

#[inline(always)]
fn _mm_max_pu8<T>(a: T, b: T) where T: Structural<Layout=[u8;8]> {
    asm!("just one instruction" : "with proper" : "register", "specifiers");
}

(cue the bikeshed)

1 Like

Why is this attribute necessary? Wouldn't a marker-trait suffice?

unsafe trait SimdPrim {}

unsafe impl SimdPrim for u32 {}
unsafe impl SimdPrim for i32 {}
unsafe impl SimdPrim for u64 {}
unsafe impl SimdPrim for i64 {}
unsafe impl SimdPrim for u16 {}
unsafe impl SimdPrim for i16 {}
unsafe impl SimdPrim for u8 {}
unsafe impl SimdPrim for i8 {}
unsafe impl SimdPrim for usize {}
unsafe impl SimdPrim for isize {}
unsafe impl SimdPrim for f32 {}
unsafe impl SimdPrim for f64 {}
unsafe impl SimdPrim for bool {}

struct Simd4<T: SimdPrim>([T; 4]);

// this might not be right, since it allows nesting...
unsafe impl<T: SimdPrim> SimdPrim for x4<T> {}

So a [T: SimdPrim; N] ?

Until we get value generics, couldn't this be a shim to the actual intrinsic + some debug_assert calls?

What I am asking is whether is would be possible to write a lint/compile-time check to insure indices are not out-of-bounds. At the first sight it shouldn’t be impossible, because indices are constant and vector length seems to be a part of the type, hence the question.

Thanks everyone for your responses! I'm replying here and also adjusting my local copy of the RFC. :smile:

Yes, thanks.

As @Aatch said, not really. The constants have to actually be constants for code-generation: linting isn't enough. For out of bounds accesses, I'm not sure we can tackle every case (and, even if we do have a lint, we have do something for allow/warn, since the code will run at runtime). In particular, my intention is to use something like RFC 1062 to wrap the raw intrinsics, so it's not obvious when an index will be out-of-bounds. (In general it won't be known until code generation time.)

Could you expand? I don't know what linting/checking you're envisioning: it's not possible to check/lint conversions of values that are too large for the target type since the values are only known at runtime in general.

The namespacing is for the compiler to recognise which intrinsic to call, we'd have to have more trickery if we just wanted to use modules (the compiler would have to consider the name of the module when looking at an extern block to work out what it should be doing).

I agree it's an important part of SIMD functionality. However, I'm not sure this RFC is the place to solve it. Certainly the concern essentially only applies to the cfg(target_feature) part of the RFC.

I thought about it before posting this RFC, and I'm not sure how to do it any way other than cfg. I think we may want some way to compile a crate with several different configurations and load them together, similar to the C/C++ method. (I.e. basically C/C++ will compile each file with different configurations.)

I'd be extremely interested in hearing other's thoughts about this.

Good point. I wonder if something like vector.const_div::<10>() works... Or maybe vector / Const::<10> (brainstorming...). Or if we should just eat the performance cliff and allow plain old vector / 10 (or vector1 / vector2) and rely on the optimiser to handle the constant cases.

This sounds like it may something to investigate for libraries to be able to impose type-safety on the raw intrinsics, but the utility of enforcing this on every intrinsic at the compiler level is not totally obvious to me.

It's not obvious to me how much this representation detail matters. :slight_smile:

In any case: we can define [T; n] as another thing that can be repr(simd)'d. If/when we get generic integers that can be used for array lengths, it seems very useful to allow it, but it's not clearly useful right now.

The original intention was to use the attribute as a cue to flatten the representation. Currently we represent Foo in the following with several layers of LLVM structs.

struct Foo(Bar);
struct Bar(Baz);
struct Baz(u8);

However, we need to represent it as a raw u8 (well, i8 in LLVM's parlance). I suppose we could just do this automatically whenever types are used in repr(simd): they're totally flattened. It then becomes the responsibility of the libraries building on this functionality to provide the appropriate bounds to ensure the non-representation properties (i.e. making SIMD-compatibility part of a type's interface).

I'm not sure what you mean. Could you clarify? This RFC isn't proposing how to implement higher-level interfaces but it sounds like this may be what you're talking about?

In any case, debug_assert!s aren't enough to ensure something is a compile-time constant. (Totally minor note: if things are compile time constants there's no reason to use debug_assert over assert: the branches will be statically known.)

Also, shimming with some sort of match to "convert" runtime values into statically known compile-time ones and relying on the optimiser to eliminate the branch for true compile-time ones runs into an exponential explosion: even just shuffling f32x4's requires 84 = 4096 branches, and it grows super-exponentially with the number of elements ((2n)n): certainly totally unreasonable to do u8x16 shuffles in this way.

In any case, I have know idea if this is what you were envisioning. Please correct any of my misunderstandings! :smile:

Rereading original post now, I think I misunderstood what the original post meant by checking (overflow checking (?)), while I meant something along the lines of cast validity (e.g. from u16x4 to u8x4 is valid (?) and u16x4 to u8x6 is not) checking.

I got that, but misunderstood the rest. Those debug_asserts definitely don't make sense at all. I think I now understand.

Macros could be the solution until we get value generics:


macro_rules! do_something_with_const {
    ($a: expr) => ({
        const A: usize = $a;
        call_only_with_const_arg(A);
    })
}

fn call_only_with_const_arg(_: usize) {}

fn main() {
    let a = 42;
    do_something_with_const!(a);
}

yields the following:

<anon>:13:30: 13:31 error: attempt to use a non-constant value in a constant
<anon>:13     do_something_with_const!(a);
                                       ^
<anon>:2:1: 7:2 note: in expansion of do_something_with_const!
<anon>:13:5: 13:33 note: expansion site
<anon>:13:30: 13:31 error: unresolved name `a`
<anon>:13     do_something_with_const!(a);
                                       ^
<anon>:2:1: 7:2 note: in expansion of do_something_with_const!
<anon>:13:5: 13:33 note: expansion site

together with static_assert it should even be possible to do bounds checks.

It's not pretty, but it does the job

This is exactly what I see as "quirky structural typing" - "these types are equivalent if their representations are equivalent."

Treating this as some special-case of intrinsics (and only some intrinsics at that!) strikes me as strange and problematic; if structural typing is valuable, I think it needs to be handled carefully. Until then, I honestly think that requiring exact types is Good Enough.

Which exact types? If we required specific types, we'd need some way to inform the compiler of them, and the specificity requirement would imply that there's exactly one type that can be used? If so, that would impose the requirement that there's only one definition of SIMD types in a given tree of dependencies, and I really really don't want that requirement. (For one, I personally don't expect to get everything right first time, so it'd be very good if people are free to experiment themselves without "accidental"/arbitrary restrictions.)

I'm moderately concerned that introducing a vein of "relaxed" typing into the compiler will leave the door open for abuse/crazy tricks, but I'm unsure. However it seems quite restricted, so it's not clear to me that one can do anything even slightly useful with it. Note, in practice, people using SIMD won't need to worry about this at all: libraries will define things to ensure type safety (even at the intrinsic level).

NB. for the platform specific intrinsics we can/will require that they aren't generic, so can type-check them in the type checker, properly (properly == answering is this a SIMD type of the appropriate length with the appropriate element type?). Hence, this discussion is basically the difference between being able to write fn simd_shuffle2<T, U>(v: T, w: T, ...) -> Simd2<U> and having the option to impose type safety (which is what will happen in practice), or being forced to have scheme by which the compiler can be totally assured that things will work. This would either require writing separate shuffle/comparison intrinsics for every concrete SIMD type, or would require compiler-known traits (like #[simd_primitive_trait], but also at least one more, I think) with some compulsory associated types and so on.

#[simd_primitive_trait]
trait SimdPrim {
     type Bool: SimdPrim;
}
#[simd_vector_trait]
trait SimdVector {
     type Elem: SimdPrim;
     type Bool: SimdVector<Elem = Self::Elem::Bool>;
}

#[repr(simd)]
struct Simd2<T: SimdPrim>(T, T);

impl<T: SimdPrim> SimdVector for Simd2<T> {
    type Elem = T;
    type Bool = Simd2<T::Bool>;
}

extern {
    fn simd_shuffle2<T: SimdVector>(v: T, w: T, i0: u32, i1: u32) -> Simd2<T::Elem>;
    // ...

    fn simd_lt<T: SimdVector>(v: T, w: T) -> T::Bool;
    // ...
}

We'd need to have careful restrictions about how the implementations of SimdPrim and SimdVector can work, and especially around generic types. It seems very complicated, and I'm not sure it's worth it.

Seems like a good work around for prototyping/while we wait, yeah. Thanks!