Getting explicit SIMD on stable Rust

Right. I forgot to include that in my previous comment. My understanding of that solution is that at codegen time, if a function being called has a SIMD type in its type signature, then the enclosing function would have target feature X turned on automatically? I guess the idea here is that this works even in the presence of generics, but could possibly result in more SIGILLs than one might expect? The other interesting part of this is determining what X is. e.g., I guess it would be +sse on x86 for 128 bit vectors and +avx on x86 for 256 bit vectors.

1 Like

What about this:

  1. you may only use vectors in function signatures if those vectors are supported with the features the function is compiled for (either through #[target_feature] or -C target-feature)
  2. you may only call functions with vectors in the signature if those vectors are supported with the features the calling function is compiled for (again, either through #[target_feature] or -C target-feature)

This should stop any “feature” leakage while point 2 still allows you to write abstractions that can be called by regular functions.

Example (assuming -C target-feature is not passed):

#[target_feature = "+avx"]
fn memcpy_aligned_avx(dst: &mut [u8x32], src: &[u8x32]) {
    /* ... */
}

#[target_feature = "+avx"]
fn memcpy_avx(dst: &mut [u8], src: &[u8]) {
    let dst_avx = /* ... */;
    let src_avx = /* ... */;
    memcpy_aligned_avx(dst_avx, src_avx); // OK because this function has AVX
    // + handle edges
}

fn memcpy(dst: &mut [u8], src: &[u8]) {
    if cpu_has_feature(Feature::Avx) {
        memcpy_avx(dst, src); // OK because signature does not use vectors
    } else {
        /* ... */
    }
}

fn do_memcpy() {
    memcpy(/*...*/); // OK
}

fn do_memcpy2() {
    memcpy_aligned_avx(/*...*/); // ERROR: function `memcpy_aligned_avx` requires
                                 // target feature AVX but the calling function
                                 // `do_memcpy2` isn't compiled for that target.
}

Also in this example, memcpy would fail to compile if #[target_feature] wasn’t specified.

This should all hold true regardless of automatic/forced inlining. If this setup results in any codegen errors, those would need to be fixed at the LLVM level, since I think this abstraction is sound.

What about generics? For example, if our simd vector types implement Mul, then a function generic over Mul (without any target feature attribute) could accept simd vectors as parameters. Wouldn’t that cause problems with your approach?

I think the same checks I proposed could apply at monomophorzation time. Yeah that does seem to be a problem considering you won’t be able to apply #[target_feature] to the monomorphized functions appropriately.

Another interesting question is what happens if you put a vector type in a tuple or struct.

If the SIMD function you’re calling depends, on say, AVX, and takes in an AVX type as an argument (f64x4), how do you expect the compiler to construct an instance of this f64x4 in the AVX register if the AVX feature is disabled in the first place?

It seems to me that constructing the AVX argument ought to imply the relevant feature, rather than calling a AVX-enabled function (which might even not care about features in the caller; e.g. if it doesn’t take as an argument, or return, any avx vectors).

The feature required to construct a SIMD vector then could be a part of the type itself (e.g. #[requires_feature] annotation, which could be made somewhat cross-platform too)


EDIT: I should note that I very much dislike how much of implicit would be going on. Instead of propagating target features, I feel like the compiler ought to generate AVX shims, which take your vectors using some arbitrary Rust ABI and converts those vectors to relevant SIMD registers before calling the feature-ful function. We already do something similar with the "C" ABI functions.

1 Like

Not anymore. Also doing something specifically for the Rust ABI is never a real option because you can have generic extern "C" functions. But maybe you're only referring to not propagating the features to the caller, i.e. having an ABI barrier preventing instructions from being generated in a caller that didn't opt in?

Well, the caller can also be generic and we don't want to error at monomorphization time. So we don't really have as many options as one may think.

AFAICT, there is basically only one option: if monomorphization leads to a function either passing a SIMD type as an argument or receiving one as a return value, it must be compiled with the appropriate target feature. Anything else seems like it would lead to either monomorphization-time errors or surprising behavior.

Here are two suggestions to mitigate the unfortunate implicitness:

  • pre-monomorphization, it is a compile error to pass a SIMD type or get one as a return value, unless you are in a function that is compiled with the appropriate target feature. This limits the implicit target_feature propagation to generic functions, and doesn't introduce any post-monomorphization errors.

  • post-monomorphization, it is a warning to call a monomorphized function that got an implicit target feature, unless you are calling from a function with that target feature (implicit or not). There would be an attribute to silence this warning, since it's legitimate to do this after a runtime CPU feature check.

The implication of the first rule for @burntsushi's examples: in

#[inline(always)]
#[target_feature = "+avx2"]
fn testfoo(x: i64, y: i64, shift: i32) -> i64 {
    let a = i64x4(x, x, y, y);
    _mm256_slli_epi64(a, shift).0
}

it would be a compile-time error to leave off target_feature, regardless of whether inline is specified. In

#[target_feature = "+avx2"]
fn testfoo(x: i64, y: i64, shift: i32) -> i64x4 {
    let a = i64x4(x, x, y, y);
    _mm256_slli_epi64(a, shift)
}

it would be a compile-time error to call testfoo from a function without an appropriate target_feature (avx or avx2).

1 Like

I am out of my depth here but, for my own edification, could it potentially make sense to pass e.g. i64x4 the same way as an [i64; 4] in the case where the required target feature isn’t available, instead of making it an error, presumably at the cost of performance, but maintaining compatibility?

My first instinct was to do this always (part of the problem is you need ABI compat between code using and not using the target feature, otherwise what you’re saying is the UB status quo) and AFAICT you can do it for the “Rust” ABI, but you still have the original problem for extern "C" (which can be both generic and safe).

I guess the issue here is dynamic calls? (For static calls you could just insert a shim at mismatches?)

Even so, at a minimum that means you could do it this way for extern "Rust" and only make it a hard error for extern "C" then, doesn’t it?

What is the purpose of generic extern "C"?

Any function definition can have any ABI, in theory. In practice, there’s a common pattern of writing generic callbacks that take an “user data void* pointer” and call a closure behind it. That way, you can pass the closure to C through the typical callback fn pointer (an instance of your callback adapter, for that closure type) and the user data pointer (the closure’s captures).

Seems good.

I think this isn't an acceptable solution.

For example, encoding_rs has a function that takes u8x16 as an argument and returns a boolean. Conceptually, this shouldn't have to be an unsafe function from the perspective of the caller: input is a Copy by value and output is Copy: there shouldn't be an opportunity for exterior unsafety. Being able to factor small operations like this into separate (inline) functions is a good thing: It allows the domain-specific concepts (like "is this ASCII"?) be implemented in a per-ISA manner while allowing the overall caller algorithm to be ISA-agnostic and safe. It would be very sad if unsafe had to spill over to the caller algorithm and wasn't hidden away inside these small functions implementing the domain-specific operations that may have ISA-specific implementations.

It seems to me that if the signature of a function requires certain types of registers exist for the ABI to work, then the compiler must require the callers to be compiled in a mode that's aware of those registers existing. This will make the "can this function call this other function" be more complicated than a binary safe/unsafe, but I think having complication like "callee ABI requires ymm registers, so the caller has to be compiled with ymm register awareness enabled" is better than expanding what's required to be unsafe so much that things in the proximity of SIMD have to become unsafe so much that the benefit of having safe code gets lost.

1 Like

I think we're in violent agreement. I do want to at least be able to enumerate the possible solutions though. unsafe seemed like one of them, but I actually think it literally doesn't work because of generics. (This is completely aside from the other points you've raised, which I agree with.)

To be clear, I think this is the same line of thinking that @eddyb and @stoklund have, yes?

1 Like

OK. I see.

I'm not sure if @eddyb's recent comments mean the same thing as mine. I failed to locate @stoklund's take on this, mainly because Discourse doesn't make the whole discussion available in the browser DOM at once and Discourse's own search function is case-insensitive only.

Whoops. I didn’t mean @stoklund. I meant @jneem. This one: Getting explicit SIMD on stable Rust

I think that makes sense.

Maybe it was already meant, but for clarity: If the whole crate is being compiled with (e.g.) AVX enabled, then I think it shouldn't be necessary to annotate the individual functions. I.e. what should count is what target features a function is being compiled with, not the specific source of those features.

2 Likes

I’ve been doing some experiments on auto-generating rust code from clang’s headers. Some figures:

  • after ignoring AVX512, KNC, and SVML, Intel defines 1184 intrinsics
  • of those, clang supports 1129. The intrinsics that clang doesn’t support are mostly covered by language features anyway (e.g. _bittest).
  • of the intrinsics that clang supports, 892 are defined as functions and 237 are defined as macros
  • of the 892 functions, 504 are completely trivial to translate (i.e. the function just looks like __m128i foo(int a) { return __some_llvm_intrinsic(a); })
  • of the 237 macros, 70 are just a trivial renaming

Overall, about half the intrinsics are completely trivial to translate automatically to rust. In the remaining half, there is also some low-hanging fruit: lots of functions are defined using llvm’s built-in operators (+, -, &, etc), and many more are just a single invocation of __builtin_shufflevector.

Would it be useful for me to continue this attempt at automatic translation? Does it overlap with something that is already being done?

3 Likes

I am working on doing it by hand, and including at least one test for each intrinsic. I’m almost done with SSE2. The implementation isn’t the only thing that needs to be translated, you also need to work out the types and those are not always straight-forward to determine. You can invent some rules that work for a lot of cases based on the suffixes (e.g., epi8/epi16/epi32/epi64), but it doesn’t universally apply. For example, _mm_packs_epi32 takes two i32x4 vectors and returns a i16x8 vector. The other bit to do here is documentation. The Intel interface includes a description for each intrinsic, but they are not universally good. Finally, some uses of __builtin_shufflevector are pretty tricky to port, since the arguments must be constants.

My feeling is that there is enough human attention required in this process that it may not be completely wise to automate it. I am at least willing to do all of SSE and I hope to do AVX/AVX2 as well. However, I’d like to write the RFC before completely finishing it.

Fair enough. I can do some of them, if you tell me which ones.

Good work, I think this is a first step I can agree with.

Is there a fundamental reason that you omitted 512bit wide vector types in the proposal?

Is there a real demand for AVX512 today? It is available in shipping hardware beyond Xeon PHI?

Yes, there is demand, hence why Clang and GCC already support AVX-512. There are a bunch of 1st gen Xeon Phi (Knights Corner) systems available, the Top 500th list currently contains >10 systems with 2nd gen Xeon Phi's (both Knights Landing and Knights Landing F), and 3rd gen Xeon Phi's systems (Knights Hill, like Argonne's Auroa) will be rolled into production in 2018, with tuning workshops using prototype boards starting as early as this year (2016) already.

These systems perform really bad if AVX-512 is not used, and the alternative ways of using AVX-512 (like OpenMP 4) are not available in Rust. So I think it is important to roll AVX-512 low-level intrinsics from the start.

It is also a good stress test for the RFC, since the "Intel weirdness" with the vector types only gets worse, and AVX-512 also adds AVX-512F, AVX-512CD, AVX-512ER, AVX-512PF, ...

Fused multiply-add (mul_add) is available on many platforms, but not universally, and it is a very expensive operation to emulate. f64x2::mul_add() requires a 104-bit wide multiplication which you really, really want implemented in hardware. Saturating arithmetic is universally available for 8-bit and 16-bit lanes, but not for i32x4. It turns out that nobody actually needs saturating arithmetic larger than 16 bits. I left it out to avoid the asymmetry.

Leaving things out, avoiding asymmetries, providing a nice interface ... is the job of a high-level SIMD interface.

The whole point of a low-level SIMD interface is to provide a direct map to the hardware, to allow the user to do anything that the hardware can do. That is, by definition, to avoid leaving anything out. Hardware instruction sets are complicated, inconsistent, asymmetric, incompatible, and weird... We can try to make the low-level intrinsics as safe and nice to use as possible, but we should not try to make them a high-level SIMD API. Implementing traits like Add is for me on the boundary with a high level API, and I don't know how I feel about that. I would be fine with a low-level SIMD API that only provides intrinsics functions, without niceties like Add.