Getting explicit SIMD on stable Rust

Well, part of my point is that a vector-length-agnostic API that monomorphizes+dispatches for each supported packed SIMD ISA under the hood is viable today, and also frees the programmer of the burden (on "iterated instructions") of selecting a specific SIMD extension, or dispatching manually.

Indeed, that's one of the absolute top goals of both SVE and the V extension. The problem is...

LLVM's optimizer just isn't there yet. It's just beginning to grow the kind of polyhedral optimizations that let GCC do exactly this. In the meantime, the single best thing we can do, to ensure it does vectorize code, is to force our vectorizable code into easily-recognized idioms.

That, then, is exactly what my proposal is meant to do: Force "iterated instructions" into trivial, one-pass, easily-fused, contiguous loops. Under the hood, we can use LLVM intrinisics (in order to avoid performance cliffs in debug mode), or we can rely on the optimizer (and have simpler implementations), or other options.

I don’t think we need to block SIMD intrinsic stabilization on supporting arbitrary-length vectors. In particular, if we’re going with the Simd<[T; N]> abstraction in the future, I could see a very realizable common abstraction in the form of Vector<[T]>.

1 Like

You're right. I've just used a mental shortcut and forgot about the signedness conversions. I still think they're surprising (From impls for i32x4→u32x4 and i32→f32 but no i32→u32 nor i32x4→f32x4) but they won't cause any ambiguity.

Neither is the hardware. Nor even a public spec.

If I understand you correctly, your proposal is that we design and stabilize a Rust language feature in order to work around a temporary deficiency in LLVM's optimizer. A deficiency that will probably be fixed before the relevant hardware is available.

I don't think we should do that.

1 Like

This discussion of variable-width vectors and compilation strategies is seriously off topic. This topic is about a low-level, architecture-specific SIMD interface. Unless you think a high-level API can (a) always generate perfect vector code on existing architectures, that a human can’t improve on, and (b) provide access to every single weird Intel/ARM/etc. vector instruction that might be necessary for optimal performance, any such API is essentially orthogonal to the set of low-level near-assembly primitives that are desired here. (Actually, not orthogonal, dependent: stabilizing low-level primitives allows experimentation with variant designs of high-level APIs in third-party crates before maybe moving to std someday, which is part of the motivation in the first place.) The only exception is that the set of fixed-width vector types and basic high-level operations on them being proposed could conflict with a future variable-width approach, but this seems minimal: the fixed-width types are necessary no matter what, to properly type the architecture-specific intrinsics, and adding a handful of impls on them really doesn’t hurt.

6 Likes

I concur with @comex. I don’t think we should try to design around RISC-V and ARM stuff that isn’t shipping yet in mainstream hardware. Especially since RISC-V might well fail (I love the idea of it too, but it does no good to pretend there isn’t significant risk).

The road to hell is paved with attempts to be future-compatible with standards that haven’t shipped yet and whose usage is not well-understood. This is how JavaScript and Java got their obnoxious two-byte wide strings, for example; this new shiny Unicode thing that was 16-bit was right around the corner and they felt they had to embrace it…

5 Likes

Hmm, I was arguing for From impls between every pair of integer vector types, but no From impls between float/float or float/integer. So we'd get: i8x16<->u8x16, i8x16<->u16x8, i8x16<->i16x8, u8x16<->u16x8, u8x16<->i16x8, and so on.

@burntsushi I was talking about the same set of impls, but I’ve made a typo, sorry for that. Fixed the previous post, so it’s not self-contradicting now. I wanted to write that From impls for i32x4→u32x4 and i32→f32 but no i32→u32 nor i32x4→f32x4 would be surprising, but at least not ambiguous (and I think we do agree on that).


I’ve just noticed that I erroneously thought that there is an impl From<i32> for f32. It weakens my argument a little, but I still think that implementing From as transmutes is surprising (note that std doesn’t do it for signedness conversion).

+1. .foo(…) should mean "semantically as-if" into array => map foo => into simd. I'm glad to see the proposed simd types not implementing Add, even in step 3. Might be worth adding map directly on the simd types, in step 3, to give an ergonomic way to do things for which there aren't vector instructions when needed. (Seeing the closure should make it plenty obvious that this isn't guaranteed to be a single SIMD instruction.)

I agree that wrapping all of the llvm intrinsics into pub fn in std would be horrible. Lots of work, unstable, etc.

But I do believe that the intel intrinsics (step 2) should be a crate with code sortof like this:

#[link(name = "llvm", kind = "intrinsics")]
extern {
    #[link_name="llvm.fmuladd.v4f32"]
    fn llvm_fmuladd_v4f32(a: v4f32, b: v4f32, c: v4f32) -> v4f32;
}
pub fn _mm_fmadd_ps(a: v4f32, b: v4f32, c: v4f32) -> v4f32 { llvm_fmuladd_v4f32(a, b, c) }

Then different people could make different choices about how to pick the type signatures, for example. Most of step 3 (the parts not constrained by coherence) can be experimented on outside of std. And people don't need to use nightly just for the one weird, possibly-custom LLVM intrinsic that isn't wrapped yet.

The stability of #[link(name = "foo")] obviously depends on the semver of foo, so I don't find this unreasonable to call "stable", so long as cargo can understand the dependency. And having the [link] means they intrinsics aren't linked "sneakily".

How does cargo support people using different versions of rust? Can a crate say "I use the ? operator so need rust = ">1.13"? Can a crate say "I'm still depending on a broken inference rule, so you better not use me after 1.n"?

Maybe linking intrinsics can be stable in the same sense as normal linking: can break if the [dependency] doesn't prevent it. So essentially require the intel_intrinsics crate to have something like

[compiler-dependencies]
llvm = "~3.8"

(And I assert that updating an intel_intrinsics crate for new versions of LLVM isn't harder than updating std for them. Might even be easier, if it means that a new LLVM version can go into rustc without needing to update potentially hundreds of pub methods in the library.)

@scottmcm

I don’t see any future in which we provide a way for folks to link to LLVM intrinsics directly on stable Rust.

I don’t understand why you aren’t on board with defining and exposing vendor APIs. My view is that the prevailing opinion of everyone here is that we should, at a minimum, export vendor intrinsic APIs in std. Can you explain your disagreement in more depth?

I suggest we table the From impl discussion for now. I continue to think all integer vectors should be convertible to one another via From. Either way, we can hash that out in the RFC.

1 Like

I agree with conclusion, but I disagree with reasoning. ARM SVE is low-level, architecture-specific interface. It is not about high-level API. It is near-assembly primitives.

I propose to ignore ARM SVE because ARM SVE hardware isn’t shipping, not because it is high-level interface.

1 Like

Let me start with this: I believe that Vendor SIMD APIs should absolutely be defined and exposed for use in stable rust. Sorry that I was unclear about that. I think stoklund's proposal is great. I expect yours will be as well.

I guess my mental block here is "what does std mean?" When I go look it up in the docs I find,

These modules are the bedrock upon which all of Rust is forged, and they have mighty names like std::slice and std::cmp

(I'm probably also biased based on "what would go in std in C++?", but I figure the similarity in names is not unintentional.)

I don't think 4 different names for "add four f32s to four corresponding f32s"—none of which is the usual and mighty add—is bedrock. I don't think that std::simd::intel_intrinsics::_mm512_mask_add_round_ps is bedrock. (Picked on only for having a long function name; I didn't even look at what it does.)

Suppose Rust takes off so much that there are 5 competing implementations that decide to write an official specification for Rust. Would I include some sort of SIMD type in the spec? Absolutely. Would I include the intel or arm or mips intrinsics in the spec? No, I wouldn't.

If I'm a vendor and want to release a new, clearly-better vendor API, what do I do? Ideally the answer would be "release a crate that everyone would be able to use immediately", not "go through the Rust commit process so people can use it when they upgrade their compiler 12+ weeks after that". (And I acknowledge the realities that make this point irrelevant to a practical solution to "I want to use Intel's MMX+SSE+SSE2 intrinsics in rust 1.15", but do believe in it as a laudable goal. Hardware vendors being deeply entrenched with LLVM is symbiotic, but they shouldn't need to be entrenched in frontends for general-purpose languages.)

I think there are some choices here where neither is legitimately better than the other. For example, the "Should _mm_add_epi16 take an __m128i (to match C and allow for easier mixing with other lane widths) or an i16x8 (to match its semantics and allow for easier mixing with the rest of rust's type system)?" debate. Both are fine options, and while I'll argue for one, I won't say it's unambiguously better. Standardizing both would be a little weird, and having a crate for one by wrapping the standardized other invites "why did they standardize the wrong one" grumbles that wouldn't happen with both being crates where you pick the one that best matches your weighting of trade-offs. And cargo means that "there's a crate for that" is a great answer (unlike the equivalent in many other languages).

Maybe all I'm wishing is that "Rust, the currently-most-popular distribution of Rust the language" comes with some special sauce that's explicitly not part of "Rust the language" when that sauce needs to be tightly integrated for practical reasons.


Postscript, feel free to skip:

As I was writing this, I stumbled on C++ committee paper n3759: SIMD Vector Types. Its details aren't relevant to this thread, but conceptually it's exactly what I'd expect from a first step for standardized SIMD:

  • Add opaque simd types for each fundamental lane type
  • Explicitly say that typical use will often "depend on implementation-specific language extensions (e.g. SIMD intrinsics/builtins)"
  • Include overloads for the operators included in the language
  • Punt on standard support for masks, gather/scatter, swizzles, interleaving, f16, iterators, etc

Surprisingly similar to stoklund's proposal, if you treat step 2 as a "vendor extension".

(If you do look at the details, it has an interesting approach to vector width: they're whatever width is good for the (compile-time) target, not something you pick. It's an interesting thought, especially after the ARM SVE discussion. Makes me glad that Rust is so good at DSTs :slight_smile:)

1 Like

SIMD intrinsics should definitely be part of any spec. Since they’re tied to the architecture, not the compiler, there is no reason why every competing implementation shouldn’t support the same intrinsics. Exposing LLVM internals just penalizes any such implementations.

1 Like

(ARM SVE itself can have a low-level interface, and should once it ships, but the interface @eternaleye was proposing is generic and high-level.)

1 Like

There should absolutely be a spec for the SIMD intrinsics. But as you say, they’re tied to the architecture, not the compiler, so the spec should be written by the architecture vendor, not the language designers. If you want the spec for the Intel intrinsic instructions C API, you don’t ask ISO/IEC JTC1/SC22/WG14, you download https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/data-3.3.15.xml (And yes, we need to define an “intel” intrinsic instructions Rust API for now because Intel won’t. But if Rust in the future ends up with as many compilers and users as C currently does, I bet they would then.)

I’m not convinced that allowing a user who knows they’re targeting LLVM to write #[link("llvm.intrinsics")] “penalizes” an implementation any more than allowing a user who knows they’re targeting Windows to write #[link("Advapi32")]. If I’m writing a compiler that targets Javascript, I’m probably not going to support either of those #[link]s. (That said, I don’t really care whether people can link llvm intrinsics—that was never a high-level goal of mine—so am totally fine with it remaining a nightly-only thing or becoming impossible.)

A small correction - N3759 is old/superseded and may not reflect the current position of committee's concurrency study group. The most recent version of the document is P0214R2 Data-Parallel Vector Types & Operations and it's mostly specification, but it refers to most recent versions of design papers as well - N4184 SIMD Types: The Vector Type & Operations, N4185 SIMD Types: The Mask Type & Write-Masking and SIMD Types: ABI Considerations.

Some more links. Official specifications for ARM, recently updated to ARMv8.1: IHI0053D ARM® C Language Extensions 2.1 - contains high level description, including data types, and detailed description for both SIMD and non-SIMD intrinsics. IHI0073B ARM® NEON™ Intrinsics Reference - list of intrinsics corresponding to Advanced SIMD (aka NEON) instructions specifically. ARM SVE is supposed to be a part of ARMv8.3 and the specification is not public yet, the only official doc seems to be [DUI0965C ARM® Compiler Version 6.6 Scalable Vector Extension User Guide] (http://infocenter.arm.com/help/topic/com.arm.doc.dui0965c/DUI0965C_scalable_vector_extension_guide.pdf).

@scottmcm Whether the intrinsics wind up in std or in a different crate, their implementation is compiler-specific. A GCC-backed rust compiler would deal with (say) shuffle intrinsics differently than the LLVM-backed reference compiler, and how that one defines an intrinsic might also change when it grows a Cretonne backend. So even if we decide not to expose vendor intrinsics from std, they would still have to live in a crate that ships with the compiler, is tied to the compiler’s internals, and follows the same general stability story. That is certainly a possibility you can argue for.

However:

  • std is, despite the documentation you quote, probably not a perfect paragon of Stuff That Will Be Standardized™. It’s currently the sole interface between the compiler and the stable-using programmer (save for core which is a strict subset of std). So if current Rust already has any API that’s specific to rustc rather than Rust-the-platonic-language, it’ll already be in std.
  • Adding a whole separate crate to the facade is a big step. Maybe justified, since SIMD intrinsics are a big deal, but…
  • Having a stable “standard distribution” crate that isn’t part of the std facade is unprecedented, and in fact goes explicitly against the policy of the std facade.

All right folks, strap yourselves in, because I think it’s going to get a little bumpy!

I think that, for now, we’ve mostly settled on the least offensive way to expose (in std) a set of low level vendor intrinsics and a very tiny cross platform API based on defining some platform independent SIMD types. The low level vendor intrinsics are not exposed as compiler intrinsics. Instead, we define normal Rust function stubs implemented by LLVM intrinsics and export those. There are undoubtedly details on which we’ll disagree, but I think we can punt on those until the RFC. (I do intend to write a pre-RFC before that point though.)

So… time to mush on to the next problem: dealing with functions whose type signatures permit SIMD types to either be passed as parameters or returned. I’d like to describe this problem from first principles so that we can get other folks participating in this thread without having to read all of it.

A key part of the SIMD proposal thus far is the existence of a new attribute called #[target_feature]. This feature maps precisely to the __attribute__(((target("..."))) annotation found in both gcc and Clang. An initial form of #[target_feature] was recently merged into Rust proper. In short, #[target_feature] is an attribute that one can apply to a function that specifies target options specifically for that function that may be different than ones specified on the command line (e.g., with rustc -C target-feature=...). This means that one can, for example, call AVX2 intrinsics like _mm256_slli_epi64 in a function labeled with #[target_feature = "+avx2"] without having to use rustc -C target-feature=+avx2.

There are two really really important problems that this solves:

  1. This permits Rust programmers to do runtime detection with CPUID to determine which subtarget features are enabled on the host running the program. By comparison, achieving something similar with a cfg! oriented system would require producing a binary for every interesting combination of target features, and then requiring the end user to download the right one. With #[target_feature], we can tell LLVM to enable a specific target feature for codegen of a specific function. If the caller calls that function on a platform without that support, then they could wind up executing a CPU instruction that doesn’t exist. i.e., On Linux, a SIGILL is raised.
  2. This permits the maintainers of the Rust distribution to distribute a single std for a particular target triple. For a similar line of reasoning as (1), if std used a cfg! oriented system, then you’d need to re-compile std for any combination of target features that you’d want to enable. With #[target_feature], we can ship everything in one compiled version of std, and it will be the responsibility of a Rust programmer to ensure they are called on platforms that support it. This opens the door for common pitfalls (like calling _mm256_slli_epi64 on a host without avx2 support), but, on the same token, these vendor intrinsics are intended to be a very low level interface.

If we decided not to go through with #[target_feature], we’d be missing out on (1) completely, and I don’t think there’s any way around it. We’d also have to completely rethink how we stabilize these intrinsics, since a cfg! oriented system in std doesn’t quite work as explained in (2).

Let’s make this concrete with an example. This is how one might define the _mm256_slli_epi64 AVX2 intrinsic:

#[inline(always)]
#[target_feature = "+avx2"]
fn _mm256_slli_epi64(a: i64x4, imm8: i32) -> i64x4 {
    // I think this extern block may be offensive.
    // Presuppose some other way to access the
    // correct LLVM intrinsic from *inside std*. :-)
    #[allow(improper_ctypes)]
    extern {
        #[link_name = "llvm.x86.avx2.pslli.q"]
        fn pslliq(a: i64x4, imm8: i32) -> i64x4;
    }
    unsafe { pslliq(a, imm8) }
}

And an example use inside of a larger program:

#![feature(link_llvm_intrinsics, repr_simd, simd_ffi, target_feature)]

#[repr(simd)]
#[derive(Debug)]
#[allow(non_camel_case_types)]
struct i64x4(i64, i64, i64, i64);

#[inline(always)]
#[target_feature = "+avx2"]
fn _mm256_slli_epi64(a: i64x4, imm8: i32) -> i64x4 {
    #[allow(improper_ctypes)]
    extern {
        #[link_name = "llvm.x86.avx2.pslli.q"]
        fn pslliq(a: i64x4, imm8: i32) -> i64x4;
    }
    unsafe { pslliq(a, imm8) }
}

#[inline(always)]
#[target_feature = "+avx2"]
fn testfoo(x: i64, y: i64, shift: i32) -> i64 {
    let a = i64x4(x, x, y, y);
    _mm256_slli_epi64(a, shift).0
}

#[target_feature = "+avx2"]
fn main() {
    let x: i64 = ::std::env::args().nth(1).unwrap().parse().unwrap();
    let y: i64 = ::std::env::args().nth(2).unwrap().parse().unwrap();
    let shift: i32 = ::std::env::args().nth(3).unwrap().parse().unwrap();
    println!("{:?}", testfoo(x, y, shift));
}

Compile/run with (remember this uses #[target_feature] which I don’t think has hit nightly yet):

$ rustc -O test1.rs  # no -C target-feature required
$ ./test1 15 5 4
240

The generated ASM is correct. Namely, I see a vpsllq instruction emitted and everything appears to get inlined. We’ve made it to the promised land! Errmm, not quite…

There are some really interesting failure modes here. First up, removing #[target_feature = "+avx2"] on testfoo is A-OK. This makes sense, I think, because the caller is asserting that they know the platform has AVX2 support. Indeed, compiling and running the program produces the expected output. However, keeping the #[target_feature = "+avx2"] removed from testfoo and also changing inline(always) to inline(never) produces an LLVM codegen error:

#[inline(never)]
fn testfoo(x: i64, y: i64, shift: i32) -> i64 {
    let a = i64x4(x, x, y, y);
    _mm256_slli_epi64(a, shift).0
}

Compiling yields:

$ rustc -O test1.rs
LLVM ERROR: Do not know how to split the result of this operator!

AFAIK, this cannot be part of stable SIMD on Rust. We must prevent all LLVM codegen errors. How? What exactly is going on here?

OK, let’s move on to another interesting problem. Instead of testfoo calling a SIMD intrinsic internally, it will try to return a SIMD vector. Here is the full program. Notice the missing inline annotation on testfoo:

#![feature(link_llvm_intrinsics, repr_simd, simd_ffi, target_feature)]

#[repr(simd)]
#[derive(Debug)]
#[allow(non_camel_case_types)]
struct i64x4(i64, i64, i64, i64);

#[inline(always)]
#[target_feature = "+avx2"]
fn _mm256_slli_epi64(a: i64x4, imm8: i32) -> i64x4 {
    #[allow(improper_ctypes)]
    extern {
        #[link_name = "llvm.x86.avx2.pslli.q"]
        fn pslliq(a: i64x4, imm8: i32) -> i64x4;
    }
    unsafe { pslliq(a, imm8) }
}

#[target_feature = "+avx2"]
fn testfoo(x: i64, y: i64, shift: i32) -> i64x4 {
    let a = i64x4(x, x, y, y);
    _mm256_slli_epi64(a, shift)
}

fn main() {
    let x: i64 = ::std::env::args().nth(1).unwrap().parse().unwrap();
    let y: i64 = ::std::env::args().nth(2).unwrap().parse().unwrap();
    let shift: i32 = ::std::env::args().nth(3).unwrap().parse().unwrap();
    println!("{:?}", testfoo(x, y, shift));
}

Compiling is successful, but running the program leads to interesting results:

$ rustc -O --emit asm test1.rs^C
$ ./test 15 5 4
i64x4(240, 240, 4, 0)

The correct output, AIUI, should be:

$ ./test 15 5 4
i64x4(240, 240, 80, 80)

Interestingly, if we apply an inline(always) annotation to testfoo, then we get an LLVM error again:

$ rustc -O test.rs 
LLVM ERROR: Do not know how to split the result of this operator!

If I force the issue and instruct rustc to enable avx2 explicitly, then things are golden:

$ rustc -C target-feature=+avx2 -O test.rs 
$ ./test 15 5 4
i64x4(240, 240, 80, 80)

The above works regardless of the inline annotations used on testfoo.

I am pretty flummoxed by the above. I don’t actually know what’s going on at the lowest level here, although I can make a high level guess that applying #[target_feature = "..."] to a function changes something about its ABI, and that this can interact weirdly with SIMD vectors. Since I don’t quite understand the problem, it’s even harder for me to understand the solution space.


@alexcrichton and I talked about the above briefly yesterday, and we brainstormed a few things. The solutions we were tossing out were things along the lines of “always passing SIMD vectors by-ref to LLVM” or “banning SIMD vectors from appearing in the type signature of a safe function” or something similar. These all have additional complications, such as what to do in the presence of generics or when defining type signatures for other ABIs like extern "C" ....


Overall, this seems pretty hairy. I briefly tried to play with something analogous in C and compared the behaviors of gcc and Clang. It seems like they suffer from similarish problems, although gcc appears to warn you. For example, consider this C program:

#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>

__attribute__((target("avx2")))
__m256i test_avx2() {
    __m256i a = _mm256_set_epi64x(1, 2, 3, 4);
    __m256i b = _mm256_set_epi64x(5, 6, 7, 8);
    return _mm256_add_epi64(a, b);
}

int main() {
    __m256i result = test_avx2();
    int64_t *x = (int64_t *) &result;
    printf("%ld %ld %ld %ld\n", x[0], x[1], x[2], x[3]);
    return 0;
}

Compiled/run with Clang:

$ clang test.c -O
$ ./a.out 
12 10 3399988123389603631 3399988123389603631

And now with gcc:

$ gcc test.c -O
test.c: In function ‘main’:
test.c:13:13: warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
     __m256i result = test_avx2();
             ^~~~~~
$ ./a.out 
140724298882158 140120498842949 1 4195757
$ ./a.out 
1 4195757 0 0
$ ./a.out 
140736787806926 139757628374341 1 4195757

Now, of course, this isn’t to say that I expected this C program to work. Rather, I’m showing this as a comparison point to demonstrate different failure modes of other compilers. In Rust, we have to answer questions like whether an analogous Rust program should be rejected by the compiler and/or whether the behavior in the above C program exhibits memory unsafety.


Can anyone help me untangle this?

cc @alexcrichton, @nikomatsakis, @withoutboats, @aatch, @aturon, @eddyb (please CC other relevant folks that I’ve missed :-))

4 Likes

Wasn’t the solution for the ABI problem to turn the necessary features on (i.e. resulting in usage of ymm registers) and let the code SIGILL on hypergeneric misuse instead of causing UB?