I’m very happy that this topic has been picked up. I’m eager to ship explicit SIMD-using Rust code in Firefox sooner than later. Currently, SSE2-accelerated encoding_rs can be over 4 times as fast as encoding_rs with ALU word-based acceleration on x86_64. It would be sad not to be able to ship that kind of improvement.
While I’d like to have runtime detection as part of the stardard library as well as the capability of compiling the bulk of code with some instruction set extension level and then enabling more instruction set extensions for particular pieces of code (called only after a runtime check), I think it way more important to enable repr_simd and platform_intrinsics on the release channel than to enable runtime detection especially because SSE2 is useful without runtime detection, since its unconditionally part of the x86_64 baseline (as well as what Mozilla-shipped rustc and (soon) Firefox treat as the x86 baseline, too). (cfg_target_feature is simple enough a feature that it would be silly not to allow as a companion of repr_simd and platform_intrinsics on the release channel once repr_simd and platform_intrinsics are allowed.)
Naming is just a bikeshed. I think naming is a bad reason not to allow this stuff on the release channel. If a non-LLVM back end comes later, mere naming is a matter of a mapping table. I think we should simply use the naming that the platform_intrinsics feature has already adopted. It’s close enough to ISA vendor naming but a) takes into account the existence of multiple ISAs and b) doesn’t have un-Rust-like underscores.
The bigger issue is that LLVM doesn’t have ISA-specific intrinsics for everything that ISA vendor define as intrinsics but some things that ISA vendors specify as ISA-specific intrinsics don’t exist in LLVM and are implemented by clang using more generalized LLVM features. It’s kind of sad that this kind of generalization is what seems to be considered eventually desirable for Rust but for the time being is being treated as a problem because of generalization level is defined by LLVM rather than rustc.
I think having cross-ISA generalizations for lane getting and setting, the per-SIMD-lane versions of basic comparisons, arithmetic and bitwise operations as well as aligned and unaligned loads and stores is unquestionably a good thing. The vector shuffle stuff in LLVM could be seen as too magic: If you use it with just the right arguments, it compiles to a specific instruction, but maybe you get something less performant if you use arguments that aren’t wired for magic treatment.
I could live with access to LLVM’s SIMD features existing outside the stability story as long as it was available to Rust-in-Firefox, and having it on the release channel of Rust would avoid the controversy of advocating nightly Rust for Firefox and the mismatches with Linux distros that would arise from that. However, I’d much prefer that API to look like the current platform_intrinsics feature than the link_llvm_intrinsics feature. (I’m in particular thinking of the SIMD shuffle signatures between the two).
Still, I think the worry of LLVM dependency has been exaggerated and think my proposal at the end of this post would work under the normal Rust stability story.
I disagree with starting by stabilizing inline assembly as a solution for SIMD. I don’t want to write full algorithms in assembly. Instead, I want to be able to write the control structures in Rust and have inline functions that map to specific SIMD operations. I don’t want to deal with instruction scheduling manually. That’s what compilers are for. Also, when the specific operations are isolated into small inline functions, I don’t want those functions to be tied to specific registers. If I use a given operation twice in an algorithm, I want the compiler to be able to use different registers directly with the different instances of the operation instead of having to move values in and out of a specific register to make it available to inline assembly multiple times.
Furthermore, the current intrinsic-based SIMD functionality is in such a good shape that I think it doesn’t make sense to avoid enabling it on the release channel.
This is a good way to avoid any semblance of an LLVM dependency. However, considering that there seems to be a general consensus that eventually we would want cross-ISA SIMD generalizations where possible, it seems sad to avoid the generalizations that LLVM already has, that look so fundamentally reasonable that it doesn’t seem unreasonable to ask other back ends to implement the same and that LLVM already proves possible. Specifically, when LLVM already provides cross-ISA generalizations for loads/stores, lane get/set, basic by-lane comparisons and basic by-lane arithmetic and bitwise ops, it would be sad to make Rust programmers to write per-ISA code for these. This set of operations is pretty close to @stoklund’s WebAssembly set of operations. For these features to work with LLVM, though, LLVM needs to be aware of the types and numbers of the lanes, which would mean stabilizing lane-aware types instead of moving to lane-unaware 128-bit types.
I agree that platform_intrinsics should be allowed to cover non-SIMD (e.g. AES) ISA features and it would be a mistake to by policy or by mental barriers arising from naming to limit intrinsics that give access to ISA-specific features to SIMD.
I think that the compiler shouldn’t make automatic decisions on where to place an instruction set level divergence point. Instead, the programmer should have control over the placement of a run-time check.
–
Concretely, I propose the following:
- Allow
repr_simd on the release channel. (Required for lane-aware stuff.)
- Allow
cfg_target_feature on the release channel. (Trivially obvious.)
- Stipulate that dereferencing a pointer to a SIMD type shall generate an aligned load/store. (Already true with the LLVM back end. Seems reasonable to require of alternative back ends.)
- Add
ptr::{read,write}_unaligned per existing RFC.
- Stipulate that
ptr::{read,write}_unaligned applied to a pointer to a SIMD type shall generate unaligned load/store instructions. (Already true with the LLVM back end. Seems reasonable to require of alternative back ends.)
- Stipulate that
transmute_copy applied to SIMD types of the same size reinterprets the SIMD register with a different lane configuration. (Already true with the LLVM back end. Seems reasonable to require of alternative back ends.)
- Allow all existing
platform_intrinsics except the SIMD shuffles on the release channel. This includes cross-ISA intrinsics for by-lane comparison and arithmetic & bitwise ops as well as lane get/set. These already work with the LLVM back end and are so fundamentally reasonable as cross-ISA generaliations that it seems reasonable to require non-LLVM back ends to implement the same. This includes ISA-specific intrinsics that are defined by the ISA vendor for operations other than loads/stores, lane get/set, by-lane comparison, by-lane arithmetic & bitwise ops and shuffles. The ISA vendor-defined operations are fundamentally not LLVM-specific. (This point excludes stabilizing the platform intrinsics for SIMD shuffles.)
- Add new platform intrinsics for each shuffle-esque instruction that the ISAs have but that aren’t covered by the above points. Use the same naming scheme as for the platform intrinsics that map to ISA vendor-defined operations. In the LLVM back end, map these to the appropriate SIMD shuffles. For example, introduce
x86_mm_unpacklo_epi8 that maps to simd_shuffle16(a, b, [0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23]);.
- Defer alternative code paths for different instruction set extensions levels in a single binary to a different iteration.
In summary: Rubber-stamp and ship the existing nightly features except for SIMD shuffles, which would involve a through-the-stack coupling with LLVM back end behaviors, and expose the shuffles under the vendor names munged to the platform_intrinsics naming convention instead. As a result, Rust programmers would see cross-ISA language/API surface for loads/stores, lane get/set, by-lane basic comparisons, by-lane basic arithmetic and by-lane bit-wise ops and vendor intrinsics for the rest.