Getting explicit SIMD on stable Rust


#219

Feels like a great sweet spot. Small enough set to be uncontroversial, just enough traits that it’s possible to use non-horribly.

And I’m pleased to see that, even if you ignore intrinsics, the simple cases can be coded reasonably—no unsafe and fairly ergonomically—in such a way that they even produce LLVM vector intrinsics. For example, this gives %2 = fmul <4 x float> %0, %1 and then mulps %xmm1, %xmm0:

pub fn my_mul(a: i32x4, b: i32x4) -> i32x4 {
    let a: [i32;4] = a.into();
    let b: [i32;4] = b.into();
    [
        a[0] * b[0],
        a[1] * b[1],
        a[2] * b[2],
        a[3] * b[3],
    ].into()
}

I think it might need to be more nuanced than that. Something like __m128i _mm_adds_epu8(__m128i a, __m128i b) that “Adds the 16 unsigned 8-bit integers in a to the 16 unsigned 8-bit integers in b using saturating arithmetic” should probably take u8x16s, not i8x16s.

Maybe it’s sufficient to say “a function takes a type implied by the suffix of its name”? That’d justify only providing the signed versions of the ones that don’t care initially, though I bet we’ll want to offer _mm_add_epu32 at some point, despite only _mm_add_epi32 existing officially.

Will adding literally thousands more functions to std impact non-simd-using consumers at all? I take it they can’t go in an intel_intrinsics crate because of needing nightly for something?

:sparkles: Yay! :fireworks:

Given all the llvm intrinsics that take vectors, I suspect there’s a few more. The only one I’ll recommend for the RFC, though is mul_add on floating-point vectors. (Oh, and since we have i32::saturating_add in stable, i32x4::saturating_add would be nice, and there’s an MMX instruction for it.)

And I assume “step 3” isn’t gated on completion of “step 2”, despite the numbering?


#220

Just a word of warning here: It’s easy enough to tickle LLVM’s SLP vectorizer into producing one or two SIMD instructions with tiny toy examples, but this is not a robust alternative to hooking up the real portable SIMD operations (my step 3). LLVM has a number of optimizations that run before the SLP vectorizer gets to see the code, and they can easily obfuscate things by optimizing the separate lane expressions in different ways.

Try creating a larger example, and make sure that there are constant folding opportunities in some lanes, but not in others. That should trip up the SLP vectorizer enough to just give you scalar code. When we have to defend this proposal in RFC form, it would actually be good to have real counterexamples at hand to show that this is not a viable alternative to step 3.

I was going for conservatism rather than completeness, so I left out many things.

  • Fused multiply-add (mul_add) is available on many platforms, but not universally, and it is a very expensive operation to emulate. f64x2::mul_add() requires a 104-bit wide multiplication which you really, really want implemented in hardware.
  • Saturating arithmetic is universally available for 8-bit and 16-bit lanes, but not for i32x4. It turns out that nobody actually needs saturating arithmetic larger than 16 bits. I left it out to avoid the asymmetry.
  • Comparisons and selects are universally available, but with subtle variations across platforms that we don’t need to resolve now. Compare the semantics of Intel’s blendvps instruction with NEON’s vbsl. The difference is just big enough to get in the way.

Most of these things should be fairly straightforward to implement in terms of vendor intrinsics, which I’m sure the simd crate will do.

I would like to see a second round of stabilization that brings the portable operations up to par with SIMD.js at least, but there are a number of issues to work out first.

Right. The steps represent my priority, not a dependency.


#221

(Oops, it looks like I pasted in the wrong code in my previous post. That rust code should have been f32s, not i32s. I won’t edit it since that seems to do weird things to timestamps here?)

Absolutely.

As you suggested, man was one easy to produce :disappointed: This one, which I’m not surprised didn’t work, does generate vector instructions, but it’s four mulsss, not a mulps :wink:

pub fn my_dot_direct(a: f32x4, b: f32x4) -> f32 {
    let a: [f32;4] = a.into();
    let b: [f32;4] = b.into();
    a[0] * b[0] +
    a[1] * b[1] +
    a[2] * b[2] +
    a[3] * b[3]
}

And I guess the inliner is one of the passes that runs before the SLP vectorizer. I was hoping this would inline the vector fmul from my_mul, but nope. No fmul <4 x float> in the LLVM, no mulps in the assembly.

pub fn my_dot_usingmul(a: f32x4, b: f32x4) -> f32 {
    let c: [f32;4] = my_mul(a, b).into();
    c[0] + c[1] + c[2] + c[3]
}

But then a tiny change makes it happy again. This one actually generates rather nice-looking vector operations even for the additions:

pub fn my_dot_usingmul_balanced(a: f32x4, b: f32x4) -> f32 {
    let c: [f32;4] = my_mul(a, b).into();
    (c[0] + c[1]) + (c[2] + c[3])
}

(Well, except for the fact that apparently it didn’t notice that %2 and %4 are exactly the same thing, which I thought SSA was really good at.)

  %2 = fmul <4 x float> %0, %1
  %3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 2>
  %4 = fmul <4 x float> %0, %1
  %5 = shufflevector <4 x float> %4, <4 x float> undef, <2 x i32> <i32 1, i32 3>
  %6 = fadd <2 x float> %3, %5
  %7 = extractelement <2 x float> %6, i32 0
  %8 = extractelement <2 x float> %6, i32 1
  %9 = fadd float %7, %8
  ret float %9

Yay, non-associative floating point math! :angry: Makes me want a bunch of sum_unspecified_order() methods (on simd, slices, …), but that’s definitely a post-step-4 bikeshed.

(Hmm, looking at llvm’s fastmath flags makes we want to go make an nnan_f32 type that lowers like that all the way to llvm, and would actually be Ord & Eq. But that’s a distraction for the future…)


#222

One thing I think is important to bring up: Both ARM SVE and the RISC-V V[ector] extension (old slides, old video, related work, new presentation a couple days ago with media not up yet) are on the very immediate horizon.

The key distinguishing factor is that they do not have fixed vector lengths.

In addition, there are three main categories of instructions that use SIMD registers:

  1. Iterated instructions
    • Parallel add/sub/and/or/xor/etc
    • These are actually quite poorly served by SIMD (see RISC-V slides for summary, or “related work” for detail). Reasons:
      • Needing to handle the boundary cases
      • Code bloat for handling the different register widths of every generation
      • Requires source changes to handle new generations
      • Code written for new generations not backwards compatible
      • Strip-mining loop is boilerplate, ripe for zero-overhead abstraction
    • For these instructions, we might be far better served by intrinsic (or library) functions that take slices of primitive types, and handle the strip-mining for the programmer.
      • These, then, would also work for ARM SVE or RISC-V+V.
  2. Permutative/combinatorial instructions
    • PSHUFB and friends; reductions
    • This is a category that is very SIMD-friendly, and does not generalize well to arbitrary width.
    • However, such instructions may see less use than “iterated instructions” outside of crypto/compression.
    • Survey?
    • EDIT: According to someone who was present and watched the new RISC-V talk:
      • I asked krste whether permute-heavy code like crypto and codecs fits into the model at all, he said that they had permutes but there wasn’t time to discuss
      • I also asked about reductions, response was “recursive halving or something”
        • Editor’s note: If you have permutes, then I think they can be used to recover any reduction order under recursive halving.
  3. Scalar instructions with large bit-width
    • This covers cases like the AES or SHA acceleration instructions on x86.
    • Heavily architecture specific, heavily purpose-specific.
    • Likely quite worth waiting a while to stabilize.

I think talking about “SIMD intrinsics” as a single unitary thing is a huge mistake: these categories may very well merit being handled differently.


#223

If I can take a different course on this, maybe there could be a level of stabilisation between “unstable” and “stable”, so that #[feature(...)] can be used on stable but the feature in question is not necessarily included in Rust’s stability story. This would be a more general solution than a special-case libstd-llvm that is only for this one case. It could be that only the small number of features strictly needed for this could be included to start with, but it might also allow us to get things like Macros 1.1 into the bloodstream as soon as possible while keeping the unstable parts explicit in the code.

I haven’t read this whole thread, sorry if something similar or strictly better has been suggested.


#224

I came here via the Reddit thread. I read the topic two weeks ago but it was already long back then, so I only skimmed through what happened since. I’m sorry for going a bit off topic, but this seems like the best way to share my opinion.

Almost a year ago I wrote Convector, a project in Rust that heavily exercises AVX. This is my experience/wishlist:

  • The tooling around target features could indeed be better. I proposed a way to deal with this in Cargo, but we could not reach consensus in the topic. Then we got support for RUSTFLAGS in .cargo/config, so I just put that under source control. It is an ugly hack, but it actually works fine.
  • I want access to the raw platform intrinsics with the same names as in the Intel Intrinsics Guide. I’m glad this topic is going in that direction. They are weird and ugly, but at least they are documented and searchable, and they would be consistent with C/C++. One of the things that surprised me about the current intrinsics, is that e.g. _mm256_add_ps is called simd_add instead. The latter looks friendlier, but it is undiscoverable (also because it is not documented, I suppose), but the main issue is, if you go this way you have to draw the line somewhere about what to rename. I propose to not rename anything, especially since the consensus seems to be to not focus on portable SIMD at the moment.
  • The types of these intrinsics are sometimes weird (e.g. _mm256_and_ps operates on floats?), but at this level types have little meaning anyway. To some operations, the operands are just “a sequence of bits”, and they might later be interpreted as floats or integers or bitmasks or whatever. Code full of these intrinsics is hard enough to read without all the transmutes. I’m not sure what the best way to go about this is. Maybe an opaque 128/256/512 bit type with methods to cast from and to tuples of float and integer types?

#225

@eternaleye I’m not really sure what to do with your comment. I don’t understand your warning. Can you please make your fears more concrete? If we go with something similar to @stoklund’s proposal above, what are the drawbacks from your perspective?

@jFransham We got off that “unstable stable because LLVM” thing a while back. It was a bad suggestion on my part based on a misunderstanding I had. We don’t need to expose something that is LLVM specific. We can expose something that matches vendor specific APIs as closely as feasible. Much of the conversation since this has revolved around two things: 1) keeping this discussion focused by reminding everyone that we need to punt on a complete cross platform abstraction and 2) just how closely we want to match the vendor specific APIs. For example, Intel’s API uses __m128i for every single integer vector operation, but there’s a strong push toward breaking that out into separate u8x16, i16x8, u64x2, etc., types that are defined in a cross platform way.

@ruudva With respect to target_feature, ya, it’s not ideal today. I think we’ll probably want to stabilize cfg(target_feature) in this effort, but I’m not sure whether we’ll get to making it ergonomic in Cargo just yet. I will say though that we can also do runtime detection, which should hopefully land soon. The key here is that you don’t need to use RUSTFLAGS or cfg!(target_feature) at all for that approach.

With respect to naming. Yes. An explicit goal of this approach is to retain an obvious mapping. We definitely won’t be stabilizing simd_add. (Instead, it might be buttoned up behind an impl of Add on various vector types, for example.) But we’ll still also stabilize _mm256_add_ps too.

With respect to type casting… Many (I dare say most) of the Intel intrinsics have a very obvious single type that they operate on, so I think it might make sense to define intrinsics with the appropriate types. As you say though, not all intrinsics have an obvious type and some of them are just explicitly operations on the bits themselves. We might want a x86 specific type alias like __m128i = i8x16 to express that in the function signatures. We’ll also include bitcasting From impls for at least all of the integer vector types.


#226

I think that bitcasting From would be a bad idea, as it has different semantics than current From impls for numerics, which do convert, not bitcast. If the From would be implemented only for integer types it may be less ambiguous, but then we’ll need some other way to bitcast float↔float and float↔int anyway, so I don’t see a point in such impls. The two solutions I see are:

  1. separate bitcast method (or a trait method) to convert between SIMD types of the same width,
  2. instead of doing __m128i = i8x16, let __m128i be a truly different type, which then can implement From and Into for all the SIMD types of the same width.

#227

If we limit the From conversions to integer vector types, then they should be consistent with what we have. Notably, the conversions are lossless and can never fail.

I don’t think it’s ambiguous at all.

I personally haven’t really settled on this myself. If our initial effort requires one to transmute for float<->integer bitcasting, then I think that’s OK.


#228
  1. We still need to manually implement the stripmine loop, which is pure boilerplate
  2. We cannot execute the stripmine loop we wrote on hardware with narrower vectors, and so suffer code-bloat for compatibility
  3. We cannot benefit from executing the stripmine loop we wrote on hardware with wider vectors, and so suffer both code-bloat and upgrade-treadmill for performance
  4. We must unroll size_of(field)/size_of(vector) - 1 iterations of our loop to scalar code to handle the loop tail (and possibly the loop head, for vectorizing operations on unhelpfully-aligned slices) manually, and so suffer code-bloat for correctness
  5. The interface grows without bound as new generations of hardware with wider vectors are introduced
  6. We cannot take advantage of hardware that provides proper vector extensions (SVE, RISC-V+V) with that interface, and so must introduce something like I describe anyway in the long run (or to be honest, the medium run)
  7. An interface like that exposes less information to the compiler (as the approach I describe could easily use the proper intrinsics under the hood, but also use them in idiomatic ways the compiler can recognize - additional degrees of freedom here lead to distinction-without-difference in stripmine implementations)

My proposal, then, is basically “put the stripmine loop behind the interface”.

  • We then suffer no source code bloat on any architecture for “iterated instructions”
  • We only suffer binary code bloat on architectures that force packed SIMD
  • We only suffer recompilation treadmill (rather than upgrade treadmill) for performance
  • We avoid the need to hand-roll a number of fiddly corner cases
  • We specify a smaller interface
  • We actually benefit from architectures that support proper vectors
  • We open the door to superior optimization

Loop Fusion optimzations trivially unify the stripmine loops, and you wind up with nice, dense SIMD code - moreover, loop fusion is very likely to take advantage of register allocation / instruction cache information to decide how many loops to fuse.


#229

Unfortunately, I understand very little of what you said. I don’t know what the “stripmine loop” is. I’m at work, so I don’t have time to read the materials you linked unfortunately. I don’t understand why “the interface grows without bound” is a problem. We don’t control the interfaces. The vendor does. (For example, Intel’s AVX-512 interface is absolutely huge.)

Since I don’t understand what you’re saying, I’d like to request that you be extremely concrete. You probably need to use real examples. I would also like to request that you put more focus on the following: what part of the problems you’re trying to describe explicitly need to be solved in our initial stabilization effort? Can the problems be solved later?

(Emphasis mine.) I don’t see any reason whatsoever to introduce value judgments about vendor APIs into this discussion. Leave them out, please.


#230

The stripmine loop is the part that chunks up your input (arbitrary-length) vector into your architectural (finite-length) vectors, and loads it into the appropriate registers.

The interface growing without bound on some axes (functionality) is unavoidable, but it growing along the vector size axis (at least in the “iterated instructions” category, and possibly “permute/combine” as well) is eminently preventable, and has major downsides.

Another preventable axis is “argument length/type” - RISC-V’s V extension (and I think also ARM SVE) has a manner of addressing this which has no mapping to argument-size being specified by the instruction.

Also, if you read none of the other things I linked, read the slides - they motivate my arguments concisely and thoroughly.

I’ll try.

Also, I’d argue that these concerns are very important to solve before stabilization, or else we will need to introduce a second API which massively overlaps this one (and stabilize it) in order to support certain hardware at all because of assumptions made in the current proposals.

This is not a value judgement; “Packed SIMD” vs. “Vector Processor” are terms of art.

The former refers to the general approach taken by NEON, SSE, AVX, etc - that of architecturally-fixed-length vector-registers, with a new instruction set for each length.

The latter refers to Cray-style vector instruction sets, which effectively perform hardware-accelerated iteration using a wide, pipelined engine, applied to a memory vector of arbitrary length. Both ARM SVE and RISC-V’s V extension are members of this family.


#231

You’re right, those conversions are not ambiguous when considered separately. I should have written surprising or confusing instead.

The problem is that when Rust user sees f32::from(i32) implementation in std which converts, they may expect to see an f32x4::from(i32x4) impl which also converts. On the other hand, if they see bitcasting i8x16::from(i32x4), they’d expect f32x4::from(i32x4) to bitcast.

So the integer to integer From-conversions won’t be confusing only if we say that we’ll never use From in SIMD context for any other conversion thay integer bitcasting (ie. the f32x4 case, lane widening, vector of bool to vector of int conversion, etc). If we’re ready to say that implementing From for these cases should never be possible, then SIMD-integer-to-integer From won’t in fact be confusing. I still prefer the bitcast or “separate ‘bits’ type” way though, since the rule “To bitcast SIMD you use From for integers and transmute in other cases” seems ad-hoc.


#232

The problem here is that if we block this round of stabilization on a uniform API that can work as well as possible for both fixed length vector APIs and variable length vector APIs on the horizon, then it’s likely that stabilization of anything will just never happen at all. There’s a saying along the lines of “don’t let perfect be the enemy of good.” I personally hate it when people tell me that, but we as a community need to decide whether we want access to SIMD intrinsics as they have existed for years in other ecosystems, or whether we want to wait until we can implement the best API possible for all new vendor provided vector APIs on the horizon. I admit this depends on what exactly a variable length vector API entails, and I don’t think you’ve really made that clear yet unfortunately. :-/

In the interest of moving this forward, could you propose a straw man extension or replacement to @stoklund’s proposal that addresses your concerns?

Can you also explicitly state whether it’s possible to even experiment with these variable length vector APIs? If we can’t, then I personally think your request here is really unreasonable.


#233

u64::from(1u8) bitcasts but f64::from(1u8) doesn’t.

How do you bitcast a u8 to a f64 in today’s Rust?


#234

Both u64::from(1u8) and f64::from(1u8) perform an integer conversion. The fact that the first one does some bit-copying is just a side effect. And also, I was using the word bitcast to refer to bitcasting of values of the same size (which I think is the most common meaning of this word).


#235

@krdln For all SIMD integer vector types of the same bit size, conversion between them is bitcasting. The only problems arise when you need to do integer<->float bitcasts, which aren’t the same as conversions. Hence why I think we should just punt on integer<->float bitcasts. But the From conversions for all the integer vector types seem completely straight-forward and they do exactly the obvious thing.

I’ll reformulate my previous question: how do you bitcast a i64 to a f64 in today’s Rust?


#236

The thing is that you’re basically asking me to copy/paste exactly what’s in the slide deck I linked. It describes why variable-length vectors are good, describes the exact programming model supported, has example assembly side-by-side with SIMD, the works.

(And copy/pasting from PDF is a royal pain.)

In essence:

trait VectorizablePrimitive: Copy {}; // {u,i}{8,16,32,64,size} f{16,32,64}

trait VectorizationOp<T> {
    type Output: VectorizablePrimitive;
    extern "rust-intrinsic" perform(...);
}

trait VectorizableIterator<T: VectorizablePrimitive>: Iterator<Item=T> {
    unsafe fn vectorize<O: VectorizationOp<T>>(self, op: O)
        -> impl VectorizableIterator<O::Output>;
}

struct IndexedGather<V: VectorizablePrimitive>(*const V);
impl<V: VectorizablePrimitive> VectorizationOp<usize> for IndexedGather<V> {
    type Output = V;
    extern "rust-intrinsic" perform(...) {
        // RISC-V vector load goes here
    }
}

fn main() {
    my x = [3i32, 1, 2, 4, 0];
    my indices = [4usize, 1, 2, 0, 3];

    println!("{:?}", unsafe {
        indices.into_iter()
        .vectorize(IndexedGather(&x as *const i32))
    });
}

// prints "[0, 1, 2, 3, 4]"

There’s currently work on adding SVE to LLVM, and I believe the Spike RISC-V emulator has support for the draft V extension (possibly in a branch).


#237

My thoughts:

  1. Designing an API for something that can’t even be feasibly experimented with isn’t something I’m personally capable of doing. I won’t be able to lead that effort.
  2. Assuming we ignore (1) and we want to address your concerns, the only reasonable thing to do (as far as I can see) is to say that absolutely zero cross platform API is possible at this time. No cross platform types. Nothing.

#238

I won’t call it a conversion. For me, it’s only bitcasting or transmute. (But that’s a bikeshedding on the meaning of conversion, so let’s ignore naming). The fact that on x86 it’s just a matter of using a register in a different instruction is just an platform implementation detail (eg. the upcoming Mill architecture treats number of lanes differently). (Note: I’m assuming that the i32x4-like types will be cross-platform, not a separate per architecture. If that’s not a case, ignore the rest of this paragraph). If we look at the SIMD types in abstract, types such as i32x4 and i8x16 have no more in common that i64 and f64 (which you’ve mentioned). They just share size. Therefore it would be suprprising for the latter pair to be converted by transmute and the former’s transmute be glorified to From implementation. I think that the right way to convert between different SIMD types of the same size should be either transmute (or a safe-transmute method, if we want to avoid unsafe) or platform-specific intrinsics.

I do agree. The problem also arises if you implement From for any pair of types with the same number of lanes. Therefore I just say that if we’ll implement transmutes of integers→integers SIMD as From (which I think is a bad idea, but you don’t have to agree), we shall never implement any From for any pair of types with the same number of lanes to avoid confusion. Do you agree with that “rule”?