One thing I think is important to bring up: Both ARM SVE and the RISC-V V[ector] extension (old slides, old video, related work, new presentation a couple days ago with media not up yet) are on the very immediate horizon.
The key distinguishing factor is that they do not have fixed vector lengths.
In addition, there are three main categories of instructions that use SIMD registers:
Iterated instructions
Parallel add/sub/and/or/xor/etc
These are actually quite poorly served by SIMD (see RISC-V slides for summary, or ārelated workā for detail). Reasons:
Needing to handle the boundary cases
Code bloat for handling the different register widths of every generation
Requires source changes to handle new generations
Code written for new generations not backwards compatible
Strip-mining loop is boilerplate, ripe for zero-overhead abstraction
For these instructions, we might be far better served by intrinsic (or library) functions that take slices of primitive types, and handle the strip-mining for the programmer.
These, then, would also work for ARM SVE or RISC-V+V.
Permutative/combinatorial instructions
PSHUFB and friends; reductions
This is a category that is very SIMD-friendly, and does not generalize well to arbitrary width.
However, such instructions may see less use than āiterated instructionsā outside of crypto/compression.
Survey?
EDIT: According to someone who was present and watched the new RISC-V talk:
I asked krste whether permute-heavy code like crypto and codecs fits into the model at all, he said that they had permutes but there wasnāt time to discuss
I also asked about reductions, response was ārecursive halving or somethingā
Editorās note: If you have permutes, then I think they can be used to recover any reduction order under recursive halving.
Scalar instructions with large bit-width
This covers cases like the AES or SHA acceleration instructions on x86.
I think talking about āSIMD intrinsicsā as a single unitary thing is a huge mistake: these categories may very well merit being handled differently.
If I can take a different course on this, maybe there could be a level of stabilisation between āunstableā and āstableā, so that #[feature(...)] can be used on stable but the feature in question is not necessarily included in Rustās stability story. This would be a more general solution than a special-case libstd-llvm that is only for this one case. It could be that only the small number of features strictly needed for this could be included to start with, but it might also allow us to get things like Macros 1.1 into the bloodstream as soon as possible while keeping the unstable parts explicit in the code.
I havenāt read this whole thread, sorry if something similar or strictly better has been suggested.
I came here via the Reddit thread. I read the topic two weeks ago but it was already long back then, so I only skimmed through what happened since. Iām sorry for going a bit off topic, but this seems like the best way to share my opinion.
Almost a year ago I wrote Convector, a project in Rust that heavily exercises AVX. This is my experience/wishlist:
The tooling around target features could indeed be better. I proposed a way to deal with this in Cargo, but we could not reach consensus in the topic. Then we got support for RUSTFLAGS in .cargo/config, so I just put that under source control. It is an ugly hack, but it actually works fine.
I want access to the raw platform intrinsics with the same names as in the Intel Intrinsics Guide. Iām glad this topic is going in that direction. They are weird and ugly, but at least they are documented and searchable, and they would be consistent with C/C++. One of the things that surprised me about the current intrinsics, is that e.g. _mm256_add_ps is called simd_add instead. The latter looks friendlier, but it is undiscoverable (also because it is not documented, I suppose), but the main issue is, if you go this way you have to draw the line somewhere about what to rename. I propose to not rename anything, especially since the consensus seems to be to not focus on portable SIMD at the moment.
The types of these intrinsics are sometimes weird (e.g. _mm256_and_ps operates on floats?), but at this level types have little meaning anyway. To some operations, the operands are just āa sequence of bitsā, and they might later be interpreted as floats or integers or bitmasks or whatever. Code full of these intrinsics is hard enough to read without all the transmutes. Iām not sure what the best way to go about this is. Maybe an opaque 128/256/512 bit type with methods to cast from and to tuples of float and integer types?
@eternaleye Iām not really sure what to do with your comment. I donāt understand your warning. Can you please make your fears more concrete? If we go with something similar to @stoklundās proposal above, what are the drawbacks from your perspective?
@jFransham We got off that āunstable stable because LLVMā thing a while back. It was a bad suggestion on my part based on a misunderstanding I had. We donāt need to expose something that is LLVM specific. We can expose something that matches vendor specific APIs as closely as feasible. Much of the conversation since this has revolved around two things: 1) keeping this discussion focused by reminding everyone that we need to punt on a complete cross platform abstraction and 2) just how closely we want to match the vendor specific APIs. For example, Intelās API uses __m128i for every single integer vector operation, but thereās a strong push toward breaking that out into separate u8x16, i16x8, u64x2, etc., types that are defined in a cross platform way.
@ruuda With respect to target_feature, ya, itās not ideal today. I think weāll probably want to stabilize cfg(target_feature) in this effort, but Iām not sure whether weāll get to making it ergonomic in Cargo just yet. I will say though that we can also do runtime detection, which should hopefully land soon. The key here is that you donāt need to use RUSTFLAGS or cfg!(target_feature) at all for that approach.
With respect to naming. Yes. An explicit goal of this approach is to retain an obvious mapping. We definitely wonāt be stabilizing simd_add. (Instead, it might be buttoned up behind an impl of Add on various vector types, for example.) But weāll still also stabilize _mm256_add_ps too.
With respect to type castingā¦ Many (I dare say most) of the Intel intrinsics have a very obvious single type that they operate on, so I think it might make sense to define intrinsics with the appropriate types. As you say though, not all intrinsics have an obvious type and some of them are just explicitly operations on the bits themselves. We might want a x86 specific type alias like __m128i = i8x16 to express that in the function signatures. Weāll also include bitcasting From impls for at least all of the integer vector types.
I think that bitcasting From would be a bad idea, as it has different semantics than current From impls for numerics, which do convert, not bitcast. If the From would be implemented only for integer types it may be less ambiguous, but then we'll need some other way to bitcast floatāfloat and floatāint anyway, so I don't see a point in such impls. The two solutions I see are:
separate bitcast method (or a trait method) to convert between SIMD types of the same width,
instead of doing __m128i = i8x16, let __m128i be a truly different type, which then can implement From and Into for all the SIMD types of the same width.
If we limit the From conversions to integer vector types, then they should be consistent with what we have. Notably, the conversions are lossless and can never fail.
I don't think it's ambiguous at all.
I personally haven't really settled on this myself. If our initial effort requires one to transmute for float<->integer bitcasting, then I think that's OK.
We still need to manually implement the stripmine loop, which is pure boilerplate
We cannot execute the stripmine loop we wrote on hardware with narrower vectors, and so suffer code-bloat for compatibility
We cannot benefit from executing the stripmine loop we wrote on hardware with wider vectors, and so suffer both code-bloat and upgrade-treadmill for performance
We must unroll size_of(field)/size_of(vector) - 1 iterations of our loop to scalar code to handle the loop tail (and possibly the loop head, for vectorizing operations on unhelpfully-aligned slices) manually, and so suffer code-bloat for correctness
The interface grows without bound as new generations of hardware with wider vectors are introduced
We cannot take advantage of hardware that provides proper vector extensions (SVE, RISC-V+V) with that interface, and so must introduce something like I describe anyway in the long run (or to be honest, the medium run)
An interface like that exposes less information to the compiler (as the approach I describe could easily use the proper intrinsics under the hood, but also use them in idiomatic ways the compiler can recognize - additional degrees of freedom here lead to distinction-without-difference in stripmine implementations)
My proposal, then, is basically "put the stripmine loop behind the interface".
We then suffer no source code bloat on any architecture for "iterated instructions"
We only suffer binary code bloat on architectures that force packed SIMD
We only suffer recompilation treadmill (rather than upgrade treadmill) for performance
We avoid the need to hand-roll a number of fiddly corner cases
We specify a smaller interface
We actually benefit from architectures that support proper vectors
We open the door to superior optimization
Loop Fusion optimzations trivially unify the stripmine loops, and you wind up with nice, dense SIMD code - moreover, loop fusion is very likely to take advantage of register allocation / instruction cache information to decide how many loops to fuse.
Unfortunately, I understand very little of what you said. I don't know what the "stripmine loop" is. I'm at work, so I don't have time to read the materials you linked unfortunately. I don't understand why "the interface grows without bound" is a problem. We don't control the interfaces. The vendor does. (For example, Intel's AVX-512 interface is absolutely huge.)
Since I don't understand what you're saying, I'd like to request that you be extremely concrete. You probably need to use real examples. I would also like to request that you put more focus on the following: what part of the problems you're trying to describe explicitly need to be solved in our initial stabilization effort? Can the problems be solved later?
(Emphasis mine.) I don't see any reason whatsoever to introduce value judgments about vendor APIs into this discussion. Leave them out, please.
The stripmine loop is the part that chunks up your input (arbitrary-length) vector into your architectural (finite-length) vectors, and loads it into the appropriate registers.
The interface growing without bound on some axes (functionality) is unavoidable, but it growing along the vector size axis (at least in the "iterated instructions" category, and possibly "permute/combine" as well) is eminently preventable, and has major downsides.
Another preventable axis is "argument length/type" - RISC-V's V extension (and I think also ARM SVE) has a manner of addressing this which has no mapping to argument-size being specified by the instruction.
Also, if you read none of the other things I linked, read the slides - they motivate my arguments concisely and thoroughly.
I'll try.
Also, I'd argue that these concerns are very important to solve before stabilization, or else we will need to introduce a second API which massively overlaps this one (and stabilize it) in order to support certain hardware at all because of assumptions made in the current proposals.
This is not a value judgement; "Packed SIMD" vs. "Vector Processor" are terms of art.
The former refers to the general approach taken by NEON, SSE, AVX, etc - that of architecturally-fixed-length vector-registers, with a new instruction set for each length.
The latter refers to Cray-style vector instruction sets, which effectively perform hardware-accelerated iteration using a wide, pipelined engine, applied to a memory vector of arbitrary length. Both ARM SVE and RISC-V's V extension are members of this family.
You're right, those conversions are not ambiguous when considered separately. I should have written surprising or confusing instead.
The problem is that when Rust user sees f32::from(i32) implementation in std which converts, they may expect to see an f32x4::from(i32x4) impl which also converts. On the other hand, if they see bitcasting i8x16::from(i32x4), they'd expect f32x4::from(i32x4) to bitcast.
So the integer to integer From-conversions won't be confusing only if we say that we'll never use From in SIMD context for any other conversion thay integer bitcasting (ie. the f32x4 case, lane widening, vector of bool to vector of int conversion, etc). If we're ready to say that implementing From for these cases should never be possible, then SIMD-integer-to-integer From won't in fact be confusing. I still prefer the bitcast or "separate 'bits' type" way though, since the rule "To bitcast SIMD you use From for integers and transmute in other cases" seems ad-hoc.
The problem here is that if we block this round of stabilization on a uniform API that can work as well as possible for both fixed length vector APIs and variable length vector APIs on the horizon, then it's likely that stabilization of anything will just never happen at all. There's a saying along the lines of "don't let perfect be the enemy of good." I personally hate it when people tell me that, but we as a community need to decide whether we want access to SIMD intrinsics as they have existed for years in other ecosystems, or whether we want to wait until we can implement the best API possible for all new vendor provided vector APIs on the horizon. I admit this depends on what exactly a variable length vector API entails, and I don't think you've really made that clear yet unfortunately. :-/
In the interest of moving this forward, could you propose a straw man extension or replacement to @stoklund's proposal that addresses your concerns?
Can you also explicitly state whether it's possible to even experiment with these variable length vector APIs? If we can't, then I personally think your request here is really unreasonable.
Both u64::from(1u8) and f64::from(1u8) perform an integer conversion. The fact that the first one does some bit-copying is just a side effect. And also, I was using the word bitcast to refer to bitcasting of values of the same size (which I think is the most common meaning of this word).
@krdln For all SIMD integer vector types of the same bit size, conversion between them is bitcasting. The only problems arise when you need to do integer<->float bitcasts, which arenāt the same as conversions. Hence why I think we should just punt on integer<->float bitcasts. But the From conversions for all the integer vector types seem completely straight-forward and they do exactly the obvious thing.
Iāll reformulate my previous question: how do you bitcast a i64 to a f64 in todayās Rust?
The thing is that you're basically asking me to copy/paste exactly what's in the slide deck I linked. It describes why variable-length vectors are good, describes the exact programming model supported, has example assembly side-by-side with SIMD, the works.
(And copy/pasting from PDF is a royal pain.)
In essence:
trait VectorizablePrimitive: Copy {}; // {u,i}{8,16,32,64,size} f{16,32,64}
trait VectorizationOp<T> {
type Output: VectorizablePrimitive;
extern "rust-intrinsic" perform(...);
}
trait VectorizableIterator<T: VectorizablePrimitive>: Iterator<Item=T> {
unsafe fn vectorize<O: VectorizationOp<T>>(self, op: O)
-> impl VectorizableIterator<O::Output>;
}
struct IndexedGather<V: VectorizablePrimitive>(*const V);
impl<V: VectorizablePrimitive> VectorizationOp<usize> for IndexedGather<V> {
type Output = V;
extern "rust-intrinsic" perform(...) {
// RISC-V vector load goes here
}
}
fn main() {
my x = [3i32, 1, 2, 4, 0];
my indices = [4usize, 1, 2, 0, 3];
println!("{:?}", unsafe {
indices.into_iter()
.vectorize(IndexedGather(&x as *const i32))
});
}
// prints "[0, 1, 2, 3, 4]"
There's currently work on adding SVE to LLVM, and I believe the Spike RISC-V emulator has support for the draft V extension (possibly in a branch).
Designing an API for something that canāt even be feasibly experimented with isnāt something Iām personally capable of doing. I wonāt be able to lead that effort.
Assuming we ignore (1) and we want to address your concerns, the only reasonable thing to do (as far as I can see) is to say that absolutely zero cross platform API is possible at this time. No cross platform types. Nothing.
I won't call it a conversion. For me, it's only bitcasting or transmute. (But that's a bikeshedding on the meaning of conversion, so let's ignore naming). The fact that on x86 it's just a matter of using a register in a different instruction is just an platform implementation detail (eg. the upcoming Mill architecture treats number of lanes differently).
(Note: I'm assuming that the i32x4-like types will be cross-platform, not a separate per architecture. If that's not a case, ignore the rest of this paragraph). If we look at the SIMD types in abstract, types such as i32x4 and i8x16 have no more in common that i64 and f64 (which you've mentioned). They just share size. Therefore it would be suprprising for the latter pair to be converted by transmute and the former's transmute be glorified to From implementation. I think that the right way to convert between different SIMD types of the same size should be either transmute (or a safe-transmute method, if we want to avoid unsafe) or platform-specific intrinsics.
I do agree. The problem also arises if you implement From for any pair of types with the same number of lanes. Therefore I just say that if we'll implement transmutes of integersāintegers SIMD as From (which I think is a bad idea, but you don't have to agree), we shall never implement any From for any pair of types with the same number of lanes to avoid confusion. Do you agree with that "rule"?
Wait, huh? i8x16 <-> u8x16 and i8x16 <-> i64x2 and i8x16 <-> u64x2 are all just lossless bitcasts. None of those suffer the same problems as, for example, i32x4 <-> f32x4 or u8x16 <-> f32x4. Are you saying otherwise? Could you give an example just so I make sure we're on the same page?
I find @eternaleyeās concerns very legitimate. Thanks for bringing them up.
ARM SVE is very much real, and it does seem to me that future hardware direction is heading to vector-length-agnostic ISA. The problem is it is future, not present.
I consider this analogous to ATC/HKT discussion. While Associated Type Constructor proposal is very much not adding Higher-Kinded Type to Rust, it is relevant for ATC design to be forward compatible with HKT. Forward compatibility with desirable future additions should be consideration of stablization.
Thatās why I want cross-platform types, to be forward compatible with future high-level SIMD API. I strongly believe forward compatibility concerns should be considered. Whether we should make changes for forward compatibility depends, but in case of SIMD types, I think cost is light enough for tradeoff to make sense.
For the same reason, I think we should consider forward compatibility with future vector-length-agnostic ISA. And whatever we do, there should be discussion on RFC rationale.
I think the limiting factor here is vector-length-agnostic ISA support is not yet upstream in LLVM. As I understand, ARM is proposing to extend <4 x f32> to <n x 4 x f32> (ānā here is literal) to express vector-length-agnostic value with minimum 4 lanes. This is already implemented in ARMās fork, but the design is bound to change in the process of upstreaming. And it will take quite some time. And there is no ETA.
Stripmine loops, loop fusion, these are compiler optimizer terms, not programming language terms.
Surely, a modern variable-length vector instruction set designed in this decade would be laser-focused on providing a good optimizer target? Back when Cray was still shipping vector machines, most of their customers just wrote Fortran code. They didn't have to worry about vectorization at all, or at least they didn't have to explicitly vectorize things.
Can't we just write Rust code and let LLVM handle the vectorization?
The reason we need to provide explicit SIMD support is precisely because SIMD auto-vectorization can only reach a tiny amount of the available functionality. We need to provide an almost assembly-like interface because compilers can't figure out how to use these weird instructions.
If these new architectures make the transition from slides to shipping silicon, it is possible that they will fail to deliver on their promises. Then it might be time to design an explicit programming model for them. But let's give them a chance to succeed first.