Is there a fundamental reason why we shouldn't omit 512bit?
To me the fundamental reasons are that it is worse than AVX/AVX2 with respect to the "explosion" of partially untyped vector types/intrinsics. So things that might be "barely" worth it design-wise for dealing with AVX2 might become worth it for dealing with AVX-512 (rustc already has a DSL for generating intrinsics for AVX2...). An RFC for SIMD would need to convince me that the proposed architecture scales to AVX-512 without issues. AVX-512 also exposes more shuffle intrinsics that might make some generic shuffle functionality worth it (just like for LLVM), and it also might make some generic prefetching functionality worth it (again just like for LLVM). Things like imprecise division might mean that
Div is less of a straightforward thing to do for AVX-512.
Or it might not. It might well be that the proposed approach is fine, and can deal with AVX-512 just fine. But I don't know, and any RFC that wants to convince me would at least need to show how these would need to be done for AVX-512, FMA, shuffle, prefetching, imprecise division, ...
IMO I think that the best we can do given all of the above is to at least attempt to support AVX-512 from the very beginning. We don't need to include that in the RFC, but we should definetely show that it will be possible to do so in a future RFC without major hassles by showing an implementation. Whether we gain something or not by then omitting AVX-512 from the initial RFC, I don't know, but I don't think so.
The key motivating reason for providing some niceties is to make it possible for crates like simd to exist on stable Rust without doing a monumental amount of work.
What monumental amount of work exactly? I guess they would need to wrap all vector types in newtypes to be able to implement
Add for them, but I also guess that these crates will want to do that anyways to support multiple architectures, software fall backs, ...
We have very little experience building higher-level SIMD APIs. I would rather focus on a truly minimal lower-level SIMD API that lets us use all the hardware first (at least all that LLVM/Clang supports), and as higher-level APIs are implemented on top of that then, with more experience, decide if there are some niceties that are worth adding to std.
I think this discussion is long because 1) I had a lot of misconceptions at the start and needed to be educated and 2) a lot of folks kept insisting on doing more stuff initially instead of choosing to start as small as reasonably possible.
How about I write a pre-RFC first and then we can decide whether this is worth doing? I will start a new thread for that.
The only reason why the pre-RFC doesn't exist right now is because I haven't written it yet. Give me some time. There will have to be significant portions devoted to motivating this design, explaining SIMD and why we've somewhat abandoned our previous SIMD path to stabilization.
I don't know. I think it would be enough for you to let us know when you have an implementation with the "fundamental architecture" and a bunch of intrinsics, but by no means complete (not even full SSE2 support). For the RFC, a "totally-incomplete-pre-RFC" without design/motivation/SIMD explanation/... but with the "fundamental architecture/pieces" of the proposal sketched is also enough.
We can help implementing the rest of the intrinsics (at least the ones that Clang supports), and as they get implemented new issues will arise, which we then can fill and discuss over there. Once the implementation is "feature complete" writing an RFC for it will be easier.
You can, of course, do all the work yourself, and when you are finished, show the pre-RFC and the implementation. But then I think it is kind of pointless to keep discussing here much until that is the case. And I also think that might be risky, since a lot of people here from different backgrounds could offer significant feedback during the implementation/prototyping that could save you a lot of work.