I agree. But I still think it needs to be stabilized in a crate, not standardized in std
.
Having _mm_load_si128
and _mm_loadu_si128
provided by the crate is good. But being forced to have them cluttering up std
forever? Please no. I guess that's why in the OP you mentioned the "special rules crate" possibility.
Where did the particular push for this thread come from, BTW? What kind of "in stable" is it looking for if "without stability" is a plausible solution?
Do we know how close we are to being able to write "canonical Rust" implementations of (the vast majority) intrinsics in nightly rust? Some easy cases work, like
#[repr(simd)]
pub struct __m128i(u64, u64);
struct i16x8(i16, i16, i16, i16, i16, i16, i16, i16);
#[no_mangle]
pub fn _mm_add_epi16(a: __m128i, b: __m128i) -> __m128i {
let a: i16x8 = unsafe { transmute(a) };
let b: i16x8 = unsafe { transmute(b) };
let c = i16x8(
a.0.wrapping_add(b.0),
a.1.wrapping_add(b.1),
a.2.wrapping_add(b.2),
a.3.wrapping_add(b.3),
a.4.wrapping_add(b.4),
a.5.wrapping_add(b.5),
a.6.wrapping_add(b.6),
a.7.wrapping_add(b.7),
);
unsafe { transmute(c) }
}
produces the expected LLVM vector instruction (plus bitcasts), resulting in the expected assembly instruction
_mm_add_epi16:
.cfi_startproc
paddw %xmm1, %xmm0
retq
Note that this only needed unstable for the definition of __m128i
, not for i16x8
.
Maybe that's a way to go. Stabilize a some_ugly_name_for_a_128_bit_bag_type_128_bit_alignment
hidden somewhere (with no methods or anything, bit which might even be helpful as PhantomData until the alignment RFC stabilizes), with a comment that people should probably not use it directly, but probably want to use __m128i
from the csimd
crate.
Then write rust implementations of all the intrinsics, not unlike the pattern above. Now you can immediately code using intrinsics, and they are functional, if not performant, on all targets (even ones like javascript). (A csimd_unstable
crate that intrinsics for people who want performant output now would be a good thing to have in addition.)
Then there are just "should be better vectorized" bugs to fix. With a particular case in hand, it's be easier to decide between "oh, we could translate that to LLVM better" or "man, LLVM should have noticed that" or "you know what? this case is weird enough that a std
intrinsic sounds reasonable".
I think there are plenty of interesting things about codegen to be found doing that. For example, I tried this:
#[no_mangle]
pub fn _mm_adds_epi16(a: __m128i, b: __m128i) -> __m128i {
let a: i16x8 = unsafe { transmute(a) };
let b: i16x8 = unsafe { transmute(b) };
let c = i16x8(
a.0.saturating_add(b.0),
a.1.saturating_add(b.1),
a.2.saturating_add(b.2),
a.3.saturating_add(b.3),
a.4.saturating_add(b.4),
a.5.saturating_add(b.5),
a.6.saturating_add(b.6),
a.7.saturating_add(b.7),
);
unsafe { transmute(c) }
}
After all, wrapping_add
worked great, so I figured that'd give me paddsw
no problem. But oh man is it unhappy! There's eight of these:
%37 = extractelement <8 x i16> %bc102, i32 5
%38 = extractelement <8 x i16> %bc103, i32 5
%39 = tail call { i16, i1 } @llvm.sadd.with.overflow.i16(i16 %37, i16 %38) #4
%40 = extractvalue { i16, i1 } %39, 0
%41 = extractvalue { i16, i1 } %39, 1
%42 = lshr i16 %38, 15
%43 = add nuw i16 %42, 32767
%_0.0.i80 = select i1 %41, i16 %43, i16 %40
It makes sense that that's what rustc needs to do given what's in LLVM, but ouch. No wonder it doesn't auto-vectorize. Maybe we could get an @llvm.sadd.with.saturation.i16
? Would make rustc's life easier and add more autovectorization opportunities. Or I wonder what would happen if rustc emitted saturating_add
as building a vector and using an llvm intrinsic? Is LLVM smart enough to combine vector types if some parts are undef
?