There is a lot of details to work out. Suppose you’re writing impl Mul for i32x4
. It would go something like this:
- If MIPS MSA is available, use
__msa_mulv_w()
. - If ARM NEON is available, use
vmul_i32()
. - If SSE 4.1 is available, use
_mm_mullo_epi32()
. - If SSE2 is available, try to cobble something together out of 16-bit multiplications using
_mm_mulhi_epi16()
and_mm_mullo_epi16()
, or maybe_mm_mul_epu32()
combined with some shuffling. - Otherwise, expand into a lane-wise scalar multiplication.
In particular, trying to construct an i32x4
multiplication out of existing SSE2 intrinsics requires some work and knowledge. Work and knowledge that has already been put into LLVM.
This is just one operation for one type. There’s about 150 of those to go through. Then you would need to write individual unit tests for all of them since you’re guaranteed to have picked the wrong intrinsic by mistake at least once. Then find a MIPS machine to run your unit tests on. No, not that one. One with SIMD instructions available.