(Oops, it looks like I pasted in the wrong code in my previous post. That rust code should have been f32s, not i32s. I won’t edit it since that seems to do weird things to timestamps here?)
Absolutely.
As you suggested, man was one easy to produce
This one, which I’m not surprised didn’t work, does generate vector instructions, but it’s four mulsss, not a mulps 
pub fn my_dot_direct(a: f32x4, b: f32x4) -> f32 {
let a: [f32;4] = a.into();
let b: [f32;4] = b.into();
a[0] * b[0] +
a[1] * b[1] +
a[2] * b[2] +
a[3] * b[3]
}
And I guess the inliner is one of the passes that runs before the SLP vectorizer. I was hoping this would inline the vector fmul from my_mul, but nope. No fmul <4 x float> in the LLVM, no mulps in the assembly.
pub fn my_dot_usingmul(a: f32x4, b: f32x4) -> f32 {
let c: [f32;4] = my_mul(a, b).into();
c[0] + c[1] + c[2] + c[3]
}
But then a tiny change makes it happy again. This one actually generates rather nice-looking vector operations even for the additions:
pub fn my_dot_usingmul_balanced(a: f32x4, b: f32x4) -> f32 {
let c: [f32;4] = my_mul(a, b).into();
(c[0] + c[1]) + (c[2] + c[3])
}
(Well, except for the fact that apparently it didn’t notice that %2 and %4 are exactly the same thing, which I thought SSA was really good at.)
%2 = fmul <4 x float> %0, %1
%3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 2>
%4 = fmul <4 x float> %0, %1
%5 = shufflevector <4 x float> %4, <4 x float> undef, <2 x i32> <i32 1, i32 3>
%6 = fadd <2 x float> %3, %5
%7 = extractelement <2 x float> %6, i32 0
%8 = extractelement <2 x float> %6, i32 1
%9 = fadd float %7, %8
ret float %9
Yay, non-associative floating point math!
Makes me want a bunch of sum_unspecified_order() methods (on simd, slices, …), but that’s definitely a post-step-4 bikeshed.
(Hmm, looking at llvm’s fastmath flags makes we want to go make an nnan_f32 type that lowers like that all the way to llvm, and would actually be Ord & Eq. But that’s a distraction for the future…)