Hi - I'm playing with optimised sort routines. I have compared code generated from some C++ and rust that both compute a 32 element sorting network. The C++ code benchmarks at about 1/3 faster than the rust code (approx 70ns vs 100ns). The assembly for the C++ is using vectorised ops, where as the rust is not.
The main difference that I can see is that the C++ code 'cheats' by providing a hand-rolled implementation of min/max. Rustc is ending up producing something that ultimately looks pretty similar to the hand-rolled assembly, but in the rust case it isn't then going on to vectorising it.
My target is i7-8700, skylark, msse4.2.
So my question is two-fold. Why is the c++ code faster? Why is rustc not able to achieve this?