Vectorisation for min/max sorting

Hi - I'm playing with optimised sort routines. I have compared code generated from some C++ and rust that both compute a 32 element sorting network. The C++ code benchmarks at about 1/3 faster than the rust code (approx 70ns vs 100ns). The assembly for the C++ is using vectorised ops, where as the rust is not.

c++: rust:

The main difference that I can see is that the C++ code 'cheats' by providing a hand-rolled implementation of min/max. Rustc is ending up producing something that ultimately looks pretty similar to the hand-rolled assembly, but in the rust case it isn't then going on to vectorising it.

My target is i7-8700, skylark, msse4.2.

So my question is two-fold. Why is the c++ code faster? Why is rustc not able to achieve this?

1 Like would be the better forum for questions like this. There's loads of other "why is this code slower?" threads there. Of course, the obligatory first question is "Did you build with --release?", so either retry with it or state in your new post that you did use it.

1 Like

To extend the above reply, this forum is for discussion of proposed changes to the Rust language and its compiler and related tooling. Queries such as yours belong in the Users forum, whose URL is in the prior post.