We want some of the
libcore functionality like
Iterator::is_sorted to use SIMD when the CPU that the program runs on support them. However, for this we would need to use run-time feature detection in
libcore when its available, but we currently can’t because run-time feature detection requires
This posts reviews “Why is run-time feature detection provided in libstd and not libcore?”, “Why do we want to use it in libcore?”, and proposes one way of solving this problem to kickstart the discussion.
Why is run-time feature detection provided in libstd ?
Currently, run-time feature detection is provided in
libstd and not
libcore by design. Run-time feature detection API semantics and implementation are, in general, platform-specific, but
libcore is platform agnostic. For example:
API-wise, one use of run-time feature detection is to answer to the question “Can I use 256-bit wide vector registers?”. This does, however, depend on whether the CPU supports them, and whether the OS has enabled user-space applications to use them (e.g., since this requires saving / restoring 256-bit wide vector registers on context switches). Even more dubious would be trying to answer this question from
libcoreinside an OS kernel, where the kernel might have set up everything up for user-space to use 256-bit wide vectors, but might not be able to use them itself ([0-1] e.g. because the kernel itself does not handle these registers across kernel ABIs).
implementation-wise, querying run-time features in user-space must, in general, be done wby querying the OS for this information. For example,
cpuidis a privileged instruction, so one can’t execute it from user-space. However, a
no_stdRust program using
libcorerunning in privileged mode could use
cpuidto do run-time feature detection, but should it? It depends, is there an OS in charge of this? If so, it should probably use the OS facilities instead.
These points are mainly why we currently only provide run-time feature detection in
libstd is linked, we know which OS is being targeted, what the OS can / cannot do, and how to answer questions related to the availability of run-time features.
The status quo is that if somebody wants to do run-time feature detection in
#[no_std], then they should implement that functionality themselves in whichever way it makes sense for their application. There are already many crates that help with this, and I am still convinced that this was and still is the right way to go.
Why do we want to use this in libcore ?
A significant part of the
libstd functionality is actually implemented in
libcore and just re-exported by
libstd. This functionality cannot currently use run-time feature detection. In some cases, we could just re-implement the functionality in
libstd using extra features. However, we can’t do this in most cases.
For example, we could try to implement
memchr twice, once in
libcore, and once in
libstd, where the
libstd uses run-time feature detection, but the
libcore implementation does not. However, this is a pretty bad solution because there are other
libcore APIs that get re-exported by
libstd that also use
memchr, and these APIs wouldn’t be using the implementation of
We could then try to re-implement everything that uses
libstd, except that this doesn’t work either, because one thing that potentially uses
Iterator::find, but the
Iterator trait cannot be re-implemented in
libstd since it has to be the exact same trait as the one from
libcore. This problem also affects
So what could we currently easily do? The status quo is that we do compile-time feature detection in
libcore, and use the best implementation possible, that gets re-exported by
libstd. If someone wants to use a better implementation, they can re-compile
xargo, e.g., with AVX2 enabled, or they can use a crate from crates.io. This is also pretty bad, since, for example, the
memchr crate wouldn’t only need to provide a better
memchr implementation, but also better implementations of everything in
libstd that uses
How bad is this, you may ask? For
&[u8] slices, using AVX2 when available delivers a 4-5x speed-up over SSE2. So it can be pretty bad depending on the functionality involved and the particular application. The current status-quo is a very limited place to be in. We have all the new SIMD functionality in
core::arch, but everything beyond SSE2 is not worth using in
What could we do about this ?
The only idea I have is to expose run-time feature detection in libcore via lang-items that users can optionally define. When not defined, run-time feature detection would just always return that no features are available. When
libstd is linked, the functionality of
libstd would be used. When
libstd is not linked, and the user defines them, then they do whatever the user wants.
That way we could just put
libcore, make use of run-time feature detection inside them, and in the extremely common case in which
libstd is also linked everybody would benefit from all the CPU features available.