We want some of the libcore
functionality like memchr
or Iterator::is_sorted
to use SIMD when the CPU that the program runs on support them. However, for this we would need to use run-time feature detection in libcore
when its available, but we currently can’t because run-time feature detection requires libstd
.
This posts reviews “Why is run-time feature detection provided in libstd and not libcore?”, “Why do we want to use it in libcore?”, and proposes one way of solving this problem to kickstart the discussion.
Why is run-time feature detection provided in libstd ?
Currently, run-time feature detection is provided in libstd
and not libcore
by design. Run-time feature detection API semantics and implementation are, in general, platform-specific, but libcore
is platform agnostic. For example:
-
API-wise, one use of run-time feature detection is to answer to the question “Can I use 256-bit wide vector registers?”. This does, however, depend on whether the CPU supports them, and whether the OS has enabled user-space applications to use them (e.g., since this requires saving / restoring 256-bit wide vector registers on context switches). Even more dubious would be trying to answer this question from
libcore
inside an OS kernel, where the kernel might have set up everything up for user-space to use 256-bit wide vectors, but might not be able to use them itself ([0-1] e.g. because the kernel itself does not handle these registers across kernel ABIs). -
implementation-wise, querying run-time features in user-space must, in general, be done wby querying the OS for this information. For example,
aarch64
'scpuid
is a privileged instruction, so one can’t execute it from user-space. However, ano_std
Rust program usinglibcore
running in privileged mode could usecpuid
to do run-time feature detection, but should it? It depends, is there an OS in charge of this? If so, it should probably use the OS facilities instead.
These points are mainly why we currently only provide run-time feature detection in libstd
. When libstd
is linked, we know which OS is being targeted, what the OS can / cannot do, and how to answer questions related to the availability of run-time features.
The status quo is that if somebody wants to do run-time feature detection in #[no_std]
, then they should implement that functionality themselves in whichever way it makes sense for their application. There are already many crates that help with this, and I am still convinced that this was and still is the right way to go.
Why do we want to use this in libcore ?
A significant part of the libstd
functionality is actually implemented in libcore
and just re-exported by libstd
. This functionality cannot currently use run-time feature detection. In some cases, we could just re-implement the functionality in libstd
using extra features. However, we can’t do this in most cases.
For example, we could try to implement memchr
twice, once in libcore
, and once in libstd
, where the libstd
uses run-time feature detection, but the libcore
implementation does not. However, this is a pretty bad solution because there are other libcore
APIs that get re-exported by libstd
that also use memchr
, and these APIs wouldn’t be using the implementation of libstd
.
We could then try to re-implement everything that uses memchr
in libstd
, except that this doesn’t work either, because one thing that potentially uses memchr
is Iterator::find
, but the Iterator
trait cannot be re-implemented in libstd
since it has to be the exact same trait as the one from libcore
. This problem also affects Iterator::is_sorted
.
So what could we currently easily do? The status quo is that we do compile-time feature detection in libcore
, and use the best implementation possible, that gets re-exported by libstd
. If someone wants to use a better implementation, they can re-compile core
and std
using xargo
, e.g., with AVX2 enabled, or they can use a crate from crates.io. This is also pretty bad, since, for example, the memchr
crate wouldn’t only need to provide a better memchr
implementation, but also better implementations of everything in libcore
and libstd
that uses memchr
…
How bad is this, you may ask? For is_sorted
on &[u8]
slices, using AVX2 when available delivers a 4-5x speed-up over SSE2. So it can be pretty bad depending on the functionality involved and the particular application. The current status-quo is a very limited place to be in. We have all the new SIMD functionality in core::arch
, but everything beyond SSE2 is not worth using in libcore
/libstd
itself.
What could we do about this ?
The only idea I have is to expose run-time feature detection in libcore via lang-items that users can optionally define. When not defined, run-time feature detection would just always return that no features are available. When libstd
is linked, the functionality of libstd
would be used. When libstd
is not linked, and the user defines them, then they do whatever the user wants.
That way we could just put memchr
and Iterator::is_sorted
in libcore
, make use of run-time feature detection inside them, and in the extremely common case in which libstd
is also linked everybody would benefit from all the CPU features available.