Using run-time feature detection in core


#1

We want some of the libcore functionality like memchr or Iterator::is_sorted to use SIMD when the CPU that the program runs on support them. However, for this we would need to use run-time feature detection in libcore when its available, but we currently can’t because run-time feature detection requires libstd.

This posts reviews “Why is run-time feature detection provided in libstd and not libcore?”, “Why do we want to use it in libcore?”, and proposes one way of solving this problem to kickstart the discussion.

Why is run-time feature detection provided in libstd ?

Currently, run-time feature detection is provided in libstd and not libcore by design. Run-time feature detection API semantics and implementation are, in general, platform-specific, but libcore is platform agnostic. For example:

  • API-wise, one use of run-time feature detection is to answer to the question “Can I use 256-bit wide vector registers?”. This does, however, depend on whether the CPU supports them, and whether the OS has enabled user-space applications to use them (e.g., since this requires saving / restoring 256-bit wide vector registers on context switches). Even more dubious would be trying to answer this question from libcore inside an OS kernel, where the kernel might have set up everything up for user-space to use 256-bit wide vectors, but might not be able to use them itself ([0-1] e.g. because the kernel itself does not handle these registers across kernel ABIs).

  • implementation-wise, querying run-time features in user-space must, in general, be done wby querying the OS for this information. For example, aarch64's cpuid is a privileged instruction, so one can’t execute it from user-space. However, a no_std Rust program using libcore running in privileged mode could use cpuid to do run-time feature detection, but should it? It depends, is there an OS in charge of this? If so, it should probably use the OS facilities instead.

These points are mainly why we currently only provide run-time feature detection in libstd. When libstd is linked, we know which OS is being targeted, what the OS can / cannot do, and how to answer questions related to the availability of run-time features.

The status quo is that if somebody wants to do run-time feature detection in #[no_std], then they should implement that functionality themselves in whichever way it makes sense for their application. There are already many crates that help with this, and I am still convinced that this was and still is the right way to go.

Why do we want to use this in libcore ?

A significant part of the libstd functionality is actually implemented in libcore and just re-exported by libstd. This functionality cannot currently use run-time feature detection. In some cases, we could just re-implement the functionality in libstd using extra features. However, we can’t do this in most cases.

For example, we could try to implement memchr twice, once in libcore, and once in libstd, where the libstd uses run-time feature detection, but the libcore implementation does not. However, this is a pretty bad solution because there are other libcore APIs that get re-exported by libstd that also use memchr, and these APIs wouldn’t be using the implementation of libstd.

We could then try to re-implement everything that uses memchr in libstd, except that this doesn’t work either, because one thing that potentially uses memchr is Iterator::find, but the Iterator trait cannot be re-implemented in libstd since it has to be the exact same trait as the one from libcore. This problem also affects Iterator::is_sorted.

So what could we currently easily do? The status quo is that we do compile-time feature detection in libcore, and use the best implementation possible, that gets re-exported by libstd. If someone wants to use a better implementation, they can re-compile core and std using xargo, e.g., with AVX2 enabled, or they can use a crate from crates.io. This is also pretty bad, since, for example, the memchr crate wouldn’t only need to provide a better memchr implementation, but also better implementations of everything in libcore and libstd that uses memchr

How bad is this, you may ask? For is_sorted on &[u8] slices, using AVX2 when available delivers a 4-5x speed-up over SSE2. So it can be pretty bad depending on the functionality involved and the particular application. The current status-quo is a very limited place to be in. We have all the new SIMD functionality in core::arch, but everything beyond SSE2 is not worth using in libcore/libstd itself.

What could we do about this ?

The only idea I have is to expose run-time feature detection in libcore via lang-items that users can optionally define. When not defined, run-time feature detection would just always return that no features are available. When libstd is linked, the functionality of libstd would be used. When libstd is not linked, and the user defines them, then they do whatever the user wants.

That way we could just put memchr and Iterator::is_sorted in libcore, make use of run-time feature detection inside them, and in the extremely common case in which libstd is also linked everybody would benefit from all the CPU features available.



Packed_simd: `cfg(target_feature)` does not play well with `#[target_feature]`
#2

What about caching the results of this feature detection? This seems like it would rely on some form of atomics, which are not available on every target. A link to some function in std that’s currently using feature detection would be really nice.


#3

Currently the libstd implementation of run-time feature detection always caches the results using atomics. AFAIK all targets supported by libstd have atomics. A valid implementations for libcore for targets without atomics / without atomics emulation could just not cache anything (cpuid instruction calls are expensive ~15-150 cycles, but depending on the platform that’s on the ballpark of some atomic memory operations).

EDIT: also, whatever lang items we come up with here should also work for targets that do not support run-time feature detection at all, so that for those targets “not implementing anything” would also be a valid implementation of run-time feature detection.

A link to some function in std that’s currently using feature detection would be really nice.

AFAIK most of the functionality where this would be worth doing is implemented in libcore where run-time feature detection cannot currently be done, so I don’t think any function in libstd actually does this.

The is_sorted crate uses it in all functions on its API, but none of it is currently preserved in the current WIP implementation for libcore. As mentioned, the current status quo is that if you can use libstd, then you should not, and instead use a third-party crate that does not have these issues. This is bad, and not easy to fix, but I think its worth fixing before we start using SIMD everywhere in libstd/libcore, duplicating functionality to work around this, etc.


#4

This sounds like a prime use case for https://github.com/rust-lang/rfcs/pull/2492