Pre-RFC: target-feature detection in libcore

gnzlbg · July 11, 2019, 7:38am

Motivation

libcore cannot use SIMD intrinsics when these are available at run-time. This is bad, because str, [T], and Iterator are provided in libcore, and some of their methods could be much faster (>10x faster) if they were to use SIMD intrinsics available at run-time.

Understanding the problems of different solutions

First: run-time feature detection requires operating system support. libcore is operating-system independent. That is, we can’t just move all the run-time feature detection system to libcore, and do it all there, because without an operating system, there is no way to know what to do.

Second: we could hack our way out of this. We have extension traits for slices, so we could probably provide different methods for str and [T], depending on whether libstd is linked. We already do something like this for f32/f64, which are types defined in libcore, but where libstd adds inherent methods to them. Doing this is painful, and while it would be possible, it would require a significant amount of work and hacks.

Third: Iterator is a trait provided in libcore, where default implementation of the methods are provided. How could libstd provide different default implementations of the Iterator methods ? AFAICT this would require providing a different Iterator trait in libstd, and “somehow” hack our ways into making them identical. So that a function bounded by the libcore::Iterator becomes bounded by libstd::Iterator if libstd is linked. Possible? Probably. Hacky? Definitely.

Proposal

We add yet another lang-item, in the same spirit as #[global_allocator], #[panic_handler], etc:

#[is_target_feature_detected]
fn is_target_feature_detected(x: &'static str) -> bool;

There can only be one definition of this item in the whole binary, otherwise, compilation should fail, #[global_allocator]-#[panic_handler]-style.

libstd would provide such an item that would just call the std::detect module:

#[is_target_feature_detected]
fn is_target_feature_detected(x: &'static str) -> bool {
  std::detect::check_for(x)
}

If #![no_std] binaries do not provide this item, it will be automatically polyfilled as:

#[is_target_feature_detected]
fn is_target_feature_detected(x: &'static str) -> bool {
 false
}

That is, when this item is not provided, this API always returns false: “feature x is not available”. Note, however, that the feature detection macros will not call this API if the feature was enabled at compile-time, so if the feature is enabled at compile time, you’ll correctly get that the feature is also available at run-time (if it isn’t, undefined behavior was invoked since the moment the binary started to run).

We will then move the is_{target_arch}_feature_detected! macros from the std::detect module into libcore, and implement them to call the is_target_feature_detected lang item.

Alternatives

That is what felt like the best alternative to me, since the feature-detection run-time does not change at run-time.

An alternative could be to, e.g., use an AtomicPtr in libcore that points to a fn(&'static str) -> bool, that users can configure to make it point somewhere else. This would allow users to change the runtime during runtime as often as they want, which I don’t think makes sense. A problem with this approach is that one of the things one often wants to check, is whether the CPU supports atomics, and, e.g., use a Mutex if it does not. With an AtomicPtr, one would need atomics to be able to check whether atomics are available. So, AFAICT, we would need to use a Mutex here. That might, however, require operating system support, which we don’t have in libcore.

I can’t think of any other alternatives, but please, this is the brainstorming phase, so maybe you do?

How do you use this

In general, users don’t need to do anything. They just call an Iterator method, and if their CPU supports AVX2, the method might just use it internally.

Today, is_x86_feature_detected! is only exported from libstd, so #[no_std] libraries cannot use it. With this RFC, that would change, and #[no_std] libraries are now able to use the run-time feature detection macros.

Users implementing #![no_std] binaries don’t need to do anything either. The compiler polyfills a sound is_target_feature_detected lang item, and they can just use the run-time detection macros, and if the features are enabled at compile-time, they will benefit from them.

These users can, however, add their own feature-detection run-time. If their app is user-space-like, they can probably just add std_detect crate as a dependency, use its cargo features to tailor it to their application, and that’s it - they get quiet good run-time feature detection.

If their application is more os-kernel-like, then they can implement their own runtime, maybe submitting PRs to std_detect to allow it to support kernel-like platforms, or maybe we can maintain a second std_detect crate tailored for OS kernel that all kernel devs can reuse. This type of experimentation can happen on crates.io. This proposal just enables it.

Drawback

Yet another lang item. We could remove this drawback with extern existentials, but those have been postponed indefinitely.

gnzlbg · July 11, 2019, 7:58am

cc @alexcrichton

HadrienG · July 11, 2019, 9:55am

May I request a little bit of background?

Coming from x86, where CPUID is an unprivileged instruction, this statement surprised me. Do other CPU architectures supported by Rust lack a similar mechanism for probing the set of SIMD ISA extensions that are available on a given chip?

EDIT: Or perhaps the issue is with people who run obsolete operating systems that do not fully support their CPU, e.g. can't save/restore the newer SIMD registers on context switches, yielding to fewer usable SIMD features than what's actually supported in hardware?

gnzlbg · July 11, 2019, 12:29pm

While CPUID is unprivileged, whether certain features are enabled or not depends on the platform. For example, even if the CPU supports AVX2, that is, 256-bit wide vector registers, the operating system might not support copying these registers on context-switches (or it might not want to do that on all context switches), so there are a range of flags that the OS can set and must be checked to advertise that. This assumes that an OS exists, and has set those appropriately, which doesn't hold if you don't have an OS, which is the assumption that libcore makes.

This can be problematic if you are writing an OS kernel, which might want to set these flags for user-space and not for kernel-space. These flags are global, so if the kernel were to use the same run-time for feature detection as user-space, the kernel would need to switch all these flags on and off properly on context switches. However, since run-time feature detection isn't part of libcore, a kernel can just implement a different run-time, that returns appropriate values for kernel space. Ideally, all functionality defined in libcore would be able to just use that.

Do other CPU architectures supported by Rust lack a similar mechanism for probing the set of SIMD ISA extensions that are available on a given chip?

We currently support run-time feature detection for x86/x86_64, arm32/arm64, ppc32/ppc64, mips32/64, and maybe some more. Only x86 architectures newer than 486 have an unprivileged CPUID instruction. In all other architectures, this instruction is privileged, and user space cannot, in general, use it.

The way this information is obtained is by asking the OS for it, e.g., on Linux, the available CPU features are put in the ELF auxiliary vectors by the OS on initialization, and calling getauxval queries them. Older linux and android versions can just query /proc/cpuid.

On modern Linux (kernel > 5) and FreeBSD 12, there is user-space emulation of the aarch64 equivalent instruction for CPUID. That is, while the instruction is privileged, user space can execute it, and the OS will intercept that and provide the results. We currently only support this for FreeBSD12 because many Linux users are still stuck with older kernel versions, and there is no easy way to test whether this is supported. I suppose we could call getauxval to test this, but since getauxval returns all available features, it is not necessary to do anything else afterwards.

alexcrichton · July 11, 2019, 2:21pm

I would personally prefer to see possible performance numbers for libcore to see what’s possible before adding new lang items. I personally loathe adding so many lang items and I think it’s also questionable whether we even can. Today you can produce programs without defining this lang item, and if we start requiring a definition that’s a breaking change.

I don’t really know of great alternatives though unfortunately, but I feel like there’s design space that should happen here before we rush to include it in libcore/libstd.

burntsushi · July 11, 2019, 2:48pm

On x86_64, it seems like you should be able to use SSE2 simd unconditionally, since SSE2 is part of x86_64. But in order to use anything above that, you need either a compilation flag or runtime CPU feature detection.

memchr might be a good example. On my machine (Intel i5-7600), the AVX version of memchr is about 1.5x faster than the SSE2 version, with the SSE2 version being about 2.5x faster than the non-SIMD version.

$ critcmp memchr-avx memchr-sse memchr-fallback
group                     memchr-avx                             memchr-fallback                        memchr-sse
-----                     ----------                             ---------------                        ----------
memchr1/rust/huge/rare    1.00      9.5±0.09µs    58.5 GB/sec    3.97     37.6±0.12µs    14.7 GB/sec    1.51     14.3±0.09µs    38.8 GB/sec

The extent to which these kind of gains can be realized for other operations isn't clear though, as workloads vary.

Is there any way to make it optional such that the default implementation always returns false?

alexcrichton · July 11, 2019, 3:13pm

Heh makes sense in terms of speedups! Is there a good idea as to what can be optimized? For example I would expect a lot of iterator adaptors wouldn’t really be able to SIMD-specialize since they’re too high-level to see they’re working on slices with a particular query (for example). Something like str::lines, however, could be optimized (and is probably already using memchr in libcore. Basically what I’m saying is that it would probably be good to have a list of APIs that would benefit from SIMD-acceleration in libcore which can’t today.

And yes a default implementation could return false, and similarly this sort of lang item could hopefully be kept unstable for a very very long time (possibly forever).

gnzlbg · July 11, 2019, 3:16pm

So I just re-ran the is_ascii::is_ascii and the is_sorted::is_sorted benchmarks (e.g. from the is_sorted RFC), and is_ascii using AVX2 is ~1.8x faster than the optimized version from rustc. The is_sorted crate is ~18x faster than [T]::is_sorted, when using AVX (4,832,577 ns/iter vs 298,360 ns/iter), for some of the benchmarks (sorting a large array of u8s/i8s using <, >, etc.).

I expect the biggest impact would be using SIMD for [u8]::is_utf8, which is something that gets called quiet often in Rust programs (e.g. every time one uses String::from_utf8). There are a couple of papers by lemire's group that show that one can continue to make the check faster as one moves from SSE to AVX2, and from AVX2 to AVX512. EDIT: I remember seeing a Rust implementation of some of these, but don't recall by who, @killercup was it by you? EDIT2: found lemire's group C++ impl: GitHub - lemire/fastvalidate-utf-8: header-only library to validate utf-8 strings at high speeds (using SIMD instructions) They show a 14x perf improvement from the non-SIMD version to the AVX one, and a 1.6x improvement from the SSE version to the AVX version.

I personally loathe adding so many lang items and I think it’s also questionable whether we even can.

Me too. Progress here was blocked on resolving the extern existential RFC, which proposed a language feature to allow defining "global singletons" like the global allocator, panic handlers, etc. in Rust libraries, without resorting to lang items. The lang team resolved the RFC by , IIRC, rejecting the overall direction, since adding a complex language feature to spare a couple of lang items was hard to justify. So requiring newer lang items was a trade-off that was known and acceptable to the language team. Obviously, each lang item needs to be worth it and stands on its own - that shouldn't be interpreted as a free license to add lang items.

I don’t really know of great alternatives though unfortunately, but I feel like there’s design space that should happen here before we rush to include it in libcore/libstd.

I hope somebody has a better idea than using a lang item. A static function pointer approach might work, if we are able to work around the limitation of atomics when it comes to no_std targets.

gnzlbg · July 11, 2019, 3:30pm

Some x86_64 targets have SSE3 enabled by default, e.g., x86_64-apple-darwin.

One can use specialization to specialize the Iterator methods for concrete iterator types, like core::iter::SliceIter. The is_sorted crate does this, but beyond SliceIter and SliceIterMut, I don't know if any other type is relevant. Fast SIMD algorithms require slices anyways.

josh · July 11, 2019, 3:49pm

That's true. However, using automatic detection seems like a reasonable default, with the ability to override that default if you want to disable or hardcode certain features.

newpavlov · July 11, 2019, 3:51pm

It's concerning, since it would be nice to have a similar lang item for getrandom. If it's the case I think extern existential proposal should be revisited, i.e. we need a functionality to define items overwritable from other crates. It may even find uses outside of core/std.

josh · July 11, 2019, 3:51pm

Even then, you might be in an operating system environment that doesn't want to save and restore the SSE registers. So you can't quite assume that.

Centril · July 11, 2019, 3:52pm

The RFC was not rejected, but is proposed to be postponed due to a lack of bandwidth to deal with it and not due to some disagreement with the feature proposed itself.

That was not the decision and the conclusion is not necessarily to instead add lang items (as you note later).

One thing that has me concerned here is the interaction with const fn. It seems to me that if you start querying for target features in libcore that will significantly postpone (or possibly render impossible) the constification of libcore.

alexcrichton · July 11, 2019, 4:10pm

Those are some compelling numbers! Is there an implementation of SIMD-accelerated utf-8 validation that we could compare against as well to see the kind of possible speedups there?

Here’s a bad idea which is an alternative to lang items at least. We could add a perma-unstable function to libcore which is “tell libcore about cpu features”. Internally it does atomics or w/e to store it in some global, and runtime checks for simd features check this global. The standard library when it starts up would then tell libcore about detected cpu features. The main downside of this approach is that rust-used-as-a-library won’t have a stable way to enable simd acceleration in libcore, only Rust binaries will have a way to do it. Hence the introduction of this not being a great idea.

gnzlbg · July 11, 2019, 4:13pm

The RFC was not rejected, but is proposed to be postponed due to a lack of bandwidth

Sorry, you are right. I might have had a different RFC in mind.

That's a very important concern. IMO, a const fn language feature that needs to pick between compile-time execution or efficient execution at run-time is flawed. This requires users to either be extremely conservative with making functions const fn, since that could prevent future run-time performance improvements, or to duplicate APIs for use in constant or "runtime" expressions.

C++ constexpr had this problem, and they fixed it by allowing const fns to query whether they were executing in a constant or run-time context, such that they can provide different implementations.

I don't know if C++ solution would be a good solution for Rust, but this is a problem worth solving if we want to prevent an API split.

gnzlbg · July 11, 2019, 4:20pm

The other problem I see with this is that, if a binary doesn't use run-time feature detection, they pay for it when libstd is initialized. libstd already stores a static atomic for caching the features, so if these are not removed, all rust binaries are already paying a price for it, even if they don't use it. So this might be an acceptable tradeoff.

We could move these atomics to libcore, initializing them always to false, and instead of initializing them properly on first use, do that during libstd initialization. Some of the initialization code needs to do system calls and/or access the file system, so I don't know whether all of this is available during libstd initialization, but it could work.

Either way, we would then move the functionality to query the active features to libcore, leaving only the initialization code in libstd. This would work for binaries linked with libstd.

The question is whether we want to also allow binaries that are not linked with libstd to provide their own initialization, and how would that work. I suppose that if these atomics are public, and their api is clearly defined, then a no_std binary can just write anything to them, at the beginning of main. Or maybe we could expose the "tell me about target features" API to users. That might work too.

comex · July 11, 2019, 11:09pm

If you're building a cdylib then there is no place for libstd to run initialization code.

killercup · July 12, 2019, 7:29am

I only did some benchmarks last year: GitHub - killercup/simd-utf8-check

Edit: Oh, maybe I did implement this, too? I honestly don't recall doing this

gnzlbg · July 12, 2019, 8:12am

@comex cdylibs are tricky. First, note that access to the feature-detection runtime would happen in libcore, which doesn’t really have any tools (e.g. dlsym) to handle weak symbols, so libcore cannot query whether the runtime state exists, or whether a libstd function exists, that could be used to initialize these symbols.

In C++, you can initialize global variables on binary initialization and dynamic-library link time:

static int foo = runtime_initialization();

IIRC, there is a segment in the binary where you can have an array of function pointers that get called on initialization and finalization of the binary or dynamic-library by the linker. We can add a function pointer there to some function (e.g. from libstd) that initializes the runtime for cdylibs, when the cdylib is compiled with libstd. This is super hacky, but I don’t know how we could solve this in Rust.

The Rust way would be to use lazy_static!, and initialize this on first access, but libcore cannot do the initialization since it doesn’t know how. Some other code needs to call libcore to do the initialization, and that code needs to be called at runtime somehow, either in life before main, or by the linker when the library is dynamically linked.

EDIT: if libstd is not linked into the cdylib but linked into the final binary, ideally, the cdylib should be able to do proper run-time feature detection. I have no idea how to allow that. While the runtime could be weakly linked to the cdylib, a libcore-only cdylib has no tools to introspect that.

whitequark · July 12, 2019, 10:20am

Right not AFAIK libcore does not have any read-write data, and it would be concerning for embedded targets if it did gain even a single field like that.

Topic		Replies	Views
Using run-time feature detection in core	4	2253	March 25, 2019
Pre-RFC: stabilization of target_feature compiler	130	11414	March 25, 2019
Pre-RFC: Cargo Target Features cargo	21	7939	March 25, 2019
Getting explicit SIMD on stable Rust	336	44443	March 25, 2019
Pre-RFC: SIMD groundwork language design	40	10025	March 25, 2019