I'm confused where an inline assembly block would be helpful. A relaxed atomic load should have the same runtime behavior.
Briefly: if a function calls several callees, and each of those callees has its own detection and dispatch, and those callees are inlined, then all but one of the detections can be eliminated.
That said, I've been continuing to experiment, and there's not as much optimization that current Rust and LLVM will do here, compared with the approach of making functions generic over the SIMD capability (as pulp does), passing around a ZST for this, and dispatching on the token (I do this by having methods like .as_avx2() on the Simd trait that the token impl's). That's yielding the highest quality code in my experiments, but ergonomically is not ideal.
I found this thread while researching why is_x86_feature_detected is surprisingly slow - especially when used in a fine-grained way in leaf functions that I expect to get inlined and used in loops.
Is it likely that is_x86_feature_detected will be made more optimal? I am currently considering writing my own workaround for this.
A more optimal version of all the is_*_feature_detected would certainly be nice, and perhaps even a way to get the cache directly so that more complex checks can be performed.
Currently I use these macros for my maybe_special attribute macro, which automatically implements cpu feature function multiversioning, similar to clang's target_clones attribute. The initialiser function is only called once by the dispatch as the detected specialisation is cached in a static mut (I don't use an atomic because I assume that every specialisation will always behave identically and as such race-conditions don't matter) limiting the severity of performance impacts from this, but an improvement to the performance of these macros would definitely be nice, especially since my crate has to call them a lot of times in a row during its initialiser (which is per function too).
Obligatory reminder that UB is UB. If you don't care about which of the values you get, use relaxed atomics. But races on non-atomic accesses are always UB, and the argument "I don't care which of the two values I get" is not valid -- the program might do something that doesn't correspond to either value, or to any sensible value at all.
Ideally the CPU detection could be done early single threaded, and then jumps patched in the code to use the appropriate implementation. I believe the Linux kernel does this sort of thing, and Function Multiversioning (Using the GNU Compiler Collection (GCC)) is a variant assisted by the dynamic linker (specifically it uses IFUNCs on Linux. This allows executing code to compute symbol resolution[1], which GCC uses to do feature detection). It tends to conflict with ASAN for some reason in my experience.
I suppose the issue for Rust is that it wants a cross-OS way to do this though. And that works inside shared libraries, without relying on .init_array or similar that executes code before main/during dlopen.
Which is coincidently the same mechanism the xz backdoor tried to abuse. âŠī¸
Yeah, I was operating under the assumption that ptr-sized writes were always atomic without needing to use atomic operations, but someone else had already corrected me about a day after I wrote that comment. I promptly implemented relaxed atomics within my crate (funnily enough though, it seems to generate identical assembly on x86-64).
As for @Vorpal's suggestions. I had already considered all that, but it seems that IFUNCs would require compiler support and in general be really annoying to deal with. Additionally, any form of load-time dispatch would restrict the architectures and OSes that the code would work on even more than it already is (one of my major design decisions for my crate was to make it support every architecture and OS that std_detect/std::arch supports, and fallback to a generic implementation for other architectures/OSes).
Great.
Yeah identical assembly is what I would hope for, the point of using atomics here is to prevent the compiler from applying certain optimizations -- optimizations which are unlikely to apply when you test your code, but who knows what happens when this gets inlined into the right context or when optimizations get smarter in future compiler versions...
Relaxed atomics are all about telling the compiler your intent, and have very little to do with the hardware. (That's similar to e.g. MaybeUninit.)
I was operating under the assumption that ptr-sized writes were always atomic without needing to use atomic operations
I wonder if there's some place in the docs we could put a big fat warning to avoid such misconceptions... but it's hard because there's many different unsafe operations you could use to do this, and if you already made the assumption you're unlikely to specifically go and search for conformation.
On x86-64, instructions that read but don't write are acquire-ordered by default, and instructions that write but don't read are release-ordered by default. So if you are only reading, or only writing, you effectively get some fairly powerful atomics for free on that platform. (But there are plenty of platforms where that isn't true!) Unless you need sequential consistency, atomics on x86-64 are only expensive in cases where you read and write in the same instruction (in which case the CPU core has to take a lock in order to ensure that the other cores don't write to the same address in between â an atomic read-write is not equivalent to an atomic read followed by an atomic write). So optimisation of atomics on x86-64 is mostly about trying to reduce the number of atomic read-writes you need rather than the number of atomic instructions in general.
As mentioned above, you still need to mark the operations as atomic to prevent compiler optimisations interfering with them, even if the hardware treats them as atomic by default.
Not true for compare and exchange, fetch add etc. But true for loads and stores.
Except, IA64 I believe has diffrent opcodes for relaxed and non-atomic stores? Or was it Alpha? One of the two anyway.
Every relevant CPU architecture has at least consume ordering as far as I know. (I have heard that GPUs are a bit of a wild west for this, but I know very little of them.)
And of course, there may be diffrent rules for diffrent types of memory: MMIO, write combining (on x86 even), etc.
But yes, in general relaxed is about the compiler. But with a ton of asterixes.
A memory about seeing a crate for this in Rust kept nagging at the me, so I took some time today to looking for it. Eventually I found it:
It allows for Linux kernel style patching of your code, but since it is not in the kernel it can unfortunately not safely patch multi-threaded code. (Thinking about it, I believe it should be possible on x86-64 due to specifics in instruction encoding and cache coherrency, but it would absolutely not be portible to e.g. ARM.)
Yeah I meant loads/stores only, not RMWs.
(The stricter orderings also have aspects that are only for the compiler. But anyway this is really going way to far in the woods. It should be clear that I am not fully spelling out everything there is to say about the memory model here, I am just giving some loose guidance to help fight an apparent misunderstanding.)
I agree that this would be good and have explored it, but don't know how to express the constraint "this needs to be done in a single-threaded context before any multithreaded access" in the Rust type system. Obviously one way to guarantee that would be to run it before main, but there's an aversion to having static initialization code in Rust (even though, as I've written, on Windows there are two instances of CPU detection that run before main).
There's other situations where it would be nice to express "this needs to happen in a single-threaded context before any multithreaded access", such as environment variable mutation[1].
Unfortunately, nothing stops C++ static constructors from calling pthread_create, and I have heard that there exist shared libraries that actually do that[2], so "before main" isn't even good enough.
We'd need a new "super early" process initialization phase, defined specifically by a guarantee that it would be executed single-threaded, after the dynamic linker was done processing relocations, but before the existing static initialization phase.
I am still interested in pursuing the POSIX-level API change proposal outlined in that thread, but not on a volunteer basis: someone would need to fund me to do it. Please get in touch if you're interested in seeing a concrete grant proposal. âŠī¸
Nobody's ever come out and said which libraries these are in my hearing, though. âŠī¸
IFUNCs would work for this though, but definitely need compiler support (though, perhaps possible via inline asm? Not sure).
Upon reading this, I thought "so if we need to initialize it that early, maybe we can get the dynamic linker to do it", and then thought "hmm, maybe the dynamic linker already does that?".
And indeed, it seems that at least on Linux, CPU feature detection is one of the things that's placed in the auxiliary vector. Unfortunately, the auxiliary vector isn't in a stable/known location: it's placed immediately after the terminating NULL pointer of the initial environment, so the location will depend on how many environment variables exist (and the initial environment might be quite hard to find if someone decides to set some environment variables and thus changes environ to point elsewhere). So this is probably a dead end â we'd need to the address to be stable enough to be linked against to be able to use it for efficient feature detection (finding the auxiliary vector every time is probably slower than just using a stable address of our own).
Ifuncs have a bunch of their own problems, but CPU feature detection is the thing they are actually intended to be used for, so, at least, if we run into the problems we should get some sympathy from the C library people. Note however that ifunc resolvers are not guaranteed to run at any particular time -- they might run as late as just before the target function is actually called for the first time, or as early as during processing of dynamic data relocations. This means there's no guarantee they're called from a single-threaded context and there's no guarantee you can make function calls or access global data.
It is indeed possible to construct an ifunc using inline assembly; I don't remember how off the top of my head, but glibc used to do this (possibly still does).
This headache is encapsulated by getauxval. We wouldn't want to call that every time we needed the information, but we could call it once during Rust stdlib startup and stash the result in a global.
I don't think an initial implementation really needs to worry about the auxval vector. Clang and gcc call the runtime's __cpu_indicator_init function in the ifunc resolver, and each multiversioned function gets its own ifunc resolver. This function is also called as a constructor before main, but that's not relevant to ifuncs since they often run before-before main. This function runs cpuid and stores the result to a global variable. It does this in each ifunc resolver with no synchronization and there's no synchronization on the reads.
I think this is safe because everything is under the loader-lock and even if it wasn't you've got bigger problems if different threads are writing different data. Theoretically it could happen on big.LITTLE machines, but irl it's rare for big.LITTLE CPUs to support different features on different cores (at least intentionally).
I think running cpuid for each resolver function is probably also fine just because the granularity here is likely to be pretty large.
And yes, ifuncs are designed for exactly this, they are superior to keeping a global function pointer table yourself and dispatching based on that because delayed resolution and interposing and suchlike will work. And when you do it yourself you really are just duplicating stuff the loader should take care of.
MacOS has "dyld-resolvers" which seem similar and windows actually uses something similar for stuff like memcpy/move/etc. So this approach is actually somewhat cross platform.
Compilers could do better at optimizing based on multiversioned functions; Nobody is identifying code that would benefit from newer vector instructions and outlining it into multiversioned funcs, for example.
MacOS has "dyld-resolvers" which seem similar and windows actually uses something similar for stuff like memcpy/move/etc. So this approach is actually somewhat cross platform.
What about musl? It doesn't support IFUNCs, and a lot of projects prefer building static binaries with it since they are portable between distros. Then there is all the BSDs as well.
I would like to see something like this, but I don't see how it can be made in any way portable. But experimenting with this in a library/macro from should be possible using global_asm I believe, so a prototype using that could be useful as a proof of concept (and for those who don't care about portability).
Well, fully static binaries are kinda a broken idea anyway, so users doing that should be used to things not working.
The thing you need compiler support for is aggregating all the various possible function versions and their CPU features and using that to generate the dispatcher. Once you have the dispatcher you could call it yourself to populate a function pointer without going through the loader, too.
You're right it might be implementable in asm, but at least having support for emitting the llvm ifunc decl would be useful.
maybe something like
#[target_feature(enable="avx2")]
fn frob_avx2() {
}
fn frob_default() {
}
gen_dispatcher!(frob_dispatcher, {
avx2: frob_avx2,
default: frob_default
});
decl_ifunc!(frob, frob_dispatcher);
Where the block in that macro could be a match body on the CPU information and decl_ifunc! could emit the module level assembly.
perhaps all that could be condensed into one macro if you want the same body for each version.