SIMD in `const` easily

Kixunil · July 17, 2024, 7:00am

There are algorithms that could be const but aren't because they are optimized for certain platforms using SIMD. (e.g. SHA256). So far the only stable way to provide const impl is having another function. However it's annoying because it clutters the API, can be confusing and it's hard to come up with a good name.

I've just realized there's a simple solution. Pretty much all sound code uses is_*_feature_detected macros in a condition and then runs the optimized impl if it returns true. So by just allowing the macro in const and always returning false from it combined with allowing writing all SIMD intrinsics in const but failing the evaluation when actually executed gets the job done. You could think of const evaluation as another platform that just doesn't provide any features at all.

I couldn't find any relevant discussion on this topic. Is there one already? Is this solution really that simple or am I missing some problem?

farnz · July 17, 2024, 9:17am

While this solves this use case, there's the flip-side that some users would like is_*_feature_detected to be a constant true whenever the feature is statically enabled. For example, on 64-bit x86, is_x86_feature_detected("sse2") should be unconditionally true, since SSE2 is always available and used by the compiler; similarly, if I add -C target-features=avx2, or use -C target-cpu=x86-64-v3, is_x86_feature_detected("avx2") should be unconditionally true.

This is useful when you compile for targets with known feature sets; by making is_*_feature_detected compile-time constant wherever possible, you ensure that the compiler correctly optimizes in the knowledge that it has a given SIMD ISA available. I can see a use case for const fns that do know which SIMD ISAs are available, so that I can (e.g.) only SIMDify my algorithm if I'm going to be able to use AVX2 throughout, since part of my algorithm depends on AVX2 intrinsics, while other parts only need SSE4.2 intrinsics (but I don't believe I'd gain from SIMD instrinsics if I can't inline functions using SSE4.2).

How would you reconcile these two positions? Deem one of them "unsupported"? Offer a different macro for the case where I want it to be false regardless of target-features in a const context?

kornel · July 17, 2024, 10:25am

There could be separate is_const_feature_detected($literal) or const fn is_x86_sse2_enabled().

the8472 · July 17, 2024, 10:30am

With const_eval_select (tracking issue) it's possible to have a const and a runtime implementation inside a single facade function.

This is a rather unsafe, internal feature but maybe there'll be a stable equivalent one day.

Kixunil · July 17, 2024, 11:08am

Thanks all for the feedback!

I see I wasn't clear enough. This is an easy solution that could enable them right now but doesn't in any way rule out other possibilities to make those intrinsics work in the future. For instance the example with some being always available: those can be software emulated instead of failing. There are multiple ways to do this: implement it in the compiler directly (miri already implements a bunch) or implement it in core using const_eval_select. (Which I knew about but I also know the problems with it.)

The cool thing about this approach is that it bypasses difficult questions like "what the API of const_eval_select should be?", "should it be possible for const code to behave differently?", "do we spend shitton of time to implement all intrinsics in const and if not which go first?", "do we want to bloat the compiler by implementing all intrinsics? if not, which will enjoy favorism?"

The thing with some features being always on justifies implementing them in const IMO (last two questions). This can be postponed or expanded in the future.

farnz · July 17, 2024, 11:20am

The challenge here is that I can make any feature always-on with -C target-features. So ultimately, the end state has to be that all SIMD intrinsics are implemented in a const-friendly fashion for use in const fn when evaluating at compile time (although compile-time evaluation does not need to be SIMD).

I like @kornel 's suggest of is_const_feature_detected, personally - then is_const_feature_detected is true if the feature is guaranteed available at runtime and implemented in a const form, or false otherwise.

So, on x86-64-v2 target CPUs, is_x86_const_feature_detected("sse2") is only false until SSE2 intrinsics can be evaluated at compile time (once that's possible, it'll be true), while is_x86_const_feature_detected("avx512f") will be false because it could be false or true at runtime. Meanwhile, is_x86_feature_detected("sse4.2") will be true regardless, because it's always true on that CPU type, while is_x86_feature_detected("avx512f") will be checked at runtime.

Kixunil · July 17, 2024, 11:40am

Isn't the purpose of that override to compile for specific CPU and remove runtime checks? In that case it'd still make sense to write "conditional" code and still return false from const for unimplemented things.

Or to put it differently: target-features literally says the word "target" which means the machine the resulting binary will run on. The const code is not running on that machine, it's running on the machine that's compiling the code. (And even that is interpreted, so it's actually a VM.) It's a different platform just like build scripts and proc macros are. You wouldn't expect target-features to apply to those, would you?

Anyway, it looks like I've misunderstood your original comment and I thought the reason for hard-coding true for some combinations was "all platforms have it so don't bother". That made me realize that my comment about implementing some intrinsics in const doesn't even make sense implementing intrinsics in const is not needed unless they are much more optimal (ironically this might apply to complex niche intrinsics like sha256rnds2) or people just don't want to be bothered writing more code just to allow yet another target - the const one but they need to support it. However it looks like portable SIMD is a better solution here.

farnz · July 17, 2024, 12:20pm

If I've understood you properly, the goal is to be able to write a const fn do_something_simdy() that can be evaluated at compile time regardless of CPU features on the host or target, and that will make use of SIMD intrinsics if it's executed on the target outside a const context. If this isn't your goal, then the rest of what I'm saying will make no sense.

The problem then is that I want codegen to know what target-features are set this time, so that it can (for example) not bother generating a scalar version if the build will only ever run on AVX2 platforms. But I also don't want to have to write compile-time evaluation of every possible SIMD intrinsic for the compiler, because that's a lot of work when do_something_simdy will contain a fallback implementation without any SIMD intrinsics that I could use at compile time.

This gives me two subtly different is_*_feature_detected meanings:

Is this feature available on the target, at runtime?
Can the compiler evaluate the relevant intrinsics at compile time?

Now, if const fn do_something_simdy is only ever evaluated at compile time, only meaning 2 makes sense. Equally, if it's only ever executed at runtime, only meaning 1 makes sense. But the reason your proposal is compelling is that when do_something_simdy is evaluated at compile time in some places, and executed at runtime in other places, we want both meanings available, depending on whether I'm evaluating at compile time (where it should be meaning 2), or whether I'm doing codegen for a runtime use (where it should be meaning 1).

And this is what I'm getting at - and why I'd prefer the clarity of two separate macros. You want to be able to write one function that's evaluated at compile time, executed without SIMD on a CPU without suitable SIMD instructions, executed with SIMD on a CPU with suitable instructions, and that's optimized in the knowledge that the CPU will always have suitable instructions where that's available.

Kixunil · July 17, 2024, 12:45pm

Just the host, there's no "compile time" on target.

Yes, this should be possible (though it might require some compiler magic). The codegen meant to run on target gets fixed true passed to if and the optimizer throws it out but the interpreted code will see it as false and avoid running unsupported intrinsics.

Right. Today you can't even mark it as const so you don't have to do anything but also can't do anything. If you continue to not mark it const you still don't have to do anything. If you want it const then you can either accept the cost or make a proposal to Rust itself to support your selected intrinsics which would then start returning true in const eval and not fail when called. However this is a lot of work for the compiler instead and one that has to be supported forever, so it's understandable that it might be undesirable.

Both meanings make sense because "compile time" is really just "a special target that's a compiler-implemented VM running on host". You could theoretically implement const eval by simply copying the const code to a parallel crate, compile it for host just as you would a proc macro and run it on host.

"Is this feature available on whatever machine is executing this code?" makes sense - it's either target or host they might have different features or same. "Can the compiler evaluate the relevant intrinsics at compile time?" makes sense because it's really "is this feature available on const VM"?

You don't need two macros for this. Just change the macro expansion appropriately depending on target, where by target I also include the VM that executes const evaluation.

farnz · July 17, 2024, 1:40pm

My concern with doing this is that, absent more context, there's two reasonable interpretations of the following const fn in a const evaluation context:

const fn can_simdify() {
    is_x86_feature_detected!("avx512f")
    && is_x86_feature_detected!("avx512vl")
    && is_x86_feature_detected!("avx512vp2intersect")
}

One is to interpret it as "am I guaranteed access to these AVX-512 features without runtime checks in code being used at runtime?" and the other is "can I use all three of these features right now?"

Without further context, you can't tell me which of the two interpretations is intended - and Rust in general (but especially when it's performance-relevant) tries quite hard to allow purely local reasoning.

And the first is a useful meaning in a situation where the cost of runtime checks for a feature outweighs the benefit of using it; "avx512vp2intersect", for example, could well be such a feature, where using it saves single-digit clock cycles as compared to not using it in some situations. Since we're talking about performance optimizations (and not correctness), it seems important to me to be able to make that distinction if it matters.

bascule · July 17, 2024, 5:30pm

I'll start by saying that this would be super handy in cryptographic code of all sorts, and beyond SIMD intrinsics, it would be nice to be able to use asm! in non const fn contexts, with a portable fallback in const contexts. In e.g. elliptic curve libraries, it's nice to have the full curve/field arithmetic available for calculating constants at compile time, but also be able to provide optimized implementations, including things that aren't currently possible with even SIMD intrinsics (e.g. ADX/MULX).

There are a few problems with using is_*_feature_detected for this I think, and has already been mentioned a stable API for const_eval_select or thereabouts seems like the real solution.

is_*_feature_detected is implemented in the std_detect crate, which means that at a minimum to make is_*_feature_detected into something const fn-friendly would require a stable const_eval_select API to begin with.

Kixunil · July 18, 2024, 8:01pm

The current implementation has the latter meaning, so changing it would be backwards-incompatible except for it simply returning true which gets optimized out.

Right but most people would likely group this to a bigger block. Or outside of a hot loop. If not then you can't really perceive the time difference so nobody cares.

Exactly, that's actually my motivation. You can pre-compute hash midstate in const fn (for tagged hashes) and then efficiently hash whatever you want at runtime.

I'd love to read it if you can find the link.

This is actually super complicated. The API doesn't look good as many people said, there's a question whether const code should be even allowed to output different things etc. Long term yes, it's superior but short term my solution is a workable stopgap.

Ddystopia · July 23, 2024, 6:47am

It would be really great to have more things during compile time, I hope it will get us a step closer to it

Ddystopia · July 23, 2024, 6:50am

Using asm in const sounds like untenable task - should rust use qemu of something like that? This is already quite unsafe and sounds just like a good use case for const_eval_select.

bascule · July 23, 2024, 5:29pm

The purpose of const_eval_select is to allow two different codepaths in const fn contexts: one which is allowed to use SIMD/asm!/etc when called from a regular fn, and one which is const fn and selected if the function is being evaluated in an actual const context

bascule · July 23, 2024, 5:31pm

I spelled out why your solution doesn't work. is_x86_feature_detected! is implemented in a third party crate (std_detect), not the compiler itself. It would need an API like const_eval_select to call to even offer this functionality.

And it's a non-solution for no_std purposes.

Kixunil · August 2, 2024, 7:30am

Why not just wrap the macro to call const_eval_select inside?

Nemo157 · August 2, 2024, 8:11am

const_eval_select! takes two callbacks to choose between depending on the current environment, so it can't be used to return a boolean result, it could be used to build a x86_feature_select! with a similar api instead (one runtime-only callback when the feature is active, one compile-time or runtime-without-feature callback to use otherwise).

bascule · August 2, 2024, 12:38pm

That's literally what I was suggesting:

It would need an API like const_eval_select to call to even offer this functionality.

(in fact, I suggested it all the way back in my first post)

But to do that, you need a stable const_eval_select-like API first. So really that seems like the priority.

Kixunil · August 2, 2024, 1:47pm

It can, it has a generic parameter. The const function can just return false unconditionally and the non-const one can call into the macro.

No need to stabilize it first, just write the macro to call into the unstable API like other macros do it (e.g. pin!).

Topic		Replies	Views
Getting explicit SIMD on stable Rust	335	47414	November 1, 2017
Pre-RFC: contextual target feature detection libs	15	984	June 12, 2023
Adding #[cfg(const)]? language design	5	1603	November 5, 2020
[pre-RFC] [idea] Detecting const evaluation language design	5	924	November 13, 2019
How to handle constants in the compiler compiler	18	4867	October 19, 2015

SIMD in `const` easily

Related topics