Then if you use them only via indirection, I don't think the rust side of things will introduce new copies to stack. So you'll end up with the LLVM IR that rustc emits, and you can file bugs in LLVM if LLVM's optimizers add spurious stack copies of them.
Even with the very direct code
_mm256_cmpgt_epi32(x[i], bbox[0]),
rustc does initially produce IR containing allocas:
%34 = alloca <4 x i64>, align 32
[..]
%52 = getelementptr inbounds [4 x <4 x i64>], ptr %x, i64 0, i64 %i
%_18 = load <4 x i64>, ptr %52, align 32
[..]
store <4 x i64> %_18, ptr %34, align 32
[..]
call void @_ZN4core9core_arch3x864avx218_mm256_cmpgt_epi3217h9610cc6e0e2aae81E(ptr noalias nocapture noundef sret(<4 x i64>) align 32 dereferenceable(32) %35, ptr noalias nocapture noundef align 32 dereferenceable(32) %34, ptr noalias nocapture noundef align 32 dereferenceable(32) %33)
But that's likely not the actual problem, because the allocas do get eliminated by SROAPass
:
%11 = getelementptr inbounds [4 x <4 x i64>], ptr %x, i64 0, i64 %_13.sroa.6.2
%_18 = load <4 x i64>, ptr %11, align 32
[..]
%12 = bitcast <4 x i64> %_18 to <8 x i32>
[..]
%14 = icmp sgt <8 x i32> %12, %13
and they never get reintroduced at the LLVM IR level. But SCCPPassGVNPass
hoists the load out of the loop and, like @newpavlov said, the stack copies are thanks to virtual registers being spilled to stack.
It's possible that more-straightforward LLVM IR (that doesn't require inlining a bunch of intrinsic wrapper functions, that doesn't use temporary allocas; etc.) might produce better output. But it's also possible that it wouldn't. And it's not like there's a good way to produce that more-straightforward IR from Rust.
This is really something that should be reported to LLVM. (Though, as others have said, while this codegen is clearly suboptimal, you also should not be expecting LLVM to avoid leaking sensitive data to the stack.)
But if you really want it to work for now, there's a simple, albeit hacky workaround. Just prevent LLVM from knowing that the loads produce the same value on each iteration:
let mut res = [_mm256_setzero_si256(); N];
for bbox in bboxes {
for i in 0..N {
+ let x = std::hint::black_box(x);
+ let y = std::hint::black_box(y);
+ let z = std::hint::black_box(z);
let tx = _mm256_and_si256(
_mm256_cmpgt_epi32(x[i], bbox[0]),
_mm256_cmpgt_epi32(bbox[1], x[i]),
(Though that does unnecessarily load and store the pointer to the stack. It's possible to avoid this but it requires code that’s a bit more complicated.)