Rustc copying cryptographic keys onto the stack instead of using them via pointer

bascule · September 26, 2023, 1:44pm

I thought I'd shine a light on a particular problem we're running into in the RustCrypto project, namely an issue with the hardware accelerated backends of the aes crate where rustc seems to be making copies of cryptographic keys on the stack behind our back:

github.com/rust-lang/rust

Unnecessary stack usage

opened 11:21AM - 14 Sep 21 UTC

newpavlov

A-LLVM I-slow T-compiler

I have the following simple SIMD-powered function which tests whether points are… inside one of bounding boxes: ```rust pub unsafe fn foo( x: &[__m256i; N], y: &[__m256i; N], z: &[__m256i; N], bboxes: &[[__m256i; 6]], ) -> [__m256i; N] { let mut res = [_mm256_setzero_si256(); N]; for bbox in bboxes { for i in 0..N { let tx = _mm256_and_si256( _mm256_cmpgt_epi32(x[i], bbox[0]), _mm256_cmpgt_epi32(bbox[1], x[i]), ); let ty = _mm256_and_si256( _mm256_cmpgt_epi32(y[i], bbox[2]), _mm256_cmpgt_epi32(bbox[3], y[i]), ); let t = _mm256_and_si256(tx, ty); let tz = _mm256_and_si256( _mm256_cmpgt_epi32(z[i], bbox[4]), _mm256_cmpgt_epi32(bbox[5], z[i]), ); let t = _mm256_and_si256(t, tz); res[i] = _mm256_or_si256(res[i], t); } } res } ``` By inspecting the [generated assembly](https://rust.godbolt.org/z/6fahjYKGP) we can see that for some reason it caches coordinates to stack and reads them from it each iteration instead of using the input pointers. The same behavior can be [observed](https://rust.godbolt.org/z/c5hGxsrvP) for a function which processes coordinate slices. This caching looks quite redundant to me, especially considering that `noalias` is enabled (i.e. compiler should know that memory at which coordinates are stored can not change during function execution). It looks like LLVM correctly moves coordinate loads from the inner loop using its infinite virtual registers. And it's exactly the behavior we want when there is enough physical registers. But when it's not true, it spills virtual register values to stack instead of relying on the original locations. On Rust 1.51 code from the first link does not have this issue, but not from the second one. UPD: See this [comment](https://github.com/rust-lang/rust/issues/88930#issuecomment-1734619004) for additional example affecting cryptographic code..

Anyone have insights into this particular problem? It's something we'd definitely like to avoid.

BD103 · September 26, 2023, 6:33pm

I know that Rust in general prefers the stack over the heap, but I unfortunately can't give any insight on how to fix it. Something similar is Box::new() allocating on the stack before copying to the heap, but I'm not educated enough on these issues to help.

newpavlov · September 26, 2023, 7:40pm

One potential reason why compilers behave like this could be a variation of register renaming. I think I read somewhere that CPU may assign stack location to a physical register (high performance CPUs usually have more "physical" registers, than "logical" ones accessible according to its ISA), i.e. movdqa xmm0, xmmword ptr [rsp - 16] could be executed as movdqa xmm0, physN, where physN is a physical register which caches [rsp - 16]. This optimization apparently could be done only for stack registers and it will not work with GP registers.

But after a brief search I couldn't find a confirmation that such optimization indeed exists, so take it with a sizable grain of salt.

programmerjake · September 26, 2023, 8:43pm

compilers usually load values into virtual registers (SSA values) before operating on them, if there aren't enough physical registers at a particular point in the program, the rest of the virtual registers will usually be stored on the stack -- this is called spilling.

newpavlov · September 26, 2023, 8:53pm

Please read the issue description linked in the OP, in it I wrote exactly about that. The question is: why compiler uses stack spilling instead of using original location directly. The comment above states a potential answer to it. Another (most likely) answer: it's just suboptimal compilation of the currently used version of LLVM.

HeroicKatora · September 26, 2023, 9:22pm

Any secret exposed to Rust as a variable is at risk of getting leaked through side channels out-of-scope of the Rust language model. That's just it, neither Rust nor the remainder of the compiler pipeline guarantees absence of 'superficial' effects outside their model and this is just bound to slip in. This issue specifically seems like an ice berg tip shaped straw that breaks the camel's back of practicality, the practicality by which avoidance of this fundamental mismatch is justified. Using assembly will only seem more and more attractive the better the optimizer gets, absent fundamental compiler changes that the linked paper is at best a puzzle piece for.

It doesn't even stop with register / memory use. Instructions on x86 can exhibit data-dependent operand timings. Surely the Rust compiler won't add the DOITM bits to every function prelude. In particular, even pure-cryptography instructions such as AESDEC are not guaranteed constant time unless some setup is done. How would you guarantee absence of Rust-generated instructions operating on bytes you regard as secret if those values are Rust code?

I think trying to resolve this issues within Rust without whole classes of new intrinsics is a quest for some Dulcinea.

quaternic · September 26, 2023, 9:32pm

From that linked comment:

The keys are accessed through a shared reference, so the compiler knows that the keys can not change under its foot.

That could be part of the problem. It's the only reason the compiler is allowed to cache the value in a register. I'm assuming the code in question is just always using the data through indexed accesses, but I couldn't easily find what the actual Rust source looks like.

Yes, a sufficiently smart compiler wouldn't cache them to the stack, but lacking that, the code could always be written to be more explicit about when values are loaded from memory. That seems warranted for the crypto-routines that might care about performance and not making unnecessary copies a lot more than your average piece of code.

bascule · October 2, 2023, 11:44am

The question here is less whether or not rustc is "allowed" to do this and more about how to avoid or otherwise work around this very specific problem in our code, and we are aware that even if we do successfully work around it in this case, it could be changed by future changes to compiler behavior.

All that said, the compiler copying encryption keys onto the stack every time encryption is performed is something we'd like to avoid if all possible, if anyone has insights as to how to do it here.

scottmcm · October 2, 2023, 11:23pm

Well, first I'd suggest making them non-Copy if they're not already, just to force extra care about it.

Then if you use them only via indirection, I don't think the rust side of things will introduce new copies to stack. So you'll end up with the LLVM IR that rustc emits, and you can file bugs in LLVM if LLVM's optimizers add spurious stack copies of them.

But it's most likely not something that'd be fixed on the rust side, if it's about whether things stay in registers or not.

bascule · October 3, 2023, 2:48am

The problem happening here is specifically that the type is a register type (i.e. __m128i on x86/x86_64), which is a Copy type, and rustc is making copies instead of keeping the data in registers.

comex · October 3, 2023, 3:02am

Then if you use them only via indirection, I don't think the rust side of things will introduce new copies to stack. So you'll end up with the LLVM IR that rustc emits, and you can file bugs in LLVM if LLVM's optimizers add spurious stack copies of them.

Even with the very direct code

_mm256_cmpgt_epi32(x[i], bbox[0]),

rustc does initially produce IR containing allocas:

%34 = alloca <4 x i64>, align 32
[..]
%52 = getelementptr inbounds [4 x <4 x i64>], ptr %x, i64 0, i64 %i
%_18 = load <4 x i64>, ptr %52, align 32
[..]
store <4 x i64> %_18, ptr %34, align 32
[..]
call void @_ZN4core9core_arch3x864avx218_mm256_cmpgt_epi3217h9610cc6e0e2aae81E(ptr noalias nocapture noundef sret(<4 x i64>) align 32 dereferenceable(32) %35, ptr noalias nocapture noundef align 32 dereferenceable(32) %34, ptr noalias         nocapture noundef align 32 dereferenceable(32) %33)

But that's likely not the actual problem, because the allocas do get eliminated by SROAPass:

%11 = getelementptr inbounds [4 x <4 x i64>], ptr %x, i64 0, i64 %_13.sroa.6.2
%_18 = load <4 x i64>, ptr %11, align 32
[..]
%12 = bitcast <4 x i64> %_18 to <8 x i32>
[..]
%14 = icmp sgt <8 x i32> %12, %13

and they never get reintroduced at the LLVM IR level. But ~~SCCPPass~~GVNPass hoists the load out of the loop and, like @newpavlov said, the stack copies are thanks to virtual registers being spilled to stack.

It's possible that more-straightforward LLVM IR (that doesn't require inlining a bunch of intrinsic wrapper functions, that doesn't use temporary allocas; etc.) might produce better output. But it's also possible that it wouldn't. And it's not like there's a good way to produce that more-straightforward IR from Rust.

This is really something that should be reported to LLVM. (Though, as others have said, while this codegen is clearly suboptimal, you also should not be expecting LLVM to avoid leaking sensitive data to the stack.)

But if you really want it to work for now, there's a simple, albeit hacky workaround. Just prevent LLVM from knowing that the loads produce the same value on each iteration:

     let mut res = [_mm256_setzero_si256(); N];
     for bbox in bboxes {
         for i in 0..N {
+            let x = std::hint::black_box(x);
+            let y = std::hint::black_box(y);
+            let z = std::hint::black_box(z);
             let tx = _mm256_and_si256(
                 _mm256_cmpgt_epi32(x[i], bbox[0]),
                 _mm256_cmpgt_epi32(bbox[1], x[i]),

(Though that does unnecessarily load and store the pointer to the stack. It's possible to avoid this but it requires code that’s a bit more complicated.)

tczajka · October 3, 2023, 8:17am

This code should be written in assembly.

It's already platform specific (x86-64), it contains a bunch of SIMD operations, and you expect a specific translation into assembly. The assembly code isn't going to be much harder to write than doing the same in Rust and then verifying the assembly is what you want. Each SIMD operation is one assembly instruction. Also writing this in assembly will guarantee the machine code doesn't change with a new Rust version.

HeroicKatora · October 3, 2023, 11:03am

The contemporary approach here, imo, would be to utilize the semantic models of the target ISA and certify the resulting assembly for the exact formulation of leak-freedom you want. Hence also the pointer to the paper. (Granted, that discusses more advanced how to certify by proof the property for all compiler invocations instead of a single one and changing llvm is probably not the approach you want.) It could be somewhat feasible with cargo-asm and some combination of types / macro annotations to specificy the intended information secrecy.

Such as, when you pass multiple buffers to key material (with my recommendation of not passing this as a pure reference, but maybe the approach makes this unecessary), it's okay when data is transferred from one of the buffers to the other; but not when data from one of the buffers ends up in a non-zeroed register.

Results from such a tool would be, at best, sometimes actionable. For instance, you might add XMM-zeroing inline blocks after noticing the compiler making use of them to move data. If working correctly, the tool likely finds memcpy having data paths loading into XMM0-4 and never cleaning them up as they are caller preserved on most (all?) x86 targets. Or lint on the memcpy in the first place, but I think this is harder as banning all high-level function abstractions instead of considering them in the model seems detrimental to the approach in the long run.

the8472 · October 3, 2023, 6:30pm

Would using a pointer instead of a reference work?

scottmcm · October 3, 2023, 7:45pm

Probably not just a pointer, because without synchronization it can still optimize since racing on the data would be UB.

Maybe weakest-possible atomic reads? Though it'd still depend on LLVM's alias/escape analysis being unable to prove that changes are impossible anyway...

bascule · October 3, 2023, 9:33pm

We are definitely aware of and already provide assembly as an option, even going so far as to reimplement unstabilized intrinsics using asm! so we can use them on stable Rust.

But that's a somewhat unhelpful answer to this particular problem. These cryptographic intrinsics are a first-class feature of the Rust programming language, and it would be a really sad state of affairs to say they're unsuitable for their purpose.

tczajka · October 4, 2023, 6:12am

I don't know what "first-class feature" means in this context. I'd say the intrinsics are there for completeness along with all the other SIMD instructions, but that doesn't mean they necessarily are the recommended way to implement secure cryptographic code.

Also there are uses of cryptographic primitives outside of secure cryptographic code, for random number generators and other things, where they don't have the same security requirements for constant-time execution and storing variables in registers.

You stated a problem and I offered what I saw as the best solution. Regular Rust doesn't seem to have the right tools for this niche. Perhaps one could build a domain-specific language using Rust macros for such secure computation.

bascule · October 4, 2023, 11:46am

These particular instructions literally have "AES" in their name. That is their purpose. And speaking as a cryptographic expert, you should not be using cryptography for non-cryptographic applications.

I think our preference would be to find solutions within the Rust language itself, rather than to rewrite all of our Rust code in assembly.

tczajka · October 4, 2023, 11:52am

Well this is a bit off topic, but this is just false. For instance the rand crate's default random number generator uses ChaCha12, a cryptographic primitive. One could also use AES for this. That's the proper way to create a pseudo-random generator if you care about statistical quality of the pseudo-random numbers, not necessarily for cryptographic purposes. Non-cryptographically-strong PRNGs have statistical biases, basically by definition.

bascule · October 4, 2023, 11:55am

It is explicitly marked as a CryptoRng, meaning it is safe to use in cryptographic operations: ChaCha12Rng in rand_chacha - Rust

RNGs which don't have cryptographic requirements can go much faster.

Also this is very much off topic for this thread.

Topic		Replies	Views
A new crypto library libs	7	1984	February 9, 2021
SIMD now available in libstd on nightly! libs	15	9244	March 25, 2019
Safe trasnsmute for slices; e.g. &[u64] -> &[u32], particularly SIMD types libs	12	5323	March 25, 2019
Optimization barriers suitable for cryptographic use language design	83	2235	November 4, 2024
Why is a trusted, feature-complete crypto library not a top priority for the Rust community? libs	19	21042	March 25, 2019

Rustc copying cryptographic keys onto the stack instead of using them via pointer

Related topics