Rustc copying cryptographic keys onto the stack instead of using them via pointer

tczajka · October 4, 2023, 12:28pm

That is true, but performance isn't the only criterion for the quality of RNGs.

In fact, speed is rarely that important in that context since random number generation is not usually the bottleneck of whatever randomized algorithm or Monte Carlo simulation you're running. Crypto RNGs can generate bits pretty fast.

But what you do often care about is the statistical quality of the random numbers, i.e. (algorithmic) indistinguishability from true randomness. Which is precisely the definition of what a cryptographically strong RNG is.

HeroicKatora · October 4, 2023, 12:36pm

It tooks a while for me to realize just how cursed the problem is. The intrinsic's signature is of course already unfortunate:

pub unsafe fn _mm_aesdec_si128(a: __m128i, round_key: __m128i) -> __m128i

This of course makes it impossible to heed the caution against turning cryptographic material into Rust values.

I tend to agree with calling this a 'bug' of sorts in the standard library's way to provide these functions, even though the library itself is pretty innocent in only following Intel's own header files for C. A language which .. comes with the same problem. It's tragic that it makes the intrinsic dubious in its most pressing use case.

newpavlov · October 4, 2023, 1:23pm

I disagree. The intrinsic definition is fine, according to ABI all its inputs and outputs will be passed in registers. You can see it clearly in the LLVM IR in the @comex's comment above. The problem is with optimization passes used by LLVM.

github.com/llvm/llvm-project

Unnecessary stack usage

opened 01:20AM - 22 Jan 22 UTC

newpavlov

backend:RISC-V backend:X86

Initially reported in rust-lang/rust#88930 I have the following simple SIMD-p…owered function which tests whether points are inside one of bounding boxes written in Rust (should be trivial to translate it to a different language): ```rust pub unsafe fn foo( x: &[__m256i; N], y: &[__m256i; N], z: &[__m256i; N], bboxes: &[[__m256i; 6]], ) -> [__m256i; N] { let mut res = [_mm256_setzero_si256(); N]; for bbox in bboxes { for i in 0..N { let tx = _mm256_and_si256( _mm256_cmpgt_epi32(x[i], bbox[0]), _mm256_cmpgt_epi32(bbox[1], x[i]), ); let ty = _mm256_and_si256( _mm256_cmpgt_epi32(y[i], bbox[2]), _mm256_cmpgt_epi32(bbox[3], y[i]), ); let t = _mm256_and_si256(tx, ty); let tz = _mm256_and_si256( _mm256_cmpgt_epi32(z[i], bbox[4]), _mm256_cmpgt_epi32(bbox[5], z[i]), ); let t = _mm256_and_si256(t, tz); res[i] = _mm256_or_si256(res[i], t); } } res } ``` By inspecting the [generated assembly](https://rust.godbolt.org/z/6fahjYKGP) we can see that for some reason it caches coordinates to stack and reads them from it each iteration instead of using the input pointers. The same behavior can be [observed](https://rust.godbolt.org/z/c5hGxsrvP) for a function which processes coordinate slices. This caching looks quite redundant to me, especially considering that `noalias` is enabled (i.e. compiler should know that memory at which coordinates are stored can not change during function execution). It looks like LLVM correctly moves coordinate loads from the inner loop using its infinite virtual registers. And it's exactly the behavior we want when there is enough physical registers. But when it's not true, it spills virtual register values to stack instead of relying on the original locations. Looks like this issue also [affects](https://github.com/RustCrypto/hashes/issues/328) SHA-2 implementation on RISC-V targets.

HeroicKatora · October 4, 2023, 2:55pm

I don't quite follow. The original definition is certainly not fine. It may be that the ABI itself is permissible (both extern "C" in core #[inline] "rust" assign registers here, that's hopefully right) but that is only a necessary and not a sufficient condition. Further it already leaks the values out of the registers as in initially containing allocas.

The value semantics of Rust don't allow denoting places which are allocated in registers, as places need addresses. As soon as execution gets into the intrinsic implementation, the argument may be moved onto the stack as none of the semantics require the value passed in the arguments to stay in registers. This level of guarantee is only good enough if there's a reasonable way to enforce the optimization or check whether it has take place .. and to quote:

And it's not like there's a good way to produce that more-straightforward IR from Rust.

The full unoptimized IR for the exposed intrinsic is as follows: godbolt

use std::arch::x86_64::*;

pub unsafe fn foo(x: __m256i, bbox: __m256i) -> __m256i {
    _mm256_cmpgt_epi32(x, bbox)
}

define internal void @core::core_arch::x86::avx2::_mm256_cmpgt_epi32(ptr sret(<4 x i64>) align 32 %_0, ptr align 32 %a, ptr align 32 %b) unnamed_addr {
start:
  %0 = alloca <8 x i32>, align 32
  %_5 = alloca <8 x i32>, align 32
  %_4 = alloca <8 x i32>, align 32
  %1 = load <4 x i64>, ptr %a, align 32
  store <4 x i64> %1, ptr %_4, align 32
  %2 = load <4 x i64>, ptr %b, align 32
  store <4 x i64> %2, ptr %_5, align 32
  %3 = load <8 x i32>, ptr %_4, align 32
  %4 = load <8 x i32>, ptr %_5, align 32
  %5 = icmp sgt <8 x i32> %3, %4
  %6 = sext <8 x i1> %5 to <8 x i32>
  store <8 x i32> %6, ptr %0, align 32
  %_3 = load <8 x i32>, ptr %0, align 32
  store <8 x i32> %_3, ptr %_0, align 32
  ret void
}

Many stack moves, a type change from 4 × i64 to 8 × i32 and all those addresses involved which demonstrate a lack of enforcement of register use. The original function arguments are even defined as pointers to values and not as registers in that level of ABI. These are all obviously wrong for the sake of any semblence of leak-freedom. Imho the implementation of these intrinsics would need to utilize inline assembly itself to truly promise absence of stack spillage / a direct translation to the intrinsic. And even that will be wrong on the caller side where no value used in arguments can be certified to keep being allocated within a register without spillage.

None of this is to say that rustc or coreis at some kind of fault here. If anything, I think C is in the exact same predicament regarding actual guarantees around intrinsics; and its silent saving grace is the lack of wide-spread dereferencable attributes that ensure no improper optimizations involving 'deduplication' of a loop constant to the stack for 'efficiency' as here.

comex · October 5, 2023, 10:22pm

I agree it would be nice if rustc generated IR that didn't use pointers for intrinsics like this.

But on the other hand, I'll reiterate that 'secret values will not be written to the stack' is fundamentally not a security property that LLVM tries to guarantee. The same is true for GCC and other highly optimizing backends. This is an extreme case because the codegen is not just insecure but also inefficient – but if it weren't inefficient, I wouldn't call this a bug in LLVM or rustc.

Also, direct stores to the stack are not the only way values can be leaked. If you look at the generated assembly, it doesn't just leave secret values on the stack, it also leaves parts of them in caller-saved xmm registers. Depending on the attack scenario, it may be possible to retrieve those registers directly, or they may be spilled to the stack by another operation (e.g. C variadic functions tend to dump xmm0-xmm7 to the stack). If on the other hand the function uses callee-saved xmm registers, then those will be restored once the function returns, but if the function calls something else first, the second function may save them to the stack. Even if the original function does not make any calls, it might do so after inlining or outlining. (The x86-64 Unix ABI doesn't actually make any xmm registers callee-saved, but the x86-64 Windows ABI does, and – if you happen to be writing similar code on ARM – the standard ARM ABI makes some vector registers callee-saved.)

Ideally LLVM would have an option to zero the stack and registers before returning, which would effectively mitigate most kinds of leakage, including the stack spills originally being discussed. This has been proposed before in the context of Rust, and (entirely separately) someone made a prototype patch to LLVM a few years ago, to implement this for C. But as far as I know there's nothing upstream. In lieu of that, if you want to reliably prevent key material from being leaked to the stack, you really should be using functions written entirely in assembly.

Though there is one more source of leakage: signal handlers, which can interrupt any code at any point and can store both caller-saved and callee-saved registers to the stack. There is no good way to prevent that, if you are a library running in some arbitrary program that might be using signal handlers. At best you could clean some region below the stack before returning, in the hope that it covers whatever a signal handler might have been using, if one happened to interrupt your function. But that won't work if the signal handler is using sigaltstack, or other OS-specific signal-like mechanisms that might save data somewhere other than the current stack. Of course, the fact that there may be leakage you can't control doesn't mean you shouldn't try to control what you can control, but… just saying.

robinm · October 5, 2023, 11:13pm

If you write a function entirely in assembly to prevent your compiler to leak any data, couldn't the linker, especially when using LTO, rewrite your assembly and expose your secrets?

scottmcm · October 6, 2023, 12:05am

Normal LTO pipelines work at a higher level than assembly, so in practice no. ASM is a horrible level for optimizers to work at, because the side effects are just so pervasive.

(That's why LTO and BOLT are two different things, and why BOLT is about layout, but doesn't really change instructions other than branches.)

the8472 · October 6, 2023, 11:51am

Though things like wasm JITs could still screw things up since there you're handing off your already linked code to another level of optimizations.

bascule · October 6, 2023, 2:19pm

I'm reminded of the secret types RFC although unfortunately it was only for the core integer types and not SIMD registers (and in that regard, a generic-based syntax like Secret<__m128i> might be nice for future extensibility).

Though now that I re-read it:

Out of scope: Memory zeroing (beyond implementing the drop trait) and register spilling

system · January 4, 2024, 2:19pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Safe trasnsmute for slices; e.g. &[u64] -> &[u32], particularly SIMD types libs	12	5236	March 25, 2019
Operational semantics and high-level vs low-level language design	24	2775	July 4, 2020
Stacked Borrows: An Aliasing Model For Rust Unsafe Code Guidelines	49	7238	March 25, 2019
Exploit the padding?	50	2379	September 28, 2021
Pre-RFC: bless the `type FastHashMap` pattern	92	3844	March 25, 2019

Rustc copying cryptographic keys onto the stack instead of using them via pointer

Related Topics