Control over CPU cache

JeffBurdges · December 17, 2025, 6:31pm

Are there nice mechanisms to explicitly declare that a type, allocation, or data loaded by a fucntion would probably be used briefly and then not again for some time?

In some cases, you could maybe achieve this by altering file formats and using mmap: You split the file formate into two related files, placing repeatedly used data into the first file, and data used only once into the second file. You then mmap both file, perform all operation that require both, and then munmap the file used only once. This might do nothing, but it'd provide some information the kernel could use if the architecture supported page level cache instructions.

I know some older architectures have load operations that give cache instructions, but that's maybe too fine grained, since often you do operate upon particular multiple times, but the point would more be saying what could be evicted sooner.

increasing · December 17, 2025, 9:22pm

github.com/rust-lang/rust

Tracking Issue for `hint_prefetch`

opened 06:48PM - 23 Sep 25 UTC

folkertdev

T-libs-api C-tracking-issue S-tracking-unimplemented

Feature gate: `#![feature(hint_prefetch)]` This is a tracking issue for hints t…o prefetch memory. ### Public API ```rust // core::hint #[non_exhaustive] pub enum Locality { NonTemporal = 0, L3 = 1, L2 = 2, L1 = 3, } pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality); pub const fn prefetch_write_data<T>(ptr: *mut T, locality: Locality); pub const fn prefetch_read_instruction<T>(ptr: *const T, locality: Locality); ``` ### Steps / History (Remember to update the `S-tracking-*` label when checking boxes.) - [x] ACP https://github.com/rust-lang/libs-team/issues/638 - [ ] Implementation: #... - [ ] Final comment period (FCP)[^1] - [ ] Stabilization PR ### Unresolved Questions - None yet. [^1]: https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

I think using this with the NonTemporal variant is supposed to tell the cpu that the data won't be needed anymore.

ais523 · December 18, 2025, 2:11am

munmap is extremely slow in practice because it has to invalidate the TLB on both the current CPU, and on every CPU that might potentially have cached a mapping between the address as seen by software, and the physical address as seen by memory (because after a munmap, accesses to the memory in question are supposed to segfault rather than load the values from memory, and programs are allowed to rely on this for correctness). This ends up throwing out a huge amount of TLB, both for the addresses you no longer need and for a lot of addresses you do still need.

Nontemporal load/store intrinsics are probably the best solution to this sort of problem (especially if they're defined to do regular loads and stores on platforms that don't support the nontemporal version): they're basically loads and stores that don't leave the loaded/stored value in cache, and thus are useful if you know you won't need it again in the near future. Because the value isn't cached in the first place, there are no collateral damage issues with trying to remove the cache entry. (This is related to the nontemporal prefetches mentioned in the other reply, but not the same.)

The API would have to be carefully designed, because on some platforms nontemporal stores have weird coherency/timing requirements that don't apply to normal stores. (For example, on x86-64, the normal memory store instructions are release-atomic, but the nontemporal instructions aren't.) I'm unclear on whether or not all processors guarantee a nontemporal write can be soundly read back even from the same thread if you don't use a barrier – as such, it would seem sensible for the API to work something like "write this value without caching it, and it's UB to read the resulting memory until you call a nontemporal-barrier instruction".

Topic		Replies	Views
Loads and stores to/from outside the memory model language design	36	3775	September 27, 2019
Volatile and sensitive memory language design	100	21924	April 19, 2021
High memory usage on random file reads using `seek_read` (Windows) internals	15	1049	June 2, 2024
Pre-RFC: core::ptr::simulate_realloc libs	107	2770	December 15, 2024
Register attribute language design	29	2786	March 28, 2023

Control over CPU cache

Related topics