Control over CPU cache

Are there nice mechanisms to explicitly declare that a type, allocation, or data loaded by a fucntion would probably be used briefly and then not again for some time?

In some cases, you could maybe achieve this by altering file formats and using mmap: You split the file formate into two related files, placing repeatedly used data into the first file, and data used only once into the second file. You then mmap both file, perform all operation that require both, and then munmap the file used only once. This might do nothing, but it'd provide some information the kernel could use if the architecture supported page level cache instructions.

I know some older architectures have load operations that give cache instructions, but that's maybe too fine grained, since often you do operate upon particular multiple times, but the point would more be saying what could be evicted sooner.

1 Like

I think using this with the NonTemporal variant is supposed to tell the cpu that the data won't be needed anymore.

1 Like

munmap is extremely slow in practice because it has to invalidate the TLB on both the current CPU, and on every CPU that might potentially have cached a mapping between the address as seen by software, and the physical address as seen by memory (because after a munmap, accesses to the memory in question are supposed to segfault rather than load the values from memory, and programs are allowed to rely on this for correctness). This ends up throwing out a huge amount of TLB, both for the addresses you no longer need and for a lot of addresses you do still need.

Nontemporal load/store intrinsics are probably the best solution to this sort of problem (especially if they're defined to do regular loads and stores on platforms that don't support the nontemporal version): they're basically loads and stores that don't leave the loaded/stored value in cache, and thus are useful if you know you won't need it again in the near future. Because the value isn't cached in the first place, there are no collateral damage issues with trying to remove the cache entry. (This is related to the nontemporal prefetches mentioned in the other reply, but not the same.)

The API would have to be carefully designed, because on some platforms nontemporal stores have weird coherency/timing requirements that don't apply to normal stores. (For example, on x86-64, the normal memory store instructions are release-atomic, but the nontemporal instructions aren't.) I'm unclear on whether or not all processors guarantee a nontemporal write can be soundly read back even from the same thread if you don't use a barrier – as such, it would seem sensible for the API to work something like "write this value without caching it, and it's UB to read the resulting memory until you call a nontemporal-barrier instruction".

2 Likes