Request for prioritization: fast thread locals

Note: this is only used if you try to globally override malloc on mac, otherwise it uses the equivalent of #[thread_local] (although it can also use pthread_getspecific)

These hacks are needed for that case, but doing so inadvisable in general... Even though the known issues with mimalloc on mac when used this way should be fixed, still requires setting DYLD_FORCE_FLAT_NAMESPACE=1 in your environment (or setting the MH_FORCE_FLAT flag in the header of the MachO executable) or dylibs you link/load will get a different allocator.

Unfortunately, disabling the flat namespace tends to break a lot of things in weird and confusing ways, including in system libraries.

Also note that the slot that mimalloc uses is probably not ideal. Slots 6 and 11 are only used by software running targetting WIN64 ABI (e.g. code running under something like WINE): https://github.com/apple/darwin-libpthread/blob/03c4628c8940cca6fd6a82957f683af804f62e7f/private/tsd_private.h#L93-L96, which is probably safe to not support (although, maybe WINE wants to RIIR :thinking:). I believe golang's runtime uses one of these also though...


Anyway I did a benchmark about some of this stuff a while ago (includes benches for per-cpu stuff, since really what i wanted was a way to heuristically reduce atomic contention, but where the algo would still be safe) GitHub - thomcc/threadidbench (this is an older version of what I used for the numbers in Proposal `std:: thread ::ThreadId::current()`). It includes a reimplementation of mimalloc's hackery, both in inline asm and called separately.

(Major caveat with that repo: That code has major bugs and is just intended to get a feel for the ballpark of what the various performance options would be, and not actually be usable — also that I don't think I pushed my fixes that make it compile on the non-mac targets).

Note that there's a similarly evil trick for windows too, which has the benefit of using the values handed out by TlsAlloc - discussed here Thread Local Storage, part 2: Explicit TLS « Nynaeve

At one point I ported the x86_64 equivalent to Rust, eliminating the checks that aren't needed if you're sure the index came from there. I don't know if how much of a benefit there is to using this over using #[thread_local] but I know there's sure a lot of downsides! I don't think this is remotely stable to rely on, beyond it being a defacto part of the ABI

// Equivalent to TlsGetValue. Requires `slot` be a
// value given to you by TlsAlloc or you get massive UB
#[inline]
#[cfg(target_arch = "x86_64")]
unsafe fn tls_get_value_fast(slot: DWORD) -> u64 {
    // NB: __readgsqword is https://docs.rs/ntapi/0.3.6/src/ntapi/winapi_local/um/winnt.rs.html#35-44
    if slot < 64 {
        __readgsqword(0x1480 + slot * 8)
    } else {
        (__readgsqword(0x1780) as *const u64).add(slot - 64).read()
    }
}
2 Likes