Request for prioritization: fast thread locals

Rust stable thread-locals (thread_local! macro) are not zero cost -- they support destructors, and so they must have "is this thread-local destroyed?" check at runtime on every access.

This is perfectly fine for once use-case of thread locals -- provide a cheap way to pass application context around without actually threading the ctx parameter everywhere.

However, this I think (and here I am a bit out of my depth) is unsatisfactory for another use-case -- speeding up concurrent algorithms. For example, the attempt to port mimalloc allocator to Rust immediately hits the need for fast no frills thread locals. Counting things also needs fast tls.

This is a pretty frustrating limitation, as you can't just work around it by just writing more, less elegant code. This is an expressiveness gap in the kind of machine code you can produce, and not in the shape of the source code that type-checks.

At the same time, it is a relatively niche and obscure use-case. I feel that some top-down nudging from the lang team to solve this problem would benefit the language quite a bit, hence the suggestions to up priority on this one :slight_smile: . As far as I know, this isn't blocked on anything external to the problem itself.

20 Likes

I would love to see this happen, if someone is interested in working on it.

In addition, I'd also like to see the tls-model option stabilized, so that applications can use a faster mechanism for thread-local storage.

Could the stable thread local support be improved without a breaking change? It would need to avoid using Box and could avoid registering a destructor - but without box only certain size values can be supported(?)

1 Like

This seems unlikely. Unlike with statics, we need to run destructors for thread locals -- otherwise that's a memory leak. So well need to somehow specialize on "needs drop?", and at that level it seems better to have the difference visible in the surface syntax?

Forgive me if I'm misunderstanding but I think there are (broadly speaking) two ways of initializing thread locals:

  • statically initialized variables where the initial value can simply be copied to a new thread or directly set (e.g. a counter that starts at zero).
  • lazily initialized variables that need to call a function to recreate them for a new thread.

And on thread exit a variable can either require a function to run (e.g. drop) or not.

So in total there's four basic ways of handling a thread local. I think efficiently encapsulating these different uses will require a different API to the one currently in std, no?

There are 2 ways to go about this:

  • Improve thread_local! to better handle no-drop types and constant initializers.
  • Stabilize #[thread_local] statics (with some design work to fix outstanding issues).
6 Likes

+1 for #[thread_local] they're useful for FFI too.

1 Like

Special casing on needs drop is not very special, we even have an old function called needs_drop which lets you do it dynamically if that's the right fit.

2 Likes

Note that the instruction sequences used for plain-old-data thread-locals in C differ greatly depending on OS and even the type of binary.

On x86-64 Linux, in an executable, it's a single load instruction:

        mov     eax, dword ptr fs:[x@TPOFF]

But in a dynamic library (-fPIC), it's a call:

        lea     rdi, [rip + x@TLSGD]
        data16
        data16
        rex64
        call    __tls_get_addr@PLT

And on macOS, whether executable or library, it's… an indirect call!?

        movq	_x@TLVP(%rip), %rdi
        callq	*(%rdi)

(As far as I can tell, this call always goes to tlv_get_addr.)

So even if Rust fixes thread-locals to be zero-cost, compared to the underlying OS mechanism, that doesn't necessarily mean they're fast. Unfortunately.

7 Likes

Oh wow, it appears that mimalloc is doing something exceptionally cursed on mac:

7 Likes

On x86-64 Linux apparently:

  • TLS defined in .exe/used in .exe - no extra cost
    mov eax, dword ptr fs:[x@TPOFF]
    Local Executable (LE) model

  • TLS defined in .so/used in .exe never loaded with dlopen() - one extra 64 bit memory read, uses an extra register
    mov 0x200893(rip), rax
    mov eax, dword ptr fs:(rax)
    Initial Executable (IE) model

  • TLS used in .so defined in .so which can be loaded with dlopen() - lots of extra cost
    __tls_get_addr() does a number of reads
    Local Dynamic (LD) model
    General Dynamic (GD) model
    the difference is tiny
    with LE ld.so has less work to do on loading the library
    no difference at run-time

Are the expenses relevant for Rust though?
Isn't a monolithic executable the main use case?
Is anybody really compiling Rust code into .so-s?

I don't know how well supported this is right now, but I sure would like to be able to use Rust to implement a shared library that presents an interface that's ABI-compatible with an existing C shared library.

1 Like

The most expensive Linux/x86-64 TLS access models (LD, GD) seem necessary inside an .so that can be loaded with dlopen().

If your .so can only be loaded before main() then IE should suffice. Theoretically. LLVM LangRef says:

This seems to confirm the hypothesis that you can use IE from an .so. And IE is just one extra read from GOT + an extra register. Not that expensive.

Seems rustc could take some flags to control this.. And make it real cheap for monolithic executables by using LE.

That's what the tls_model option @josh mentioned does, though it's not stable.

3 Likes

It is well-supported enough. Examples of Rust used to reimplement existing C APIs:

  • librsvg mostly ported to Rust
  • mesalink which emulates OpenSSL's API (and maybe ABI, not sure)

Examples of crates with their own C APIs:

Re-cursed!

[/off]

4 Likes

Note: this is only used if you try to globally override malloc on mac, otherwise it uses the equivalent of #[thread_local] (although it can also use pthread_getspecific)

These hacks are needed for that case, but doing so inadvisable in general... Even though the known issues with mimalloc on mac when used this way should be fixed, still requires setting DYLD_FORCE_FLAT_NAMESPACE=1 in your environment (or setting the MH_FORCE_FLAT flag in the header of the MachO executable) or dylibs you link/load will get a different allocator.

Unfortunately, disabling the flat namespace tends to break a lot of things in weird and confusing ways, including in system libraries.

Also note that the slot that mimalloc uses is probably not ideal. Slots 6 and 11 are only used by software running targetting WIN64 ABI (e.g. code running under something like WINE): darwin-libpthread/private/tsd_private.h at 03c4628c8940cca6fd6a82957f683af804f62e7f · apple/darwin-libpthread · GitHub, which is probably safe to not support (although, maybe WINE wants to RIIR :thinking:). I believe golang's runtime uses one of these also though...


Anyway I did a benchmark about some of this stuff a while ago (includes benches for per-cpu stuff, since really what i wanted was a way to heuristically reduce atomic contention, but where the algo would still be safe) GitHub - thomcc/threadidbench (this is an older version of what I used for the numbers in Proposal `std:: thread ::ThreadId::current()`). It includes a reimplementation of mimalloc's hackery, both in inline asm and called separately.

(Major caveat with that repo: That code has major bugs and is just intended to get a feel for the ballpark of what the various performance options would be, and not actually be usable — also that I don't think I pushed my fixes that make it compile on the non-mac targets).

Note that there's a similarly evil trick for windows too, which has the benefit of using the values handed out by TlsAlloc - discussed here Thread Local Storage, part 2: Explicit TLS « Nynaeve

At one point I ported the x86_64 equivalent to Rust, eliminating the checks that aren't needed if you're sure the index came from there. I don't know if how much of a benefit there is to using this over using #[thread_local] but I know there's sure a lot of downsides! I don't think this is remotely stable to rely on, beyond it being a defacto part of the ABI

// Equivalent to TlsGetValue. Requires `slot` be a
// value given to you by TlsAlloc or you get massive UB
#[inline]
#[cfg(target_arch = "x86_64")]
unsafe fn tls_get_value_fast(slot: DWORD) -> u64 {
    // NB: __readgsqword is https://docs.rs/ntapi/0.3.6/src/ntapi/winapi_local/um/winnt.rs.html#35-44
    if slot < 64 {
        __readgsqword(0x1480 + slot * 8)
    } else {
        (__readgsqword(0x1780) as *const u64).add(slot - 64).read()
    }
}
2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.