Request for prioritization: fast thread locals

matklad · February 4, 2021, 11:12am

Rust stable thread-locals (thread_local! macro) are not zero cost -- they support destructors, and so they must have "is this thread-local destroyed?" check at runtime on every access.

This is perfectly fine for once use-case of thread locals -- provide a cheap way to pass application context around without actually threading the ctx parameter everywhere.

However, this I think (and here I am a bit out of my depth) is unsatisfactory for another use-case -- speeding up concurrent algorithms. For example, the attempt to port mimalloc allocator to Rust immediately hits the need for fast no frills thread locals. Counting things also needs fast tls.

This is a pretty frustrating limitation, as you can't just work around it by just writing more, less elegant code. This is an expressiveness gap in the kind of machine code you can produce, and not in the shape of the source code that type-checks.

At the same time, it is a relatively niche and obscure use-case. I feel that some top-down nudging from the lang team to solve this problem would benefit the language quite a bit, hence the suggestions to up priority on this one . As far as I know, this isn't blocked on anything external to the problem itself.

josh · February 4, 2021, 11:44am

I would love to see this happen, if someone is interested in working on it.

In addition, I'd also like to see the tls-model option stabilized, so that applications can use a faster mechanism for thread-local storage.

bluss · February 4, 2021, 12:04pm

Could the stable thread local support be improved without a breaking change? It would need to avoid using Box and could avoid registering a destructor - but without box only certain size values can be supported(?)

matklad · February 4, 2021, 12:25pm

This seems unlikely. Unlike with statics, we need to run destructors for thread locals -- otherwise that's a memory leak. So well need to somehow specialize on "needs drop?", and at that level it seems better to have the difference visible in the surface syntax?

chrisd · February 4, 2021, 1:44pm

Forgive me if I'm misunderstanding but I think there are (broadly speaking) two ways of initializing thread locals:

statically initialized variables where the initial value can simply be copied to a new thread or directly set (e.g. a counter that starts at zero).
lazily initialized variables that need to call a function to recreate them for a new thread.

And on thread exit a variable can either require a function to run (e.g. drop) or not.

So in total there's four basic ways of handling a thread local. I think efficiently encapsulating these different uses will require a different API to the one currently in std, no?

Amanieu · February 4, 2021, 6:36pm

There are 2 ways to go about this:

Improve thread_local! to better handle no-drop types and constant initializers.
Stabilize #[thread_local] statics (with some design work to fix outstanding issues).

programmerjake · February 4, 2021, 7:32pm

+1 for #[thread_local] they're useful for FFI too.

bluss · February 4, 2021, 11:14pm

Special casing on needs drop is not very special, we even have an old function called needs_drop which lets you do it dynamically if that's the right fit.

comex · February 5, 2021, 5:46am

Note that the instruction sequences used for plain-old-data thread-locals in C differ greatly depending on OS and even the type of binary.

On x86-64 Linux, in an executable, it's a single load instruction:

        mov     eax, dword ptr fs:[x@TPOFF]

But in a dynamic library (-fPIC), it's a call:

        lea     rdi, [rip + x@TLSGD]
        data16
        data16
        rex64
        call    __tls_get_addr@PLT

And on macOS, whether executable or library, it's… an indirect call!?

        movq	_x@TLVP(%rip), %rdi
        callq	*(%rdi)

(As far as I can tell, this call always goes to tlv_get_addr.)

So even if Rust fixes thread-locals to be zero-cost, compared to the underlying OS mechanism, that doesn't necessarily mean they're fast. Unfortunately.

matklad · February 5, 2021, 4:38pm

Oh wow, it appears that mimalloc is doing something exceptionally cursed on mac:

github.com

microsoft/mimalloc/blob/15220c684331d1c486550d7a6b1736e0a1773816/include/mimalloc-internal.h#L277-L289


/* ----------------------------------------------------------------------------------------
The thread local default heap: `_mi_get_default_heap` returns the thread local heap.
On most platforms (Windows, Linux, FreeBSD, NetBSD, etc), this just returns a
__thread local variable (`_mi_heap_default`). With the initial-exec TLS model this ensures
that the storage will always be available (allocated on the thread stacks).
On some platforms though we cannot use that when overriding `malloc` since the underlying
TLS implementation (or the loader) will call itself `malloc` on a first access and recurse.
We try to circumvent this in an efficient way:
- macOSX : we use an unused TLS slot from the OS allocated slots (MI_TLS_SLOT). On OSX, the
           loader itself calls `malloc` even before the modules are initialized.
- OpenBSD: we use an unused slot from the pthread block (MI_TLS_PTHREAD_SLOT_OFS).
- DragonFly: the uniqueid use is buggy but kept for reference.
------------------------------------------------------------------------------------------- */

atagunov · February 5, 2021, 10:21pm

On x86-64 Linux apparently:

TLS defined in .exe/used in .exe - no extra cost
mov eax, dword ptr fs:[x@TPOFF]
Local Executable (LE) model
TLS defined in .so~~/used in .exe~~ never loaded with dlopen() - one extra 64 bit memory read, uses an extra register
mov 0x200893(rip), rax
mov eax, dword ptr fs:(rax)
Initial Executable (IE) model
TLS ~~used in .so~~ defined in .so which can be loaded with dlopen() - lots of extra cost
__tls_get_addr() does a number of reads
Local Dynamic (LD) model
General Dynamic (GD) model
the difference is tiny
with LE ld.so has less work to do on loading the library
no difference at run-time

Are the expenses relevant for Rust though?
Isn't a monolithic executable the main use case?
Is anybody really compiling Rust code into .so-s?

zackw · February 5, 2021, 11:04pm

I don't know how well supported this is right now, but I sure would like to be able to use Rust to implement a shared library that presents an interface that's ABI-compatible with an existing C shared library.

atagunov · February 6, 2021, 12:00am

The most expensive Linux/x86-64 TLS access models (LD, GD) seem necessary inside an .so that can be loaded with dlopen().

If your .so can only be loaded before main() then IE should suffice. Theoretically. LLVM LangRef says:

This seems to confirm the hypothesis that you can use IE from an .so. And IE is just one extra read from GOT + an extra register. Not that expensive.

Seems rustc could take some flags to control this.. And make it real cheap for monolithic executables by using LE.

comex · February 6, 2021, 12:01am

That's what the tls_model option @josh mentioned does, though it's not stable.

comex · February 6, 2021, 12:03am

It is well-supported enough. Examples of Rust used to reimplement existing C APIs:

librsvg mostly ported to Rust
mesalink which emulates OpenSSL's API (and maybe ABI, not sure)

Examples of crates with their own C APIs:

H2CO3 · February 6, 2021, 8:19am

Re-cursed!

[/off]

tcsc · February 8, 2021, 10:50pm

Note: this is only used if you try to globally override malloc on mac, otherwise it uses the equivalent of #[thread_local] (although it can also use pthread_getspecific)

These hacks are needed for that case, but doing so inadvisable in general... Even though the known issues with mimalloc on mac when used this way should be fixed, still requires setting DYLD_FORCE_FLAT_NAMESPACE=1 in your environment (or setting the MH_FORCE_FLAT flag in the header of the MachO executable) or dylibs you link/load will get a different allocator.

Unfortunately, disabling the flat namespace tends to break a lot of things in weird and confusing ways, including in system libraries.

Also note that the slot that mimalloc uses is probably not ideal. Slots 6 and 11 are only used by software running targetting WIN64 ABI (e.g. code running under something like WINE): darwin-libpthread/private/tsd_private.h at 03c4628c8940cca6fd6a82957f683af804f62e7f · apple/darwin-libpthread · GitHub, which is probably safe to not support (although, maybe WINE wants to RIIR ). I believe golang's runtime uses one of these also though...

Anyway I did a benchmark about some of this stuff a while ago (includes benches for per-cpu stuff, since really what i wanted was a way to heuristically reduce atomic contention, but where the algo would still be safe) GitHub - thomcc/threadidbench (this is an older version of what I used for the numbers in Proposal `std:: thread ::ThreadId::current()`). It includes a reimplementation of mimalloc's hackery, both in inline asm and called separately.

(Major caveat with that repo: That code has major bugs and is just intended to get a feel for the ballpark of what the various performance options would be, and not actually be usable — also that I don't think I pushed my fixes that make it compile on the non-mac targets).

Note that there's a similarly evil trick for windows too, which has the benefit of using the values handed out by TlsAlloc - discussed here Thread Local Storage, part 2: Explicit TLS « Nynaeve

At one point I ported the x86_64 equivalent to Rust, eliminating the checks that aren't needed if you're sure the index came from there. I don't know if how much of a benefit there is to using this over using #[thread_local] but I know there's sure a lot of downsides! I don't think this is remotely stable to rely on, beyond it being a defacto part of the ABI

// Equivalent to TlsGetValue. Requires `slot` be a
// value given to you by TlsAlloc or you get massive UB
#[inline]
#[cfg(target_arch = "x86_64")]
unsafe fn tls_get_value_fast(slot: DWORD) -> u64 {
    // NB: __readgsqword is https://docs.rs/ntapi/0.3.6/src/ntapi/winapi_local/um/winnt.rs.html#35-44
    if slot < 64 {
        __readgsqword(0x1480 + slot * 8)
    } else {
        (__readgsqword(0x1780) as *const u64).add(slot - 64).read()
    }
}

system · May 9, 2021, 10:51pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fast thread locals: TLS model language design	4	1071	October 16, 2022
Thread lifetime for TLS language design	24	3620	March 30, 2021
Allow thread locals destructors to be blocked by Condvar language design	19	929	April 15, 2023
Pushing the usage of TLS for TyCtxt compiler	3	839	June 3, 2019
Support c++ compiler flags for msvc windows compiler	4	1183	September 22, 2021

Request for prioritization: fast thread locals

Related topics