Proposal `std:: thread ::ThreadId::current()`

tcsc · February 8, 2021, 7:58pm

So, right now, getting the ThreadId of the current thread requires std::thread::current().id(). On my machine (macOS), this takes about 10ns-20ns, usually around 12ns, and can't really be easily optimized without a lot of complexity¹.

The only fix for this I can think of is to provide a new (initially unstable) API that returns the same value as thread::current().id(), but caches the value directly in a thread local. Specifically, adding std::thread::ThreadId::current() doesn't feel out of place to me, but the name is infinitely bikeshedable.

This brings the perf of access in hot code down to around 1-2ns on my machine, which is comparable to the options provided by the thread_id crate (but with the benefits that Rust's ThreadId brings, specifically that it's never reused by other threads, which can be useful for unsafe code to rely on).

So that we're 100% on the same page, I'm proposing something like this:

impl ThreadId {
    pub fn current() -> ThreadId {
        #[thread_local]
        static CACHED_ID: Cell<u64> = Cell::new(0);
        // Try to have only one TLS read, even though LLVM will
        // definitely emit two for the initial read
        let cached = &CACHED_ID;
        let id = cached.get();
        if likely(id != 0) {
            unsafe { Self(NonZeroU64::new_unchecked(id)) }
        } else {
            Self::get_and_cache(cached)
        }
    }

    #[cold]
    #[inline(never)]
    fn get_and_cache(cache: &Cell<u64>) -> Self {
        let id = thread::current().id();
        cache.set(id.0.get());
        id
    }
}

Note that concretely I'm not saying this will definitely be the implementation (the generated code has meaningful issues in the non-cached case, which is hard to benchmark), but this API allows for an efficient implementation, whereas I don't see a possible way for the current API to allow for one.

Additional notes

Using #[thread_local] over thread_local! saves ~30% in my measurements, and reduces code bloat by a lot² and even for !needs_drop types. (Using thread_local! does make the code somewhat cleaner though, since the macro takes care of caching the value)
On cfg(not(target_thread_local)) targets, we would probably just make this a function that returns thread::current().id() directly with no caching.
Making this be the canonical location for the value rather than a cache could be avoided with More Engineering, of course, but the cost is a branch that is usually correctly predicted, and doing so would be more complicated.
a big reason the current std::thread::current().id() is slow is that in addition to the TLS reads, it must clone and drop an Arc, which requires 2 RMW operations. Even with low contention, and on x86, these are not free.

The low contention is essentially guaranteed here (unless you get very unlucky and the allocator causes the refcount to be on the same cache line as some high-contention atomic value) but obviously running on x86 is not guaranteed.

Invoking the Arc::drop also requires including a bunch of glue code for the case where the decref is the final decref, which is impossible as we're still on that thread, but not in a way that seems communicable to the compiler. It's possible this is avoidable by using unsafe code for the get_and_cache case. (This is further harmed by the fact that the compiler can't tell that std::thread::Thread::id doesn't unwind, which we could avoid by making it inline)

For clarity: I suspect it would be worth trying to improve get_and_cache() further here, as every thread goes through it once, and many use cases would involve hitting it only once. The version posted above is just the conceptual version (and the version that is worth using if further optimization attempts don't bear fruit).

¹ Specifically: there are suggestions for improving thread local performance, but the TLS for the current thread does actually need both lazy init and for its dtor to be run, so there's no reason to assume it will get much better, at on macOS where -Ztls-model doesn't apply. Also, see above, where I go over why this being slow is plausibly more about atomic RMWs than about TLS reads, although the TLS reads certainly don't help.

² I've spent probably around 15 hours of combined time³trying to improve thread_local! to avoid these issues without also harming performance in the use cases that need dtor/lazy init... every time I improve things for copy/const init, it ends up causing LLVM to do something bad like emit multiple reads to the TLS variable, and thus multiple __get_tls_addr() invocations on ELF platforms... I'm sure I'm not alone here either, it feels like a macro that could use a total_hours_wasted_here comment .

jethrogb · February 8, 2021, 8:18pm

Why couldn't thread::current().id() be reimplemented with your proposal?

RustyYato · February 8, 2021, 8:22pm

thread::current() does a lot more work than necessary, specifically it needs to get all available information about the thread, not just it's id. This work isn't easily optimized away, so it's a perf issue.

@tcsc Since this is a small libs addition, you can just send a PR to add it with a feature gate. No need for an RFC.

jethrogb · February 8, 2021, 8:23pm

It could just get the id and compute the other information when asked?

Soni · February 8, 2021, 8:27pm

How does fork handle thread-locals? Would that potentially cause issues?

RustyYato · February 8, 2021, 8:27pm

Doing it that way would penalize all other uses-cases besides just getting the thread id. Thread would need a cache too to alleviate this, which is more work than providing this API.

tcsc · February 8, 2021, 9:53pm

In addition to the issue RustyYato mentioned, it might not be on the same thread by the time the information needs to be resolved.

E.g. I can send the result of std::thread::current() to another thread before calling any methods on it. If it resolves things when it is called rather than when current() is invoked, then it will be on the wrong thread a that time, and the result will be inaccurate.

tcsc · February 8, 2021, 9:54pm

The current info is already stored in a thread local, so this wouldn't change anything with regards to fork.

(Crap, should have responded to this at the same time as the other...)

comex · February 9, 2021, 5:17am

…Do we actually guarantee this?

The documentation says that ThreadId "has a unique value for each thread that creates one", but it's not clear about whether "unique" means "unique as long as the thread exists" or "unique forever".

The documentation also says that "ThreadIds are not guaranteed to correspond to a thread's system-designated identifier.", but that suggests that it might correspond to it, even though under the current implementation it never does. However, on most operating systems, the "system-designated identifier" is only unique as long as the thread exists. To wit:

On Unix, the most natural candidate for "system-designated identifier" is pthread_t. This is an opaque type, but at least glibc (Linux) and macOS implement it as a pointer to a data structure, which can be reused after the thread has exited and been joined/detached.
Alternately, on Linux there's the option of gettid, which returns an integer identifier that can be reused.
On Windows, GetCurrentThreadId is documented as returning a value that is unique "until the thread terminates".
On all those OSes, another candidate for "system-designated identifier" is also the relatively-new C++ std::thread::id, but these IDs can be reused (and in any case are typically implemented as just a wrapper for the previously-mentioned IDs).

Changing ThreadId's implementation to a wrapper for the OS thread ID might be even faster than what you're proposing, at least on some systems. On Linux with the shared-library TLS model, getting the OS thread ID is a single instruction, whereas accessing a thread-local variable requires an external call. On macOS, ABI guarantees aren't strong enough to avoid an external call to pthread_self, which just accesses a thread-local variable and has the same cost. But compared to your proposed scheme, at least it would avoid the branch and cold path.

But if you say that the property of IDs never being reused is useful, perhaps you or others are already relying on it, so in practice it's too late to change it...

tcsc · February 9, 2021, 6:28am

But if you say that the property of IDs never being reused is useful, perhaps you or others are already relying on it, so in practice it's too late to change it...

Well, I don't use thread::current().id(). I've never written code that intentionally relied on it, but I have probably written code where something dodgy would happen if two different code had pointer-identical pthread_t values if I was using that for a thread ID.

So, even if it's not guaranteed it's something I'm sympathetic to (and I wasn't aware that that guarantee wasn't documented, but why wouldn't it always use an OS thread ID, or a pointer to a thread local, or etc if not?) isn't listed anywhere but seems deliberately upheld by the code)

That said, I'd also be open to changing it to using some other implementation. But I do think there's a real likelihood that the current behavior prevents bugs.

system · May 10, 2021, 6:28am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Request for prioritization: fast thread locals language design	17	3745	May 9, 2021
Ability to determine if current thread is the main thread of process libs	35	2761	December 7, 2023
Thread lifetime for TLS language design	24	3610	March 30, 2021
TLS / 'current values' / any ideas on this	1	742	March 25, 2019
pre-ACP: identifying privileged threads libs	15	973	December 8, 2023

Proposal `std:: thread ::ThreadId::current()`

Additional notes

Related topics