So, right now, getting the ThreadId of the current thread requires std::thread::current().id()
. On my machine (macOS), this takes about 10ns-20ns, usually around 12ns, and can't really be easily optimized without a lot of complexity¹.
The only fix for this I can think of is to provide a new (initially unstable) API that returns the same value as thread::current().id()
, but caches the value directly in a thread local. Specifically, adding std::thread::ThreadId::current()
doesn't feel out of place to me, but the name is infinitely bikeshedable.
This brings the perf of access in hot code down to around 1-2ns on my machine, which is comparable to the options provided by the thread_id
crate (but with the benefits that Rust's ThreadId brings, specifically that it's never reused by other threads, which can be useful for unsafe code to rely on).
So that we're 100% on the same page, I'm proposing something like this:
impl ThreadId {
pub fn current() -> ThreadId {
#[thread_local]
static CACHED_ID: Cell<u64> = Cell::new(0);
// Try to have only one TLS read, even though LLVM will
// definitely emit two for the initial read
let cached = &CACHED_ID;
let id = cached.get();
if likely(id != 0) {
unsafe { Self(NonZeroU64::new_unchecked(id)) }
} else {
Self::get_and_cache(cached)
}
}
#[cold]
#[inline(never)]
fn get_and_cache(cache: &Cell<u64>) -> Self {
let id = thread::current().id();
cache.set(id.0.get());
id
}
}
Note that concretely I'm not saying this will definitely be the implementation (the generated code has meaningful issues in the non-cached case, which is hard to benchmark), but this API allows for an efficient implementation, whereas I don't see a possible way for the current API to allow for one.
Additional notes
-
Using
#[thread_local]
overthread_local!
saves ~30% in my measurements, and reduces code bloat by a lot² and even for !needs_drop types. (Usingthread_local!
does make the code somewhat cleaner though, since the macro takes care of caching the value) -
On
cfg(not(target_thread_local))
targets, we would probably just make this a function that returnsthread::current().id()
directly with no caching. -
Making this be the canonical location for the value rather than a cache could be avoided with More Engineering, of course, but the cost is a branch that is usually correctly predicted, and doing so would be more complicated.
-
a big reason the current
std::thread::current().id()
is slow is that in addition to the TLS reads, it must clone and drop an Arc, which requires 2 RMW operations. Even with low contention, and on x86, these are not free.The low contention is essentially guaranteed here (unless you get very unlucky and the allocator causes the refcount to be on the same cache line as some high-contention atomic value) but obviously running on x86 is not guaranteed.
Invoking the Arc::drop also requires including a bunch of glue code for the case where the decref is the final decref, which is impossible as we're still on that thread, but not in a way that seems communicable to the compiler. It's possible this is avoidable by using unsafe code for the get_and_cache case. (This is further harmed by the fact that the compiler can't tell that
std::thread::Thread::id
doesn't unwind, which we could avoid by making it inline)For clarity: I suspect it would be worth trying to improve
get_and_cache()
further here, as every thread goes through it once, and many use cases would involve hitting it only once. The version posted above is just the conceptual version (and the version that is worth using if further optimization attempts don't bear fruit).
¹ Specifically: there are suggestions for improving thread local performance, but the TLS for the current thread does actually need both lazy init and for its dtor to be run, so there's no reason to assume it will get much better, at on macOS where -Ztls-model doesn't apply. Also, see above, where I go over why this being slow is plausibly more about atomic RMWs than about TLS reads, although the TLS reads certainly don't help.
² I've spent probably around 15 hours of combined time³trying to improve thread_local!
to avoid these issues without also harming performance in the use cases that need dtor/lazy init... every time I improve things for copy/const init, it ends up causing LLVM to do something bad like emit multiple reads to the TLS variable, and thus multiple __get_tls_addr()
invocations on ELF platforms... I'm sure I'm not alone here either, it feels like a macro that could use a total_hours_wasted_here
comment .