Allow thread locals destructors to be blocked by Condvar

As demonstrated within my project async-local, it is possible to use a shutdown barrier to synchronize the destruction of thread local data during shutdown. By protecting every async runtime worker-owned thread this way, no async task will outlive the thread local data owned by any thread in this group, and pointers to thread locals created within an async context will be of a valid lifetime while within that context or within a runtime-managed thread. In current stable Rust, this only works for !Drop types because types that impl Drop register destructor functions that cannot be circumvented by way of using Condvar even though that thread will actually in fact block. For types that don't impl Drop, memory won't deallocate until the thread actually drops, and so thread locals of these types can be guarded by using a separate thread local to block the thread. With better language support, it should be possible to make thread locals owned by runtime worker threads always usable within an async context/runtime managed thread regardless of whether the type implements Drop.

Is it a bug that Context::new has a “Safety” comment despite not being an unsafe function?


Also, with_async being safe and guarded_ref unsafe seems inconsistent. It’s farily trivial to extract the RefGuard from the future returned by with_async.

use std::{
    future::Future,
    sync::Arc,
    task::{self, Wake, Waker},
    thread,
};

use async_local::{AsyncLocal, Context, RefGuard};
use pin_utils::pin_mut;

fn main() {
    let guard = thread::spawn(run).join().unwrap();
    println!("{}", *guard); // -> miri reports UB
    // "Undefined Behavior: pointer to alloc3596 was dereferenced after this allocation got freed"
}

thread_local! {
    static LOCAL: Context<u8> = Context::new(0);
}

fn run() -> RefGuard<'static, u8> {
    let f = LOCAL.with_async(|r| async move { r });
    pin_mut!(f);
    struct W;
    impl Wake for W {
        fn wake(self: Arc<Self>) {}
    }
    let task::Poll::Ready(r) = f.poll(&mut task::Context::from_waker(&Waker::from(Arc::new(W))))
    else { panic!() };
    r
}

In earlier versions it was marked as unsafe, however I changed the design so that instead AsContext is unsafe and promises that the type is !Drop and that Context can’t be invalidated, and then there is a panic when issuing references if the type implements Drop. I wanted to use static_assertion to make the AsContext derive enforce Drop but that doesn’t work on generics. I could use generic_const_exprs to enforce this with trait bounds instead of a panic on ref creation but in the past I’ve found the error messages that occur when this is missing can be confusing. I am thinking of adding a static assertion for non-generic derives. This is also hard to enforce because !Drop is forbidden; this doesn’t make sense to me because such bounds are useful for at least some use cases such as bump allocation. Whenever generic_const_exprs is stabilized I’ll make this so creating refs doesn’t panic and instead enforce by trait bounds.

Regarding your example, I'll have to take another look at the lifetime constraints to make it so that RefGuard can never be 'static as it this is only valid for non-'static lifetimes as shown in the safety notes. Perhaps I should make it so that async_local sets up the async runtime with thread start hooks enabling the barrier and resign to only supporting tokio; this really only works when using one or more async runtimes setup so that all worker threads synchronize shutdown. There's not really any way to make this work outside the context of barrier protected runtimes.

I remain unconvinced. It looks like what you're doing is preventing the release of thread-local data by blocking in a thread-local destructor until all threads which poked it have run their copy of the destructor.

There's two obvious potential holes in the scheme:

  • This relies on thread-local destructors running at all. As far as I recall, this isn't in any way guaranteed, and some targets may reuse/deallocate/reassign thread local slots without running destructors.
  • If you can get a reference into a thread local off of the threads which have poked the barrier. I presume that you've attempted to handle this via with_async (which pokes the barrier)... but I don't see anything preventing you from polling the with_async future a couple times then sending it to a different thread that hasn't poked the barrier. (IOW, it's only (presumably) sound if immediately .awaited without shenanigans.)

Obviously, thread-local destructors run in an unspecified order (and the compiler has essentially no control over that, it's done by the target), so doing this for references to any type with drop glue[1] is obviously unsound. It might still be Rust-level unsound for types without drop glue, independent of the lowering; it could be unsound because the AM invalidates thread-locals' references' provenance, or perhaps more simply that dropping a place still invalidates references to it even if it doesn't have drop glue. Miri seems to agree with the latter, though you could argue a difference between that example on the stack and thread-locals... though I think it unlikely that all targets would guarantee no thread-local storage slots are reused/deallocated/etc until all slots' registered destructors finish; it seems much more likely that someone would stick them in the free list as soon as running the registered cleanup, before moving to the next slot's cleanup routine.

Is it possible to test any use of your crate under Miri? Doing so in addition to the loom tests would help improve confidence, if Miri supports enough functionality to get a simple test of a reference to thread-local outliving the thread it's local to working.

In short: it may seem to work on your machine(s), but that's a common manifestation of UB, and the most dangerous: looking like it works until you hit the perfectly unfortunate set of conditions to make it not work. It might be that it'll always work on an Itanium+pthreads machine, but on some ARM chip it goes belly up.


  1. not just that don't impl Drop; the Drop trait bound is fundamentally useless, especially when inverted; you need needs_drop instead, which is also allowed to spuriously be consistently true for any type (and in the future, perhaps inconsistently, so long as constant evaluation is consistent). ↩︎

No, danger, danger!

struct Oops {
    data: String,
}

fn assert_drop<T: Drop>() {}
const _: fn() = || {
    assert_drop::<Oops>(); // compile error
};

[playground]

This type does not implement Drop, but it still has drop glue, needs to be dropped, and using it after it's been dropped is unequivocally unsound.

where const { !needs_drop::<T>() } may be possible to express in the future, but

  • needs_drop is allowed to be true spuriously for all types, and
  • a Drop trait bound is absolutely never what you want, and only isn't always forbidden because of stability.
1 Like

It's acceptable to limit support to a subset of targets.

Yes that's an issue I've been pondering. If I choose to only support the Tokio runtime, then I can export my own main macro that uses on_thread_start to setup the barrier on runtime worker threads, this way it's guaranteed all runtime worker threads synchronize. It might make sense to have runtimes adopt this via feature flags.

The trick here would be to use the Drop of Context as a rendezvous point; this way it wouldn't matter which thread local is destroyed first, only that first time a thread starts to destroy any protected thread local it will synchronize with other threads during shutdown.

how would this cause issues? So long as the data is not deallocated a raw pointer could dereference this?

I just tested on Miri and had no issues. Added command to repo

The tls-barrier-check and previously doomsday-clock I wrote specifically for testing this, so presumably these things can be tested on a number of targets to see which can't be support

I think all that's necessary is that for a given type T that std::mem::needs_drop::<T>() be consistent as what really matters is whether the thread_local macro registers a destructor as should that be the case no barrier will protect invalidation. A type that conservatively returns needs_drop true would fail to compile and a type that returns false will be able to be barrier-protected.

Something like this:

#![allow(incomplete_features)]
#![feature(generic_const_exprs)]

use const_assert::{Assert, IsTrue, IsFalse};
use std::mem::needs_drop;

struct Buffer<T, const N: usize> {
  inner: [T; N],
}

impl<const N: T> Buffer<N>
where
  Assert<{ N == N.next_power_of_two() }>: IsTrue,
  Assert<{ N == 1 }>: IsFalse,
  Assert<{ needs_drop::<T>() }: IsFalse
{
  ...
}

It doesn't matter if it works on the target if the AM semantics say it's UB. A sanitizing implementation could potentially be allowed to decide to overwrite all memory with 0xFF after it's dropped to help try to catch use-after-drop bugs. Codegen could change arbitrarily such that it's now causing problems on the target, if it's UB according to AM semantics. UB means all bets are off and even your debugger could be lying to you.

This probably isn't allowed for Copy types, as ptr::drop_in_place at least implies that using a Copy value after it's been drop_in_placed isn't UB.

It's still an open/undecided question whether it's valid to read memory after it's been dropped (or perhaps more precisely, whether it's been uninitialized). This holds just as much for drop-glue-free types (that aren't Copy) as it does for drop-glue-having types.

I'd suggest checking to ensure Miri runs TLS destructors at all. Miri doesn't show the absence of potential nondeterministic UB, but if it's just not deallocating TLS at all it's extra not showing anything.

... if your main macro is used. If you go that path, make sure you require the macro is used, else it can just not be used and bypass any protection it was supposed to be providing.

For libraries, sure, but not for the language or std itself.


If you're relying on the presence of a runtime anyway, though... just don't use TLS! It can still be keyed off of worker thread ID and be associated to the thread via TLS, but if you put the runtime in charge of cleaning up resources it can trivially defer that until the runtime threads are all joined. You still need some way to prevent the references from escaping the lifetime of the runtime, but you don't have to worry about TLS destructors at all.

I also managed to forget the third big hole while drafting: doing it twice. Nothing ever resets the finalized flag, so spin up a second time after the first shutdown barrier and completely bypass the barrier :slight_smile: At leasg that one's easily addressable by just resetting the finalized flag when new threads register on the barrier.

It passes under TSAN; what else would be needed here?

If a type doesn't impl Drop then the thread local barrier in shutdown-barrier will block on Condvar and all TLS destructors registered will be called regardless but for types not registered the dropping actually happens later which is why there's no use-after-drop occurring.

Confirmed TLS destructors are being ran under MIRI

The only use case this will ever support is one shutdown, one barrier.

Sanitizers are hardly catch-all. Even miri can miss some kinds of undefined behavior. If your code exhibits UB, the compiler is entirely free to turn your code into a goo that does the bare minimum that passes your sanitizer. If your program ever enters a branch that unconditionally leads to UB, you completely lose the ability to reason about your code.

Then it's unsound unless you prevent me from going twice. I can always just write

fn main() {
    tmain();
    tmain();
}

#[tokio::main]
async fn tmain() {
    // ...
}

and if the second time allows use-after-free then it's unsound.

It's absolutely fine if starting up again ends up causing some old threads to not finish shutting down. It's not fine if it's unsound.

If there's no observable destructor, then there's no way to know when the destructor happens. Both before all other TLS destructors or after all other TLS destructors are possible choices.

Because there's no actual code involved, no LLVM-level sanitizer is going to care. It's only a distinction that makes any difference at the compiler (Abstract Machine) level, in whether compiler optimization and codegen choices are allowed to assume you don't access TLS after the thread has terminated but before your barrier TLS destructor has finished blocking.

I.e. the relevant example of reclaiming the TLS slots before all TLS destructors have finished.

It's absolutely not a question about what the compiler happens to be doing today, it's a question about what the compiler is allowed to do in the future. That would includes registering extra TLS destructors just to uninit that data, if AM semantics permit it. (rustc almost certainly wouldn't, because that'd be pointless, but the point is that there's no guarantee it won't.)

In that case, especially if it's restricted to data without a destructor anyway, why not just Box::leak the data? Then there's no soundness worries to be had, and the data will be cleaned up after the main thread exits.

More than anything else, really, it just feels like having the data owned by TLS is the incorrect choice if it needs to outlast the death of the thread and attempted expiry of TLS. The correct owner for data that dies with runtime shutdown is the runtime. It can be referenced from TLS just fine, but rather than force TLS to wait for confirmation all the executor runtime threads have exited... just own the data from the runtime. And then you only hold up the data that matters, not also anyone else using TLS as well.

With how tokio spawns and kills off blocking/IO threads, it also seems relatively possible this'd end up causing TLS exhaustion if you do hook tokio's thread spawn.

#[tokio::main] won't compile unless on main so this risk doesn't exit.

What I showed with the tls-barrier-test is that TLS with Drop will be dropped in place no matter even if Condvar is used, but that TLS without Drop won't be dropped until the thread actually drops, and can have drop deferred an arbitrarily long time without invalidating. Probably it could be tested to see if this the case for all targets that have:

        #[cfg(all(
                target_thread_local,
                not(all(target_family = "wasm", not(target_feature = "atomics"))),
            ))]

It adds extra indirection. I could probably benchmark this to see how much it actually matters, but also I just wanted to see if this can be done.

If runtimes synchronized during shutdown TLS raw pointers owned by worker threads wouldn't need to be protected by blocking during TLS destruction in the first place because all async tasks and all blocking threads will have dropped before any worker thread proceeds to drop any TLS. Using a sidekick destructor to block as a way of deferring unregistered TLS drop was only a way to try to avoid asking runtimes to make this change, and the hope was this could be made to be deterministically safe for some set of targets

Threads in the blocking pool wouldn't need to be barrier protected because it's guaranteed at least one worker thread will outlive all blocking threads and so it would only need be the case that worker threads have all at some point enabled the barrier. These threads would be able to freely dereference raw pointers to TLS owned by worker threads without any atomic reference counter nor TLS initialization on these threads

That’s not true. Also, tokio::main is merely a convenience wrapper around the ordinary API for creating and running a tokio::runtime::Runtime anyways.

4 Likes

Oh interesting good point! I mixed that up; I was reading through async-std and saw they had the assertion there but yes you're right this isn't something that Tokio has and so that's something I would need to include.

The only thing special that I would do differently than that of Tokio for main is to use on_thread_start to make runtime threads enter a barrier via shutdown_barrier::guard_thread_shutdown. So long as this can be made to synchronize the shutdown of worker threads without affecting blocking threads then this alone would make it such that no async task will outlive the TLS destructor internal too shutdown-barrier. There's then the question of how useful is that; I think it might be the case here that all targets with target_thread_local sans wasm will defer deallocations of unregistered thread locals should Condvar block, meaning those can be protected.

It also just occurred to me that I can use on_thread_stop as a synchronization point to get the same effect without needing to suspend during TLS destruction at all, so maybe the best option for me here is to abandon supporting async-std altogether and to just use a barrier protected Tokio Runtime instead. It's still unclear if it's actually undefined behavior to dereference a raw pointer to a thread local value for a value that has no registered destructor on a thread that is blocking during the Drop of a LocalKey and it would be cool if there were a way to support blocking during a TLS drop as a means of reliably extending the lifetime.

And it's essentially just a lint. It's trivially possible to bypass by just defining a fn main in a module. Or even by layering a second procedural macro underneath that will change the function name.

The only way to prevent calling the function again is if it never exits (i.e. you call process::exit to terminate the entire process), including via unwind. It can still be called reentrantly, but at least that will only soundly deadlock.

And I'll just say again that if you're terminating the process directly afterwards, it's barely worth it to allow TLS cleanup to continue rather than just blocking indefinitely.

If TLS cleanup happens before the thread is reported finished and joined[1] (reasonable choice; it ensures TLS destructors actually get a chance to run), then now doing a block_on from the main thread will ensure[2] runtime shutdown deadlock (waiting for the main thread TLS destructors to rendezvous with the barrier, which isn't going to happen). If TLS destructors run after the thread is joined (but before TLS is freed up for reuse), then you're likely cutting off cleanup by exiting the process immediately after stopping blocking TLS cleanup.


  1. I tested the behavior on my Windows machine, since I was curious. thread::spawn(_).join() waits for TLS destructors, but thread::scope doesn't. scope has the spawned threads' main notify/unpark the scope thread when they're done rather than .join()ing them. ↩︎

  2. I'm assuming all of the worker threads get told to exit before joining any of them. Not doing so would be somewhat silly and lead to a deadlock waiting for the first workers' TLS to finish cleanup. ↩︎

Agreed. I had (in my younger years) sent a patch to clean up dlopen-associated memory stored in TLS to make the APIs thread-safe. Ulrich Drepper shot it down saying "no one cares; the process is exiting and the kernel can take care of it". At the time (green as I was) I thought that leaks should be plugged (one of my earliest patches to a project I wasn't a developer of from this effort is still open…I should probably close that loop someday), but now I know that TLS can be a maze full of dark corners and pits that is best left to the kernel to clean up if it's not critical (and you should leave as much outside of TLS as possible anyways).

... Actually, because I somehow didn't put two and two together in the last post:

On Windows, from my quick test, when the main thread exits / the process is exited: currently active threads are immediately terminated without TLS destructors, currently active TLS destructors complete, and then the process-exiting thread's TLS destructors run (based on rust docs, not the test) and the process actually exits.

This means that blocking infinitely in a TLS destructor is a horrid idea on Windows. The process-global loader lock is held during that thread cleanup period, so anything else taking that lock (most OS calls can) will deadlock with that. Leak stuff, sure and please, but don't block in TLS destructors.

Based on the Rust docs, on pthreads targets the process exit skips any remaining TLS destructors and that probably includes the infinitely blocking one(s).

Yea so what I’ll do instead here is give up on TLS destructor blocking and instead make a macro that builds a barrier protected tokio runtime instead and I’ll do so in a way that has one barrier per expansion and then I’ll have to figure out some solution for making sure references are made only by runtime worker threads probably by debug assertions with named threads.