Design rationale: why uncaught panics in threads don't abort?

It seems Rust emphasizes not masking errors. Result::Err is either explicitly handled, or it will loudly blow up on .unwrap(). Panics in the main thread are either wrapped in catch_unwind() or will lead to immediate process termination with an error message that is hard to miss. I appreciate this a lot, it's an important aid for writing correct programs.

Uncaught panics in spawned threads are an unfortunate departure from this principle. Something is printed to stderr and the thread aborts, but the rest of the program keeps running as if nothing happened.

This also makes panics and threads non-orthogonal. There are two ways to not die on panic: either catch it or just have it in a spawned thread where it won't bother anybody. That's awkward.

I'm curious why things were designed this way. Was it a mistake that is now entrenched, or are there good reasons for this behavior?

5 Likes

JoinHandle::join actually tells you whether the thread has panicked, so the rest of the program only keeps running if you ignore the handle (maybe it should be #[must_use]?)

1 Like

#[must_use] won't help, because it's not enough to not ignore JoinHandle right away, one has also to remember to check it later. Linearity has to be enforced, either by the type system or by panicking in JoinHandle::drop().

And this does not address non-orthogonality concern. In a world where Result, panics, and threads are orthogonal, JoinHandle::join() would return T. If one wants to signal an error from a thread, they'll make that T a Result. If one wants panics to produce a Result::Err, they'll wrap stuff in catch_unwind() that does exactly that. I don't see why it has to be a package deal.

And there are cases where one spawns a thread to do stuff in the background and doesn't care when it joins. I dunno, maybe this should be declared bad practice, but it's tempting for example to implement simple TCP servers this way (one thread per connection). Adding explicit joining would complicate it a bit.

4 Likes

Yeah, I agree that the status quo here is unfortunate. Instead of abort, my preferred solution would be different though.

  • Make JoinHandle wait for a child thread in Drop (a-la C++ jthread) (obviously can't do this because of backwards compatibility, but I'd love to see jthread library on crates.io).
  • By default, propagate the panic when joining thread (ie, Drop unwraps the Result, getting a Result back requires a dedicated method call).

And there are cases where one spawns a thread to do stuff in the background and doesn't care when it joins.

I'd say it's almost always makes sense to join the threads, as that leads to more robust software (structured concurrency FTW!). But yeah, in jthread world we can have an explicit .detach method, which would abort if the thread panics.

5 Likes

Why would it not be forward compatible to do join by default in Drop?

There is one perspective from which the current behavior makes sense. An uncaught panic in any thread doesn't result in an abort. If the main thread panics, the stack is unwound without aborting (unless something happens to force an abort, like a double panic). The same thing happens with any other thread: if it panics, the thread unwinds and terminates.

It just so happens that the main thread is special, though. When it terminates, the program terminates. This behavior is not specific to panics.

I'm not sure if main-thread-termination-terminating-the-program is OS-specific or not. And I don't think it's ideal that panics in other threads can go unnoticed. But at least from this perspective (panics terminate threads, not programs) it's more consistent.

1 Like

I've complained about this some time ago. I haven't gotten motivated enough to write an RFC and there seems to be somewhat but not completely related issue somewhere (I can't find it right now :frowning: ). However, the at the time proposed solution was to define another panic = "..." value (maybe abort-uncaught) that would abort if the unwind falls off the stack. That could arguably be set in new projects by cargo new if it was deemed better default.

And yes, there are reasons to have threads you don't join ‒ maybe a background thread that sends metrics out every 5 minutes, or one that watches the health of the rest of the program, or clean up unused files. That thread is not supposed to terminate at all, so one doesn't join it (one could join it on shutdown, but by that time if the thread died 4 weeks ago, it would still be too late to propagate the panic).

You still need to join those threads though, lest your test suite spawns a dozen interfering zombie background threads, or the cleanup threads fails to cleanup files after ctrl+C because Drop is not run for abruptly terminated threads.

The point about delayed reporting of panic is a good one, but it is orthogonal to joining I think. The ideal way this should work is that the main thread waits to join both both the worker thread and the background thread in parallel (using select), and, if the backgroud thread panics, the panic propagates upwards, cancelling the worker thread. But, to implement this plan, you need both select and cooperative cancellation, which are underexplored areas of API design.

1 Like

I'm not arguing that that design is without its problems for sure. But there are situations where you don't have tests and the cleanup would clean them up next time the program runs or whatever.

I'm just saying that the default of threads dying silently makes a non-optimal design even worse and does that without pointing it out in any way. My point there is, the borrow checker tells you if you manage your memory in a wrong way without your previous knowledge of what memory management is, but nothing alerts you here. This problem is that people can be ignorant about their own ignorance.

(The old blog post actually argues that it would be better if unwinding just didn't exist and that was probably a very extreme opinion, but it still contains several examples what can go wrong: https://vorner.github.io/2018/07/22/dont_panic.html)

That's true, but this also feels like an extremely narrow use-case. The big problem is that those background threads won't run destructors, so if, for example, the background thread uses some resources (buffered files, sockets), there's a risk that those resources won't be cleaned up. Basically, if you can't put std::proces::abort(0) in the middle of your program, you shouldn't detach threads.

Given that failure mode here "happens only when you are super unlucky, but can corrupt your data", I feel that defensive joining of all threads is a significantly better approach in practice, for all applications.

Um, I'm not sure I've described the situation well enough. I can put that abort in the middle of the program, that is fine. It'll recover the next time it is started. What I can't have is an application in a half-dead state. What I mean by cleaning up files. Let's say the main application generates a new file every 5 minutes. And the cleanup thread wakes up every 5 minutes and deletes all files except the 3 newest. So if you lose that cleanup thread, everything seems fine until you run out of disk space. The better situation is die completely and get restarted or alert someone, not say half-zombie for 2 months.

Though, the exact details are probably moot. People seem to agree that the current situation is not exactly ideal.

Sidebar, but I really like Erlang process linking[1][2] as a solution to this problem. I may be mistaken, but I thought Rust had some of the rudiments of it at least at one point?

1 Like

If you wish to have a spawned thread abort on panic, all you have to do is add some abort on unwind bomb:

pub fn spawn_aborting_if_unwind<T, F> (f: F) -> JoinHandle<T> 
where
    F : FnOnce() -> T,
    F : Send + 'static,
    T : Send + 'static, 
{
    ::std::thread::spawn(|| {
        ::scopeguard::defer_on_unwind!({ ::std::process::abort() });
        f()
    })
}
Version without scopeguard
pub fn spawn_aborting_if_unwind<T, F> (f: F) -> JoinHandle<T> 
where
    F : FnOnce() -> T,
    F : Send + 'static,
    T : Send + 'static, 
{
    struct AbortOnDrop();
    impl Drop for AbortOnDrop {
        fn drop (self: &mut Self)
        {
            ::std::process::abort();
        }
    }
    ::std::thread::spawn(|| {
        let abort_on_drop = AbortOnDrop();
        let ret = f();
        ::core::mem::forget(abort_on_drop);
        ret
    })
}

But, again, this only truly works if the main thread waits for the spawned threads, so using a scoped API such as ::crossbeam's, or one where the JoinHandles .join() on drop would make this design quite good.


That being said, I do agree that making panics "cross" the thread boundary rather than not (here, for instance, by aborting the process) is a suprising default w.r.t. other guarantees.

While the fact that you can run a thread and know that if it panics the failure can be contained there is indeed quite useful and definitely something that was worth designing around, this "capability" should have been opt-in, rather than opt-out as showcased at the beginning of my post.

Something that "proves" my point is the fact that spawning a thread does not require them to be UnwindSafe & co.

A nicer design would have been for thread::spawn to spawn a thread where panics abort (at the end of the unwind chain), and with two opt-in settings on that JoinHandle:

  • .ignore_unwind(), that would switch to the currently default behavior, while requiring the provided F closure parameter be UnwindSafe!

    thread::spawn(|| { ... })
      .ignoring_unwind() // there is a race condition with this pattern, but it shouldn't matter
    // alternative non-racy version (with a type-level enum, we can still require UnwindSafe when applicable)
    thread::spawn(|| { ... }, OnUnwind::Ignore)
    
  • .join_on_drop(), that would enable joining the thread when the JoinHandle would be dropped, thus making it possible to propagate the panic without aborting, by making the .join() call panic itself (infecting the JoinHandle with the UnwindSafety of F).

2 Likes

I can deal with it (currently I do if catch_unwind(..).is_err() { abort() }). The problem is I have to know about having to deal with it and not forget to do it.

I do like the general idea of configurable thread behavior on its spawn you propose (bikeshedding needed, etc). Though I'm afraid changing the default is not backwards compatible :-(.

1 Like

So create a non-default version that people who care about such things can opt into, and perhaps let rustfix at an edition change rewrite the old default version into the correctly-optioned new version.

6 Likes

Is this something that the 2021 edition could help with?

Unfortunately, I'm pretty sure the answer is "no." You need to be able to mix multiple edition crates into the same final executable, and there's really no good way to have threads work differently depending on the crate. What happens when a thread is spawned by one crate, but the panic occurs within a different crate from a different edition? Who wins?

"More generally, breaking changes to the standard library are not possible." So editions simply don't affect what std can and can't get away with changing.

I guess this would require some synchronization, like a Mutex. When the cleanup thread panics, the Mutex gets poisoned, so when the main thread accesses the Mutex, it panics as well. But this doesn't work with panic=abort :thinking:

I believe that it is OS specific (at least in so far as it relates to panics), an example of OS in use today which doesn't really have a distinct notion of process separate from thread, It has a notion of fault handlers, such that a fault handler can resume execution of the faulted thread (This can be used for instance to implement user-mode posix system call emulation, by setting up a fault handler that performs the system call then resumes the process).

In seL4, faults are modeled as separately programmer-designated “fault handler” threads. In monolithic kernels, faults are not usually delivered to a userspace handler, but they are handled by the monolithic kernel itself.

I'm not exactly sure about the specifics of non-faulty termination. Just that it doesn't really seem to treat the main thread in any special way...