Standard library, synchronization primitives, and undefined behavior

Hey all! The standard library provides a number of synchronization primitives for all Rust programs to use in a cross-platform fashion. Ideally the standard library also provides sound implementations of these primitives so they can’t stop working at runtime!

Currently the system primitives Mutex, RwLock, Condvar, and ReentrantMutex (internal to libstd) all wrap underlying primitives for each os (pthreads on non-Window, corresponding objects on Windows). Unfortunately, though, using the OS primitives brings quite a few caveats:

And that’s just the ones we’ve managed to fix! There’s a number of open issues that don’t currently have proof-of-concept programs showing unsoundness, but we’re still causing undefined behavior in the “spec says it’s undefined” sense:

Overall this makes the current situation in the standard library pretty unfortunate! We have already gone to great lengths to try to work around the OS primitives, but we’re still not all the way there in terms of allowing 100% of safe Rust to leverage the primitives without undefined behavior. We think that this represents the set of guarantes we need to uphold (basically fixing reentrant locking of rwlock on all platforms), but there could be more future gotchas as well!

We had a short discussion about these issues at the recent libs triage meeting, and we’d like to send out a call for help and suggestions of how to fix these! Correctness here is absolutely paramount, and the secondary concern for these primitives is speed. There seems to be two broad methods that we could go about solving these issues:

  • Continue to hack around the OS primitives in the standard library. It’s unclear what the performance hit would be for fixing the rwlock reentrancy, and it’s also somewhat unclear what the solution would look like! (there aren’t any proposals yet).
  • Alternatively we could take it upon ourselves in the standard library to reimplement these primitives without basing them on the system implementations. The most prominent implementation for doing this parking_lot.

It may seem obvious that "just use parking_lot" is the solution here but it’s unfortunately more nuanced than that (some more discussion can be found on an old RFC as well). The libs team is not ready to simply move parking_lot into the standard library and call it a day, but we would prefer to move more carefully and deliberately to measure performance, ensure stability guarantees can be upheld, and ensure that platform compatibility is maintained. Additionally we’d ideally like to see a solution that continues to rely on system primitives to have a “bottom line” implementation to start from and evaluate an alternative strategy like parking_lot.

Would others be willing to help us out with these issues? @Amanieu or other parking_lot folks, would y’all be willing to help work with us and figure out what it would look like for parking_lot to be in the standard library? Do others have ideas on how to fix these issues?

Help is always greatly appreciated here!

26 Likes

I want to just take the opportunity to emphasize how much I love that this is a community with people and a library team which takes correctness extremely seriously. :heart:

18 Likes

parking_lot has evolved quite a lot since the old RFC, so I'll start by giving an overview of it and how it differs from the standard library synchronization primitives.

parking_lot is actually split into 3 crates:

  • parking_lot_core provides the basic functionality for parking and unparking threads which the synchronization primitives build on. The API somewhat resembles that of the futex Linux system call, with several extensions and implemented in userspace.
  • lock_api provides all the boilerplate for type-safe Mutex<T> and MutexGuard<'a, T> types from a type which implements the much simpler RawMutex trait (lock, unlock and try_lock). It also supports RwLock and ReentrantMutex.
  • parking_lot provides the actual implementation of the synchronization primitives based on parking_lot_core. For example, the RawMutex type implements the RawMutex trait, and the Mutex type is a typedef for the Mutex type from lock_api.

parking_lot's API has quite a few differences from the standard library primitives, for example:

  • Locks are not poisoned during unwinding, they are unlocked normally. This avoids having to call unwrap all the time.
  • Locks can be forcibly unlocked with the unsafe force_unlock method.
  • Locking can be attempted with a timeout using the try_lock_for and try_lock_until methods.
  • Lock guards can be temporarily released using the unlocked method.
  • Fair locks are optionally supported through the unlock_fair method on lock guards.
  • A lock can be yielded to another thread blocked on a lock using the bump method. This has a fast path if there are no threads waiting on the lock.
  • Lock guards can be mapped, which returns a MappedMutexGuard. For soundness reasons, MappedMutexGuard does not support the unlocked and bump methods, and cannot be used with Condvar.
  • RwLock supports upgradable locks (RwLockUpgradableReadGuard) and downgradable locks (RwLockWriteGuard::downgrade).
  • RwLock supports recursive locking without deadlocking through the read_recursive method. Attempting to acquire a read lock using read can block if there is a pending writer thread.
  • A ReentrantMutex type is provided.
  • Condvar can be waited on with different Mutex, but not at the same time (will panic).
  • All primitives have const fn constructors, can be safely moved and do not require drop.
  • Locks have well-defined behavior when locked recusively (guaranteed to either deadlock or succeed).

Given that quite a few of these are breaking changes, I think the best way forward would be to keep the existing Mutex and MutexGuard facades in libstd, but replace the OS-specific backend implementations with the RawMutex from parking_lot. parking_lot_core can be imported into libstd mostly unchanged, even though it has several features that may not be needed (e.g. fairness support).

Also do keep in mind that the main reason why parking_lot is so much faster than the standard library is that the locking/unlocking fast path when there is no contention can be fully inlined. This is only possible because we implement the synchronization primitives ourselves.

I don't understand what you have in mind here. parking_lot_core makes use of the system primitives as a "back-end" to actually put threads to sleep and wake them up. See the different ThreadParker implementations.

15 Likes

The idea there is to see what "correct" version of the style of implementation we have today (i.e. Mutex is backed by pthread_mutex_t, etc) looks like. In particular, how painful/slow is it to do deadlock detection?

To solve the reentrancy problem, you basically need to keep a thread-local map of counters keyed by mutex address which track how many times the current thread has locked a particular mutex. This map will need to be updated on every lock and unlock operation, which will likely be a disaster for performance.

I think trying to detect reentrancy around a primitive where reentrancy is UB is just a waste of time. It is a lot of work, complex code that can go wrong, and likely horrible for performance.

On Windows, even reentrant read locks of a RwLock are UB! TBH I can only consider such a primitive broken, and the best one can do with it is avoid it. On macOS, "only" recursive write locking of an RwLock is UB.

On Linux, we are actually fine because pthreads there says recursive RwLock is okay, and for Mutex we go through great lengths to set the required flags and not move stuff. Unfortunately some parts of libstd have to use the non-reentrant-safe mutex because they need a const initializer, so this is still far from a great situation but at least it is not UB (hoping no of the libstd internal users screwed up).


Given the rather shocking situation that system-level concurrency primitives are in, I am fully in favor of rolling our own implementation. We already have park/unpark in libstd, which only needs a condvar. So we could restrict our platform-specific implementations to just what is needed for a basic condvar (and it'd likely be unsafe because we'd use a non-reentrant-safe mutex, but it only has one user so that seems okay), and then build the rest on top of that. Since parking_lot does pretty much that, we probably do not even have to figure out what this API provided by libstd::sys should look like. :wink:

I imagine that would be the ThreadParker @Amanieu mentioned? That's on a higher level than I expected, i.e. it seems to do everything park/unpark do instead of a shared implementation using even lower-level primitives that then have platform-specific implementations. @Amanieu could you briefly explain the key differences between ThreadParker and the corresponding code in libstd?


Oh, so we do not have a guarantee that when the current thread holds a read lock, trying to acquire another read lock is guaranteed to succeed? That could conflict with fairness but it is also extremely useful -- actually it seems to be it is basically a requirement to make reentrant read locking not a hazard.

But I see this is a method on RwLock (I somehow expected it would be on RwLockReadGuard), so I could get my guarantee by always using read_recursive. I just have to know I have to do that. (Also, once we come to API bikeshedding I'd suggest to call it read_reentrant. This is not just about recursive functions.)

2 Likes

A thread can hold a mutex only once, so it's not really a count. I have an implementation somewhere of a reentrancy-fixed mutex that stores the ID of the thread holding the mutex, so in lock we can test if it is the current thread and then panic (or loop {}, or whatever). But even that was already killing perf in my very limited benchmarking.

2 Likes

ThreadParker is a very low-level API that is tied to how parking_lot_core manages thread queues. Essentially, parking_lot_core will place a thread on a wait queue for the lock that it is currently waiting on. Each queue has its own lock.

Here is the process for parking a thread (code):

  1. Before inserting the current thread into a queue, prepare_park is called to reset the ThreadParker. This method is not thread-safe, but at this point no other thread can access the ThreadParker since it isn't in a queue yet.
  2. The queue is locked, the thread is inserted into the queue, and then queue is unlocked.
  3. park and park_until is called, which will block until another thread unparks us or the timeout is reached.
  4. If we were unparked, then we are done: the thread that unparked us will have removed us from the queue.
  5. If we timed out, then we need to lock the queue again to remove ourselves from the queue. However there may have been a race where another thread unparked us after park_until returned but before we locked the queue. We call timed_out (thread-unsafe, must hold queue lock) to check whether we were unparked by another thread (with no races since we hold the lock).
  6. If we timed out, remove the current thread from the queue and then unlock it.

Here is the process for unparking a thread (code):

  1. Lock the queue.
  2. For each thread that we want to unpark, call the the unpark_lock method. This sets a flag to tell the ThreadParker to wake up, but doesn't actually wake up the thread. An UnparkHandle is returned and kept for later.
  3. Unlock the queue.
  4. For each UnparkHandle that we collected, call the unpark method. This will perform the actual system call which wakes up the target thread. We do this outside the queue lock because we don't want to hold the lock for long periods while performing system calls.

As you can see, this is significantly more complex than the simple park and unpark functions provided by the standard library, especially UnparkHandle which needs to ensure that the ThreadParker it is associated with doesn't disappear.

I think the standard library park and unpark should just be implemented using parking_lot::Condvar to keep things simple.

It is useful to have RwLock prefer writers over readers since this avoids writer starvation if there is a continuous stream of overlapping readers. To make this work, new readers need to block if there is a pending writer, even if they could acquire a read lock directly (i.e. the lock is currently owned by existing readers). The downside is that if you try to acquire a read lock when you already own one in the current thread, you could deadlock if there is a writer currently waiting on the lock.

read_recursive (I don't mind changing the name) simply changes the behavior to always acquire a read lock if possible, even if there is a pending writer. The downside, as mentioned before, is that it may lead to writer starvation.

3 Likes

I should also note that one of the reasons ThreadParker is so complicated is that parking_lot guarantees that there will not be any spurious wakeups, e.g. for Condvar.

1 Like

The big fundamental change in parking_lot seems to be the use of a global map (“static HASHTABLE” defined in parking_lot.rs) to store data related to contended locks.

However, it seems like one could relatively easily create a variant of parking_lot that eliminates that map and instead puts that data in the Mutex/RwLock/etc. structures, with the result of creating something that is architecturally the same as the current synchronization primitives and with the same trade-offs, except that it’s implemented in pure Rust and safer (in particular on Linux, both would be using futexes in similar ways).

Obviously this will make those structures bigger, but it might improve performance and especially worst-case performance, and generally make such a change much less controversial and risky (although I think the best “default” choice at least for new code would be to have a global map).

Then one could consider the change to a global map for contended primitives separately (of course both approaches can be supported in the same library by adding a generic parameter that would stored as a field in Mutex/etc. and have two implementations that either contain inline “parking” data or delegate to the global map).

Maybe this could be an option that could be considered?

I am somewhat hesitant to make large changes to parking_lot at this point. The current implementation has been extensively tested in the wild and works well enough.

Also do keep in mind that the global map is only used in the slow path where you are going to be making at least one system call anyways (either waking a thread, or suspending the current thread). So there isn’t much performance advantage in putting that data in the lock itself.

1 Like

I really think that we should move away from an OS backed system so that we can have more control flow control and can support things that some OS’s don’t (like timeouts on lock acquires on Windows if I am not mistaken).

Does parking_lot Mutex only require 2-bits to store each lock like webkit’s original parking lot implementation? If this is the case, does it mean that we can teach the compiler about this and make types like Mutex<Box<T>>, Mutex<Option<T>>, etc store mutexes for free?

parking_lot::Mutex currently uses 1 full byte, but only uses the low 2 bits in that byte, like the Webkit implementation.

Regarding your idea, I think that a much easier way of saving space would be to place the mutex in the padding bytes at the end of a struct. See https://github.com/rust-lang/rfcs/issues/1397.

The usual enum niche savings come from re-using full values that would be invalid instances of the inner type -- like a null Box<T>.

A Mutex needs to keep additional information with a valid instance of the inner type. For a Box<T> that could mean using the useless pointer bits from alignment padding. However, you'd no longer be able to have a &Box<T> anywhere (like for Deref::deref(&self)) if you don't have a "clean" value to reference.

Sorry I’ve been a bit slow catching up on all this, but thanks for all the replies everyone!

@Amanieu it sounds like it’s not necessarily super easy to integrate parking lot primitives into libstd (not that we necessarily expected it to be so), but I definitely agree that we should keep the same API in libstd for now and avoid any breaking changes. We can hopefully leverage a different implementation to eventually provide a more feature-ful API (timeouts, etc), but for now we just need to strive for feature parity with the current API.

I doubt I’m alone when I say that the libs team would be very interested in seeing what at least a prototype of parking lot (or more likely just a subset) being moved into the standard library would look like. This’ll provide a great basis to compare/contrast various implementations.


For continuing to use OS-level primitives, I realized earlier that we’re already doing a TLS lookup for the current state of panicking on locks/unlocks, so we can probably piggy-back on this to store information (optionally) about lock state in the local thread structure. That may give us a good entry point to detect recursion in a cheap fashion.

For example as a bare bones implementation we could maintain a linked list in a thread structure of locks currently held (nodes on the stack). This’d be slow if we held a lot of locks at the same time (all new acquisitions would walk the list) so we could instead maybe use some bloom-filter-like-thing to optimize the fast path where it’s highly unlikely you’re acquiring the same mutex.

In any case this is just a though! I probably don’t have time to implement it personally at this time just yet.


One last thing I wanted to mention, I’ve witnessed historically that Linux pthread mutexes were orders of magnitude faster than a naive park/unpark-based mutex, so we need to be careful when evaluating performance to not just consider uncontended performance. I believe the issue I witnessed a long time ago was that linux pthread mutexes conveniently had all threads scheduled on the same CPU core, whereas a park/unpark based mutex had threads spread out across all cores so waking up other threads took much long (IPIs vs a context switch)

1 Like

I think that integrating parking_lot primitives is easier than you think: parking_lot::RawMutex is almost a drop-in replacement for the internal platform-specific sys::mutex::Mutex in libstd. Same with the other primitives (Condvar, RwLock, etc). We can make changes to the public API later as a separate step.

I personally feel that continuing to use OS-level primitives is a dead-end and not worth pursing considering how difficult and costly it is to avoid all the UB cases.

I think you may be thinking of barging vs FIFO (fair) locks (see this article). In summary: when a mutex is unlocked, a FIFO lock will directly transfer lock ownership to the next thread in the queue without unlocking it, while a barging lock will simply unlock the mutex, allowing any thread to acquire it. In practice barging locks are much faster because they allows a thread to re-acquire the same lock multiple times in a row without waiting for a context switch.

Parking lot uses barging locks (pthread_mutex_t also does on Linux), so doesn't suffer from this performance penalty. In any case there should be no scheduling difference from the kernel's point of view since we are just blocking on a futex like the pthread implementation.

11 Likes

Is there a plan here? Pending PR to put parking_lot in std? :slight_smile:

3 Likes

The next step would be for someone to write up an implementation and open a PR!

1 Like