Non-Send Futures

12 Likes

I had given this general idea some thought myself after reading the Jane Street post about proposed OCaml features for thread-safety. They include an API they call “capsule”, which is analogous to the thread-safety boundary for memory that physical threads in Rust offers, but separate from physical threads.

This idea thus doesn’t just relate to Futures. It could also apply to ways of wrapping non-thread-safe data structures in a better way, and more. E.g. for Rust, if you have a graph of Rc pointers contained withing a data structure, that whole thing could safely be transferred between threads, as long as it's ensured that all contained Rc cross threads together.

My most minimal and initial idea of what a “capsule” type in Rust could look like (first and foremost for illustration, and ignoring most of the details and additional feature around how the thing is designed for OCaml in the Jane Street post) was

impl<T> Capsule<T> {
    fn new(init: impl FnOnce() -> T + Send) -> Capsule<T>;
    fn with<R: Send>(
        &mut self,
        cb: impl FnOnce(&mut T) -> R + Send
    ) -> R;
}

impl<T> Send for Capsule<T> // !!
impl<T> Sync for Capsule<T> // this one uninteresting with only `&mut self` API

And such a minimal API can be implemented safely, with some additional 'static bounds, using real threads: Rust Playground. The idea then is, of course, that it in principle could truly be implemented way more efficiently by just calling the callbacks in-place, and merely simulating the memory-boundaries of threads via the API.


Because of thread_locals as well as Mutex’s guard’s Drop implementation, I arrived at the same conclusion that there’s a big conflict with the current meaning of Send & Sync, and one would ideally want two separate kinds of Send/Sync which I thought to call CSend/CSend (for Capsule) and TSend/TSync (for Tthread), in each case T: …Sync just relates to the &T: …Send bound, so I’m focusing mostly on just …Send traits.

The Capsule type above would be written with CSend throughout

impl<T> Capsule<T> {
    fn new(init: impl FnOnce() -> T + CSend) -> Capsule<T>;
    fn with<R: CSend>(
        &mut self,
        cb: impl FnOnce(&mut T) -> R + CSend
    ) -> R;
}

impl<T> CSend for Capsule<T>
impl<T: TSend> TSend for Capsule<T>

the impl<T: TSend> TSend for Capsule<T> makes the whole thing sound in the presence of physical-thread-bound APIs like MutexGuard.

Migration strategies seem a huge pain… just pretending legacy Send (which I still call just “Send”) means the same as CSend makes a large amount of pre-existing code unsound. The sound way is to make old Send equal TSend + CSend (with a strong enough trait-synonym feature, or something equivalent, so that current explicit implementations of the trait would still work and result in the respective TSend + TSync implementation), and then allow new non-braking refinements to existing APIs.

Regarding the implementations, thus an existing one like

impl<T: Send> Send for Option<T> {}

would thus (at least if it were explicitly written) translate to

impl<T: Send> TSend for Option<T> {}
impl<T: Send> CSend for Option<T> {}

i.e.

impl<T: TSend + CSend> TSend for Option<T> {}
impl<T: TSend + CSend> CSend for Option<T> {}

and the desired new implementations

impl<T: TSend> TSend for Option<T> {}
impl<T: CSend> CSend for Option<T> {}

are no longer a breaking change.

As for some types’ respective bounds: A MutexGuard<i32> would be CSend + CSync + !TSend + TSync of course, and generically for MutexGuard<T>, the non-TSend ones would just qualify the same bound on T.

For Rc, one could simply deny CSend while still allowing TSend, as in

// analogous to current `Rc<T>: !Send`
impl<T> !CSend for Rc<T> {}
impl<T> !CSync for Rc<T> {}

// analogous to current `Arc` implemetation
impl<T: TSync + TSend> TSend for Rc<T> {} 
impl<T: TSync + TSend> TSync for Rc<T> {} 

though now that I’m writing this, I’m wondering whether or not the TSend one could even be weakened to just T: TSend, since !CSync implies it cannot be referenced on multiple physical threads at once anyways.

Interestingly enough, as TSend and CSend stand for two orthogonal kinds of memory-boundary, one based on capsules and other thread-like boundaries, like tasks for futures; and one based on physical threads. I’m noticing that it’s also feasible to build … well let’s call the above a CSync-limited Rc … to build a TSync-limited version. (Take above implementations and swap the roles of T… with C… traits.) I’ve wondered if that’s necessarily only possible by fully duplicating the type and its API, or whether something similar can be achieved by wrapping a CSync-limited Rc somehow to make it - effectively - TSync-limited.


Another very fascinating take-away from the Jane Street blog (which takes a while to understand, given one has to learn their syntax first) is that when you end the life of a capsule, you can just have it “merge” with any other given context. I.e. in other words: While Rust’s threading API requires today that in

pub fn spawn<F, T>(f: F) -> JoinHandle<T>
where
    F: FnOnce() -> T + Send + 'static,
    T: Send + 'static,

the result of a thread of type T must be allowed to be sent between threads, if we forego the physical threads interpretation it doesn’t have to be the case. A joining thread can simply “inherit” the thread’s local memory, (if it wasn’t for thread_locals and the like). This means for CSend-ness taken into account, the new version could look like

pub fn spawn<F, T>(f: F) -> JoinHandle<T>
where
    F: FnOnce() -> T + TSend + CSend + 'static,
    T: TSend + 'static,

with no requirement of T: CSend. Ultimately, this has the effect that a thread can return some data structure containing Rcs to the thread it joins with, and that should work and be sound.


Of course now thread_local is also a large issue. It works without any trait bounds today, so it would not translate well without breakage. If we were to freshly design Rust today, one could simply say “thread-local data must be CSync”, similar to how global statics must be Sync today. That’s probably the “correct” change, but it would be breaking. Another thing I’ve considered and that might work is an effects system for access to thread_local:


Imagine Rust was single-threaded today, and we wanted to first introduce Sync and Send in the first place. Of course single-threaded Rust allows you to place any data in global variables. You cannot mutably borrow them (without a RefCell), but there’s no restrictions. Also, unsafe code commonly relies on the single-threaded nature of execution for soundness.

How to introduce Send/Sync and thread::spawn, anyways? Well… how did we introduce running Rust at compile-time, even though so many operations, including basic things like allocation, aren’t possible at compile-time? With effects, like const fn. (I call const an effect here, even though it’s arguably the taking-away of an effect; but I’ll stay consistent with const here and just call the additional keyword an “effect”.)

Unmarked functions (ordinary fn) still only run single-threaded. The analogy of single-thread here is “running in the main thread”. Any code that supports being executed in other threads than the main thread would need to be a thread fn. Like const fn, this comes with language restrictions. Notably, in a thread fn, you can no longer access the legacy version of global variables, and only the new ones that require T: Sync are available. The same issues as with const fn apply, we still need a solution for effects in traits (e.g. like T: const Eq or const FnOnce() bounds, etc… and const Destruct) but eventually, we can mark everything thread-safe with thread fn, and write thread::spawn as

pub thread fn spawn<F, T>(f: F) -> JoinHandle<T>
where
    F: thread FnOnce() -> T + Send + 'static + thread Destruct,
    T: Send + 'static + thread Destruct,

where A: thread Destruct means values of type A may be dropped outside of the main thread, and thread FnOnce() -> T can be called outside of the main thread.


Now introducing capsules could in principle work the same. A new effect capsule (naming TBD) would mark code that can be executed in the a context of a Capsule::with callback. It cannot access legacy thread_local, but only a new thread_local with a CSync constraint. (Whether or not there are ways to still have these two thread_local versions merged into one is something I’m not sure yet - probably yes.)

In either case, the obvious downside is that there’s a lot of code to mark then. This same problem applies with const already, it’s slow to mark everything const as such, and it’s manual effort. I believe, for safe code, there can be lints, or even automatic tools, to add const markers to functions that support it, but automation can only go so far, and in any case it would be a deliberate opt-in, also for future-compatibility (this argument of “what if I want to make my function do non-const stuff later!?” is a lot stronger for const, as that’s severely limited compared to non-const). For a capsule effect, even more code could support it than for const, so it would have the unfortunate effect that ultimately, probably most code would be marked with the effect. Editions could help making the opposite the default, but it’s still not an easy problem to solve.


Finally, some more interactions with existing code / other crates, besides the obvious case of thread_local and MutexGuard that currently rely on the “Sync/Send means physical thread” assumption.

  • Crates like send_wrapper (I believe there was at least one more similar one, too) offer an API that allows moving non-Send data between threads, and ensures safety by guarding access to the data with a check against the current thread’s id. In a world as outlined above, such a type could only remove TSend bounds with such dynamic checks, not CSend ones. With an effects system, the relevant API and all af its users would simply never gain the capsule fn marker.

    This does demonstrate that even a breaking change to thread_local itself wouldn’t eliminate all problematic API; code can use ThreadIds and Send bounds manually to connect Send with a notion of physical threads.

  • On a positive note, the currently problematic Ungil of pyo3 could benefit significantly from changes like the introduction of separate CSend vs TSend. Instead of their current unsound solution to use Send (and in the presence of scoped thread-locals, perhaps even their nightly auto-trait approach isn’t sound), it could simply use CSend bounds and require the callback to be capsule FnOnce at the same time.

9 Likes

Oooh interestingly Capsule looks a lot like an experimental crate I wrote a while ago: diplomatic_bag - Rust (docs.rs).

I'm slightly confused by your TSend and CSend, you say:

For Rc, one could simply deny CSend while still allowing TSend,

but I don't see how Rc could possibly be TSend as if two Rcs end up on different threads at the same time you can immediately make UB. More generally, I can't see how something could be TSend but not CSend, if it can move thread then why couldn't it move between Capsules?

On the original post, I nearly want to suggest that Tokio could just do:

trait TaskSend {}
impl<T: Send> TaskSend for T {}
impl<T: TaskSend> TaskSend for Cell<T> {}

This doesn't work for specialization reasons, and more importantly isn't very useful because it's not an auto-trait and isn't implemented on generators. However, I do think this gives a possible path for question (3) "Is there a backwards-compatible path we could take to make this a reality?":

  1. Add TaskSend (bikesheddable) an auto trait like Send and have trait Send: TaskSend
  2. Implement TaskSend for currently !Send types like Cell.
  3. Libraries like tokio drop their bounds from Send to TaskSend This wouldn't be entirely backwards compatible, things like bounds on associated types couldn't be relaxed easily because someone might be relying on their Send nature, but that's always going to be a problem.

On question (2) "Is this an important problem in practice to look into?", I don't think I've ever hit this issue in practice, purely because I can always drop down to the thread-safe versions of things. Admittedly those types are marginally slower but not enough so that I've ever worried about it.

4 Likes

Yep. The lack of CSend (and CSync) already means that clones of the same Rc will never end up on different threads simultaneously.

The Rc<Something>: TSend however is important to make, say, a Capsule<Vec<Rc<Something>>> still implement Send (Send == TSend + CSend). This is only sound if Something isn’t a type like MutexGuard<()>, hence we track the whole TSend-ness all the way through Rc, Vec, and Capsule. The capsule prevents that any clones of the Rc’s leak anywhere else (accessibly) than the Vec in question, and sending the whole capsule at once to another physical thread is thus ultimately allowed.

So ultimately TSend means “can me moved to another physical thread”, but does not imply any real thread-safety. For actually directly sending something between threads, like sending something through a mpsc::channel, you usuallly need TSend + CSend. It’s only relevant in contexts where identifying threads matters, e.g. by the OS caring which specific thread does something, or some API that looks at the ThreadId, or API that involves thread_locals.

2 Likes

I’ve just realized that one place where this comes up is async traits. Right now, we need a new language to specify sendness of futures returned by async method.

In a world where sendness isn’t required for work-stealing, I think we could safely make all futures !Send, without any extra linguistic surface area?

True, although there are futures that are not movable between OS threads. We'd need some mechanism to write traits/bounds that accept ?OsThreadSend futures and that problem is identical to the current one with Send.

Rc isn’t just about simultaneous mutation, though. It uses non-atomic operations, which means you can’t safely increment the reference count on one OS thread and decrement it on another. By contrast, if you have one OS thread that’s multiplexing futures, it should actually work fine. So I think that makes it CSend but not TSend.

It's a common misconception (or, should I say, oversimplification). Remember that threads routinely migrate across CPU cores. In other words, in certain situations, you can:

The biggest problem here is thread-local storage, which has poor compatibility with async execution model and Rust currently does not provide any protection in this area.

That’s true of pretty much any !Send type: “if you happen to not expose any of the problem cases, nothing goes wrong”.

To make Foo unsound it is sufficient to add fn get_a(&self) -> Rc<u32> (which would be a totally innocuous and normal thing to do if not for the unsafe impl Send ).

from the response, Controversial opinion: keeping `Rc` across `await` should not make future `!Send` by itself - #5 by trentj - The Rust Programming Language Forum

The task at hand isn’t just to develop a rule that allows safe programs and conservatively disallows unsafe programs. It also has to be a model a human can understand, depend mostly on local reasoning for an efficient debug build, produce diagnostics that are highly likely to point to the correct conflict if a user breaks the rules, and allow for evolution of implementation across versions without breaking API. So having the compiler guess what is and isn’t safe has to be done very carefully.

Which doesn’t mean it’s impossible! Swift is trying to do it right now, using a very conservative escape analysis: SE-0414: Region Based Isolation - Proposal Reviews - Swift Forums. But Send checking has been in Rust a lot longer than it’s been in Swift.

1 Like

We already have a very good and widely known model. It's called threading. And Rust supports it splendidly. You spawn tasks/threads which may be executed in parallel and may migrate between CPU cores freely. You get safety by carefully guarding boundaries between tasks/threads using Send/Sync.

I believe that an asynchronous execution model should emulate threading model as closely as possible (after all, fundamentally they are not that different from each other), which the current Rust async model clearly fails at. Asynchronous execution model allows very nice additional tricks possible because of the additional scheduling control, but they are just extensions to the base model.

This is one of the reasons why I strongly dislike Rust async and stay as far away from it as possible. Rust got itself into a corner by making Future implementers nothing more than a common type, instead of treating task (persistent) stacks specially like we do with thread stacks. In my opinion, we would've been much better with a stackfull (fiber) model together with semi-breaking (i.e. introduced in a new edition) changes to TLS to fix the async incompatibility and, maybe, with additional "async" compilation targets.

Yes, I know that Rust had it pre-1.0, but, in my experience, we can implement it in a much lighter-weighted fashion, especially on top of new APIs like io-uring. And people do it in production as, for example, can be seen in this comment! At my work we also plan to completely side-step Rust async and use our custom stackfull executor (our solution for the TLS problem is, unfortunately, "just carefully review all TLS uses").

Another common counter-argument to the stackfull model is embedded (bare-metal) devices. But there is a potential solution for that as well. Compiler in most cases is able to compute exactly how much stack space functions will use and on embedded devices you often have to track it either way (e.g. using tools like stack-sizes). So future compiler in theory could compute the stack size necessary for a stackfull task. There are also potential niceties about being able to build hybrid cooperative/preemptive systems (e.g. higher priority tasks launched by an interrupt may forcefully preempt tasks with lower priority) not possible in the stackless model, but it's a whole another topic.

Another breakage example is various OS APIs that just mandate that things happen on a particular execution thread, like pthread_mutex_unlock . Though I think that the turtle those APIs stand on are thread locals again?

This is arguably off-topic, but - it's usually not thread locals.

On OSes like Darwin where pthread mutexes use priority inheritance by default, there's a good reason you can't unlock from another thread. With priority inheritance, if a high-priority thread wants the mutex, the OS temporarily boosts the priority of the current owner in the hope they'll finish their work and unlock the mutex sooner. But that doesn't work if the current owner handed off the MutexGuard to some random other thread. The OS has to know who is doing the work. (It might be nice if there was a way to tell the OS to switch mutex ownership to a specified other thread, but there is no API for it.)

Most OSes don't enable priority inheritance by default, and in the implementations I've seen, pthread_mutex_unlock has an ownership check just as a way to detect API misuse, not because the implementation internals require it.

7 Likes

Linux too has mutices with PI (though not by default). We use them heavily at work on real time Linux. The lack of exposing this functionality in the Rust standard library is annoying to be honest.

2 Likes

Just a quick remark on effects:

The effect should be "tls" or so. Effects generally say "this code may do something"; absence of effects implies "code definitely does not do something".

"const" is not an effect, it's the absence of an effect.

It's not just TLS, there's also thread IDs.

Basically your argument relies on the Rust concept of threads being entirely "virtual", with no way for code to check "which thread it is in". Then the async runtime can just pretend like each async task is its own thread and it can migrate that between host OS threads. (This can be modeled perfectly in my RustBelt soundness proof.) But this is broken both by Rust threads being tied to TLS and by thread IDs.

If you mean ThreadId, aren't those implemented using TLS?

I’m aware, hence also the remark a little bit further up in my post:

Just like with const, the newly introduced kind of function is the one without the effect, so it’s the one that gets the keyword, so that keyword-less functions keep the same meaning as before “unrestricted TLS access[1] being allowed”. Negated names in keywords (say, not_tls[2]) would be a bit weird. The name is a placeholder anyways. I’m working analogous to const also since it’s precedent and familiar to Rust programmers, thus familiar to all readers.


  1. as well as more generally code that makes the assumption that non-Send data is allowed to be used safely as long as thread identity is dynamically checked ↩︎

  2. or … well … no_data_race_protection_via_dynamic_thread_identity_checks or something like that, as it’s not only about TLS? ↩︎

Ah true. The ID is assigned using a global counter but then it is stored using TLS. So I guess a no-tls function could also not fetch the current thread ID.

I’ll stay consistent with const here and just call the additional keyword an “effect”.

I did indeed miss the paragraph you quoted in your reply, sorry for that. However calling "const" an effect is just as misleading. These are technical terms and misuse of technical terms is a great way to cause misunderstandings. (I'm not involved in the effects initiative so I don't know their latest terminology. But if they plan to call "const" an effect then I think that's a mistake.)

IMO no_tls is also way more clear than capsule. It fits well with other qualifiers that are considered regularly, like no_unwind. (But anyway we don't have to bikeshed this now. I just wanted to reduce the confusion around the term "effect". We already have plenty of confusion there as things stand. :wink: )

3 Likes

While Rust's ThreadId is physically dissasociated from the OS, thread, the OS provides still a concept of thread ID which is fundamentally tied to the actual execution thread. Rust std doesn't expose a way to get the current OS thread ID IIRC (unix can get pthread_t from a join handle, but that's definitely an OS thread anyway, even in a world with virtual thread isolation), but linking in extern code which provide that access is straightforward and would currently be implied sound.

(While const is the removal of an effect, being generic over const is still an generalization over an effect, it's just that the effect is the absence of the const restriction. So const would still qualify as an “effect keyword” since it controls an effect.)

Yes, it can be summarized like this. To use your terminology, I think that ideally Rust should define a basic model of entirely "virtual" threads/tasks and it should be the only model available in std. Most of code can run in this basic model without any issues. The model then can be extended on an opt-in basis with additional capabilities available for native OS threads or fiber-based asynchronous execution contexts.

I don't see a big problem with ThreadId. In async context it would simply change meaning to TaskId. We also probably could somehow emulate "task-local storage", which would be used as "async TLS".

I probably should explicitly highlight that I imply introduction of new exprimental "asynchronous" fiber-based compilation targets, something like x86_64-unknown-linux-io-uring. Shoehorning asynchronous execution model on top of synchronous std, while probably possible, would be a much harder task.

While I do think having !Send futures and work stealing at the same time would be awesome, I don't think this alone justifies the amount of breakage required. If we ever plan to have rust 2.0 I would certainly put this on the list. But for the mean time, I think our best path forward is to drop the !Send bounds wherever possible.

1 Like

I think within rust is the only time I've ever seen these kinds of negated effects discussed, The only previous encounter being a recent post by boats.

https://without.boats/blog/poll-next/

This is a lead-in to a longer term vision of introducing a “pinned effect.”

Which would be like a not_move.