LLVM coroutines - to bring awarness


#1

Gor Nishanov made an RFC in LLVM to add support for coroutines directly in LLVM.


#2

Keep in mind you would also need borrow-checking support for it to be safe in Rust.


#3

@zonyitoo What do you think about it? Is it useful for coroutine-rs?


#4

What’s being added to LLVM is native support for stackless coroutines, e.g. ES6/C# generators, while coroutine-rs appears to implement stackful coroutines. The difference being whether you can call “yield” from any function, or only at the top level.


#5

From the demonstrations that Gor did, it seems that you can, quite efficiently, build stackful coroutines on top of stackless ones.

Basically, you end up with a chained list of “memoized” stack frames, one for each function frame.


#6

Of course! That would be helpful!!

BTW, what I would pay more attention on: whether this extension could be compatible with TLS?? For example, when you call resume in the other thread, would LLVM swap the TLS table?


#7

It would be nice to have this extension directly to be built in LLVM. I have already built a useable coroutine scheduling library in Rust, see https://github.com/zonyitoo/coio-rs .

But we found that it is very hard to work around TLS variables when migrating coroutines between threads, see https://github.com/zonyitoo/coio-rs/issues/56 . This extension may enable the possibility to tell LLVM to force update TLS calls!


#8

BTW, what I would pay more attention on: whether this extension could be compatible with TLS?? For example, when you call resume in the other thread, would LLVM swap the TLS table?

See https://github.com/GorNishanov/CoroutineWording/issues/2 .

Thread local storage is not special in any way. When you read a TLS variable, you always get the value of the variable in the current thread (per the definition of TLS). In particular, Gor says:

Compilers won’t cache the addresses of a TLS across the suspend point as it will violate the “you get the thread-local of the currently running thread” behavior.

Whether you are shooting yourself in the foot or not is up to you. It is very easy to shoot yourself in the foot. For example, in the above issue, I posted the following C++ code:

future<void> foo() { 
  thread_local auto tls = 314;
  for (int i = 0; i < 10; ++i) {
      cout << tls << std::endl;
      co_await SomeAsyncApi(); 
  }
}

On the thread that this function is initialized the thread_local variable tls is initialized to 314 (thread_local implies static). If after suspension I call .get on the same thread, but the coroutine is resumed in a different thread by the system scheduler, then reading from the variable tls would be a read from uninitialized memory (and thus UB).

Gor confirmed that this behavior is correct, and that whether UB occurs or not will depend on what the system/environment scheduler does, which, at least for C++, is allowed to migrate coroutines between threads at will (so you have no guarantees about in which thread a coroutine will run).

This issue is completely orthogonal to TLS though. There are other C++ proposals about executors and schedulers that provide more control.

The main point is, however, that reasoning about TLS variables inside resumable functions is, in general, impossible, since you are not even guaranteed that these are initialized.


#9

Compilers won’t cache the addresses of a TLS across the suspend point as it will violate the “you get the thread-local of the currently running thread” behavior.

Nah, as I said in the issue https://github.com/zonyitoo/coio-rs/issues/56 , LLVM will actually cache the address, which is we have already confirmed from Rust official team.

If TLS won’t be cache, then WE (with @lhecker) can continue working on coio-rs!!


#10

Currently, LLVM doesn’t know anything about coroutines, and gives you no guarantees.

Gor’s RFC gives you this guarantee. If his implementation caches the address, it is a bug.

He has a fork of LLVM where he implemented the RFC, so you might want to give that a try. The changes have not been upstreamed yet since the RFC is still evolving.

In my opinion the most relevant aspect for Rust is that coroutines containing DSTs are not part of the RFC and only mentioned in the future work section. This is not a problem for C++ (which does not have DSTs), but it is a huge problem for Rust. If rustc wants to be able to reuse LLVM’s coroutine implementation coroutines must support DSTs.


#11

If he wants his coroutine implementation to be used in multithread environment, he has to tell LLVM not to cache TLS between suspensions. But it seems that what he focuses mostly is stackless coroutines, which cannot be transferred between threads.


#12

IIUC this is what he does in both his MSVC and LLVM implementations: TLS are not cached across @llvm.experimental.coro.suspend invocations, but I don’t know if the intrinsic handles that or if clang does. The current revision of the RFC does not mention anything about this so it might be clang doing it. I’ve pinged him on the issue and will let you know once he answers.

But it seems that what he focuses mostly is stackless coroutines, which cannot be transferred between threads.

I think you have the wrong expectations about this RFC. This RFC proposes primitives for defining functions with suspension points and transforming them into state machines (as well as optimization passes on those). Implementing coroutines as state machines is not the only way of implementing coroutines, but it is one of the most efficient ways of implementing coroutines that we know.

Clang uses these primitives to implement the C++ Coroutines Technical Specification which supports both stackless and stackfull coroutines with a combination of language features, library types, and runtime support (including a multi-threaded system scheduler that migrates coroutines between threads).

This RFC is basically the set of primitives that the main author of C++ Coroutines (both the specification and MSVC and Clang implementations) thinks would be useful to the whole LLVM community, such that other languages can reuse these to build whatever coroutine semantics they want (not necessarily those of the C++ coroutines TS).

However, this has obviously only be tested for clang and C++ coroutines, and hence why he is asking for feedback. IMO he is only going to get good feedback if people try to reuse the primitives to implement coroutines in other languages and report their findings in the mailing list. The RFC has had no responses.

One of the places where this shows is, for example, in the definition of a coroutine stackframe, where the size of the frame must be a constant. What happens when a coroutine has a DST in its stack frame, like a C99 VLA or a Rust DST? Then the size of the stackframe must be dynamic, the coroutine itself becomes a DST, but it might still be possible to avoid any memory allocation at run-time. Since C++ doesn’t have DSTs, this is left as future work (mainly because clang does offer VLAs in C++ as an extension, and at some point it might want to allow using VLAs inside coroutines as well). Progress would be faster here if frontends with DSTs would give this a try.

So @zonyitoo if you want to give his LLVM fork a try I think that would be awesome. You might want to contact him first and tell him about it in case he has any hints.


#13

Update on this topic with a video from Gor: https://www.youtube.com/watch?v=8C8NnE1Dg4A

It contains a lot of info for implementors.

Ping @zonyitoo @alexcrichton @aturon @carllerche


#14

Clang builtins for C patch: https://reviews.llvm.org/rL283155


#15

BTW, this LLVM Coroutine can only support stackless coroutines, which works just like Python’s Generator. It is not suitable for building stackful coroutines like coio-rs


#16

Also, they are also aware of the problem of TLS cache: http://lists.llvm.org/pipermail/llvm-dev/2016-June/100840.html