I spent some time reading into Rayon’s codebase, specifically rayon-core
, which is. Let me summarize some of the patterns I saw:
Panic or return: you got two choices.
First, a sort of obvious one, but Rayon relies on being able to fully control control-flow. Specifically, we assume when we call a function, that function will either return, panic, or run forever. Since we can catch panics, that means that we can guarantee that if a function returns, it returns to us. This allows us to advertise pointers into our stack, confident that before we return, we can ensure that those pointers are no longer active. Something like setjmp()
would ruin this.
Transmute, and plenty of it.
Second, true confessions, I kind of love transmute()
. It’s such a handy interconversion tool, and Rayon’s source uses it frequently, though typically in tightly controlled ways (i.e., with the target type explicitly written). I guess it’d be interesting to see if those uses can be expressed without transmute – and, perhaps more importantly, if that is in some way safer versus just being different (I admit I am dubious).
(Ultimately I kind of think we’ll want some rules that say “types can be interconverted in these scenarios”, and those rules will apply whether you use transmute, a union, or any other sort of cast.)
Background: JobRef
The key abstraction in rayon-core is JobRef
, which is a kind of home-spun trait object:
struct JobRef {
pointer: *const (),
data: unsafe fn(*const ())
}
We create the JobRef
by taking in a *const T
of some type T: Job
, where Job
is a trait like so:
trait Job {
unsafe fn execute(this: *const Self);
}
The actual construction just extracts T::execute
as the fn pointer and then does some transmutes to erase the type T
etc:
impl JobRef {
pub unsafe fn new<T>(data: *const T) -> JobRef
where T: Job
{
let fn_ptr: unsafe fn(*const T) = <T as Job>::execute;
JobRef {
pointer: transmute(pointer), // erase types
execute_fn: transmute(fn_ptr),
}
}
}
I think we could have used a real trait object, but those do not currently support *const self
methods (for which there is no good reason). So we create it ourselves. We used to use a trait with &self
, but that always felt like something of a lie. fn(self)
would probably be more correct, but we don’t support that in trait objects either. =)
For one thing, the backing pointer in a JobRef
can actually be of many different kinds:
- when using
join()
, the pointer is an &StackJob<F>
, logically – it points into the join()
stack frame, which is guaranteed not to be popped until the job has executed. (Here F
is the user’s closure.)
- when using
spawn()
, the pointer is a Box<HeapJob<F>>
– the job is allocated on the heap, but there is only ever one pointer to it. (Again, the F
is a closure.)
- when using futures, the pointer is an
Arc<Future<F>>
– the job is allocated on the heap, and there are multiple handles to it (e.g., it may be enqueued, but also referenced by other futures). (Here the F
is a future.)
In all of these cases, the code that creates the JobRef ensures that it has the correct type of pointer, and the Job
implementations then transmute the *const Self
into the appropriate pointer type (e.g., Box<Self>
etc). I think I might have preferred to implement the Job
trait for the pointer types (e.g, implement Job
for Box<HeapJob<F>>
instead of HeapJob<F>
) but that would have required passing the trait objects “by value” which doesn’t really work (basically, the existing struct encodes the constraint that a Job
can only be implemented for something that is represented as a thin pointer).
Observations on JobRef: escaping
One thing I noticed is that Rayon frequently has cases where we “hide” lifetimes. Often this is accomplished by an (unsafe) function. One example is the one that creates a JobRef
for a stack job (this is mildly simplified to remove irrelevant details):
impl<F, R> StackJob<F, R>
where F: FnOnce() -> R
{
pub unsafe fn as_job_ref(&self) -> JobRef {
JobRef::new(self) // implicit: coercion from `self` to `*const Self`
}
}
The problem here? The &self
is actually escaping into the JobRef
. Under some rules, this would be illegal.
I have to say that literally every time I look for this pattern in Rayon I find it. Now, I’ve generally been careful to write unsafe fn
in those cases, precisely because I want to signal that something fishy is going on. But I wouldn’t be surprised to learn that this sort of erasure is a very common thing for people to do (at least if their unsafe code is trying to be clever with lifetimes).
In any case, there are numerous other examples of hiding lifetimes in this way.
Future code: types that may indeed be out of scope (but no accessible data)
In the futures code, the future ultimately has a type like ScopeFuture<'scope, F, S>
where F
is the future code. This has a field of type Option<Spawn<CU<F>>>
– the details of the type aren’t important, except to note that this type F
may have references (though F: 'scope
). In general, we guarantee that all aliases of ScopeFuture
expose this F
type, and hence whenever they are used, the type F
is in scope (and hence the references are valid).
However, there is one case where that is not true: you also get a future that represents the result of the future. It is of type RayonFuture<T, E>
, where T
and E
are the result and error types of the future. Note that this type does not include F
. This is intentional, as we want this type to escape and be usable wherever a T
is usable. This is achieved by hiding the type F
through a trait object (and some transmuting). The subtle point is this: through the methods on that trait object, it is possible to execute methods of the original ScopeFuture
but after 'scope
has ended. However, in such cases, all the fields that may have references (e.g., that Option
I mentioned) have been set to None
(and, in any case, are not accessed). So there are no actual references accessible, just the types may be invalid.
Panic guards
There are various places where we use “aborting” panic guards. These are in place to prevent unexpected unwinding from messing things up. I have found it tends to be impossible – and kind of subtle! – to predict all the places that user code may be invoked and try to catch unwinding. In particular, it is hard to see all the places where Drop
might run – for a while I tried to be very careful about catching drop failures, but eventually I gave up and decided that we would just abort if a Drop
impl should panic (as such, I’ve added drop guards here and there). A lint that indicates when a value of generic type might be dropped would, I think, be very useful – since I’ve likely missed some spots.
Scoping of unsafe
Rayon is pretty careful to mark functions as unsafe
if they place any “additional constraints” on their caller (e.g., cannot allow return value to escape a certain lifetime etc). There are usually comments explaining what those constraints are, as well, though not always. =)