The main reason of having a split between safety and validity is that whether an execution of a program is defined is only controlled by validity.
We very much want the question of “is this execution well-defined” to be answerable (and decidable),
However, it is possible (and in fact, had been proposed) to have a “tootsie pop” model with 2 separate notions of validity - one for safe code and one for unsafe code, but that is not what this post is talkin about.
Continuing a discussion from Zulip, I think we may want to phrase both of these invariants as a kind of “range”. In other words, we might want to specify two things:
Things that your unsafe code must ensure is true
Things that your unsafe code can rely on being true
this should be a subset of the first one
The idea is that there are some validity invariants that we require now but which we might not always want to require. For example, right now we require that &T is aligned. Unsafe code might therefore rely on this by taking some random &T, casting it to a usize, and packing things in the low bits. If we decided to drop the requirement that &T is aligned – we could adjust our own codegen, but we can’t change that unsafe code.
Put another way, if all unsafe code implicitly has “validity requirements are met” as a precondition, we are not free to make more things valid without breaking that unsafe code.
Now, I’m not sure if there are any things we might consider changing in the future, but I think this is something we have to keep in mind.
This article is both rigorous and accessible at the same time - a great achievement!!
Notice that the “boundary” is not necessarily where the unsafe block ends. The boundary occurs
where the unsafe code provides a public API that safe code is intended to use – that might be at
the module boundary, or it might even be at the crate level.
(and more discussion in the comments)
I think that even if the model doesn’t insist on the ‘correct’ placement of unsafe, it’s still worth encouraging authors to do so. If you had a function like
RalfJung, after reading your arguments on the MaybeUninit RFC and in your blog post “Two Kinds of Invariants”, I think you’re right that &mut T should be UB for uninitialized T. But, as you mentioned, there is already a lot of code that relies on this behaving as naively expected. All of this code could be converted to use *mut T, but the only way I’m aware of to get struct/enum member offsets (in general) is to create two &mut T's to uninitialized T, cast them to pointers, and then subtract. So at the very least, an intrinsic needs to be provided to get struct/enum member offsets before exploiting &mut T being UB for uninitialized T.
Even better would be to provide type or compiler-level support for partial initialization. I wrote a blog post recently about one approach, but to make it truly usable (i.e. efficient), a high-level optimization needs to be added to rustc. It would be amazing if someone intimately familiar with rustc and compiler optimizations could comment on the feasibility/difficulty of implementing such an optimization.
In the Reddit discussion for my blog post, we came up with an alternative approach using an &uninit T type that is linear (exactly single use) and write-only. This approach would be more ergonomic and more deeply engrained into the compiler. I plan to write a blog post exploring this latter idea a bit more, but it seems like it would be much more involved to implement, and so should likely be de-prioritized in favor of a quicker solution; but again, I would love to get input from someone intimately familiar with rustc.
That's funny, because I never argued for this. I brought it up as a possibility, but my own thinking is actually that &mut T is only UB for NULL/unaligned references, and (through the aliasing model) for dangling pointers, but makes no statement about the pointee being initialized. The next topic of discussion in the unsafe-code-guidelines WG will be covering this.
It is true that we would be in a more type-safe world if people used some kind of &out instead of &mut when data is still to-be initialized -- but it is not at all clear to me that all that extra complexity would be worth the benefit.