Excellent! This is the first solid answer I’ve seen to a question I’ve been wondering about: Is it okay for this code to mutate the memory behind an &mut str such that it is temporarily not UTF-8?
Under this model, I think this code can be correct if a non-UTF-8 str is valid but not safe.
The main reason of having a split between safety and validity is that whether an execution of a program is defined is only controlled by validity.
We very much want the question of “is this execution well-defined” to be answerable (and decidable),
However, it is possible (and in fact, had been proposed) to have a “tootsie pop” model with 2 separate notions of validity - one for safe code and one for unsafe code, but that is not what this post is talkin about.
To be more precise, an &mut str pointing to non-UTF-8 is valid but not safe.
I think str itself would have being UTF-8 as validity requirement. Being unsized, that’s not extremely relevant, but would be relevant for unsized locals, for example.
And tail is always safe, right? So graphemes is only ever called on safe data? Yeah then I think this is fine under my proposal.
Continuing a discussion from Zulip, I think we may want to phrase both of these invariants as a kind of “range”. In other words, we might want to specify two things:
Things that your unsafe code must ensure is true
Things that your unsafe code can rely on being true
this should be a subset of the first one
The idea is that there are some validity invariants that we require now but which we might not always want to require. For example, right now we require that &T is aligned. Unsafe code might therefore rely on this by taking some random &T, casting it to a usize, and packing things in the low bits. If we decided to drop the requirement that &T is aligned – we could adjust our own codegen, but we can’t change that unsafe code.
Put another way, if all unsafe code implicitly has “validity requirements are met” as a precondition, we are not free to make more things valid without breaking that unsafe code.
Now, I’m not sure if there are any things we might consider changing in the future, but I think this is something we have to keep in mind.
This article is both rigorous and accessible at the same time - a great achievement!!
Notice that the “boundary” is not necessarily where the unsafe block ends. The boundary occurs
where the unsafe code provides a public API that safe code is intended to use – that might be at
the module boundary, or it might even be at the crate level.
(and more discussion in the comments)
I think that even if the model doesn’t insist on the ‘correct’ placement of unsafe, it’s still worth encouraging authors to do so. If you had a function like
that you know is safe because of the module/crate visibility boundary, I would still consider this bad practice - the function should be marked unsafe.
RalfJung, after reading your arguments on the MaybeUninit RFC and in your blog post “Two Kinds of Invariants”, I think you’re right that &mut T should be UB for uninitialized T. But, as you mentioned, there is already a lot of code that relies on this behaving as naively expected. All of this code could be converted to use *mut T, but the only way I’m aware of to get struct/enum member offsets (in general) is to create two &mut T's to uninitialized T, cast them to pointers, and then subtract. So at the very least, an intrinsic needs to be provided to get struct/enum member offsets before exploiting &mut T being UB for uninitialized T.
Even better would be to provide type or compiler-level support for partial initialization. I wrote a blog post recently about one approach, but to make it truly usable (i.e. efficient), a high-level optimization needs to be added to rustc. It would be amazing if someone intimately familiar with rustc and compiler optimizations could comment on the feasibility/difficulty of implementing such an optimization.
In the Reddit discussion for my blog post, we came up with an alternative approach using an &uninit T type that is linear (exactly single use) and write-only. This approach would be more ergonomic and more deeply engrained into the compiler. I plan to write a blog post exploring this latter idea a bit more, but it seems like it would be much more involved to implement, and so should likely be de-prioritized in favor of a quicker solution; but again, I would love to get input from someone intimately familiar with rustc.
That's funny, because I never argued for this. I brought it up as a possibility, but my own thinking is actually that &mut T is only UB for NULL/unaligned references, and (through the aliasing model) for dangling pointers, but makes no statement about the pointee being initialized. The next topic of discussion in the unsafe-code-guidelines WG will be covering this.
It is true that we would be in a more type-safe world if people used some kind of &out instead of &mut when data is still to-be initialized -- but it is not at all clear to me that all that extra complexity would be worth the benefit.