Two Kinds of Invariants: Safety and Validity

mbrubeck · August 24, 2018, 12:05am

Excellent! This is the first solid answer I’ve seen to a question I’ve been wondering about: Is it okay for this code to mutate the memory behind an &mut str such that it is temporarily not UTF-8?

github.com

mbrubeck/unicode-reverse/blob/cf392aafd82f333cc5ed1f805fa3844f994d5543/src/lib.rs#L105-L127


pub fn reverse_grapheme_clusters_in_place(s: &mut str) {
unsafe {
    let v = s.as_bytes_mut();


    // Part 1: Reverse the bytes within each grapheme cluster.
    // This does not preserve UTF-8 validity.
    {
        // Invariant: `tail` points to data we have not modified yet, so it is always valid UTF-8.
        let mut tail = &mut v[..];
        while let Some(len) = str::from_utf8_unchecked(tail).graphemes(true).next().map(str::len) {
            let (grapheme, new_tail) = {tail}.split_at_mut(len);
            grapheme.reverse();
            tail = new_tail;
        }
    }


    // Part 2: Reverse all bytes. This restores multi-byte sequences to their original order.
    v.reverse();


    // The string is now valid UTF-8 again.

This file has been truncated. show original

Under this model, I think this code can be correct if a non-UTF-8 str is valid but not safe.

arielb1 · August 24, 2018, 9:32am

The main reason of having a split between safety and validity is that whether an execution of a program is defined is only controlled by validity.

We very much want the question of "is this execution well-defined" to be answerable (and decidable),

However, it is possible (and in fact, had been proposed) to have a "tootsie pop" model with 2 separate notions of validity - one for safe code and one for unsafe code, but that is not what this post is talkin about.

RalfJung · August 24, 2018, 10:05am

To be more precise, an &mut str pointing to non-UTF-8 is valid but not safe.

I think str itself would have being UTF-8 as validity requirement. Being unsized, that's not extremely relevant, but would be relevant for unsized locals, for example.

And tail is always safe, right? So graphemes is only ever called on safe data? Yeah then I think this is fine under my proposal.

nikomatsakis · August 24, 2018, 12:59pm

Continuing a discussion from Zulip, I think we may want to phrase both of these invariants as a kind of “range”. In other words, we might want to specify two things:

Things that your unsafe code must ensure is true
Things that your unsafe code can rely on being true
- this should be a subset of the first one

The idea is that there are some validity invariants that we require now but which we might not always want to require. For example, right now we require that &T is aligned. Unsafe code might therefore rely on this by taking some random &T, casting it to a usize, and packing things in the low bits. If we decided to drop the requirement that &T is aligned – we could adjust our own codegen, but we can’t change that unsafe code.

Put another way, if all unsafe code implicitly has “validity requirements are met” as a precondition, we are not free to make more things valid without breaking that unsafe code.

Now, I’m not sure if there are any things we might consider changing in the future, but I think this is something we have to keep in mind.

derekdreery · August 28, 2018, 1:38pm

This article is both rigorous and accessible at the same time - a great achievement!!

Notice that the “boundary” is not necessarily where the unsafe block ends. The boundary occurs where the unsafe code provides a public API that safe code is intended to use – that might be at the module boundary, or it might even be at the crate level.

(and more discussion in the comments)

I think that even if the model doesn't insist on the 'correct' placement of unsafe, it's still worth encouraging authors to do so. If you had a function like

fn change_vec_len(v: &mut Vec) {
  unsafe {
    v.set_len(v.len() + 10);
  }
}

that you know is safe because of the module/crate visibility boundary, I would still consider this bad practice - the function should be marked unsafe.

scottjmaddox · September 5, 2018, 5:11pm

RalfJung, after reading your arguments on the MaybeUninit RFC and in your blog post “Two Kinds of Invariants”, I think you’re right that &mut T should be UB for uninitialized T. But, as you mentioned, there is already a lot of code that relies on this behaving as naively expected. All of this code could be converted to use *mut T, but the only way I’m aware of to get struct/enum member offsets (in general) is to create two &mut T's to uninitialized T, cast them to pointers, and then subtract. So at the very least, an intrinsic needs to be provided to get struct/enum member offsets before exploiting &mut T being UB for uninitialized T.

Even better would be to provide type or compiler-level support for partial initialization. I wrote a blog post recently about one approach, but to make it truly usable (i.e. efficient), a high-level optimization needs to be added to rustc. It would be amazing if someone intimately familiar with rustc and compiler optimizations could comment on the feasibility/difficulty of implementing such an optimization.

In the Reddit discussion for my blog post, we came up with an alternative approach using an &uninit T type that is linear (exactly single use) and write-only. This approach would be more ergonomic and more deeply engrained into the compiler. I plan to write a blog post exploring this latter idea a bit more, but it seems like it would be much more involved to implement, and so should likely be de-prioritized in favor of a quicker solution; but again, I would love to get input from someone intimately familiar with rustc.

RalfJung · September 17, 2018, 12:36pm

That's funny, because I never argued for this. I brought it up as a possibility, but my own thinking is actually that &mut T is only UB for NULL/unaligned references, and (through the aliasing model) for dangling pointers, but makes no statement about the pointee being initialized. The next topic of discussion in the unsafe-code-guidelines WG will be covering this.

It is true that we would be in a more type-safe world if people used some kind of &out instead of &mut when data is still to-be initialized -- but it is not at all clear to me that all that extra complexity would be worth the benefit.

system · March 25, 2019, 8:30am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Types as Contracts Unsafe Code Guidelines	118	11684	March 25, 2019
Idea for UnsafeFn and calling convention traits	11	1176	March 25, 2019
Ambiguity of "unsafe" causes confusion in understanding	5	1406	March 25, 2019
Allow safety checks inside unsafe functions Unsafe Code Guidelines	5	1214	December 22, 2024
Rust's stability story should apply to bool in FFI language design	13	4602	March 25, 2019

Two Kinds of Invariants: Safety and Validity

Related topics