Slice patterns in the land of MIR

nikomatsakis · August 18, 2015, 3:27pm

As I mentioned in another thread, I’ve been working on getting the MIR to handle the full compiler and test suite. Right now it’s stuck on rustdoc due to the use of slice patterns like [x, ..y]. These patterns raise some interesting questions about how to integrate indices into the borrow checker in a MIR-like setting. I thought I’d jot down quickly the conflict and my proposed solution, along with some thoughts on future directions we might go.

The Problem

The problem is that the borrow checker does not, by and large, reason about indices. But slice patterns only make sense if you can reason about indices. For example, a pattern like [ref mut x, ..y, ref mut z], when matching against foo, is roughly equivalent to:

let len = foo.len();
if len >= 2 {
    x = &mut foo[0];
    y = &foo[1..len-1];
    z = &mut foo[len-1];
    ...
} else {
    // other match arms
}

To know that patterns like this are safe, you must consider the length len and reason that foo[0] is disjoint from foo[len-1], and that both of those are disjoint from foo[1..len-1]. Adding this sort of logic to the borrow checker in general is a slippery scope that we have tried to avoid. Of course, such detailed reasoning shouldn’t be necessary – in the case of slice patterns, we know by construction that the indices are disjoint.

Trusted parts of MIR

Slice patterns thus join a small set of other things (bounds checks, matches) that are obviously safe in the AST but not obviously safe when desugared to the MIR level. In those other cases, I’ve taken the simplifying approach of generating “unsafe” MIR. For example, MIR generally assumes that all array indices are in bounds, and trusts the front-end to produce bounds checking code. I’d like to do the same for slice patterns, as I describe below. In all of these cases – slice patterns included – it should be possible to write an extended static check for MIR that checks that the unsafe constructors are in fact used safely; this check would require a data-flow analysis, or else a dependent type system, and hence would be somewhat more complex than a typical type checker. I also talk a bit about that below.

What do we do today?

In fact the same problem occurs today. The ExprUseVisitor tries to translate matches into individual borrows and thus it must figure out how to translate a match pattern. To handle this, the mem_categorization code marks borrows that arise from a slice pattern in a special way, which is recognized by the borrow checker and treated differently. In addition, the borrow checker today is ok with the idea of a single node in the AST originating several borrows, and it assumes that they are all disjoint. But in MIR this single pattern node has been busted up into smaller pieces, so we need to recover that assumption somehow.

How to fix it most easily?

Currently, the MIR contains “index lvalues” LV1[LV2], where the idea was that the borrow checker will make no attempt to reason about what the index LV2 might be (i.e., statically, we might simplify x[y] to just x[_]). The idea is to add a second form of path that is generated by slice patterns. It would look like LV1[LV2; ALIAS] where ALIAS is a pair of integers (pattern, item). The idea is that two paths LV[_; (P, I1)] where LV[_; (P, I2)] where I1 != I2 are statically known to be disjoint (this is a trusted assertion). We also need to add a new rvalue for slicing LV[LV1..LV2; ALIAS] (this is only used for slice patterns; slice expressions are just overloaded method calls).

Let me show you a snippet of code to give you a rough idea of how this would work:

let x = &foo[0]; // => generates &foo[tmp1]
let y = &foo[2]; // => generates &foo[tmp2]
match foo {
    [ref a, ref b, ..c, ref d, ref mut e] => {
        // => generates:
        //     len = foo.len()
        //     a = &foo[0; (0, 0)]
        //     b = &foo[1; (0, 1)]
        //     c = &foo[2..len-2; (0, 2)]
        //     d = &foo[len-2; (0, 3)]
        //     e = &foo[len-1; (0, 4)]
    }
}

Now when the code is expanded, the various paths for a, b, etc will be considered disjoint with one another.

Note that while the uses of these alias-hinted paths is confined to the evaluation of the match pattern (after that point, bindings have been made), those uses can still be spread across the CFG, since the match evaluation can span basic blocks. (However, they always come in one DAG, there is never an internal loop.) For example consider this slice pattern:

match foo {
    [(ref a, 22), b] => { /* ARM 0 */ }
    _ => { /* ARM 1 */ }
}

This will desugar to something like:

BB0:
    tmp0 = foo[0; (0,0)].1 == 22; 
    if tmp0 goto BB1 else BB2
    
BB1:
    a = &foo[0; (0,0)].0;
    b = foo[1; (0,1)];
    /* ARM 0 */
    
BB2:
    /* ARM 1 */

Here you can see that the same path foo[0; (0,0)] appears twice, once in BB0 and once in BB1. Those uses will naturally be considered overlapping (but the fact that we then access the 1 and 0 fields of the tuple renders them again disjoint).

Loops and correctness

I was somewhat concerned about what would happen in the face of borrows that persist over loops, but I think I have convinced myself that everything works out fine. Here is an example that I will use to explain:

fn f(foo: [i32; 2]) {
    let mut p;
    loop {
        match foo {
            [ref mut a, ref mut b] => {
                p = a;
                ...
            }
        }
    }
}

This code is (rightly) rejected today:

<anon>:7:14: 7:23 error: cannot borrow `foo[..]` as mutable more than once at a time
<anon>:7             [ref mut a, ref mut b] => {
                      ^~~~~~~~~
<anon>:7:14: 7:23 note: previous borrow of `foo[..]` occurs here; the mutable borrow prevents subsequent moves, borrows, or modification of `foo[..]` until the borrow ends
<anon>:7             [ref mut a, ref mut b] => {
                      ^~~~~~~~~

The problem is that p from the previous iteration will overlap with the borrow a. Under the newer model, this function translates to MIR something like:

BB0:
  a = &mut foo[0; (0,0)]; // (a)
  b = &mut foo[1; (0,1)]; // (b)
  p = a;
  ...
  goto BB0;

From the POV of borrowck, line (a) creates a loan. Due to the assignment to p, this loan will still be in scope when we branch back around, which means we will still get an error, as expected.

Nonetheless, it seems creepy that a borrow from a previous iteration can persist until the next one – after all, the disjointedness reasoning applies only to a single match statement. If that same match statement executes twice, is there any guarantee that the paths are still disjoint/overlapping in the same way?

It turns out that I think there is such a guarantee, I think. Not, of course, in general, but specifically in the case where a loan flows through from a previous iteration. The reason is basically that the only case there would be a problem is if the slice has changed to a new pointer or the length of the slice has changed. Both of those would require mutation to the variable storing the slice, and that can’t happen if there is an outstanding loan.

Type checking

If we wanted to verify these alias sets, it is certainly doable. The basic idea is that the compiler must reason about the length of the vector and ensure that, whenever we create a value with a given alias set, it is disjoint from later values in that same alias set. However, this is somewhat non-trivial, particularly given loops; we would presumably want to infer such a region and insist that it has a dominator and post-dominator that are confined to a single loop nest.

Anyway, this is all doable, but it leads one to wonder if we can instead modify MIR to be more of a dependent type system that can express the invariants that we actually want. There is some progress and clues here (for example, the work that Ralf is doing on “lambda Rust”), but it’s something I’d like to keep thinking and maybe evolve towards. I imagine that immutable, SSA-like values – as opposed to the mutable stack slots I proposed – might be a useful component. They might be a useful addition, or of course we could go whole hog and migrate to an LLVM-like notation of allocas and values. This has its appeal, but it might complicate the borrowck (and mapping errors to a user’s model) if we traverse too far from the paths present in the language today, so I think we’ll have to feel our way.

arielb1 · August 18, 2015, 5:42pm

Another option would be to have a slice sub-borrow contain an assertion of the slice’s minimum length (it can’t be changed except by mutating the slice itself of course), and a position which is either pos, real_len-pos, or [start, real_len-end), and check for overlap using the largest minimum length. This can even support patterns of the form:

    fn foo<'a>(s: &'a mut [u8]) {
        let r : Option<&'a mut u8> = None;
        if let [ref mut t, _, ..] = s {
            r = Some(s); // the borrow of s[0; min_len=2]
        }
        if let [.., t] = s {
            use(r, t); // does not conflict with the borrow of s[-1; min_len=1]
        }
    }

nikomatsakis · August 18, 2015, 6:42pm

Yes, right, this is what I was trying to allude to in the end. The problem is that one must contend with "real_len" potentially being mutated (or else the checking isn't particularly safe). This is why I said we might consider using SSA-style temporaries, which can never be reassigned. That said, since it's ok to have "side constraints" on the MIR that we know are satisfied by construction, we could also just assume that the length in question is never mutated (which would obviously also be easier to check if we ever chose to do so). That is appealing.

nikomatsakis · August 18, 2015, 7:01pm

Ah, I see, we don’t even have to have the length as something that’s in the MIR itself, we could just make the index forms be:

LV[LV]
LV[C; N]
LV[-C; N]

and add the slice rvalue:

LV[C..-C; N]

where C,N are just usizes derived from the pattern. N is the minimum length of slice, -C means “C back from the end.”

I considered this too, and I’m not sure why I rejected it. It seems good now!

UPDATE: Tweaked language to, you know, make sense.

arielb1 · August 18, 2015, 7:31pm

The problem is that one must contend with "real_len" potentially being mutated (or else the checking isn't particularly safe).

How is this different from ordinary enum downcasts i.e. exactly what the borrow-checker was designed for? Is the problem just that type-checkers don't like integers?

Anyway, how is SSA significant here? The matched-on value could be any number of mutable references, and the borrow-checker would handle that just as well.

nikomatsakis · August 18, 2015, 7:35pm

What I was concerned about was, I realize now, not quite what you were proposing. I think you meant for "real len" to be a bit of the syntax, so to speak, and not a separate lvalue. I was originally thinking we would make "real len" an lvalue, in which case if I conclude that REAL_LEN-1 and REAL_LEN-2 are disjoint, it is only valid under the assumption that the variable REAL_LEN has not been mutated (I agree the borrow checker's basic rules ought to be ensuring that the length remains constant.)

bluss · August 18, 2015, 7:38pm

The current slice_pattern implementation has many memory safety bugs open, it is not sound as it is.

arielb1 · August 18, 2015, 7:40pm

that’s mostly trans::_match being buggy (and maybe also middle::check_match - I don’t know it well enough) - we are talking about borrowck.

jnicklas · August 19, 2015, 9:32am

This might be a stupid question, but does this still hold when the variable storing the cell has interior mutability, such as if it were a Cell<[u32;2]> or something?

nikomatsakis · August 19, 2015, 9:35am

Yes. In the case of Cell, you could never obtain a reference into that array, you would have to copy it out first and take a reference into the copy. In the case of RefCell, you could obtain a reference, but you are dynamically prevented from mutating it.

oli-obk · August 20, 2015, 10:18am

wouldn’t it basically be sufficient to offer a split_at_mut-equivalent MIR-RValue and letting the normal borrow-checker handle the rest?

on a side-note: the loop example given is totally orthogonal imo, this causes the same error:

fn f(mut foo: i32) {
    let mut p;
    loop {
        p = &mut foo;
    }
}

nikomatsakis · August 20, 2015, 3:52pm

Hmm, so, I considered this, and for some reason I thought I found that it didn't work, but I'm not sure now what that was. It feels like it could indeed work to me, though we have to handle the case where this no ..x somewhat differently due to the potential for moves. But that's ok. That is, we could have an rvalue SPLICE(P,S) that yields up a tuple of slices. I do like that, I'll ponder it.

arielb1 · August 20, 2015, 9:10pm

That wouldn’t allow the different-parts-inside-a-loop example I gave. Should be good enough for most purposes through.

RalfJung · August 21, 2015, 9:17am

I was about to ask exactly that, what about having primitive operations on slices to split off a constant amount of elements from the beginning or the end, plus something that turns a slice of length 1 into a normal borrow?

I wasn't able to follow your discussion about real_len though. arielb1, which example are you referring to? The only example of you that I can find here doesn't have a loop...

Anyway, this is all doable, but it leads one to wonder if we can instead modify MIR to be more of a dependent type system that can express the invariants that we actually want. There is some progress and clues here (for example, the work that Ralf is doing on "lambda Rust"), but it's something I'd like to keep thinking and maybe evolve towards.

My current plan for handling these checks is not full-blown dependent types (I'll try to avoid them if at all possible), but something more special-purpose: A particular typing rule that says that after you did an "if idx < slice.len", that you know that accessing slice at idx is safe. So, the type of either idx or slice changes to encode this relationship, or maybe there's a new type witness for that. (idx is Copy, so we can easily have lots of type witnesses for it.) Since variables never change their value in my formalism, I think that should be good enough. I already have downcasting in matches solved, that's part of the typing rule of match: It tells you that the data part of the enum has to type given by the arm you are in. I don't yet have a full plan for conditional drops, though. Somehow I need to tie a boolean variable to whether a stack slot is currently initialized. Or maybe I'll have some kind of single-purpose "fat pointer" for that, but then it's unclear to me who's going to modify the flag bit... really, "fill with 0 on drop" is probably the easiest to do, but for obvious reasons not preferable

nikomatsakis · August 21, 2015, 3:31pm

Yes, this sounds roughly like what I was thinking. I certainly did not intend dependent types in their full generality, but just add some way to refer to indices. (But I also don't think it's particularly useful/imp't in our setting, as compared to just expressing what we want as a static analysis.)

nikomatsakis · August 31, 2015, 8:43pm

Thinking on this more, I think that moves are exactly the reason I had rejected this. In particular, we still need to specialize the idea of vec[0], vec[1] being disjoint (that is, a constant known index), and cover the idea that this overlaps with other indices where we don't have a constant. If we have to include that in the borrowck, it's not a big extension to cover paths like @arielb1 was proposing.

llogiq · September 1, 2015, 9:03am

We could even extend this idea to prove that two indices are disjoint even if they aren’t constant. For example, vec[x] and vec[x + 1] are always disjoint. If we can catch this (equal expession, one time with a non-zero value added), the extension could be much more useful.

Thiez · September 1, 2015, 9:19pm

Surely it is a mere coincidence that different indices do not overlap when using Vec. Implementors of Index need not obey this restriction, and it would be a shame if some magic got added only to Vec and/or [T]. I can imagine an implementation of Index that just ignores the key and always returns the same element. Or an implementation that uses a two-tuple for keys, but only uses the first element for indexing (e.g. (1, "cheese") and (1, "banana") would return the same value.). These seem like nonsense examples, but perhaps a more realistic example would be a circular buffer that implements Index and performs a modulo operation so there are no out of bounds indices. &mut c_buffer[n] and &mut c_buffer[n + c_buffer.length()] would alias in that case.

I agree it would be nice to have the borrow checker understand that vec[0] and vec[1] are disjoint and mutable references to these values can exist at the same time, but perhaps this knowledge should only be used if some (unsafe?) trait is implemented.

There is of course also the question of when two indices are equal. For the integers this is simple, but since any type can be used for indexing this is trickier.

nikomatsakis · September 1, 2015, 10:24pm

Quick note: when I wrote vec, I meant a variable of “array” type. The limitations around Index are one reason I’d prefer not to go too far in this direction.

Niko

nikomatsakis · March 25, 2019, 8:25am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nascent Idea: allow `.0`, `.1`, etc on arrays language design	32	2431	May 4, 2019
Mini idea: slice length patterns language design	26	1313	March 13, 2024
Indexing syntax borrow checking should work as with functions language design	8	953	July 8, 2021
Tree Borrows explained Unsafe Code Guidelines	31	4998	December 22, 2024
[blog post] Nested method calls via two-phase borrowing language design	36	8209	March 25, 2019