Rough thoughts on the impl of region inference, MIR, etc


#1

The current region inference scheme has some known limitations. Many of these were uncovered and discussed as part of the work on RFC 1214. There have also been a number of recent regressions that are tangentially, at least, related to the way region inference currently works. At its worst, the shortcomings in region inference can lead to nondeterministic errors, where doing small things like adding redundant parentheses can cause or solve type inference failures. Yuck. This post covers a smattering of thoughts on the topic.

Nondeterminism and lifetime parameters

Problem

The current region system allows a region to be inferred to one of:

  • a scope or expression
  • a named lifetime parameter ('a)
  • 'static
  • a special empty region that has no corresponding syntax

To a first approximation, it works by starting region variables small and growing them in response to constraints. It assumes that there is a “least upper bound” operation that can be applied to any two regions to yield something sensible. When it was written, this was basically true, but in the meantime lifetime parameters have gotten a lot more complex. Now we have things like:

    where 'b: 'a,
          'b: 'c,
          'd: 'a,
          'd: 'c,

now what is the LUB('a, 'c)? Basically, there is no single answer: both 'b and 'd are greater than both 'a and 'c, but neither 'b: 'd nor 'd: 'b holds. Right now, we pick something conservative ('static, in fact), but this is not great, because the order in which we process constraints can affect the final solution.

For example, imagine three constraints:

  1. 'a <= $0
  2. 'c <= $0
  3. 'b <= $0

If we process these in order, we’ll first grow $0 to 'a and then to 'static, where it remains. But if we process in the order 1, 3, 2, we get a different result. We first grow to 'a and then to 'b, where $0 remains. Double yuck!

Proposed solution

Allow region variables to be inferred to the intersection of a list of named lifetime parameters. So the result of LUB('a, 'c), in the previous example, would be {'b & 'd}, instead of 'static. If you work this through, you’ll see that the order of processing constraints no longer matters.

The LUB operation on two sets of named lifetime parameters X and Y can be defined roughly by:

concat([forall i, j. UBS(X[i], Y[j]))

where UBS yields the set of disjoint upper-bounds (so, e.g., UBS('a, 'c) = ['b, 'd]). (This UBS operation is already implemented, see the method minimal_upper_bounds on transitive_relation.)

Adopt a pure dataflow solution; impoverished constraint language

Problem

The regionck constraint model consists of a set of constraints of the form R1 <= R2, all of which must hold:

C = R <= R
  | AND(C, C)

But, as discussed in RFC 1214, if you want to handle projections, you really need the ability to have “OR” constraints. Similar problems arise from where-clauses and a few other sources.

When solving this constraint set, the region inference algorithm also classifies nodes as “expanding” or “contracting” based upon what kinds of incoming edges they have. This is sort of a holdover from the days before there was an empty region. It is also problematic because it means we can’t incrementally solve region constraints at all: we have to wait until all constraints are known so that we can do these classifications.

This contraction mode means that if you have a variable with only upper bounds, like a single constraint $0 <= X, we will infer $0 to X. Since this represents a maximal solution, it can be useful for uncovering missing lower bounds, since often a solution of X is too large (for example, the missing constraint #29006 was uncovered by such an error). nut the contraction mode also causes some difficulties (e.g., changing this behavior happens to fix #29048).

Solution

Extend constraint language to include OR:

C = R <= R
  | AND(C, C)
  | OR(C, C)

Of course, this has a big impact on how the code works. Basically, the idea would be that we have to exhaustively explore the options. Luckily, we mostly generate a pretty flat tree, so we ought to be able to do a strategy like:

  1. At any given level, gather up all the R <= R constraints that are reachable without going through an OR node.
  2. Solve those using data-flow, to find minimal regions. Call this interim solution X.
  3. Explore the required OR constraints, starting from X. For each constraint, at least one of the sides must hold. Note that there are a lot of possible combinations here. We certainly want to at minimum continue distingushing “verifies”, where inference is not affected.

Note that at any given level I have implicitly adopted a pure dataflow model, where all variables start as empty and strictly grow. (We can also grow the root level incrementally, which may allow us to make some progress on problems like #21974, but there are also more complications in that general area.)

We will lose the “bug-finding” behavior of the current algorithm. This is, however, not a reliable way to uncovering bugs, so I’m not sure that’s a great loss. Instead, I think we should just focus on doing type-checking on the MIR as a kind of “double check” for inference. In fact, I’d like to move all region checking to the MIR, as I discuss in the next section.

More principled treatment of regions in closures, non-lexical lifetimes, etc

This is not a real solution yet, more of a general direction. The current typeck code feels overly complex, and it has a lot of trouble squaring the type inference we need around closures with the region hierarchy that closures ought to have. In particular, closures are distinct functions, and so should have a distinct region hierarchy, but we have to type check them altogether to get better inference for local variables and the like.

In my ideal world, we’d address this by having typeck not handle regions at all, instead substituting some kind of dummy region (let’s call it 'dummy) everywhere (though we will have to track where region bindings occur as well). We’d then build up the MIR, which will have all regions just using 'dummy initially.

We’d then re-run a full type-check on the MIR. This will be relatively easy because the MIR is very simple. In this later pass, we will also replace regions with inference variables, and construct a proper graph. This should let us treat closures as completely distinct fns, like we want to. It also makes it easy to move to non-lexical lifetimes. (This doesn’t really solve #21974 yet, since we’ll still have to be doing trait selection.)

All done.

Thanks for listening.


#2

I am not entirely sure what is the current region-inference vs. closures situation, but the lifetimes of a closure signature (both early and late-bound) interact with the parent function. Having concrete regions on the MIR alone is probably required for SEME.


#3

Why do we need to do this? Can we not use the type_ck information to start region inference based on the MIR?


#4

The idea is that typeck wouldn’t be recording region subtyping etc. When I say “run a full typeck” I don’t mean that we do any inference – we have the types given to us, we’d just be replacing the regions in those types with variables and then replaying the appropriate trait matching, subtyping relationships, etc. (This also serves to double check the inferred types.)