[MIR] Move up propagation


#1

Update: I wrote this before I knew this PR was updated recently: https://github.com/rust-lang/rust/pull/34164. I’ll look at that PR and revise this if needed.


I’m planning to implement one simple instance of the several optimizations discussed in https://github.com/rust-lang/rust/issues/32966.

I’m calling this “move up propagation.” For example, we want to replace:

tmp0 = ...
// statements that do not use tmp0
tmp1 = tmp0

With:

tmp1 = ...
// statements that do not use tmp0

The definition of “statements that do not use tmp0” is a little tricky. For my first step, I’m planning on considering any reference as potentially aliasing tmp0. Possible refinements are considering any particular reference to only alias things that have already been borrowed. Even better refinements are definitely possible in the future.

Another potential issue I discussed with @nikomatsakis was threading. For example:

let x = ...1;
let p = &mut x;
send(p); // some other thread may mutate x through ref
let tmp0 = ...2;
block(); // thread ends
x = tmp0;

A bad optimization would be:

let x = ...1;
let p = &mut x;
send(p); // some other thread may mutate x through ref
x = ..2;
block(); // thread ends

The current plan to avoid this problem is not apply the optimization when the statements between the definition and use include function call(s). Some refinement to this restriction is probably possible in the future (again).

I plan to implement this and measure how beneficial it is (or isn’t) and then think about the refinements based on that data.

I’m interested in any feedback people have on this specific optimization. I’m planning on holding off on the other ideas we discussed here (or generalizing) until later.


#2

Wouldn’t borrowck reject that code?


#3

This is @nikomatsakis’s example that I may have mistranslated. Maybe it should be:

let x = ...1;
let p: *mut T = &mut x;
send(p); // some other thread may mutate x through ref
let tmp0 = ...2;
block(); // thread ends
x = tmp0;

… or I’ll let @nikomatsakis clarify…


#4

I think the most conservative option is to consider any local indefinitely unusable for this optimization after its address was first observed. And even simpler, “if its address was ever observed”.
That way we can trivially ensure soundness, and then move from there as needed to obtain performance.

@pcwalton mentioned some Servo crates (WebRender?) are in need of this optimization, so they might be able to provide benchmarks.


#5

I looked at 34164 this morning. I think for now I will implement “move up propagation” as its own thing and wait to see if there is anything in common with PR 34164 when both are finished.


#6

That’s right, you need to have some raw pointers around or else borrowck will get upset. The point is that it’s really hard to avoid entanglements around the rules of what kind of unsafe code is permitted. But I think we should proceed and try to steer a conservative course, knowing that we’ll have to revisit some of this eventually as we reach consensus there.


#7

I think we got this discussion totally wrong anyway. Here’s my model of the optimization:

The optimization is to transform

S1:
    tmp = SRC
    ...
S2:
    DEST = tmp

to

S1:
    DEST = SRC

I think this is best split into 4 steps. I don’t think we ever want to do the steps separately, but this clarifies the analysis rules.

Step 0 (original)

S1:
    tmp = SRC
    ...
S2:
    DEST = tmp

Step 1 - add additional dead write

S1:
    tmp = SRC
    DEST = tmp
    ...
S2:
    DEST = tmp

This requires that DEST is dead at S1, and that it can be evaluated there (e.g. it is not a dereference of a pointer that is uninitialized there). Nothing else matters.

Step 2: common subexpression introduction

S1:
    tmp = SRC
    DEST = tmp
    ...
S2:
    tmp2 = DEST
    DEST = tmp2

This requires that the newly-added read links with the write of tmp - basically a “memdep” analysis.

This also requires that the address of DEST does not change, which is non-trivial because DEST can be the dereference of a pointer.

Step 3: remove write-of-read

S1:
    tmp = SRC
    DEST = tmp
    ...
S2:
    NOP

After step 2, S2 is obviously a NOP and can be removed.

Step 4: remove tmp

S1:
    DEST = SRC
    ...
S2:
    NOP

This is purely a local operation, but may require some sophistication if e.g. SRC is a function call. I think we are always justified doing it, but we should make sure it is OK.


#8

I feel like I’m missing something – does this breakdown into steps suggest changing the way we are thinking about the problem in some way? It seems like this is one way to look at the optimization, but I am not sure what one gains to look at it this way. In particular, the MIR in between the steps doesn’t seem to be always valid or to have the same meaning, so I don’t think you could do these transformations independently.

For example, after step 1, you have:

S1:
    tmp = SRC
    DEST = tmp
    ...
S2:
    DEST = tmp

but here DEST = tmp is happening twice; if tmp is an affine value, that might not even be legal? And the second DEST = tmp would free the old value, potentially?

I guess if you view this optimization as happening in the “LIR” point of the pipeline, it is still equivalent, since DEST = tmp is just a memcpy (and the before-hand-drops have been made explicit?)?


#9

I think that two key questions that @scottcarr and I have debated a bit over IRC are:

  • the role of postdominators, if any;
  • whether to try and do some sort of “global” analysis for conflicting uses.

I think we’ve more-or-less concluded that postdominators are not helpful, as @arielb1 suggested here. I tried to summarize the reasoning already on Github – but basically the idea is that the postdom check was really overly strict, and we can fold the logic into checking for conflicting uses that may occur.

The question of “global analysis” is a bit more interesting. One way to do the optimization is to find a temporary you’d like to eliminate (TMP = ...; ...; LV = tmp;) and then scan between the TMP = ... and the LV = tmp for potential uses of LV. The problem with this is that you re-do this scan every time, which is sort of an O(n^2) pattern.

One might also imagine that it’s useful to do some kind of “conflicting use” analysis over the whole graph, for all lvalues at once. Ultimately, this seems like it’s purely an empirical question. There are some complications involved with doing things all at once, though: for one thing, the set of lvalues to consider is kind of open-ended – maybe we want to use something like fragments? Also, we’d have to keep those results up to date as we optimize (though that might not be so hard, since the transformation we’re doing is fairly straightforward, and perhaps doesn’t really break the results anyway).

Anyway the key thing is that ultimately it seems like something we can change fairly easily (and I expect we will – particularly if we find that we can do one analysis and drive multiple optimizations from it).


#10

I was just trying to figure out a sufficient set of conditions for the optimization. While the “invalid” MIR breaks the move-checker, it has well-defined semantics and will not be actually generated everywhere.

It really seems that there are different sets of “design rules” for MIR - e.g. non-zeroing drop trans requires no duplicate drops, and borrowck requires no weird things with indexes.