Aggressive loop hoisting (in MIR?)

djzin · April 29, 2017, 10:53am

It would be nice to finally use some strong properties of rust’s type system in order to do some aggressive high-level optimizations. I came across something like this when profiling code - >90% of the time was spent computing a function that only took immutable references as arguments, and so it returned the same result each time. It was at the bottom of a stack of many loops. LLVM does do loop hoisting, but it seems to be very hit-and-miss, especially buried so deep in code that does a bunch of other stuff.

So I had an idea. Maybe the reason LLVM cannot optimize this code is because it doesn’t know enough about rust’s type system to know when values are immutable. At a higher level, we can reason about return values using the fact that references are immutable, values are immutable unless specified etc. - and then hoist the values out of the loop before it ever gets to LLVM.

Here’s an example I came up with where rustc produces inefficient code:

struct Struct {
    data: Vec<i32>,
}

extern {
    fn use_value(input: i32);
}

impl Struct {
    fn compute(&self) -> i32 {
        let mut ret = 0;
        for _ in 0..10 {
            ret += self.data[0] * 0b110110 ^ ret / 7 & self.data[0] + ret;
        }
        ret
    }
}

#[no_mangle]
pub fn compute(data: i32, n: usize) {
    let s = Struct { data: vec![data] };
    for _ in 0..n {
        unsafe {
            use_value(s.compute());
        }
    }
}

Here, the generated assembly computes the value each time, even though from the immutability of s (and the lack of unsafe code in Struct::compute) we can reason that s.compute() will return the same value each time. Note that with a constant n the value is only computed once - I had to add complexity until it wasn’t.

So the transformation would turn the compute function into the following:

#[no_mangle]
pub fn compute(data: i32, n: usize) {
    let s = Struct { data: vec![data] };
    let value = s.compute();
    for _ in 0..n {
        unsafe {
            use_value(value);
        }
    }
}

I will admit, I do not know very much about this kind of stuff, so I guess the key questions are:

Will this be genuinely useful?
What are the precise conditions under which this transformation is valid? I mentioned “lack of unsafe code” - but this is not strictly true - there is unsafe code when accessing an element of a Vec, and other “hidden” unsafe code (like accessing a Mutex) could result in different results from the function each time. (Even a for loop calls mem::swap deep in its bowels, which is implemented with unsafe code)
Can this be generalized to values in general rather than just return values from functions?
Where/how should this optimization be applied? Is MIR the right place for this kind of thing? I notice MIR seems to add a lot of mutability where none is needed - this might be a problem for this kind of reasoning

bjorn3 · April 29, 2017, 11:01am

<Vec as Index>::index() contains unsafe code.

What about possible interior mutability?

djzin · April 29, 2017, 11:05am

<Vec as Index>::index() contains unsafe code.

Yes I mention this - I am not sure what the precise condition should be - but certainly in this case the code is not "unsafe enough" to give different results each time

What about possible interior mutability?

AFAIU a type can only have interior mutability if it contains (transitively) an UnsafeCell. Otherwise any interior mutability would be undefined. Is this a sufficient condition? It is certainly true that Struct here has no possibility of interior mutability.

bjorn3 · April 29, 2017, 11:06am

I think so

hanna-kruppe · April 29, 2017, 11:27am

One thing that hasn't been mentioned so far is that compute must not have any side effects. Many kinds of side effects are ruled out by noting that all the arguments are references to UnsafeCell-free types, there are still at least two kinds still possible with that signature: Modifying global(ly reachable) memory locations, and I/O.

While MIR optimizations are desired (and some are already implemented), they will never be able to compete with LLVM's optimizations. This is potentially an issue because quite a bit of "grunt work" (inlining code, simplifying it, culling dead paths, etc.) may be needed to prove the numerous pre-conditions necessary for hoisting the loop, especially in more complicated examples. It is far better to teach LLVM the necessary ingredients to do the desired transformation, after it has simplified surrounding code to make it easier to see. However, this raises the question of how to communicate the right information to LLVM to enable it to perform the transformation while remaining relatively language agnostic.

One thing that would help a lot with this and many other cases is giving LLVM better aliasing information. I'd guess that LLVM just didn't hoist s.compute() because it couldn't prove that use_value can't write to s.data[0]. This would not surprise me at all, it's a known issue that we don't give LLVM all the aliasing information we could:

It's not 100% clear what Rust's aliasing guarantees are, at least around unsafe code -- the unsafe code guidelines will settle this and other questions, but they're a work in progress.
In cases where we are sure about the aliasing semantics, it's not always clear how to best communicate this to LLVM. Its aliasing attributes and metadata have historically been mostly inspired by C and C++, so they don't always fit Rust very well.
Even when we are sure about something and can articulate it to LLVM, there were some misoptimizations in the past that caused us to be more conservative with emitting such aliasing information. For example, &mut T references are currently not marked noalias!

djzin · April 29, 2017, 1:47pm

Ah yes hmm, maybe something like the following is good enough? -

A function f is "hoistable" (pure?) iff it can be written as a function g with the following properties:

The arguments of g are UnsafeCell-free
The arguments of g are not &mut references
g (transitively) makes no reference to UnsafeCell-containing global consts/statics
g (transitively) makes no reference to mutable statics
g (transitively) makes no ffi calls
- make an exception for certain well-behaved calls such as malloc, free, memcpy etc

(then the bit that requires the unsafe code guidelines...)

g (transitively) accesses no "wild" memory locations
- e.g. unsafe { *(0x8402859 as *mut u8) = 0xff }

Not sure what to do if g can panic, either (panics generally make things more complicated, but the guarantees around what the stack frame looks like when a function panics don't need to be too rigid)

Yes this is quite a problem! That's why I thought maybe handling it at a higher level would be better, since we necessarily lose some information in the translation to lower layers. But you seem to think this is not a good idea. Would doing a bunch of the "grunt work" in MIR be worthwhile, or a complete waste of time, if it opened up the door to more optimizations like this?

eesh did not know this!

Seems like we want LLVM to be more "aware" of rust so that far more assumptions can be made. Might mean more rust-centric optimization passes in LLVM itself. I guess noalias must be uncommon enough in C++ that these bugs weren't found - but that means that also we can't really expect there to have been much work done on optimizations that utilize noalias effectively (maybe I am wrong on that point).

frankmcsherry · April 29, 2017, 2:06pm

The reason, I assume, is that with variable n it can be either zero on non-zero. The optimization that hoists the computation is much harder to motivate (and prove equivalent) if the unoptimized could wouldn't have called s.compute() ever. It may be that LLVM can tell that it is safe to do, but no longer wants to because it could be only an expensive cost.

Edit: if you change the 0 .. n to 0 .. (n/2)+1 it seems to hoist the computation. It is a bit hard for me to read, but it moves the tail call into use_value from the end of compute, and puts it in a different basic block after the call into the compute basic block.

hanna-kruppe · April 29, 2017, 2:12pm

There are a few optimizations that are inherently language specific . But hoisting loop-invariant computations doesn’t strike me as an example of that. In fact, LLVM already has that, under the standard term “loop invariant code motion”. The parts of Rust’s semantics that are useful for this optimizations are mostly restrictions around aliasing. When faced with the choice between communicating these aliasing restrictions to LLVM and implementing LICM for MIR, I cannot understate how strongly I would err towards the former. Here are a few reasons:

There’s the aforementioned network effect with other optimizations. Duplicating all of LLVM (well, all of its optimization passes) is entirely impractical, but it’s necessary to do as well as the former option.
Better aliasing information benefits virtually every other optimization, whereas LICM has limited scope.
On a more abstract level, there is no good reason why LICM should need to be language specific. Essentially, the optimizer just needs to answer the question “is it safe to eliminate a call to this function if the result is not needed”, which is such an essential query that it should be answerable just by analyzing IR, or – if that runs into practical limitations – with a few selected annotations emitted by the front end.

djzin · April 29, 2017, 2:56pm

Thanks for your input rkruppe, you have convinced me that better aliasing info is the way to go on this one.

Does LLVM have some way to say “look, I know that this is a pointer, but the data it points to will never ever change forever and ever” (as is the case with any immutable UnsafeCell-free data structure)?. Is there a change to the generated LLVM IR in this case that would generate better code?

hanna-kruppe · April 29, 2017, 3:19pm

There are numerous problems that crop up when you try to make that precise, both around semantic questions such as “what transformations are allowed, actually? when does the promise stop holding?” and around actually representing this in LLVM IR in a way that’s robust under transformations and easily accessible to alias analysis. I believe work on this has mostly stalled in favor of fleshing out the unsafe code guidelines (which is itself going slowly, unfortunately) because, as mentioned before, the precise semantics of Rust’s aliasing restrictions will strongly influence the answers to these questions. You might want to talk to some people who are involved hands-on in both efforts, e.g., @arielb1.

nathansizemore · May 2, 2017, 1:06am

This video will give you some high level insight on the optimizer in LLVM, how inlining and friends are decided, and why front ends do no optimizations besides dead code removal. @frankmcsherry is correct in noting that it cannot determine whether n is non-zero. Giving it any literal, or non-zero provable will cause it to unroll the entire loop, or only run the computation once and loop the call to use_value.

Clang also behaves the same way.

gnzlbg · May 4, 2017, 12:03pm

Have you tried making that function take &mut instead of & ? My point is, &mut guarantees that there is no aliasing, while & allows aliasing. While LLVM might not understand much about reading/writing, it does understand about aliasing. Trying this might reveal a bit more about what is going on with the optimizations.

hanna-kruppe · May 4, 2017, 12:56pm

It allows multiple pointers, but it doesn't allow any of those pointers to write to the memory (modulo UnsafeCell), which empirically seems good enough to mark those pointers as noalias in LLVM.

And as mentioned before, &mut references currently aren't noalias:

gnzlbg · May 4, 2017, 1:16pm

Thanks for the clarification

system · March 25, 2019, 8:28am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Taming Undefined Behavior in LLVM Unsafe Code Guidelines	6	2239	March 25, 2019
Expose LLVM integer intrinsics for arbitrarily-large integers compiler	16	1810	August 31, 2023
An immutable loop structure `loop let` language design	13	923	October 23, 2024
Using a custom optimisation pass pipeline compiler	7	3838	March 25, 2019
Stacked Borrows Implemented Unsafe Code Guidelines	16	3595	March 25, 2019

Aggressive loop hoisting (in MIR?)

Related topics