Possible alternative compiler backend: Cretonne

Except LLVM can’t always fold loads, not easily though. Bytes + relocations ensure O(lg R) load folding if it is safe to do so, where R is the number of relocations (because you have to do a binary search or something similar). For undef there’s a mask, I was thinking bit-level for LLVM, while miri uses byte-level because it denies using those values anyway.

1 Like

By then, multi-core CPUs will have been mainstream for 20 years, so let's assume a single-threaded compiler is off the table.

Inlining is a global optimization that interacts with (and modifies) the whole call graph. This means that its design depends a lot on the architecture you choose for a compiler that supports concurrent and incremental compilation efficiently. We probably want something like:

  • Multiple threads working on the call graph simultaneously. (And safely, and deterministically).
  • Don't require everything to be in memory at the same time. Some programs are large.
  • Parts of the call graph can be reused from the previous compilation because nothing they depend on has changed.

For compiling C/C++ with LLVM, ThinLTO is a really good bid. I think Rust can do better because it doesn't have to deal with C++'s legacy, but it's a high bar.

So I think inlining is something that evolves along with the compiler / optimizer architecture. We can start by build building a single-threaded, non-incremental inliner pretty easily.

Here @pcwalton and other Servo devs might have some insight from building a parallelized layout engine - both that and inlining operate on trees (call graph -> DAG subsets -> tree of inlining), at the end of the day.

My intuition is that the worker pool would handle the “global callgraph” and partition it such that bottom-up local optimizations + inlining can be done in parallel.
That is, once a leaf function has been optimized, it can be “frozen” and shared between callees which may inline it from different threads.

Cretonne’s index-based ownership model (or lack thereof) helps with keeping things efficient inside a function, allowing even the “heavy-duty” Arc<RwLock<Function>> to work for such a bottom-up scheme without per-instruction costs or unsafety.

LLVM would have to sync types, constant expressions and metadata nodes (which are all interned), but Cretonne can just side-step all of that and joggle with self-contained functions across threads.

2 Likes

This project is really cool, but the comments in this threads seems, well, a little too optimistic.

Some assume that in the far future Cretonne could achieve performance similar to LLVM, but that simply doesn’t seem to be a possibility, and I’m not talking about the massive development and research that have gone into LLVM, but rather an architectural limitation: the lack of undefined behavior.

Undefined behavior is the cornerstone in modern optimizers. Without that, you never come near to the speed of GCC, LLVM, VeLLVM, etc.

Many optimizations are impossible without UB. In fact, the vast majority of optimization passes rely on UB in one way or another. If everything is well-defined, there’s a hell lot of things you cannot do.

I don’t see any obvious way to get around this without either adding heavy runtime checks (e.g. derefing dangling pointers gives zero or crashes the program) or undefined behavior (something happen).

I’d say making a compromise between the two extremes would make sense. For one thing, you could make derefing null pointers defined and so on. Another less extreme idea is to remove all undefined behavior except for the Undefined Value-based (Undef, as the value is known in LLVM) UB. This allows for smoother reasoning, but still carries many of the same disadvantages of normal UB, unfortunately.

That being said, I don’t think it’s a bad idea, because there are cases where accidental UB is a no-go even on the cost of performance. The web is an example. I just think that it cannot (and should not) replace the default optimizer in --release as some call for in this thread (and the reddit thread).

7 Likes

This one stands out in particular.

The optimizer wouldn't be able to reason about values behind pointers at all then, right?

If it isn’t clear yet, runtime performance would not be the primary goal, compile-time performance and correctness are. There are a lot of important use cases for code generation with these qualities, even if they can’t match LLVM’s runtime performance.

Cretonne’s design is philosophically aligned with Rust’s: the same way Rust starts from a disposition of zero-cost abstractions, and is working toward making that more ergonomic; Cretonne starts from a disposition of compiling fast and not having undefined behavior, and can push the boundaries of what is possible from there. I think it is not obvious at all what the limits of such an approach are in a production compiler, and we’re not likely to know without doing it. As @stoklund already mentioned, there are already notions to add powerful aliasing annotations - which are unique to Rust’s needs and could give it unique advantages - that would introduce undefined behavior in a controlled way, to allow specific important optimizations. This is a smart approach - start from a completely safe model and open escape hatches where absolutely necessary, similar to unsafe blocks in safe Rust.

15 Likes

I still doubt it would (even if it heavily exploits Rust-specific constructs) match performance of LLVM, and I don’t see it becoming the default backend for --release mode.

That being said, it could easily outperform LLVM compile-time-wise, and as such I think it would fit as a good default for non-release compilation.

4 Likes

As @stoklund said:

  • Dereferencing an invalid pointer will cause the resulting machine code program to do the same. It doesn't cause Cretonne to assume that is can't happen.

So Cretonne doesn't emit runtime checks for pointer validity – but it doesn't assume that dereferencing an invalid pointer cannot happen either.

LLVM mostly can’t either. LLVM IR has no concept of “strict aliasing”: unless the frontend supplies annotations any pointer can alias any value anywhere (unless LLVM can locally prove it can’t).

  1. LLVM puts a lot of effort into finding and proving noalias pointers (and is actually pretty good at it at this point).

  2. It is UB to break the assumptions of noalias, so in the end it relies on UB, which was my point.

I see this as targeting this roadmap issue: "Rust should have a pleasant edit-compile-debug cycle", but admittedly in a long-term way. Therefore, I wouldn't want to divert resources away from incremental compilation and other ongoing compiler improvements -- but if we can do some early design work, and hopefully get community involvement in pushing the project forward, this seems like a promising direction for us. I am particularly excited about the idea of having optimized code that is also debuggable -- most of my attempts to use debuggers on optimized Rust have been exercises in frustration.

I'd be very interested in this too. LLVM supports building with debuginfo+optimizations, but we have often run into assertions because optimization passes rely on undocumented assumptions or were just buggy wrt debuginfo. Having more control over the whole process would maybe also allows to provide something like an -Og optimization level that tries to preserve good quality debuginfo.

My understanding is that we would need to implement MIR inliner to use Cretonne, because Cretonne doesn’t inline but Rust requires inlining, if only for #[inline(always)]. Is my understanding correct?

As I described in the comparison with LLVM linked above, Cretonne’s intermediate representation attempts to span the same abstraction levels that LLVM spans with 4 different representations. To achieve this, Cretonne uses a basic data structure consisting of an SSA data flow graph with a simple type system, and a layout of instructions in extended basic blocks with few constraints. This fabric is then populated with instructions from a quite large instruction set which spans many abstraction levels. At a given abstraction level, only a subset of the instructions are used, but many instructions (like basic arithmetic) are used at multiple abstraction levels.

Sharing a fabric and common instructions minimizes the memory and allocation bandwidth needed to switch abstraction levels since most instructions can simply stay the same. Incidentally, the concept of an IR fabric is also the origin of the Cretonne name.

WebAssembly subset

WebAssembly instructions can be mapped directly to a subset of the Cretonne instructions. This subset is sandbox-safe and has no undefined behavior at all. A WebAssembly load instruction is mapped to a Cretonne heap_load instruction which includes a bounds check:

v10 = heap_load.i16 v1, 4

This instruction will either return the i16 value loaded from v1+4 in the wasm heap, or it will produce a trap if the address is out of bounds. The trap is not optional or undefined like it is in C.

Cretonne internally transforms this instruction to a lower abstraction level:

v2 = heap_addr v1, 6
v10 = load.i16 v2, 4

Here, heap_addr performs a bounds check and translates the wasm heap offset in v1 to a native address in v2. The load instruction corresponds to a native load instruction, and is not part of the sandbox-safe subset. (Neither is heap_addr since it leaks the location of the wasm heap). The heap_addr instruction is then further expanded to code that checks that the range [v1;v1+6[ are valid heap offsets and so on.

Unsafe optimizations

In the expanded code, should we delete the load.i16 instruction if the loaded value is unused? This is the kind of optimization that can lead to surprises, as Chris Lattner describes. Two things to consider:

  1. It would be perfectly safe to delete the load.i16 instruction because the heap_addr bounds check is right there, and it is guaranteed to never return an address that would trap.
  2. Cretonne is a low-level code generator, so it expects that obviously redundant code has already been optimized out by higher-level optimizers. Therefore, the benefits of deleting loads like this are very small.

Initially, Cretonne won’t have optimizations like this, but I agree with the strategy @brson lays out above of gradually introducing annotations and transformations that enable higher-level optimizations. For example, load instructions could get a notrap flag which asserts that the address is known to not cause a trap:

v2 = heap_addr v1, 6
v10 = load.i16 v2, 4, notrap

The notrap flag would give the optimizer permission to delete an unused load. When Cretonne is used as a backend for the Rust compiler, presumably load instructions would be injected directly, not via a safe head_load version as wasm does. You then have the option of adding the notrap flag or not, and the choice affects the tradeoff between performance and predictable behavior when unsafe Rust code violates its constraints.

This is not fundamentally different from the various annotations supported by LLVM. The important difference is that we don’t start out with the whole set of C-derived undefined behaviors, but that we can introduce annotations and optimizations gradually and evaluate their tradeoffs as we go.

Stretching the fabric

I think Cretonne’s IR fabric could support mid-level optimizations like those performed on LLVM IR, even if that is not initially in scope for the project. The larger type system in LLVM doesn’t inform the optimizers much, except perhaps in the ability to distinguish pointers from integers. Cretonne could easily add a single address type if needed.

As @eddyb mentioned above, Cretonne uses integer indexing and side-tables via the EntityRef trait and the EntityMap data structure. This means that it is also possibly to add things like use-lists if that is deemed necessary for the higher-level optimizations. The low-level passes don’t need them, so they can simply be discarded before code generation.

I imagine mid-level optimizations could be implemented in a separate crate from the low-level Cretonne code generator, sharing the IR fabric. There are still use cases for Cretonne that don’t need this type of optimization.

The inliner loop

Should inlining happen on MIR or the mid-level Cretonne IL? Two points to consider are:

  1. Inlining heuristics are usually driven by code size estimates of the callee function. This works best if the callee has already been optimized as much as possible, after its own calls have been inlined. This can be achieved by traversing the call graph in a bottom-up order, inlining and optimizing before moving on to the next function. In languages like Rust and C++, multiple levels of function calls can collapse into almost no code, making it very difficult to estimate code size without preceding optimizations.
  2. Optimizations following inlining can expose more opportunities for inlining because indirect calls are turned into direct calls by constant propagation. For Rust, this applies to method calls on trait objects. It is beneficial to retry inlining following optimizations that discover direct calls.

Both of these considerations imply that inlining works best if it is followed by optimizations on the same intermediate representation. In particular, it is not optimal to perform inlining on rustc’s MIR and then immediately lower to the next IR.

If Rust has cases where inlining is required for correctness, as the #[inline(always)] attribute is hinting, it could make sense to inline those calls on MIR and let the following passes deal with heuristic-driven inlining where the purpose is performance.

7 Likes

Whether one function is actually inlined into another is not detectable by the code being compiled, just like any other optimization. There might be unreliable hacks like taking addresses of local variables to try and keep track of the stack, but that’s very clearly nonsense which shouldn’t be graced with guaranteed results. So IMHO #[inline(always)] should not force actual inlining, like any other optimization it should be fully optional.

There is a semantically relevant component to #[inline] and #[inline(always)]: A copy of the function is placed in crates that call it. Because some codegen decisions such as overflow checks can be made per-crate, the presence or absence of #[inline] can have an observable effect. However, this is not separate from any actual inlining decisions so it’s easy to support even if the backend can’t perform inlining.

That would be my preferred approach too.

It's worth noting that Clang considers __attribute__((always_inline)) as mandatory, and will inline those functions even with -O0. I've even seen code depending that for correctness, like:

int mylib_setjmp(jmp_buf env) __attribute__((always_inline)) {
    return setjmp(env);
}

This depends on a rather liberal interpretation of the spec for setjmp() which requires The longjmp() routines may not be called after the routine which called the setjmp() routines returns. In practice, it works with both Clang and GCC.

If Rust doesn't allow such shenanigans, that would be fine by me.

Inline assembly might need some of its inputs to be const-propagated, which forced inlining would help with. Not sure if I’ve seen anything else which requires forced inlining except for performance reasons.

Would it be possible in the non-recursive case to translate one function at a time using Cretonne and choose whether or not to inline (at the MIR level, causing us to recompile the function if we do) after seeing the size of the compiled function?

That would be possible, and it would give you some very accurate size estimates, of course. It would also mean duplicating the optimization work at every call site a function gets inlined.

Another side of inlining heuristics, though, is trying to predict optimizations that will be enabled by inlining a function and that will cause the inlined body to shrink a lot. This helps with inlining medium-sized functions where a chunk of the body may be eliminated after inlining. For example:

let mut v = Vec::new()
v.push(1);
v.push(2);
v.push(3);
foo(&v[..]);

Ideally, we would want the optimizers to turn this code into:

let v = [1, 2, 3];
foo(&v[..]);

This would require inlining both Vec::push() and the RawVec::double() it calls. The latter method has two parts about the same size: One half for the initial allocation, and one half for growing an existing allocation.

After inlining Vec::new(), the initial 0 capacity and size are known constants. The Vec::push() method is pretty small, so we’ll inline those calls, but we probably didn’t inline RawVec::double() into push() because the two halves of the function together are too big.

We now have three copies of the push() body. After cleaning up and simplifying expressions, we’ll be able to determine that the first push() body calls double(). We also know that v.buf.cap is 0, so by examining the body of double(), we can predict that half the function will go away. This is enough to drop the predicted size under the inlining threshold, and we can inline double().

We now have code like this:

let mut v = // body of Vec::new().
// body of v.buf.double()
v[0] = 1
// body of v.push(2)
// body of v.push(3)
foo(&v[..]);

That code then simplifies into what we wanted. In particular, inlining double() makes it possible to accurately track the vector capacity and size so the push() bodies simplify to:

let mut v = allocate(4*4);
v[0] = 1
v[1] = 2
v[2] = 3
foo(&v[..]);
deallocate(v);

We couldn’t inline double() without looking at the body of the function, knowing that v.buf.cap == 0 and predicting that half the function would disappear.

These two things are hard when inlining MIR:

  1. We had to first inline Vec::new() and then clean up and simplify code before we could discover that v.buf.cap was a known constant.
  2. We had to look at the code of RawVec::double() to predict that it would shrink a lot, and we had to estimate that it would shrink under the inlining threshold.
9 Likes

Very good walkthrough, this should be on some MIR issue as an example of an optimization we should strive for.

What I have in mind is to find opportunities which would result in dramatic impact (branch conditions and trait method calls, mostly) and later use them to probe whether inlining would be worth the effort.

It might even be worth using temporary arenas/overlays to do cheap speculative inlining and value propagation, bailing as soon as irreducible complexity is found, but keeping the simplified form otherwise, for insertion/materialization back into the caller.