Possible alternative compiler backend: Cretonne

It limits the assumptions the compiler can make when optimizing. For example, signed integer overflow is undefined in C, so the compiler can optimize a + 1 < a to false, even though that condition can be true in an implementation that wraps on overflow.

6 Likes

Indeed. If you use a store instruction to write ‘7’ as your return address, something bad will happen when you try to return. Cretonne removes focus from the undefined-behavior approach to optimizations, but there are still invariants that shouldn’t be broken. Cretonne differs from LLVM on these points:

  • Arithmetic instructions either trap or produce a value.
  • There is no concept of undef or poison values in the type system.
  • There is no unreachable instruction to indicate that a code path is impossible. Use a trap instead.
  • Dereferencing an invalid pointer will cause the resulting machine code program to do the same. It doesn’t cause Cretonne to assume that is can’t happen.

The LLVM optimizations that exploit undefined behavior don’t have counterparts in Cretonne. It is not the case that Cretonne has a new innovative way of performing these optimizations safely.

12 Likes

Ok, thanks for elaborating!

I wrote a document comparing Cretonne to LLVM.

I am quite excited about this project. It would be great for Cretonne to have multiple clients because it forces us to define the right abstractions and not let too many client-specific hacks creep in. For Rust, I believe Cretonne can provide debug builds that are both faster and that generate faster code than LLVM’s debug builds.

Debug information

A JIT compiler like IonMonkey does speculative optimizations that sometimes turn out to be incorrect, and a running function needs to be de-optimized. When this happens, the values of local variables are derived from the state of the optimized code and copied into an interpreter stack frame where execution continues.

This means that at certain safe points in the optimized code, it must be possible to compute the value of all local variables that are in scope. This is very similar to the requirements when breaking a debug-compiled Rust program in a debugger. The debugger can display the values of all local variables.

In LLVM, this is achieved with a -O0 build which generates code that keeps all local variables in a designated stack slot at all times. Such code is very slow and very large—register allocation is an important optimization.

A debug build that uses safe-point-preserving optimizations to generate accurate DWARF should run much faster than everything-on-the-stack -O0 code while being just as debuggable.

Schedule

Cretonne is a work in progress, and it can’t do much of anything at the moment. The first goal is to be able to compile WebAssembly as a second-tier compiler in SpiderMonkey. I expect we will get to that point sometime in 2017.

REPL

Cretonne can be used as a JIT for a REPL as can LLVM. I expect Cretonne to provide a better tradeoff between latency and optimizations than LLVM. Of course I have no data to back that up at this time.

Higher-level optimizations

Cretonne does not aim to compete with LLVM in the -O3 optimization space, but it could be a component of a compiler that did. I agree there is an interesting opportunity here, in particular if we can exploit Rust’s explicit aliasing model.

Cretonne’s intermediate representation isolates functions so they can be compiled in parallel. This also means that it can’t really express inter-procedural optimizations like inlining which is essential to optimizing Rust. There’s nothing wrong with inlining one Cretonne function into another, but the process needs to be guided by a call graph representation that is external to Cretonne.

There is an interesting challenge in defining a compiler architecture that permits concurrent inter-procedural optimizations as well as incremental compilation. I hope Cretonne can be a component of such an architecture—it does not attempt to define it.

25 Likes

I would personally be super excited about a new backend for Rust, especially one written in Rust and super fast. I’m particularly excited about even faster debug builds through an alternate backend like Cretonne.

That… sounds amazing!

So one thing I just thought of is that we could perhaps run both Cretonne and LLVM for release builds. Most of our time in LLVM is spent just because we’re sending such a huge wad of code to LLVM, but if Cretonne could very quickly rip through most of that to greatly reduce the size of the IR we send to LLVM, then Cretonne+LLVM could arguably be faster than just LLVM in release mode.

Now release mode builds aren’t necessarily the critical piece to make faster, just an idea :slight_smile:

12 Likes

This is really interesting - I wonder how it compares to WebKit’s own B3.

I’ve had some thoughts on an IR that doesn’t share LLVM’s problems (as I understand them) for a while now and Cretonne matches my expectations almost entirely, when it comes to the type system - aggregates only serve to allow sub-optimal assumptions and are otherwise a pointless price to pay - so this makes me very happy indeed. (LLVM devs thought I was mad, ha!)

The one thing that having no aggregate types implies is no aggregate constant expressions (at least for LLVM). What I came up with as a replacement for LLVM constants (not that LLVM devs would even consider it) was described in my design for an interpretable abstract Rust machine, which miri (developed by @scott and oli_obk) already implements.

The gist of it, especially for a backend as opposed to an interpreter, is to have globals be byte buffers with relocations (just like you’d emit in the binary you give a linker) that point to other globals, as opposed to having nested aggregates with getelementptr constant expressions embedded in them.

I can’t find anything related to globals or constants in Cretonne’s source so maybe I’m off the mark.

I did get to see the integer indexing (as opposed to pointers with complex ownership rules), which is another thing that makes me quite happy. If bounds checking ever gets too costly you could consider using the indexing crate - it would work remarkably well for any kind of analysis that doesn’t need to modify anything, since you only need to bound-check everything once and then provide the sound compiler-checked bound-less indexing.

EDIT: Ah, one thing I forgot: the design using various side-tables (EntityMap) is nicely extendable and one place this really matters is debuginfo: wasting exactly 0 bytes per instruction when it’s not in use is really important.
On top of that, one complaint I hear about removing complex types from LLVM is that you lose information needed to debug LLVM IR - the answer there is debuginfo!
If you already have code to emit debuginfo in your front-end, why not make use of it in the back-end for presenting a richer IR?
LLVM’s is terrible, all you see is !dbg !123 and have to look up !123, and repeat this recursively because it’s a tree of metadata nodes.

The opportunity here is to annotate the textual IL form with all sorts of details that can be lazily reconstructed from debuginfo. For example, (*ptr).z.0.foo[5] might be GEP ptr, 0, 2, 0, 1, 5 in LLVM and offset 41 in Cretonne - but you can store the compact offset and use debuginfo to reproduce (*ptr).z.0.foo[5] in an IL comment - nicer than LLVM’s and more efficient underneath!

IOW, Cretonne can have its cake and eat it too - did I mention how happy I am this is happening? :smile:

12 Likes

Cough goblin

Would give me a kick in the butt to finish the rest of the binary formats. Adding a writer API wouldn’t be too bad either, I think, depending on requirements.

So yea, I’d be interested in helping :slight_smile:

8 Likes

Five years from now how much work will be needed to add inlininig to Cretonne?

To clarify, you’re talking about lowering MIR to Cretonne IR, letting Cretonne optimize that, and then translating the resulting Cretonne IR to LLVM IR? I have doubts about the economic viability of developing such a translator, since there’s already a path forward for reducing IR size: MIR-level optimizations. Unlike Cretonne, MIR optimizations can work pre-monomorphization, so they have the potential for much more “bang for the buck”. Moreover, if Cretonne’s optimizations really are as simple as is expected from a mid-tier JIT compiler, Rust’s own MIR optimizations may cover much of the same ground, leaving Cretonne little room for improvement.

8 Likes

Globals in WebAssembly are different, and I haven’t gotten to them yet. Like stack objects, you can’t take their address. I think they’ll end up looking a lot like stack objects in Cretonne:

  • Globals accessed by the function are declared in the preamble, including a name à la FunctionName. A byte size is optional, and there is a constant flag and probably an alignment.
  • Instructions global_load and global_store work like the stack slot equivalents, and their offsets are verified for sized globals. For unsized globals, these instructions are not sandbox-safe.
  • A global_addr instruction is used when lowering from WASM, like stack_addr.

Regarding data initializers, I think your idea makes sense. Here’s an LLVM example using somewhat contrived C++ code, but a constant array of enum variants is not unreasonable in Rust:

extern int earray[];

union U {
    int *p;
    struct S { float c; int *p; } b;
};

U myarray[2] = {
    { .p = &earray[5] },
    { .b = { 3.14, &earray[17] } },
};

To initialize this array, Clang actually has to declare it as a struct type (my formatting):

%union.U = type { %struct.U::S }
%struct.U::S = type { float, i32* }

@earray = external global [0 x i32], align 4
@myarray = global <{ { i32*, [8 x i8] }, %union.U }> <{
    { i32*, [8 x i8] } {
        i32* bitcast (i8* getelementptr (i8, i8* bitcast ([0 x i32]* @earray to i8*), i64 20) to i32*),
        [8 x i8] undef
    },
    %union.U {
        %struct.U::S {
            float 0x40091EB860000000,
            i32* bitcast (i8* getelementptr (i8, i8* bitcast ([0 x i32]* @earray to i8*), i64 68) to i32*)
        }
    }
}>, align 16 

The resulting assembly is much nicer:

	.section	__DATA,__data
	.globl	_myarray                ## @myarray
	.align	4
_myarray:
	.quad	_earray+20
	.space	8
	.long	1078523331              ## float 3.1400001
	.space	4
	.quad	_earray+68

The LLVM initializer expression does have the advantage that it can specify undef bits, which LLVM’s optimizers will make use of. However, Cretonne won’t, so a bag of bytes with relocations makes sense to me.

Sometimes optimizers want to constant fold loads from constant data, and that’s where LLVM’s representation makes sense. If the declared type matches the loaded type, the load can simply be replaced with the constant expression from the initializer. It is important that this kind of constant folding works when relocations are present. Constant folding vtable loads can enable more inlining.

1 Like

Except LLVM can’t always fold loads, not easily though. Bytes + relocations ensure O(lg R) load folding if it is safe to do so, where R is the number of relocations (because you have to do a binary search or something similar). For undef there’s a mask, I was thinking bit-level for LLVM, while miri uses byte-level because it denies using those values anyway.

1 Like

By then, multi-core CPUs will have been mainstream for 20 years, so let’s assume a single-threaded compiler is off the table.

Inlining is a global optimization that interacts with (and modifies) the whole call graph. This means that its design depends a lot on the architecture you choose for a compiler that supports concurrent and incremental compilation efficiently. We probably want something like:

  • Multiple threads working on the call graph simultaneously. (And safely, and deterministically).
  • Don’t require everything to be in memory at the same time. Some programs are large.
  • Parts of the call graph can be reused from the previous compilation because nothing they depend on has changed.

For compiling C/C++ with LLVM, ThinLTO is a really good bid. I think Rust can do better because it doesn’t have to deal with C++'s legacy, but it’s a high bar.

So I think inlining is something that evolves along with the compiler / optimizer architecture. We can start by build building a single-threaded, non-incremental inliner pretty easily.

Here @pcwalton and other Servo devs might have some insight from building a parallelized layout engine - both that and inlining operate on trees (call graph -> DAG subsets -> tree of inlining), at the end of the day.

My intuition is that the worker pool would handle the “global callgraph” and partition it such that bottom-up local optimizations + inlining can be done in parallel.
That is, once a leaf function has been optimized, it can be “frozen” and shared between callees which may inline it from different threads.

Cretonne’s index-based ownership model (or lack thereof) helps with keeping things efficient inside a function, allowing even the “heavy-duty” Arc<RwLock<Function>> to work for such a bottom-up scheme without per-instruction costs or unsafety.

LLVM would have to sync types, constant expressions and metadata nodes (which are all interned), but Cretonne can just side-step all of that and joggle with self-contained functions across threads.

2 Likes

This project is really cool, but the comments in this threads seems, well, a little too optimistic.

Some assume that in the far future Cretonne could achieve performance similar to LLVM, but that simply doesn’t seem to be a possibility, and I’m not talking about the massive development and research that have gone into LLVM, but rather an architectural limitation: the lack of undefined behavior.

Undefined behavior is the cornerstone in modern optimizers. Without that, you never come near to the speed of GCC, LLVM, VeLLVM, etc.

Many optimizations are impossible without UB. In fact, the vast majority of optimization passes rely on UB in one way or another. If everything is well-defined, there’s a hell lot of things you cannot do.

I don’t see any obvious way to get around this without either adding heavy runtime checks (e.g. derefing dangling pointers gives zero or crashes the program) or undefined behavior (something happen).

I’d say making a compromise between the two extremes would make sense. For one thing, you could make derefing null pointers defined and so on. Another less extreme idea is to remove all undefined behavior except for the Undefined Value-based (Undef, as the value is known in LLVM) UB. This allows for smoother reasoning, but still carries many of the same disadvantages of normal UB, unfortunately.

That being said, I don’t think it’s a bad idea, because there are cases where accidental UB is a no-go even on the cost of performance. The web is an example. I just think that it cannot (and should not) replace the default optimizer in --release as some call for in this thread (and the reddit thread).

7 Likes

This one stands out in particular.

The optimizer wouldn’t be able to reason about values behind pointers at all then, right?

If it isn’t clear yet, runtime performance would not be the primary goal, compile-time performance and correctness are. There are a lot of important use cases for code generation with these qualities, even if they can’t match LLVM’s runtime performance.

Cretonne’s design is philosophically aligned with Rust’s: the same way Rust starts from a disposition of zero-cost abstractions, and is working toward making that more ergonomic; Cretonne starts from a disposition of compiling fast and not having undefined behavior, and can push the boundaries of what is possible from there. I think it is not obvious at all what the limits of such an approach are in a production compiler, and we’re not likely to know without doing it. As @stoklund already mentioned, there are already notions to add powerful aliasing annotations - which are unique to Rust’s needs and could give it unique advantages - that would introduce undefined behavior in a controlled way, to allow specific important optimizations. This is a smart approach - start from a completely safe model and open escape hatches where absolutely necessary, similar to unsafe blocks in safe Rust.

15 Likes

I still doubt it would (even if it heavily exploits Rust-specific constructs) match performance of LLVM, and I don’t see it becoming the default backend for --release mode.

That being said, it could easily outperform LLVM compile-time-wise, and as such I think it would fit as a good default for non-release compilation.

4 Likes

As @stoklund said:

  • Dereferencing an invalid pointer will cause the resulting machine code program to do the same. It doesn’t cause Cretonne to assume that is can’t happen.

So Cretonne doesn’t emit runtime checks for pointer validity – but it doesn’t assume that dereferencing an invalid pointer cannot happen either.

LLVM mostly can’t either. LLVM IR has no concept of “strict aliasing”: unless the frontend supplies annotations any pointer can alias any value anywhere (unless LLVM can locally prove it can’t).

  1. LLVM puts a lot of effort into finding and proving noalias pointers (and is actually pretty good at it at this point).

  2. It is UB to break the assumptions of noalias, so in the end it relies on UB, which was my point.