Computed gotos, TCO, threaded interpreters: experiments and findings

Throughout the forums there’s a sprinkling of posts enquiring about computed gotos for implementing threaded interpreters or faster FSMs.

With an interest in interpreters, I thought I’d investigate the current possibilities and realities of interpreter dispatch methods in Rust.

For anybody who is similarly interested, I wrote up my findings. The code to go with the article is hosted here.

This hasn’t been reviewed by anybody and I’m not sure who I’d ask to to review before publishing a corrected version, so here it is!

Members whose posts and comments have expressed interest in tco and gotos: @dobenour @leonardo

2 Likes

Is it a useful data point to add the performance of C code compiled with GCC that uses computed gotos?

Looks interesting. Trivial thing I noticed - I guess the date at the top should be 2017.

Thanks! Corrected!

I would assume that gcc/clang would produce similar code to threadedasm.rs except that the bounds checking would be up to you to implement. It would be interesting to compare, but I don’t want to reimplement everything in C myself.

There’s more going on here than meets the eye. If you compile Eli’s C example with Clang, you’ll observe that it generates an LLVM indirectbr instruction for the computed gotos:

$ clang -O -o- -S -emit-llvm interp_cgoto.c | grep indirectbr
  indirectbr i8* %12, [label %6, label %13, label %24, label %31, label %38, label %45, label %52]

But notice that Clang only emits a single indirectbr instruction. All of the computed gotos in the original source become normal direct jumps to that single indirectbr, passing the destination label in a register.

You would think that defeated the purpose of the computed gotos, but if you look at the generated assembly instead, there are indeed multiple indirect jump instructions:

$ clang -O -o- -S interp_cgoto.c | grep jmpq
	jmpq	*(%rcx)
	jmpq	*(%rcx)
	jmpq	*(%rcx)
	jmpq	*(%rcx)
	jmpq	*(%rdx)
	jmpq	*(%rcx)
	jmpq	*(%rcx)

Why does Clang merge all the indirectbr instructions?

Every indirect branch can potentially jump to any label in the function that has its address taken. If your virtual machine has 1000 opcodes, that means that there are 1000 indirect branches and 1000 potential destination labels for each. That means that the control-flow graph has a million edges. Since many of LLVM’s optimizations use the control-flow graph, they risk suddenly running very slowly.

By merging the indirectbr instructions, the control-flow graph instead has 1000 edges to the single indirect branch, and 1000 edges from the indirect branch. N x M becomes N + M. That is much more manageable.

So why does it work anyway?

LLVM’s code generator is where LLVM IR is turned into real machine code. The main passes are instruction selection and register allocation, but it also has a tail duplication pass. This pass knows that it should be much more aggressive with branches leading to an indirect branch, and the result is that it ends up duplicating the indirect branch. This usually restores the original computed gotos from the source.

This also brings back the million CFG edges, but the code generator passes that run after the tail duplicator know how to handle that safely.

Conclusion

You don’t need to put computed gotos in your favorite programming language. Known compiler transformations can do the same thing to a loop { match { .. } }.

LLVM doesn’t optimize a switch in a loop that aggressively today, probably because it tends to bloat code size, and modern CPUs have good indirect branch predictors anyway.

The presence of an indirectbr instruction really works as a compiler flag that says “I’m compiling an interpreter loop, and I don’t care about code size right now”.

6 Likes

Interesting! Maybe there should just be some attribute to put on match statements requesting this optimization.

One situation where this wouldn’t be sufficient is if you actually want regular conditional branches after some handlers, which check for commonly-following opcodes before falling back to the indirect branch. My guess is that on modern “big” CPUs, this would only harm performance, since they have sophisticated prediction for indirect branches, and adding more branches to the mix would just make mispredictions more costly. But on older or smaller CPUs, it might help…

However, I think direct gotos will be expressible in Rust with become + inlining, once (if) that becomes a thing.

My post isn’t arguing for computed gotos in Rust.

It was an exploration of what assembly rustc can generate and whether there are still benefits these days from gotos, since the topic has come up on occasion in this forum and urlo.

I like learning by doing and so my conclusion from benchmarking is definitely “it depends.” In the case of Rust, I don’t see any value in building gotos into the language, excepting the possible case of become but that’s a different story.

Thanks for the explanation of how clang/llvm works in this case, that is really interesting. However, I don’t see at all how your conclusions, however true, derive from that explanation…

Sorry, I didn’t explain that very well. Suppose we rewrite Eli’s computed goto example to look like this:

int
interp_cgoto2(unsigned char* code, int initval)
{
  static void* dispatch_table[] = { &&do_halt, &&do_inc,  &&do_dec, &&do_mul2,
                                    &&do_div2, &&do_add7, &&do_neg };
  int pc = 0;
  int val = initval;

  while (1) {
    goto* dispatch_table[code[pc++]];
  do_halt:
    return val;
  do_inc:
    val++;
    continue;
  do_dec:
    val--;
    continue;
  do_mul2:
    val *= 2;
    continue;
  do_div2:
    val /= 2;
    continue;
  do_add7:
    val += 7;
    continue;
  do_neg:
    val = -val;
    continue;
  }
}

And compare that to his normal switch code which goes like this:

    while (1) {
        switch (code[pc++]) {
            case OP_HALT:
                return val;
            case OP_INC:
                val++;
                break;
                ...

We should expect that to generate about the same machine code since a switch is translated to a jump table and an indirect branch if it is big and dense enough:

$ clang -O -o- -S interp_switch.c | grep jmpq
	jmpq	*%rcx

However, LLVM compiles the interp_cgoto2 function to almost the same code as the original interp_cgoto which has a computed goto at the end of each opcode:

$ clang -O -o- -S interp_cgoto2.c | grep jmpq
	jmpq	*(%rax,%rcx,8)
	jmpq	*(%rax,%rcx,8)
	jmpq	*(%rax,%rcx,8)
	jmpq	*(%rax,%rcx,8)
	jmpq	*(%rax,%rdx,8)
	jmpq	*(%rax,%rcx,8)
	jmpq	*(%rax,%rcx,8)

LLVM is effectively replacing the continue statements with copies of the indirect branch at the top of the loop. It could also do that with the indirect branch that is inherent in a switch; it chooses to not do that for code size reasons.

So, LLVM can transform back and forth between having a single indirect branch at the top of the loop, or having individual indirect branches at the end of each opcode handler. It can do this whether you write your source code using a normal switch statement or a computed goto.

6 Likes

Thanks for taking the time to explain :slight_smile:

I checked clang again. Starting from version 6.0 it does that also for the switch statement version when using -O3.

See it yourself: Compiler Explorer.

3 Likes

In my own threaded interpreter, built on top of gcc, I directly plant the labels into the generated code-array, so that the dispatch is “goto *pc++;” where pc is a pointer into the code. This is much more direct than “goto *dispatch_table[ code[ pc++ ] ]” where pc is an int that is indirected through a code array. I designed the instruction set so I could actually try out both forms and my experiments showed a very large difference in performance for gcc; the exact value depends on architecture but never less than a factor of 2.5 and up to a factor of 6.

I would love to move off C++ and onto Rust one day. Any thoughts on how to get a similar effect in Rust?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.