There’s more going on here than meets the eye. If you compile Eli’s C example with Clang, you’ll observe that it generates an LLVM indirectbr
instruction for the computed gotos:
$ clang -O -o- -S -emit-llvm interp_cgoto.c | grep indirectbr
indirectbr i8* %12, [label %6, label %13, label %24, label %31, label %38, label %45, label %52]
But notice that Clang only emits a single indirectbr
instruction. All of the computed gotos in the original source become normal direct jumps to that single indirectbr
, passing the destination label in a register.
You would think that defeated the purpose of the computed gotos, but if you look at the generated assembly instead, there are indeed multiple indirect jump instructions:
$ clang -O -o- -S interp_cgoto.c | grep jmpq
jmpq *(%rcx)
jmpq *(%rcx)
jmpq *(%rcx)
jmpq *(%rcx)
jmpq *(%rdx)
jmpq *(%rcx)
jmpq *(%rcx)
Why does Clang merge all the indirectbr
instructions?
Every indirect branch can potentially jump to any label in the function that has its address taken. If your virtual machine has 1000 opcodes, that means that there are 1000 indirect branches and 1000 potential destination labels for each. That means that the control-flow graph has a million edges. Since many of LLVM’s optimizations use the control-flow graph, they risk suddenly running very slowly.
By merging the indirectbr
instructions, the control-flow graph instead has 1000 edges to the single indirect branch, and 1000 edges from the indirect branch. N x M becomes N + M. That is much more manageable.
So why does it work anyway?
LLVM’s code generator is where LLVM IR is turned into real machine code. The main passes are instruction selection and register allocation, but it also has a tail duplication pass. This pass knows that it should be much more aggressive with branches leading to an indirect branch, and the result is that it ends up duplicating the indirect branch. This usually restores the original computed gotos from the source.
This also brings back the million CFG edges, but the code generator passes that run after the tail duplicator know how to handle that safely.
Conclusion
You don’t need to put computed gotos in your favorite programming language. Known compiler transformations can do the same thing to a loop { match { .. } }
.
LLVM doesn’t optimize a switch in a loop that aggressively today, probably because it tends to bloat code size, and modern CPUs have good indirect branch predictors anyway.
The presence of an indirectbr
instruction really works as a compiler flag that says “I’m compiling an interpreter loop, and I don’t care about code size right now”.