Computed gotos, TCO, threaded interpreters: experiments and findings

stoklund · January 26, 2017, 7:54pm

There’s more going on here than meets the eye. If you compile Eli’s C example with Clang, you’ll observe that it generates an LLVM indirectbr instruction for the computed gotos:

$ clang -O -o- -S -emit-llvm interp_cgoto.c | grep indirectbr
  indirectbr i8* %12, [label %6, label %13, label %24, label %31, label %38, label %45, label %52]

But notice that Clang only emits a single indirectbr instruction. All of the computed gotos in the original source become normal direct jumps to that single indirectbr, passing the destination label in a register.

You would think that defeated the purpose of the computed gotos, but if you look at the generated assembly instead, there are indeed multiple indirect jump instructions:

$ clang -O -o- -S interp_cgoto.c | grep jmpq
	jmpq	*(%rcx)
	jmpq	*(%rcx)
	jmpq	*(%rcx)
	jmpq	*(%rcx)
	jmpq	*(%rdx)
	jmpq	*(%rcx)
	jmpq	*(%rcx)

Why does Clang merge all the `indirectbr` instructions?

Every indirect branch can potentially jump to any label in the function that has its address taken. If your virtual machine has 1000 opcodes, that means that there are 1000 indirect branches and 1000 potential destination labels for each. That means that the control-flow graph has a million edges. Since many of LLVM’s optimizations use the control-flow graph, they risk suddenly running very slowly.

By merging the indirectbr instructions, the control-flow graph instead has 1000 edges to the single indirect branch, and 1000 edges from the indirect branch. N x M becomes N + M. That is much more manageable.

So why does it work anyway?

LLVM’s code generator is where LLVM IR is turned into real machine code. The main passes are instruction selection and register allocation, but it also has a tail duplication pass. This pass knows that it should be much more aggressive with branches leading to an indirect branch, and the result is that it ends up duplicating the indirect branch. This usually restores the original computed gotos from the source.

This also brings back the million CFG edges, but the code generator passes that run after the tail duplicator know how to handle that safely.

Conclusion

You don’t need to put computed gotos in your favorite programming language. Known compiler transformations can do the same thing to a loop { match { .. } }.

LLVM doesn’t optimize a switch in a loop that aggressively today, probably because it tends to bloat code size, and modern CPUs have good indirect branch predictors anyway.

The presence of an indirectbr instruction really works as a compiler flag that says “I’m compiling an interpreter loop, and I don’t care about code size right now”.

Topic		Replies	Views
Rust Compiler Performance Working Group announcements	44	12188	March 25, 2019
When will Rust have TCO/TCE? A proposal on the subject language design	16	3951	March 25, 2019
FYI: IBM is working on LLVM support for arbitrary aliasing graphs (for Fortran) compiler	4	1638	January 30, 2021
Gotos in restricted functions language design	16	13949	March 25, 2019
The RLS / Rustdoc interface to rustc compiler	9	1808	March 25, 2019

Computed gotos, TCO, threaded interpreters: experiments and findings

Why does Clang merge all the indirectbr instructions?

So why does it work anyway?

Conclusion

Related topics

Why does Clang merge all the `indirectbr` instructions?