Idea for safe computed goto using enums

I noticed LLVM lets you use blockaddress outside of the function defining the labels (i think it requires being in the same codegen unit as the function containing the labels though...), so I think it would be useful for interpreters and other uses of computed goto to be able to express that like so:

// must not be visible outside `crate`
#[repr(label(interpreter))] // it's label addresses in `fn interpreter`
pub(crate) enum Insn {
    A,
    B,
    C,
}

pub(crate) fn build_insns() -> Vec<Insn> {
    vec![Insn::A, Insn::B, Insn::A, Insn::C]
}

pub(crate) fn interpreter(insns: &[Insn]) {
    let mut pc = 0;
    k#goto insns[pc];
    #[defines(Insn::A)]
    {
        // blocks are like the let else block -- they must diverge
        println!("A");
        pc += 1;
        k#goto insns[pc];
    }
    #[defines(Insn::B)]
    {
        println!("B");
        pc += 1;
        k#goto insns[pc];
    }
    #[defines(Insn::C)]
    {
        println!("C");
        return;
    }
}
2 Likes

if we can get llvm to allow making aliases to blockaddresses (seems likely to be easy to implement due to just being another assembly-level label), then it should be possible to have the label enum type be public and greatly simplify the implementation in rustc since that way you can just make an alias for each label (making the address accessible in other codegen units and across crates) and use those aliases whenever you need a variant's value. this would also make inlining functions using the enum much easier since you can refer to the enum values anywhere rather than only in the same codegen unit as the function defining the labels.

it seems like a pretty big ask to add any form of computed goto into a language that lacks even static goto.

that's because most existing control flow structures can be usually efficiently converted to equivalent static gotos, so static goto isn't as necessary. computed goto doesn't really have any as-efficient equivalents -- the two closest equivalents are:

  • match -- this has the downside of being less efficient in some cases due to having to go through a translation table to block addresses, as well as having a hard time expressing having separate computed gotos at the end of every block for more efficient interpreters (due to the branch prediction advantages and having fewer branches per interpreted instruction).
  • tail calls -- tail calls are much more complex to implement at the compiler level due to needing new ABIs and needing backend support, and are more limited in the number of values that can be kept in registers or on the stack, as well as having more limited borrow checking due to being separate functions.
1 Like

Uh, this logic isn't sound. Of course it is possible to convert less flexible control flow construct (eg. labeled blocks) into a more powerful construct (static goto).

Perhaps what you meant is "most common usage patterns of static goto can be expressed in terms of existing control flow concepts"?

Regardless, the main difficulty of gotos in rust is the complexity of integrating them with the borrow checker.

perhaps you would be interested in a special #[repr] for enums, where the discriminant is assigned based on the address of the arms of a specific match statement?

also, eliminating a single jmp addr[offset] is usually not a big deal in all but the most inner of loops. do you have benchmarks to support the idea that this is a bottleneck? both are computed jumps, all you are eliminating is a single memory access.

i found this article about computed goto that gives me some idea what you might mean by "having fewer branches", but it's worth noting that in rust you can get the same advantage by just inserting unreachable_unchecked into the default branch (godbolt link).

this may be true in theory, but it's important to note that basically 100% of that backend work has been done already, llvm has full tail call support. additionally, there's already an accepted rfc for tail calls in rust, all that is needed is for someone to implement it in rustc.

yeah, that's what I meant.

actually, since the borrow checker runs on MIR and MIR is already a list of basic blocks with branches between them (it's a control flow graph), I expect borrow checking to not need any modification and just work as is.

that has been suggested before, however that has the downside of only working when you only ever need 1 computed goto per enum. good interpreters need one computed goto per match arm at the end of each block, so more than one computed goto per enum.

it's not as much eliminating a single memory access as having a separate indirect branch for each interpreted instruction instead of one indirect branch shared across the whole interpreter. It took me a bit of work to come up with a decent example since Rust doesn't have computed goto: The interpreter with multiple computed gotos is about 2.9% faster than the one with a single shared goto.

note that using the godbolt.org server to run the program doesn't give decent results since it's extremely noisy due to being a cloud vm (like +-25% or more). also note that I'm using -O with gcc 12 since using clang and/or higher optimization levels duplicated the computed goto making both tests effectively the same. Running it on one of my computers at home (with a Ryzen 9 3900X) gives much more predictable results. I used the median from the last 2 runs, ignoring the first 2 because the cpu is warming up the cache and changing clock speeds and stuff:

$ g++-12 -O -std=c++20 -o computed-goto computed-goto.cpp && ./computed-goto
single: 513 iterations:
    min: 0.0396113 sec -- 77215 ns/iter
    max: 0.0463892 sec -- 90427.4 ns/iter
    median: 0.042533 sec -- 82910.3 ns/iter
    mean: 0.0427916 sec -- 83414.4 ns/iter
multiple: 571 iterations:
    min: 0.0453155 sec -- 79361.7 ns/iter
    max: 0.0464836 sec -- 81407.4 ns/iter
    median: 0.0453678 sec -- 79453.2 ns/iter
    mean: 0.045558 sec -- 79786.4 ns/iter
single: 530 iterations:
    min: 0.040901 sec -- 77171.7 ns/iter
    max: 0.0470188 sec -- 88714.7 ns/iter
    median: 0.0432859 sec -- 81671.4 ns/iter
    mean: 0.0434654 sec -- 82010.2 ns/iter
multiple: 572 iterations:
    min: 0.0453794 sec -- 79334.6 ns/iter
    max: 0.0477413 sec -- 83463.8 ns/iter
    median: 0.0454158 sec -- 79398.2 ns/iter
    mean: 0.046009 sec -- 80435.4 ns/iter

the other problem with tail calls that computed goto doesn't have is tail calls don't work on targets that don't support them, such as wasm without the tail call feature. however llvm can translate computed gotos back to a switch so it works fine on wasm.

This makes sense, but this can just be handled in an optimization pass, there's no need to introduce a new language concept. Indeed, this optimization pass already exists, which is why you were unable to reproduce your example using clang.

What's more, clang and rustc use the same backend (llvm), so this optimization actually already happens in rust code! if you carefully inspect my previous godbolt example, you'll see that the basic blocks for each match arm all end the same way:

        movzx   esi, byte ptr [rdi + rcx]
        movsxd  rsi, dword ptr [rdx + 4*rsi]
        add     rsi, rdx
        jmp     rsi

this is a computed goto! one per arm, for improved branch prediction, just like you want.

if you change opt-level to 0, you'll see that every match arm instead ends with a static jump to the same basic block.

that only works that way on some LLVM targets, e.g. on aarch64-unknown-linux-gnu it doesn't use an indirect branch and instead uses a complex set of conditional branches and doesn't duplicate that for each match arm:

If it used computed goto instead of match then it would have indirect branches as desired.