Idea for safe computed goto using enums

I noticed LLVM lets you use blockaddress outside of the function defining the labels (i think it requires being in the same codegen unit as the function containing the labels though...), so I think it would be useful for interpreters and other uses of computed goto to be able to express that like so:

// must not be visible outside `crate`
#[repr(label(interpreter))] // it's label addresses in `fn interpreter`
pub(crate) enum Insn {
    A,
    B,
    C,
}

pub(crate) fn build_insns() -> Vec<Insn> {
    vec![Insn::A, Insn::B, Insn::A, Insn::C]
}

pub(crate) fn interpreter(insns: &[Insn]) {
    let mut pc = 0;
    k#goto insns[pc];
    #[defines(Insn::A)]
    {
        // blocks are like the let else block -- they must diverge
        println!("A");
        pc += 1;
        k#goto insns[pc];
    }
    #[defines(Insn::B)]
    {
        println!("B");
        pc += 1;
        k#goto insns[pc];
    }
    #[defines(Insn::C)]
    {
        println!("C");
        return;
    }
}
2 Likes

if we can get llvm to allow making aliases to blockaddresses (seems likely to be easy to implement due to just being another assembly-level label), then it should be possible to have the label enum type be public and greatly simplify the implementation in rustc since that way you can just make an alias for each label (making the address accessible in other codegen units and across crates) and use those aliases whenever you need a variant's value. this would also make inlining functions using the enum much easier since you can refer to the enum values anywhere rather than only in the same codegen unit as the function defining the labels.

it seems like a pretty big ask to add any form of computed goto into a language that lacks even static goto.

that's because most existing control flow structures can be usually efficiently converted to equivalent static gotos, so static goto isn't as necessary. computed goto doesn't really have any as-efficient equivalents -- the two closest equivalents are:

  • match -- this has the downside of being less efficient in some cases due to having to go through a translation table to block addresses, as well as having a hard time expressing having separate computed gotos at the end of every block for more efficient interpreters (due to the branch prediction advantages and having fewer branches per interpreted instruction).
  • tail calls -- tail calls are much more complex to implement at the compiler level due to needing new ABIs and needing backend support, and are more limited in the number of values that can be kept in registers or on the stack, as well as having more limited borrow checking due to being separate functions.
1 Like

Uh, this logic isn't sound. Of course it is possible to convert less flexible control flow construct (eg. labeled blocks) into a more powerful construct (static goto).

Perhaps what you meant is "most common usage patterns of static goto can be expressed in terms of existing control flow concepts"?

Regardless, the main difficulty of gotos in rust is the complexity of integrating them with the borrow checker.

perhaps you would be interested in a special #[repr] for enums, where the discriminant is assigned based on the address of the arms of a specific match statement?

also, eliminating a single jmp addr[offset] is usually not a big deal in all but the most inner of loops. do you have benchmarks to support the idea that this is a bottleneck? both are computed jumps, all you are eliminating is a single memory access.

i found this article about computed goto that gives me some idea what you might mean by "having fewer branches", but it's worth noting that in rust you can get the same advantage by just inserting unreachable_unchecked into the default branch (godbolt link).

this may be true in theory, but it's important to note that basically 100% of that backend work has been done already, llvm has full tail call support. additionally, there's already an accepted rfc for tail calls in rust, all that is needed is for someone to implement it in rustc.

yeah, that's what I meant.

actually, since the borrow checker runs on MIR and MIR is already a list of basic blocks with branches between them (it's a control flow graph), I expect borrow checking to not need any modification and just work as is.

that has been suggested before, however that has the downside of only working when you only ever need 1 computed goto per enum. good interpreters need one computed goto per match arm at the end of each block, so more than one computed goto per enum.

it's not as much eliminating a single memory access as having a separate indirect branch for each interpreted instruction instead of one indirect branch shared across the whole interpreter. It took me a bit of work to come up with a decent example since Rust doesn't have computed goto: The interpreter with multiple computed gotos is about 2.9% faster than the one with a single shared goto.

note that using the godbolt.org server to run the program doesn't give decent results since it's extremely noisy due to being a cloud vm (like +-25% or more). also note that I'm using -O with gcc 12 since using clang and/or higher optimization levels duplicated the computed goto making both tests effectively the same. Running it on one of my computers at home (with a Ryzen 9 3900X) gives much more predictable results. I used the median from the last 2 runs, ignoring the first 2 because the cpu is warming up the cache and changing clock speeds and stuff:

$ g++-12 -O -std=c++20 -o computed-goto computed-goto.cpp && ./computed-goto
single: 513 iterations:
    min: 0.0396113 sec -- 77215 ns/iter
    max: 0.0463892 sec -- 90427.4 ns/iter
    median: 0.042533 sec -- 82910.3 ns/iter
    mean: 0.0427916 sec -- 83414.4 ns/iter
multiple: 571 iterations:
    min: 0.0453155 sec -- 79361.7 ns/iter
    max: 0.0464836 sec -- 81407.4 ns/iter
    median: 0.0453678 sec -- 79453.2 ns/iter
    mean: 0.045558 sec -- 79786.4 ns/iter
single: 530 iterations:
    min: 0.040901 sec -- 77171.7 ns/iter
    max: 0.0470188 sec -- 88714.7 ns/iter
    median: 0.0432859 sec -- 81671.4 ns/iter
    mean: 0.0434654 sec -- 82010.2 ns/iter
multiple: 572 iterations:
    min: 0.0453794 sec -- 79334.6 ns/iter
    max: 0.0477413 sec -- 83463.8 ns/iter
    median: 0.0454158 sec -- 79398.2 ns/iter
    mean: 0.046009 sec -- 80435.4 ns/iter

the other problem with tail calls that computed goto doesn't have is tail calls don't work on targets that don't support them, such as wasm without the tail call feature. however llvm can translate computed gotos back to a switch so it works fine on wasm.

This makes sense, but this can just be handled in an optimization pass, there's no need to introduce a new language concept. Indeed, this optimization pass already exists, which is why you were unable to reproduce your example using clang.

What's more, clang and rustc use the same backend (llvm), so this optimization actually already happens in rust code! if you carefully inspect my previous godbolt example, you'll see that the basic blocks for each match arm all end the same way:

        movzx   esi, byte ptr [rdi + rcx]
        movsxd  rsi, dword ptr [rdx + 4*rsi]
        add     rsi, rdx
        jmp     rsi

this is a computed goto! one per arm, for improved branch prediction, just like you want.

if you change opt-level to 0, you'll see that every match arm instead ends with a static jump to the same basic block.

that only works that way on some LLVM targets, e.g. on aarch64-unknown-linux-gnu it doesn't use an indirect branch and instead uses a complex set of conditional branches and doesn't duplicate that for each match arm:

If it used computed goto instead of match then it would have indirect branches as desired.

I'm not convinced that support for computed goto is worth it at this point. But if it's done, I'd propose the following approach:

  • An enum needs to be marked with a new representation, e.g. #[repr(code_pointer)].
    • This makes the enum pointer sized, and leaves the choice of discriminants up to the compiler.
    • It could also disable enum as usize casts in const contexts, if the value of constants needs to be known before the address of the code block has been determined.
  • In the same crate there must be at most one match on that enum, marked with a #[computed_goto] attribute. This match can be used to decide the discriminants of the enum. The function containing it must not be generic (except lifetimes).
  • Compiler optimizations that turn a direct jump to this match into a computed goto (effectively inlining the computed goto).
    Typically this direct goto will be a continue inside a loop.

This design adds no new syntax. And it allows a compiler that doesn't support computed goto to use simple integers as discriminant and use a normal match.

For example:

#[repr(code_pointer)]
#[derive(Copy, Clone, Eq, PartialEq)]
enum OpCode {
    Add,
    Sub,
    Stop,
}

fn run(program: &[OpCode]) -> u8 {
    let mut index = 0;
    let mut state = program[index];
    let mut value = 0;
    
    macro_rules! next {
        () => {{
            index += 1;
            state = program[index];
            continue;
        }}
    }
    
    loop {
        #[computed_goto]
        match state {
            OpCode::Add => {
                value += 1;

                index += 1;
                state = program[index]; // bounds checking happens here
                continue; // this turns into a computed goto
            },
            OpCode::Sub => {
                value -= 1;
                // Equivalent to what happens in Add.
                // Demonstrates you can avoid duplication in such an interpreter
                next!();
            },
            OpCode::Stop => {
                // ...
                break;
            },
        }
    }
    value
}

3 Likes

yeah, that could work...

compilers (clang in particular) already do that for computed goto in C anyway, for wasm, just they do that conversion after doing many optimizations that treat it as a computed goto.

1 Like

Reminder that new syntax is really not a problem. It's trivial compared to all the semantic things, like how doing this means changing the initialization checking, drop checking, etc to account for people being able to jump into weird places.

the proposal you're replying to doesn't add anything new to the borrow checker, since it's based on match

My proposal is designed to minimize the impact on the language, and affects none of the things you mention.

Probably the biggest impact is that it's not possible to get the integer discriminant of #repr(code_pointer) enums until an address has been chosen (I assume by the linker). This is quite similar to how pointers to statics are already handled.

const { my_enum as usize } will fail with an error like "error: pointers cannot be cast to integers during const eval".


However it could affect parts of the compiler that assume that the discriminant value is already available (e.g. match).

It's also necessary to ensure that each enum value gets a distinct discriminant, so unifying match cases with identical code into a single goto target needs to be prevented somehow.

Clang seems to have unnamed_addr/local_unnamed_addr at the function/static level, but I assume it implicitly assumes that labels inside a function are unnamed.

It might also require the targets of computed gotos to be unnamed. So LLVM might not actually support enum based computed goto (which would affect both @programmerjake's proposal, and my proposal)

Related issues:


You are concerned about targeting architectures that don't support computed goto. What I mean is that neither compiler frontend nor backend need to implement support for computed goto to compile the program. They just won't be able to take advantage of the computed goto performance improvement.

Can't one "just" jump to a nop sequence ahead of another?

tgt1:
    nop
tgt2_same_code:
    non_nop_sequence
1 Like