Idea for safe computed goto using enums

programmerjake · October 30, 2024, 11:21pm

I noticed LLVM lets you use blockaddress outside of the function defining the labels (i think it requires being in the same codegen unit as the function containing the labels though...), so I think it would be useful for interpreters and other uses of computed goto to be able to express that like so:

// must not be visible outside `crate`
#[repr(label(interpreter))] // it's label addresses in `fn interpreter`
pub(crate) enum Insn {
    A,
    B,
    C,
}

pub(crate) fn build_insns() -> Vec<Insn> {
    vec![Insn::A, Insn::B, Insn::A, Insn::C]
}

pub(crate) fn interpreter(insns: &[Insn]) {
    let mut pc = 0;
    k#goto insns[pc];
    #[defines(Insn::A)]
    {
        // blocks are like the let else block -- they must diverge
        println!("A");
        pc += 1;
        k#goto insns[pc];
    }
    #[defines(Insn::B)]
    {
        println!("B");
        pc += 1;
        k#goto insns[pc];
    }
    #[defines(Insn::C)]
    {
        println!("C");
        return;
    }
}

programmerjake · October 30, 2024, 11:43pm

if we can get llvm to allow making aliases to blockaddresses (seems likely to be easy to implement due to just being another assembly-level label), then it should be possible to have the label enum type be public and greatly simplify the implementation in rustc since that way you can just make an alias for each label (making the address accessible in other codegen units and across crates) and use those aliases whenever you need a variant's value. this would also make inlining functions using the enum much easier since you can refer to the enum values anywhere rather than only in the same codegen unit as the function defining the labels.

binarycat · November 6, 2024, 5:44am

it seems like a pretty big ask to add any form of computed goto into a language that lacks even static goto.

programmerjake · November 6, 2024, 6:46am

that's because most existing control flow structures can be usually efficiently converted to equivalent static gotos, so static goto isn't as necessary. computed goto doesn't really have any as-efficient equivalents -- the two closest equivalents are:

match -- this has the downside of being less efficient in some cases due to having to go through a translation table to block addresses, as well as having a hard time expressing having separate computed gotos at the end of every block for more efficient interpreters (due to the branch prediction advantages and having fewer branches per interpreted instruction).
tail calls -- tail calls are much more complex to implement at the compiler level due to needing new ABIs and needing backend support, and are more limited in the number of values that can be kept in registers or on the stack, as well as having more limited borrow checking due to being separate functions.

binarycat · November 8, 2024, 7:02pm

Uh, this logic isn't sound. Of course it is possible to convert less flexible control flow construct (eg. labeled blocks) into a more powerful construct (static goto).

Perhaps what you meant is "most common usage patterns of static goto can be expressed in terms of existing control flow concepts"?

Regardless, the main difficulty of gotos in rust is the complexity of integrating them with the borrow checker.

perhaps you would be interested in a special #[repr] for enums, where the discriminant is assigned based on the address of the arms of a specific match statement?

also, eliminating a single jmp addr[offset] is usually not a big deal in all but the most inner of loops. do you have benchmarks to support the idea that this is a bottleneck? both are computed jumps, all you are eliminating is a single memory access.

i found this article about computed goto that gives me some idea what you might mean by "having fewer branches", but it's worth noting that in rust you can get the same advantage by just inserting unreachable_unchecked into the default branch (godbolt link).

this may be true in theory, but it's important to note that basically 100% of that backend work has been done already, llvm has full tail call support. additionally, there's already an accepted rfc for tail calls in rust, all that is needed is for someone to implement it in rustc.

programmerjake · November 10, 2024, 8:56am

yeah, that's what I meant.

actually, since the borrow checker runs on MIR and MIR is already a list of basic blocks with branches between them (it's a control flow graph), I expect borrow checking to not need any modification and just work as is.

that has been suggested before, however that has the downside of only working when you only ever need 1 computed goto per enum. good interpreters need one computed goto per match arm at the end of each block, so more than one computed goto per enum.

it's not as much eliminating a single memory access as having a separate indirect branch for each interpreted instruction instead of one indirect branch shared across the whole interpreter. It took me a bit of work to come up with a decent example since Rust doesn't have computed goto: The interpreter with multiple computed gotos is about 2.9% faster than the one with a single shared goto.

note that using the godbolt.org server to run the program doesn't give decent results since it's extremely noisy due to being a cloud vm (like +-25% or more). also note that I'm using -O with gcc 12 since using clang and/or higher optimization levels duplicated the computed goto making both tests effectively the same. Running it on one of my computers at home (with a Ryzen 9 3900X) gives much more predictable results. I used the median from the last 2 runs, ignoring the first 2 because the cpu is warming up the cache and changing clock speeds and stuff:

$ g++-12 -O -std=c++20 -o computed-goto computed-goto.cpp && ./computed-goto
single: 513 iterations:
    min: 0.0396113 sec -- 77215 ns/iter
    max: 0.0463892 sec -- 90427.4 ns/iter
    median: 0.042533 sec -- 82910.3 ns/iter
    mean: 0.0427916 sec -- 83414.4 ns/iter
multiple: 571 iterations:
    min: 0.0453155 sec -- 79361.7 ns/iter
    max: 0.0464836 sec -- 81407.4 ns/iter
    median: 0.0453678 sec -- 79453.2 ns/iter
    mean: 0.045558 sec -- 79786.4 ns/iter
single: 530 iterations:
    min: 0.040901 sec -- 77171.7 ns/iter
    max: 0.0470188 sec -- 88714.7 ns/iter
    median: 0.0432859 sec -- 81671.4 ns/iter
    mean: 0.0434654 sec -- 82010.2 ns/iter
multiple: 572 iterations:
    min: 0.0453794 sec -- 79334.6 ns/iter
    max: 0.0477413 sec -- 83463.8 ns/iter
    median: 0.0454158 sec -- 79398.2 ns/iter
    mean: 0.046009 sec -- 80435.4 ns/iter

programmerjake · November 10, 2024, 9:01am

the other problem with tail calls that computed goto doesn't have is tail calls don't work on targets that don't support them, such as wasm without the tail call feature. however llvm can translate computed gotos back to a switch so it works fine on wasm.

binarycat · November 10, 2024, 8:47pm

This makes sense, but this can just be handled in an optimization pass, there's no need to introduce a new language concept. Indeed, this optimization pass already exists, which is why you were unable to reproduce your example using clang.

What's more, clang and rustc use the same backend (llvm), so this optimization actually already happens in rust code! if you carefully inspect my previous godbolt example, you'll see that the basic blocks for each match arm all end the same way:

        movzx   esi, byte ptr [rdi + rcx]
        movsxd  rsi, dword ptr [rdx + 4*rsi]
        add     rsi, rdx
        jmp     rsi

this is a computed goto! one per arm, for improved branch prediction, just like you want.

if you change opt-level to 0, you'll see that every match arm instead ends with a static jump to the same basic block.

programmerjake · November 10, 2024, 9:28pm

that only works that way on some LLVM targets, e.g. on aarch64-unknown-linux-gnu it doesn't use an indirect branch and instead uses a complex set of conditional branches and doesn't duplicate that for each match arm:

If it used computed goto instead of match then it would have indirect branches as desired.

CodesInChaos · November 28, 2024, 3:06pm

I'm not convinced that support for computed goto is worth it at this point. But if it's done, I'd propose the following approach:

An enum needs to be marked with a new representation, e.g. #[repr(code_pointer)].
- This makes the enum pointer sized, and leaves the choice of discriminants up to the compiler.
- It could also disable enum as usize casts in const contexts, if the value of constants needs to be known before the address of the code block has been determined.
In the same crate there must be at most one match on that enum, marked with a #[computed_goto] attribute. This match can be used to decide the discriminants of the enum. The function containing it must not be generic (except lifetimes).
Compiler optimizations that turn a direct jump to this match into a computed goto (effectively inlining the computed goto).
Typically this direct goto will be a continue inside a loop.

This design adds no new syntax. And it allows a compiler that doesn't support computed goto to use simple integers as discriminant and use a normal match.

For example:

#[repr(code_pointer)]
#[derive(Copy, Clone, Eq, PartialEq)]
enum OpCode {
    Add,
    Sub,
    Stop,
}

fn run(program: &[OpCode]) -> u8 {
    let mut index = 0;
    let mut state = program[index];
    let mut value = 0;
    
    macro_rules! next {
        () => {{
            index += 1;
            state = program[index];
            continue;
        }}
    }
    
    loop {
        #[computed_goto]
        match state {
            OpCode::Add => {
                value += 1;

                index += 1;
                state = program[index]; // bounds checking happens here
                continue; // this turns into a computed goto
            },
            OpCode::Sub => {
                value -= 1;
                // Equivalent to what happens in Add.
                // Demonstrates you can avoid duplication in such an interpreter
                next!();
            },
            OpCode::Stop => {
                // ...
                break;
            },
        }
    }
    value
}

programmerjake · December 2, 2024, 10:27pm

yeah, that could work...

compilers (clang in particular) already do that for computed goto in C anyway, for wasm, just they do that conversion after doing many optimizations that treat it as a computed goto.

scottmcm · December 2, 2024, 10:54pm

Reminder that new syntax is really not a problem. It's trivial compared to all the semantic things, like how doing this means changing the initialization checking, drop checking, etc to account for people being able to jump into weird places.

binarycat · December 3, 2024, 12:02am

the proposal you're replying to doesn't add anything new to the borrow checker, since it's based on match

CodesInChaos · December 3, 2024, 2:33pm

My proposal is designed to minimize the impact on the language, and affects none of the things you mention.

Probably the biggest impact is that it's not possible to get the integer discriminant of #repr(code_pointer) enums until an address has been chosen (I assume by the linker). This is quite similar to how pointers to statics are already handled.

const { my_enum as usize } will fail with an error like "error: pointers cannot be cast to integers during const eval".

However it could affect parts of the compiler that assume that the discriminant value is already available (e.g. match).

It's also necessary to ensure that each enum value gets a distinct discriminant, so unifying match cases with identical code into a single goto target needs to be prevented somehow.

Clang seems to have unnamed_addr/local_unnamed_addr at the function/static level, but I assume it implicitly assumes that labels inside a function are unnamed.

It might also require the targets of computed gotos to be unnamed. So LLVM might not actually support enum based computed goto (which would affect both @programmerjake's proposal, and my proposal)

Related issues:

You are concerned about targeting architectures that don't support computed goto. What I mean is that neither compiler frontend nor backend need to implement support for computed goto to compile the program. They just won't be able to take advantage of the computed goto performance improvement.

mathstuf · December 3, 2024, 3:18pm

Can't one "just" jump to a nop sequence ahead of another?

tgt1:
    nop
tgt2_same_code:
    non_nop_sequence

Vorpal · December 24, 2024, 10:28pm

@burntsushi's recent benchmarks at GitHub - BurntSushi/duration-unit-lookup: How fast can we recognize a word from a small pre-determined set? seem to suggest that lack of goto (static in that case, not computed) does hurt expressing fast state machines in Rust.

While in the end they found a fast way in Rust as well, as they point out that only works because they were matching prefixes. Some more complex state machines (including for example an arbitrary number of repetitions, * using regex syntax) couldn't be expressed with that approach, and would possibly benefit from goto.

This suggests to me that Idea for safe computed goto using enums - #4 by programmerjake above is not quite right, there are cases where even static goto would be beneficial.

the8472 · December 25, 2024, 2:04am

The issue with those is that the optimizer can be hard to please. Having explicit constructs to convey intent can result in more reliable performance.

github.com/rust-lang/rust

Massive performance regression between `nightly-2022-08-12` and `nightly-2022-08-13`

opened 09:21AM - 12 Oct 22 UTC

Robbepop

A-LLVM I-slow P-medium T-compiler regression-from-stable-to-stable C-bug E-needs-mcve

Usually I develop on the `stable` channel and wanted to see how [my project](htt…ps://github.com/paritytech/wasmi) performs on the upcoming `beta` or `nightly` channels. I saw [performance regressions between 30-80%](https://github.com/paritytech/wasmi/pull/507#issuecomment-1275152300) across the board in native and Wasm targets. Those regressions were confirmed by our benchmarking CI as can be seen by the link. ## Was it the LLVM 15 Update? I conducted a bisect and found that the change happened between `nightly-2022-08-12` and `nightly-2022-08-13`. After short research I saw that Rust updated from LLVM 14 to 15 in exactly this time period: https://github.com/rust-lang/rust/pull/99464 Other merged commits in this time period were not as suspicious to me. ## Past Regressions Also this unfortunately [is not the first time we saw such massive regressions ...](https://github.com/rust-lang/rust/issues/95409). It is extremely hard to craft a minimal code snippet out of `wasmi` since it is a very heavily optimized bunch of code with lots of interdependencies. Unfortunately [the `wasmi` project](https://github.com/paritytech/wasmi) is incredibly performance critical to us. Even 10-15% performance regression are a disaster to us let alone those 30-80% we just saw ... ## Hint for Minimal Code Example **I have one major suspicion:** Due to [missing guaranteed tail calls in Rust](https://github.com/rust-lang/rfcs/issues/2691) we are heavily reliant on a non-guaranteed optimization for our `loop-switch` based interpreter hot path that pulls jumps to the match arms which results to very similar code as what threaded-code interpreter code would produce. The code that depends on this particular optimization [can be found here](https://github.com/paritytech/wasmi/blob/21e12da67a765c8c8b8a62595d2c9d21e1fa2ef6/crates/wasmi/src/engine/executor.rs#L84). This suspicion is underlined by the fact that especially non call-intense workloads show most regressions in the linked benchmarks. This implies to me that the regressions have something to do with instruction dispatch. ## Potential Future Solutions - The Rust compiler could add a few benchmarks concerning those `loop-switch` optimizations to [its set of benchmarks](https://github.com/rust-lang/rustc-perf/) so that future LLVM updates won't invalidate those optimizations. I am not sure how viable this approach is to the Rust compiler developers though. Also this only works if we find all the fragile parts that cause these regressions. - Ideally Rust offered abstractions that allow to develop efficient interpreters in Rust without relying on Rust/LLVM optimizations: for example guaranteed tail calls. ## Reproduce The current `stable` Rust channel is the following: ``` stable-x86_64-unknown-linux-gnu (default) rustc 1.64.0 (a55dd71d5 2022-09-19) ``` In order to reproduce these benchmarks do the following: 1) `git clone git@github.com:paritytech/wasmi.git` 1) `cd wasmi` 1) `git checkout 21e12da67a765c8c8b8a62595d2c9d21e1fa2ef6` 1) `rustup toolchain install nightly-2022-08-12` 1) `rustup toolchain install nightly-2022-08-13` 1) `git submodule update --init --recursive` 1) `cargo +stable bench --bench benches execute -- --save-baseline stable` 1) `cargo +nightly-2022-08-12 bench --bench benches execute -- --baseline stable` 1) `cargo +nightly-2022-08-13 bench --bench benches execute -- --baseline stable`

programmerjake · December 25, 2024, 3:41am

note I never said static goto isn't ever beneficial, i said it isn't usually needed. there are exceptions where static goto is needed and beneficial, and you/they found one. one of the benefits of computed goto is that a computed goto to a constant is trivially optimizable into a static goto, so you don't need a separate static goto language construct if you have computed goto.

Topic		Replies	Views
Gotos in restricted functions language design	16	13959	March 25, 2019
Pre-RFC: Labels in variables language design	39	2546	January 15, 2020
Use statement in implementation blocks language design	5	893	June 5, 2019
Design `?` to work on named blocks language design	34	3202	March 25, 2019
Computed gotos, TCO, threaded interpreters: experiments and findings	12	10482	March 25, 2019

Idea for safe computed goto using enums

Related topics