Pre-RFC: Add Labeled Switches to Rust

Zig 0.14.0 recently released, and one of the headline features is Labeled Switch. This is a new construct that allows Zig to implement computed gotos while still having structured control flow.

foo: switch (@as(u8, 1)) {
        1 => continue :foo 2,
        2 => continue :foo 3,
        3 => return,
        4 => {},
        else => unreachable,
}

The basic idea is that there exists a new syntax to jump to another switch case directly. An equivalent in Rust could look like this (I don't care about the exact syntax).

'foo: match (1u8) {
   1 => continue 'foo 2,
   2 => continue 'foo 3,
   3 => return,
   _ => {},
}

What makes this so powerful is that the switch labels can be computed at runtime and each statement can have its own branch, leading to much better branch predictor performance. This is explained in greater detail in the codegen section.

This would be very useful for implementing bytecode instructions, FSM's, and parsers. When Zig switched their own parser to this method they saw a 13% speedup.

This would also enable Rust to achieve some of the performance Python saw in their new tail-cail interpreter without requiring guaranteed tail calls and specialized calling conventions. Currently, that level of performance can only be achieved in C (and potentially now Zig).

There is a github issue that talks more about the design and reasoning behind this.

Would this be a reasonable thing to add to Rust? From a borrow checkers perspective, this doesn't seem anymore complicated than match statement in a loop.

4 Likes

Since you can build this trivially out of a loop:

let mut subject = 1u8;
loop {
  match subject {
    1 => subject = 2,
    2 => subject = 3,
    3 => return,
    _ => break,
  }
}

It’s not perfect if any of the cases already had breaks or continues in them, but we’re talking about a fraction of cases in an already contextual pattern.

So if we think this is important, the easiest way to make it happen is probably for someone to make a macro for it, and someone else to go see what it would take for the optimizer to produce the same code as the Zig situation.

6 Likes

Sounds like you're looking for the conversation in RFC: Improved State Machine Codegen by folkertdev · Pull Request #3720 · rust-lang/rfcs · GitHub

12 Likes

Thanks. This is just what I was looking for.

If I recall correctly, actually optimizing such a switch down to computed goto unfortunately relies on that Zig's unreachable is actually unreachable_unchecked() outside all but the simplest cases to simplify. And testing it confirms that; the version with a _ => break compiles to a loop head compare and branch table, whereas the => unreachable_unchecked() version directly does a computed goto. At the least, each computed goto will need a "bounds check" in order to implement a fallback case.

Now, it could be possible for an MIR transform to simplify a loop-and-match into a more direct CFG representation. The viability of such and whether it would actually help, though, I have no clue.

5 Likes

And in a declarative macro, since I kinda nerd sniped myself:

macro_rules! fsm {
    {
        'start => $start:expr,
        $($($case:pat_param)|+ => $arm:block $(,)?)*
    } => {{
        let mut switch = $start;
        'switch: loop {
            macro_rules! goto {($arg:expr) => {{
                switch = $arg;
                continue 'switch;
            }}}
            match switch {
                $($($case => { let false = true else $arm; unreachable!() },)+)*
            }
        }
    }};
}

pub fn ex0(switch: u8) {
    fsm! {
        'start => switch,
        1 => { goto!(case1()) },
        2 => { goto!(case2()) },
        3 => { goto!(case3()) },
        4 => { goto!(case4()) },
        _ => { break },
    }
}

This should somewhat consistently compile to using a computed goto, as long as the scrutinee range is continuous and each pattern only matches a single scrutinee value. (Duplicating arms for top-level or patterns is handled by the macro.)

1 Like

That macro doesn't reliably work. Rewriting the suboptimal codegen example from https://github.com/folkertdev/rust-rfcs/blob/labeled-match/text/3720-labeled-match.md#doesnt-llvm-optimize-this-already using your macro I get Compiler Explorer, which doesn't do direct jumps.

But this wouldn’t be necessary if you used an exhaustive enum instead of matching on a u8 correct? You could get the optimization without unsafe.

2 Likes

Adding enum variant to @CAD97’s playground results in

ex3:
	pushq	%rbx
	movzbl	%dil, %eax
	decl	%eax
	leaq	.LJTI2_0(%rip), %rbx
	movslq	(%rbx,%rax,4), %rax
	addq	%rbx, %rax
	jmpq	*%rax
…

so it does work. Has a bit more unsafe blocks for transmuting caseN to new enum, but this is just me being lazy.

2 Likes

I quite like the OP's proposal, it solves the issue of the loop match proposal from the RFC that it's trying to do too many things at once (state + loop + match).

Question: would this proposal still be useful with the additional restriction that we can only jump to a branch below the current one?

That would force an acyclic directed graph within the match, which you can already express with existing constructs (e.g. labelled break from a block, or loop+break+continue). What's missing is the ability to express cyclic ones too without having to go through a common merge point (like what happens with a loop).

1 Like

The issue is that allowing backward jumps here would make this basically a general labeled goto feature. If that's what we wanted I don't think using match to express this is the most natural: may as well use block labels, no?

In concrete cases of needing direct jumps for performance, does the graph of direct jumps actually sometimes need to be cyclic? Is putting the match in a loop not enough?

For zlib-rs the performance benefit of the loop match proposal over a regular match inside a loop is wel over 20% in some cases.[1] And for re2rust compiled regexes the difference can be well over 2x.[2] And maybe it turns out the graph of direct jumps doesn't need to be cyclic after all, but then the question becomes what are the best edges to cut? Having to rewrite your entire function every time you want to cut at a different location quickly becomes infeasible. And it may well be that on a different cpu cutting at a different location is faster.


  1. For decompressing in chunks of 2^4 bytes match inside a loop is 10% slower than zlib-ng while with my implementation of the loop match proposal it is 8% faster than zlib-ng. For comparison -Cllvm-args=-enable-dfa-jump-thread only makes it 2% faster than zlib-ng and cargo doesn't have any way to force zlib-rs to be compiled with that flag, so most users will not set it. ↩︎

  2. ./autolink_email_loop_match ran
       1.42 ± 0.03 times faster than ./autolink_email_llvm_dfa
       2.74 ± 0.07 times faster than ./autolink_email
    

    comrak/autolink_email.rs at loop_match_attr · bjorn3/comrak · GitHub ↩︎

4 Likes

Ok, sounds like it's pretty important then :smiley:

1 Like

Yes. The usual example is a bytecode interpreter,[1] which is naturally written like

for op in opcode_array {
    match op {
       Op::ADD => ...,
       Op::CALL => ...,
       ...
    }
}

but if you translate this straight to machine code without any cleverness you get something like

    lea     opcode_dispatch_table(%rip), %rbx
forloop:
    movzbl  (%rsi), %eax
    inc     %rsi
    jmp     *(%rbx, %rax, 8)
op_add:
    ...
    jmp     forloop
op_call:
    ...
    jmp     forloop
...

and the indirect jump is totally unpredictable so you get a massive pipeline stall after every bytecode instruction. It's much better if you can somehow get machine code that duplicates the load and dispatch sequence at the tail of every opcode implementation,

begin_interpreting:
    lea     opcode_dispatch_table(%rip), %rbx
    movzbl  (%rsi), %eax
    inc     %rsi
    jmp     *(%rbx, %rax, 8)

op_add:
    ...
    movzbl  (%rsi), %eax
    inc     %rsi
    jmp     *(%rbx, %rax, 8)

op_call:
    ...
    movzbl  (%rsi), %eax
    inc     %rsi
    jmp     *(%rbx, %rax, 8)

...

because then the branch predictor has a chance to learn common sequences of bytecode instructions.[2]

But you can see that the control flow graph of the second version of the machine code is not just irreducible but completely connected. Any opcode could, in principle, follow any other opcode.

Coincidentally, I just saw an article go by about the authors of a real production-grade interpreter (CPython) struggling to get the machine code they want out of their main interpreter loop. One striking fact is that current-generation compilers might generate the "tail duplicated" version of the machine code from the straightforward loop-and-match (loop-and-switch, in C) source code. But "might" isn't good enough when a 15-20% performance delta is on the table.


  1. Both zlib-rs and re2rust can be thought of as special-purpose bytecode interpreters. ↩︎

  2. Branch prediction costs are also why even the simplest possible "copy and patch" compiler can give another 2-10% speedup over a tail-duplicated bytecode interpreter loop: it completely eliminates the troublesome indirect branch after every bytecode instruction. ↩︎

5 Likes

https://crates.io/crates/fallthrough

You don't need a language feature for that!

2 Likes

TBH I still think we should start with something simpler, a dedicated FSM construct with only static jumps.

Then inside that if you want to match an enum and 3 => continue 'foo or Some(_) => continue 'bar or whatever, fine. (Where 'foo and 'bar are named FSM states.) That LLVM should have no problem translating to a jump table if needed, but it avoids a bunch of rust-side weirdness.

4 Likes