Pre-ACP: floor halve with remainder

Many processors (including many widely used ones) support an operation that divides a number by 2, rounding towards negative infinity, and also returning whether it was odd or even. In Rust, this would look something like this (implemented as a method on each integer type):

fn floor_halve_with_remainder(self) -> (Self, bool) {
    (self.div_euclid(2), (self & 1) == 1)
}

(Although I'm not planning to use this on negative numbers, the floor-halving behaviour, rather than rounding towards zero, is important to avoid needing a special case for them that would lose performance.)

Although this could be added as a user-facing function and might be useful on occasion, I'm primarily hoping for it to be added as an intrinsic that the code-generation backends can special-case, so that it could be used internally in the compiler as a building block for lowering.

Motivation

The motivation for this is based around looking at possible enum niche optimizations. In current Rust, an enum like this is as large as 3 pointers:

enum Example {
    One(&'static u16, &'static u16),
    Two(&'static u16),
    Three(&'static u16),
    Four(&'static u16),
    Five(&'static u16),
}

Clearly this is possible to fit into a smaller space by exploiting the fact that references to u16 are always stored as even numbers in memory, but most approaches I've seen for that lose performance due to having bad code generation for match statements – although the memory usage can be lower (depending on whether or not the memory allocator needs to round up the size), the code generated when using the enum is bad enough that the code ends up slower regardless.

However, if you have an efficient floor-halve-with-remainder operation, it leads to an efficient encoding of this sort of enum: the large case (two references) is stored as-is, and the small cases are stored using consecutive odd numbers in place of the pointer (in this case, 1, 3, 5, and 7). To decode the enum, you do a floor-halve-with-remainder on the pointer that gets replaced by the discriminant: if you have no remainder you have the large option, if you do have a remainder you have a small option, and the result of the division is a number in a consecutive small range so you can index directly into a jump table with it.

If the pointer you're using the niches of is larger than u16, there are alternative possible arrangements – but this one is still extremely efficient given the intrinsic, and (marginally) beats more obvious arrangements like using 0/1/2/3 or 1/2/3/4 as the value to put in the niches of the small options (it also has the advantage of mixing well with the existing niche optimization, meaning you can put an Option inside or outside without losing performance).

Questions

Does something like this exist in LLVM already, so that it can be lowered to, or would it need to be added to LLVM first? (And is there an efficient way to do this in Rust already? The implementation I gave above calculates both halves separately, even when you feed the resulting boolean into an if statement, rather than using the processor instruction directly.)

Also, is an ACP the appropriate tool for adding something like this? Requests for new compiler intrinsics are probably quite rare, so I'm not sure what the appropriate process for them is.

5 Likes

As long as we're talking unsigned, I would assume that LLVM recognizes the idiom

let (quot, rem) = (a / 2, a % 2)

like it does for arbitrary divisors, and translates it optimally on targets that have something more efficient than (a>>2, a&1). On x86 the lowest bit (ie. the remainder) is shifted to CF, but LLVM does translate the code as separate SHR and AND. Not sure if that's the most efficient code possible on x86, but at least those two can trivially be executed simultaneously. And I wouldn't be surprised if the processor recognized the idiom and took the bit from CF if it's more efficient.

1 Like

These compile to identical code on x86-64:

pub fn qr1(v: usize) -> (usize, bool) {
    (v / 2, v % 2 != 0)
}

pub fn qr2(v: usize) -> (usize, bool) {
    (v.div_euclid(2), (v & 1) != 0)
}

namely

    mov   %rdi, %rax
    shr   %rax
    and   $1, %dil
    mov   %edi, %edx
    ret

I'm fairly sure that this is not what @ais523 was thinking of; I'm fairly sure that they were thinking of something like this:

    mov    %rdi, %rax
    xor    %edx, %edx  # also clears the carry flag
    rcr    %rax
    setcs  %dl
    ret

However, I could easily believe that this is actually slower than what LLVM currently produces, as RCR is a rarely-used instruction that may not be implemented as efficiently as SHR and AND. Also note that if you don't clear the carry flag before the RCR then the high bit of %rax will end up being whatever random value it used to have, which is a detail that might be really obnoxious to make LLVM handle efficiently (I know nothing whatsoever about its x86 back end).

In my tests, when you plug this into an if statement, LLVM compiles this for x86-64 as doing if a & 1 == 0 { … } and then later calculating a >> 1 separately, after the branch – this agrees with your observations. However, it only allows the two to be calculated simultaneously if the branch is correctly predicted: with a mispredicted branch one of the calculations is after processor has rewound for the misprediction.

That said, although the list of idioms that are recognised by the processors is known (and test 1 doesn't combine with shr), most x86-64 processors are capable of combining test with je (but not shr with jc). So LLVM's current implementation isn't horrible: it has 1 instruction too many but most of the costs of that are subsequently undone by a processor optimisation. Thinking about it, on Intel processors, LLVM's version would be better in code that was limited by latency/dependency chains rather than limited by throughput/execution speed or instruction decoding, because it can forward information faster from a bit-test to a jump than it can from a shift to a jump, so you would have lower latency even though the throuput is worse.

I wonder if this means that an odd-discriminants enum representation would be a substantial gain even with current LLVM? Thinking about the sort of code that often uses enums, latency often is going to be the bottleneck in enum heavy code.

1 Like

No, it's much simpler:

mov %rdi, %rax
shr %rax
setc %dl

(assuming that the setc becomes a jc in context so that you don't pay the cost of merging %dl with the rest of %rdx – likewise the mov would be unnecessary when this is inlined into a larger function).

I did not know SHR sent the (last) bit shifted out to the carry flag! I would have assumed that that only happened for left shifts. Not needing RCR does make it a lot more tempting of an optimization.

I would be very surprised if we bothered doing this. This is the kind of thing where the backend should just be seeing that you're using both and make it work -- just like there's no "div and rem at once" operation in LLVM, since it just sees that you're doing both and arranges things properly.

Do you know any backend that needs and offers a separate intrinsic for this?

I was mostly thinking about x86-64 – LLVM currently doesn't make use of the hardware support, even though every processor I've ever learned the assembly for in detail has a remainder flag for its shift-right-by-1 instruction (there are probably some RISC processors that don't). The discussion in this thread, however, has made me think that the current code generation is probably fine in the scenario where I wanted it; it's more verbose than it could be, but makes some plausible tradeoffs (it's slower if the program is bottlenecked on frontend but faster if it's bottlenecked on latency – I actually verified this using uICA). So I probably won't progress the ACP further unless it turns out that there's a widely used platform where the code generation is even worse.

If your case needs the remainder as an integer in a gpr, the independent shr and and should indeed save on latency. The codegen could still be improved in situations where the carry flag could be utilized directly. For example:

y = x.carrying_add(y >> 1, (y & 1) != 0).0;
// or equivalently:
y = x.wrapping_add(y.div_ceil(2));

Either of which generate

        mov     rcx, rdx
        shr     rcx
        and     edx, 1
        add     rdx, rax
        add     rdx, rcx

... which would be strictly better as:

        shr     rdx
        adc     rdx, rax

Anyways, I agree with scottmcm, the backend should be able to optimize this, so I don't see an intrinsic being necessary for this.

4 Likes

Maybe you just make benchmarks that proof that alternative code generation is better and open an issue for llvm? I don't see why Rust has anything to do with it, it is just llvm's optimization

2 Likes

Isn't this just borrwing_shr but without the explicit borrow parameter?