Adding access to the carry flag bit

I've often lamented that most compilers do not expose the carry bit, detecting overflow/underflow and rotate through carry instructions...

I think that the performance of numerous programs can be significantly enhanced by the addition of instructions that operate on the carry flag.

The tricky bit...

The actual condition is easy, but for anyone not familiar with most cpu implementations, the carry flag is a flag set only by various instructions, usually, add, sub, cmp, rcr/rcl (rotate right/left through carry) to indicate numerical overflow (carry), or subtractive borrowing.

I don't know how difficult it would be add the compiler logic to make sure nothing clobbered the carry flag from the point that it's set to the point that it's used in a conditional.

For example:

z = x + y; if (y < z) and carry_set() then ... // The y < z usually used a cmp that would clobber the carry bit

It may be that we'd want to create a few special instructions that would set and tets the carry bit atomically.

if (rcl_set(x, 3)) then ... if (add_set(x, y)) then...

It's also powerful to use the carry bit to transfer bits from one variable to another: y = y << 1; rcl_carry(x,1); rcr_carry(y,1); // Moves the top bit (sign bit?) from x to y

To high level programmer, access the carry bit can seem excessively trivial... but with a little practice, it is amazing the extra power and speed available.

A lot more information about c languages that support intrinsic rotate instructions, etc here:

Thanks

Add-with-carry is exposed via std::arch::x86::addcarry_uN, so it is usable. If you use that in a way compatible with keeping the info in the carry flag, the compiler should be able to do so.

1 Like

The standard library already includes functions like overflow_add for integer types which return the wrapped result and a boolean representing if an overflow occurred or not. playground example. Accessing the carry bit directly is probably not possible without using asm intrinsic as certain architectures like RISC 5 don't have a carry bit.

9 Likes

How do you cope with architectures such as RISC-V that don't materialize a carry bit?

My first inclination was to bag the whole idea because the RISC V doesn't support the carry bit...

But, upon further reflection, I still think it would be a very worthwhile feature. The carry bit and carry bit functions can be programmed separately for the Risc V, and the functionality will be blazing fast for the other processors... I still think it's a worthwhile addition, and it's a rather small change that would add a significant performance boost, especially when mucking with bits.

Do you have an example of overflowing_add optimizing to the wrong thing? It uses an intrinsic, so as far as I can tell that would be a bug and you could report it along with the miscompiling example. Otherwise it sounds like exactly the thing you are asking for.

3 Likes

Why would it be a bug? Intrinsics are implemented in LLVM for each supported platform, so it should work just fine on RISC-V.

I think toc means that a suboptimal overflowing_add would be a bug.

Overflowing_add is perfect as is, even on the RISC V.

The purpose/intent of this idea is to grant access to the additional size and performance improvements that can be gained by using the carry flag bit.

Usual code for single bit manipulations in c:

let a=...;

if (a && 256) != 256 then ...

Using the carry bit it's possible to do the same thing with:

load regx, 3382

rcr regx, 8 // Rotate right through carry

jns SKIP_282: // Skip following code if carry flag is set

The challenges:

  • Not all processors implement the carry flag
  • llvm doesn't implement anything like rotate through carry.
  • The carry bit is set and reset by a variety of instructions. Having the compiler ensure that the carry bit is not overwritten while optimizating before the carry flag is used would be problematic.

An unlikely alternative:

The speed gains are mostly realized when combining the bit test with a branch.

It's relatively easy to write a couple of functions that return if a bit is set:

fn is_bit_set( v, bit:i8) -> bool

fn is_bit_clear( v, bit:i8) -> bool

Part of the power of the carry bit in combination with the branch:

if is_bit_clear(v,8) then ...

What's seems needed is an optimization rule that would recompile this to cpu assembly (not llvm) to use the carry bit on cpus that support a carry bit.

  1. What are these potential improvements? What's the high-level use case that isn't already covered by one of the dozens of existing std arithmetic functions like overflowing_add?
  2. If the use case truly requires application code to have direct access to platform-specific concepts like carry bits, why is platform-specific assembly not the right solution?
4 Likes

Oops, that wasn't clear.

1.) Using the carry bit could provide smaller faster code on some architectures for code that manipulates bits directly.

Branching if a bit is 0/1.

In assembly the carry bit can also be useful for moving a bit from one variable to another, reversing the bit order of a word, etc.

There doesn't seem to be any improvements with additions/subraction that are not already covered by overflowing_add.

Code that uses the carry bit is strongly coupled... It needs to be performed in a certain order without using any intervening instructions that overwrite the carry bit. If all architectures supported carry bits, it seems like command(s) that took advantage of the enhancement would be worthwhile... since they do not, it's much more uncertain.

I've run few tests and seems both overflowing_add and ...::x86_64::_addcarry_u32 produces semi optimal assembly.

For example here compilers uses carry flag directly, no local variables on stack of type u8 or bool are actually created!

like:

pub fn test_carry(num: u32,num2:u32) -> u32 {
    let mut res:u32=0;
    let c =  unsafe {core::arch::x86_64::_addcarry_u32(0, num, num2, &mut res)};
    if c==1u8 {
        42
    } else {
        res
    }
}

becomes:

example::test_carry:
        add     edi, esi
        mov     eax, 42
        cmovae  eax, edi
        ret

So you could use x86_64 intrinsic or standardized overflowing_add and in most cases LLVM would get you covered by optimizations.

P.S. to beclear, _addcarry_u32 doesn't emit adc instruction, it emits regular add if first argument is constant 0

3 Likes

Instruction-level access to a "carry bit", so that the value can be used as an input to a subsequent instruction, was trivial to implement in the early days of computing when each instruction was completed before the next instruction was begun.

For a modern, out-of-order, superscalar processor implementation that cost/benefit is reversed; the cost in gates and/or instruction-cycle slowdown of the "carry-bit feature" far, far outweighs any possible benefit. That is why RISC-V, which is a state-of-the-art computer architecture whose expected implementations span the range from embedded processors of 10k gate complexity (e.g., RV32EC) to superscalar processors with 100x more gates, does not materialize an instruction-stream-synchronous carry bit.

8 Likes

Are you saying it should emit adc? If so, why? The result of adc is carry_flag + arg1 + arg2 and if carry_flag is 0 it seems wrong not to constant propagate this to arg1 + arg2 aka. a simple add.

1 Like

As I recall, you would want to use the adc on the higher words as you added them so it would also add in the carry bit from the result of the lowest word addition.

For example: Adding 2 64 bit words on a 32 bit machine:

add A.lo, B.lo adc A.hi, B.hi

The adc on the high words would A.hi + B.hi + c from the first add instruction.

Here's how you would do that today in cross-platform stable Rust. This code:

pub fn add((a1, a0): (u32, u32), (b1, b0): (u32, u32)) -> (u32, u32) {
    let (sum0, carry) = a0.overflowing_add(b0);
    let sum1 = a1 + b1 + carry as u32;
    (sum1, sum0)
}

compiles to add followed by adc:

example::add:
        mov     eax, edi
        add     esi, ecx
        adc     eax, edx
        mov     edx, esi
        ret

(Godbolt link)

9 Likes

No, i mean if you need just add with carry check, it would emit just that, without overhead, if you pass const/literal valued.

I just pointed that this intrinsic is optimized well.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.