Rc optimization on 64-bit targets

I was optimizing some code and noticed a surprising branch. It turned out to come from Rc::clone. It's an overflow check for the refcount and it turns what could be a single add instruction turns into 6-8 instructions, including a branch.

Godbolt link

I found this very old thread discussing it. I think the check makes sense on 32-bit targets, but not 64-bit+ ones. With 64-bit usize, if you cloned and leaked the same Rc one billion times per second, it would take 584 years to overflow. To me that's unlikely enough to ignore, or at least leave it for one of my distant great-great16-grandchildren to deal with.

Could we make the check conditional on the size of usize? Or is there still some value to having it on 64-bit targets?

1 Like

Is the branch actually significant? I would expect the branch predictor to do really well with this branch.

I think the code could be optimized to skip the strong == 0 check (the comment even says "The reference count will never be zero when this is called"). And change it from a pre-condition to a post-condition (though I suppose there's a window of opportunity here for UB if another thread reads the ref count before we manage to abort).

1 Like

If it has any performance impact, perhaps core::intrinsics::unlikely could help.

The branch predictor should handle it fine, but the problem is more one of code bloat. Rc::clone is inlined, which means you end up with thousands of copies of this extra dead code spread around the binary. That bloats the instruction cache (and branch predictor state tables), and the bigger your code size, the more often it falls out of cache.

That's a nice optimization! You don't have to worry about other threads since Rc is single-threaded. I'm specifically talking about Rc and not Arc. I think your change or something like it should be made. Though I'd still prefer the perhaps more controversial option of removing the check entirely.

1 Like

As I wrote back then, the overflow could be triggered on 64-bit targets, but only with pathological code that did nothing but clone and forget. This is because LLVM would optimize "N times in a row, increase refcount by 1" into "increase refcount by N".

Then see this exchange from the PR that added the overflow check. I suggested that the check could be skipped on 64-bit targets as long as we could prevent the optimization mentioned previously, which could be done with as little as an empty inline assembly block. But this was judged too risky, and the need for it uncertain. Perhaps it's time to revisit that.

5 Likes

I personally prefer @mjbshaw's optimized solution over skipping the check entirely. While I fully agree that overflowing the counter is difficult on 64 bit machines, as @comex pointed out it is possible to do so with malicious code.

How hard would it be to test the differences in code size using the different options on real world code? That is, something like a crater run, but for code size?

2 Likes

Out of curiosity, I tried optimizing inc_strong to this:

    #[inline]
    fn inc_strong(&self) {
        let strong = self.strong();

        // SAFETY: The reference count will never be zero when this is called;
        // nevertheless, we insert an assume here to hint LLVM at
        // an otherwise missed optimization.
        unsafe { assume(strong != 0) };

        // We want to abort on overflow instead of dropping the value.
        self.strong_ref().set(strong.checked_add(1).unwrap_or_else(|| abort()));
    }

This generates shorter assembly:

example::rc_inc:
        mov     rax, qword ptr [rdi]
        mov     rcx, qword ptr [rax]
        inc     rcx
        je      .LBB1_1
        mov     qword ptr [rax], rcx
        ret
.LBB1_1:
        ud2
        ud2

However, I doubt doing so would have a practical impact on performance. cmp instruction is pretty fast compared to accessing memory, so removing cmp is not going to be particularly meaningful.

Albeit I suppose shorter code is going to mean less stuff to cache...

What is the missed optimization?

Anyway, like already said, since this is inlined many times this can have effect.

I went ahead and made a pull request.

2 Likes