Rc optimization on 64-bit targets

whatisaphone · March 22, 2022, 4:31pm

I was optimizing some code and noticed a surprising branch. It turned out to come from Rc::clone. It's an overflow check for the refcount and it turns what could be a single add instruction turns into 6-8 instructions, including a branch.

Godbolt link

I found this very old thread discussing it. I think the check makes sense on 32-bit targets, but not 64-bit+ ones. With 64-bit usize, if you cloned and leaked the same Rc one billion times per second, it would take 584 years to overflow. To me that's unlikely enough to ignore, or at least leave it for one of my distant great-great¹⁶-grandchildren to deal with.

Could we make the check conditional on the size of usize? Or is there still some value to having it on 64-bit targets?

mjbshaw · March 22, 2022, 4:58pm

Is the branch actually significant? I would expect the branch predictor to do really well with this branch.

I think the code could be optimized to skip the strong == 0 check (the comment even says "The reference count will never be zero when this is called"). And change it from a pre-condition to a post-condition (though I suppose there's a window of opportunity here for UB if another thread reads the ref count before we manage to abort).

jhpratt · March 22, 2022, 5:58pm

If it has any performance impact, perhaps core::intrinsics::unlikely could help.

whatisaphone · March 22, 2022, 6:43pm

The branch predictor should handle it fine, but the problem is more one of code bloat. Rc::clone is inlined, which means you end up with thousands of copies of this extra dead code spread around the binary. That bloats the instruction cache (and branch predictor state tables), and the bigger your code size, the more often it falls out of cache.

That's a nice optimization! You don't have to worry about other threads since Rc is single-threaded. I'm specifically talking about Rc and not Arc. I think your change or something like it should be made. Though I'd still prefer the perhaps more controversial option of removing the check entirely.

comex · March 22, 2022, 6:53pm

As I wrote back then, the overflow could be triggered on 64-bit targets, but only with pathological code that did nothing but clone and forget. This is because LLVM would optimize "N times in a row, increase refcount by 1" into "increase refcount by N".

Then see this exchange from the PR that added the overflow check. I suggested that the check could be skipped on 64-bit targets as long as we could prevent the optimization mentioned previously, which could be done with as little as an empty inline assembly block. But this was judged too risky, and the need for it uncertain. Perhaps it's time to revisit that.

ckaran · March 22, 2022, 7:58pm

I personally prefer @mjbshaw's optimized solution over skipping the check entirely. While I fully agree that overflowing the counter is difficult on 64 bit machines, as @comex pointed out it is possible to do so with malicious code.

How hard would it be to test the differences in code size using the different options on real world code? That is, something like a crater run, but for code size?

xfix · March 22, 2022, 10:03pm

Out of curiosity, I tried optimizing inc_strong to this:

    #[inline]
    fn inc_strong(&self) {
        let strong = self.strong();

        // SAFETY: The reference count will never be zero when this is called;
        // nevertheless, we insert an assume here to hint LLVM at
        // an otherwise missed optimization.
        unsafe { assume(strong != 0) };

        // We want to abort on overflow instead of dropping the value.
        self.strong_ref().set(strong.checked_add(1).unwrap_or_else(|| abort()));
    }

This generates shorter assembly:

example::rc_inc:
        mov     rax, qword ptr [rdi]
        mov     rcx, qword ptr [rax]
        inc     rcx
        je      .LBB1_1
        mov     qword ptr [rax], rcx
        ret
.LBB1_1:
        ud2
        ud2

However, I doubt doing so would have a practical impact on performance. cmp instruction is pretty fast compared to accessing memory, so removing cmp is not going to be particularly meaningful.

Albeit I suppose shorter code is going to mean less stuff to cache...

chrefr · March 22, 2022, 10:46pm

What is the missed optimization?

Anyway, like already said, since this is inlined many times this can have effect.

mjbshaw · March 23, 2022, 2:23am

I went ahead and made a pull request.

system · June 21, 2022, 2:24am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rc is unsafe [mostly on 32-bit targets] due to overflow	72	9886	March 25, 2019
Question on RawVec::alloc_guard compiler	3	845	March 25, 2019
Optimizing branched assignings based on possible states compiler	5	683	April 26, 2021
Resolving Rust's forward progress guarantees compiler	53	6830	December 22, 2024
Likely/unlikely hint on refutable patterns	4	2000	March 25, 2019

Rc optimization on 64-bit targets

Related topics