Reference-counting garbage collection can be quite efficient

I thought this might interest some people here: Lean (a theorem prover in C++) uses something like Arc for its garbage collection in the purely functional language it implements, and with some tricks (basically make_mut) it is performing very well. Their language is guaranteed to not have cycles in its data structures, which definitely helps. Still, the conventional wisdom I was aware of is "refcounting garbage collection is inefficient", which might be less clear-cut than I thought it was.


Thanks for sharing!

The note about the resurrection hypothesis makes me wonder whether that's something we could be optimizing in Rust. For example, the following program currently gets the obvious alloc-then-dealloc implementation:

pub fn foo(x: Box<i32>) -> Box<i32> {
    Box::new(*x + 1)

But it feels like we ought to have the freedom to be allowed to just re-use the allocation we were given instead.

(On a very unimportant note, that paper has some of the worst into-the-margin overruns I've ever seen in a LaTeX paper.)


I believe C++ does optimizations of this flavor already? It is definitely smart enough to pair ::new and ::delete calls across basic blocks and "unspill" the heap allocation onto the stack. I'm curious to what extent this can be emulated in Rust... this seems like a place where the borrow semantics could be useful constraints.

1 Like

And at some point in the future also reuse the stack space used by the Box. A first step is to guarantee that the returning Box is pre-allocated in the caller's stack frame (without copying).

Somewhat relevant (but a bit off-topic in relation to the OP).

I also would love to see code like this:

let mut buf = Vec::new();

To get optimized into:

let mut buf = Vec::new();

IIRC it could benefit for example diesel, since after macro expansions SQL queries in it are assembled using similar pushes.

Even more ideal if Vec::new() will be automatically replaced with Vec::with_capacity(6 + var.len() + 2). And in general a better understanding of allocations semantics from compiler would be nice. Inefficient boxing of large static objects (today Rust first places object on stack and only then copies it to the heap) probably also can be solved by improving this understanding.

We can already write

pub fn foo(x: Box<i32>) -> Box<i32> {
    let mut x = x;
    let r = &mut *x;
    *r += 1;

Yeah, I know, it's a bit more work, but it works and reuses the allocation.

1 Like

Or if you want to be succinct

pub fn foo(mut x: Box<i32>) -> Box<i32> {
    *x += 1;

It would help greatly if this were possible also for distinct types because that is something that is really not possible right now without unsafe:

pub fn foo(x: Box<i32>) -> Box<u32> {
    Box::new(*x as u32)

Isn't this the kind of thing that we could potentially get from LLVM, "for free"?

This would be a use case for something like Haskell's Coercible - which I also want for a bunch of other reasons.

I knew someone who was working on a Rust system using auto traits to implement something like that - a function that always monomorphizes to mem::transmute, but has trait bounds so it only compiles in cases where the conversion is known to be safe, starting with simple things like u32 to i32, but including much more sophisticated things like &Box<[u32]> to &&[[u8; 4]]. Unfortunately, I don't think she's still working on it, I don't have contact with her, and it doesn't seem to be anywhere online :frowning:

Not sure which person you mean, but I hacked together a proof-of-concept for something similar a while ago. There's no inference with this version, though: you have to build up the chain of conversions yourself.

(Note that the linked implementation doesn't get variance quite right, so I recommend against using it in its current form.)

(It also invokes undefined behavior left and right, since you're not allowed to e.g. transmute Vec<u32> to Vec<i32> directly even though they have the same representation.)

As I recall, the system I saw did do variance and inference right. It might have even been theoretically complete, but it had a few issues because it was making rustc figure out the chains through trait inference – which made it have long compile times and sometimes not manage to find that a moderately complex conversion was valid.

Your system for explicitly specifying the sub-conversions looks pretty neat. If I find some free time and energy, I might think about a way to include variance (and some other issues) into it. (EDIT: this is very unlikely because I know this project could turn into a endless rabbithole for me)

May I ask why the code uses instances of the Transmute ZST rather than having it be an uninhabited type? Is it just to make the syntax easier?

I use this pattern all the time btw. So convenient. (Not only with boxes, but actually any owning container.)

That is a different operation though, + 1 is like fn(i32) -> i32 while += 1 is like fn(&mut i32).

Maybe a better example is negation, for which there is no &mut method.

pub fn foo(mut x: Box<i32>) -> Box<i32> {
    *x = -*x;

The main point still holds though. :slight_smile:

Would it be possible to implement a reuse method like this:

impl Box<T> {
    fn reuse<N>(mut self, new: N) -> Box<N>

except that it ensures that T and N have the same size?

We could do this right now, with no change to the compiler

impl<T: ?Sized> Box<T> {
    fn try_reuse<N>(self, new: N) -> Result<Box<N>, Self> {
        self.try_reuse_with(|| new)

    fn try_reuse_with<N, F>(self, new: F) -> Result<Box<N>, Self> 
    where F: FnOnce() -> N {
        use std::alloc::Layout;
        let layout = Layout::for_value(&*self);

        if layout == Layout::new::<N>() {
            unsafe {
                let ptr = Box::into_raw(self);
                let ptr = ptr as *mut N;
        } else {

Another possibility:

fn reuse<N>(self, new: N) -> Box<N> {
    self.try_reuse_with(|| new).unwrap_or_else(|| Box::new(new))
1 Like

I thought they might have cited

It appears all their benchmarks are single threaded. What really worries me about refcounting performance is when multicore workloads that are logically read-only, like traversing a complex read-only datastructure simultaneously from multiple cores, turn into read-write workloads and now you have lots of CREW cache misses when you wouldn't without refcounting.


Hi, one of the authors of the paper here.

Note that you don't really need our hypothesis in this case because Rust statically knows that x will be deallocated here; we need the hypothesis because we do not have a linear type system to give us static guarantees about memory management. So we basically need to guess and check at runtime. A closer example in Rust would be a function that takes and returns an Rc/Arc and tries to reuse the allocation by use of try_unwrap (but don't forget to drop nested Rcs early like in the map example in the paper!).

Yeah, arxiv was not kind to our Latex sources. It should be better now.


Thanks for the link, I wasn't aware of this paper before. The analysis section looks very nice and thorough. The proposed delayed refcounting optimizations do not seem to be compatible with ours that require exact reference counts.

I would hope that for your use case, most of the traversal is done using borrowed references that do not change the reference counter. But since our (heuristical) borrowing is much more restrictive than say Rust's, there is certainly room for improvement in that regard.