Immortal Arc API?

It does cover the need. A couple replies here have done a great job answering the XY question (roughly "how does one handle immortalizing (leaking) memory in rust?").

This is possible, Arcs share a Box<(AtomicUsize, AtomicUsize, T)>[1]. So if the refcount is 1 that box can be leaked[2] and a &'static mut T can be handed out, but then the refcounts would also be leaked[3]. You might use Box::leak(Box::new(*Arc::get_mut(&mut my_arc).unwrap())) (which makes a new allocation to leak), but I think the best advice if you want a leakable Arc is to use an Arc<Box<T>>.


  1. That inner type is actually an "ArcInner". Same idea. ↩ī¸Ž

  2. The box has actually been pre-leaked by the Arc, usually the last Arc will clean it up though. ↩ī¸Ž

  3. And leak functions probably shouldn't leak. ↩ī¸Ž

2 Likes

The need only arises in ascenario similar to the cpython implementation. You have a big codebase that uses Arc extensively in basically all interfaces. However, you have objects of which you know that they exist for the complete runtime of your application, i.e. they have effectively a static lifetime. You share these objects between multiple threads and pay a lot of synchronization cost for incrementing/decrementing the counter so you look for a solution that allows you to use the existing functions and interfaces without paying the reference counter synchronization cost.

The cpython motivation is a bit more complicated because it is not about thread synchronization but about interpreter forking but the point stays the same.

This is such a niche application that it should not be included in the rust std. If you write an application like a python interpreter you should use a custom smart pointer because you probably want to employ several domain specific optimizations (like this one). It is a severe performance cost of several percent for the rest of the users in the current python implementation.

2 Likes

Perhaps something like a CoW would fulfill that use case better than modifying Arc?

enum AoR<T> {
    Arc(Arc<T>),
    Ref(&'static T),
}

Impl<T> AoR<T> {
    fn leak(this: &mut Self) {
        let Self::Arc(a) = this else { return; }

        *this = Self::Ref(
            unsafe {
                &*Arc::into_raw(a.clone())
            }
        );
    }
}

impl Deref for AoR { ... }

I guess this wouldn't be pointer-sized though

I don't think this is equivalent to the original idea, as it only works for Copy types.

After the line, my_arc still holds on to its portion of the allocation, and the get_mut result had wrong provenance anyways. I find it suspect to have the combination Box::leak(Box::new(_)), too. As you said the Arc already contains an allocation. Am I overlooking other semantics of Box that would be useful?

Another implementation for should be:

// Ensure unique ownership of the value, `T: Clone`.
Arc::make_mut(&mut my_arc);
// Now take over and leak the share of the allocation.
unsafe { &mut *(Arc::into_raw(my_arc) as *mut _) }

For a leaked static shared reference it could be simply:

unsafe { & *Arc::into_raw(my_arc) }

I think this could be sufficient?

use std::mem::ManuallyDrop;
use std::sync::Arc;

fn immortal_arc<T>(t: T) -> Arc<T> {
    let tmp = Arc::new(t);
    let _ = ManuallyDrop::new(Arc::clone(&tmp));
    tmp
}

The allocation is too big by two AtomicUsizes + padding. This might be fine in some cases (and e.g. Vec::leak leaks spare capacity), but isn't what I'd expect.

That is indeed immortal, and is roughly what Arc::increment_strong_count does internally. This will still perform the atomic increments/decrements as we clone the Arc, and in the original use case (such as it is) those increments/decrements are a huge cost to our program runtime :upside_down_face:.

Can you clarify what you mean by "huge cost," do you have some specific use-case in mind? (EDIT: Oh do you mean the use-case in the linked article?)

When I think of Arc-heavy code, I think of tree-like data structures where the creation of the nodes is often an insignificant cost that is paid up-front.

Yes, this is theoretically what I want. However, the problem is that the types are not the same and so a CPython-esque use case would not work.

A "zero overhead" solution is to change the APIs from taking Arc<Ty> to impl P<Ty> for some trait P<T> = 'static + Deref<Target=T> + Clone + Send + Sync. (Use a trait with blanket impl instead of a trait alias for stable.) Then any such APIs can be monomorphized for Arc<T> or &'static T.

The cost of such is of course the need to change every such API and to monomorphize two copies of the API. This isn't exactly comfortable in Rust, but it's not even really an option in C (e.g. for CPython).

You can of course also polymorphize with an enum of Arc or &'static for the cost of an extra pointer sized int on the stack. Or you can use a tagged pointer to pack the flag into the pointer for the cost of a bitmask before deref. But I highly doubt that std is going to add this branch into every Arc — it does pessimize every case that doesn't also deal with a large number of "immortalized" values to materialize the value of branching over a single AcqRel sync per clone/drop.

The "best" solution is to "just" take advantage of Rust's borrow checking to minimize unnecessary reference count manipulation. Use "+1" passing (function takes Arc) when the function generally needs ownership AND callers can regularly pass along ownership instead of cloning, and "+0" passing (function takes &Arc<T> or ArcBorrow<'_, T>) when the function generally doesn't need to take ownership, OR if the caller would rarely be able to give up ownership.

An interpreter like CPython is a fun case where the interpreter does not know either whether the calling scope will use the value again or whether the callee scope will use ownership, and it has a good number of known functionally immortal objects which are commonly manipulated in the same dynamic codepaths as nonimmortal objects. Plus, CPython objects aren't exclusively just reference counted; there's also a reference cycle reclaiming GC involved, which is another cost that immortal objects can avoid, in addition to the reference counting.

If the cost of atomic reference counting is really a notable chunk of your runtime, the best approach might actually be a "hybrid rc" approach, where within a thread you use a thread local count, and only touch the atomic count when you need to potentially cross threads (be Send). This can work surprisingly well when data only rarely (potentially) crosses threads, although that's incompatible with a multi-thread work-stealing executor (e.g. tokio).

6 Likes

As far as I know the imortal objects in python are an optimization for the garbage collector. By not having to check these immortal objects the garbage collector pauses get shorter. I don't see how this would apply to rust since there is no garbage collector and dropping of the backing storage happens quite naturally when the last Arc is dropped. No need to scan or not scan objects wether they need to be deleted or not.

5 Likes

I thought it was to prevent contention on the recount front different threads.

EDIT: https://peps.python.org/pep-0683/#reducing-cpu-cache-invalidation

How is that different from what I said?

One of the goals of immortal objects in Python is to not have to touch the refcount at all; this is unrelated to GC, and applies to Rust as well as to Python. Instead, once an object is deemed immortal, you leave the refcount alone; Arc::clone and Arc::drop would not increment and decrement the refcount.

If you have many Arc::clone and Arc::drop calls in your code, then the cost of incrementing and decrementing the refcount is significant; as a rule of thumb, an atomic operation is 10x the cost of a non-atomic operation when uncontended, going upwards as multiple CPUs attempt to access the same atomic. The "immortal Arc" proposal cuts those costs out for the case where the Arc is guaranteed to live long enough, and in Python that's helpful since all objects are effectively wrapped in an Arc if they're not garbage collected by other means.

2 Likes

By avoiding touching the refcount atomically at all, this is faster, right?

Yes, exactly. Please see this type which incorporates an efficient immortal API.

If, as in the OP, immortality is signaled by a special strong count, then it still needs to be read atomically to determine if it is immortal. This would be less contention than a rmw op, but it's still an atomic operation.

You do put the "is immortal" discriminant in a separate field, which permits checking it non-atomically. However, because you're using typestate to specify the mode, there's no benefit to unifying the types instead of having separate types, e.g. Arc<T> and a "LeakedBox<T>".

The proposed benefit of immortal arcs is giving them to code handling Arc<T> and having that code perform closer to if it were handling &'static T. If you're going to require editing the code to make it monomorphic anyway, just make it monomorphic over impl Deref<Target=T> + Clone and just use &T instead of Arc<T>. There's no benefit to making the data representation polymorphic if the type representation is monomorphic.

LeakedBox pseudocode
unsafe trait SmartPtr: Deref {
    fn into_raw(self) -> NonNull<T>;
    unsafe fn from_raw(ptr: NonNull<T>) -> Self;
}

impl<T> SmartPtr for Box<T>, Arc<T>, Rc<T>;

struct Leaked<P: SmartPtr>(NonNull<T>, PhantomData<P>);
impl<P: SmartPtr> From<P> for Leaked<P>;

impl<P: SmartPtr> Copy + Clone for Leaked<P>;

unsafe impl<P: SmartPtr> Send for Leaked<P> where P: Send;
unsafe impl<P: SmartPtr> Send for Leaked<P> where P: Send;

impl<P: SmartPtr> Deref for Leaked<P> where P: Deref;
impl<P: SmartPtr> DerefMut for Leaked<P> where P: DerefMut;

impl<P: SmartPtr> Leaked<P> {
    unsafe fn into_inner_unchecked(this: Self) -> P;
}

Dynamically thread-local reference counting is likely more impactful than immortal reference counts. (And immortal reference counts are likely more impactful with thread-local reference counts, since then you can skip the thread-local access.)

But that's all a much more involved thing than std's Arc/Rc. Them necessarily supporting weak references is already seen as "too much" by some (it's unnecessary overhead if you never use weak references); doing more certainly steps out of std's intended scope of providing the simple building block vocabulary.

Perhaps hybrid reference counting should be a preferred default, with non-hybrid being an optimization. But that's far from proven at this point, and the perfect domain for an external crate to provide. If you want to push for it to be "more available" than "some random crate," crossbeam would be a natural home for such a tool. Or just create a high quality implementation, show its benefits with real-world benchmarks, and offer PRs to projects using it and showing the real-world benefit.

3 Likes

Would it actually need to be read atomically (or is this architecture dependent)? If IMMORTAL is read then no other writes should ever occur. Although since there is no atomic fetch_add_if_not_equal operation (?) probably immortal should be a separate AtomicBool.

1 Like

It would race writes for nonimmortal strong counts, so yes, it would need to be a separate byte, written only at initialization, if reads are supposed to be nonatomic.

1 Like

Yes. Like you pointed out, I use typestate. However, the benefit is to have a safe API with safe method to convert.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.