Crossbeam: request for help

As for the proof of correctness: I proved the SegQueue correct* by creating its FSM and proving the correctness of its states and transitions.

*I suspect there is a bug on one of the edges, which if reachable, I cannot prove correct.

Similar, relatively simple reasoning can be carried out on the other primitives.

I think you shouldn't overdo with too much work. There is no reason to rush. Think of your health!!!

Iā€™m fine, everbody.

(just to be clear: I havenā€™t worked 20 hours straight, only around 11 or 12)

But yeah, I need to get some rest.

That sounds very good to me; I'd been thinking along similar lines, but was a little worried about slowing things down.

For larger proposed changes, I'd suggest using the issue tracker to try to hash it out with owners/stakeholders, and move forward when a reasonable level of consensus is reached. If need be we can get more formal later, but that seems like a decent starting point.

1 Like

On the topic of weakly ordered hardware, it would be really great to have a testbed for that. Iā€™ve looked into finding a cheap Power cloud provider but had no luck, and the only arm platform Iā€™ve been able to easily run/compile code on (Cavium Thunder) doesnā€™t reorder aggressively and serializes L1 misses.

I used to feel otherwise, but now I donā€™t find per-datastructure garbage bags so bad. I suspect thereā€™s a performance friendly way to minimize the impact of a stuck thread to the specific structure using that sort of scheme.

I think crossbeam should allow for destructors in the GC bags, I donā€™t think the overhead of refcounting items is acceptable especially since thereā€™s already a full fence on pin. From a very x86 perspective, these refcount would introduce full fences on load and I consider that unnacceptable. Maybe in practice the overhead is smaller since you canā€™t reorder around the cache misses anyways but I want to see strong evidence of that. There are other ways to control for latency and GC urgency that I have in mind.

If you are refcounting the items, is there a point to EBR? Is that you only refcount items which need destructors?

:tada: Release candidate for 1.0.0

2 Likes

I just think it's unfair for a data structure that doesn't have droppable elements to run destructors for objects that belonged to completely different structures. Is this something you agree with?

Kinda, but not really. It's basically the premise of crossbeam, and users have already accepted an expensive set of deallocations on pin. Same with Arc - you accept that maybe you trigger a huge drop set. I don't think the solution is to give up and do something more expensive. The epoch GC should definitely be incremental, drops should probably be considered differently from plain frees, and per-datastructure or per-type bags should be considered. But I really don't think that a refcount on data reference is a good forced solution.

This is true. Pinning requires a SeqCst fence and refcounting requires AcqRel fences. A simple way to eliminate subsequent SeqCst fences incurred by pinning is by creating a pin yourself beforehand (remember that pinning is reentrant).

On Intel, all RMW operations act as a SeqCst fence and on Arm, AcqRel results in the same fences as SeqCst for RMW operations and power isn't much better. Even on aarch64, I suspect that the 'optimized' ll/sc operations carry the same fence cost on most implementations. Some very weak results I have support this. Ultimately though, an atomic refcount increment and decrement doesn't come cheap unless the cache line is already held in an exclusive state, in which case it's just expensive.

Can you wait for one week, starting from today, until I finish my skiplist? :slight_smile: Iterators are based on refcounting and the overhead is really small.

Absolutely! I would be very happy if empirically the cost wasn't very big! With anything linked you have to wait for loads to complete anyways so maybe it doesn't matter too much. I'm very glad to see that a good skiplist is almost done since I need one for transactional datastructures and don't want to rewrite one.

Care to share them?

For latency control - my thoughts are essentially that if you accept the fact that you may enter a GC run and do a lot of deallocations you've already given up on any sort of tight latency. With that in mind, I think it's more fruitful to think about controlling when threads can/can't GC, how much time they can GC for, how many elements, etc. Segregated drop/non-drop lists could go a long ways, especially in combination with per-datastructure drop bags.

Memory usage control: Similar situation to above. You have to accept that in a concurrent situation, you can get exact control over how much memory is used. Refcounting can free memory more quickly, but still can't free you from a stuck thread. With per-datastructure epochs or multiple distinct EBR groups a structure can register to a lot could be done to prevent a stuck thread in some random location from blocking memory reclamation.

Maybe a better plan could be to find a scheme that would let an eager destruction scheme work on top of crossbeam without preventing one from submitting destructors to the garbage bags. While exposing this to users would increase complexity, maybe it could be hidden by default and only users that can sacrifice throughput for tighter drop latency can use the option. Sounds tricky though.

Shameless self-plug, I already have one here. It's a bit wierd in that it does ordering on key hashes instead of keys themselves, but that's easily fixable.

For transactions though, lists are the better choice due to the existence of efficient implementations.

That's largely due to Java having managed memory, unfortunately. The problems of memory reclamation are non-existent. Designing concurrent APIs in languages like Rust and C++ is a lot trickier, but hopefully we can leverage lifetimes to get something nice.

If you are concerned about cmpxchg16b performance one hacky workaround is to add an ABA tag so that you can do loads using smaller primitives.

For example,

struct queue {
     __uint128_t tag : 32U;
     __uint128_t head : 48U;
     __uint128_t tail : 48U;
};

struct queue fast_load(struct queue _Atomic *queuep)
{
     uint32_t tag;
     uint32_t a;
     uint32_t b;
     uint32_t c;
     // bad C standard violating hack
     uint32_t _Atomic *words = (uint32_t _Atomic *)queuep;
     tag = atomic_load_acquire(words, memory_order_acquire);
     for (;;) {
           a = atomic_load_acquire(words + 1U, memory_order_acquire);
           b = atomic_load_acquire(words + 2U, memory_order_acquire);
           c = atomic_load_acquire(words + 3U, memory_order_acquire);
           uint32_t tag_again = atomic_load_acquire(words, memory_order_acquire);
           if (tag == tag_again)
                break;
           tag = tag_again;
     }
     struct queue result;
     memcpy((char*)&result, &tag, sizeof tag);
     memcpy((char*)&result + 4U, &a, sizeof a);
     memcpy((char*)&result + 8U, &b, sizeof b);
     memcpy((char*)&result + 12U, &c, sizeof c);
     return result;
}

Would hazard pointers be a good choice for eager deallocation? I can think of latency-critical situations that need eager deallocation.

Thanks for your feedback @anon19897381.

It's a bit unfortunate that we lose wider applicability by using CMPXCHG16B, though.

If the performance numbers hold up, maybe it's worth having different implementations on different architectures? Either using @sstewartgallus version of CMPXCHG16B or having a completely different implementation of the whole data structure.

Regarding hazard pointers, I'm not sure why you'd want to use them. Crossbeam's EBR is almost strictly better.

Maybe I dismissed Crossbeam's EBR too quickly. Let me try it again.

1 Like

Maybe itā€™s a little bit late, but I would like to volunteer for the Crossbeam project. My expertise is on the C/C++ relaxed memory models, on which I have published two conference papers (http://sf.snu.ac.kr/jeehoon.kang/). Hopefully I can review/make PRs on weak-memory orderings.

6 Likes

Youā€™ve got an invite, weā€™d love to have your help!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.