Async-traits - the less dynamic allocations edition

I will start this post with a shoutout to to @dtolnay! async-trait is a great crate, which removes a lot of friction trying to work with "Rust async"!

While async functions alone are nice, we still need some kind of interfaces to make our software more flexible and to test it. Traits for async Rust are kind of a tricky topic. There exist poll_() methods which are object-safe, and there is the hope that GATs at some day might provide true zero-cost async interfaces. All of those are however harder to understand and use than async-trait, which provides traits in a form that resemble the synchronous counterparts.

However up to async-trait comes with 2 downsides compared to an ideal version of async traits:

  1. Dynamic dispatch requires an indirect jump and prevents inlining and some further optimizations
  2. it requires a dynamic allocation for each invocation of an async method on a trait.

As the title of this post hints, there exist some ideas around how to reduce the overhead of the 2nd part:

While thinking about the problem I recalled some optimizations and findings around async methods in other ecosystems - here most notably the .NET ecosystem.

In .NET the return values of async methods initially all had to be heap allocated (in the form of a Task<T> object). C#'s async methods are not "lazy" as Rust's, and therefore the .NET team could already earlier on make an optimization which reduced the amount of required allocations: If the method had chances to often finish without suspension, it could return a ValueTask<T> object which avoids the heap allocation for the state-machine. This makes sense in a lot of cases - e.g. when writing on sockets, which are typically "ready" and do not "block".

The ValueTask optimization for async operations which do not actually suspend unfortunately does not carry over to Rust: Due to the lazy nature of Rusts Futures we have to allocate the Future before the first check for completion occurs.

However later on the .NET team made another realization:

  • An async method on an object is not concurrently called in most cases. Users call the method and await the returned Tasks.
  • This means for each async method on an object there exists typically only one Task<T> object in memory.
  • This object is a good candidate for pooling and reuse!

After investing some more thoughts into this, I figured out that this approach actually can carry over into Rust:

  • async methods on Rusts trait objects are typically also only called one at a time, and the returned Future is immediately .awaited.
  • if the method takes &mut self - this would even be guaranteed!
  • the implementation behind the async trait always returns the same concrete Future which needs to get boxed, so any allocation which can hold the state of the initial Future can also hold the state of additional Futures.

==> Based on this I thought it should be possible to reuse allocations for Futures returned from async trait objects in Rust, in case those methods are called more than once.

I implemented a proof of concept of this of the approach here.

The results are rather promising, so I wanted to share them here. E.g. on windows I could observe an up to 5x performance improvement for repeated calls to async functions on trait objects. On Linux the performance improvement was far lower and very different between memory allocators. More details are in the repositories readme.

The whole appraoch definitely needs a bit more evaluation on real applications, but I'm still rather excited about it. For some applications it might open up the opportunity to use an easy-to-implement async fn based Stream implementation instead of a manual poll_fn based version. It could also allow us to rethink what the best way is to represent async IO traits (which are typically called more than once).

The linked repository contains a bit more description. The code is a bit beyond proof of concept state, it might still need some fixes here and there. If you notice anything broken or undefined behavior feel free to open an issue on Github and/or propose a fix.

Back to async-trait itself: I think if this is interesting, then a future version of async-trait could integrate support for it which is invisible from an API point ov view: The return type of async methods would need to be chaned from Box<Pin<dyn Future>> to DynamicFuture - as shown in the linked repository. In this case implementations of the traits could decide whether they want to return unique heap-allocated Futures for each call - or whether they would rather want to reuse Futures between calls. This could easily be configured by an attribute on methods. Since DynamicFuture is fully type erased the behavior would not impact consumers.

25 Likes

I did a bit more experimentation on this this morning. I figured out I could also add an implementation of DynamicFuture which mostly resembles Pin<Box<dyn Future>> - which means that every instance is allocated on the heap instead of using recyling. This variant can be used by the recycling allocator if recycling is not possible - and won't carry any overhead of refcounting.

This variant can be found here.

I didn't expect this to behave very different from Pin<Box<dyn Future>> - but to my huge surprise it can show 5-30% performance benefit in benchmarks :hushed::

nested_stream_benches/async-trait
                        time:   [5.2282 us 5.2461 us 5.2624 us]

nested_stream_benches/non recyclable DynamicFuture
                        time:   [3.2658 us 3.2727 us 3.2804 us]

nested_stream_benches/recyclable DynamicFuture
                        time:   [1.5016 us 1.5052 us 1.5099 us]

My guess so far is that this is due to the vtable pointer being stored in the Future instead of in the heap allocated state, which requires one less indirection to load the vtable for a poll or drop call. That should be the only difference between the implementations.

Most interestingly storing the function pointers in the Future directly - which should require even one less indirection than looking up the vtable first - made things again slower instead of faster. My guess so far is due to the optimizations in the CPU, where memory access and/or branch prediction are more optimized depending on the location of function pointers.

Apart from that there might obviously also exist a bug or oversight which blurs the comparison. If you find it let me know :slight_smile:

6 Likes

I'm suspicious of performing allocation optimizations like this at the library or language level, without knowledge of the underlying allocator being used or the usage pattern in the application. What would be interesting would be to figure out how to allow users to introduce this sort of optimization when they know that their use case benefits from it.

3 Likes

Just wanted to quickly mention this could work really well with procedural vtables (e.g. RFC-2967). Then the async_trait crate could do the memory recycling instead of the language itself (while leaving the impl AsyncTrait case completely untouched). If you want to know what that would look like, I did a sketch of it here (using my version of RFC-2967 because I understand it better): https://gitlab.com/-/snippets/2013937

1 Like

I'm suspicious of performing allocation optimizations like this at the library or language level, without knowledge of the underlying allocator being used or the usage pattern in the application.

There is nothing language level involved here, and also nothing on an "interface" level. If one would standardize an async-trait based Stream type [1], which makes use of DynamicFuture, then each implementation can still choose an implementation strategy that is best for their use-case.

And I think library implementors typically have a vergy good understand how their libraries and contained types are used. E.g. if I implement an async TCP, TLS or Quic library then I know very well that users would call the read or write functions more than once in most cases. However I know that they will call connect at most once - or that any kind of async factory function will be called at most once (and that it might not actually be async). Based on that knowledge I can measure how the library will perform when opting in or out of a recycling mechanism. And since everything is dynamically dispatched they can even let their users make the choice. A simple boolean parameter could do the job - the only drawback is then increasing the size of the async trait object by 1 ptr size.

And about

without knowledge of the underlying allocator

in particular: The main benefit of reducing allocations is that it makes application performance far more predictable without having to rely on a very peculiar allocator behavior. Yes, there are some allocators which are doing better on lots of allocations and memory churn then others - and they are great. However no allocations still beat great allocators :slight_smile:

[1] I am not proposing this here.

Hmm - it definitely seems like it's related. However I can not fully grasp what is going on here. Your implementation of the async trait has 2 impl Writer blocks - which use a different return type. Besides not knowing which one would actually be used I don't think this has the implementation independent return type that the examples in my repository have. There the type of the returned Future is always DynamicFuture<T> - independent on how memory is managed. I think in your example the user would need to explicitly choose between dyn Writer and Recycle<dyn Writer> - right? In mine the user always only sees a dyn Writer which returns a DynamicFuture - and the Recycler is an optional implementation detail.

So one point of dynamic dispatch here is also to abstract and type erase the backing storage location. I'm not sure whether the linked RFCs capture this aspect. As far as I understand them they are more about where to store the vtable if the storage location is known.

I'm not sure whether the linked RFCs capture this aspect.

The point of RFC-2967 here is that it allows full customization of the dyn Writer type and how it implements the Writer trait, to the extent that all impl Writer could continue to use -> impl Future for static dispatch but would switch to -> DynamicFuture for dynamic dispatch after being cast to a dyn Writer. That way you get the minimum abstraction cost in both cases.

I think in your example the user would need to explicitly choose between dyn Writer and Recycle<dyn Writer> - right?

That is correct but the reason I did so was to simplify the example and make it clear where the Recycler was being kept: along side the dyn Writer via wrapper struct. Really it could be stored onboard the dyn Writer instead if desired.

As for DynamicFuture vs &move dyn Future, the two are totally identical as far as I can tell. DynamicFuture is the realistic of the two (&move is pretty unlikely to happen any time soon) but I thought the equivalence was sorta interesting and worth pointing out.

// We shouldn't have any alignment issues, since RecyclableFutureHeader // is aligned to usize - which should cover what everything else needs.

Actually there are possibility of types with bigger alignment. For example buffers used with mmx/avx could have alignment requirements (MOVAPS require up to 64 byte alignment, and it's relative MOVUPS wich is bit slower but don't require alignment).

I'm not sure if itt applies for variables inside future, and someone would store io buffer there, but unbuffered io has requirement to align buffer to device block size (eg 512 or 4096 bytes).

So what I am saying is that 8 byte alignment won't be enaught for all cases. Not to be relevant for the proof of concept, but then comment should be reprased.

UPD.

It's indeed debug asserted, so would blow up in case someone would try using it with inappropriate type.

Furthermore, we could squeeze whole header into 8 bytes regardless of target bitness, in contrast of two usizes.

Fitst, we could store size in u32, as if we have Future bigger than 4 GiB we are already having bigger troubles, and this allocation trick won't help a lot.

Second: we could use AtomicU8 or AtomicU16 for counter, as we are using only 2 values, 1 and 2. It could be even AtomicBool perhaps, but that won't help in any way.

Third: alignment is always power of two, so we could store it in u16 as power²-1 like:

real_align= 1 << header.stored_align

P.S. sorry if i was incorrect in any way, I was typing it from phone before sleep.

Nice work on this proof of concept :+1:

1 Like

Good call. I'm not sure whether anyone would use those in a Future, but it could still be nice to handle them.

but unbuffered io has requirement to align buffer to device block size (eg 512 or 4096 bytes).

I'm less worried about these. People would likely have them in another buffer type (e.g. to make the buffers poolable too) and not as part of the Future. And "normal" IO buffers which are [u8] work just fine.

Furthermore, we could squeeze whole header into 8 bytes regardless of target bitness, in contrast of two usize s.

Yes! those tricky could indeed help cutting the size down.

After looking at some of the things that @samsartor had posted I thought about another way to do it: Just like rustc does, the Layout could be stored as part of the vtable instead of the Future. It's 'static there, and therefore mostly for free. From there it can be passed to the allocate/deallocate methods.

There is a catch however: The RecyclableFutureAllocator has no reference to the vtable - since it can allocate any kind of Future. So this would require adding the size information there for cases where the storage has to be deallocated while no Future had been handed out. This would just move the memory around - but not save it.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.