Efficiently bubbling Results

Rust's error handling is typically quite efficient and low-overhead, but there's a major performance gap compared to exceptions/panics with returning Result<T, E> because of a forced move due to an inefficiency in the ABI.

Problem

Compare:

type LargeType = [u8; 4096];

#[inline(never)]
fn create_large_type() -> LargeType {
    [0u8; 4096]
}

pub fn create_two_large_types() -> (LargeType, LargeType) {
    (create_large_type(), create_large_type())
}

#[inline(never)]
fn create_large_type_result() -> Result<LargeType, ()> {
    Ok([0u8; 4096])
}

pub fn create_two_large_types_result() -> Result<(LargeType, LargeType), ()> {
    Ok((create_large_type_result()?, create_large_type_result()?))
}

create_large_type(_result) simulates a function that cannot be inlined, for some reason (e.g. it's simply long and often reused).

With create_two_large_types, return-value optimization can be used to construct the two large values in the same location where the callee of create_two_large_types excepts it without creating and then bitwise-moving a temporary. Indeed, this is what rustc does with optimizations enabled.

With create_two_large_types_result, however, RVO cannot be used because Result<(LargeType, LargeType), ()> clearly does not contain a Result<LargeType, ()> as a subfield. Indeed, rustc creates temporaries and issues a memcpy after verifying that the Ok variant is selected.

Rationale

I'm not sure how wide-spread and important this problem is. I'm worried that even when T is small and memcpy is inlined, the code bloat due to all the movs might affect performance negatively due to cache pollution.

Another troublesome part is that we're leaving performance on the table without even explicitly mentioning it. anyhow goes out of its way to provide a single-word error type with a niche, supposedly because this leads to noticeable performance increments. Indeed, introducing the niche means Result<T, anyhow::Error> is just T with one additional machine word, which seems to be efficient, but actually isn't the moment we need to decompose the Result, which we constantly implicitly do with the ? operator.

Proposed fix

As the problem stems from cross-function returning of Result<T, E> values, perhaps it's most productive to change the calling convention in this case.

For now, assume no niche optimizations are present.

Instead of receiving a single pointer where the return value of type Result<T, E> is supposed to be stored (in register rdi for x86-64), functions would receive two pointers *mut T and *mut U (e.g. rdi and rsi) and return a boolean flag indicating success (via rax, or perhaps the carry flag).

Indeed, this doesn't require hard-coding Result but can be introduced as an opt-in feature for arbitrary enums, e.g.:

#[decompose_on_return]
enum Result<T, E> {
    Ok(T),
    Err(E),
}

In case of a more complicated enum, instead of returning a boolean flag, we would return the discriminant.

Let's see how the code at the start of the post is compiled with this ABI.

type LargeType = [u8; 4096];

#[inline(never)]
fn create_large_type_result() -> Result<LargeType, ()> {
    Ok([0u8; 4096])
}

pub fn create_two_large_types_result() -> Result<(LargeType, LargeType), ()> {
    Ok((create_large_type_result()?, create_large_type_result()?))
}

According to the new ABI, create_two_large_types_result takes rdi pointing at a place to construct (LargeType, LargeType) and rsi at... well, supposedly rsi can be entirely removed for ZSTs. It then calls create_large_type_result without modifying rdi, and verifies whether rax indicates that an error happened. If it did, the temporary is destructed and the error in rax is bubbled up. If it didn't, create_large_type_result is called once again with rdi increased by 4096 bytes. It then checks whether an error happened once again and if it didn't, leaves rax as-is and returns.

Note that this sort of ABI change (almost) does not slow down storing the Result returned by a function to a local variable. Indeed, with the old ABI, rdi was set to the address of the local Result object, while with the new ABI, both rdi and rsi would be set to the addresses of the corresponding fields in the enum and the discriminant would be saved after the call. This increases the callee's code size a tiny bit, but this effect is hopefully compensated by the other improvements provided by this ABI change.

Now what do we do about niches? I propose that whenever a variant fits in a machine word with a niche, that variant is returned by value in a register (rax, for x86-64) instead of necessiating an output pointer (rdi/rsi). The niche value is used to indicate that the other variant is used.

For instance, consider the anyhow error type. A function returning anyhow::Result<T> = Result<T, anyhow::Error> would accept the pointer at which to create a T instance in rdi, just like a function returning T. In contrast to a function returning T, however, it returns anyhow::Error (a machine-sized smart pointer, effectively) in rax, or 0 if the Ok variant is selected and the T was constructed. This enables RVO and furthermore simplifies error checks, no longer requiring read from memory to check if an error has happened.

Of course, this could be extended to enums with more than 2 variants, but perhaps this is enough for now. Am I missing anything? Has this been proposed before?

Prior art

The only general-purpose instance of my idea, as far as I'm aware, is P0709:

It could simplify our C++ implementation engineering if C functions could gain the ability to return A-or-B values, so for example:

_Either(int, std_error) somefunc(int a) {
    return a > 5 ? _Expected(a) : _Unexpected(a);
}
// ...
_Either(int, std_error) ret = somefunc(a);
if(ret)
    printf("%d\n", ret.expected);
else
    printf("%f\n", ret.unexpected);

Here _Either would be pure compiler magic, like va_list, not a type. Under the hood, functions would return with some CPU flag (e.g., carry) set if it were a right value, clear if it were a left value. An alternative is setting a bit in the thread environment block which probably is much better, as it avoids a calling convention break, though potentially at the cost of using thread local storage. It would be up to the platform vendor what to choose (this is an ABI extension), but the key is that C and platforms’ C-level calling conventions would be extended to understand the proposed union-like return of values from functions to use the same return channel storage, with a bit to say what the storage means.

In Rust, we don't have to bother about making _Either compile-time magic, or worry about breaking ABI, so perhaps using the carry flag or a separate register is a fine solution.

Outside of the C/Rust world, I've often seen returning error flags via the carry flag in low-level programming. This is mostly the same idea as proposed in this post, just with () as the error type. I'm not quite sure if using a register containing 0 or 1 is better or worse than a carry flag, but supposedly the former is easier to implement and does not suffer from flag stall or require handling the error immediately, before any instructions have a change to change CF.

3 Likes

Taken literally, this would mean rdi and rsi are pointing to overlapping memory, which would prevent some optimizations. Do you mean to have them be guaranteed-nonoverlapping? If so, that's an interesting thought – you could then write the discriminant next to whichever one was chosen as the actual return value, allowing the callee to use both of them as scratch space in the meantime (at only the cost of wasting some stack space if it turns out to not be needed).

Here's a thread of some prior discussion of similar ideas – I've specifically linked to a post that shows the gotchas with overlapping them: Pre-RFC: minimal, flexible placement and returning unsized types - #9 by notriddle

Here’s a use case: recursive deserialization code:

We ended up using serde’s secret deserialize_in_place (thanks, Gankra), but we might not have needed to if there was ABI support for “stret-in-place-with-discriminator”.

1 Like

You must not write the discriminator next to the output contents if the output address is embedded in a larger object. But I’m not sure what optimizations are impeded by having the pointers potentially alias each other (and no other memory), since a function will only write to at most one of them, and at the very end of execution…as long as LLVM can represent that.

1 Like

Taken literally, this would mean rdi and rsi are pointing to overlapping memory

Yes, this is what I kept in mind: as only one of them is supposed to be initialized, aliasing shouldn't be a problem. However, perhaps you're right that in practice this would hinder some optimizations, so perhaps wasted stack is not that bad a compromise.

Thanks for the reference to the pre-RFC! I'll take a look, but it seems like what they advise makes this relatively simple optimization much more difficult to get right in all cases.

Sorry, should've mentioned the context where it matters – the only time you need to write the discriminant is if other code in-fact requires you to produce an object of type Result<T,E>, rather than just immediately matching on it (or returning to other functions that immediately match on it). In the latter case, your "larger object" will semantically only be used after you match on the Result, so you didn't need to write the discriminant next to it in the first place.

That's what the post I linked is about! A function may have two mutable local variables foo and bar that it uses a scratch space, and, at the end, return either Ok(foo) or Err(bar). If the return pointers are nonoverlapping, it can avoid a copy, but if they are overlapping, at least one of those variables has to be in the local stack frame instead, because it can't use them as separate scratch spaces.

1 Like

Oh, now that I think about it – if we are customizing the calling conventions of functions, perhaps each individual function could decide whether it wanted nonoverlapping return pointers or would be just as happy with overlapping ones! Then there's no compromise to performance, because you would only have to allocate extra space in the caller's stack frame if it would otherwise be needed as an extra allocation in the callee's frame.

I believe it would actually be acceptable to apply LLVM noalias / C restrict to these pointers, actually. The requirement is a dynamic one that they aren't used in a way that aliases writes, but it's perfectly acceptable for the pointers to alias if they aren't actually used in an aliasing manner.

This isn't true at the Rust level, of course, since &mut function arguments are stronger.

2 Likes