Rust's error handling is typically quite efficient and low-overhead, but there's a major performance gap compared to exceptions/panics with returning Result<T, E>
because of a forced move due to an inefficiency in the ABI.
Problem
Compare:
type LargeType = [u8; 4096];
#[inline(never)]
fn create_large_type() -> LargeType {
[0u8; 4096]
}
pub fn create_two_large_types() -> (LargeType, LargeType) {
(create_large_type(), create_large_type())
}
#[inline(never)]
fn create_large_type_result() -> Result<LargeType, ()> {
Ok([0u8; 4096])
}
pub fn create_two_large_types_result() -> Result<(LargeType, LargeType), ()> {
Ok((create_large_type_result()?, create_large_type_result()?))
}
create_large_type(_result)
simulates a function that cannot be inlined, for some reason (e.g. it's simply long and often reused).
With create_two_large_types
, return-value optimization can be used to construct the two large values in the same location where the callee of create_two_large_types
excepts it without creating and then bitwise-moving a temporary. Indeed, this is what rustc does with optimizations enabled.
With create_two_large_types_result
, however, RVO cannot be used because Result<(LargeType, LargeType), ()>
clearly does not contain a Result<LargeType, ()>
as a subfield. Indeed, rustc creates temporaries and issues a memcpy after verifying that the Ok
variant is selected.
Rationale
I'm not sure how wide-spread and important this problem is. I'm worried that even when T
is small and memcpy is inlined, the code bloat due to all the mov
s might affect performance negatively due to cache pollution.
Another troublesome part is that we're leaving performance on the table without even explicitly mentioning it. anyhow goes out of its way to provide a single-word error type with a niche, supposedly because this leads to noticeable performance increments. Indeed, introducing the niche means Result<T, anyhow::Error>
is just T
with one additional machine word, which seems to be efficient, but actually isn't the moment we need to decompose the Result
, which we constantly implicitly do with the ?
operator.
Proposed fix
As the problem stems from cross-function returning of Result<T, E>
values, perhaps it's most productive to change the calling convention in this case.
For now, assume no niche optimizations are present.
Instead of receiving a single pointer where the return value of type Result<T, E>
is supposed to be stored (in register rdi
for x86-64), functions would receive two pointers *mut T
and *mut U
(e.g. rdi
and rsi
) and return a boolean flag indicating success (via rax
, or perhaps the carry flag).
Indeed, this doesn't require hard-coding Result
but can be introduced as an opt-in feature for arbitrary enums, e.g.:
#[decompose_on_return]
enum Result<T, E> {
Ok(T),
Err(E),
}
In case of a more complicated enum, instead of returning a boolean flag, we would return the discriminant.
Let's see how the code at the start of the post is compiled with this ABI.
type LargeType = [u8; 4096];
#[inline(never)]
fn create_large_type_result() -> Result<LargeType, ()> {
Ok([0u8; 4096])
}
pub fn create_two_large_types_result() -> Result<(LargeType, LargeType), ()> {
Ok((create_large_type_result()?, create_large_type_result()?))
}
According to the new ABI, create_two_large_types_result
takes rdi
pointing at a place to construct (LargeType, LargeType)
and rsi
at... well, supposedly rsi
can be entirely removed for ZSTs. It then calls create_large_type_result
without modifying rdi
, and verifies whether rax
indicates that an error happened. If it did, the temporary is destructed and the error in rax
is bubbled up. If it didn't, create_large_type_result
is called once again with rdi
increased by 4096 bytes. It then checks whether an error happened once again and if it didn't, leaves rax
as-is and returns.
Note that this sort of ABI change (almost) does not slow down storing the Result
returned by a function to a local variable. Indeed, with the old ABI, rdi
was set to the address of the local Result
object, while with the new ABI, both rdi
and rsi
would be set to the addresses of the corresponding fields in the enum and the discriminant would be saved after the call. This increases the callee's code size a tiny bit, but this effect is hopefully compensated by the other improvements provided by this ABI change.
Now what do we do about niches? I propose that whenever a variant fits in a machine word with a niche, that variant is returned by value in a register (rax
, for x86-64) instead of necessiating an output pointer (rdi
/rsi
). The niche value is used to indicate that the other variant is used.
For instance, consider the anyhow error type. A function returning anyhow::Result<T> = Result<T, anyhow::Error>
would accept the pointer at which to create a T
instance in rdi
, just like a function returning T
. In contrast to a function returning T
, however, it returns anyhow::Error
(a machine-sized smart pointer, effectively) in rax
, or 0 if the Ok
variant is selected and the T
was constructed. This enables RVO and furthermore simplifies error checks, no longer requiring read from memory to check if an error has happened.
Of course, this could be extended to enums with more than 2 variants, but perhaps this is enough for now. Am I missing anything? Has this been proposed before?
Prior art
The only general-purpose instance of my idea, as far as I'm aware, is P0709:
It could simplify our C++ implementation engineering if C functions could gain the ability to return A-or-B values, so for example:
_Either(int, std_error) somefunc(int a) { return a > 5 ? _Expected(a) : _Unexpected(a); } // ... _Either(int, std_error) ret = somefunc(a); if(ret) printf("%d\n", ret.expected); else printf("%f\n", ret.unexpected);
Here
_Either
would be pure compiler magic, likeva_list
, not a type. Under the hood, functions would return with some CPU flag (e.g., carry) set if it were a right value, clear if it were a left value. An alternative is setting a bit in the thread environment block which probably is much better, as it avoids a calling convention break, though potentially at the cost of using thread local storage. It would be up to the platform vendor what to choose (this is an ABI extension), but the key is that C and platforms’ C-level calling conventions would be extended to understand the proposed union-like return of values from functions to use the same return channel storage, with a bit to say what the storage means.
In Rust, we don't have to bother about making _Either
compile-time magic, or worry about breaking ABI, so perhaps using the carry flag or a separate register is a fine solution.
Outside of the C/Rust world, I've often seen returning error flags via the carry flag in low-level programming. This is mostly the same idea as proposed in this post, just with ()
as the error type. I'm not quite sure if using a register containing 0 or 1 is better or worse than a carry flag, but supposedly the former is easier to implement and does not suffer from flag stall or require handling the error immediately, before any instructions have a change to change CF.