Rethinking Failure


It seems that making Rust code memory-safe during unwinding is rather complicated, and that unwinding introduces other performance problems. Additionally, current LLVM does not allow unwinding from failure during stack overflow, which means that recovery-on-failure does not actually protect against many kinds of logic errors.

The job of unwinding during failure is to release resources used by the failing task (otherwise, loop { sleep(UINT_MAX) } would be a decent implementation of failure). There are essentially 4 common ways of doing that:

  1. Return-Value based - every function that can fail can return, in addition to its normal return value, an extra “failure” return code. Functions receiving that return code free the resources they own and return failure. This has the advantage of not requiring runtime support, and being doable (through ugly) even without language support, and a disadvantage of creating large amounts of relatively slow machine code.
  2. Table based/dw2-style - essentially a variant of the previous version, requiring less machine code but more runtime support. Takes advantage of the fact that the failure-handling code is rather structured to compress it and move it to out-of-band tables, in which it does not take extra execution time or cache space in the fast path. This has the disadvantage of requiring a good amount of runtime support, and being hard to explicitly control.
  3. Process based/sjlj-style - each task stores the resources it uses in a list along with a freeing function, maintains it, and ensures it is up to date when a failure is possible. This has the advantages of not messing with control flow, and not requiring functions that don’t manage resources to be aware of it at all, at the cost of relatively low performance in resource-using function. This method is essentially used by kernels to ensure process isolation, with the file-descriptor table serving as the resource list, and is typically useful when there are few resources (when managing memory using this method, it is generally used to manage a small number of large “arenas”, rather than managing each allocation on its own).
  4. GC - This one is pretty well-known: a garbage-collector traces through all living tasks’ data structures, and collects all these not used. This method has many kinds of well-studied advantages and disadvantages unrelated to failure, but we can notice that it does not require dealing with failing tasks at all, only with live ones.

Currently, Rust forces programs to opt-in to dw2-style resource management, but some programs would prefer to use other schemes, and would like to not deal with it.