Notes + Interviews on Fallible Collection Allocations

Bingo.

Any Rust code that does neither I/O nor manual memory management is automatically exception-safe in all but a very few corner-cases. For example, a parser is generally a pure function fn Parse(&str) -> Result<ParseTree, Vec<ParseError>>. On panic, all allocations are automatically cleaned up and the code can safely continue. In fact, one could change the API to return Result<ParseTree, Result<Vec<ParseError>, InsufficientMemory>> and recover from memory exhaustion without any changes.

C++ has a very hard time with exception safety for various reasons, not the least of which are the pervasiveness of manual resource management (in poorly-written code) and user-defined move constructors (which can throw). The latter simply doesn’t exist in Rust (memcpy cannot fail!) and the former is extremely rare in Rust, not least because it is generally unsafe. Some complex data structures might be left in inconsistent states if a panic occurs during traversal, and some system resources require memory to release the resource, not just to acquire it. However, I suspect that both of those are small minorities.

I know that if an RDBMS or web server I was using ran out of memory and crashed, I would report that as a bug, on the grounds that a crash in such a system is likely to mean production downtime. IMO panicing on allocation failure and catching the panic near top-level is much, much easier than the alternative (manually tracking all allocations and ensuring that they stay under a certain limit) which is extremely error-prone. I don’t need it to do anything fancy on OOM. Just log it (ignoring failure), and return a 500 Internal Server Error or rollback the transaction.

2 Likes

I agree exception-safety is a fraction of the difficulty in Rust, but unsafe code is the key place where this isn’t true. Lots of code uses Vec unsafely, to e.g. pass an uninitialized buffer to some C code.

I am not satisfied with “well getting completely owned by exception-safety problems will be rare”.

I am willing to consider adding an oom=unwind toggle for rust, though. But I would advise against its usage in general.

Actually, I’d like to say we should table any discussion of unwinding since it’s irrelevant to the motivating use-case (and largely irrelevant to what I consider to be the second-most-important one: embedded).

The server case is interesting, oom=unwind can be handled orthogonally.

1 Like

Perhaps one way you could design this (which would fit in well with custom allocators) is through adding another type parameter to Vec:

pub struct Vec<T, Allocator> { /* stuff */ }.

The Allocator trait would have some way to perform an allocation:

fn allocate(size, alignment) -> Result<*mut u8, Error>.

For an infallible allocator, you simply set Error to enum Void {} or ! (i.e. non-instantiatable type).

Then, you avoid the duplication of having to have fallible & infallible versions of Vec, by making every method look somewhat like this:

fn push(&mut self, item: T) -> Result<(), Error> where the Error is the same type as above. Of course, if you’re using an infallible allocator, this Result essentially is ().

Now, obviously, this doesn’t actually work, as Rust doesn’t appreciate that Result<(), !> == (). As a stopgap, you could make a wrapper type that unwraps all of these results, but then we’re back to the duplication thing again. However, if Rust ever does gain the ability to equate Result<(), !> to (), then this model may prove helpful, as you can get rid of the duplicated wrapper version.

2 Likes

Hooking fallibility into the (already existing) allocator type parameter is equivalent to the Vec<T, Fallible> proposal, no?

[rereading intensifies] oh crap, yeah - sorry! chalk one up to not reading things carefully enough… (to be fair, though, the Fallibility type thing didn’t really mention hooking it into the allocator API - I was suggesting adding the concept of an “infallible allocator” instead of a Vec that acts infallibly. Doing it that way sorta helps prevent code duplication - if you wanted an infallible HashMap, all your scary “abort on OOM” routines are in one thin wrapper over some fallible allocator instead of having to be distributed across collections. It also opens up the opportunity for weird allocators that e.g. try to stack-allocate and, if they fail, heap-allocate instead).

1 Like

I have another generics based idea for solving the problem. We change all methods that can fail due to oom to have a generic parameter of an AllocFailure trait, which is defaulted to give the current experience:

#![feature(default_type_parameter_fallback, associated_type_defaults)]
trait OomFailure<ReturnTy> {
    type ResultType = ReturnTy;
}

struct AbortOnOom;
impl<T> OomFailure<T> for AbortOnOom {}

struct OomResult;
struct OomError;
impl<T> OomFailure<T> for OomResult {
    type ResultType = Result<T, OomError>;
}

struct Vec<T>(T);

impl<T> Vec<T> {
    fn push<AF: OomFailure<()> = AbortOnOom>(&mut self, t: T) -> AF::ResultType {
        unimplemented!()
    }
}

fn foo<T>(v: Vec<T>, u: T, w: T) {
    v.push(u); // doesn't work yet due to https://github.com/rust-lang/rust/issues/36887#issuecomment-296787518 being unresolved
    v.push::<AbortOnOom>(u); // the above line should work just like this one
    v.push::<OomResult>(w); // warning: unused `std::result::Result` which must be used
}
4 Likes

I think the server aspect of this discussion is under-appreciated. Note that none of the people you interviewed are working on servers. Embedded is not quite the same.

I think there are at least three different types of servers with different memory uses:

  • Request-based (think web application): many concurrent requests with individually low memory use. Technically you might be able to isolate the effects of memory allocation failure between individual requests.
  • Classic RDBMS: a long-running process that uses as much memory as possible. A process should never die.
  • Big Data: a (relatively) short-running process that uses as much memory as possible. A process abort can be handled gracefully but should be avoided as it increases overall computation time.

I can put you in touch with a Spark developer if you’re interested.

1 Like

The pushback on fallible allocations is incredibly bizarre to me.

Graceful handling is the status quo with C libraries, and Rust is supposed to be a systems programming language with a high level of control over memory. I was shocked to find Rust lacking in this when I first started using it full-time, and a year and a half later the same flawed arguments against proper allocation keep returning.

I’m questioning what Rust is even for, now.

4 Likes

There’s been some discussion of isolating per-request (or per-work unit) failures by having allocation panic (via unwinding) and threads catch the unwind. We should keep in mind that this only works for the thread-per-task model, but not for an event loop model (ie, used in async IO) with worker threads. For that, you actually need normal fallible allocation that returns an error when allocation can’t be performed.

2 Likes

Depends on how expensive panic isolation is. Potentially something like tokio could wrap every single invocation into a Task and fail that Task as a whole if it panics. If you have something like a web server then it could isolate the request processing into a separate Task from the overall request handling which would allow returning a 500 on allocation failure during the actual processing stage.

Unwinding is not a panacea. Depending on unwinding for continuity makes it so that you can’t use std::sync::{Mutex, Once, RwLock} anywhere in your code because of std::sync::PoisonError.

Can you elaborate on the problems? It seems like you just tack .catch_unwind() onto your futures and call it a day? (outside my area of expertise by a mile)

Can you elaborate on the problems? It seems like you just tack .catch_unwind() onto your futures and call it a day? (outside my area of expertise by a mile)

Sure. Two issues that I see: performance and unwinding. I benchmarked it, and the performance overhead of catch_unwind isn’t bad, but it’s definitely worse than just matching on a Result (I forget the exact numbers - this was about a week ago). Also, this assumes that the panic strategy is unwind - it doesn’t work for panic = abort.

How is this different than using threads? I don’t see what’s unique about futures here.

I suppose that the panic = abort vs panic = unwind issue isn’t different - you’re right. For the performance issue, I’m thinking that in the threaded model, you have some outer loop that loops acquiring more work to do and then doing it, and the catch_unwind is called in a parent function of that loop, so the check is only performed either when the thread is quitting normally or when there’s been an unwinding panic. On the other hand, since in an event loop-based system, the unit of work that you’d want to restart on a panic is a single future/async execution, you’d need to catch at that granularity, and thus you’d need to do catch_panic around each of those executions, checking whether you were panicking each time you were about to switch to do a different unit of work.

Based on the discussion and some others I’ve had elsewhere I’ve posted the following RFC: https://github.com/rust-lang/rfcs/pull/2116

It ain’t perfect, but to be blunt I’m exhausted and this problem is awful.

3 Likes

Could you elaborate? Poisoning actually helps dealing with unwinding; if it wasn’t for poisoning, unwinding would be much more likely to introduce subtle bugs into programs. So it actually seems to me like especially when you do unwinding should you use concurrency primitives that do poisoning.

If you don’t do unwinding, things will never be poisoned anyway.

I’ve responded to your question on GitHub to keep the discussion in one place.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.