Notes + Interviews on Fallible Collection Allocations

I have another generics based idea for solving the problem. We change all methods that can fail due to oom to have a generic parameter of an AllocFailure trait, which is defaulted to give the current experience:

#![feature(default_type_parameter_fallback, associated_type_defaults)]
trait OomFailure<ReturnTy> {
    type ResultType = ReturnTy;
}

struct AbortOnOom;
impl<T> OomFailure<T> for AbortOnOom {}

struct OomResult;
struct OomError;
impl<T> OomFailure<T> for OomResult {
    type ResultType = Result<T, OomError>;
}

struct Vec<T>(T);

impl<T> Vec<T> {
    fn push<AF: OomFailure<()> = AbortOnOom>(&mut self, t: T) -> AF::ResultType {
        unimplemented!()
    }
}

fn foo<T>(v: Vec<T>, u: T, w: T) {
    v.push(u); // doesn't work yet due to https://github.com/rust-lang/rust/issues/36887#issuecomment-296787518 being unresolved
    v.push::<AbortOnOom>(u); // the above line should work just like this one
    v.push::<OomResult>(w); // warning: unused `std::result::Result` which must be used
}
4 Likes

I think the server aspect of this discussion is under-appreciated. Note that none of the people you interviewed are working on servers. Embedded is not quite the same.

I think there are at least three different types of servers with different memory uses:

  • Request-based (think web application): many concurrent requests with individually low memory use. Technically you might be able to isolate the effects of memory allocation failure between individual requests.
  • Classic RDBMS: a long-running process that uses as much memory as possible. A process should never die.
  • Big Data: a (relatively) short-running process that uses as much memory as possible. A process abort can be handled gracefully but should be avoided as it increases overall computation time.

I can put you in touch with a Spark developer if you’re interested.

1 Like

The pushback on fallible allocations is incredibly bizarre to me.

Graceful handling is the status quo with C libraries, and Rust is supposed to be a systems programming language with a high level of control over memory. I was shocked to find Rust lacking in this when I first started using it full-time, and a year and a half later the same flawed arguments against proper allocation keep returning.

I’m questioning what Rust is even for, now.

4 Likes

There’s been some discussion of isolating per-request (or per-work unit) failures by having allocation panic (via unwinding) and threads catch the unwind. We should keep in mind that this only works for the thread-per-task model, but not for an event loop model (ie, used in async IO) with worker threads. For that, you actually need normal fallible allocation that returns an error when allocation can’t be performed.

2 Likes

Depends on how expensive panic isolation is. Potentially something like tokio could wrap every single invocation into a Task and fail that Task as a whole if it panics. If you have something like a web server then it could isolate the request processing into a separate Task from the overall request handling which would allow returning a 500 on allocation failure during the actual processing stage.

Unwinding is not a panacea. Depending on unwinding for continuity makes it so that you can’t use std::sync::{Mutex, Once, RwLock} anywhere in your code because of std::sync::PoisonError.

Can you elaborate on the problems? It seems like you just tack .catch_unwind() onto your futures and call it a day? (outside my area of expertise by a mile)

Can you elaborate on the problems? It seems like you just tack .catch_unwind() onto your futures and call it a day? (outside my area of expertise by a mile)

Sure. Two issues that I see: performance and unwinding. I benchmarked it, and the performance overhead of catch_unwind isn't bad, but it's definitely worse than just matching on a Result (I forget the exact numbers - this was about a week ago). Also, this assumes that the panic strategy is unwind - it doesn't work for panic = abort.

How is this different than using threads? I don’t see what’s unique about futures here.

I suppose that the panic = abort vs panic = unwind issue isn’t different - you’re right. For the performance issue, I’m thinking that in the threaded model, you have some outer loop that loops acquiring more work to do and then doing it, and the catch_unwind is called in a parent function of that loop, so the check is only performed either when the thread is quitting normally or when there’s been an unwinding panic. On the other hand, since in an event loop-based system, the unit of work that you’d want to restart on a panic is a single future/async execution, you’d need to catch at that granularity, and thus you’d need to do catch_panic around each of those executions, checking whether you were panicking each time you were about to switch to do a different unit of work.

Based on the discussion and some others I’ve had elsewhere I’ve posted the following RFC: https://github.com/rust-lang/rfcs/pull/2116

It ain’t perfect, but to be blunt I’m exhausted and this problem is awful.

3 Likes

Could you elaborate? Poisoning actually helps dealing with unwinding; if it wasn't for poisoning, unwinding would be much more likely to introduce subtle bugs into programs. So it actually seems to me like especially when you do unwinding should you use concurrency primitives that do poisoning.

If you don't do unwinding, things will never be poisoned anyway.

I’ve responded to your question on GitHub to keep the discussion in one place.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.