Feedback from adoption of fallible allocations

I've converted most of my Rust projects to use fallible_collections, including a service running in production at "web scale". It does regularly abort itself due to actual OOM, on Linux. With plain libstd I had couple thousand rust_oom coredumps per day. With try_reserve and proactive detection of low-memory situations I brought it down to 5-10 aborts per day, which are all in 3rd party code now.

I've heard that there's an open question whether Rust should use a different approach to fallibility and have a FallibleVec type or similar, like Vec<T, SomeFallibleNonDefaultFlag>.

In my experience with the fallible_collections prototype: type-based enforcement of fallibility is both too restrictive and not restrictive enough at the same time. It sucks.

Returning a non-standard variant of a Vec (in this case FallibleVec) is "viral" and forces callers of the function to also switch their Vec type.
If the caller also has to work with other functions that use the vec, it requires those other functions to change too. Sometimes it's impossible if it involves 3rd party crates or std (e.g. io::Read). If the Vec needs to be stored in a struct, then the methods of the struct and users of the struct also need to change, causing more and more changes. This makes switch difficult, because a small change in one function may snowball info refactoring of the whole dependency tree. It also feels too viral, because not every place that receives a Vec needs ability to grow it. Very often data ends in a FallibleVec only because it had to be fallible at creation time, but afterwards it's effectively treated as immutable or fixed-size. The right type for such non-growable owned data should have been Box<[T]> rather than [Fallible]Vec<T>, but a conversion from a Vec with excess capacity to a boxed slice is not free, so in low-memory situations it's actually better to keep using needlessly-growable Vec than to switch to a more theoretically-accurate Box<[T]>.

Adoption of a FallibleVec variant, despite its vitality, is also quite insufficient for ensuring all allocations are handled fallibly throughout the codebase. This type used in APIs doesn't do anything about Vecs used inside function bodies. I've had to also find and eliminate all uses of temporary vecs inside functions — things like a .collect::<Vec<_>>() to a temporary to sort it or to convert [Owned] to [&borrowed]. It's not easy to grep for all such cases, because there's too many methods that may allocate, e.g. From/Into.

Most importantly, adoption of a FallibleVec in my codebase does absolutely nothing about 3rd party code. Crates may return wrappers around aborting-Vec such as Image or Bytes. They should of course add fallibility support in their codebase, but my point is that my use of a FallibleVec type doesn't force them to do so. Even if 3rd party crate's API doesn't use any aborting types, there's no guarantee that it doesn't use them. I would like to detect and potentially block use of 3rd party code that aborts on OOM, but there are no language features for that, and type-based enforcement can't do it.

My conclusions so far:

  • Vec::try_reserve + Vec::try_with_capacity + Vec::try_extend are a big improvement. They're easy to adopt. Finding all places where allocations happen is a whack-a-mole, but once it's done it works well.

  • Handling fallibility through a FallibleVec type is not worth it. It's a pain from usability and interoperability perspective, and it doens't improve anything over Vec::try_reserve. Just like Vec::try_reserve it's only a partial opt-in solution. I would also be wary of code that uses generics to allow both fallible and non-fallible flavors of Vecs, because that can cause generics bloat, and still fails to give a guarantee that the program won't abort. I would not use such type if it was in std.

  • I would like to have some additional solution that enforces fallible allocation for entire scopes or entire crates, including calls to 3rd party code, even code that uses aborting-Vec only privately. I think it would be similar to enforcing no panics.

  • I've never needed any detailed information from the allocation error. From the context it's always obvious what I was trying to allocate, and details don't matter beyond the fact that it failed. Any information about allocation size is redundant. Any information about remaining free mem is unusable due to inherent TOCTOU race, so there's really nothing useful I could do with details in allocation errors. The only option is to give up and propagate the error, so to me a zero-sized AllocError type is entirely sufficient.

36 Likes

As a general point around this, I wish Rust had corollary traits for fallible collection operations.

The main one I have in mind is Extend.

It'd be nice to have something like TryExtend:

pub trait TryExtend<A> {
    type Error;
    fn try_extend<T>(&mut self, iter: &mut T) -> Result<(), Self::Error>
    where
        T: Iterator<Item = A>;

    fn try_extend_from_slice(&mut self, slice: &[A]) -> Result<(), Self::Error>
    where
        A: Clone,
    { ... }
}
4 Likes

There’s been some conversation about the recent Linux/Rust RFC about using feature flags and recompilation of std to disable infallible allocation APIs for a given build. If that were available, do you think you would try to enable it for your service? I imagine it would probably require work with your dependencies to make sure they all have fallible variants of their allocating functionality.

2 Likes

@anp personally I'm interested in no_std / no-alloc use cases, but in such use cases fallible collections are provided by crates like heapless.

These crates do impl traits like Extend, but the problem is with a fixed-sized container Extend is fallible, and when it fails it panics, hence my interest in TryExtend.

If we're implicitly going off feedback on Rust in the Linux kernel, panic-free code is important as well, so it'd be nice to have some core traits for fallible buffers which are also panic-free.

2 Likes

I'm afraid that an absolute feature toggle for all of std would be too harsh, because it'd prevent non-fallible code from compiling, even code that is not actively used.

Such compile-breaking split would require crates to specifically support no-aborting-std case, similar to how they have to specifically support no-std today. I'm not sure if the crates ecosystem would embrace that enough not to cause friction, e.g. dependency-of-a-dependency-of-a-dependency may just want to format! a string somewhere, and it'd be hard to argue they can't do such a trivial thing, because my big server needs to prevent allocs elsewhere for other reasons.

2 Likes

Not sure how painful it is to find all such cases, but at least in theory GitHub - ktrianta/rust-callgraphs or GitHub - rust-corpus/qrates: A framework for large scale analysis of the Rust ecosystem. could be adapted for this.

@kornel thanks for bringing this up. I think this is a rather important topic, especially in the light of Rust becoming a serious alternative for C and C++ projects.

I did some further reading, How to Deal with Out-of-memory Conditions in Rust | CrowdStrike gives a good outline and demonstrates a way to test OOM behavior.

Herb Sutter' paper about deterministic exceptions in C++ http://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p0709r0.pdf gives intresting insights too, especially the later revision http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0709r4.pdf talks about OOM in section 4.3.

Testing allocation failure handling is much more difficult because allocation is pervasive, and lots of code that thinks it handles allocation failure is wrong.

This is a quite important lesson I think we shouldn't ignore.

Programs don’t run forever; the highest reliability comes from embracing termination, not applying increasingly heroic measures to prevent it.

The more I think about it, maybe this is more of a philosophical issue. We might get somewhat close to a goal like 'bug free software' but we know we can't reach it. So we've invented independent approaches and systems like process isolation to deal with that reality. Same goes for server hardware, instead of attempting to build crash free servers, we build software that can fall-over, load-balance etc. and accept hardware failure as at some point inevitable. Of course that doesn't mean we should neglect software and hardware reliability on purpose, but maybe relying on a single approach is fundamentally at odds with our human ability to design and operate complex systems. This same line of thinking seems to apply to OOM. We can get far by having a global allocator opt in that specifies this allocator reports errors, this global mechanism would be pervasive for all code including third party code, but error handling code can still be wrong, so we also need additional out of process measures to be truly resilient and highly available.

I do have "embracing termination" environment — my server's listening socket is managed externally by systemd and transparently passed on to the next instance after a crash. The socket is behind multiple layers of proxies and load balancers. The whole system is extremely redundant and distributed over many thousands of machines in over 200 data centers.

And yet, I have to handle OOM.

This is because crashing is expensive. My server is handling hundreds of requests in parallel per process. An abort of the process means that the hundreds of in-flight requests are suddenly aborted, and the work is lost, and all resources spent on them so far are wasted. All of them will have to be retried — not juts the OOMing one, but all that got caught in the abort. And when they're retried, I don't have a guarantee they won't crash the process again. These start-crash-restart cycles have a visible impact on cost and latency of the service. I can't guarantee 100% uptime, so I do have environment technically prepared to handle crashes, but the crashes still have a cost and need to be avoided as much as possible.


To me, opinions of C and C++ programmers about OOM handling are not relevant. I agree that OOM handling in these languages, especially C, is incredibly difficult, and frequently goes through broken untested code paths. This is not the case in Rust.

  • Rust has Drop, which runs for you automatically whenever you exit a scope. There's no special code path for OOM. There's no goto cleanup which could jump in a weird state. There's no risk of double-free. Drop is a path that's guaranteed to be correct, and drops are always automatically executed, regardless of why the function exits.

  • Allocations are not that pervasive in Rust. Rust is pretty good at making them explicit and avoiding them. The perceived impossibility of handling OOM errors comes from the fear of needing to allocate during OOM handling. This can be avoided in Rust. Majority of Drop implementations never need to allocate anything, and strictly free memory. It's possible to use simple enum types for errors (as opposed to fancy backtrace-allocating error libraries), and then error handling can be guaranteed not to allocate. I am convinced that 100% bullet-proof OOM handling is very much achievable in Rust — in places where I was able to use fallible_collections it does work.

  • Even if OOM handling can't ever be 100% perfect, I benefit from any improvement over the current abort() approach which has 0% chance of success by design. Even bad OOM handling that works half of the time is a 50% improvement in the cost of restarting the servers I'm running.

18 Likes

I agree with most of the things you say. While I wouldn't go as far as call the experience from C and C++ developers completely irrelevant. For example the part about virtual memory and over-commit is applicable across languages. But let's ignore that for now and only focus on the as you emphasized valuable goal of reducing OOM crashes.

The approach I mentioned could look like this:

pub unsafe trait GlobalFallibleAlloc {
    pub unsafe fn try_alloc(&self, layout: Layout) -> Option<NonNull<u8>>;
    pub unsafe fn dealloc(&self, ptr: NonNull<u8>, layout: Layout);

    [...]
}

One could configure a project to use a GlobalFallibleAlloc instead of GlobalAlloc. The standard library and other libraries could cfg out functions like Vec::reserve but keep functions like Vec::try_reserve. This would be a pervasive change that truly forces all your and third-party code to handle allocation failure in some way. Maybe as an addition to make adoption easier users could choose this on a per crate granularity and make exception for some crates and allow them to use GlobalAlloc.

Note that GlobalAlloc already supports fallible allocations

From the docs for GlobalAlloc::alloc:

pub unsafe fn alloc(&self, layout: Layout) -> *mut u8

Allocate memory as described by the given layout .

Returns a pointer to newly-allocated memory, or null to indicate allocation failure.

It's up to the users of GlobalAlloc to handle allocation failure in whatever way the see fit. The way suggested in the docs (and the most pervasive way to handle allocation failure is with handle_alloc_error which is an abort)

4 Likes

Well handle_alloc_error does not offer a way to affect the control flow of calling code. So the error can't bubble, with the exception of panics. Plus users are still free to not call handle_alloc_error. @kornel would set_alloc_error_hook(|| panic!()) + catch_unwind address your use case?

Oh wow that is quite an odd signature for a fallible method. I'd have expected something like a Result<T, E> or an Option<T> rather than a pointer that can be a null reference, aka the billion dollar mistake.

Is there any particular reason that the signature is defined this way?

I think it was for compatibility with existing allocators

3 Likes

TL;DR: GlobalAlloc was the minimal viable API for making it possible to swap out Rust's global allocator. As such, its API mostly just matches that of the existing allocators that it was created to serve: malloc, jemalloc, and SytemAlloc, mainly. Basically, GlobalAlloc is exactly the API that the compiler uses to allocate in generated code (i.e. Box).

The current nightly Allocator trait does in fact have the richer type signature of fn allocate(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError>.

The exact API is still in flux, because the best API for allocators is still a fairly open question. As such GlobalAlloc was the compromise solution to enable #[global_alloc] and nothing more.

8 Likes

As far as I can tell, currently handle_alloc_error intentionally forbids panics and defines them as Undefined Behavior (it or functions around it are marked "nounwind").

So I have no use for handle_alloc_error at all. The only possible operations is permits are equivalent of abort() or loop{}, and these are the disruptive behaviors I want to avoid.

BTW, the draft code of Rust for Linux tried to panic there, but it's been flagged as incorrect usage of this function. So I presume Linux won't use it either. I don't understand why this handler exists.

2 Likes

Shouldn't a fallible vector have a zero-cost conversion to and from a std::vec::Vec using the same allocator? I would see that as the most natural interface. That would remove the viral nature of it.

2 Likes

I would recommend checking out https://github.com/rust-lang/rust/pull/84266. The "raw cfg" intentionally is hard to use right now --- a more accessible knob or other interface will require an RFC --- but if one can get past that I believe this is a nice alternate approach to fallible_collections surmounting many of the problems mentioned in the original post.

1 Like