Notes + Interviews on Fallible Collection Allocations


#1

Here are my notes on adding fallible allocation to Rust’s collection types, including several interviews with stakeholders or people with experience in the problem space.

This document doesn’t propose a concrete solution, but instead lays out the options and my notes on them. I have a pretty strong bias going in, and it was reinforced by the interviews.

I’ll be prepping an RFC this week based on this feedback and the discussion here.

Background

Methods like push on Vec don’t provide any way for the caller to handle OOM, and this is problematic for our several users. Notably gecko devs will be forking our collections to add this functionality for Firefox 57 (November 2017). As such we should fast track a design so, at very least, their fork matches what std ends up doing eventually.

Note: I will be usually be using push as a stand-in for "all allocating APIs" for simplicity here.

  • Today we always abort (call Alloc::oom())

    • Unless computing the allocation size overflows isize::MAX (yes isize), then it panics
      • Both of these are in the implementation of Vec, and not done by the allocator
  • Moving to unwinding on OOM is problematic

    • unsafe code may be exception-unsafe on the basis of current strategy
    • almost everyone tells me that C++'s version of this (bad_alloc) is bad and disliked
    • libunwind allocates, making panic on OOM sketchy at best
    • Alloc::oom says it can panic, so local or custom global allocators might unwind…?
      • Seems sketchy… should we update docs to forbid this? Was this discussed in the RFCs?
      • pnkfelix and sfackler seem to think it was only intended for local allocators
        • global allocators aren’t currently distinguished by the Alloc trait
      • this + catch_panic is a “simple” way to get fallible allocation out of the existing APIs
        • still unstable APIs anyway, so won’t help gecko devs for 57
        • unwinding isn’t always supported by our stake holders (no unwinding in gecko)

For all considered proposals, I’ll be using a strawman error type, which I’ve only thought about a little:

enum AllocFailure { Exhausted, CapacityOverflow }

Note that the allocator API defines:

pub enum AllocErr {
    Exhausted { request: Layout },
    Unsupported { details: &'static str },
}
  • I don’t think Unsupported should be exposed (it will instead be folded into Exhausted).
  • I don’t think Layout should be exposed (it’s an implementation detail).

Most consumers will probably just do:

foo.try_push(x)?; // evaporate details

// or

if foo.try_push(x).is_err() { /* do something */ }

But some might be interested in reproducing Vec’s behaviour (including Vec itself):

match foo.try_push(x) {
    Exhausted        => Allocator::oom(),
    CapacityOverflow => panic!("capacity overflow"),
}

Felix notes Exhausted having requestedBytes: usize might be useful for debugging crashes – was it “real” oom or did we try to allocate something massive?

Major contenders

  • Types to distinguish fallibility

    • FallibleVec<T>, replaces push(T) with push(T) -> Result<(), (T, AllocFailure)>
      • doesn’t support generic use of Vec/FallibleVec
      • hard to do mixed usage of fallible and non-fallible
        • or at least, outside allocating code, fallibility loses relevance
    • Vec<T, F: Fallibility=Infallible>, makes push(T) -> F::Result<(), T>
      • requires generic associated types (stable late 2018, optimistically)
      • probably requires type defaults to be improved?
      • works with generics, but makes all of our signatures in rustdoc hellish
        • maybe needs “rustdoc lies and applies defaults” feature
  • Methods to distinguish fallibility

    • Make mirrors of all methods – try_push(T) -> Result<(), (T, AllocFailure)>
      • works fine, but people aren’t happy about lots of methods
    • Only add try_reserve() -> Result<(), AllocFailure>
      • minimal impact
      • methods like extend/splice have unpredictable allocations
      • doesn’t work with portability lints (see below)
      • might be nice to have anyway?
    • Add some methods, but ignore niche ones
      • Weird, going to make people mad
  • Middle ground: method to temporarily change type

    • as_fallible(&'a mut self) -> FallibleVec<'a, T>
      • can do it for one method: vec.as_fallible().push(x)
      • or for a whole scope: let mut vec = vec.as_fallible()
      • doesn’t enable generic use, weak for library interop
      • can be built on method style
      • note: this is different from type-based b/c a lifetime is involved

Possible augmentation: negative portability lints

In some sense “don’t use infallible allocation” is the same kind of constraint that kernel code has for “don’t use floats”. The latter is intended to be handled by negative portability lints, so we can do that too.

portability lints were spec’d here: https://github.com/rust-lang/rfcs/blob/master/text/1868-portability-lint.md

But the negative version (removing portability assumptions) was left as future work.

Strawman syntax – add maybe as a cfg selector in analogy to ?Sized:

// In liballoc
impl<T> Vec<T> {
    // No need to mark push, implicitly #[cfg(std)] ?
    fn push(elem: T) { ... }

    // Say try_push doesn't infallibly allocate -- forces verification of body
    #[cfg(maybe(infallible_allocation))]
    fn try_push(elem: T) -> Result<(), AllocFailure> { ... }
}



// In your code
#![cfg(maybe(infallible_allocation))]
/* a bunch of functions/types that shouldn't use infallible allocation */

// or (equivalent)
#[cfg(maybe(infallible_alloction))]
mod allocation_sensitive_task;

// or (more granular)
#[cfg(maybe(infallible_allocation))]
fn process_task() {
    /* will get warning if any function called isn't #[cfg(maybe(infallible_allocation))] */
}

Note this analysis is local, so if you call any undecorated function from a third-party library, you’ll get a warning. This is a bit annoying, but strictly correct insofar as longterm stability is concerned: they should publicly declare that they guarantee this. In this vein, adding a #[cfg(maybe)] from a public item isn’t a breaking change, but removing one is.

This will also require a ton of careful stdlib decorating (don’t want to promise things we shouldn’t).

Interviews

I interviewed several people with industry experience in this problem, only some stakeholders in Rust providing this API (noted here).

Interview with Ehsan (Gecko dev; doesn’t use Rust for it):

Gecko has fallible allocation in its standard collection types. Distinction can be done at the type level or method level – there are factions that disagree on the right approach, and the issue doesn’t appear to be settled?

  • Personally prefers methods
  • Almost all allocations in gecko are infallible; crashing is simple and maintainable (especially with multi-process!)
  • Will fallibly allocate for some key things to improve reliability. Notably when website can create allocation disproportionate in size to network traffic (image size is a few bytes).
  • Doesn’t need to handle all fallible allocation in that region of code, or even on that buffer
    • happy to crash if the going gets tough.
    • In quick search of gecko, [^1] couldn’t find any actual mixed use
      • Except a sketchy pattern [^2]
  • Fallibility is a maintenance/safety hazard! Many untested branches.
    • In a quick search of gecko, I found a few cases that are written in a confusing way

(last two points are why methods are preferred)

[^1]: https://searchfox.org/mozilla-central/search?q=%5B%5En%5Dfallible%5C)&case=false&regexp=true&path==

[^2]:

// Fallibly reserve space
if (!aOutput.SetCapacity(aSource.Length(), fallible)) { 
    return false; 
}

for (auto x : aSource) { 
    // Says fallible, but this is actually infallible; otherwise this is UB on OOM
    *aOutput.AppendElement(fallible) = x;
}

In rust this would probably just be output.try_extend(source)?, although FFI might make you write code like above?

Interview with Whitequark (embedded dev; uses Rust for it):

Three lines of defense against the specter of allocation:

  • First: statically allocate; much harder to mess up.
  • Second: Crash on oom! Usually hard abort (need to know how to recover anyway), but sometimes unwind (some Cortex-M devs)
    • unwinding isn’t commonly supported here, so unwinding won’t ever be a complete solution.
  • Third: actually handle oom.
    • fail at a task/request granularity
    • all allocations for task are in a pool, so that on failure we free the whole pool; avoid fragmentation
    • all allocations in this region of code are handled fallibly, no mixing strategies

Likes try_push, but wants #[deny(infallible_allocation)]

If we do a typed approach, would prefer something generic for library compat.

Fallible allocation is a last resort, and devs are willing to put in work to use it properly.

Interview with Swgillespie (CLR dev, works on the GC; doesn’t use Rust for it)

Need collections for state in GC traces, e.g. stacks in graph traversal. If allocation fails, can try to shrink the stack and retry. OOMing while trying to GC is a bug.

  • Uses global allocator (new with std::nothrow)
  • Would use #[deny(infallible_allocations)]
  • No preference on typed vs untyped.
  • No need for being generic over fallibility (GCs are fairly concretely typed)
  • No concern with interop with third-parties
  • Lots of bugs from missing spots or failing to check results

Interview with nox (Servo dev; uses Rust for it)

Stylo needs it for Firefox 57, will be forking libstd collections until we provide these APIs.

Code like this which parses a list should be fallible: https://github.com/servo/servo/blob/de0ee6cebfcaad720cd3568b19d2992349c8825c/components/style_traits/values.rs#L251

Style sheet should just come out as “couldn’t parse”/“didn’t load” when this happens.

  • Prefers methods to integrate into existing code where desired
  • Moving to infallible likely to be incremental, as it’s a big job
  • Controls all the relevant libraries
  • Doesn’t care about generics
  • Would like #[deny(infallible_allocations)], not super important though

Relevant Reading


#2

Handling memory allocation failures is a niche usage, but if you think having it is important, then having all try_* methods + infallible_allocations lint + some way to group very similar methods in docs is the most clean solution, is idiomatic (Rust uses similar solutions for integral overflows, etc), and I think it doesn’t require more library code than a type solution.


#3

I discussed this with @nox and other people on IRC today. It appears the above was based on a misunderstanding in a prior IRC discussion. The vast majority of allocations in the C++ code that Stylo is replacing are not fallible. They only are in a few places, but these may be important to reduce crashiness on 32-bit platforms.

Regarding prioritization: based on the release schedule of both projects and the constraint that Firefox wants to support stable Rust, it looks like Firefox 57 will ship with Rust 1.20 as the minimum supported version. 1.20 went to beta last week, so it’s already too late for getting new language or standard library features in time.


#4

I think it would be very cool if rustdoc could show documentation for types with all default type parameters substituted!


#5

Regarding the middle-ground: I am not sure why a lifetime is necessary here.

Couldn’t the interface be Vec<T>::into_fallible(self) -> FallibleVec<T> with the symmetric FallibleVec<T>::into_infallible(self) -> Vec<T>? (With appropriate implementations of From/Into)

As long as both variants have strictly the same data-members, the transformation is effectively a no-op.


I must admit that the method approach seems unfortunate in the sense of the large duplication of interface that it entails. There is already quite a number of methods on Vec, especially once one considers all those it “inherits” from [T], and this already makes finding the right method for the job somewhat challenging (“There should be a method to do that, let me try to find it in the doc…”).

I would really favor splitting the interfaces in two, so that newcomers and “casual” users don’t have to wallow in the world of fallible allocations when they really do not need it.

Of course, this is only viable if conversion between the two is easy and cost-less, otherwise choosing one or the other has long-lasting consequences on all APIs which is going to be painful.


#6

I think if the methods are thoughtfully arranged such that all of the try_ methods are in their own section and are in the same order as the non-try_ methods, then I think the wallowing could be minimized quite a bit. :slight_smile:

(I do like your into_ idea though.)


#7

I was assuming an into-like conversion for the type-based approach. It has significant ergonomics issues in the temporary usage case:

  • Can’t be done on &mut Vec without swap shenanigans
  • Requires an into conversion at the end. That is:
vec.as_fallible().push(x)?;

becomes (in the worst case)

let f_vec = mem::replace(vec, Vec::new()).into_fallible();
f_vec.push(x)?;
mem::replace(vec, f_vec.into_infallible());

Or (if you have ownership):

let f_vec = vec.into_fallible();
f_vec.push(x)?;
vec = f_vec.into_infallible();

#8

I’m wary of folding Unsupported and Exhausted into the same error type. I have a long comment about this here, but the TLDR is that Unsupported and Exhausted express fundamentally different ideas. Unsupported implies that the requested layout is fundamentally unsupported by this allocator, and could never be supported regardless of system resources. Exhausted, on the other hand, to quote the documentation, “strongly implies that some series of deallocations would allow a subsequent reissuing of the original allocation request to succeed.”

As I discuss in the linked comment, the only reasonable response to an Unsupported error is to crash. An Unsupported error should be taken as evidence of a bug - that a given piece of code is being combined with an allocator that cannot support the allocations that the code requires. In fact, it might even be reasonable to remove the Unsupported error entirely and have Alloc's methods panic/abort on unsupported requests.

enum AllocFailure { Exhausted, CapacityOverflow }

To revisit this proposed type, what is the benefit of having CapacityOverflow be its own variant? Are there cases in which it both the case that a) a request is valid (in the sense of being supported by the allocator) but also, b) the request is too large to ever be supported regardless of system resources? CapacityOverflow seems to me to be a special case of Unsupported - an allocation request that is fundamentally unsatisfiable regardless of system resources.

EDIT: I think I may have misunderstood. Is AllocFailure intended to be an alternative to AllocErr for the Alloc trait’s methods to return, or is it intended as the error type returned by fallible methods on containers (e.g., Vec::try_push)?


This is mostly awesome work, though! Very in favor of making progress on the infallible/fallible distinction.


#9

You can have both?


#10

Have you considered providing try_* methods via an extension trait, such as FallibleVec which would be automatically implemented for Vec?. This allows for maximum convenience, while being out of the way for users not interested in handling fallible allocations.

The trait import would also signal that a piece of code uses fallible allocations somewhere, plus this pattern can easily extend to other types which may perform allocations.


#11

Regarding the difficulty of properly handling OOM and untested branches: in SpiderMonkey, the only way we’ve been able to reliably handle fallible allocation (all allocation is fallible in SpiderMonkey) is by having integration with our fuzzers. All allocations go through a common path and when we OOM, we set a flag. When we recover from OOMs, we reset that flag, and then various places make assertions that the flag is not set. Then, we give the fuzzers the ability to inject simulated OOMs, and they try their best to trigger the assertions.

http://searchfox.org/mozilla-central/rev/8a61c71153a79cda2e1ae7d477564347c607cc5f/js/src/builtin/TestingFunctions.cpp#1463-1605

Note that libunwind actually does have some infrastructure in place to do unwinding in difficult situations, such as under OOM and in a signal handler when locks can’t be taken: https://github.com/libunwind/libunwind/blob/7c079200d024d5868073246c4ec8f80446b0a4c0/include/mempool.h#L29-L48


#12

You’re missing servers in your use-cases. For servers it’s desirable to fail a single request that caused OOM, but not fail any other requests that are already running.

Proposed solutions are focused on returning Result, but I don’t see how that would work with the upcoming box keyword and <- placement new.

I’m very unsatisfied with proposed solutions. The try methods are useful, but not sufficient. From my perspective keeping the uncontrollably aborting push(), Box and collect() behavior is a mistake:

  • If aborting methods exist alongside faillible ones, they will be used, and it will be endless whack-a-mole.
  • It will cause difficulty using 3rd party libraries. Not everyone knows or cares about the problem, so it may split the ecosystem into OOM-compatible and aborting groups.
  • The proposed solution effectively deprecates push(), Box::new, collect(), which makes almost all existing Rust code and tutorials deprecated.

I’m unhappy that panicing on OOM has been dismissed, as I don’t find any of the arguments convincing enough:

  • All existing code already has to be prepared to deal with panics, so panicing in one more case doesn’t introduce anything fundamentally new, especially that too-large allocations panic today.
  • C++ exception safety is much harder than panic safety, and shouldn’t be used as a guide.
  • libunwind could preallocate to be more robust. In practice exhaustion of memory to the last byte does not happen. The larger the allocation, the higher the chance of hitting OOM, so in practice programs mostly fail on the largest memory blocks they use, so there’s likely to be enough memory left for unwinding.
  • OOM during unwinding can cause abort — this is also noting new, it’s same as panic during panic unwinding.

If you’re worried about compatibility of existing code, then I propose extending existing panic = abort to distinguish between “always abort”, “abort on OOM and panic-in-unwind (today’s default)” and “abort only on panic-in-unwind” (giving a chance to recover from OOM). Existing behavior can be kept as default.

[profile.release]
panic = "unwind"
panic = "abort-oom"/"unwind-oom" // to be bikeshedded
panic = "abort"

#13

The proposed solution effectively deprecates push(), Box::new, collect(), which makes almost all existing Rust code and tutorials deprecated.

Those would not be deprecated. Lots of programs, maybe most programs, work best by defaulting to not explicitly checking OOM.

But I agree with you that unwinding on OOM is worth pushing on a bit harder. It would be great to have because it would mean that safe Rust code running on its own thread provides stronger isolation.


#14

Is fallible allocation actually useful at all?

The problem is that, if the OS does not apply a memory quota to processes, then OS memory allocation simply cannot fail, and thus a small allocation in a malloc-like memory allocator will also never fail (assuming a virtual address space larger than physical RAM + swap space, i.e. a 48/64-bit address space).

To see why, note that if the OS allowed small allocations to fail, it would be due to complete exhaustion of RAM, which would result in memory allocations failing in ALL processes, which would most likely make the system completely unusable as all user and network interaction would fail due to the inability to allocate memory to process it, requiring an hard reboot.

Thus, what happens instead is that the OS identifies a process that the biggest offender and kills it (this is called “OOM killer” mechanism in Linux). If the process killed is not the one trying to allocate memory, the allocation will then succeed since memory has been freed. If it is, then the process has been killed and thus the allocation function never returns. Hence, allocation never fails.

If instead there is a per-process memory quota, then the OS allocation can indeed fail without issues once the quota is exhausted, since there is still RAM available for other processes due to the quota. Likewise, with a 32-bit address space, OS allocation can fail due to virtual address space exhaustion. Unusually large allocations can also fail simply because they are large, while other programs are still able to make reasonably sized allocations.

However, programs need anyway to function in the quota-less OOM killer model since that’s what Linux does by default, which means that they must cope with the kernel suddenly killing their processes due to OOM at any time, including when not allocating memory (this can be handled either by simply having systemd or a similar tool restart them, or by explicitly doing most of the work in one or multiple subprocess and detecting their abnormal termination).

But if they are already coping correctly with being killed due to the kernel OOM killer, then the simplest way to support OS memory allocation failures is to pretend that an OOM kill happened, by artificially aborting the process like Rust does now.

This approach greatly reduces complexity since memory allocation always either succeeds or results in the process being killed regardless of OS and OS behavior. Also, most people aren’t going to take the time and trouble to test allocation failures, so there’s an high likelihood that allocation failure handling code, if it were to be required, would often be broken.

So the value of supporting fallible allocation at least in userspace seems dubious, and the current behavior might be the best, perhaps just with the addition of a crate that allows to easily spawn subprocesses and determine if they were killed due to either the kernel OOM killer or allocation failure abort.

When the Rust code in question is implementing a Linux-like kernel, things are a bit different, because you can’t just “abort” or kill a kernel thread without panicing the whole system due to the lack of memory isolation. In this case fallible allocation might make sense, although it’s also possible to make all kernel code exception safe and use unwinding, catching the panic at the system call or interrupt entry.


#15

You are completely ignoring a major use-case in embedded. Also, some software are designed to be run in cgroups, and will want to be able to handle OOM, and not care about the OOM killer.


#16

The interview with whitequark was intended to capture requirements for embedded development. Can you be more specific about what use case was ignored?


#17

Where are you getting this idea? On unix, heap allocators will request memory from the OS using brk/sbrk or mmap, which will both fail with ENOMEM if there is not enough physical memory available for the requested amount.


#18

RE: the impossibility of handling oom

  • As an application-level concern, a careful system administrator can avoid all the issues with overcommit.
  • This is also a complete non-concern by default in the embedded context.
  • OOM can always be reasonably handled in the case of local allocators. e.g. it’s very reasonable for a pool or arena to run out of resources deterministically, even on an overcommiting system.
  • OOM handling doesn’t need to be perfect. In the case of Firefox, it’s a best-effort quality-of-experience issue. Being able to render a page with a catastrophically large image showing up as broken is nice. Sometimes the whole page will crash; such is life.
  • Something being difficult to use correctly isn’t sufficient to dismiss its availability. It’s just a reason to make it opt-in (as this proposal does). Almost all users are still expected to use infallible allocation, as is practical.

RE: servers

So basically this is the same as the embedded “task” case, but where it’s (reasonably) deemed impractical to put in the elbow grease to do the job right?

I’ll be honest: if you’re not putting in much more effort than installing a panic handler, I’m incredibly suspicious of your server’s ability to properly recover from allocation troubles. I suspect purging the whole process with an abort will be more reliable. A good server needs to recover from process aborts anyway.

I have received repeated warnings from developers I respect that handling oom is basically a fool’s errand unless you’re already carefully managing all the allocations.

I also fundamentally disagree with your assessment of how this changes idioms. Almost no one should ever use these APIs. It takes a lot of effort to handle OOM in a meaningful way.

RE: keywords/syntax

Those are low priority for me at the moment because they’re also unstable. They should definitely be looked at but it’s important to note that the box/place syntax is just sugar to make common cases less clumsy. (please correct me if this is wrong)

Could this be made to work? (I haven’t read those RFCs recently)

in vec.as_fallible().emplace_back() { ... }

RE: AllocFailure

Yes it’s only intended to be exposed by collections. The description of Unsupported I saw in the docs seemed to suggest to me it was possible to get unsupported as a psuedo-oom. For instance, requesting a very large allocation might plausibly come back with Unsupported. I’m happy to bend to the will of the users here. I don’t have strong opinions on this aspect of the design.

You make a good point that CapacityOverflow is basically Unsupported, but the devil’s in the details here. Do we handle AllocErr::Unsupported by panicking? Because we panic on CapacityOverflow today. If handling is different, they should probably take different paths.

RE: libunwind

Ah, I believe I was looking at an different (old?) version of libunwind, which was much more liberal with blind mallocs. This is good to hear!


#19

I’m unhappy that panicing on OOM has been dismissed, as I don’t find any of the arguments convincing enough:

Why not both?

It’s absolutely necessary to have Result-based methods for embedded/kernel use cases, but that doesn’t mean the existing methods can’t also be switched from aborting to panicking.

(Also, if placement new doesn’t support fallibility, placement new is wrong. Not sure what the latest design is, but I’d like to see something like (HEAP <- some_expr())? be supported. That is, if the allocation failed, some_expr() would not be evaluated and the placement expression would evaluate to Err.)


#20

I’m incredibly suspicious of your server’s ability to properly recover from allocation troubles

I’m not, because I already have a Nodejs + C server setup that does it and relies on that.

I have an image processing pipeline running in a memory-constrained VM. Because many requests are running in parallel, images have various sizes, and the pipeline’s buffering approach varies per image, peak memory usage varies a lot.

OOM handling in my case is super easy, because on failure I can immediately free hundreds of megs of memory. My only problem is that Rust does not give me a chance to do so.

I already handle failure of the whole server, but it’s literally expensive, because all concurrent requests need to be retried, rather than just an unlucky one.

I can in my code use try methods, but I’m afraid of it being a whack a mole game. Apart from needing to review and change all of my code, I’d have to convince authors of most of my dependencies to do so as well.

if you’re saying that handling OOM will be a rare concern, then it worries me even more, since it will be rare for Rust crates to be suitable for my server usage.