Notes + Interviews on Fallible Collection Allocations

I don’t like any of the approaches. Either we end with 2xN collections to support fallible allocations, or 2xM methods per collection.

I would like a type like the integer arithmetic wrappers for fallible allocations, e.g., Fallible<C: Collection> that accepts Vec, VecDeque, HasMap, SmallVec… etc. This type should adapt C methods to make allocation on them fallible, that is, change their signatures, catch the panics, and somehow be able to return the element that failed to insert on OOM to the caller of these methods.

I understand that this is hard, and that it will take time and language features to get there. I also understand that Gecko cannot wait. If they just need Vec nothing prevents them from implementing a FallibleVec collection as an external crate and using that instead. This would be generally useful and could belong in collections-rs (@Gankra), but I don’t think we should rush any of this into std.

I like the try_ methods approach because it allows to control behavior for each call separately. Well it is niche of niche but good to have the freedom.

If you want to switch between fallible and non-fallible allocations you can just move a Vec in or out of a Fallible<Vec>. Those in the niche of needing to handle OOM want to always handle it (at least those who @Gankra interviewed above). That is, they don’t want to, by mistake, call push instead of try_push. Fallible<Collection> allows you to express this invariant for a piece of data and enforces it for you. If there are some regions in your program in which you want to panic on OOM you either keep using Fallible<Collection> and .unwrap directly on error, or just move the Collection out of the fallible, in which case your Collection type is expressing that OOM will panic.


EDIT: I want to repeat again that we probably cannot write Fallible in Rust today, but we should think about what it would take to do so, and see if going that route, at least for std, would make more sense than either exploding the number of collection or the number of methods in the collections.

1 Like

Obviously we don't have a way to express the Collection trait yet, but it's clear how that might become possible (ATCs) and that we'll likely get the feature required for it in the future. But that's the easy part. I don't understand how generic code might hook deep into the internals of the collection to modify the behavior when allocation returns an error. The collection's interface panics, how can you stop it from panicking "from the outside" without changing the interface? I could understand Infallible<C: FallibleCollection> (unwrap everywhere), but not this direction.

Unless, of course, Fallible would be closer to Wrapping in that Fallible<Vec>, Fallible<HashSet>, etc. are all entirely separate impls (even if they may offer a uniform interface) that need to be specifically written by the author of each and every collection. But that is mostly a rebranding of FallibleVec, FallibleHashSet, etc. so I guess this is not what you mean?

2 Likes

I could understand Infallible<C: FallibleCollection> (unwrap everywhere), but not this direction.

This direction would be panic::catch_unwind everywhere (in the wrapper), but this means that once you move an object in by, e.g., calling vec.push(object), that object is lost, which is a big downside.

Another big downside is that it requires unwinding, which is not universally available (e.g., embedded) or desired. Several other downsides are mentioned in the OP. Requiring unwinding is pretty much right out, as far as I can tell.

2 Likes

Aside from embedded systems and local allocators, this is also not the case on Windows, which doesn't do overcommit. It instead allows programs to separately reserve and commit memory. This is, for example, Firefox's biggest platform.

The only reason Linux has to have an OOM killer is because it does CoW allocation of virtual memory, which means that when an application writes to a page, it may require an allocation and that allocation may fail and if it does, there's no way to report it to the application. Granted, since this can happen to any application running on the system, the kernel may not choose to kill the application which first allocated after exhausting memory, but instead kill another process, but that's just a heuristic.

On Windows, for example, MS SQL Server can successfully run on a machine while consuming most of available physical memory without things grinding to a halt. Windows also includes mechanisms to get notified when the OS is under memory pressure so that memory can be freed. SQL Server uses this to flush caches and free them until the memory pressure is gone.

While you're right that some kernels make dealing with memory allocation failure almost pointless, that is far from a universal property of kernels.

2 Likes

No, the OOM killer is not really related to overcommit at all. You could just send a SIGBUS in response to a failed CoW event instead of having an OOM killer.

The issue is that if you don’t have an OOM killer and just return ENOMEM to allocations, then RAM can become exhausted and then ANY application that wants to allocate memory will see allocation failures.

This means that you may not be able to, say, switch to another window with Alt+Tab because the window manager would need to allocate memory to display the window list, but can’t because RAM is exhausted. Likewise you probably can’t SSH into the system because sshd’s allocations will fail, can’t type into terminals because they can’t allocate memory to process the text you typed, etc.

Unless all programs and constantly tested are written to cope with a system with exhausted RAM (and obviously most software is never tested for allocation failures and just crashes or aborts on allocation failures), you can’t have a functioning system where memory allocation fails and there are no memory quotas.

Conversely, with an OOM killer, when you press Alt+Tab and the kernel can’t find any RAM to give to the window manager, it will notice that 99% of the RAM is being hogged by another process and kill that other process, resulting in the window manager getting its memory and the system being usable (unless there’s a memory leak in the window manager, in which case it will be the one killed).

There is another problem I didn’t mention in the previous post, which is that this behavior also happens within a single process.

Let’s say your OS uses memory quotas and returns ENOMEM on allocation, and your web page allocates a bunch of objects until the quota is exhausted. Now, ALL allocation by that browser content process will fail since its memory quota is exhausted, which means that most JavaScript code will fail and in general any sort of interaction will fail, including visiting a link to another page unless the browser is specifically designed to make no allocations for the new page before unloading the old one.

Even scrolling or resizing a browser window may fail, since allocation may be required to do so (in case of CPU painting, this is even unavoidable since when resizing a window you need to allocate a larger back buffer to render the content it, unless you want to, say, preallocate a 16384x16384 back buffer and not support any window larger than that, wasting 1GB of RAM per tab in the process).

The only real usefulness for fallible memory allocation (other than embedded use case where the whole system is written with allocation failures in mind) is thus to reject unusually large allocations without aborting the whole process, and it’s not that useful since while such large allocations will be rejected, a huge number of small allocations will also effectively require an abort due to issue described above.

In my experience memory is very rarely exhausted to the last byte like this, because:

  • Likelihood of allocation failure is proportional to allocation size, so larger-than-average allocations are most likely to fail first.
  • Processes have their own caches of buckets/free lists and allocate memory in pages, freeing them lazily, which creates a buffer that limits interaction between processes and gives a bit of extra room to handle OOM.
  • Allocation failures cause more memory being freed (programs that handle failures abort operations and free memory allocated so far, and programs that don’t handle OOM - crash, and the OS frees even more memory).

To really run out of memory you need a pathological case like while(1) malloc(rand());.

1 Like

I’m not going to argue that every application can usefully recover from memory allocation failure. But then again, I don’t need to. Your post seems to talk about the behavior that you believe is implemented by a specific operating system kernel. Other kernels implement other behavior. And some applications can operate successfully in the face of memory allocation failure.

The point being, if you want to be a system programming language, you have to handle ‘all’ cases, not just the lowest common denominator. Arguing that you can’t write software which works in the face of memory allocation failures on linux has nothing to do with the ability to write that software on Windows.

5 Likes

Would make more sense to have an Infallible wrapper wrapping the fallible version of the collections.

2 Likes

I think std::collections would have been better if it would be designed as you suggested by default. That is, all collections support failure by returning Results from their methods, and then we would just somehow have a Infallible<Collection> wrapper that .unwrap()s.

We are now in the opposite starting position. I think I would be “Ok” with adding FallibleVec and friends with an API that is a close 1:1 map of Vec, but we should attempt to add an Infallible<Collection> wrapper as well that just unwraps and implement Vec<T> and friends ideally as just a type alias: type<T> Vec<T> = Infallible<FallibleVec<T>>.

3 Likes

I’ve run into similar issues developing embedded APIs. It’s commonplace to have devices that are accessed through an on-board interface such as I2C or SPI. In the vast majority of cases, these devices are going to be physically on the same PCB as the MCU, if there is any kind of error accessing the device, it’s an indicator of a hardware problem and the right thing to do is to abort.

However, there are some cases where you don’t want to abort. First, some designs might have pluggable devices, in which case you would want to be able to handle the error. Second, safety critical systems might want to handle some types of device errors explicitly for all sorts of reasons.

The problem is that you don’t want to force all of the first category of users to use a Fallible API, you don’t want to double the number of methods in your API by presenting both choices, and you don’t want to have independent Fallible and Infallible APIs because users might only want to handle a subset of errors.

Rust has evolved solutions for making Fallible APIs easier to use by the caller: first with the try! macro and then with the ? operator. Both of these make it easy to wrap Fallible calls with a default error handler to early return the error back up the call chain.

What would be nice is a language-level way for the API implementer to build a Fallible API and then attach a default Infallible wrapper that could be unwrapped by callers in an analogous way.

For instance, imagine if there was a #[abort_on_error] attribute that you could attach to any function returning a Result which would convert any error result into an abort. Then , provide callers with the opposite of the ? operator (! would be nice if it wasn’t already used to indicate macro usage) to tell the compiler that they want the error result directly instead of using the abort wrapper.

I don’t think there’s any reasonable way to do this in Rust at this moment - you could probably do the first part with a procedural macro, but not the second.

Does anyone know if there are language constructs of this type in any other languages?

2 Likes

Bingo.

Any Rust code that does neither I/O nor manual memory management is automatically exception-safe in all but a very few corner-cases. For example, a parser is generally a pure function fn Parse(&str) -> Result<ParseTree, Vec<ParseError>>. On panic, all allocations are automatically cleaned up and the code can safely continue. In fact, one could change the API to return Result<ParseTree, Result<Vec<ParseError>, InsufficientMemory>> and recover from memory exhaustion without any changes.

C++ has a very hard time with exception safety for various reasons, not the least of which are the pervasiveness of manual resource management (in poorly-written code) and user-defined move constructors (which can throw). The latter simply doesn’t exist in Rust (memcpy cannot fail!) and the former is extremely rare in Rust, not least because it is generally unsafe. Some complex data structures might be left in inconsistent states if a panic occurs during traversal, and some system resources require memory to release the resource, not just to acquire it. However, I suspect that both of those are small minorities.

I know that if an RDBMS or web server I was using ran out of memory and crashed, I would report that as a bug, on the grounds that a crash in such a system is likely to mean production downtime. IMO panicing on allocation failure and catching the panic near top-level is much, much easier than the alternative (manually tracking all allocations and ensuring that they stay under a certain limit) which is extremely error-prone. I don’t need it to do anything fancy on OOM. Just log it (ignoring failure), and return a 500 Internal Server Error or rollback the transaction.

2 Likes

I agree exception-safety is a fraction of the difficulty in Rust, but unsafe code is the key place where this isn’t true. Lots of code uses Vec unsafely, to e.g. pass an uninitialized buffer to some C code.

I am not satisfied with “well getting completely owned by exception-safety problems will be rare”.

I am willing to consider adding an oom=unwind toggle for rust, though. But I would advise against its usage in general.

Actually, I’d like to say we should table any discussion of unwinding since it’s irrelevant to the motivating use-case (and largely irrelevant to what I consider to be the second-most-important one: embedded).

The server case is interesting, oom=unwind can be handled orthogonally.

1 Like

Perhaps one way you could design this (which would fit in well with custom allocators) is through adding another type parameter to Vec:

pub struct Vec<T, Allocator> { /* stuff */ }.

The Allocator trait would have some way to perform an allocation:

fn allocate(size, alignment) -> Result<*mut u8, Error>.

For an infallible allocator, you simply set Error to enum Void {} or ! (i.e. non-instantiatable type).

Then, you avoid the duplication of having to have fallible & infallible versions of Vec, by making every method look somewhat like this:

fn push(&mut self, item: T) -> Result<(), Error> where the Error is the same type as above. Of course, if you’re using an infallible allocator, this Result essentially is ().

Now, obviously, this doesn’t actually work, as Rust doesn’t appreciate that Result<(), !> == (). As a stopgap, you could make a wrapper type that unwraps all of these results, but then we’re back to the duplication thing again. However, if Rust ever does gain the ability to equate Result<(), !> to (), then this model may prove helpful, as you can get rid of the duplicated wrapper version.

2 Likes

Hooking fallibility into the (already existing) allocator type parameter is equivalent to the Vec<T, Fallible> proposal, no?

[rereading intensifies] oh crap, yeah - sorry! chalk one up to not reading things carefully enough… (to be fair, though, the Fallibility type thing didn’t really mention hooking it into the allocator API - I was suggesting adding the concept of an “infallible allocator” instead of a Vec that acts infallibly. Doing it that way sorta helps prevent code duplication - if you wanted an infallible HashMap, all your scary “abort on OOM” routines are in one thin wrapper over some fallible allocator instead of having to be distributed across collections. It also opens up the opportunity for weird allocators that e.g. try to stack-allocate and, if they fail, heap-allocate instead).

1 Like