Feedback from adoption of fallible allocations

kornel · April 28, 2021, 12:11pm

I do have "embracing termination" environment — my server's listening socket is managed externally by systemd and transparently passed on to the next instance after a crash. The socket is behind multiple layers of proxies and load balancers. The whole system is extremely redundant and distributed over many thousands of machines in over 200 data centers.

And yet, I have to handle OOM.

This is because crashing is expensive. My server is handling hundreds of requests in parallel per process. An abort of the process means that the hundreds of in-flight requests are suddenly aborted, and the work is lost, and all resources spent on them so far are wasted. All of them will have to be retried — not juts the OOMing one, but all that got caught in the abort. And when they're retried, I don't have a guarantee they won't crash the process again. These start-crash-restart cycles have a visible impact on cost and latency of the service. I can't guarantee 100% uptime, so I do have environment technically prepared to handle crashes, but the crashes still have a cost and need to be avoided as much as possible.

To me, opinions of C and C++ programmers about OOM handling are not relevant. I agree that OOM handling in these languages, especially C, is incredibly difficult, and frequently goes through broken untested code paths. This is not the case in Rust.

Rust has Drop, which runs for you automatically whenever you exit a scope. There's no special code path for OOM. There's no goto cleanup which could jump in a weird state. There's no risk of double-free. Drop is a path that's guaranteed to be correct, and drops are always automatically executed, regardless of why the function exits.
Allocations are not that pervasive in Rust. Rust is pretty good at making them explicit and avoiding them. The perceived impossibility of handling OOM errors comes from the fear of needing to allocate during OOM handling. This can be avoided in Rust. Majority of Drop implementations never need to allocate anything, and strictly free memory. It's possible to use simple enum types for errors (as opposed to fancy backtrace-allocating error libraries), and then error handling can be guaranteed not to allocate. I am convinced that 100% bullet-proof OOM handling is very much achievable in Rust — in places where I was able to use fallible_collections it does work.
Even if OOM handling can't ever be 100% perfect, I benefit from any improvement over the current abort() approach which has 0% chance of success by design. Even bad OOM handling that works half of the time is a 50% improvement in the cost of restarting the servers I'm running.

Topic		Replies	Views
Notes + Interviews on Fallible Collection Allocations	53	7929	March 25, 2019
Try_reserve returning non-growable Vec view language design	9	889	July 14, 2021
Fallible alternative to Extend language design	16	2039	August 28, 2020
[Idea] `#[fallible_drop]` attribute for let statements language design	7	1118	January 10, 2024
Pre-rfc: adaptive sets and deduplication functions libs	17	2258	March 25, 2019

Feedback from adoption of fallible allocations

Related topics