I do have "embracing termination" environment — my server's listening socket is managed externally by systemd and transparently passed on to the next instance after a crash. The socket is behind multiple layers of proxies and load balancers. The whole system is extremely redundant and distributed over many thousands of machines in over 200 data centers.
And yet, I have to handle OOM.
This is because crashing is expensive. My server is handling hundreds of requests in parallel per process. An abort of the process means that the hundreds of in-flight requests are suddenly aborted, and the work is lost, and all resources spent on them so far are wasted. All of them will have to be retried — not juts the OOMing one, but all that got caught in the abort. And when they're retried, I don't have a guarantee they won't crash the process again. These start-crash-restart cycles have a visible impact on cost and latency of the service. I can't guarantee 100% uptime, so I do have environment technically prepared to handle crashes, but the crashes still have a cost and need to be avoided as much as possible.
To me, opinions of C and C++ programmers about OOM handling are not relevant. I agree that OOM handling in these languages, especially C, is incredibly difficult, and frequently goes through broken untested code paths. This is not the case in Rust.
-
Rust has
Drop
, which runs for you automatically whenever you exit a scope. There's no special code path for OOM. There's nogoto cleanup
which could jump in a weird state. There's no risk of double-free.Drop
is a path that's guaranteed to be correct, and drops are always automatically executed, regardless of why the function exits. -
Allocations are not that pervasive in Rust. Rust is pretty good at making them explicit and avoiding them. The perceived impossibility of handling OOM errors comes from the fear of needing to allocate during OOM handling. This can be avoided in Rust. Majority of
Drop
implementations never need to allocate anything, and strictly free memory. It's possible to use simpleenum
types for errors (as opposed to fancy backtrace-allocating error libraries), and then error handling can be guaranteed not to allocate. I am convinced that 100% bullet-proof OOM handling is very much achievable in Rust — in places where I was able to usefallible_collections
it does work. -
Even if OOM handling can't ever be 100% perfect, I benefit from any improvement over the current
abort()
approach which has 0% chance of success by design. Even bad OOM handling that works half of the time is a 50% improvement in the cost of restarting the servers I'm running.