No longer possible to exercise system call failure due to EBADF

zackw · September 18, 2024, 6:15pm

std PR #124210 made dropping a File whose OS-level file descriptor is invalid cause an immediate process abort in debug builds.

In principle, I support this change; however, it has broken a number of test cases for a library I'm working on (not currently public). The problem is that some low-level operations, notably File::metadata() and libc::fcntl(●, F_GETFL), can only fail because of an invalid fd. To exhaustively exercise error handling in a crate that uses these, therefore, requires mock Files that wrap invalid fds. It was formerly possible to construct such a File and have it "work" (i.e. all operations on it would fail, as the test code requires) but now they crash the entire test group.

I would like to start a conversation about how to recover the ability to unit-test error handling for code that calls OS primitives that can only fail when supplied an invalid fd. Possibilities that occur to me include

a new File constructor, File::unusable(), or something like that, that gives you a File on which all I/O is guaranteed to fail, and as_raw_fd() is guaranteed to give you an invalid fd, but dropping it is fine
alter debug_assert_fd_is_open so that if the fd has a particular, documented value (−2 perhaps) that means "I made this File wrap an invalid fd on purpose", it doesn't abort the process; make similar changes throughout the ecosystem (e.g. rustix has several APIs that will abort if passed −2 as an fd).
have someone write a really thorough crate for failure injection on system calls in general, obviating the need to pull tricks like this

I am open to any and all suggestions ... other than "don't bother testing your error handling"

Example of code broken by #124210 (playground link)

use std::os::fd::AsRawFd;
use std::fs::File;
use std::io;

#[allow(dead_code)]
#[derive(Debug)]
struct WrappedFile {
    f: File,
    // more fields in the real thing
}

fn wrap_read_only_file(f: File) -> io::Result<WrappedFile> {
    // SAFETY: trusts you to supply a valid fd, manual error checking
    // required.  Various other safety issues are not relevant when
    // the second argument is F_GETFL.
    let fl = unsafe { libc::fcntl(f.as_raw_fd(), libc::F_GETFL) };
    if fl == -1 {
        return Err(io::Error::last_os_error());
    }
    if (fl & libc::O_ACCMODE) != libc::O_RDONLY {
        return Err(io::Error::from_raw_os_error(libc::EINVAL));
    }

    Ok(WrappedFile { f })
}

// originally a unit test of wrap_read_only_file's error handling
pub fn main() {
    use libc::{c_int, rlimit};
    use std::os::fd::FromRawFd;

    // The only way to make fcntl(fd, F_GETFL) fail is to supply an
    // invalid file descriptor.  An fd whose numeric value is larger
    // than the process's hard limit on the number of open files is
    // guaranteed to be invalid, no matter what the rest of the
    // process does.
    let f = unsafe {
        let mut rl = rlimit {
            rlim_cur: 0,
            rlim_max: 0,
        };
        if libc::getrlimit(libc::RLIMIT_NOFILE, &mut rl) < 0 {
            panic!(
                "getrlimit failed despite valid args: {}",
                io::Error::last_os_error()
            );
        }
        // rlim_max is presumably either u32 or u64, and c_int is
        // almost certainly i32.  Under these assumptions, if the
        // system call returns the maximum value for the type (and
        // thus .saturating_add(1) returns the value unchanged) then
        // the try_into is guaranteed to fail, so one expect() will do
        // for both.
        let fd: c_int = rl
            .rlim_max
            .saturating_add(1)
            .try_into()
            .expect("fake fd ought to be in positive range of c_int");
        File::from_raw_fd(fd)
    };

    let err = wrap_read_only_file(f).unwrap_err();
    assert_eq!(err.raw_os_error(), Some(libc::EBADF));
}

josh · September 18, 2024, 6:29pm

First of all, before anything else: to the problem statement of wanting to test your error-handling code. That seems like a perfectly reasonable problem to solve.

The first two of the options you're suggesting would break the intended property that those debug assertions are attempting to maintain, namely that a File shouldn't contain an invalid FD. This seems comparable to, for instance, testing an error path using a str that contains invalid UTF-8.

The error-injection approach would certainly work, but I can absolutely understand not wanting to do that much just to test error handling.

While I'm not sure this is a good idea, one possible approach would be to mem::forget your File so it never gets dropped. I don't know that that would avoid all possible future issues, but it might address this one.

pitaj · September 18, 2024, 6:31pm

Can you do drop(OwnedFd::from(file)) at the end of your test cases? Then you'll only hit the abort if the test fails.

Edit: mem::forget as suggested by Josh is probably a better option

pitaj · September 18, 2024, 6:35pm

Another option would be to Box::leak your files

the8472 · September 18, 2024, 7:01pm

Well, the more important property is to not close an FD you do not own. But we can't detect that in that codepath, or in general. So this is actually a proxy measure that we notice when something else previously closed an FD it did not own and now ours is invalid.

So not killing the process when it's a guaranteed-never-valid number can at least be considered, because that shouldn't lead to UB. But it generally still is a bug, not something OwnedFd is intended for.

I am open to any and all suggestions ... other than "don't bother testing your error handling"

Other options:

run tests as processes and handle the abort / signal success-up-to-the-abort some other way
take AsFd/AsRawFd or a custom trait instead of File

zackw · September 18, 2024, 7:19pm

I believe these are all equivalent to wrapping the file object in a ManuallyDrop, which works for most of my test cases, but not for the one I cut down to make the example code up top, which has this structure:

struct WrappedFile { f: File /* etc */ }
fn wrap_file(f: File) -> io::Result<WrappedFile> {
    something_that_can_fail(&f)?;
    Ok(WrappedFile { f /* etc */ })
}

#[test]
fn test_wrap_file() {
    let f = file_wrapping_invalid_fd();
    let err = wrap_file(f).unwrap_err();
    assert!(something_about(err))
}

The drop that triggers the abort happens inside wrap_file, when something_that_can_fail fails. In the real thing, wrap_file is part of my library's public API, so its signature cannot easily be changed.

Also, ManuallyDrop or equivalent only avoid the abort right now because debug_assert_fd_is_open is only called from File's destructor. I would not be at all surprised if, in the future, File::metadata() and/or File::as_raw_fd() start doing the same check, and now my crate is broken again. I'd like an official supported way to test this class of failures, please

I don't want to get into custom test runners.

This is an interesting suggestion, however, that I will look into. WrappedFile isn't using the File for much besides its file descriptor, and it might not be too big of an API break.

the8472 · September 18, 2024, 7:32pm

I would not be at all surprised if, in the future, File::metadata() and/or File::as_raw_fd() start doing the same check, and now my crate is broken again. I'd like an official supported way to test this class of failures, please

Well... currently it's only a debug assert because it incurs an extra syscall, which we do to work around FUSE issues. But we could extend this assert to other places in the future where we know FUSE can't interfere (e.g. duplicating file descriptors) and at that point you would no longer be dealing with error handling, we'd just unconditionally abort the process on EBADF.

The problem is that some low-level operations, notably File::metadata() and libc::fcntl(●, F_GETFL), can only fail because of an invalid fd.

I think FUSE can inject errors in metadata via getattr, I haven't tried it though.

CAD97 · September 18, 2024, 8:11pm

Strictly speaking, no it doesn't. FromRawFd is unsafe, and the safe From<OwnedFd> requires an owned file descriptor that can be closed. This is the oft-maligned concept of IO safety; File has a safety invariant that it must hold an owned/closable file descriptor, and code that doesn't document otherwise doesn't need to / isn't expected to handle the case where the file descriptor isn't, with a consequence of library-level UB.

I don't think that's what I'm saying. Rather, if your code is resilient to unsafe Files as input, then it must be careful to only handle the unsafe Files in permitted manners. std doesn't document any File functionality as supporting unsafe File states, so this would mean using the absolute bare minimum to test your code (from_raw_fd and the relevant API), and taking responsibility for if any of the used API changes to exploit the library UB you're expecting to stay latent.

Alternatively, test your error handling with mocks that don't use File directly or by hooking fcntl directly.

This definitely feels like a case for panic! instead of rtabort! since during Drop for OwnedFd there's no UB left to be potentially caused at that point. (Also, impl-wise, it feels like this should check the result of close for EBADF, not fcntl beforehand. Also also, Dir just below in the diff uses assert! for a very similar check for last_os_error().is_interrupted(), which would panic on EBADF prior to this change.)

Also×3, it should be panic_nounwind! anyway, not rtabort!.

If these do, it's a pretty strong motivator that you shouldn't need to handle that case. Eventually the blessed way to test it anyway will be disabling cfg(ub_checks), like cfg(debug_assertions) is handled. In fact, IIRC, ub_checks is currently set by debug_assertions of the calling crate, so turning off debug assertions would be the semiofficial way to test the behavior gated behind UB checks.

If it's just a matter of writing the custom runner, I think cargo-nextest can handle this out of the box.

Then storing the RawFd directly seems like the appropriate behavior. And existing API you don't want to break can still take File and turn it into RawFd eagerly. (But adding the generic would only break type inference, thus technically be a minor change.)

It does vaguely feel like there should be a policy on when assert_unsafe_precondition!(check_library_ub can be allowed to cause an unwinding panic versus an aborting panic. Since there's library UB, an abort is always allowed, but if the potential fallout is "low stakes" enough, an unwinding panic is in many cases a nicer user experience.

My reading of fcntl(2) would allow EACCES, EAGAIN, or ENOLCK failures for F_GETFL as documented (they aren't restricted to specific values of op nor dependent on an optional argument). And my understanding of IO errors was that any IO call that is allowed to error could return with any error number, and the documentation only lists common error causes. But POSIX may certainly be stricter than my recollection here (which is Win32 biased, which does explicitly document "common errors").

the8472 · September 18, 2024, 8:38pm

panicking may try to open further files to do symbolication. If your process is in a state where the file descriptor table has become corrupted you want to bail asap.

I was replying to metadata being mentioned, which calls stat.

I did that initially, but then FUSE dashed any hope for sanity

CAD97 · September 18, 2024, 9:03pm

Maybe this is me being overly optimistic, but corrupting kernel state sufficiently that properly error tolerant code paths in backtrace display creates further problems seems very unlikely and out of Rust's scope to make sense in the face of kernel issues. The kernel is supposed to respond to any FD syscalls with at a minimum coherent behavior, so newly opened FDs would still work normally enough.

So attempting to close an FD which isn't open isn't an indicator that there are unfixable issues requiring an immediate abort imo, only that something has gone wrong. But a chance to report the issue via panic hook shouldn't create additional issues.

I'm pretty sure that we still eagerly abort if the panic hook starts recursively panicking, without relying on unwinding reaching the drop glue unwind abort edge.

Unfortunately it seems reality means exotic IO drivers will expose you to every edge case that can exist

the8472 · September 18, 2024, 9:38pm

By corrupted I didn't mean the kernel structures becoming broken. I meant that the ownership in userspace is violated. If you let things continue to run like this something may end up writing to the wrong FD and and then damage files you'd rather not have damaged. Bailing out fast is just damage-minimization.

kpreid · September 18, 2024, 10:17pm

In particular, you want to abort the process as soon as possible so that other threads, that might be about to access a closed-incorrectly-and-reopened file descriptor, will stop as soon as possible.

zackw · September 19, 2024, 1:07pm

I feel like I need to clarify that my goal is not to create, manipulate, or be resilient to Files that violate their own internal constraints.

Rather, my goal is to make the ? in these functions return an error, from a unit test, without changing the function's signature.

pub(crate) fn last_modified_time(file: &File) -> io::Result<SystemTime> {
    file.metadata()?.modified()
}

pub(crate) fn relevant_xattrs(file: &File) -> io::Result<Vec<String>> {
    use xattr::FileExt;
    Ok(file.list_xattr()?
        .filter_map(|tag| 
            tag.to_str()
               .filter(|s| s.starts_with("mycrate_"))
               .map(|s| s.to_owned()))
        .collect())
}

It just happens to be that the only way I know to make file.metadata() and file.list_xattr() fail [on Unix anyway] is to feed them one of these messed-up Files.

you shouldn't need to handle that case

If File::metadata (and Metadata::modified, btw) were changed to guarantee that they never return an Err ("returns io::Result for backward compatibility only, cannot fail" or some such in the documentation) then I'd cheerfully switch to using unwrap(), but until then I feel a need for tests that exercise the failure edges. More specifically, I want to be able to write tests of the functions that call these functions that probe those functions' ability to cope with these functions failing. Which is why it's important to not change their signatures.

I'm going to look into this, but I'm not sure it solves all my problems, because: the appropriate public API after making this change would be something like

pub fn wrap<T>(f: T) -> io::Result<WrappedF>
    where T: Into<OwnedFd>

and OwnedFd has the same validity constraint for its fd as File does. Also, this would seriously complicate a future port to Windows (where it'd need be the same API but with OwnedHandle instead of OwnedFd and I don't want to think about how many functions would need to be duplicated with cfg(unix) and cfg(windows) just for that...)

I am not sure myself whether the ERRORS section of https://pubs.opengroup.org/onlinepubs/9699919799/functions/fcntl.html is meant to be exhaustive, but it is largely moot. As a practical matter, I know of no other way to make fcntl(fd, F_GETFL) fail in a predictable fashion besides handing it an invalid fd, and the same goes for fstat and flistxattr and probably a couple others I'm not remembering right now but my crate does use. The code samples I keep posting are meant to be representative, not exhaustive.

cgwalters · September 19, 2024, 4:33pm

I wouldn’t try to do this with unit tests, but an integration test that can e.g. use e.g. strace fault injection which is just going to better exercise realistic code paths.

binarycat · September 21, 2024, 3:01am

if all you care about is making the function fail within a specific unit test, you could hack that together with #[cfg(test)] and a thread local variable to inject an Err return.

it's not pretty, but it doesn't involve library UB.

Topic		Replies	Views
Request for comments: Unexpected error handling libs	3	794	March 25, 2019
Pre-RFC: I/O Safety Unsafe Code Guidelines	39	5421	November 28, 2021
Error ergonomics language design	34	3752	March 25, 2019
Implicit close for I/O objects, revisited libs	1	1359	March 25, 2019
Termination for Result<_, E: Debug> libs	20	732	September 25, 2024

No longer possible to exercise system call failure due to EBADF

Related topics