Introduce write_at/write_all_at/read_at/read_exact_at on Windows

I have discovered that std::os::unix::fs::FileExt::read_exact_at and std::os::windows::fs::FileExt::seek_read are very far from being similar, Windows version is MUCH slower.

There was Tracking issue for write_all_at/read_exact_at convenience methods · Issue #51984 · rust-lang/rust · GitHub a while ago that introduced those methods for unix, I think there would be a significant value to introduce similar/the same methods for Windows as well, especially considering that doing it in performant way seems to require unsafe and is not trivial: https://github.com/vasi/positioned-io/blob/1cb70389b133f1e96c68279b04c6baf618e6ca22/src/windows.rs

For context my use case involves reading from 1GiB sector chunks that are 32 bytes each, at random, 1MiB total. It has a pretty good performance on Linux with high read concurrency available on modern SSDs, but takes orders of magnitude more time on Windows.

FileExt::seek_read ultimately calls sys::windows::Handle::synchronous_read, which calls NtReadFile; I'm not a Windows expert, but, as used here, this system call appears to be effectively the same thing as Unix pread. The code you linked as "the performant way", however, does a MapViewOfFile and an UnmapViewOfFile every time it's called. Modifying a process's memory map is inherently expensive at the hardware level, so, I would expect the technique used by positioned-io to be measurably slower than what the stdlib is doing. You might want to look into alternative explanations for the slowdown you're observing.

Independent of that, I agree that there's no good reason for the API for positioned read/write presented by std::os::unix::fs::FileExt to be different from that presented by std::os::windows::fs::FileExt. Can we converge them?

windows:

Even when the I/O Manager is maintaining the current file position, the caller can reset this position by passing an explicit ByteOffset value to NtReadFile. Doing this automatically changes the current file position to that ByteOffset value, performs the read operation, and then updates the position according to the number of bytes actually read. This technique gives the caller atomic seek-and-read service.

unix:

  pread()  reads up to count bytes from file descriptor fd at offset offset (from the start of the file) into the buffer starting at buf.
  The file offset is not changed.

These are just not compatible

I must admit I have not actually tested performance of that crate, but it seemed like it should have been fast, I might be completely wrong though.

What I did test and what was much faster is memory-mapped I/O, but if mapping the whole file it was too much (we saw files that are over 60TiB in size). Currently testing mapping individual sectors of the file instead, but then we're doing file mapping a lot of times and ultimately we don't want memory-mapped I/O due to its numerous drawbacks, it seems like there must be a way to simply read the bytes I need without going through unnecessary kernel abstractions.

It might be worth experimenting with Windows' asynchronous (aka "overlapped") I/O API to queue up a whole bunch of these reads and let the kernel complete them in the most efficient order. The stdlib doesn't have anything for that but there's probably at least one crate out there.

That doesn't mean there's no way to converge the API presented by the stdlib, it just means NtReadFile used in this fashion doesn't provide the semantics required to match pread. I'm curious what the alternative to "Even when the I/O Manager is maintaining the current file position" is.

Another option would be io_uring or its windows equivalent which should minimize the number of syscalls for those small reads.

Yeah, 32k randomly scattered 32byte reads are pretty bad. I would expect even pread to be suboptimal here just due to the syscall overhead.

Yeah, io_uring is the ultimate goal, but it is only on Linux and requires fairly new kernel (many users are on Ubuntu 20.04 with non-hwe kernel unfortunately).

There was an idea to just read the whole gigabyte and drop unnecessary data, which would be actually faster on Windows that what I have experienced with seek_read, but it wouldn't be fast enough on SATA SSDs and we have tight time requirements for this operation.

Modern SSDs are shockingly good at serving multiple concurrent requests, just need to find a way to leverage it on 3 major platforms (yes, macOS is also a target).

Files that aren't seekable don't maintain a file pointer. Files opened for async (overlapped) access do not need to maintain a file position.

I would strongly suggest coming up with some benchmarks first. Improving performance by guesswork is a lot more tricky.

That said, I doubt it'd be hard to beat std for read/write perf. All the std APIs are designed to be synchronous and to work on any open handle, so using IOCP would be a great improvement. Even better would be IoRing but that requires a newer kernel.

Implemented benchmark (not reduced for now, just in my app) and ran with both memory-maped I/O and seek_read on Windows. The result is that memory-mapped I/O results in 0% idle time on a disk, with seek_read idle time is ~15% and total execution time of the app is much longer: 71ms vs 267ms.

Is anyone available of the non-async library for Windows for reading files at arbitrary offset without seeking first in the meantime (on Linux async version of the app is slower anyway)?

If you're doing synchronous IO you'll either want to issue readaheads or throw more threads at the problem so that the parallelism can mask the wait-induced stalls

Reads are random, number of threads is already equal to number of cores and works great on Linux and macOS. My current suspicion is that seek_read has the "seek" part, so calling it concurrently from multiple threads can be problematic, I'll try to open file multiple times, once in each thread, on Windows and see how that goes.

The memory-mapped IO version, are you making one big map of the entire file and keeping it around for the duration of whatever it is your app actually does, or are you doing what positioned-io does, mapping and unmapping chunks as needed?

Memory mapped I/O is great! However, being realistic it's not something std can do by default behind the user's back. Crates have a lot more freedom though!

Mapping the whole file for the duration of the app lifetime otherwise performance is even worse than direct file reads. The challenge is that files are multiple (sometimes tens) of terabytes in size, so Windows fill memory with pages (and users are concerned with application using 100% of memory available even though it is not quite the case), also there is a limit of how much space can be mapped this way in total apparently that seems to be smaller than amout of virtual memory that some users are hitting as well (though it might be Linux-specific, I'm not sure).

I'm not suggesting it either, I'm just sharing what seems to work better and that there are APIs on other platforms in std that do exactly what I need without memory-mapped I/O efficiently.

Opening file multiple times, once for every thread in a thread pool works even better than memory-mapped I/O, somewhat expectedly. I think I'll go with that now. Seems to be tiny bit faster on Linux with read_exact_at as well, but hardly outside of noise range.

On Windows relative numbers are:

  • 936 ms for single file with seek_read (+handling of partial reads identically to read_exact_at in std)
  • 245 ms for single file with memory-mapped I/O
  • 190 ms with file openes many times, once for each thead in thread pool using seek_read

This was on Micron 5200 3.8T SATA SSD and i7-6700 processor.

Can you post your test program somewhere? The numbers for memory-mapped I/O with a map and unmap for each read are so far off from what I'd expect, that I want to run the same benchmark on a couple different Unixes to find out if my mental model of the cost of altering a process's address space is really that wrong.

1 Like

I'm afraid the program will not be very useful as it requires long preparation step and then audits results of preparation.

Essentially I have this trait:

pub trait ReadAtSync: Send + Sync {
    /// Fill the buffer by reading bytes at a specific offset
    fn read_at(&self, buf: &mut [u8], offset: usize) -> io::Result<()>;
}

It is implemented for &[u8]:

impl ReadAtSync for [u8] {
    fn read_at(&self, buf: &mut [u8], offset: usize) -> io::Result<()> {
        if buf.len() + offset > self.len() {
            return Err(io::Error::new(
                io::ErrorKind::InvalidInput,
                "Buffer length with offset exceeds own length",
            ));
        }

        buf.copy_from_slice(&self[offset..][..buf.len()]);

        Ok(())
    }
}

And File:

impl ReadAtSync for File {
    fn read_at(&self, buf: &mut [u8], offset: usize) -> io::Result<()> {
        self.read_exact_at(buf, offset as u64)
    }
}

And then I have created a following wrapper that opens file multiple times and implements the same trait leveraging above File implementation of the trait:


pub struct RayonFiles {
    files: Vec<File>,
}

impl ReadAtSync for RayonFiles {
    fn read_at(&self, buf: &mut [u8], offset: usize) -> io::Result<()> {
        let thread_index = rayon::current_thread_index().ok_or_else(|| {
            io::Error::new(
                io::ErrorKind::Other,
                "Reads must be called from rayon worker thread",
            )
        })?;
        let file = self.files.get(thread_index).ok_or_else(|| {
            io::Error::new(io::ErrorKind::Other, "No files entry for this rayon thread")
        })?;

        file.read_at(buf, offset)
    }
}

impl RayonFiles {
    pub fn open(path: &Path) -> io::Result<Self> {
        let files = (0..rayon::current_num_threads())
            .map(|_| {
                let file = OpenOptions::new()
                    .read(true)
                    .advise_random_access()
                    .open(path)?;
                file.advise_random_access()?;

                Ok::<_, io::Error>(file)
            })
            .collect::<Result<Vec<_>, _>>()?;

        Ok(Self { files })
    }
}

All file operations above are proving OS hints about random file reads using these cross-platform utility traits (they implement cross-platform version of read_exact_at as described before): https://github.com/subspace/subspace/blob/c14b3ff0d6a2547f4d9155f1acf231f3181dc68d/crates/subspace-farmer-components/src/file_ext.rs

Now what I actually do with that is running some CPU-intensive work interleaved with random file reads (~20kiB each per 1GB of space used) using rayon on a large file (2.7TB in above case).

So above numbers are not just disk reads, they represent the workload I actually care about where reads are only a component, but it is clear that there is a massive difference depending on how files are read on Windows, it is even larger than above results show due to CPU being a significant contributor to the performance.

Linux only gained finer-grained locking fairly recently. And I think there also have been some optimizations to reduce IPIs/TLB shootdowns. So it can depend on which kernel version you're using.

$ zcat /proc/config.gz | grep CONFIG_PER_VMA_LOCK
CONFIG_PER_VMA_LOCK=y