Reading into uninitialized buffers, yet again

I’ve been thinking about @sfackler’s “Reading into uninitialized buffers” RFC and the “API to acquire arbitrarily initialized buffer” thread which respectively propose a complicated solution that seems to have gotten stuck right before merging, and a simple solution requiring changes to LLVM that may not be feasible, to the problem of reading into uninitialized buffers:

the Read trait is

fn read(&mut self, buf: &mut [u8]) -> io::Result<usize>

and you have to needlessly initialize buf before you can pass it to read().

It seems to me that the key to the problem is that the above method has the wrong signature. It should actually be something like

fn read_buf(&mut self, buf: &'a mut [MaybeUninit<u8>])
    -> io::Result<&'a mut [u8]>

That is, it takes a mutable reference to a maybe-uninitialized buffer, and on success it gives you back a mutable reference to a subslice of that buffer which is now guaranteed to be initialized. It’s safe to call, but implementations will need internal unsafe blocks. I don’t know how to extend this to vectored reads, but I think it should take care of encapsulating the unsafety of partial buffer initialization without needing all the complexity of ReadBuf or any changes to LLVM.

(This is quite similar to an example in the MaybeUninit documentation.)

Hypothetically, the safe and unsafe Read implementations from sfackler’s RFC would look like this using this interface:

impl ReadIntoUninit for MyReader {
    fn read_buf(&mut self, buf: &'a mut [MaybeUninit<u8>])
            -> io::Result<&'a mut [u8]> {
        // Fill the whole buffer with some nonsense.
        for (i, byte) in unfilled.iter_mut().enumerate() {
            byte.write(i as u8);
        }

        // And indicate that we've written the whole thing.
        Ok(unsafe { buf.slice_get_mut() })
    }
}

impl ReadIntoUninit for TcpStream {
    fn read_buf(&mut self, buf: &'a mut [MaybeUninit<u8>])
            -> io::Result<&'a mut [u8]> {
        unsafe {
            // We're just delegating to the libc read function, which returns an `isize`.
            // The return value indicates an error if negative and the number of bytes read otherwise.
            let nread = libc::read(self.fd, buf.as_mut_ptr().cast::<libc::c_void>(), buf.len());

            if nread < 0 {
                Err(io::Error::last_os_error())
            } else {
                Ok(buf[..(nread as usize)].slice_get_mut())
            }
        }
    }
}
1 Like

Iirc, the main issue with this proposal is that you can't implement read in terms of read_buf, because casting from &mut [u8] to &mut [MaybeUninit<u8>] is unsafe. We really want an &out abstraction for this to work properly.

The writeup linked by the RFC discusses that exact option: https://paper.dropbox.com/doc/IO-Buffer-Initialization--A~SYVgn5DfaK57b22IUNJQyuAg-MvytTgjIOTNpJAS6Mvw38. It does not provide sufficient guarantees for many use cases.

1 Like

Could this be done by extending the Read trait with a new method like this?

pub trait Read {
    fn read(&mut self, buf: &mut [u8]) -> Result<usize>;

    // Name subject to bikeshedding. Let's ignore that unimportant detail.
    fn read_uninit<'a>(&mut self, buf: &'a mut [MaybeUninit<u8>]) -> Result<&'a mut [u8]> {
        let ptr = MaybeUninit::first_ptr_mut(buf);
        let init_buf;
        unsafe {
            core::ptr::write_bytes(ptr, 0u8, buf.len());
            init_buf = MaybeUninit::slice_get_mut(buf);
        }
        let len = self.read(init_buf)?;
        return Ok(&mut init_buf[0..len]);
    }
}

The default implementation is suboptimal because it has to initialize the whole buffer, particularly since only a portion of the buffer may be filled. But implementations that explicitly implement read_uninit could be optimal.

I know this is more or less what the Dropbox paper proposes and then discards. I'm not sure I fully understand Dropbox's concerns with the API. I think the read_uninit method could also return Result<(&'a mut [u8], &' mut [MaybeUninit<u8>])>, which is basically buf.split_at_mut(...).

As for the vectored API: I've never used vectored I/O APIs. I'm sure they're useful, but I don't think this should be abandoned just because the vectored API doesn't have an obvious solution. There are plenty of people like me that would benefit from a non-vectored API.

1 Like

I'm not sure I understand the problem. Is it that

The Read​ trait is not unsafe, so we can’t assume it’s implemented as we’d expect. The implementation could read from the buffer, or return the wrong number of bytes read.

(quoting the document you linked) Since I think we'd have to make any new read_buf function be part of a new trait anyway, why not have it be an unsafe trait? Implementations inevitably need unsafe code internally, so it isn't that much of an extra requirement.

That's no obstacle to the blanket impl that would be in the stdlib, though, surely? Concretely, this compiles fine for me ...

use std::mem::MaybeUninit;
use std::io::Result;
use std::io::Read;

unsafe trait ReadUninit<'a> {
    fn read_uninit(&mut self, buf: &'a mut [MaybeUninit<u8>])
        -> Result<&'a [u8]>;
}

pub struct UReadable;

unsafe impl<'a> ReadUninit<'a> for UReadable {
    fn read_uninit(&mut self, _buf: &'a mut [MaybeUninit<u8>])
        -> Result<&'a [u8]>
        {
            unimplemented!();
        }
}

impl<'a> Read for UReadable {
    fn read(&mut self, buf: &mut [u8]) -> Result<usize> {
        self.read_uninit(unsafe { &mut *(buf as *mut [u8] as *mut[MaybeUninit<u8>]) })
        .and_then(|r| Ok(r.len()))
    }
}

pub fn demo(h: &mut UReadable, b: &mut [u8]) -> Result<usize> {
    h.read(b)
}

(playground link)

Additionally, the method’s signature doesn’t provide enough guarantees for common cases. Imagine a read_exact method. It’s going to be repeatedly calling read2 with progressively smaller slices of the input buffer. For that to work, it needs to be the case that the returned slice comes from the start of the input buffer, not the middle, or end, or some other random region of memory. Since Read is not an unsafe trait, the read_exact implementation can’t make that assumption, and would have to manually check that the slice points at the right place.

In addition, the default implementation has no other option than to zero out the entire buffer every time before passing it to read, which imposes a signifcant overhead when using a reader that doesn't override it.

1 Like

Why do you think the RFC is "stuck"? I don't see anything indicating that.

I like this (it's certainly way more elegant and causes way less churn than a whole new kind of pointer.).

However, I have one concern: in particular, I'm not sure that

is sound, because the read_buf() method would then need to make sure not to read from the uninitialized buffer. Can this be resolved without marking read_buf() as unsafe?

The various guarantee issues seem like they'd all go away if the new extension trait were unsafe; what is the obstacle to that?

the default implementation has no other option than to zero out the entire buffer every time before passing it to read

Does there need to be a default implementation of read_into_uninit? There certainly needs to be a default implementation of Read for anything that implements the new trait, but ISTM we could get away without going the other way.

Why do you think the RFC is "stuck"? I don't see anything indicating that.

I could be wrong, but it was approved for merge almost two months ago, this was immediately followed by a report of problems with the implementation, discussion since then seems not to have resolved said problem, and the current state is "merging is blocked".

It doesn't need a default implementation if it's going to be a new trait, but that makes it incompatible with anything that works with Read and doesn't want to require implementations of the new trait.

As my comment said, the issue I identified doesn't block the RFC merging. It's just needs someone with the right permissions to actually merge it.

FWIW, it looks like the RFC has just been merged.