Pre-RFC: Read::read_into_uninitialized

Background reading

Current situation

The std::io::Read trait is the primary way to do the I half of I/O, such as reading from a file. It looks like this:

pub trait Read {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize>;

    fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> { ... }
    fn read_to_string(&mut self, buf: &mut String) -> io::Result<usize> { ... }

    // [Other default methods…]
}

A Read impl is expected to have its read method write to the start of buf, and return the number of bytes written. A well-behave impl does nothing else, but there is nothing stopping a buggy impl from reading from buf, or from returning an integer greater than the number of bytes actually written.

For this reason, although a typical use case of Read is to read into a newly-allocated buffer, callers cannot safely call read with an uninitialized buffer in a generic context. It has to initialize the entire buffer first, for example with zeroes. Zeroing reportedly has a 20-25% overhead in somewhat realistic benchmarks. Only with a concrete type such as fs::File whose Read impl is known to be “well-behaved” can we avoid this overhead.

And this is exactly what read_to_end and read_to_string do. The default implementations spend time writing zeroes, and many Read impls in the standard library override these methods to use uninitialized memory instead.

This trick only works for Vec<u8> and String. Other buffer types are out of luck. A method I just wrote taking a generic R: Read type parameter has to be unsafe, leaving it up to users to unsure that whatever Read impl they use is well-behaved.

Proposal

Edit: comments below list several reasons why this doesn’t work at all at maintaining safety.

I’d like to add a new default method to Read that calls read after zeroing, and can be overridden by Read impls to skip zeroing whey they also make sure that their read method is well-behaved.

This trait method must be unsafe to implement, but not to call. So the reverse of unsafe fn. This is what unsafe trait does, but applying it to Read would be a breaking change, and introducing a new trait would not help with all the existing Read impls (including those outside of the standard library).

So we have to get a bit creative.

pub trait Read {
    // [Existing methods…]

    fn read_into_uninitialized(&mut self, buf: *mut [u8]) -> io::Result<TrustedUsize> {
        unsafe {
            let slice = &mut *buf;
            ptr::write_memory(slice.as_mut_ptr(), 0, slice.len());
            // `slice` is now full initialized
            self.read(slice).map(TrustedUsize::new)
        }
    }
}

#[derive(Debug, Copy, Clone)]
pub struct TrustedUsize(usize);

impl TrustedUsize {
    pub unsafe fn new(x: usize) -> Self { Self(x) }
    pub fn get(self) -> usize { self.0 }
}

We use a raw slice (Which is totally a thing! Thanks generalized DST!) so that reading form it (or writing to it) requires unsafe to dereference it. Introducing a whole new TrustedUsize type is unfortunate, but I don’t know how else to “protect” from a non-zero return value in an impl that doesn’t write anything to buf (and so doesn’t need to unsafe’ly dereference it).

Details subject to bikeshed, of course.

Alternatives

Tokio has a different approach.

1 Like

Hmm… What would prevent me from writing something like this?

struct MyWrongStuff {};

impl Read for MyWrongStuff {
    fn read_into_uninitialized(&mut self, buf: *mut [u8]) -> io::Result<TrustedUsize> {
        let x = SomeOKStruct {};
        x.read_into_uninitialized(buf.wrapping_offset(1)) // Now I'm returning a TrustedUsize just by using one generated by a well-behaving impl
    }
}
1 Like

Rezeroing the buffer on every call by default is going to be pretty brutal performance-wise. If you have e.g. a BufferedReader over SomethingThatDoesntOverrideReadIntoUninitialized it’s going to be writing 8KB of zeroes and then reading probably way less than that.

Almost all impls in the standard library are well-behaved and so should override read_into_uninitialized to not do zeroing. BufferedReader in particular already has a zeroed vec![0; cap] buffer for calling the nested Read::read.

But yeah, it’s pretty heavy.

Every reasonable implementation of Read anywhere should be not reading from the buffer.

BufferedReader only has a zeroed buffer because read_into_uninitialized doesn’t already exist. If it did, we’d have to deal with the same question that everyone else would have to deal with - is it better to use read_into_uninitialized and have significantly worse behavior when interacting with code that hasn’t been updated to override read_into_uninitialize or continue using read and not take advantage of the implementations that do override it.

It seems like an unsafe trait TrustedRead {} together with specialization would be a more robust way of doing this, but that’s obviously blocked on specialization stabilizing.

This trait method must be unsafe to implement, but not to call

What would stop someone from doing this?

let buf: *mut [u8] = {
    let mut v = vec![0u8; 16];
    &mut v as &mut [u8]
    // v is dropped here
};
f.read_into_uninitialized(buf);
3 Likes

It would be better to have an ‘uninitialized buffer’ trait rather than forcing every Read impl to take care of it (and use unsafe code). Something like

unsafe trait UninitializedBuffer<T> {
    fn get(&mut self) -> *mut [T];
    unsafe fn did_fill(&mut self, size: usize);
    // + safe helper methods to append, etc.
}
// example
unsafe impl<T> UninitializedBuffer<T> for Vec<T> {
    fn get(&mut self) -> *mut [T] {
        unsafe {
            slice::from_raw_parts_mut(self.as_mut_ptr().offset(self.len() as isize),
                                      self.capacity() - self.len()) as *mut [T]
        }
    }
    unsafe fn did_fill(&mut self, size: usize) {
        self.set_len(self.len() + size);
    }
}
2 Likes

https://github.com/rust-lang/rust/pull/42002

Ok, so my initial proposal doesn’t work at all :slight_smile: I’ve edited it to cross it out.

@comex, How would UninitializedBuffer be used with Read?

If the read method parameter type can change, how about creating an InitializeOnReadSlice trait that provides slice-like operations, with two implementations:

  1. A pass-through wrapper for an initialized slice which implements From<&mut [u8]>;
  2. A wrapper for an uninitialized slice, which (partially?) initializes elements only if the slice is read, using clone of a default element.

Change read to accept mut T where T : Into<InitializeOnReadSlice> so that current caller code works.

Although all read implementations must be changed to the new signature, the unsafety is confined to the uninitialized wrapper implementation.

We are not breaking every implementation of Read in existence.

Please consider other “read-like” low-level APIs as well, such as recvmsg and inflate.

I think you can save the initial proposal by using phantom lifetimes:

#![feature(unique)]

use std::marker::PhantomData;
use std::ptr::{self, Unique};

// invariant: `len` is equal to the length of the uninitialized
// buffer with the lifetime of `sticky`.
pub struct TrustedLen<'sticky> {
    len: usize,
    sticky: PhantomData<*mut &'sticky ()>
}

impl<'sticky> TrustedLen<'sticky> {
    fn get(self) -> usize { self.len }
}

pub struct UninitializedBuffer<'sticky, T> {
    ptr: Unique<T>,
    capacity: usize,
    len: usize,
    sticky: PhantomData<*mut &'sticky ()>, // IMPORTANT: invariant lifetime
}

impl<'sticky, T> UninitializedBuffer<'sticky, T> {
    // ... all non-resizing Vec methods, including `to_raw_parts`
    pub fn push(&mut self, value: T) {
        if self.len == self.capacity {
            panic!("ran out of space")
        }
        unsafe {
            let end = self.ptr.as_ptr().offset(self.len as isize);
            ptr::write(end, value);
            self.len += 1;
        }
    }

    /// important: this borrows the UninitializedBuffer for 'sticky, so you can't change the
    /// length afterwards (this can also return a &mut [T] if that is desired).
    pub fn trusted_len(&'sticky mut self) -> TrustedLen<'sticky> {
        TrustedLen { len: self.len, sticky: PhantomData }
    }

    // requirement: ptr..ptr+capacity must be modifiable memory for the duration
    // of this function
    pub unsafe fn with<F>(ptr: *mut T, len: usize, capacity: usize, f: F) -> usize
        where F: for<'s> FnOnce(&'s mut UninitializedBuffer<'s, T>) -> TrustedLen<'s>
    {
        let mut this = UninitializedBuffer {
            ptr: Unique::new(ptr),
            capacity: capacity,
            len: len,
            sticky: PhantomData,
        };
        f(&mut this).get()
    }
    
    pub fn with_vec<F>(vec: &mut Vec<T>, f: F)
        where F: for<'s> FnOnce(&'s mut UninitializedBuffer<'s, T>) -> TrustedLen<'s>
    {
        unsafe {
            // if we want to allow *shrinking* our Vec, we need to zeroize the length on a panic
            // but ATM it can only lead to leaks.
            let len_ = Self::with(vec.as_mut_ptr(), vec.len(), vec.capacity(), f);
            vec.set_len(len_)
        }
    }
}

pub fn evil() {
    let mut v = Vec::with_capacity(16);
    UninitializedBuffer::with_vec(&mut v, |buf0| {
        buf0.push("hello".to_owned());
        let mut w : Vec<String> = Vec::with_capacity(16);
        UninitializedBuffer::with_vec(&mut w, |buf1| {
            buf0.trusted_len() //~ ERROR cannot infer an appropriate lifetime for autoref
        });
        panic!("game over - {:?}", w.get(0)) // or is it?
    });
}

fn main() {
    // but this isn't evil, so it works OK.
    let mut v = Vec::with_capacity(16);
    UninitializedBuffer::with_vec(&mut v, |buf0| {
        buf0.push("hello".to_owned());
        buf0.trusted_len()
    });
    println!("{:?}", v)
}

Actually, all that complexity isn’t needed - you can just use UninitializedBuffer by itself.

Isn’t it simply reading into an existing Vec?

It should be possible to have Buffer type that will be similar to Vec in having pre-allocated size and also the amount of bytes actually filled, but the difference would be that the slice is allocated on stack rather than on heap, and it doesn’t allow to change allocated amount.

Nope, because this is agnostic to the buffer's location - it can be a pre-existing Vecs allocation, a stack buffer, a field in a struct, or anything else.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.