`size_hint` for `std::io::Read`

Proposing a std::io::Read::size_hint, similar to the one on Iterator:

trait Read {
    // ...
    fn size_hint(&self) -> (usize, Option<usize>) {
        (0, None)
    }
}

Allocation

size_hint would allow for better optimized pre-allocation for memory to read the entire Reader into. For example, std::io::read_to_string is currently implemented as :

pub fn read_to_string<R: Read>(reader: &mut R) -> Result<String> {
    let mut buf = String::new();
    reader.read_to_string(&mut buf)?;
    Ok(buf)
}

but this may require reallocating multiple times. size_hint could at least reduce this

// ..
let mut buf = String::with_capacity(reader.size_hint().0);
// ..

or even

// ..
let size_hint = reader.size_hint();
let size = size_hint.1.unwrap_or(size_hint.0);
let mut buf = String::with_capacity(size);
// ..

though this option may allocate too much space, which could be undesirable


Bytes::size_hint

<std::io::Bytes as Iterator>::size_hint currently operates by "magic", using an internal SizeHint trait that only implements a specific size for &[u8].

Having a size_hint method on Read would allow Bytes::size_hint to have a more obvious implementation, and allow use on more types. BufReader could return the minimum as the amount currently in its buffer, for example.


Thoughts?

4 Likes

I could see where a different name might be good, I'm just not sure what.

Note that size_hint is kindof a bad API, because the upper limit is basically never used.

I would suggest something much simpler like fn collect_hint(&self) -> usize; -- especially because that's that way it's not a lower-bound, so it's ok for it to slightly overestimate.

7 Likes

The problem with slight overestimations is that they add up: if my reader only returns up to one redundant byte, then a collection of 1,000 readers will allocate a spare kilobyte.

That's more a problem with iterators than with readers, however, because you don't compose readers usually.

I like that idea. It leaves a bit of a question as to what the upper limit of Bytes::size should return, but if it doesn't matter as you say (and I would agree) then I'd suggest having it return None (or keep the magic impl, but that seems strange to me)

In terms of unused allocated capacity, that's still probably better than having Vec/String organically grow their allocation.

Prior art: A few months ago when I was working Optimize File::read_to_end and read_to_string by jkugelman · Pull Request #89582 · rust-lang/rust · GitHub, I toyed with adding a Read::remaining_hint trait method. For much of the same reasons you've outlined I figured it would be a more elegant way to optimize not just a couple of specific readers (namely, File and BufReader), but any readers that wish to provide hints.

I didn't end up submitting it as I didn't have the energy/confidence to push for a libs-api addition. Still, here is what I came up with:

Returns the bounds on the number of bytes remaining to be read.

Specifically, remaining_hint() returns a tuple where the first element is the lower bound, and the second element is the upper bound.

The second half of the tuple that is returned is an Option<usize>. A None here means that either there is no known upper bound, or the upper bound is larger than u64.

Implementation notes

It is not enforced that a reader implementation yields the declared number of bytes. A reader may yield less than the lower bound or more than the upper bound of bytes. A file could be truncated or appended to after calculating its size, for instance.

remaining_hint() is primarily intended to be used for optimizations such as reserving space for the elements of the stream, but must not be trusted to e.g., omit bounds checks in unsafe code. An incorrect implementation of remaining_hint() should not lead to memory safety violations.

That said, the implementation should provide a correct estimation, because otherwise it would be a violation of the trait's protocol.

The default implementation returns (0, None) which is correct for any reader.

pub trait Read {
    #[unstable(feature = "remaining_hint", issue = "none")]
    fn remaining_hint(&self) -> (u64, Option<u64>) {
        (0, None)
    }
}

And here is the full commit, which includes a couple of implementations on std library types.

2 Likes

But if it can't overestimate at all -- like size_hint().0 -- then something like a decompressor will end up having to give a hint that's never big enough to avoid a reallocation.

1 Like

I really like the name! And yeah, u64 is better here. But I would agree with @scottmcm that simply returning the lower bound might be better

What about requiring that it be a correct minimum at the time of calling, but can change later?

Edit: Reading it again, that might have been what you meant. Was it?

Oh, that's an interesting issue. What about a tuple of (min, likely), instead of just min or (min, max)? Not sure what would exactly constitute "likely", though.

Eh. With alignments of allocations from the typical allocators, these bytes only actually add up in usable memory when you're already on a allocation boundary (say 8 or 16 bytes), so effective usage is probably on the order of 6-12% of that "extra" kilobyte in practice (yes, alignments overall probably are some kind of Benford's Law in how their % 8 sizes land, but I'd expect File sizes to be way more even than data structures).

Note that this is bad for File when reading contents out of /sys because the "sizes" returned by stat are complete fabrications. Most everything is either 0 or 4096 (one page). Even "symlinks" have 0 size despite their size being "more known". Things like sys/dev/block/*/stat have a stat size of 4096 but cat stat | wc -c is 153 (in one instance I have here).

I know /sys is not a land of "files", but making them dangerous to use with the standard library doesn't sound good to me (there have been DoS vulns where read loops expected st_size to be trustworthy and it just stalled out despite the EOF state having been reached), but this is just not the case in such "special" places.

I tried essentially min(hint.0.saturating_mul(2), hint.1) for Iterators back in Try to guess a smarter initial capacity in Vec::from_iter by scottmcm · Pull Request #53086 · rust-lang/rust · GitHub, but it wasn't an obvious win.

Maybe it could be better here, but it's also unclear to me how a Read would normally know a maximum other than infinity or that's the same as the min. Like a TcpStream or a DecompressStream can't really give a meaningful max.

Can you elaborate on where you'd use the extra size here on a 32-bit platform?

What can you do with that fact? Is it even possible to tell whether it was wrong? Is there any way you could write safe code which would misbehave if it wasn't the "minimum at time of calling", since it could change immediately after calling anyway?


Basically, my meta-point here is that if this is for

better optimized pre-allocation for memory

then perhaps it should just be reserve_hint(&self) -> usize, with no requirements other than "should return something reasonably useful". Especially since, unlike for iterators, TOCTOU is an issue for many common Reads and thus it can rarely be relied upon precisely.

(And, for example, I think it would be perfectly fine for it to return a size rounded up to a sector multiple -- at least for multi-sector files -- rather than potentially needing to use a more precise API.)

3 Likes

This behavior's already in the standard library.

Can you elaborate on what the danger is? What are these DoS vulnerabilities you mention? Is it because somebody relied on the size being accurate instead of treating it as a "hint"?

I can't think of an example either.

I was thinking about the fact that Seek deals in u64s, but seeing as Read deals in usize, that may indeed make more sense. I really gotta think more before I speak...

I mean that if a file returns the amount of bytes left until the end as the hint, and the file was subsequently shortened, that wouldn't be a violation of the trait

That makes sense

Either one makes sense to me. The standard library uses both.

  • Read returns usize when it's reading into a buffer since a buffer can obviously never hold more than usize::MAX bytes.
  • Other methods like seek, take, and Metadata::len use u64.

I previously thought u64 made the most sense, but given that the intended use case for this hint is to preallocate memory I see that usize would be more convenient. Callers would have to deal with overflow if it were u64.

Yep, it is.

1 Like

On the other hand, this hint will always be misleading for files larger than what usize can represent. I agree that this is fine since we're dealing with allocation, but we have to make sure to communicate that this is the sole purpose of the function, and that it shouldn't be relied on any other thing. (i.e. it shouldn't be used to calculate progress percentage).

2 Likes

Yes. I searched for it, but wasn't able to find it right now. I'll poke around later again.

Edit: OK, just came across it again. It was this subthread where grousing about userspace doing silly things was discussed and it was brought up that anything trusting the size is 1) a TOCTOU watiing to happen and 2) not considering the shared mutable-ness of filesystems. There wasn't actually a DoS in the wild, just silly userspace behaviors.