[Pre-RFC] Fixed-capacity view of Vec

Shnatsel · September 16, 2018, 6:53pm

Hey everyone! Recently I’ve been looking for vulnerabilities in unsafe code in Rust and trying to rewrite unsafe code in popular crates as safe without degrading performance. My goal with the rewrite attempts was not to try to fix the entire ecosystem single-handedly, but to identify missing safe abstractions, good or bad language practices, etc.

This is a proposal I came up with after discovering vulnerabilities and trying to eliminate the relevant unsafe code in png and claxon crates.

The problem

Following up on my post Auditing popular crates: how a one-line unsafe has nearly ruined everything I’ve taken another stab at eliminating the last unsafe block in png crate without degrading performance and pretty much hit a wall. There seems to be no way to accomplish what it needs using safe Rust only.

This is the code I would like to write (simplified compared to real code):

/// Repeats `repeating_fragment_len` bytes at the end of the buffer
/// until an additional `num_bytes_to_fill` in the buffer is filled
pub fn decode_rle_ideal(buffer: &mut Vec<u8>, repeating_fragment_len: usize, num_bytes_to_fill: usize) {
    buffer.reserve(num_bytes_to_fill); // allocate required memory immediately, it's faster this way
    let slice_to_repeat = &buffer[buffer.len()-repeating_fragment_len..]; // figure out what to repeat
    buffer.extend(slice_to_repeat.iter().cycle().take(num_bytes_to_fill));
}

This encodes the intent very well, and can be easily rewritten to use vec.extend_from_slice() for optimal performance in case the compiler doesn’t figure it out.

Sadly, this doesn’t compile: we cannot append a slice of the vector to the vector itself using slices because appending to the vector might cause it to reallocate, invalidating the reference to our slice.

So in practice this was written instead:

/// Repeats `repeating_fragment_len` bytes at the end of the buffer
/// until an additional `num_bytes_to_fill` in the buffer is filled
pub fn decode_rle_working(buffer: &mut Vec<u8>, repeating_fragment_len: usize, num_bytes_to_fill: usize) {
    buffer.reserve(num_bytes_to_fill); // allocate required memory immediately, it's faster this way
    for _ in 0..num_bytes_to_fill {
        // byte_to_copy variable is needed because buffer.push(buffer[i]) doesn't compile
        let byte_to_copy = buffer[buffer.len() - repeating_fragment_len];
        buffer.push(byte_to_copy);
    }
}

Except this was prohibitively slow (I’ve even tried to get this to go fast with various constraints on values and failed). Extending an existing vector so the memory would be zeroed before use was also prohibitively slow, so it was rewritten using unsafe function set_len() and operated on a slice of the vector that could contain uninitialized memory:

/// Repeats `repeating_fragment_len` bytes at the end of the buffer
/// until an additional `num_bytes_to_fill` in the buffer is filled
pub fn decode_rle_vulnerable(buffer: &mut Vec<u8>, repeating_fragment_len: usize, num_bytes_to_fill: usize) {
    buffer.reserve(num_bytes_to_fill); // allocate required memory immediately, it's faster this way
    unsafe { // set length of the buffer up front so we can set elements in a slice instead of pushing
        buffer.set_len(buffer.len() + num_bytes_to_fill);
    }
    for i in (buffer.len() - num_bytes_to_fill)..buffer.len() {
        self.buffer[i] = self.buffer[i - repeating_fragment_len];
    }
}

This implementation has a security vulnerability: if you pass it repeating_fragment_len set to 0, it will expose contents of uninitialized memory in the output. And for a PNG decoder this is a big deal.

But not only this implementation vulnerable, it is also slow! We’re still copying bytes one by one instead of doing a memcpy() as we know we could!

It is the worst of both worlds, and this is what was actually used in practice.

And while this example may seem a little contrived, predicting future data based on past input is a common pattern in decompressors of all kinds. Image and video decoders rely on it especially heavily, as do general-purpose decompressors such as gzip.

Making things possible (and efficient)

If we can guarantee that a vector would not reallocate while we append to it, we could implement the above function both safely and efficiently.

Let’s introduce a fixed-capacity view of Vec that’s identical to Vec in every way except it panics on out-of-capacity append instead of reallocating, and a function as_fixed_capacity() similar to as_slice() to obtain such a view. Then decode_rle_ideal above can be rewritten as follows:

/// Repeats `repeating_fragment_len` bytes at the end of the buffer
/// until an additional `num_bytes_to_fill` in the buffer is filled
pub fn decode_rle_fixedcap(buffer: &mut Vec<u8>, repeating_fragment_len: usize, num_bytes_to_fill: usize) {
    buffer.reserve(num_bytes_to_fill); // allocate required memory immediately, it's faster this way
    {   // view buffer as fixed-capacity vector so we can append data to it while its contents are borrowed
        let fixed_cap_buffer = buffer.as_fixed_capacity();
        let slice_to_repeat = &fixed_cap_buffer[fixed_cap_buffer.len()-repeating_fragment_len..];
        fixed_cap_buffer.extend(slice_to_repeat.iter().cycle().take(num_bytes_to_fill));
    }
    // fixed_cap_buffer went out of scope, we can use buffer normally again
}

What’s more, it’s fairly trivial to rewrite it using a more explicit slice copying that should directly map into memcpy():

/// Repeats `repeating_fragment_len` bytes at the end of the buffer
/// until an additional `num_bytes_to_fill` in the buffer is filled
pub fn decode_rle_fixedcap_memcpy(buffer: &mut Vec<u8>, repeating_fragment_len: usize, num_bytes_to_fill: usize) {
    buffer.reserve(num_bytes_to_fill); // allocate required memory immediately, it's faster this way
    {   // view buffer as fixed-capacity vector so we can append data to it while its contents are borrowed
        let fixed_cap_buffer = buffer.as_fixed_capacity();
        let slice_to_repeat = &fixed_cap_buffer[fixed_cap_buffer.len()-repeating_fragment_len..];
        for _ in 0..(num_bytes_to_fill/repeating_fragment_len) {
        	fixed_cap_buffer.extend_from_slice(slice_to_repeat);
        }
        let remaining_bytes_to_fill = num_bytes_to_fill - fixed_cap_buffer.len();
        fixed_cap_buffer.extend(slice_to_repeat.iter().take(remaining_bytes_to_fill));
    }
    // fixed_cap_buffer went out of scope, we can use buffer normally again
}

Not only this is entirely safe, this is also faster than the real-world unsafe implementation! Now we get the best of both worlds!

Making things convenient

A really common thing to do in code handling binary data - decoders especially - is to preallocate a certain amount of memory that’s known in advance and write data to it. The common idiom seems to be creating a vector with the desired capacity, setting vector length equal to its capacity and then passing around slices of it. And once you sit down and do some profiling, you may find that zeroing the memory you’re going to overwrite anyway adds roughly 20% to the total execution time, so you swap that vec.resize() for unsafe { vec.set_len() }, and that’s where it all goes south.

In fact, this is what has happened in Claxon. It worked fine over the course of normal operation, but when given a specially crafted file it included contents of uninitialized memory in the output. This issue has been assigned RUSTSEC-2018-0004, see also my original report and the author’s explanation of why this bug occurred.

I don’t know why people choose to pass around slices instead of appending to the vector. I can only state that it’s a very common thing to do in multimedia decoding crates that I’ve looked at. However, after trying to rewrite Claxon using passing a vector around instead of passing slices of it, I can make an educated guess.

Consider this trivial function:

let mut buffer = Vec::with_capacity(block_size);
unsafe {buffer.set_len(block_size)};
// -- snip --
fn decode_verbatim<R: ReadBytes>(input: &mut Bitstream<R>, bps: u32, buffer: &mut [i32]) {
    for s in buffer {
        *s = extend_sign_u32(try!(input.read_leq_u32(bps)), bps);
    }
}

Rewriting it without either unsafety or performance overhead due to zeroing memory got me this:

let mut buffer = Vec::with_capacity(block_size);
// -- snip --
fn decode_verbatim<R: ReadBytes>(input: &mut Bitstream<R>, bps: u32, buffer: &mut Vec<i32>, bs: u16) {
    for _ in 0..bs {
        buffer.push(extend_sign_u32(try!(input.read_leq_u32(bps)), bps));
    }
}

This is safe, but worse for two reasons. First, it requires to explicitly pass around an extra length parameter; this not only adds boilerplate, but lets you mess up. In fact, in real-world code this has complicated the code sufficiently for me to mess up all this math somewhere, which is why the Vec-based rewrite doesn’t even work correctly. And two, unlike the iterator on the slice, it is a lot harder for the optimizer to elide bounds checks in this case, which makes this safe version slower.

Could we make a better safe abstraction that would solve these problems? I believe we can, using fixed-capacity vectors.

If we could set capacity explicitly when creating a fixed-capacity view of a vector, we could pass it around just as conveniently as a slice without exposing uninitialized memory. We could even iterate over it without jumping through extra hoops to elide bounds checks if we create methods fill() and fill_with() analogous to extend() and resize_with() that never go beyond the fixed capacity. Then we’d get something like this:

let mut reusable_buffer = Vec::with_capacity(block_size);
let buffer = reusable_buffer.as_fixed_capacity();
// -- snip --
fn decode_verbatim<R: ReadBytes>(input: &mut Bitstream<R>, bps: u32, buffer: &mut FixedCapacityVec<i32>, bs: u16) {
    buffer.fill_with(|| extend_sign_u32(try!(input.read_leq_u32(bps)), bps));
    }
}

Now this is safe and fast!

What’s more, it simplifies our rle_decode_fixedcap from earlier considerably:

/// Repeats `repeating_fragment_len` bytes at the end of the buffer
/// until an additional `num_bytes_to_fill` in the buffer is filled
pub fn decode_rle_fixedcap2(buffer: &mut Vec<u8>, repeating_fragment_len: usize, num_bytes_to_fill: usize) {
    // view buffer as fixed-capacity vector so we can append data to it while its contents are borrowed
    let fixed_cap_buffer = buffer.as_fixed_capacity(buffer.len() + num_bytes_to_fill);
    let slice_to_repeat = &fixed_cap_buffer[fixed_cap_buffer.len()-repeating_fragment_len..];
    fixed_cap_buffer.fill(slice_to_repeat.iter().cycle());
}

And the same goes for the manually optimized function:

/// Repeats `repeating_fragment_len` bytes at the end of the buffer
/// until an additional `num_bytes_to_fill` in the buffer is filled
pub fn decode_rle_fixedcap2_memcpy(buffer: &mut Vec<u8>, repeating_fragment_len: usize, num_bytes_to_fill: usize) {
    // view buffer as fixed-capacity vector so we can append data to it while its contents are borrowed
    let fixed_cap_buffer = buffer.as_fixed_capacity(buffer.len() + num_bytes_to_fill);
    let slice_to_repeat = &fixed_cap_buffer[fixed_cap_buffer.len()-repeating_fragment_len..];
    for _ in 0..(num_bytes_to_fill/repeating_fragment_len) {
        fixed_cap_buffer.extend_from_slice(slice_to_repeat);
    }
    fixed_cap_buffer.fill(slice_to_repeat.iter());
}

Concerns, caveats, etc

So far I’ve been focusing on binary format and multimedia decoding because that’s where memory management bugs are most common and have the highest impact. This may not be relevant in other domains, but I still consider memory safety in this domain critical.

So far I have only looked in-depth into crates with actual vulnerabilities. So while this is based on real-world failures, I still only have two samples and this might not generalize well. If you have use cases for something like this, please speak up.

All names are subject to bikeshedding.

References to production code discussed: vulnerable function from png crate, my attempt to rewrite Claxon safely

Also, I’d like to note that memory disclosure vulnerabilities are probably under-discovered in Rust code compared to use-after-free or buffer overflows because the only tool that can discover them, Memory Sanitizer, currently does not work with anything that uses Rust standard library. I literally had to write my own tool to discover the bug in Claxon, I’ll probably make a blog post announcing it soon.

At this point I’m looking for feedback on this proposal. Does it sound reasonable? Do you have use cases for this thing? Do you see a better way to solve this problem? Let me know!

gbutler · September 16, 2018, 7:09pm

Wow! This is truly interesting and exciting work! This is something that needs to be studied and documented in the “Unsafe Code Guidelines” I would think.

Shnatsel · September 16, 2018, 7:14pm

More like “How to avoid unsafe code” guidelines. And perhaps clippy warnings.

I do believe a coordinated effort for providing good enough safe abstractions so that unsafe code can be eliminated from most crates is required. I was hoping the security WG proposal would result in something like that, but turns out the term “security” is too broad. So here I am, trying to blaze the trail and lead by example.

steveklabnik · September 16, 2018, 7:15pm

I think this is very cool, and very clear. Thanks for writing this up!

djc · September 16, 2018, 8:01pm

Have you already investigated other parts of the ecosystem? So far I’m thinking of std::io::Cursor, Pin<Vec> and maybe something from the bytes crate (like BytesMut).

scottmcm · September 16, 2018, 10:24pm

Note that we have something very similar to this: slice - Rust, which is even more efficient than cycle-take by using doubling to copy larger-and-larger chunks:

github.com

rust-lang/rust/blob/8a2dec6e583bc6425a91b277bdc6c602088845f1/src/liballoc/slice.rs#L409-L463


      
          pub fn repeat(&self, n: usize) -> Vec<T> where T: Copy {
              if n == 0 {
                  return Vec::new();
              }
          
              // If `n` is larger than zero, it can be split as
              // `n = 2^expn + rem (2^expn > rem, expn >= 0, rem >= 0)`.
              // `2^expn` is the number represented by the leftmost '1' bit of `n`,
              // and `rem` is the remaining part of `n`.
          
              // Using `Vec` to access `set_len()`.
              let mut buf = Vec::with_capacity(self.len() * n);
          
              // `2^expn` repetition is done by doubling `buf` `expn`-times.
              buf.extend(self);
              {
                  let mut m = n >> 1;
                  // If `m > 0`, there are remaining bits up to the leftmost '1'.
                  while m > 0 {
                      // `buf.extend(buf)`:

This file has been truncated. show original

That is, however, just on slice, producing a new Vec each time. Perhaps there's a way to expose it on &mut Vec somehow that could be useful for this?

Shnatsel · September 16, 2018, 11:38pm

I’m not very familiar with the broader Rust ecosystem, so thanks a lot for the suggestions!

As far as I know, std::io::Cursor provides no benefits over passing around a plain Vec.

Pin<Vec> sounds like something that could potentially serve the RLE use case, but I’m not sure how to apply it. Is it possible to pin the backing buffer of the vector at all? I haven’t found a way to do that. Perhaps a non-reallocating view of a vector might be useful for the use cases that require pinning as well?

BytesMut from bytes crate seems to support both of these use cases:

split_to() allows holding a slice of already written memory and appending to the buffer at the same time. Hooray!

BytesMut is pretty good on ergonomics and as far as the API goes I could see it replacing slices in Claxon too. Its behavior is actually pretty close to my proposal. Methods such as fill() or fill_with() are missing, but should not be hard to implement.

However, it has significant runtime complexity:

It has an inlining optimization to avoid allocation, which is detrimental to performance in the general case, as the Vec documentation notes and SmallVec benchmarks confirm
Its types are reference-counted, which is not needed in this context and introduces runtime overhead. To make matters worse, they’re reference-counted atomically.

That sounds like a deal-breaker. Someone will come along and rewrite it into unsafe code to get rid of Arc overhead, and we’ll be back at square one.

Shnatsel · September 16, 2018, 11:44pm

Oooh, that’s clever!

Yeah, if that worked on &mut Vec it would help with performance even more. But it would still require a non-reallocating view of Vec to pass the borrow checker.

CAD97 · September 17, 2018, 12:41am

Per the latest Pin discussion, I believe the form of pin we’re going with is Pin<P> where P is some sort of pointer/ref. By this definition of Pin, a Pin<Vec<_>> would mean that the slice that it derefs to does not move.

This would mean that a Pin<Vec<_>> plus some stdlib impl blocks would be a Vec that’s not allowed to reallocate (without impl blocks and unsafe code, you can only get a shared ref out, so would just be unable to push/pop).

The only gotcha is that going back from a Pin<Vec<_>> to a Vec<_> is illegal (Pin means pinned forever), so this use case would still want a unique type (or some sort of reasoning around temporary pinning). All in all I think the use of Pin for this is more complicated than it should be and a view that prevents reallocation would be better than reusing Pin for this.

canndrew · September 17, 2018, 3:22am

Once we have &out/&in references (or whatever we want to call them) we could have an IndexIn trait, analogous to Index and IndexMut, which would be used when someone wants to index into an uninitialized part of a value. Vec could then implement IndexIn and allow you to index into the entire capacity of the Vec, not just the length, eg.

buffer.reserve(num_bytes_to_fill);
let space = &in buffer[buffer.len()..buffer.capacity()];

However to solve OP’s problem while appeasing the borrow-checker Vec would need to provide a method to borrow both the used capacity and the unused capacity in a single call (as a &mut [T] and an &in [T] respecitively).

pub fn decode_rle_ideal(buffer: &mut Vec<u8>, repeating_fragment_len: usize, num_bytes_to_fill: usize) {
    buffer.reserve(num_bytes_to_fill);
    let len = buffer.len();
    let (data_mut, mut data_in) = buffer.borrow_capacity(num_bytes_to_fill);
    let slice_to_repeat = &data_mut[len-repeating_fragment_len..]; // figure out what to repeat
    for chunk_in in data_in.chunks_in(repeating_fragement_len) {
        chunk_in.clone_from_slice(slice_to_repeat[..chunk_in.len()]);
    }
}

scottmcm · September 17, 2018, 3:51am

I don't think it necessarily needs something that complex. (Sure, in general it might, but that's hard to design, and we can reasonably make something simpler.)

I was just picturing an addition on Vec in ~~std~~alloc like

impl Vec<T> {
    fn repeat_tail(&mut self, r: RangeFrom<usize>, n: usize) where T: Copy;
}

so you'd have things like

    let mut v = vec![1, 2, 3, 4, 5];
    v.repeat_tail(3.., 2);
    assert_eq!(v, vec![1, 2, 3, 4, 5, 4, 5, 4, 5]);

    let mut v = vec![1, 2, 3, 4, 5];
    v.repeat_tail(4.., 6);
    assert_eq!(v, vec![1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 5]);

    let mut v = vec![1, 2, 3, 4, 5];
    v.repeat_tail(0.., 1);
    assert_eq!(v, vec![1, 2, 3, 4, 5, 1, 2, 3, 4, 5]);

in terms of which one could implement slice::repeat, and seems plausible for RLE as well, assuming one doesn't repeat a non-integer number of times.

djc · September 17, 2018, 7:34am

In that post there is still talk about Unpin. Why wouldn't Pin<Vec> also implement Unpin?

CAD97 · September 17, 2018, 8:02am

Unpin would based on the T in Pin<Vec<T>>. I’ll admit that Unpin is the part of the Pin guarantee I understand the least, so I’m not really one to answer that. It might work though.

gnzlbg · September 17, 2018, 8:03am

Why was that written instead ? Don't you have an upper bound on how large num_bytes_to_fill can actually be ? You could just use an ArrayVec (or just an array..) on the stack (or on the heap if the repeating byte patterns are of unbound length) where you copy the pattern first, and then just extend the vector from that.

Shnatsel · September 17, 2018, 9:49am

I’m not entirely sure why it was written like that because it wasn’t me who wrote it.

In this particular case the upper bound on the slice length is u16::MAX_VALUE bytes, so 65Kb. So stack-based allocation or SmallVec is probably not a great option, but berhaps by copying the slice in question in a preallocated buffer and reusing the allocation between invocations might help.

In real-world code it’s not going to be pretty, but probably workable for the RLE use case. Doesn’t do anything about the “people keep passing around slices of uninitialized memory” problem, though.

Shnatsel · September 17, 2018, 9:54am

This would solve the RLE use case in the fastest possible way. However, why repeat the tail, and not an arbitrary subslice?

gnzlbg · September 17, 2018, 10:02am

A SmallVec might work pretty well here if most of your allocations are smaller in practice than u16::MAX_VALUE and the upper bound for most allocation can fit in the stack.

kornel · September 17, 2018, 2:12pm

I’ve ported lodepng from C to Rust, and ran into this problem several times. Not just for repetition, but in all kinds of places. “Put the data here” is a typical C design pattern.

Small nit about the proposed solution: I’m not keen on needing a separate .reserve() call to set the capacity beforehand, since the code that is reserving and code that is filling may be far apart, I’m anxious about having to reserve enough space ahead of time to prevent later .as_fixed_capacity() calls from panicking or failing to fill enough data. I’d rather see .with_fixed_capacity(size) that ensures capacity and gives an accessor for it in one call.

Shnatsel · September 17, 2018, 3:23pm

I’d expect the C code to pass the expected length explicitly alongside the buffer. Or was it using a sizeof trick? What were the issues with refactoring that into a Vec?

kornel · September 17, 2018, 3:37pm

The problem is that “Put the data here” requires a pointer to uninitialized memory, which is a no-no in Rust:

char *buf = malloc(size);
write_it_here(buf);

The size is not always passed explicitly, sometimes it’s supposed to be known (e.g. in case of the png library it was implied by the dimensions and color mode of the PNG file, or fixed sizes of chunks as defined in the PNG spec).

Rusty solution is to flip it around and return the buffer:

let buf = get_it();

or take an empty buffer and push:

append_it(&mut vec);

However, the pushing solution doesn’t always work, e.g. if the data is written out of order, or the borrow checker problem mentioned earlier.

Topic		Replies	Views
[RFC] Vec::append_from_within() language design	2	987	September 20, 2019
Pre-RFC: Constant slicing language design	3	969	March 25, 2019
Pre-RFC: RawSlices	9	2139	March 25, 2019
Creating a vector with reserved space language design	7	990	November 15, 2023
Stacked Borrows: An Aliasing Model For Rust Unsafe Code Guidelines	49	7581	March 25, 2019

[Pre-RFC] Fixed-capacity view of Vec

The problem

Making things possible (and efficient)

Making things convenient

Concerns, caveats, etc

Related topics