`VecDeque`: `pop_front+push_back`

Phil15 · March 16, 2024, 9:47am

Hi, I come from this PR where we compute the tail of an iterator. I use a vector that I cyclically update but we thought that using a VecDeque would be as efficient except it was not (a benchmark was 4 times slower with VecDeque than with Vec: let v = (0..1024).collect_vec(); v.iter().tail(10)).

let data: VecDeque<_> = ...; // non empty
iter.for_each(|val| {
    data.pop_front();
    data.push_back(val);
});

The flamegraph of the benchmark said that nothing apart from push_back and pop_front mattered, and that most of the time was spent in push_back (and roughly 7 times less in pop_front). I tend to think that a new method could be more efficient than pop_front+push_back.

scottmcm suggested that I file an issue which lead me here. What do you think?

the8472 · March 16, 2024, 11:25am

You can probably amortize the cost of those operations by overfilling the vecdeque by some constant factor (or perhaps by a constant amount) and then trimming it back to the intended size every now and then.

Phil15 · March 16, 2024, 11:36am

In my case, use a vector works fine and efficiently without allocation malus. So I would prefer to keep it if use VecDeque is not as efficient. I'm not very familiar with VecDeque internals and maybe that pop_front+push_back can not be optimized better.

push_back does if self.is_full() { self.grow(); } which in my pop+push case is definitely not used. Decrement the length then increment it back does not seem very useful either. About buffer_read+buffer_write I just don't know.

sahnehaeubchen · March 16, 2024, 12:45pm

It is probably more optimizer friendly to provide a method that does not touch the len at all but fails if the len is 0. Something like

fn try_shift_replace_forward(&mut self, elem: T) -> Result<T, T> {
    if self.len == 0 {
        Err(elem)
    } else {
        unsafe {
            let old = self.ptr().add(self.head).read();
            self.ptr().add(self.to_physical_idx(self.len)).write(elem);
            self.head = self.wrap_add(self.head, 1);
            Ok(old)
        }
    }
}

Phil15 · March 16, 2024, 1:44pm

That seems close to pop_front+push_back inlined and simplified with what we know, and I agree it should fail when empty (in my case, I would unwrap).

I merely guess that it would a bit faster, but slower than what I do with my vector because what is read and what is wrote in the VecDeque are not at the same place in memory while what I do in my vector: replace an old value with a new one at a single place in the Vec memory.

I suppose we can't do better in general because of VecDeque structure? (which is not familiar to me.)

pitaj · March 16, 2024, 2:51pm

What about an ArrayDeque with a constant size that never allocates.

sahnehaeubchen · March 16, 2024, 2:54pm

For your usecase I would stick with the Vec and use an iterator

    assert_ne!(data.len(), 0);
    let rotate = 'r: loop {
        for (idx, place) in data.iter_mut().enumerate() {
            if let Some(val) =  iter.next() {
                *place = val;
            } else {
                break 'r idx
            }
        }
    };
    let mut data = VecDeque::from(data);
    data.rotate_left(rotate);

It might actually be faster if read and write are not at the same location because this means they can be done in parallel. This probably depends on the drop code of T and the actual distance. The location will be the same in the VecDeque case as well if len == capacity.

Did you try out @the8472 's suggestion? If you iterate between filling and cleaning the deque in chunks this could result in easier to optimize/predict code.

Phil15 · March 16, 2024, 3:39pm

I did not know about this ArrayDeque, nice to know. However, my case is arbitrary n: usize so it has to allocate. And add a dependency is something we don't really do.

Phil15 · March 16, 2024, 3:51pm

I would expect iter.fold(..) (or iter.for_each(..)) to be in general faster than repeatedly call iter.next() (it's a whole project for us).

I'm gonna try some things like overfilling the VecDeque a bit.

Aside from my use case, from what I know, VecDeque is a lot about pop_front and push_back. Do you think that such try_shift_replace_forward would be a valuable addition to VecDeque?

After benchmarking multiple changes (thanks for the suggestions), I'm gonna stick with our Vec+VecDeque version.

However, if VecDeque had a try_shift_replace_forward method someday, I would definitely give it a try and eventually use it once our MSRV is big enough.

If you don't think such method is worth it, feel free to close this.

parasyte · March 16, 2024, 7:20pm

Given only this use case, I think I would amortize the push/pop operations by doing them in bulk:

data.drain(..iter.len());
data.extend(iter);

This assumes iter: ExactSizeIterator.

scottmcm · March 16, 2024, 7:32pm

The situation in the OP cannot make that assumption.

If you're ExactSizeIterator, you can just advance_by over the things you don't need instead.

parasyte · March 16, 2024, 7:41pm

Thanks for pointing that out. Of course, the size can be captured with inspect if it's acceptable to advance_by after extending. But my main point is that the compiler ought to be able to better optimize the bulk operations than one-at-a-time. (I have not measured this, it's just a hunch.)

kornel · March 17, 2024, 2:04pm

You get much simpler/shorter code generated if you do:

      data.pop_front();
      if data.capacity() <= data.len() { return }
      data.push_back(val);

I think VecDeque needs the same kind of fix:

github.com/rust-lang/rust

Missed optimization opportunity when Vec::push follows Vec::pop

opened 03:56PM - 01 Aug 23 UTC

closed 11:00PM - 16 Oct 23 UTC

krtab

A-LLVM I-slow

The following code ```rust pub fn noop(v: &mut Vec<u8>) { if let Some(x…) = v.pop() { v.push(x) } } ``` should be optimized to a no-op afaict, but instead gives: ```asm example::noop: push rbp push rbx push rax mov rcx, qword ptr [rdi + 16] test rcx, rcx je .LBB2_4 mov rbx, rdi lea rsi, [rcx - 1] mov qword ptr [rdi + 16], rsi mov rax, qword ptr [rdi] movzx ebp, byte ptr [rax + rcx - 1] cmp rsi, qword ptr [rdi + 8] jne .LBB2_3 mov rdi, rbx call alloc::raw_vec::RawVec<T,A>::reserve_for_push mov rax, qword ptr [rbx] mov rsi, qword ptr [rbx + 16] .LBB2_3: mov byte ptr [rax + rsi], bpl inc rsi mov qword ptr [rbx + 16], rsi .LBB2_4: add rsp, 8 pop rbx pop rbp ret ``` (this is true on stable, beta and nightly). I think a blocking point here is that rustc/LLVM do not manage to infer the invariant: ```rust pub fn noop(v: &mut Vec<u8>) { // v.len <= v.capacity if let Some(x) = v.pop() { // v.len < v.capacity v.push(x) } } ``` Note that the version with added asserts for this invariant removes the useless conditional call to `reserve_for_push` but instead gets useless calls to assert failure code. ```rust pub fn noop(v: &mut Vec<u8>) { assert!(v.len() <= v.capacity()); if let Some(x) = v.pop() { assert!(v.len() < v.capacity()); v.push(x) } } ``` ```asm example::noop: push rax mov rcx, qword ptr [rdi + 8] mov rax, qword ptr [rdi + 16] cmp rax, rcx ja .LBB0_6 test rax, rax je .LBB0_4 lea rdx, [rax - 1] mov qword ptr [rdi + 16], rdx cmp rdx, rcx jae .LBB0_5 mov qword ptr [rdi + 16], rax .LBB0_4: pop rax ret .LBB0_6: lea rdi, [rip + .L__unnamed_1] lea rdx, [rip + .L__unnamed_2] mov esi, 41 call qword ptr [rip + core::panicking::panic@GOTPCREL] ud2 .LBB0_5: lea rdi, [rip + .L__unnamed_3] lea rdx, [rip + .L__unnamed_4] mov esi, 40 call qword ptr [rip + core::panicking::panic@GOTPCREL] ud2 .L__unnamed_3: .ascii "assertion failed: v.len() < v.capacity()" .L__unnamed_5: .ascii "/app/example.rs" .L__unnamed_4: .quad .L__unnamed_5 .asciz "\017\000\000\000\000\000\000\000\025\000\000\000\t\000\000" .L__unnamed_1: .ascii "assertion failed: v.len() <= v.capacity()" .L__unnamed_2: .quad .L__unnamed_5 .asciz "\017\000\000\000\000\000\000\000\023\000\000\000\005\000\000" ```

github.com/rust-lang/rust

Vec-growing logic emitted for Vecs which never grow

opened 09:41AM - 03 Dec 23 UTC

LingMan

T-compiler I-heavy T-libs

Consider these two examples: ```rust pub fn foo(sizes: &[u64]) -> u64 { l…et mut top = Vec::with_capacity(1); for &size in sizes { /*if top.capacity() <= top.len() { unsafe { std::hint::unreachable_unchecked() } }*/ top.push(size); top.pop(); } top.iter().sum() } ``` ```rust pub fn bar(sizes: &[u64]) -> u64 { const TOP_ELEMENTS: usize = 3; let mut top = Vec::with_capacity(TOP_ELEMENTS + 1); for &size in sizes { /*if top.capacity() <= top.len() { unsafe { std::hint::unreachable_unchecked() } }*/ let i = 0; //top.partition_point(|&cur| cur > size); top.insert(i, size); top.truncate(TOP_ELEMENTS); } top.iter().sum() } ``` Godbolt: https://rust.godbolt.org/z/PhrMr7dYr Neither are ever going to grow their Vec from the initial size. Yet, `RawVec::reserve_for_push` and friends are emitted when compiling. Uncommenting the `unreachable_unchecked` hints allows the optimizer to clean everything up. Might be asking a lot of the compiler but it would be great if it could figure that out by itself. Tested with Rust 1.74 and with 1.76.0-nightly (87e1447aa 2023-11-30).

CAD97 · March 17, 2024, 3:15pm

Does deque.rotate_left(1) help accomplish what you want / optimize better? Specifically, writing your OP snippet as:

let data: VecDeque<_> = ...; // non empty
iter.for_each(|mut val| {
    mem::swap(&mut val, &mut data[0]);
    data.rotate_left(1);
});

Obviously it's best if pop; push would optimize better, but rotate_left should hopefully sidestep whatever opt barrier is getting run into (doesn't need a potential grow).

Phil15 · March 17, 2024, 3:27pm

Indeed. The fastest full-VecDeque version I get of my tail algo is with

let mut data: VecDeque<_> = iter.by_ref().take(n).collect();
iter.for_each(|val| {
    // specific to my use case, small win but still:
    if data.is_empty() { unsafe { core::hint::unreachable_unchecked() } }
    data.pop_front();
    // best win here:
    if data.capacity() <= data.len() { unsafe { core::hint::unreachable_unchecked() } }
    data.push_back(val);
});

which is nearly as fast as my Vec version with manual index manipulation. So I think that the optimization you are talking about is the best thing we can do here.

@CAD97 I did try data[0] = val; data.rotate_left(1) in the past and it was not better. (Here, swap is even worse, which makes sense as I just drop the old value).

CAD97 · March 17, 2024, 3:48pm

Potential unsoundness note: for that snippet specifically, it's unsound unless iter is TrustedFused. If iter.next() yields [None, Some(_), ..], you'll hit the first unreachable_unchecked because data will be empty but iter.for_each will run more than zero times.

If the win is small (I expect the worst it'd be is a well-predicted branch), I'd recommend skipping that unsafe assertion as too fragile for the benefit. With or without, I'd still recommend an if !data.is_empty() check around the iter.for_each call, such that iter doesn't get used after exhaustion (which is a potential panic from nonfused iterators).

Phil15 · March 17, 2024, 3:50pm

I did not mention here that the iterator is fused (by me) so if data is empty then the iterator is too and the closure in for_each is not called.

Phil15 · March 21, 2024, 9:15am

From what @kornel pointed out, something like Add invariant to Vec::pop that len < cap if pop successful by krtab · Pull Request #114370 · rust-lang/rust · GitHub should probably be done for VecDeque::pop_front (and logically for VecDeque::pop_back as well).

So is there a volunteer?

I would propose myself if I understood how this is actually tested but I don't have a clue.

A4-Tacks · March 21, 2024, 9:39pm

The std::deque of C++ seems to be very effective in this situation, as we do not have to tolerate amortization due to implementation reasons But I haven't found its implementation on crates-io

Phil15 · March 26, 2024, 10:42am

...Well I volunteered, see Add invariant to VecDeque::pop_* that len < cap if pop successful by Philippe-Cholet · Pull Request #123089 · rust-lang/rust · GitHub

Topic		Replies	Views
`VecDeque`: allow `len == cap` to make full usage of memory in the `RawVec` libs	8	1854	June 21, 2022
Performance of Iterator::chain and std::iter::once libs	7	1973	June 2, 2019
Optimization for `std::vec::Drain` to directly access an indexed element compiler	6	338	October 25, 2024
VecDeque: binary search fns	3	817	December 25, 2020
[Pre-RFC] Retain Iterators API for HashSet/HashMap/Vec/VecDeque libs	13	1793	March 25, 2019

`VecDeque`: `pop_front+push_back`

Related topics