Hi, I come from this PR where we compute the tail of an iterator.
I use a vector that I cyclically update but we thought that using a VecDeque would be as efficient except it was not (a benchmark was 4 times slower with VecDeque than with Vec: let v = (0..1024).collect_vec(); v.iter().tail(10)).
let data: VecDeque<_> = ...; // non empty
iter.for_each(|val| {
data.pop_front();
data.push_back(val);
});
The flamegraph of the benchmark said that nothing apart from push_back and pop_front mattered, and that most of the time was spent in push_back (and roughly 7 times less in pop_front).
I tend to think that a new method could be more efficient than pop_front+push_back.
scottmcm suggested that I file an issue which lead me here.
What do you think?
You can probably amortize the cost of those operations by overfilling the vecdeque by some constant factor (or perhaps by a constant amount) and then trimming it back to the intended size every now and then.
In my case, use a vector works fine and efficiently without allocation malus. So I would prefer to keep it if use VecDeque is not as efficient. I'm not very familiar with VecDeque internals and maybe that pop_front+push_back can not be optimized better.
push_back does if self.is_full() { self.grow(); } which in my pop+push case is definitely not used. Decrement the length then increment it back does not seem very useful either. About buffer_read+buffer_write I just don't know.
That seems close to pop_front+push_back inlined and simplified with what we know, and I agree it should fail when empty (in my case, I would unwrap).
I merely guess that it would a bit faster, but slower than what I do with my vector because what is read and what is wrote in the VecDeque are not at the same place in memory while what I do in my vector: replace an old value with a new one at a single place in the Vec memory.
I suppose we can't do better in general because of VecDeque structure? (which is not familiar to me.)
For your usecase I would stick with the Vec and use an iterator
assert_ne!(data.len(), 0);
let rotate = 'r: loop {
for (idx, place) in data.iter_mut().enumerate() {
if let Some(val) = iter.next() {
*place = val;
} else {
break 'r idx
}
}
};
let mut data = VecDeque::from(data);
data.rotate_left(rotate);
It might actually be faster if read and write are not at the same location because this means they can be done in parallel. This probably depends on the drop code of T and the actual distance. The location will be the same in the VecDeque case as well if len == capacity.
Did you try out @the8472 's suggestion? If you iterate between filling and cleaning the deque in chunks this could result in easier to optimize/predict code.
I did not know about this ArrayDeque, nice to know. However, my case is arbitrary n: usize so it has to allocate. And add a dependency is something we don't really do.
I would expect iter.fold(..) (or iter.for_each(..)) to be in general faster than repeatedly call iter.next() (it's a whole project for us).
I'm gonna try some things like overfilling the VecDeque a bit.
Aside from my use case, from what I know, VecDeque is a lot about pop_front and push_back. Do you think that such try_shift_replace_forward would be a valuable addition to VecDeque?
After benchmarking multiple changes (thanks for the suggestions), I'm gonna stick with our Vec+VecDeque version.
However, if VecDeque had a try_shift_replace_forward method someday, I would definitely give it a try and eventually use it once our MSRV is big enough.
If you don't think such method is worth it, feel free to close this.
Thanks for pointing that out. Of course, the size can be captured with inspect if it's acceptable to advance_by after extending. But my main point is that the compiler ought to be able to better optimize the bulk operations than one-at-a-time. (I have not measured this, it's just a hunch.)
Does deque.rotate_left(1) help accomplish what you want / optimize better? Specifically, writing your OP snippet as:
let data: VecDeque<_> = ...; // non empty
iter.for_each(|mut val| {
mem::swap(&mut val, &mut data[0]);
data.rotate_left(1);
});
Obviously it's best if pop; push would optimize better, but rotate_left should hopefully sidestep whatever opt barrier is getting run into (doesn't need a potential grow).
Indeed. The fastest full-VecDeque version I get of my tail algo is with
let mut data: VecDeque<_> = iter.by_ref().take(n).collect();
iter.for_each(|val| {
// specific to my use case, small win but still:
if data.is_empty() { unsafe { core::hint::unreachable_unchecked() } }
data.pop_front();
// best win here:
if data.capacity() <= data.len() { unsafe { core::hint::unreachable_unchecked() } }
data.push_back(val);
});
which is nearly as fast as my Vec version with manual index manipulation. So I think that the optimization you are talking about is the best thing we can do here.
@CAD97 I did try data[0] = val; data.rotate_left(1) in the past and it was not better. (Here, swap is even worse, which makes sense as I just drop the old value).
Potential unsoundness note: for that snippet specifically, it's unsound unless iter is TrustedFused. If iter.next() yields [None, Some(_), ..], you'll hit the first unreachable_unchecked because data will be empty but iter.for_each will run more than zero times.
If the win is small (I expect the worst it'd be is a well-predicted branch), I'd recommend skipping that unsafe assertion as too fragile for the benefit. With or without, I'd still recommend an if !data.is_empty() check around the iter.for_each call, such that iter doesn't get used after exhaustion (which is a potential panic from nonfused iterators).
The std::deque of C++ seems to be very effective in this situation, as we do not have to tolerate amortization due to implementation reasons But I haven't found its implementation on crates-io