[Pre-RFC] Retain Iterators API for HashSet/HashMap/Vec/VecDeque

Edit: Document moved to gist

1 Like

I think DrainIf::next has to be a bit more complicated, as it has to maintain the map’s invariants. pop_internal, which is called by remove, shifts elements to account for this. Drain has no such problem, because the map is completely emptied regardless of whether the iterator is consumed.

Relevant is https://github.com/rust-lang/rust/issues/33549.

1 Like

New revision with my first attempt at an algorithm to fix up displacement here

I’ve noticed this API is most similar to the retain function, so I’ve updated the scope of this RFC to include modifying those APIs to return iterators as well.

For HashMap & HashSet

Does retain make tables vulnerable to bad performance, or to O(n²) blowup? In some complex cases, I think it is. There's a reason why drain doesn't let you leave any entries in the map. Here's an example.

let mut map: HashMap<u64, _> = (0..900000).map(|i| (i, ())).collect();
let mut counter = 0;
map.retain(|_| {
  counter += 1;
  /* keeps entry if */ counter & (1024 - 1) == 0
});
map.shrink_to_fit();
// now, operations on the map are O(n²).

See Exposure of HashMap iteration order allows for O(n²) blowup. · Issue #36481 · rust-lang/rust · GitHub to read more.

As I understand it, the minimization of displacement is used to remove multiple entries in one sweep, so that each entry we want to back-shift is moved only once instead of some number of times by one position.

For the implementation of Iterator, the simplest option just uses shift after every removal. However this option is not very performant. Worst case, requires O(n²) shifts.

How do you know it's O(n²) shifts? Hash tables are expected to do better than that. Maybe you meant to say O(n log n).

I suggest two improvements. First, for the current hash table implementation, make Retain that starts at the end of the last run in the table, and iterates backwards. This way, you don't need any back-shifting or fancy algorithms. The change in the iteration order doesn't matter, I think.

Second, are you familiar with GitHub - bluss/indexmap: A hash table with consistent order and fast iteration; access items by key or sequence index? Its iteration order is unrelated to hashing, so it has no way to abuse retain. However, the improvement for back-shifting can't be used in ordermap. Perhaps you still don't need the trick for displacement minimization. The ordermap uses entries with 8 bytes to store both the hash and an index to another array. If you back-shift elements of 8 bytes, copying them is always cheap.

invarient

typo: invariant

For Vec & VecDeque

Changing the & to &mut is a tiny change. It's a breaking change nonetheless. I think adding retain_mut is the safer choice.

retain is so similar to drain that they could be confused.

What to do in case the Retain iterator is leaked? The comments on drain say

When the Drain is first created, it shortens the length of the source vector to make sure no uninitalized or moved-from elements are accessible at all if the Drain's destructor never gets to run.

###For Vec & VecDeque They are indeed similar, but drain is index focused, whereas retain is value focused. Right now it just seems a waste that drain callers can get the removed values directly, but retain callers have to clone them on their way out.

What to do in case the Retain iterator is leaked? The comments on drain say

I'll make this more clear in the RFC, but the proposed function will be modifying length each iteration instead of accumulating in del, so the same guarantee holds true.

This does bring up an interesting point, doesn't forgeting Drain currently leave HashMaps in a broken state? Turns out, yes, yes it does. I'll file an issue for this.

###For HashMap & HashSet Wow, I can't believe I missed that reversing iteration order pretty much eliminates the O(n²) shifts. Given my proposed algorithm isn't forget resistant, I pretty much have to go with yours (Drain will also probably have to change to something similar). It's certainly a lot simpler, although it's still subject to the attack you've described.

Speaking of which, iteration order could be somewhat randomized. For example, instead of single pointer to the last full bucket, we could track 2 (or perhaps more) from separate runs (leap-frogging as they hit empty buckets so as to remain separate), and randomly switch between them. I'm not sure if that's truly random enough to mitigate the issue.

How do you know it's O(n²) shifts? Hash tables are expected to do better than that. Maybe you meant to say O(n log n).

Just to clarify, the worst case assumes a hash table with 100% collisions and nothing being retained. With 5 items (a, b, c, d, e) that all want to be in bucket 0, if we remove them using pop_internal, shift will be called 4 times for the removal of a, then 3 for the removal of b, 2 for c, 1 for d. That's (n-1)+(n-2)+...+2+1 or n*(n-1)/2, which is O(n²) .

Just realized that what I said about retain for Vec was false with the current RFC. I will update the algorithm to make it true.

Updated. And I finally had a look at retain for OrderMap, and essentially no work is necessary to change it to return an iterator thanks to swap_remove_index.

Thanks for working on this. I don’t think changing the return type of Vec::retain is viable, and it’s not the kind of type change that has an exception in the API evolution guidelines for breaking changes. (I used github code search to find an example of code that would break: (1) (that last line has no semicolon)).

Hmmm… technically RFC 1105 doesn’t mention changing return types, but that’s probably just the pedant in me talking :stuck_out_tongue_winking_eye:.

I’m fine with going back to drain_if, if it’s not viable to merge with retain

You're right, changing the return type could be allowed under those rules, since they require code to be able to be fixed up front (just add ; or { ... ; } to the code), and that the fix is local.

If return types are not mentioned, there's this:

Breaking changes are assumed to be major changes unless otherwise stated.

With some more coercion / subtyping inprovements, this wouldn’t need to be a breaking change. Functions are already contravariant over their arguments, so you might as well extend that to closures and we could just fix Vec::retain.

Oh wait, you guys are talking about its return type. I’m talking about its parameters.

Well… we could generalize it, which is a minor change. fn retain<F, R=()> and put some dummy constraint like R: RetainIteratorFromVec which () and Retain will implement.

I’m not sure if I’m helping or making things worse…

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.