Maps, Sets, and the value of Keys


#1

I’ve had this conversation a lot lately, so I’m just going to post this here and hope we can settle it once and for all.

So currently, Maps treat keys like trash. These are the two main MutableMap fns:

    fn swap(&mut self, k: K, v: V) -> Option<V>;
    fn pop(&mut self, k: &K) -> Option<V>;

swap takes a key by value, and in the case of a key-match returns a value, and just drops one of the keys. pop meanwhile removes a key, but never gives it back. The search methods are a similar story, they only yield values. As a consequence, Sets (which are just a Map<T, ()>) can’t tell you at all about what’s happening inside them:

    fn insert(&mut self, value: T) -> bool;
    fn remove(&mut self, value: &T) -> bool;

They just know if something was there, and nothing else.

Now, on one hand, this all makes sense. The current convention in Rust is that if two objects are Eq, they’re basically byte-wise equal (ignoring actual memory addresses). Similarly, the two biggest keys are integers and strings, objects which are completely indistinguishable when equal. Therefore, information about keys would just be noise when we really do generally only care about values. If we accept this, then this opens up room for some internal optimizations. On key collisions, the implementation can keep whichever key is easiest (probably the old one, to avoid a write). In fact, this is actually basically the case right now. HashMap keeps the old key, and TreeMap keeps the new key. The behavior of a generic Map is basically undefined with respect to which keys are kept or lost.

On the other hand, there’s no fundamental reason why Eq keys should be considered indistinguishable. It’s arbitrary code, the user can make it work how they want. Further, the keys could implement Drop, and therefore which does or doesn’t survive could really matter. Sets, which consist only of Keys, are exceptional minimal because of this behavior. All you can do is ask yes/no questions or get an iterator. Also, with tuple indexing, returning a “useless” key wouldn’t be a huge ergonomic loss.

So there’s the issue. We have Maps and Sets which basically assert that keys are uninteresting as actual values. This provides room for optimization, and keeps the API simple. However this constrains some niche hypothetical use cases. At very least, the current behavior is undocumented and inconsistent. I’d like to at least have a documented convention for how keys should be treated.


#2

I’d be interested to hear any compelling examples you can come up with or precedent in other languages. As this stands, this doesn’t seem worthwhile IMHO.

I think Eq should fundamentally mean that the 2 objects are indistinguishable for useful purposes. Obviously that can’t be enforced but it should be the high level contract. Failing to meet that expectation seems both fundamentally avoidable and certain to lead to confusion. It seems like this would make the API more complex where the only gain would be making it convenient to allow people to make bad design decisions.

So perhaps the idea of Eq and/or the way collections treat keys should be documented more explicitly but I think the current state of affairs is fine. Thoughts? Am I missing something?


#3

I’d like to clarify that I am not advocating any particular stance on this matter. I wrote out my thoughts on the matter, and I just want to obtain a consensus so that it can be documented and coded against. The current state of affairs is effectively “a concrete map will do whatever the author decided made sense at the time”, and I don’t think that’s an acceptable way to build out standard APIs.

After a cursory scan of some languages, the following behaviour can be found on a key collision in a Map:

  • Java: Change the value, not the key
  • Python: Ambiguous
  • C#: Throw an IllegalArgumentException (lolwut)
  • C++: Drop both the new key and value
  • Ruby: Ambiguous
  • Qt: Change the value, not the key

There’s some inconsistency here, but the general trend is that keys are considered uninteresting or forgettable during a collision. This seems like a perfectly reasonably solution. I’m not sure if I’m willing to commit to Eq === Indistinguishable, but that would also certainly be a sensible convention.


#4

Sorry, to clarify are you suggesting (thought not necessarily advocating for) a richer API to return keys in more places so they can be used as more meaningful values? That was mainly what I was objecting to. I don’t think there’s a use case for swap or pop to return the key in addition to the value that’s not an anti-pattern and I don’t think there’s much precedent for collection APIs doing things like that. If this is just about what the convention for which key should win on an insert collision, I agree that should be documented. I’d vote in favor of it being defined as implementation dependent (since I think Eq !== Indistinguishable is an anti-pattern) but I don’t see that as mattering too much either way.


#5

I think it’s worth considering exposing the keys, if only as a “we thought about it, and it wasn’t worth it” so I have somewhere to point when people mention it.

For what it’s worth, right now I’m leaning strongly towards “it’s not worth it”, because it clearly hasn’t been a problem, and no one can give me a concrete use case. Certainly for 1.0, where we’re going for minimal and lean, it seems unnecessary. I’m not sure if I’d go so far as to say it’s an anti-pattern, but I can certainly see where you’re coming from. It would likely be a symptom of a hack around something.

Edit: and I lumped in the “which key wins” thing because it is clearly part of the same overall “how much do we care about keys” issue.


#6

If you want to look up the actual key value in Python, you can just make a dict of key -> key mappings. Since rust maps want to own both key and value, we don’t have that option.


#7

@eddyb has a use case in the compiler that makes keys interesting. He basically wants a HashSet to use as a cache that he can query with equiv keys to get the real back out. He wants something like:

HashSet::find_equiv<Q: Hash<S> + Equiv<T>>(&self, &Q) -> Option<&T>

Specifically:

>  right now I have FnvHashMap<InternedTy<'tcx>, Ty<'tcx>>
> InternedTy is a special wrapper for Ty that just has the right Equiv, Eq and Hash impls1
> (to work as a key)
> I would prefer to use FnvHashSet<InternedTy<'tcx>>

#8

Sorry for reviving this, but it came up again recently for me. One distinction I see between Rust and most of these languages you surveyed is the linear-ness of Rusts types.

This sort of answers the “are they effectively indistinguishable?” question with a “no, because if I move/drop one, I still have the other”. Cloning or dropping something can be expensive, and it would be great to avoid this when possible (e.g. hand things back to the user and let them drop them if appropriate, and spare them an early clone if they don’t need one).

I have several cases where keys can be String or Vec<T>, in which case it is a pain to have one taken away from me, even if it is Eq to another one I happen to have nearby.


#9

I’ve been needing to “pop” values from sets.

Basically: HashSet::pop(&mut self) -> Option<T>

This came up when I was trying to implement DFA minimization (Hopcroft’s algorithm) http://en.wikipedia.org/wiki/DFA_minimization. Basically, I have a set of partitions and loop, removing an element from the set each iteration. The set is also mutated later (items removed / added) so set operations are needed as well as being able to remove a value from the set each iteration (the specific value does not matter).


#10

never mind this comment, misunderstood the issue.


#11

I’m also having performance issues where I have to excessively clone keys because I can’t get them back if they’re not needed. In my case, instead of using the entry API, I just use multiple calls on the hashtable to check if a key is present and if not, insert the key. This is less than optimal, but better than excessive clones.