Fallible Hash trait

davidhewitt · November 23, 2021, 7:55am

In PyO3 we're haven't implemented the Hash trait for the main Python object type (PyAny) because we're not able to guarantee __hash__ implementations written in Python will not error. (The return value of PyAny::hash is Result<isize, PyErr>).

Has anyone else come across problems like this in the past, and do you think it warrants a solution in std? It's unfortunate that it means Python objects cannot be used with Rust's collections like HashSet and HashMap. On the other hand, Hash being infallible is hugely convenient the vast majority of the time. I'm also not aware of any modification which could be made to the Hash trait backwards-compatibly to resolve this, and it feels like I'm staring at an edge case here.

The best solution I can think of that doesn't require support in std is just to panic in the Hash implementation. PyO3 already uses catch_unwind at the FFI boundary to prevent Rust panics propagating into C; it wouldn't be hard to transform panics originating from Hash back into Python exceptions. This just feels a little... dirty, as I understood idiomatic use of panics discourages them from being used for regular error control flow!

scottmcm · November 23, 2021, 8:25am

You're not able to guarantee that Hash implementations in rust won't panic either -- they're safe code, they can do whatever, even if they oughtn't.

Is it particularly common that Python __hash__s can fail? Or are they not supposed to fail in Python either?

Just unwrapping them might well be fine.

davidhewitt · November 23, 2021, 8:41am

Is it particularly common that Python __hash__ s can fail? Or are they not supposed to fail in Python either?

It's fairly likely that PyAny::hash will return an error. Every Python object that isn't hashable (e.g. lists) will exercise that code path.

NoamB · November 23, 2021, 10:33am

You can add a wrapper type which stores the hash on construction (and implements the hash trait). This approach doesn't scale if you need to implement many traits in this fashion.

bill_myers · November 23, 2021, 11:48am

But that's almost always a bug in the code and it's only a runtime error because Python lacks a proper type system (in Rust the type would not implement Hash and the code would fail to compile).

I think panicing in Rust is the best solution.

quaternic · November 23, 2021, 4:01pm

It looks like PyAny implements PartialEq as pointer equality, i.e. it uses object identity.¹ Would it be reasonable to do that for Hash as well? In fact, I think it should be, since otherwise you would just be introducing hash collisions in Hash{Map,Set} for distinct objects that happen to be equal in python.

¹) In this macro, when invoked here.

felix.s · November 23, 2021, 4:08pm

I don’t remember, is __hash__ in Python allowed to change over the course of the object’s lifetime?

Because if not, you could implement a wrapper type with a fallible constructor that invokes __hash__ and caches the result in the wrapper, then returns the cached value in the Hash implementation, and otherwise Derefs to the original object. If the initial invocation fails, the constructor returns an error.

mjbshaw · November 23, 2021, 4:39pm

From the docs:

An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() method). Hashable objects which compare equal must have the same hash value.

Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.

Most of Python’s immutable built-in objects are hashable; mutable containers (such as lists or dictionaries) are not; immutable containers (such as tuples and frozensets) are only hashable if their elements are hashable. Objects which are instances of user-defined classes are hashable by default. They all compare unequal (except with themselves), and their hash value is derived from their id() .)

H2CO3 · November 23, 2021, 4:40pm

Well, this is a typing issue. Since PyAny is a dynamically typed object, it can't be guaranteed that hashing even makes sense for the underlying concrete type at all. So you really shouldn't try to implement Hash for PyAny.

Rather, implement the trait for known-hashable types (e.g. str) only. And go with unwrapping the result in those cases. Python's exceptions aren't effects (they can occur anywhere and everywhere), and except for some unfortunate cases, they usually represent a programmer error. Certainly this is the case for hashing at least. So an exception in __hash__ is much more akin to a panic in Rust, and not to a fallible function.

davidhewitt · November 23, 2021, 6:32pm

Thank you everyone for the many responses!

You can add a wrapper type which stores the hash on construction (and implements the hash trait). This approach doesn't scale if you need to implement many traits in this fashion.

Indeed this is also a good approach because calling __hash__ requires the Python GIL to be held, so there's potentially quite significant performance saving here too. This is a pattern I've considered recommending to users, though haven't gone for providing such a wrapper in PyO3 itself.

But that's almost always a bug in the code and it's only a runtime error because Python lacks a proper type system (in Rust the type would not implement Hash and the code would fail to compile).

While true, at the moment PyO3 implements panics as a destructive crash that brings down the process. We also have to consider many users interact with Python packages through REPL environments. If such a user passed an unhashable object to a function which assumed a hashable object and subsequently panicked, I think they would be very upset to have their REPL environment die.

It looks like PyAny implements PartialEq as pointer equality, i.e. it uses object identity.¹ Would it be reasonable to do that for Hash as well? In fact, I think it should be, since otherwise you would just be introducing hash collisions in Hash{Map,Set} for distinct objects that happen to be equal in python.

This is a very good point. There is an argument to be made that PartialEq should instead go via Python's __eq__ in a similar way to how we're discussing __hash__. Certainly we should not implement Hash without also modifying PartialEq to match.

I personally wouldn't use pointer identity as a Python object hash, because I think users would expect Python and Rust hashes of an object to match. I could be wrong on that assumption, however it seems like a footgun if hashing in Rust hashed the pointer value, but hashing in Python hashed individual fields via __hash__.

Well, this is a typing issue.

This is a very good point, definitely a potential avenue to explore. Certainly types like str make sense; tuple is more interesting because a tuple is hashable only if all of its constituents are. At the moment PyO3 only has a single dynamic PyTuple type so we wouldn't be able to express it as being hashable under this paradigm; maybe that's fine.

bluss · November 23, 2021, 9:54pm

Using identity sounds promising.

If there was a fallible hash trait, what would happen then? Would HashMap in std support it? I would guess it wouldn't, and then the new trait is not really solving the original problem.

I maybe oversimplify sometimes, but the Hash trait's main reason to exist is to support HashMap. In that sense, I don't think it's so important if py objects end up in the same hash bucket as the equivalent rust objects - more important to be able to use them in HashMaps at all.

NoamB · November 23, 2021, 10:50pm

I'm not sure what the use cases for using python objects in a rust hashmap, but in python I often use namedtuples (with integers members, re-created many times) as keys. Using the pointer for equality/hash breaks this use case. In general, I would not expect hashmap keys to always come from the same place - for example, a hashmap can be constructed with one set of key objects, and indexed by another.

I agree that if the pointer value is used for equality, it should also be used for hashing.

Maybe there can be a HashKey wrapper type which delegates to python's __eq__ and communicates (in the docs) that an exception there will cause a rust panic? Like @H2CO3 said, an exception in __eq__ is probably almost always a programmer error.

kornel · November 24, 2021, 1:53pm

hash.unwrap_or(0) should be technically correct .

adriangb · November 25, 2021, 3:17am

I recently used the wrapper approach. It worked well, and I think it did help performance.

__eq__ is a bit trickier insofar as that the result depends on the object it is being compared to, so you can't just get it in the constructor. You could check if self.__eq__(self) == true, which you can then replace w/ an identity comparison in Eq so that at least in the common case of looking up something in a HashMap there is no need to call into Python.

H2CO3 · November 25, 2021, 12:50pm

No, that wouldn't work. The reason and the consequence are backwards. For any sane object, if equality is defined, self == self (except for horrible designs like NaN).

So, when identity is used for equality, that does imply self == self. However, the converse is not true! For instance, two strings that come from different sources but have the same content compare equal (just like in Rust). If you replaced such comparison with the identity, it would incorrectly conclude that two distinct strings with the same contents are not equal.

By the way, the "Hash should agree with Eq" constraint is not a matter of litigation. The standard library requires it, period. Consequently, implementing them in a way inconsistent with each other is a programmer error.

adriangb · November 26, 2021, 7:54pm

Sorry, let me clarify: if self.__eq__(self) == true and you can check if other.eq(self) in rust and if so, exit early with true. If either one of those two conditions is not true, you have to ask Python. But luckily for a HashMap lookup, most of your collisions are going to be with the same object, so for that specific situation this should provide some pretty tangible performance benefits (I think). Here's my Rust code (bear in mind I'm very new to Rust, please feel free to comment on it): Hashable Python objects in PyO3 (github.com)

That said, I benchmarked a bit and realized something important: this helps in some cases, but hurts in many others. You trade off always making 1 call into Python (in the constructor) and maybe making no future calls (in the case that all hash collisions are w/ the same object) vs. maybe making a lot more calls into Python (for any hash collision, even w/ the same object). I think ideally this check would be run in eq itself JIT and cached, but that's not possible because eq's self is not mut.

system · February 24, 2022, 7:55pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Stability of hash values documentation	8	5523	March 25, 2019
Unstable hash architecture ideas (deprecated)	13	3341	March 25, 2019
Low-latency hashing libs	11	1288	December 18, 2024
Pre-RFC: bless the `type FastHashMap` pattern	92	3972	March 25, 2019
[Pre-RFC] Safer Transmutation language design	38	6291	November 30, 2020

Fallible Hash trait

Related topics